Open Source Archives - Kai Waehner

Apache Kafka 4.0: The Business Case for Scaling Data Streaming Enterprise-Wide

Kai Waehner — Sat, 19 Apr 2025 13:32:55 +0000

Apache Kafka 4.0 is more than a version bump. It marks a pivotal moment in how modern organizations build, operate, and scale their data infrastructure. While developers and architects may celebrate feature-level improvements, the true value of this release is what it enables at the business level: operational excellence, faster time-to-market, and competitive agility powered by data in motion. Kafka 4.0 represents a maturity milestone in the evolution of the event-driven enterprise.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And download my free book about data streaming use cases and business value, including customer stories across all industries.

From Event Hype to Event Infrastructure

Over the last decade, Apache Kafka has evolved from a scalable log for engineers at LinkedIn to the de facto event streaming platform adopted across every industry. Banks, automakers, telcos, logistics firms, and retailers alike rely on Kafka as the nervous system for critical data.

Today, over 150,000 organizations globally use Apache Kafka to enable real-time operations, modernize legacy systems, and support digital innovation. Kafka 4.0 moves even deeper into this role as a business-critical backbone. If you want to learn more about use case and industry success stories, download my free ebook and subscribe to my newsletter.

Version 4.0 of Apache Kafka signals readiness for CIOs, CTOs, and enterprise architects who demand:

Uninterrupted uptime and failover for global operations
Data-driven automation and decision-making at scale
Flexible deployment across on-premises, cloud, and edge environments
A future-proof foundation for modernization and innovation

Apache Kafka 4.0 doesn’t just scale throughput—it scales business outcomes:

Source: Lyndon Hedderly (Confluent)

This post does not cover the technical improvements and new features of the 4.0 release, like ZooKeeper removal, Queues for Kafka, and so on. Those are well-documented elsewhere. Instead, it highlights the strategic business value Kafka 4.0 delivers to modern enterprises.

Kafka 4.0: A Platform Built for Growth

Today’s IT leaders are not just looking at throughput and latency. They are investing in platforms that align with long-term architectural goals and unlock value across the organization.

Apache Kafka 4.0 offers four core advantages for business growth:

1. Open De Facto Standard for Data Streaming

Apache Kafka is the open, vendor-neutral protocol that has become the de facto standard for data streaming across industries. Its wide adoption and strong community ecosystem make it both a reliable choice and a flexible one.

Organizations can choose between open-source Kafka distributions, managed services like Confluent Cloud, or even build their own custom engines using Kafka’s open protocol. This openness enables strategic independence and long-term adaptability—critical factors for any enterprise architect planning a future-proof data infrastructure.

2. Operational Efficiency at Enterprise Scale

Reliability, resilience, and ease of operation are key to any business infrastructure. Kafka 4.0 reduces operational complexity and increases uptime through a simplified architecture. Key components of the platform have been re-engineered to streamline deployment and reduce points of failure, minimizing the effort required to keep systems running smoothly.

Kafka is now easier to manage, scale, and secure—whether deployed in the cloud, on-premises, or at the edge in environments like factories or retail locations. It reduces the need for lengthy maintenance windows, accelerates troubleshooting, and makes system upgrades far less disruptive. As a result, teams can operate with greater efficiency, allowing leaner teams to support larger, more complex workloads with greater confidence and stability.

Storage management has also evolved in the past releases by decoupling compute and storage. This optimization allows organizations to retain large volumes of event data cost-effectively without compromising performance. This extends Kafka’s role from a real-time pipeline to a durable system of record that supports both immediate and long-term data needs.

With fewer manual interventions, less custom integration, and more built-in intelligence, Kafka 4.0 allows engineering teams to focus on delivering new services and capabilities—rather than maintaining infrastructure. This operational maturity translates directly into faster time-to-value and lower total cost of ownership at enterprise scale.

3. Innovation Enablement Through Real-Time Data

Real-time data unlocks entirely new business models: predictive maintenance in manufacturing, personalized digital experiences in retail, and fraud detection in financial services. Kafka 4.0 empowers teams to build applications around streams of events, driving automation and responsiveness across the value chain.

This shift is not just technical—it’s organizational. Kafka decouples producers and consumers of data, enabling individual teams to innovate independently without being held back by rigid system dependencies or central coordination. Whether building with Java, Python, Go, or integrating with SaaS platforms and cloud-native services, teams can choose the tools and technologies that best fit their goals.

This architectural flexibility accelerates development cycles and reduces cross-team friction. As a result, new features and services reach the market faster, experimentation is easier, and the overall organization becomes more agile in responding to customer needs and competitive pressures. Kafka 4.0 turns real-time architecture into a strategic asset for business acceleration.

4. Cloud-Native Flexibility

Kafka 4.0 reinforces Kafka’s role as the backbone of hybrid and multi-cloud strategies. In a data streaming landscape that spans public cloud, private infrastructure, and on-premise environments, Kafka provides the consistency, portability, and control that modern organizations require.

Whether deployed in AWS, Azure, GCP, or edge locations like factories or retail stores, Kafka delivers uniform performance, API compatibility, and integration capabilities. This ensures operational continuity across regions, satisfies data sovereignty and regulatory needs, and reduces latency by keeping data processing close to where it’s generated.

Beyond Kafka brokers, it is the Kafka protocol itself that has become the standard for real-time data streaming—adopted by vendors, platforms, and developers alike. This protocol standardization gives organizations the freedom to integrate with a growing ecosystem of tools, services, and managed offerings that speak Kafka natively, regardless of the underlying engine.

For instance, innovative data streaming platforms built using the Kafka protocol, such as WarpStream, provide a Bring Your Own Cloud (BYOC) model to allow organizations to maintain full control over their data and infrastructure while still benefiting from managed services and platform automation. This flexibility is especially valuable in regulated industries and globally distributed enterprises, where cloud neutrality and deployment independence are strategic priorities.

Kafka 4.0 not only supports cloud-native operations—it strengthens the organization’s ability to evolve, modernize, and scale without vendor lock-in or architectural compromise.

Real-Time as a Business Imperative

Data is no longer static. It is dynamic, fast-moving, and continuous. Businesses that treat data as something to collect and analyze later will fall behind. Kafka enables a shift from data at rest to data in motion.

Kafka 4.0 supports this transformation across all industries. For instance:

Automotive: Streaming data from factories, fleets, and connected vehicles
Banking: Real-time fraud detection and transaction analytics
Telecom: Customer engagement, network monitoring, and monetization
Healthcare: Monitoring devices, alerts, and compliance tracking
Retail: Dynamic pricing, inventory tracking, and personalized offers

These use cases cannot be solved by daily batch jobs. Kafka 4.0 enables systems—and decision-making—to operate at business speed. “The Top 20 Problems with Batch Processing (and How to Fix Them with Data Streaming)” explore this in more detail.

Additionally, Apache Kafka ensures data consistency across real-time streams, batch processes, and request-response APIs—because not all workloads are real-time, and that’s okay.

The Kafka Ecosystem and the Data Streaming Landscape

Running Apache Kafka at enterprise scale requires more than open-source software. Kafka has become the de facto standard for data streaming, but success with Kafka depends on using more than just the core project. Real-time applications demand capabilities like data integration, stream processing, governance, security, and 24/7 operational support.

Today, a rich and rapidly developing data streaming ecosystem has emerged. Organizations can choose from a growing number of platforms and cloud services built on or compatible with the Kafka protocol—ranging from self-managed infrastructure to Bring Your Own Cloud (BYOC) models and fully managed SaaS offerings. These solutions aim to simplify operations, accelerate time-to-market, and reduce risk while maintaining the flexibility and openness that Kafka is known for.

Confluent leads this category as the most complete data streaming platform, but it is part of a broader ecosystem that includes vendors like Amazon MSK, Cloudera, Azure Event Hubs, and emerging players in cloud-native and BYOC deployments. The data streaming landscape explores all the different vendors in this software category:

The market is moving toward complete data streaming platforms (DSP)—offering end-to-end capabilities from ingestion to stream processing and governance. Choosing the right solution means evaluating not only performance and compatibility but also how well the platform aligns with your business strategy, security requirements, and deployment preferences.

Kafka is at the center—but the future of data streaming belongs to platforms that turn Kafka 4.0’s architecture into real business value.

The Road Ahead with Apache Kafka 4.0 and Beyond

Apache Kafka 4.0 is a strategic enabler responsible for driving modernization, innovation, and resilience. It directly supports the key transformation goals:

Modernization without disruption: Kafka integrates seamlessly with legacy systems and provides a bridge to cloud-native, event-driven architectures.
Platform standardization: Kafka becomes a central nervous system across departments and business units, reducing fragmentation and enabling shared services.
Faster ROI from digital initiatives: Kafka accelerates the launch and evolution of digital services, helping teams iterate and deliver measurable value quickly.

Kafka 4.0 reduces operational complexity, unlocks developer productivity, and allows organizations to respond in real time to both opportunities and risks. This release marks a significant milestone in the evolution of real-time business architecture.

Kafka is no longer an emerging technology—it is a reliable foundation for companies that treat data as a continuous, strategic asset. Data streaming is now as foundational as databases and APIs. With Kafka 4.0, organizations can build connected products, automate operations, and reinvent the customer experience easier than ever before.

And with innovations on the horizon—such as built-in queueing capabilities, brokerless writes directly to object storage, and expanded transactional guarantees supporting the two-phase commit protocol (2PC)—Kafka continues to push the boundaries of what’s possible in real-time, event-driven architecture.

The future of digital business is real-time. Apache Kafka 4.0 is ready.

Want to learn more about Kafka in the enterprise? Let’s connect and exchange ideas. Subscribe to the Data Streaming Newsletter. Explore the Kafka Use Case Book for real-world stories from industry leaders.

The post Apache Kafka 4.0: The Business Case for Scaling Data Streaming Enterprise-Wide appeared first on Kai Waehner.

JavaScript, Node.js and Apache Kafka for Full-Stack Data Streaming

Kai Waehner — Mon, 04 Mar 2024 07:06:20 +0000

JavaScript is a pivotal technology for web applications. With the emergence of Node.js, JavaScript became relevant for both client-side and server-side development, enabling a full-stack development approach with a single programming language. Both Node.js and Apache Kafka are built around event-driven architectures, making them naturally compatible for real-time data streaming. This blog post explores open-source JavaScript Clients for Apache Kafka and discusses the trade-offs and limitations of JavaScript Kafka producers and consumers compared to stream processing technologies such as Kafka Streams or Apache Flink.

JavaScript: A Pivotal Technology for Web Applications

JavaScript is a pivotal technology for web applications, serving as the backbone of interactive and dynamic web experiences. Here are several reasons JavaScript is essential for web applications:

Interactivity: JavaScript enables the creation of highly interactive web pages. It responds to user actions in real-time, allowing for the development of features such as interactive forms, animations, games, and dynamic content updates without the need to reload the page.
Client-Side Scripting: Running in the user’s browser, JavaScript reduces server load by handling many tasks on the client’s side. This can lead to faster web page loading times and a smoother user experience.
Universal Browser Support: All modern web browsers support JavaScript, making it a universally accessible programming language for web development. This wide support ensures that JavaScript-based features work consistently across different browsers and devices.
Versatile Frameworks and Libraries: The JavaScript ecosystem includes a vast array of frameworks and libraries (such as React, Angular, Vue.js) that streamline the development of web applications, from single-page applications to complex web-based software. These tools offer reusable components, two-way data binding, and other features that enhance productivity and maintainability.
Real-Time Applications: JavaScript is ideal for building real-time applications, such as chat apps and live streaming services, thanks to technologies like WebSockets and frameworks that support real-time communication.
Rich Web APIs: JavaScript can access a wide range of web APIs provided by browsers, allowing for the development of complex features, including manipulating the Document Object Model (DOM), making HTTP requests (AJAX or Fetch API), handling multimedia, and tracking user geolocation.
SEO and Performance Optimization: Modern JavaScript frameworks and server-side rendering solutions help in building fast-loading web pages that are also search engine friendly, addressing one of the traditional criticisms of JavaScript-heavy applications.

In conclusion, JavaScript’s capabilities make it indispensable for modern web development, offering the tools and flexibility needed to build everything from simple websites to complex, high-performance web applications.

Full-Stack Development: JavaScript for the Server-Side with Node.js

With the advent of Node.js, JavaScript is not just used only for the client side of web applications. JavaScript is for both client-side and server-side development. It enables a full-stack development approach with a single programming language. This simplifies the development process and allows for seamless integration between the frontend and backend.

Using JavaScript for backend applications, especially with Node.js, offers several advantages:

Unified Language for Frontend and Backend: JavaScript on the backend allows developers to use the same language across the entire stack, simplifying development and reducing context switching. This can lead to more efficient development processes and easier maintenance.
High Performance: Node.js is a popular JavaScript runtime. It is built on Chrome’s V8 engine, which is known for its speed and efficiency. Node.js uses non-blocking, event-driven architecture. The architecture makes it particularly suitable for I/O-heavy operations and real-time applications like chat applications and online gaming.
Vast Ecosystem: JavaScript has one of the largest ecosystems, powered by npm (Node Package Manager). npm provides a vast library of modules and packages that can be easily integrated into your projects, significantly reducing development time.
Community Support: The JavaScript community is one of the largest and most active, offering a wealth of resources, frameworks, and tools. This community support can be invaluable for solving problems, learning new skills, and staying up to date with the latest technologies and best practices.
Versatility: JavaScript with Node.js can be used for developing a wide range of applications, from web and mobile applications to serverless functions and microservices. This versatility makes it a go-to choice for many developers and companies.
Real-time Data Processing: JavaScript is well-suited for applications requiring real-time data processing and updates, such as live chats, online gaming, and collaboration tools, because of its non-blocking nature and efficient handling of concurrent connections.
Cross-platform Development: Tools like Electron and React Native allow JavaScript developers to build cross-platform desktop and mobile applications, respectively, further extending JavaScript’s reach beyond the web.

Node.js’s efficiency and scalability, combined with the ability to use JavaScript for both frontend and backend development, have made it a popular choice among developers and companies around the world. Its non-blocking, event-driven I/O characteristics are a perfect match for an event-driven architecture.

JavaScript and Apache Kafka for Event-Driven Applications

Using Node.js with Apache Kafka offers several benefits for building scalable, high-performance applications that require real-time data processing and streaming capabilities. Here are several reasons integrating Node.js with Apache Kafka is helpful:

Unified Language for Full-Stack Development: Node.js allows developers to use JavaScript across both the client and server sides, simplifying development workflows and enabling seamless integration between frontend and backend systems, including Kafka-based messaging or event streaming architectures.
Event-driven Architecture: Both Node.js and Apache Kafka are built around event-driven architectures, making them naturally compatible. Node.js can efficiently handle Kafka’s real-time data streams, processing events asynchronously and non-blocking.
Scalability: Node.js is known for its ability to handle concurrent connections efficiently, which complements Kafka’s scalability. This combination is ideal for applications that require handling high volumes of data or requests simultaneously, such as IoT platforms, real-time analytics, and online gaming.
Large Ecosystem and Community Support: Node.js’s extensive npm ecosystem includes Kafka libraries and tools that facilitate the integration. This support speeds up development, offering pre-built modules for connecting to Kafka clusters, producing and consuming messages, and managing topics.
Real-time Data Processing: Node.js is well-suited for building applications that require real-time data processing and streaming, a core strength of Apache Kafka. Developers can leverage Node.js to build responsive and dynamic applications that process and react to Kafka data streams in real time.
Microservices and Cloud-native Applications: The combination of Node.js and Kafka is powerful for developing microservices and cloud-native applications. Kafka serves as the backbone for inter-service communication. Node.js is used to build lightweight, scalable service components.
Flexibility and Speed: Node.js enables rapid development and prototyping. Kafka environments can implement new streaming data pipelines and applications quickly.

In summary, using Node.js with Apache Kafka leverages the strengths of both technologies to build efficient, scalable, and real-time applications. The combination is an attractive choice for many developers.

Open Source JavaScript Clients for Apache Kafka

Various open source JavaScript clients exist for Apache Kafka. Developers use them to build everything from simple message production and consumption to complex streaming applications. When choosing a JavaScript client for Apache Kafka, consider factors like performance requirements, ease of use, community support, commercial support, and compatibility with your Kafka version and features.

Open Source JavaScript Clients for Apache Kafka

For working with Apache Kafka in JavaScript environments, several clients and libraries can help you integrate Kafka into your JavaScript or Node.js applications. Here are some of the notable JavaScript clients for Apache Kafka from the past years:

kafka-node: One of the original Node.js clients for Apache Kafka, kafka-node provides a straightforward and comprehensive API for interacting with Kafka clusters, including producing and consuming messages.
node-rdkafka: This client is a high-performance library for Apache Kafka that wraps the native librdkafka library. It’s known for its robustness and is suitable for heavy-duty operations. node-rdkafkaoffers advanced features and high throughput for both producing and consuming messages.
KafkaJS: An Apache Kafka client for Node.js, which is entirely written in JavaScript. It focuses on simplicity and ease of use and supports the latest Kafka features. KafkaJS is designed to be lightweight and flexible, making it a good choice for applications that require a simple and efficient way to interact with a Kafka cluster.

Challenges with Open Source Projects In General

Open source projects are only successful if an active community maintains them. Therefore, familiar issues with open source projects include:

Lack of Documentation: Incomplete or outdated documentation can hinder new users and contributors.
Complex Contribution Process: A complicated process for contributing can deter potential contributors. This is not just a disadvantage, as it guarantees code reviews and quality checks of new commits.
Limited Support: Relying on community support can lead to slow issue resolution times. Critical projects often require commercial support by a vendor.
Project Abandonment: Projects can become inactive if maintainers lose interest or lack time.
Code Quality and Security: Ensuring high code quality and addressing security vulnerabilities can be challenging if nobody is responsible and has no critical SLAs in mind.
Governance Issues: Disagreements on project direction or decisions can lead to forks or conflicts.

Issues with Kafka’s JavaScript Open Source Clients

Some of the above challenges apply for the available Kafka’s open source JavaScript clients. We have seen maintenance inactivity and quality issues as the biggest challenges in projects.

And be aware that it is difficult for maintainers to keep up not only with issues but also with new KIPs (Kafka Improvement Proposal). The Apache Kafka project is active and releasing new features in new releases two to three times a year.

Kafka-node, KafkaJS and node-rdkafka are all on different parts of the “unmaintained” spectrum. For example, kafka-node has not had a commit in 5 years. KafkaJS had an open call for maintainers around a year ago.

Additionally, commercial support was not available for enterprises to get guaranteed response times and support help in case of production issues. Unfortunately, production issues happened regularly in critical deployments.

For this reason, Confluent open sourced a new JavaScript client for Apache Kafka with guaranteed maintenance and commercial support.

Confluent’s Open Source JavaScript Client for Kafka powered by librdkafka

Confluent, the company founded by the creators of Kafka, provides a Kafka client for JavaScript. This client works seamlessly with Confluent Cloud (fully managed service) and Confluent Platform (self-managed deployments). But it is an open source project and works with any Apache Kafka environment.

The JavaScript client for Kafka comes with a long-term support and development strategy. The source code is available now on Github. The client is available via npm. npm (Node Package Manager) is the default package manager for Node.js.

This JavaScript client is a librdkafka based library (from node-rdkafka) with API compatibility for the very popular KafkaJS library. Users of KafkaJS can easily migrate their code over (details in the migration guide in the repo).

At the time of writing in February 2024, the new Confluent JavaScript Kafka Client is in early access and not for production usage. GA is later in 2024. Please review the GitHub project, try it out, and share feedback and issues when you build new projects or migrate from other JavaScript clients.

What About Stream Processing?

Keep in mind that Kafka clients only provide a product and consume API. However, the real potential of event-driven architectures comes with stream processing. This is a computing paradigm that allows for the continuous ingestion, processing, and analysis of data streams in real time. Event stream processing enables immediate responses to incoming data without the need to store and process it in batches.

Stream processing frameworks like Kafka Streams or Apache Flink offer several key features that enable real-time data processing and analytics:

State Management: Stream processing systems can manage state across data streams, allowing for complex event processing and aggregation over time.
Windowing: They support processing data in windows, which can be based on time, data size, or other criteria, enabling temporal data analysis.
Exactly-once Processing: Advanced systems provide guarantees for exactly-once processing semantics, ensuring data is processed once and only once, even in the event of failures.
Integration with External Systems: They offer connectors for integrating with various data sources and sinks, including databases, message queues, and file systems.
Event Time Processing: They can handle out-of-order data based on the time events actually occurred, not just when they are processed.

Stream processing frameworks are NOT available for most programming languages, including JavaScript. Therefore, if you live in the JavaScript world, you have three options:

Build all the stream processing capabilities by yourself. Trade-off: A lot of work!
Leverage a stream processing framework in SQL (or another programming language): Trade-off: This is not JavaScript!
Don’t do stream processing and stay with APIs and databases. Trade-off: Cannot solve many innovative use cases.

Apache Flink provides APIs for Java, Python, and ANSI SQL. SQL is an excellent option to complement JavaScript code. In a fully managed data streaming platform like Confluent Cloud, you can leverage serverless Flink SQL for stream processing and combine it with your JavaScript applications.

One Programming Language Does NOT Solve All Problems

JavaScript has broad adoption and sweet spots for client and server development. The new Kafka Client for JavaScript from Confluent is open source and has a long-term development strategy, including commercial support.

Easy migration from KafkaJS makes the adoption very simple. If you can live with the dependency on librdkafka (which is acceptable for most situations), then this is the way to go for JavaScript node.js development with Kafka producers and consumers.

JavaScript is NOT an allrounder. The data streaming ecosystem is broad, open and flexible. Modern enterprise architectures leverage microservices or data mesh principles. You can choose the right technology for your application.

Learn how to build data streaming applications using your favorite programming language and open source Kafka client looking at Confluent’s developer examples:

JavaScript/Node.js
Java
HTTP/REST
C/C++/.NET
Kafka Connect DataGen
Go
Spring Boot
Python
Clojure
Groovy
Kotlin
Ruby
Rust
Scala

For stream processing, get started with Kafka Streams or Apache Flink.

Which JavaScript Kafka client do you use? What are your experiences? Or do you already develop most applications with stream processing using Kafka Streams or Apache Flink? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post JavaScript, Node.js and Apache Kafka for Full-Stack Data Streaming appeared first on Kai Waehner.

Why Tiered Storage for Apache Kafka is a BIG THING…

Kai Waehner — Tue, 05 Dec 2023 05:38:36 +0000

Apache Kafka added Tiered Storage to separate compute and storage. The capability enables more scalable, reliable and cost-efficient enterprise architectures. This blog post explores the architecture, use cases, benefits, and a case study for storing Petabytes of data in the Kafka commit log. The end discusses why Tiered Storage does NOT replace other databases and how Apache Iceberg might change future Kafka architectures even more.

If you prefer watching a ten minute video, check out this summary about the “Evolution of Storage for Apache Kafka covering Tiered Storage, Direct Write to Object Storage and the relation to Open Table Formats such as Apache Iceberg”:

Now, let’s explore why Tiered Storage for Apache Kafka is a BIG THING:

Compute vs. Storage vs. Tiered Storage

Let’s define the terms compute, storage, and tiered storage to have the same understanding when exploring this in the context of the data streaming platform Apache Kafka.

Compute and Storage

Two fundamental components of a computing system are compute and storage. They serve different purposes in information processing.

Compute refers to the processing power and capability of a computer system to perform tasks, execute instructions, and carry out computations. The compute component includes the CPU (Central Processing Unit) and GPU (Graphics Processing Unit).

Storage refers to the components and systems that store and retrieving data over the long term. It is where data is persistently maintained for later use. Storage includes devices such as hard disk drives (HDDs), solid-state drives (SSDs), and other types of non-volatile memory, such as databases that keep data even when the power is turned off.

Tiered Storage

Tiered storage refers to a storage architecture that uses different classes or tiers of storage (e.g., Object Storage on S3) to efficiently manage and store data based on its access patterns, performance requirements, and cost considerations.

The goal of tiered storage is to optimize the use of storage resources, balancing performance and cost, by placing data on the most suitable storage media based on its characteristics and the organization’s policies.

Data placement and movement between these tiers can be automated based on policies and algorithms that analyze usage patterns, access frequency, and other factors. This ensures that the most critical and frequently accessed data lives in high-performance storage, while less critical or infrequently accessed data is moved to lower-cost, lower-performance storage.

Long-Term Storage in Apache Kafka

Apache Kafka is an open-source distributed streaming platform that is used for building real-time data pipelines and streaming applications. Kafka is the established de facto standard for data streaming. The event streaming platform handles large volumes of data, providing a scalable and fault-tolerant architecture.

Applications and data stores use Kafka for ingesting, storing, and processing real-time data streams, making it a fundamental component in building event-driven architectures and systems that require the processing of continuous data flows. Additionally, many use cases leverage Kafka not just for real-time data but to ensure data consistency across real-time, batch, and request-response APIs.

Use Cases for Apache Kafka as Storage System

While most people think about Kafka as a message broker, real-time analytics platform, or big data ingestion system, the distributed commit log with ordering guarantees and timestamps enables plenty of use cases for accessing data long after its creation or replaying historical data.

Here are a few examples for use cases that leverage long-term storage of data in Kafka:

New consumer: Deploy a new application / database / data warehouse, data lake and synchronize the state of the business objects.
Offloading: Reducing cost significantly by NOT consuming again and again from expensive or non-scalable systems (e.g. mainframe and MIPS)
Error-handling: Re-process historical data after fixing an issue in the business logic.
Compliance / regulatory processing: Replay historical data to analyze an incident.
Query and analyze existing events: Consume data from a notebook for data engineering, analytics, or reporting.
Schema changes in analytics platform: Re-process data after updating data contracts.
Model training: Batch ingestion into an AI framework to apply a machine learning algorithm
Disaster recovery: Operational data stores replay data again from the persistent commit log in the case of a failure.
…

Objections for Storing Data Long-Term in Kafka

Storing data long term in Kafka has a few drawbacks. The following arguments are valid concerns:

Cost: Storing large volumes of data on attached disks is much more expensive than external storage systems like an object store.
Scalability: Operating Kafka brokers with lots of data (say many gigabytes, or even terabytes, and more) is challenging, especially in the case of failures when you need to rebalance partitions.
Risk: Downtime or data inconsistencies happen if operations struggle with large volumes or when hardware needs to be migrated.

Therefore, you should NOT store big data sets in Kafka without Tiered Storage! With this in mind, let’s explore how Tiered Storage for Kafka solves these problems.

Introducing Tiered Storage for Apache Kafka

Apache Kafka’s backend is a distributed system running Kafka brokers. Each Kafka broker has processing and storage capabilities.

The applications are producers and consumers of events. Many interfaces communicate with Kafka brokers:

An application written in Java, Python, C++, Go, or any other programming language
A Kafka Connect source or sink connector connecting to IBM MQ, Spark, Snowflake, or any other data store or SaaS application
A stream processor built with Kafka-native Kafka Streams, KSQL, or external infrastructures like Apache Flink
Any other endpoint, like a HTTP interface or an out-of-the-box integration of another middleware or data platform

What is Tiered Storage for Kafka?

Tiered storage for Apache Kafka refers to the capability of configuring different storage tiers to optimize the storage infrastructure based on the access patterns and requirements of the data stored in Kafka brokers.

A Kafka cluster stores data in Kafka Topics. These topics can have different characteristics in terms of importance, access frequency, and retention policies.

The concept is like the general idea of tiered storage in storage systems, but it’s adapted to the specific needs of Kafka. Tiered Storage is one critical making the Kafka architecture cloud-native.

Kafka Architecture without Tiered Storage

Kafka applications communicate with logical Kafka Topics to produce messages to or consume messages from partitions:

The storage is a disk attached to the broker. This can be HDD or SDD disks on-premise or e.g. EBS volumes on AWS cloud.

Kafka Architecture with Tiered Storage

Tiered Storage for Kafka does NOT change how applications communicate with Kafka brokers. Tiered Storage is an implementation detail:

Besides the disks attached to the broker, Kafka offloads data to an external storage. Most times, this is an object storage like Amazon S3, Azure Blog Storage, Google Cloud Storage, or MinIO for Kubernetes.

Serverless cloud offerings handle the offloading for the operator. Self-managed solutions allow operators to configure hot and cold storage durations for each Kafka Topic.

Benefits of Tiered Storage for Apache Kafka

Let’s review the above-discussed objections to storing big data sets long-term in Kafka and how Tiered Storage helps:

Reduced cost: Most data is offloaded to an external storage. This reduces the storage cost significantly.
Improved scalability: Only data on the disks attached to the Kafka brokers must be rebalanced. As most data is offloaded, rebalancing only takes seconds or minutes; even if the external storage saves petabytes.
Reduced risk: Better scalability and separation of compute and storage makes operations much easier and significantly reduces the risk of downtime or data inconsistency.

The Implementation of Tiered Storage in Apache Kafka

Tiered Storage for Apache Kafka is available. However, be aware that different implementations exist with different features, maturity, and support levels.

And open source Apache Kafka only provides the interface for tiered storage. You must choose an open source implementation, build your own integration into an external storage system, or leverage a commercial product or cloud service that embeds tiered storage into its offering.

Keep in mind that the interface alone is not helpful. The implementation needs to be battle-tested and guarantee data consistency across the hot storage on the broker and cold storage in the external storage; even in the case of failure, network issues, etc.

Kafka consumers do not see the implementation details of Kafka’s Tiered Storage. They just consumed as if there was no tiered storage implementation (and still expect the same behavior). There are no API or code changes needed in Kafka client applications. Hence, you can easily migrate an existing deployment to a Kafka cluster leveraging Tiered Storage.

Many people ask about the performance impact of tiered storage for Kafka. The short answer: There is no performance impact for most scenarios. Real-time consumers consume from the memory / page cache as before. And replaying historical data from the event log does not differ much from the local disk or the remote object-store.

AK 3.6 Release Makes Tiered Storage Available

When writing this blog post (December 2023), KIP-405: Kafka Tiered Storage is available as early access in Apache Kafka 3.6. This release introduces Tiered Storage to Kafka. This release is only for non-production environments (see the early access notes for more information).

GA of this feature is just a foreseeable matter of time. The bulk of KIP-405 was part of early access in release 3.6. But there are a few additional features that are slated for 3.7. And GA likely comes after that in 3.8+.

KIP-405 Provides a Pluggable Storage API for Tiering

KIP-405 separates computation and storage in the Kafka broker for pluggable storage tiering natively in Kafka Tiered Storage, bringing a seamless storage extension to remote objects with minimal operational changes.

Apache Kafka’s LocalTieredStorage default implementation is a local file-based RemoteStorageManager. LocalTieredStorage facilitates the simulation of remote storage behavior in a controlled and isolated environment during testing. This is not meant for production use cases! Enterprises need to write their own implementation, embed an open-source alternative, or trust a software vendor respectively cloud service.

How Confluent, Uber, and Others use Tiered Storage

KIP-405 is only available in preview with Kafka 3.6. But some proprietary implementations already exist for years in production. This also helped to define the KIP with lessons learned from running Kafka in production with tiered storage under the hood.

Implementation details of tiered storage for Kafka vary, and there may be different approaches or tools available to achieve this, depending on the specific Kafka distribution or storage infrastructure being used. Organizations might also use external systems or cloud storage solutions to implement tiered storage strategies for Kafka.

Confluent pioneered tiered storage for Kafka and has provided the capability for several years already. It is available for the self-managed Confluent Platform and the fully managed Confluent Cloud in AWS, Azure, and GCP. Confluent chose the S3 interface to implement storage support for the cloud providers (AWS, Azure, GCP) and several on-premise solutions like PureStorage Flash Blade, Nutanix Objects, Netapp Object Storage, Dell EMC ECS, Hitachi Content Platform Object Storage, or MinIO for Kubernetes.

Uber, who had the lead in implementing the KIP-405 in open source Apache Kafka, runs its tiered storage against HDFS. Confluent and AWS contributed to refactoring, best practices and performance / integration testing. Satish Duggana, tech lead for Data and Streaming Infrastructure at Uber, presented the details of their implementations and deployment in a talk at Current 2023.

Other vendors like AWS with MSK and Aiven are adopting KIP-405 and provide their own tiered storage implementations these days.

Case Study: KOR Financial stores 160 Petabytes in Kafka for Regulatory Reporting

KOR is a cloud-native family of global trade repositories and regulatory reporting services that has adopted Confluent Cloud and a data streaming architecture to improve compliance processes.

Regulatory reporting is obviously a perfect use case for Tiered Storage in Kafka to replay historical data. As the Kafka log provides guaranteed ordering and timestamps, there is no need for another database or data lake besides Kafka.

Daan Gerits, Chief Data Officer, KOR Financial, explains at Diginomica: “At KOR Financial, we have a very specific problem that we are trying to solve, which is collecting trading information for regulators. And we decided to do it in a totally different way to the way that most people are doing it. Where others would be using data storage or big data technologies, we decided to go all in on Kafka. We are building our system to store 160 petabytes in Confluent Cloud and then work on top of that. We don’t have any other database. So it’s a long retention use case.”

Kafka is NOT a Database (Replacement)

Apache Kafka is a database. It provides ACID guarantees. Hundreds of companies for deploy Kafka for mission-critical deployments including transactional workloads. However, most times, Kafka is NOT competitive to other databases.

Kafka is an event streaming platform for messaging, storage, processing and integration at scale in real-time with zero downtime and zero data loss. Almost all deployments connect Kafka to database sources and sinks for data integration, decoupling and data consistency, where the heart of the cloud-native enterprise architecture is real-time, scalable and reliable.

Apache Kafka is complementary to database, data warehouse, data lake and Lakehouse architectures. I wrote a blog series about use cases and architectures for data streaming other storage platforms.

The Future: Apache Iceberg for Kafka?

The adoption of Tiered Storage for Apache Kafka is just getting started. Many teams will store (some) data longer in Kafka to offload data from expensive systems or replay historical data without needing another database.

However, most analytics platforms do NOT use the Kafka protocol to consume and query data. The trend across most data platforms goes towards Apache Iceberg as a standardized abstraction layer for storing and querying (non-real-time) data in an objects store or other storage.

Apache Iceberg is an open-source table format and processing framework for big data. It aims to provide the best of both worlds: the performance of a traditional table format with the flexibility of a schema-on-read approach. Iceberg addresses solves in managing large-scale and evolving data sets in distributed storage environments.

Apache Iceberg supports popular data processing frameworks, such as Apache Spark, Apache Flink, Apache Hive, Presto, and more. With Kafka’s Tiered Storage and especially the S3 support by some vendors, I can see how this can be an entire game changer for storing and processing events in real-time with the Kafka protocol or with other analytics engines and databases in near-real-time or batch.

The future will show us. For now, let’s be excited about how Tiered Storage for Kafka is the next big thing around data streaming.

Tiered Storage makes Kafka more Scalable, Cost-Efficient and Reliable

Tiered Storage for Apache Kafka makes event-driven architectures more scalable, cost-efficient and reliable. It enables new use cases that require another database or data lake in the past.

However, Kafka’s goal is still NOT to replace other data and analytics platforms. Design patterns like microservices and data mesh enable a true decoupling of applications and data stores. Kafka provides this decoupling. With tiered storage in mind for various use cases such as offloading, new consumers, or error-handling, you can consider new approaches for your cloud-native enterprise architecture.

Are you excited about Tiered Storage for Apache Kafka? How will you use it? Or do you already use an existing implementation, like Confluent Cloud? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Why Tiered Storage for Apache Kafka is a BIG THING… appeared first on Kai Waehner.

The State of Data Streaming for the Public Sector

Kai Waehner — Wed, 02 Aug 2023 05:09:44 +0000

This blog post explores the state of data streaming for the public sector. The evolution of government digitalization, citizen expectations, and cybersecurity risks requires optimized end-to-end visibility into information, comfortable mobile apps, and integration with legacy platforms like mainframe in conjunction with pioneering technologies like social media. Data streaming provides consistency across all layers and allows integrating and correlating data in real-time at any scale. I look at public sector trends to explore how data streaming leverages Apache Kafka and to help as a business enabler, including customer stories from the US Department of Defense (DoD), NASA, Deutsche Bahn (German Railway), and others. A complete slide deck and on-demand video recording are included.

General trends in the public sector

The public sector covers so many different areas. Examples include defense, law enforcement, national security, healthcare, public administration, police, judiciary, finance and tax, research, aerospace, agriculture, etc. Many of these terms and sectors overlap. Many of these use cases are applicable across many sectors.

Several disruptive trends impact innovation in the public sector to automate processes, provide a better experience for citizens, and strengthen cybersecurity defense tactics.

The two critical pillars across departments in the public sector are IT modernization and data-driven applications.

IT modernization in the government

The research company Gartner identified the following technology trends for the government to accelerate the digital transformation as they prepare for post-digital government:

These trends differ not much from traditional companies in the private sector like banking or insurance. Data consistency across monolithic legacy infrastructure and cloud-native applications matters.

Accelerating data maturity in the public sector

The public sector is often still very slow in innovation. Time-to-market is crucial. IT modernization requires up-to-date technologies and development principles. Data sharing across applications, departments, or states requires a data-driven enterprise architecture.

McKinsey & Company says “Government entities have created real-time pandemic dashboards, conducted geospatial mapping for drawing new public transportation routes, and analyzed public sentiment to inform economic recovery investment.

While many of these examples were born out of necessity, public-sector agencies are now embracing the impact that data-driven decision making can have on residents, employees, and other agencies. Embedding data and analytics at the core of operations can help optimize government resources by targeting them more effectively and enable civil servants to focus their efforts on activities that deliver the greatest results.”

AI and Machine Learning help with automation. Chatbots and other conversational AI improve the total experience of citizens and public sector employees.

Data streaming in the government and public sector

Real-time data beats slow data in almost all use cases. No matter which agency or department you look at in the government and public sector:

Data streaming combines the power of real-time messaging at any scale with storage for true decoupling, data integration, and data correlation capabilities. Apache Kafka is the de facto standard for data streaming.

Check out the below links for a broad spectrum of examples and best practices. Additionally, here are a few new customer stories from the last months.

New customer stories for data streaming in the public sector and government

So much innovation is happening worldwide, even in the “slow” public sector. Automation and digitalization change how we search and buy products and services, communicate with partners and customers, provide hybrid shopping models, and more.

Most and more governments and non-profit organizations use a cloud-first approach to improve time-to-market, increase flexibility, and focus on business logic instead of operating IT infrastructure.

Here are a few customer stories from worldwide organizations in the public sector and government:

University of California, San Diego: Integration Platform as a Service (iPaaS) as “Swiss army knife” of integration.
U.S. Citizenship and Immigration Services (USCIS): Real-time inter-agency data sharing.
Deutsche Bahn (German Railway): Customer data platform for real-time notification about delays and cancellations, plus B2B integration with Google Maps.
NASA: General Coordinates Network (GCN) for multi-messenger astronomy alerts between space- and ground-based observatories, physics experiments, and thousands of astronomers worldwide.
US Department of Defense (DOD): Joint All Domain Command and Control (JADC2), a strategic warfighting concept that connects the data sensors, shooters, and related communications devices of all U.S. military services. DOD uses the ride-sharing service Uber as an analogy to describe its desired end state for JADC2 leveraging data streaming.

Resources to learn more

This blog post is just the starting point.

I wrote a blog series exploring why many governments and public infrastructure sectors leverage data streaming for various use cases. Learn about real-world deployments and different architectures for data streaming with Apache Kafka in the public sector:

Learn more about data streaming for the government and public sector in the following on-demand webinar recording, the related slide deck, and further resources, including pretty cool lightboard videos about use cases. I presented with my colleague and SME for the public sector and governments Will La Forest.

On-demand video recording

The video recording explores public sector trends and architectures for data streaming leveraging Apache Kafka and other modern and cloud-native technologies. The primary focus is the data streaming case studies. Check out our on-demand recording:

Slides

If you prefer learning from slides, check out the deck used for the above recording:

Fullscreen Mode

Case studies and lightboard videos for data streaming in the public sector and government

The state of data streaming for the public sector in 2023 is fascinating. New use cases and case studies come up every month. Mission-critical deployments at governments in the United States and Germany prove the maturity of data streaming concerning security and data privacy. The success stories prove better data governance across the entire organization, secure data collection and processing in real-time, data sharing and cross-agency partnerships with Open APIs for new business models, and many more scenarios.

We recorded lightboard videos showing the value of data streaming simply and effectively. These five-minute videos explore the business value of data streaming, related architectures, and customer stories. Stay tuned; I will update the links in the next few weeks and publish a separate blog post for each story and lightboard video.

And this is just the beginning. Every month, we will talk about the status of data streaming in a different industry. Manufacturing was the first. Financial services second, then retail, telcos, gaming, and so on…

Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post The State of Data Streaming for the Public Sector appeared first on Kai Waehner.

Apache Kafka (including Kafka Streams) + Apache Flink = Match Made in Heaven

Kai Waehner — Mon, 23 Jan 2023 09:35:05 +0000

Apache Kafka and Apache Flink are increasingly joining forces to build innovative real-time stream processing applications. This blog post explores the benefits of combining both open-source frameworks, shows unique differentiators of Flink versus Kafka, and discusses when to use a Kafka-native streaming engine like Kafka Streams instead of Flink.

The tremendous adoption of Apache Kafka and Apache Flink

Apache Kafka became the de facto standard for data streaming. The core of Kafka is messaging at any scale in combination with a distributed storage (= commit log) for reliable durability, decoupling of applications, and replayability of historical data.

Kafka also includes a stream processing engine with Kafka Streams. And KSQL is another successful Kafka-native streaming SQL engine built on top of Kafka Streams. Both are fantastic tools. In parallel, Apache Flink became a very successful stream processing engine.

The first prominent Kafka + Flink case study I remember is the fraud detection use case of ING Bank. The first publications came up in 2017, i.e., over five years ago: “StreamING Machine Learning Models: How ING Adds Fraud Detection Models at Runtime with Apache Kafka and Apache Flink“. This is just one of many Kafka fraud detection case studies.

One of the last case studies I blogged about goes in the same direction: “Why DoorDash migrated from Cloud-native Amazon SQS and Kinesis to Apache Kafka and Flink“.

The adoption of Kafka is already outstanding. And Flink gets into enterprises more and more, very often in combination with Kafka. This article is no introduction to Apache Kafka or Apache Flink. Instead, I explore why these two technologies are a perfect match for many use cases and when other Kafka-native tools are the appropriate choice instead of Flink.

Top reasons Apache Flink is a perfect complementary technology for Kafka

Stream processing is a paradigm that continuously correlates events of one or more data sources. Data is processed in motion, in contrast to traditional processing at rest with a database and request-response API (e.g., a web service or a SQL query). Stream processing is either stateless (e.g., filter or transform a single message) or stateful (e.g., an aggregation or sliding window). Especially state management is very challenging in a distributed stream processing application.

A vital advantage of the Apache Flink engine is its efficiency in stateful applications. Flink has expressive APIs, advanced operators, and low-level control. But Flink is also scalable in stateful applications, even for relatively complex streaming JOIN queries.

Flink’s scalable and flexible engine is fundamental to providing a tremendous stream processing framework for big data workloads. But there is more. The following aspects are my favorite features and design principles of Apache Flink:

Unified streaming and batch APIs
Connectivity to one or multiple Kafka clusters
Transactions across Kafka and Flink
Complex Event Processing
Standard SQL support
Machine Learning with Kafka, Flink, and Python

But keep in mind that every design approach has pros and cons. While there are a lot of advantages, sometimes it is also a drawback.

Unified streaming and batch APIs

Apache Flink’s DataStream API unifies batch and streaming APIs. It supports different runtime execution modes for stream processing and batch processing, from which you can choose the right one for your use case and the characteristics of your job. In the case of SQL/Table API, the switch happens automatically based on the characteristics of the sources: All bounded events go into batch execution mode; at least one unbounded event means STREAMING execution mode.

The unification of streaming and batch brings a lot of advantages:

Reuse of logic/code for real-time and historical processing
Consistent semantics across stream and batch processing
A single system to operate
Applications mixing historical and real-time data processing

This sounds similar to Apache Spark. But there is a significant difference: Contrary to Spark, the foundation of Flink is data streaming, not batch processing. Hence, streaming is the default execution runtime mode in Apache Flink.

Continuous stateless or stateful processing enables real-time streaming analytics using an unbounded stream of events. Batch execution is more efficient for bounded jobs (i.e., a bounded subset of a stream) for which you have a known fixed input and which do not run continuously. This executes jobs in a way that is more reminiscent of batch processing frameworks, such as MapReduce in the Hadoop and Spark ecosystems.

Apache Flink makes moving from a Lambda to Kappa enterprise architecture easier. The foundation of the architecture is real-time, with Kafka as its heart. But batch processing is still possible out-of-the-box with Kafka and Flink using consistent semantics. Though, this combination will likely not (try to) replace traditional ETL batch tools, e.g., for a one-time lift-and-shift migration of large workloads.

Connectivity to one or multiple Kafka clusters

Apache Flink is a separate infrastructure from the Kafka cluster. This has various pros and cons. First, I often emphasize the vast benefit of Kafka-native applications: you only need to operate, scale and support one infrastructure for end-to-end data processing. A second infrastructure adds additional complexity, cost, and risk. However, imagine a cloud vendor taking over that burden, so you consume the end-to-end pipeline as a single cloud service.

With that in mind, let’s look at a few benefits of separate clusters for the data hub (Kafka) and the stream processing engine (Flink):

Focus on data processing in a separate infrastructure with dedicated APIs and features independent of the data streaming platform.
More efficient streaming pipelines before hitting the Kafka Topics again; the data exchange happens directly between the Flink workers.
Data processing across different Kafka topics of independent Kafka clusters of different business units. If it makes sense from a technical and organizational perspective, you can connect directly to non-Kafka sources and sinks. But be careful, this can quickly become an anti-pattern in the enterprise architecture and create complex and unmanageable “spaghetti integrations”.
Implement new fail-over strategies for applications.

I emphasize Flink is usually NOT the recommended choice for implementing your aggregation, migration, or hybrid integration scenario. Multiple Kafka clusters for hybrid and global architectures are the norm, not an exception. Flink does not change these architectures.

Kafka-native replication tools like MirrorMaker 2 or Confluent Cluster Linking are still the right choice for disaster recovery. It is still easier to do such a scenario with just one technology. Tools like Cluster Linking solve challenges like offset management out-of-the-box.

Transactions across Kafka and Flink

However, Apache Kafka and Apache Flink are deployed in many resilient, mission-critical architectures. The concept of exactly-once semantics (EOS) allows stream processing applications to process data through Kafka without loss or duplication. This ensures that computed results are always accurate.

Transactions are possible across Kafka and Flink. The feature is mature and battle-tested in production. Operating separate clusters is still challenging for transactional workloads. However, a cloud service can take over this risk and burden.

Many companies already use EOS in production with Kafka Streams. But EOS can even be used if you combine Kafka and Flink. That is a massive benefit if you choose Flink for transactional workloads. So, to be clear: EOS is not a differentiator in Flink (vs. Kafka Streams), but it is an excellent option to use EOS across Kafka and Flink, too.

Complex Event Processing with FlinkCEP

The goal of complex event processing (CEP) is to identify meaningful events in real-time situations and respond to them as quickly as possible. CEP does usually not send continuous events to other systems but detects when something significant occurs. A common use case for CEP is handling late-arriving events or the non-occurrence of events.

The big difference between CEP and event stream processing (ESP) is that CEP generates new events to trigger action based on situations it detects across multiple event streams with events of different types (situations that build up over time and space). ESP detects patterns over event streams with homogenous events (i.e. patterns over time). Pattern matching is a technique to implement either pattern but the features look different.

FlinkCEP is an add-on for Flink to do complex event processing. The powerful pattern API of FlinkCEP allows you to define complex pattern sequences you want to extract from your input stream. After specifying the pattern sequence, you apply them to the input stream to detect potential matches. This is also possible with SQL via the MATCH_RECOGNIZE clause.

Standard SQL support

Structured Query Language (SQL) is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS). However, it is so predominant that other technologies like non-relational databases (NoSQL) and streaming platforms adopt it, too.

SQL became a standard of the American National Standards Institute (ANSI) in 1986 and the International Organization for Standardization (ISO) in 1987. Hence, if a tool supports ANSI SQL, it ensures that any 3rd party tool can easily integrate using standard SQL queries (at least in theory).

Apache Flink supports ANSI SQL, including the Data Definition Language (DDL), Data Manipulation Language (DML), and Query Language. Flink’s SQL support is based on Apache Calcite, which implements the SQL standard. This is great because many personas, including developers, architects, and business analysts, already use SQL in their daily job.

The SQL integration is based on the so-called Flink SQL Gateway, which is part of the Flink framework allowing other applications to interact with a Flink cluster through a REST API easily. User applications (e.g., Java/Python/Shell program, Postman) can use the REST API to submit queries, cancel jobs, retrieve results, etc. This enables a possible integration of Flink SQL with traditional business intelligence tools like Tableau, Microsoft Power BI, or Qlik.

However, to be clear, ANSI SQL was not built for stream processing. Incorporating Streaming SQL functionality into the official SQL standard is still in the works. The Streaming SQL working group includes database vendors like Microsoft, Oracle, and IBM, cloud vendors like Google and Alibaba, and data streaming vendors like Confluent. More details: “The History and Future of SQL: Databases Meet Stream Processing“.

Having said this, Flink supports continuous sliding windows and various streaming joins via ANSI SQL. There are things that require additional non-standard SQL keywords but continuous sliding windows or streaming joins, in general, are possible.

Machine Learning with Kafka, Flink, and Python

In conjunction with data streaming, machine learning solves the impedance mismatch of reliably bringing analytic models into production for real-time scoring at any scale. I explored ML deployments within Kafka applications in various blog posts, e.g., embedded models in Kafka Streams applications or using a machine learning model server with streaming capabilities like Seldon.

PyFlink is a Python API for Apache Flink that allows you to build scalable batch and streaming workloads, such as real-time data processing pipelines, large-scale exploratory data analysis, Machine Learning (ML) pipelines, and ETL processes. If you’re already familiar with Python and libraries such as Pandas, then PyFlink makes it simpler to leverage the full capabilities of the Flink ecosystem.

PyFlink is the missing piece for an ML-powered data streaming infrastructure, as almost every data engineer uses Python. The combination of Tiered Storage in Kafka and Data Streaming with Flink in Python is excellent for model training without the need for a separate data lake.

When to use Kafka Streams instead of Apache Flink?

Don’t underestimate the power and use cases of Kafka-native stream processing with Kafka Streams. The adoption rate is massive, as Kafka Streams is easy to use. And it is part of Apache Kafka. To be clear: Kafka Streams is already included if you download Kafka from the Apache website.

Kafka Streams is a Library, Apache Flink is a cluster

The most significant difference between Kafka Streams and Apache Flink is that Kafka Streams is a Java library, while Flink is a separate cluster infrastructure. Developers can deploy the Flink infrastructure in session mode for bigger workloads (e.g., many small, homogenous workloads like SQL queries) or application mode for fewer bigger, heterogeneous data processing tasks (e.g., isolated applications running in a Kubernetes cluster).

No matter your deployment option, you still need to operate a complex cluster infrastructure for Flink (including separate metadata management on a ZooKeeper cluster or an etcd cluster in a Kubernetes environment).

TL;DR: Apache Flink is a fantastic stream processing framework and a top #5 Apache open-source project. But it is also complex to deploy and difficult to manage.

Benefits of using the lightweight library of Kafka Streams

Kafka Streams is a single Java library. This adds a few benefits:

Kafka-native integration supports critical SLAs and low latency for end-to-end data pipelines and applications with a single cluster infrastructure instead of operating separate messaging and processing engines with Kafka and Flink. Kafka Streams apps still run in their VMs or Kubernetes containers, but high availability and persistence are guaranteed via Kafka Topics.
Very lightweight with no other dependencies (Flink needs S3 or similar storage as the state backend)
Easy integration into testing / CI / DevOps pipelines
Embedded stream processing into any existing JVM application, like a lightweight Spring Boot app or a legacy monolith built with old Java EE technologies like EJB.
Interactive Queries allow leveraging the state of your application from outside your application. The Kafka Streams API enables your applications to be queryable. Flink’s similar feature “queryable state” is approaching the end of its life due to a lack of maintainers.

Kafka Streams is well-known for building independent, decoupled, lightweight microservices. This differs from submitting a processing job into the Flink (or Spark) cluster; each data product team controls its destiny (e.g., don’t depend on the central Flink team for upgrades or get forced to upgrade). Flink’s application mode enables a similar deployment style for microservices. But:

Kafka Streams and Apache Flink live in different parts of a company

Today, Kafka Streams and Flink are usually used for different applications. While Flink provides an application mode to build microservices, most people use Kafka Streams for this today. Interactive queries are available in Kafka Streams and Flink, but it got deprecated in Flink as there is not much demand from the community. These are two examples that show that there is no clear winner. Sometimes Flink is the better choice, and sometimes Kafka Streams makes more sense.

“In summary, while there certainly is an overlap between the Streams API in Kafka and Flink, they live in different parts of a company, largely due to differences in their architecture and thus we see them as complementary systems.” That’s the quote of a “Kafka Streams vs. Flink comparison” article written in 2016 (!) by Stephan Ewen, former CTO of data Artisans, and Neha Narkhede, former CTO of Confluent. While some details changed over time, this old blog post is still pretty accurate today and a good read for a more technical audience.

The domain-specific language (DSL) of Kafka Streams differs from Flink but is also very similar. How are both characteristics possible? It depends on who you ask. This (legitimate) subject for debate often segregates Kafka Streams and Flink communities. Kafka Streams has Stream and Table APIs. Flink has DataStream, Table, and SQL API. I guess 95% of use cases can be built with both technologies. APIs, infrastructure, experience, history, and many other factors are relevant for choosing the proper stream processing framework.

Some architectural aspects are very different in Kafka Streams and Flink. These need to be understood and can be a pro or con for your use case. For instance, Flink’s checkpointing has the advantage of getting a consistent snapshot, but the disadvantage is that every local error always stops the whole job and everything has to be rolled back to the last checkpoint. Kafka Streams does not have this concept. Local errors can be recovered locally (move the corresponding tasks somewhere else; the task/threads without errors just continue normally). Another example is Kafka Streams’ hot standby for high availability versus Flink’s fault-tolerant checkpointing system.

Kafka + Flink = A powerful combination for stream processing

Apache Kafka is the de facto standard for data streaming. It includes Kafka Streams, a widely used Java library for stream processing. Apache Flink is an independent and successful open-source project offering a stream processing engine for real-time and batch workloads. The combination of Kafka (including Kafka Streams) and Flink is already widespread in enterprises across all industries.

Both Kafka Streams and Flink have benefits and tradeoffs for stream processing. The freedom of choice of these two leading open-source technologies and the tight integration of Kafka with Flink enables any kind of stream processing use case. This includes hybrid, global and multi-cloud deployments, mission-critical transactional workloads, and powerful analytics with embedded machine learning. As always, understand the different options and choose the right tool for your use case and requirements.

What is your favorite for streaming processing, Kafka Streams, Apache Flink, or another open-source or proprietary engine? In which use cases do you leverage stream processing? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka (including Kafka Streams) + Apache Flink = Match Made in Heaven appeared first on Kai Waehner.

The Data Streaming Landscape 2023

Kai Waehner — Wed, 21 Dec 2022 07:29:58 +0000

Data streaming is a new software category to process data in motion. Apache Kafka is the de facto standard used by over 100,000 organizations. Plenty of vendors offer Kafka platforms and cloud services. Many complementary stream processing engines like Apache Flink and SaaS offerings have emerged. And competitive technologies like Pulsar and Redpanda try to get market share. This blog post explores the data streaming landscape of 2023 to summarize existing solutions and market trends.

Data streaming is a new software category

Data-driven applications are the new black. This approach increases the business value as the overall goal by increasing revenue, reducing cost, reducing risk, or improving the customer experience.

Plenty of software categories and related data platforms exist to process and analyze data:

Database: Store and execute transactional workloads.
Data Warehouse: Processing structured historical data to create recurring reports and unique insights.
Data Lake: Processing structured and semi- or unstructured big data sets with batch processing to create recurring reports and unique insights.
Lakehouse: A mix of data warehouse and data lake to process all data on one platform.
Data Streaming: Continuously process data in motion and provide data consistency across communication paradigms instead of storing and analyzing data at rest.

Of course, these data platforms often overlap a bit. I did a complete blog series exploring the use cases and how they complement each other.

Data streaming use cases by business value

Use cases for data streaming exist across all industries:

Adding business value is crucial for any enterprise. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products. Search my blog for your favorite industry to find plenty of case studies and architectures. Or read about use cases for Apache Kafka across industries to get started.

The data streaming landscape of 2023

Data Streaming is a separate software category of data platforms. Many software vendors built their entire businesses around this category.

The data streaming landscape shows that most vendors use Kafka or implement its protocol because it has become the de facto standard.

New software companies have emerged in this category in the last few years. And several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem.

Apache Kafka is the de facto standard for data streaming, like Amazon S3 is the de facto standard for S3 object storage. Most software vendors use Kafka for their data streaming platforms. However, there is more than Kafka. Some vendors only use the Kafka protocol (Azure Event Hubs) or utterly different APIs (like Amazon Kinesis).

The following Data Streaming Landscape 2023 summarizes the current status of relevant products and cloud services:

Please note: This is not a complete list of frameworks, cloud services, or vendors. It is not an official research landscape. If your favorite technology is not in this diagram, then I did not see it in my conversations with customers, prospects, partners, analysts, or the broader data streaming community. We will probably see many more logos in this diagram in a year or two, as this is still the beginning of the data streaming era.

Also, note that I focus on general data streaming infrastructure. Brilliant solutions exist for using and analyzing streaming data for specific scenarios, like time series databases, machine learning engines, or observability platforms. These are complementary and often connected out of the box to a streaming cluster.

Evaluation criteria for data streaming platforms

I often recommend using the following four aspects to look at different frameworks, platforms, and cloud services to evaluate a technology for your business project or enterprise architecture strategy:

Cloud-native: Is the solution elastic to scale up and down? Is it fully managed / serverless, or just a bunch of server instances hosted in the cloud? Can you automate the development, operations, and testing process using DevOps, GitOps, test-driven development, and similar principles?
Complete: Does the solution offer all required capabilities? Data streaming requires more than just messaging or data ingestion. Hence, does it provide connectors, data processing, governance, security, self-service, and so on?
Everywhere: Where can you use the solution? Cloud-only? Are all required cloud service providers supported? Is there an option to deploy in a data center or even at the edge (i.e., outside a data center)? How can you share data between regions, clouds or data centers? What use cases are supported (e.g., aggregation, disaster recovery, hybrid integration, etc.)?
Supported: Is the solution mature and battle-tested? Are public case studies available for your use case or industry? Does the vendor fully support the product? What are the SLAs? Are specific features excluded from commercial enterprise support? It is a shame that this aspect needs to be evaluated. Still, some vendors offer data streaming cloud services and exclude support in the terms and conditions (that many people don’t read in cloud services, unfortunately).

Let’s take a deeper look into the different categories and start with the leading technology: Native Apache Kafka…

Apache Kafka is the de facto standard for data streaming

Starting with the leader and de facto standard Apache Kafka and related vendors and SaaS offerings. Apache Kafka became the de facto standard for data streaming like Amazon S3 is the de facto standard for object storage:

Read the detailed blog post to learn more about the differences between an open-source standard like Kafka and a proprietary protocol like S3.

When you explore the data streaming world, there is no way not to look at the Apache Kafka ecosystem.

Apache Kafka adoption and growth

The growth of the Apache Kafka community in the last few years is impressive. Here are some statistics that Jay Kreps presented at the data streaming conference “Current – The Next Generation of Kafka Summit” in Austin, Texas, in October 2022:

>100,000 organizations using Apache Kafka
>41,000 Kafka meetup attendees
>32,000 Stack Overflow questions
>12,000 Jiras for Apache Kafka
>31,000 Open job listings request Kafka skills

And look at the increased number of active monthly unique users downloading the Kafka Java client library with Maven:

Source: Sonatype

Fun fact: The leading conference for Kafka was rebranded from “Kafka Summit” to “Current 2022 – The Next Generation of Kafka Summit”. Why? Because data streaming is more than Kafka. Many complementary and competitive technologies were present, including vendors, booths, demos, and customer case studies. That’s a remarkable evolution of data streaming for the community and enterprises across the globe!

Apache Kafka Vendors: self-managed vs. cloud offerings

New software companies focus on data streaming. And traditional players like IBM and Amazon jumped on the bandwagon in the past few years. On a top level – to keep it simple – three kinds of offerings exist for Apache Kafka:

I made a detailed comparison of on-premise Kafka vendors and cloud services using this car analogy. Only Amazon MSK Serverless (i.e., the fully managed service, not the partially Managed MSK) was not available when writing this comparison. Hence, read Confluent Cloud versus Amazon MSK Serverless.

Here are a few notes on each vendor as a summary.

Apache Kafka: The de facto standard for data streaming. Open source with a vast community. All the vendors in this list rely on (parts of) this project.
Confluent: Provides data streaming everywhere with Confluent Platform (self-managed) and Confluent Cloud (fully managed and available across cloud providers).
Cloudera: Provides Kafka as a self-managed offering. Focuses on combining many data technologies like Kafka, Hadoop, Spark, Flink, NiFi, and many more.
Red Hat: Provides Kafka as a partially managed cloud offering and self-managed Kafka on Kubernetes via OpenShift. Kafka is part of the integration portfolio that includes other open-source frameworks like Apache Camel.
TIBCO: Offers Kafka for Linux and Windows. Strange product (as Kafka experts know Kafka does not work well on Windows) and minimal documentation.
AWS: Provides two separate products with Amazon MSK (partially managed) and Amazon MSK Serverless (fully managed). Kafka support is excluded in the MSK offerings. AWS has hundreds of cloud services, and Kafka is part of that broad spectrum. Only available on AWS clouds.
Instaclustr and Aiven: Partially managed Kafka cloud offerings across cloud providers. The product portfolios offer various hosted services of open-source technologies. Instaclustr also offers a (semi-)managed offering for on-premise infrastructure.
Microsoft Azure HDInsight. A piece of Azure’s Hadoop infrastructure. Not intended for other use cases. Only available on Azure clouds.
Lenses and Conduktor: Tools for managing and monitoring Kafka clusters. Complementary to the other vendors.

This is no comparison. Just a list with a few notes. Make your own evaluation of your favorite vendors. Check what you need: Cloud-native? Complete? Everywhere? Supported?

Kafka-compatible open-source frameworks and SaaS

A few vendors don’t rely on open-source Apache Kafka but built their own implementations for different reasons. The Kafka protocol compatibility is limited (though marketing will not tell you). This can create risk in operating existing Kafka workloads against the cluster and differs in operations and execution (which can be good or bad).

Here are a few notes on each vendor as a summary:

Apache Pulsar: A competitor to Apache Kafka. Similar story and use cases, but different architecture (Kafka is one distributed cluster – after removing the ZooKeeper dependency in 2022), Pulsar is three distributed clusters (Pulsar brokers, ZooKeeper, BookKeeper). I wrote about Pulsar vs. Kafka two years ago, and I think the status is still the same (and it is too late now to get more market traction).
StreamNative: The primary vendor behind Apache Pulsar. Offers self-managed and fully managed solutions. StreamNative Cloud for Kafka is in beta and not production ready.
DataStax: A Pulsar offering integrated into the database-focused product portfolio. Not sure if the streaming product is just marketing or not. If you want to try out the Astra Streaming cloud service powered by Pulsar, it refers you to the multi-cloud DBaaS built on Apache Cassandra.
Redpanda: A new entrant into the data streaming market offering self-managed and fully managed products. Interesting approach to implementing the Kafka protocol with C++. It might take some market share if they can find the proper use cases and differentiators. Today, I don’t see Redpanda as an alternative to a Kafka-native offering because of its early stage in the maturity curve and no added value for solving business problems versus the added risk compared to Apache Kafka.
Azure Event Hubs: A mature, fully managed cloud service. The service does one thing, and that is done very well: Data ingestion via the Kafka protocol (with limited compatibility). Hence, it is not a complete streaming platform, but is more comparable to Amazon Kinesis or Google Cloud PubSub. Only available on Azure cloud.

Be careful about statements of vendors that reimplement the Kafka protocol. Most of these vendors oversell the Kafka protocol compatibility. Additionally, “benchmarketing” (i.e., picking a sweet spot or niche scenario where you perform better than your competitor) is the favorite marketing technique to “prove” differentiators to the real Apache Kafka.

Data streaming is more than Apache Kafka…

While Apache Kafka is the de facto standard for data streaming, many complementary and competitive technologies exist.

Even more technologies emerge these days because of the growth of this software category across the globe and all industries. That’s excellent news. Data streaming is here to stay and grow.

The situation is challenging to explore as part of the data streaming landscape, as some products are complementary and competitive to the Apache Kafka ecosystem.

Some data streaming technologies are competitive to Kafka

In some situations, you must evaluate whether Apache Kafka or another technology is the right choice. Here are a few open-source and cloud competitors:

Amazon Kinesis: Data ingestion into AWS data stores. Mature product for a specific problem. Only available on AWS.
Google Cloud PubSub: Data ingestion into GCP data stores. Mature product for a specific problem. Only available on GCP.
Pravega and Hazelcast Jet: Open-source frameworks for stream processing. I added these to show that there are more than Kafka and Flink in the open-source world. Though, I see little market traction.

Amazon Kinesis and Google Cloud PubSub are excellent cloud services if you “just” want to ingest data into a specific cloud storage. If there are no other use cases, these tools might be the right choice (if pricing at scale and other limitations work for you).

Apache Kafka is a much more flexible and strategic data streaming platform. Many projects still start with data ingestion and build the first pipeline. But providing access to the same stream of events to any other data sink or for powerful stream processing with tools like Kafka Streams or Apache Flink is a significant advantage.

Some data streaming technologies are complementary to Kafka

Each stream processing framework or cloud service has trade-offs. There is no single size that fits all use cases. Here are a few mature and emerging technologies that complement Apache Kafka:

Apache Flink: Together with Kafka Streams (part of Apache Kafka), the leading open-source stream processing framework. Advanced features include ANSI SQL support and APIs for stream and batch workloads.
Decodable and Immerok: Two brand new cloud services. Very early stage. I still added them, as I think it is an excellent strategic move to build a data streaming cloud service on top of Apache Flink. Huge potential if it is combined with existing Kafka infrastructures in enterprises.
Spark Streaming: The streaming part of Apache Spark. I am still not 100 percent convinced. Kafka Streams and Apache Flink are the better choices for stream processing. However, the enormous installed base of Spark clusters in enterprises broadens adoption.
Databricks: The leading vendor behind Apache Spark. Getting or at least trying to get much more into the business of real-time data. I like the platform, but I am not convinced by the lakehouse story around “doing everything within one big data lake”. Check out my blog series “Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?“.

Apache Flink and Spark Streaming WITHOUT Kafka?

Most of these technologies complement Apache Kafka. But stream processing frameworks like Flink or cloud services like Databricks do NOT need Kafka as an ingestion layer. There are other options…

Flink, Spark, et al. can consume data from other streaming platforms or directly from data stores. However, be careful with the latter: If you use Flink or Spark Streaming for stream processing, that’s fine. But if the first thing to do is read the data from an S3 object store, well, that is data at rest. Don’t do stream processing with data at rest.

Or in other words, don’t store data in a database or data lake just to reverse it later. Almost all Spark Streaming examples and case studies I saw last year at conferences and customer meetings looked like this. That is an anti-pattern for stream processing!

To be clear: It is okay to ingest data from S3 or another data store to a stream processing application built with Kafka Streams, Flink, et al. This data can be used in the stateful backend for your tasks like enrichment purposes. A stream processing application is not just about real-time data feeds. It also correlates these real-time feeds with (already ingested) historical data. This is a common approach for metadata or business data that is updated less frequently (like from an SAP ERP system).

Why are Kafka Streams and KSQL missing in the data streaming landscape?

I intentionally did not put Kafka Streams and KSQL into the data streaming landscape. Both are Kafka-native stream processing technologies.

Kafka Streams, like Kafka Connect, are part of open-source Apache Kafka. Hence, the Java library is included if you download Kafka from the Apache website. It is already included in the data streaming landscape with the Kafka logo. You should always ask yourself if you need another framework besides Kafka Streams for stream processing. The significant benefit: One technology, one vendor, one infrastructure.

Many vendors exclude or do not focus on Kafka Streams and Kafka Connect and only offer incomplete Kafka; they want to sell their own integration and processing products instead.

KSQL is an abstraction layer on top of Kafka Streams to provide stream processing with streaming SQL. A great tool, also Kafka-native. It comes with a Confluent Community License and is free to use. Hence, like Kafka Streams, I see it as part of Kafka and did not explicitly put it into the data streaming landscape as a separate product. But you need to evaluate it against Flink, Decodable, and others, for your use case, of course.

The data streaming era is just beginning…

The data streaming landscape 2023 shows how a new software category is emerging. We are still in a very early stage. In most conversations with customers, partners, and the community, I hear statements like:

“We see the value, but we are not there yet – we now start with building first data streaming pipelines and have a roadmap for the next years to add more advanced stream processing”.

Data streaming is a long journey, as it is a paradigm shift. We hopefully see a Gartner Magic Quadrant for Event Streaming and a Forrester Wave for Data Streaming in the foreseeable, too. A new category takes time to create. But did you already notice how much more the analysts of Gartner, Forrester, and others already write about data streaming and the various vendors? I also wrote a dedicated blog explaining why data streaming is its own software category.

Looking at the competitive data streaming market, one of my favorite real-world examples for choosing the right stream processing technologies comes from DoorDash: Why companies migrate from Amazon SQS and Kinesis to Apache Kafka and Flink. The article explores the trade-offs between cloud-specific solutions like Kinesis or PubSub and an open ecosystem around open-source technologies like Kafka and Flink.

Last but not least, check out my Top 5 Data Streaming Trends for 2023 to understand how the data streaming landscape fits into emerging trends like data mesh, data sharing, and data governance.

What are your most relevant and exciting trends for data streaming and Apache Kafka in 2023 to set data in motion? What does your enterprise landscape for data streaming look like? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Data Streaming Landscape 2023 appeared first on Kai Waehner.

A cloud-native SCADA System for Industrial IoT built with Apache Kafka

Kai Waehner — Tue, 04 Oct 2022 11:10:06 +0000

Industrial IoT and Industry 4.0 enable digitalization and innovation. SCADA control systems are a vital component of IT/OT modernization. The SCADA evolution started with monolithic applications and moved to networked and web-based platforms. This blog post explores building the 5th generation: A cloud-native SCADA infrastructure with Apache Kafka. A real-world case study explores the journey of a German system operator for electricity to show how such a journey to open and scalable real-time workloads and edge-to-cloud integration progressed.

What is a SCADA system?

Supervisory control and data acquisition (SCADA) is a control system architecture comprising computers, networked data communications, and graphical user interfaces for high-level supervision of machines and processes. It also covers sensors and other devices, such as programmable logic controllers, which interface with process plants or machinery.

While many people refer to specific commercial products, SCADA is a concept or architecture. It can include various components, functions, and products (from different vendors) on different levels:

Wikipedia has a detailed article explaining the terms, history, components, and functions of SCADA. The evolution describes four generations of SCADA systems:

First generation: Monolithic
Second generation: Distributed
Third generation: Networked
Fourth generation: Web-based

The evolution did not stop here. The following explores the 5. generation: Cloud-native and open SCADA systems.

How does Apache Kafka help in Industrial IoT?

Industrial IoT (IIoT) and Industry 4.0 create a few new challenges across industries:

The need for a much bigger scale
The demand for real-time information
Hybrid architectures with mission-critical workloads at the edge and analytics in elastic public cloud infrastructure.
A flexible Open API culture and data sharing across OT/IT environments, and between partners (e.g., supplier, OEM, and mobility service).

Apache Kafka is unique in its characteristics for IoT infrastructures, being very scalable (for transactional and analytical requirements and SLAs), reliable, and open. Hence, many new Industrial IoT projects adopt Apache Kafka for various use cases, including data hub between OT and IT, global integration of smart factories for analytics, predictive maintenance, customer 360, and many other scenarios.

Cloud-native data historian powered by Apache Kafka (operating at the edge or in the cloud)

Data Historian is a well-known concept in Industrial IoT. It helps to ensure and improve the Overall Equipment Effectiveness (OEE). The term often overlaps with SCADA. Some people even use it as a synonym.

Apache Kafka can be used as a component of a Data Historian to improve the OEE and reduce/eliminate the most common causes of equipment-based productivity loss in manufacturing (aka Six Big Losses):

Continuous real-time data ingestion, processing, and monitoring 24/7 at scale is a crucial requirement for thriving Industry 4.0 initiatives. Data Streaming with Apache Kafka and its ecosystem brings enormous value to implementing these modern IoT architectures.

Let’s explore a concrete example of a cloud-native SCADA system.

50hertz: A case study for a cloud-native SCADA system built with Apache Kafka

50hertz is a transmission system operator for electricity in Germany. The company secures electricity supply to 18 million people in northern and eastern Germany.

The infrastructure must operate 24 hours, seven days a week. Various shift teams and a mission-critical SCADA infrastructure supervise and control the OT systems.

50hertz presented their OT/IT and SCADA modernization leveraging data streaming with Apache Kafka at the Confluent Data in Motion tour 2021. The on-demand video recording is available (the speech is in German, unfortunately).

The Journey of 50hertz in a big picture

Look at this fantastic picture of 50hertz’s digital transformation journey from monolithic and proprietary legacy technology to a modern cloud-native integration platform powered by Kafka to modernize their IoT ecosystem, such as SCADA systems:

Source: 50hertz

Notice the details in the above picture:

The legacy infrastructure on the left side glues and patches together different components. It almost breaks together. No changes are possible to existing components.
The new infrastructure on the ride side is based on flexible, standardized containers. It is easy to scale, add, or remove applications. The communication happens via standard sizes and schemas.
The bridge in the middle shows the journey. This is a brownfield approach where the old and new world has to communicate with each other for many years. Over time, the company can shut down more and more of the legacy infrastructure.

A great example of innovation in the energy sector! Let’s explore the details of building a cloud-native SCADA system with Apache Kafka:

Challenges of the monolithic legacy IoT infrastructure

The old IT/OT infrastructure and SCADA system are monolithic, proprietary, not scalable, and miss open APIs based on standard interfaces:

Source: 50hertz

A very common infrastructure setup. Most existing OT/IT infrastructures have exactly the same challenges. This is how factories and production lines were built in the past decades.

The consequence is inflexibility regarding software updates, hardware changes, security fixes, and no option for scalability or innovation. Applications run in disconnected mode and are air-gapped from the internet because the old Windows servers are not even supported and no longer get security patches.

Digital transformation in the industrial space requires modernization. Legacy infrastructure still needs to be integrated into most scenarios. Not every company starts from scratch like Tesla, building brand new factories that are built with automation and digitalization from scratch.

Cloud-native SCADA with Kafka to enable innovation (and legacy integration)

50hertz next-generation Modular Control Center System (MCCS) leverages a central, scalable, event-based integration platform based on Confluent:

Source: 50hertz

The first four containers include the Supervisory & Control (SCADA), Load Frequency Control (LFC), and Time Series Management & Forecasting applications. Each container can have multiple services/functions that follow the event-based microservices pattern.

50hertz provides central governance for security, protocols, and data schemas (CIM compliant) between platform containers/ modules. The cloud-native 24/7 SCADA system is developed in the cloud and deployed in safety-critical edge environments.

More on data streaming and Industrial IoT

If you want to learn more about real-world case studies, use cases, and technical architectures for data streaming with Apache Kafka in IIoT scenarios, check out these articles:

If this is insufficient, please let me know what else you need to know…

Cloud-native architectures and Open API are the future of Industrial IoT

50hertz is a tremendous real-world case study about the modernization of the OT/IT world. A modern SCADA architecture requires real-time data processing at any scale, true decoupling between data producers and consumers (no matter what API these apps use), and open interfaces to integrate with any other application like MES, ERP, cloud services, and so on.

From the IT side, this is nothing new. The last decade brought up scalable open source technologies like Kafka, Spark, Flink, Iceberg, and many more, plus related fully managed, elastic cloud services like Confluent Cloud, Databricks, Snowflake, and so on.

However, the OT side has to change. Instead of using monolithic legacy systems, unsupported and unstable Windows servers, and proprietary protocols, next-generation SCADA systems need to use the same cloud-native IT systems, adopt modern OT hardware/software combinations, and integrate the old and new world to enable digitalization and innovation in industry verticals like manufacturing, automotive, military, energy, and so on.

What role plays data streaming in your Industrial IoT environments and OT/IT modernization? Do you run everything around Kafka in the cloud or operate hybrid edge scenarios? What tasks does Kafka take over – is it “just” the data hub, or are IoT use cases built with it, too? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post A cloud-native SCADA System for Industrial IoT built with Apache Kafka appeared first on Kai Waehner.

Request-Response with REST/HTTP vs. Data Streaming with Apache Kafka – Friends, Enemies, Frenemies?

Kai Waehner — Fri, 12 Aug 2022 06:28:02 +0000

Request-response communication with REST / HTTP is simple, well understood, and supported by most technologies, products, and SaaS cloud services. Contrarily, data streaming with Apache Kafka is a fundamental change to process data continuously. HTTP and Kafka complement each other in various ways. This post explores the architectures and use cases to leverage request-response together with data streaming in the control plane for management or in the data plane for producing and consuming events.

Request-response (HTTP) versus data streaming (Apache Kafka)

Prior to discussing the relationship between HTTP/REST and Apache Kafka, let’s explore the concepts behind both. Traditionally, request-response and data streaming are two different paradigms.

Request-response with REST/HTTP

The following characteristics make HTTP so prevalent in software engineering for request-response (aka request-reply) communication:

The foundation of data communication for the World Wide Web
The standard application layer protocol in the internet protocol suite, commonly known as TCP/IP
Simple and well understood
Supported by most open source frameworks, proprietary products, and SaaS cloud services
Pre-defined API with GET, POST, PUT, and DELETE commands
Typically synchronous communication, but chunked transfer encoding (i.e., streaming) is also possible
Point-to-point message exchange between two applications (like a client and server or two independent microservices)

Data streaming with Apache Kafka

HTTP is about communication between two applications. On the contrary, data streaming is much more than just data communication between a client and a server. Hence, data streaming platforms like Apache Kafka have very different characteristics:

No official web standard like HTTP
Plenty of open-source and proprietary implementations exist
An event-driven system with asynchronous communication using truly decoupled producers and consumers due to the storage of the streaming platform
General-purpose events instead of pre-defined APIs – contract management using schemas is crucial in larger projects for API enforcement and data governance
Continuous data processing in real-time at any scale – a fundamental change for developers that are used to web services and databases for building applications
Backpressure handling, slow consumers, and replayability of historical events are core concepts built-in out-of-the-box
Data integration and data processing capabilities are built into the data streaming platform, i.e., it is not just a message queue

Please note that this article specifically discusses Apache Kafka as it is the established de facto standard for data streaming. It powers most data streaming distributions like Confluent, Red Hat, IBM, Cloudera, TIBCO, and many more, plus cloud services like Confluent Cloud and Amazon MSK. Nevertheless, other frameworks and cloud services like Apache Pulsar, Redpanda, Apache Flink, AWS Kinesis, and many other data streaming technologies follow the same principles. Just be aware of the technical differences and trade-offs between data streaming products.

Request-response and data streaming are complementary

Most architectures need request-response for point-to-point communication (e.g., between a server and mobile app) and data streaming for continuous event processing.

Event sourcing with CQRS is the better design for most data streaming scenarios. However, developers can implement the request-response message exchange pattern natively with Apache Kafka.

Nevertheless, direct HTTP communication with Kafka is the easier and often better approach for appropriate use cases. With this in mind, let’s look at use cases where HTTP is used with Kafka and how they complement each other.

Why is HTTP / REST so popular?

Most developers and administrators are familiar with REST APIs. They are the natural option for many best practices and security guidelines. Here are some good reasons why this will not change in the future:

Avoiding technology lock-in: Sometimes, you want to embed the communication or proxy it with a more agnostic API.
Familiarity with a known technology: Developers are familiar with REST endpoints and if they are under pressure or need a quick result, it’s quicker than learning how to use a new API.
Supported by almost all products: Most open-source frameworks, commercial products, and SaaS cloud provide HTTP APIs.
Security: HTTP ports are much easier to open by security teams compared to the TCP ports of the Kafka-native protocol used by client APIs from programming languages such as Java, Go, C++, or Python. For instance, in DMZ pass-through requirements, InfoSec owns the F5 proxies in the DMZ. A Kafka REST Proxy makes the integration easier.
Domain-driven design (DDD): Often, HTTP/REST and Kafka are combined to leverage the best of both worlds: Kafka for decoupling and HTTP for synchronous client-server communication. A service mesh using Kafka with REST APIs is a common architecture.

Other great implementations exist for request-response communication. For instance:

gRPC: Efficient request-response communication via a cross-platform open source high-performance Remote Procedure Call framework
GraphQL Data query and manipulation language for APIs and a runtime for fulfilling queries with existing data

Nevertheless, HTTP is the first choice in most projects. gRPC, GraphQL, or other implementations are chosen for specific problems if HTTP is not good enough.

Use cases for HTTP / REST APIs with Apache Kafka

RESTful interfaces to an Apache Kafka cluster make it easy to produce and consume messages, view the cluster’s metadata, and perform administrative actions using standard HTTP(S) instead of the native TCP-based Kafka protocol or clients.

Each scenario differs significantly in its purpose. Some use cases are implemented out of convenience, while others are required because of technical specifications.

There are two major categories of use cases for combining HTTP with Kafka. In terms of cloud-native architectures, this can be divided into the management plane (i.e., administration and configuration) and the data plane (i.e., producing and consuming data).

Management plane with HTTP and Kafka

The management and administration of a Kafka cluster involve various tasks, such as:

Cluster management: Creation of a cluster, expanding or shrinking a cluster, etc.
Cluster configuration: Management of Kafka topics, consumer groups, key management, role-based access control (RBAC), etc.
CI/CD and DevOps integration: HTTP APIs are the most popular way to build delivery pipelines and automate administration, instead of using Python or other alternative scripting options.
Data governance: Tools for data lineage, data catalogs, and policy enforcement need to be configured using APIs.
3rd party monitoring integration: Connect metrics APIs, alerts, and other notifications into systems like Datadog, Slack, etc.

Data plane with HTTP and Kafka

Many scenarios require or prefer the usage of REST APIs for producing and consuming messages to/from Kafka, such as

Natural request-response applications such as mobile apps: These applications and the frameworks almost always require integration via HTTP and request-response. WebSockets, Server-Sent Events (SSE), and similar concepts are a better fit for data streaming with Kafka. They are in the client framework, though often not supported.
Legacy application and third-party tool integration: Legacy applications, standard software, and traditional middleware are often proprietary. The only integration capabilities are HTTP/REST. Extract, transform, load (ETL), enterprise service bus (ESB), and other third-party tools are complementary to data streaming with Kafka. Mainframe integration using REST APIs from COBOL to Kafka is another example.
API gateway: Most API management tools do not provide native support for data streaming and Kafka today and only work on top of REST interfaces. Kafka (via the REST interface) and API management are still very complementary for some use cases, such as service monetization or integration with partner systems.
Other programming languages: Kafka provides Java and Scala clients. Confluent provides and supports additional clients, including Python, .NET, C, C++, and Go. More Kafka clients exist from the community, including Erlang, Kotlin, Node.js, PHP, Ruby, and Rust. Many of these community clients are not battle tested or supported. Therefore, calling the REST API from your favorite programming language is sometimes the better and easier option. Others, such as COBOL on the mainframe, don’t even provide a Kafka client at all. Hence, a REST API is the only viable solution.

Example: HTTP + Kafka with Confluent REST Proxy

The Confluent REST Proxy has been around for a long time and is available under the Confluent Community License. Many companies use it in production as a management plane and data plane as a self-managed component in conjunction with open source Apache Kafka, Confluent Platform, or Confluent Cloud.

While not being a lawyer, the short version is that you can use the Confluent REST Proxy for free – even for your production workloads at any scale – as long as you don’t build a competitive cloud service with it (say, e.g., the “AWS Kafka HTTP Proxy”) and charge per hour or volume for the serverless offering.

The Confluent REST Proxy and REST APIs are separated into both a data plane and a management plane:

While some applications require both, in many scenarios, only one or the other is used.

The management plane is typically used for very low throughout and a few API calls. The data plane, on the other hand, varies. Many applications produce and consume data continuously. The biggest limitation of the REST Proxy data plane is that it is a synchronous request-response protocol.

What scale and volumes does a REST Proxy for Kafka support?

Don’t underestimate the power of the REST Proxy as a data plane because Kafka provides batch capabilities to scale up to many parallel REST Proxy instances. There are deployments where four REST Proxy instances can handle ~20,000 events per second, which is sufficient for many use cases.

Many HTTP use cases do not require millions of events per second. Hence, the Confluent REST Proxy is often good enough. This is even true for many IoT use cases I have seen in the wild where devices or machines connect to Kafka via HTTP.

How does HTTP’s streaming data transfer fit into the architecture?

Please note that chunked transfer encoding is a streaming data transfer mechanism available in version 1.1 of HTTP. In chunked transfer encoding, the data stream is divided into a series of non-overlapping “chunks”. The chunks are sent out and received independently of one another.

Some Kafka REST Produce APIs support a streaming mode that allows sending multiple records over a single stream. The stream remains open unless explicitly terminated. The streaming mode can be achieved by setting an additional header “Transfer-Encoding: chunked” on the initial request. Check if your favorite Kafka proxy or cloud API supports the HTTP streaming mode.

Architecture options for Kafka + Rest Proxy

Different deployment options exist for the Confluent REST Proxy:

The self-managed REST Proxy instance or cluster of instances (as a “dedicated node”) is still decoupled from the open-open source Kafka broker or commercial Confluent Server. This is the ideal option for a data plane to produce and consume messages.

The management plane is also embedded as a unified REST API into Confluent Server (as a “broker plugin”) and Confluent Cloud for administrative operations. This simplifies the architecture because no additional nodes are required for using the administration APIs.

In some deployments, both approaches may be combined: The management plane is used via the embedded REST APIs in Confluent Server or in Confluent Cloud. Meanwhile, data plane use cases are decoupled into their own REST Proxy instances to easily handle scalability and be independent of the server side.

The developer does not have to care about the infrastructure or architecture for REST APIs in Confluent Cloud. The HTTP interfaces are fully managed by the vendor.

The REST APIs of the self-managed REST Proxy and Confluent Cloud are compatible. Hybrid architectures and cloud migration are possible without implementing any breaking changes.

Data governance for data streaming and HTTP services with a Schema Registry

Data governance is an important part of most data streaming projects. Kafka deployments usually include various decoupled producers and consumers, often following the DDD principle for microservice architectures. Hence, Confluent Schema Registry is used in most projects for schema enforcement and versioning.

Any Kafka client built by Confluent can leverage the Schema Registry using Avro, Protobuf, or JSON Schema. This includes programming APIs like Java, Python, Go, or Python, but also Kafka Connect sources and sink, Kafka Streams, ksqlDB, and the Confluent REST Proxy. Like the REST Proxy, Schema Registry is available under the Confluent Community License.

Schema Registry lives separately from your Kafka brokers. Confluent REST Proxy still talks to Kafka to publish and read data (messages) to topics. Concurrently, the REST Proxy can also talk to Schema Registry to send and retrieve schemas that describe the data models for the messages.

Schema Registry provides a serving layer for your metadata and enables data governance and schema enforcement for all events. It provides a RESTful interface for storing and retrieving your Avro, JSON Schema, and Protobuf schemas. The Schema Registry stores a versioned history of all schemas based on a specified subject name strategy, provides multiple compatibility settings, and allows the evolution of schemas according to the configured compatibility settings and expanded support for these schema types. It provides serializers that plug into Kafka clients that handle schema storage and retrieval for Kafka messages that are sent in any of the supported formats:

Schema enforcement happens on the client side. Additionally, Confluent Platform and Confluent Cloud provide server-side schema validation. The latter is helpful if incorrect or malicious client applications send messages to Kafka without using the client-side Schema Registry integration.

API Management is a term from the Open API world that puts a layer on top of HTTP / REST APIs for the management, monitoring, and monetization of APIs. Solutions include Apigee, MuleSoft Anypoint, Kong, IBM API Connect, TIBCO Mashery, and many more.

Features of API gateways and API management products

API Gateway and API Management Tools provide many outstanding features:

API Portal for creating and publishing APIs
Enforcing usage policies and controlling access
Technical features for data transformations
Nurturing the subscriber community
Collecting and analyzing usage statistics
Reporting on performance
Monetization and billing

These features are unavailable in any data streaming platform like Kafka, Pulsar, Kinesis, et al. on the other side, API tools like MuleSoft or Kong are not built for processing real-time data at scale with low latency.

API Management and data streaming are complementary

Hence, API Management and data streaming are complementary, not competitive! The blog post “Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?” explores this:

API == REST/HTTP for most API Management products and related API gateways. Vendors start to integrate either the Kafka API or event standards like AsyncAPI to get into event-based architectures. That’s great news!

Data sharing becomes crucial to modern and flexible enterprise architectures that build on concepts like microservices and data mesh. Real-time data beats slow data. That’s not just true for applications but also for data replication across business units, organizations, B2B, clouds, hybrid environments, and other scenarios. Therefore, the next generation of data sharing is built on top of data streaming.

HTTP APIs make little sense in many data streaming scenarios, especially if you expect high volumes or require low latency. Hence, data sharing in real-time by linking Kafka clusters or using a stream data exchange is the much better approach:

I won’t go into more detail here. The dedicated blog post “Streaming Data Exchange with Kafka for Real-Time Data Sharing” explores the idea, its trade-offs, and some real-world examples.

Not one or the other, instead combine Kafka and HTTP/REST in your projects!

Various use cases employ HTTP/REST with Apache Kafka as a management plane or data plane. This combination will not go away in the future.

The Confluent REST Proxy can be used for HTTP(S) communication with your favorite client interface. No matter if you run open-source Apache Kafka, Confluent Platform, or Confluent Cloud. Check out the source code on GitHub or get started with an intuitive tutorial.

How do you combine data streaming with request-response principles? How do you combine HTTP and Kafka? What proxy are you using? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Request-Response with REST/HTTP vs. Data Streaming with Apache Kafka – Friends, Enemies, Frenemies? appeared first on Kai Waehner.

Cloud-native Core Banking Modernization with Apache Kafka

Kai Waehner — Wed, 16 Mar 2022 07:52:11 +0000

Most financial service institutions operate their core banking platform on legacy mainframe technologies. The monolithic, proprietary, inflexible architecture creates many challenges for innovation and cost-efficiency. This blog post explores an open, elastic, and scalable solution powered by Apache Kafka can look like to solve these problems. Three cloud-native real-world banking solutions show how transactional and analytical workloads can be built at any scale in real-time for traditional business processes like payments or regulatory reporting and innovative new applications like crypto trading.

What is core banking?

Core banking is a banking service provided by networked bank branches. Customers may access their bank accounts and perform basic transactions from member branch offices or connected software applications like mobile apps. Core banking is often associated with retail banking, and many banks treat retail customers as their core banking customers. Wholesale banking is a business conducted between banks. Securities trading involves the buying and selling of stocks, shares, and so on.

Businesses are usually managed via the corporate banking division of the institution. Core banking covers the basic depositing and lending of money. Standard core banking functions include transaction accounts, loans, mortgages, and payments.

Typical business processes of the banking operating system include KYC (“Know Your Customer”), product opening, credit scoring, fraud, refunds, collections, etc.

Banks make these services available across multiple channels like automated teller machines, Internet banking, mobile banking, and branches. Banking software and network technology allow a bank to centralize its record-keeping and allow access from any location.

Automatic analytics for regulatory reporting, flexible configuration to adjust workflows and innovate, and an open API to integrate with 3rd party ecosystems are crucial for modern banking platforms.

Transactional vs. analytical workloads in core banking

Workloads for analytics and transactions have very unlike characteristics and requirements. The use cases differ significantly. Most core banking workflows are transactional.

SLAs are very different and are crucial to understanding to guarantee the proper behavior in case of infrastructure issues and for disaster recovery:

RPO (Recovery Point Objective): The actual data loss you can live within the case of a disaster
RTO (Recovery Time Objective): The actual recovery period (i.e., downtime) you can live within the case of a disaster

While downtime or data loss are not good in analytics use cases, they are often acceptable when the cost and risk are compared for guaranteeing an RTO and RPO close to zero. Hence, if the reporting function for end-users is down for an hour or even worse a few reports are lost, then life goes on.

In transactional workloads within a core banking platform, RTO and RPO need to be as close to zero as possible, even in the case of a disaster (e.g., if a complete data center or cloud region is down). If the core banking platform loses payment transactions or other events required for compliance processing, then the bank is in huge trouble.

Legacy core banking platforms

Advancements in the Internet and information technology reduced manual work in banks and increased efficiency. Computer software was developed decades ago to perform core banking operations like recording transactions, passbook maintenance, interest calculations on loans and deposits, customer records, the balance of payments, and withdrawal.

Banking software running on a mainframe

Most core banking platforms of traditional financial services companies still run on mainframes. The machines, operating systems, and applications still do a great job. SLAs like RPO and RTO are not new. If you look at IBM’s mainframe products and docs, the core concepts are similar to cutting-edge cloud-native technologies. Downtime, data loss, and similar requirements need to be defined.

The solving architecture provided the needed guarantees. IBM DB2, IMS, CICS, and Cobol code still operate transactional workloads very stable. A modern IBM z15 mainframe, announced in 2019, provides up to 40TB RAM and 190 Core. That’s very impressive.

Monolithic, proprietary, inflexible mainframe applications

So, what’s the problem with legacy core banking platforms running on a mainframe or similar other infrastructure?

Monolithic: Legacy mainframe applications are extreme monolithic applications. This is not comparable to a monolithic web application from the 2000s running on IBM WebSphere or another J2EE. / Java EE application server. It is much worse.
Proprietary: IMS, CICS, MQ, DB2, et al. are very mature technologies. However, next-generation decision makers, cloud architects, and developers expect open APIs, open-source core infrastructure, best-of-breed solutions and SaaS with independent freedom of choice for each problem.
Inflexible: Most legacy core banking applications do their job for decades. The Cobol code runs. However, it is not understood or documented. Cobol developers are scarce, too. The only option is to extend existing applications. Also, the infrastructure is not elastic to scale up and down in a software-defined manner. Instead, companies have to buy hardware for millions of dollars (and still pay an additional fortune for the transactions).

Yes, the mainframe supports up-to-date technologies such as DB2, MQ, WebSphere, Java, Linux, Web Services, Kubernetes, Ansible, Blockchain! Nevertheless, this does not solve the existing problems. This only helps when you build new applications. However, new applications are usually made with modern cloud-native infrastructure and frameworks instead of relying on legacy concepts.

Optimization and cost reduction of existing mainframe applications

The above sections looked at the enterprise architecture with RPO/RTO in mind to guarantee uptime and no data loss. This is crucial for decision-makers responsible for the business unit’s risk and revenue.

However, the third aspect besides revenue and risk is cost. Beyond providing an elastic and flexible infrastructure for the next-generation core banking platform, companies also move away from legacy solutions for cost reasons.

Enterprises save millions of dollars by just offloading data from a mainframe to modern event streaming:

For instance, streaming data empowers the Royal Bank of Canada (RBC) to save millions of dollars by offloading data from the mainframe to Kafka. Here is a quote from RBC:

“… rescue data off of the mainframe, in a cloud-native, microservice-based fashion … [to] … significantly reduce the reads on the mainframe, saving RBC fixed infrastructure costs (OPEX). RBC stayed compliant with bank regulations and business logic, and is now able to create new applications using the same event-based architecture.”

Read my dedicated blog post if you want to learn more about mainframe offloading, integration, and migration to Apache Kafka.

Modern cloud-native core banking platforms

This post is not just about offloading and integration. Instead, we look at real-world examples where cloud-native core banking replaced existing legacy mainframes or enabled new FinTech companies to start in a cutting-edge real-time cloud environment from scratch to compete with the traditional FinServ contenders.

Requirements for a modern banking platform?

Here are the requirements I here regularly on the wish list of executives and lead architects from financial services companies for new banking infrastructure and applications:

Real-time data processing
Scalable infrastructure
High availability
True decoupling and backpressure handling
Cost-efficient cost model
Flexible architecture for agile development
Elastic scalability
Standards-based interfaces and open APIs
Extensible functions and domain-driven separation of concerns
Secure authentication, authorization, encryption, and audit logging
Infrastructure-independent deployments across an edge, hybrid, and multi-region / multi-cloud environments

What are cloud-native infrastructure and applications?

And here are the capabilities of a genuinely cloud-native infrastructure to build a next-generation core banking system:

Real-time data processing
Scalable infrastructure
High availability
True decoupling and backpressure handling
Cost-efficient cost model
Flexible architecture for agile development
Elastic scalability
Standards-based interfaces and open APIs
Extensible functions and domain-driven separation of concerns
Secure authentication, authorization, encryption, and audit logging
Infrastructure-independent deployments across an edge, hybrid, and multi-region / multi-cloud environments

I think you get my point here: Adopting cloud-native infrastructure is critical for success in building next-generation banking software. No matter if you

are on-premise or in the cloud
are a traditional player or a startup
focus on a specific country or language, or operate across regions or even globally

Apache Kafka = cloud-native infrastructure for real-time transactional workloads

Many people think that Apache Kafka is not built for transactions and should only be used for big data analytics. This blog post explores when and how to use Kafka in resilient, mission-critical architectures and when to use the built-in Transaction API.

Kafka is a distributed, fault-tolerant system that is resilient by nature (if you deploy and operate it correctly). No downtime and no data loss can be guaranteed, like in your favorite database, mainframe, or other core platforms.

Elastic scalability and rolling upgrades allow a flexible and reliable data streaming infrastructure for transactional workloads to guarantee business continuity. The architect can even stretch a cluster across regions with tools such as Confluent Multi-Region Clusters. This setup ensures zero data loss and zero downtime even in case of a disaster where a data center is entirely down.

The post “Global Kafka Deployments” explores the different deployment options and their trade-offs in more detail. Check out my blog post about transactional vs. analytical workloads with Apache Kafka for more information.

Apache Kafka in banking and financial services

The rise of event streaming in financial services is growing like crazy. Continuous real-time data integration and processing are mandatory for many use cases. Many business departments in the financial services sector deploy Apache Kafka for mission-critical transactional workloads and big data analytics, including core banking. High scalability, high reliability, and an elastic open infrastructure are the critical reasons for Kafka’s success.

To learn more about different use cases, architectures, and real-world examples in the FinServ sector check out the post “Apache Kafka in the Financial Services Industry“. Use cases include

Wealth management and capital markets
Market and credit risk
Cybersecurity
IT Modernization
Retail and corp banking
Customer experience

Modern cloud-native core banking solutions powered by Kafka

Now, let’s explore the specific example of cloud-native core banking solutions built with Apache Kafka. The following subsections show three real-world examples.

Thought Machine – Correctness and scale in a single platform

Thought Machine is an innovative and flexible core banking operating system. The core capabilities of Thought Machine’s solution include

Cloud-native core banking software
Transactional workloads (24/7, zero data loss)
Flexible product engine powered by smart contracts (not blockchain)

The cloud-native core banking operating system enables a bank to achieve a wide scale of customization without having to change anything on the underlying platform. This is highly advantageous and a crucial part of how its architecture is a counterweight to the “spaghetti” that arises in other systems when customization and platform functionality are not separated.

Thought Machine’s Kafka Summit talk from 2021 explores how Thought Machine’s core banking system ‘Vault’ was built in a cloud-first manner with Apache Kafka at its heart. It leverages event streaming to enable asynchronous and parallel processing at scale, specifically focusing on the architectural patterns to ensure ‘correctness’ in such an environment.

10x Banking – Channel agnostic transactions in real-time

10X Banking provides a cloud-native core banking platform. In their Kafka Summit talk, they talked about the history of core banking and how they leverage Apache Kafka in conjunction with other open-source technologies within their commercial platform.

10x cloud-native approach provides flexible product lifecycles. Time-to-market is a key benefit. Organizations do not need to start from scratch. A unified data model and tooling allowed focusing on the business problems.

10x platform is a secure, reliable, scalable, and regulatory compliant SaaS platform that minimizes the regulatory burden and cost. It is built on a domain-driven design with true decoupling between transactional workloads and analytics/reporting.

Kafka is a data hub within the comprehensive platform for real-time analytics, transactions, and cybersecurity. Apache Kafka is not the silver bullet for every problem. Hence, 10x chose a best-of-breed approach to combine different open-source frameworks, commercial products, and SaaS offerings to build the cloud-native banking framework.

Here is how 10X Banking built a cloud-native core banking platform to enable real-time and batch analytics with a single data streaming pipeline leveraging the Kappa architecture:

The key components include Apache Kafka for data streaming, plus Apache Pinot and Trino for analytics.

Custodigit – Secure crypto investments with stateful data streaming and orchestration

Custodigit is a modern banking platform for digital assets and cryptocurrencies. It provides crucial features and guarantees for seriously regulated crypto investments:

Secure storage of wallets
Sending and receiving on the blockchain
Trading via brokers and exchanges
Regulated environment (a key aspect and no surprise as this product is coming from the Swiss – a very regulated market)

Kafka is the central core banking nervous system of Custodigit’s microservice architecture. Stateful Kafka Streams applications provide workflow orchestration with the “distributed saga” design pattern for the choreography between microservices. Kafka Streams was selected because of:

lean, decoupled microservices
metadata management in Kafka
unified data structure across microservices
transaction API (aka exactly-once semantics)
scalability and reliability
real-time processing at scale
a higher-level domain-specific language for stream processing
long-running stateful processes

I covered Custodigit and other blockchains/crypto platforms in a separate blog post: Apache Kafka as Data Hub for Crypto, DeFi, NFT, Metaverse – Beyond the Buzz.

Cloud-native core banking provides elastic scale for transactional workloads

Modern core banking software needs to be elastic, scalable, and real-time. This is true for transactional workloads like KYC or credit scoring and analytical workloads, like regulatory reporting. Apache Kafka enables processing transactional and analytical workloads in many modern banking solutions.

Thought Machine, 10X Banking, and Custodigit are three cloud-native examples powered by the Apache Kafka ecosystem to enable the next generation of banking software in real-time. Open Banking is achieved with open APIs to integrate with other 3rd party services.

The integration, offloading, and later replacement of legacy technologies such as mainframe with modern data streaming technologies prove the business value in many organizations. Kafka is not a silver bullet, but the central and mission-critical data hub for real-time data integration and processing.

How do you leverage data streaming for analytical or transactional workloads? What architecture does your platform use? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Cloud-native Core Banking Modernization with Apache Kafka appeared first on Kai Waehner.

Analytics vs. Transactions in Data Streaming with Apache Kafka

Kai Waehner — Wed, 09 Mar 2022 14:01:10 +0000

Workloads for analytics and transactions have very unlike characteristics and requirements. The use cases differ significantly. SLAs are very different, too. Many people think that Apache Kafka is not built for transactions and should only be used for big data analytics. This blog post explores when and how to use Kafka in resilient, mission-critical architectures and when to use the built-in Transaction API.

Analytical and transactional workloads

Let’s begin by defining the terms. The YouTube channel ‘Databases Demystified’ has a great episode: Analytical vs. Transactional. I use and enhance its explanation in the following subsections.

Some people refer to this as an “OLTP vs. OLAP” discussion:

In OLTP (online transaction processing), information systems typically facilitate and manage transaction-oriented applications.
In OLAP (online analytical processing), information systems generally execute much more complex queries, in a smaller volume, for the purpose of business intelligence or reporting rather than to process transactions.

There are some overlaps in some use cases and products. Hence, I use the more generic terms “transactions” and “analytics” in this blog post.

Analytical workloads

Analytical workloads have the following characteristics:

Processing large amounts of information for creating aggregates
Read-only queries and (usually) batch-write data loads
Supporting complex queries with multiple steps of data processing, join conditions, and filtering
Highly variable ad hoc queries, many of which may only be run once, ever
Not mission-critical, meaning downtime or data loss is not good, but in most cases not a disaster for the core business

Analytics solutions

Analytics solutions exist on-premises and in all major clouds. The tools differ regarding their capabilities and sweet spots. Examples include:

Redshift (Amazon Web Services)
BigQuery (Google Cloud)
Snowflake
Hive / HDFS / Spark
And many more!

Transactional workloads

Transactional workloads have unique characteristics and SLAs compared to analytical workloads:

Manipulating one object at a time (often across different systems)
Create Read Update and Delete (CRUD) operations inserting data one object at a time or updating existing data (often across different systems)
Precisely managing state with guarantees about what has or hasn’t been written to disk
Supporting many operations per second in real-time with high throughput
Mission-critical SLAs for uptime, availability, and latency of the end-to-end data communication

Transactional solutions

Transactional solutions include applications, databases, messaging systems, and integration middleware:

IBM Mainframe (including CICS, IMS, DB2)
TIBCO EMS
PostgreSQL
Oracle Database
MongoDB
And many more!

Often, a transactional workload has to guarantee ACID principles (i.e., all or nothing writes to different applications and technologies).

A mix of transactional and analytical workloads

Many solutions support a mix of transactional and analytical workloads.

For instance, many enterprises store transactional data in MongoDB but also process complex queries for analytics use cases in the same database. MongoDB started as document-based NoSQL database. In the meantime, it is a general-purpose database platform that also supports other forms of database queries like MongoDB provides graph and tree traversal capabilities:

Hence, focus on the business problem first. Then, you can decide if your existing infrastructure can solve the problem or if you need yet another one. But there is no silver bullet. A vendor-independent best of breed approach works best in most enterprise architectures I see in the success stories from the field.

Data at Rest vs. Data in Motion

Batch vs. real-time data processing is an important discussion you should have in every project. Statements like “batch processing is for analytics, real-time processing is for transactions” are not always correct. Real-time beats slow data in almost all use cases from a business value perspective. Nevertheless, batch processing is the better approach for some specific use cases.

Analytics platforms for batch processing

Data at Rest means to store data in a database, data warehouse, or data lake. This means that the data is processed too late in many use cases – even if a real-time streaming component (like Kafka) ingests the data. The data processing is still a web service call, SQL query, or map-reduce batch process away from providing a result to your problem.

Don’t get me wrong. Data at Rest is not a bad thing. Several use cases such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) require this approach… If you do it right! Data at Rest can be used for transactional workloads, too!

Apache Kafka for real-time data streaming

The Kafka API is the De Facto Standard API for Data in Motion like Amazon S3 for object storage. Why is Kafka so successful? Real-time beats slow data in most use cases across industries.

The same cloud-native approach is required for event streaming as for the modern data lake. Building a scalable and cost-efficient infrastructure is crucial for the success of a project. Event streaming and data lake technologies are complementary, not competitive.

I will not explore the reasons and use cases for the success of Kafka in this post. Instead, check out my overview about Kafka use cases across industries for more details. Or read some of my vertical-specific blog posts.

In short, most added value comes from processing Data in Motion while it is relevant instead of storing Data at Rest and processing it later (or too late). Many analytical and transactional workloads use Kafka for this reason.

Apache Kafka for analytics

Even in 2022, many people think about Kafka as a data ingestion layer into data stores. This is still a critical use case. Enterprises use Kafka as the ingestion layer for different analytics platforms:

Batch reporting and dashboards
Interactive queries (using Tableau, Qlik, and similar tools)
Data preparation for batch calculations, model training, and other analytics
Connectivity into different data warehouses, data lakes, and other data sinks using a best of breed approach

But Kafka is much more than a messaging and ingestion layer. Here are a few analytics examples using Kafka for analytics (often with other analytics tools to solve a specific problem together):

Data integration for various source systems using Kafka Connect and pre-built connectors (including real-time, near real-time, batch, web service, file, and proprietary interfaces)
Decoupling and backpressure handling as the sink systems are often not ready for vast volumes of real-time data. Domain-driven design (DDD) for true decoupling is a crucial differentiator of Kafka compared to other middleware and message queues.
Data processing at scale in real-time filters, transforms, generalizes, or aggregates incoming data sets before ingesting them into sink systems.
Real-time analytics applied within the Kafka application. Many analytics platforms were designed for near real-time or batch workloads but not for resilient model scoring with low latency – especially at scale). An example could be an analytic model trained with batch machine learning algorithms in a data lake with Spark MLlib or TensorFlow and then deployed into a Kafka Streams or ksqlDB application.
Replay historical events in cases such as onboarding a new consumer application, error-handling, compliance or regulatory processing, schema changes in an analytics platform. This becomes especially relevant if Tiered Storage is used under the hood of Kafka for cost-efficient and scalable long-term storage.

Analytics example with Confluent Cloud and AWS services

Here is an illustration from an AWS architecture combining Confluent and its ecosystem including connectors, stream processing capabilities, and schema management together with several 1st party AWS cloud services:

As you can see, Kafka is an excellent tool for analytical workloads. It is not a silver bullet but used for appropriate parts of the overall data management architecture. I have another blog post that explores the relationship between Kafka and other serverless analytics platforms.

However, Kafka is NOT just used for analytical workloads!

Apache Kafka for transactions

Around 60 to 70% of use cases and deployments I see at customers across the globe leverage the Kafka ecosystem for transactional workloads. Enterprises use Kafka for:

core banking platforms
fraud detection
global replication of order and inventory information
integration with business-critical platforms like CRM, ERP, MES, and many other transactional systems
supply chain management
customer communication like point-of-sale integration or context-specific upselling
and many other use cases where every single event counts.

Elastic scalability and rolling upgrades allow building a flexible and reliable data streaming infrastructure for transactional workloads to guarantee business continuity. The architect can even stretch a cluster across regions to ensure zero data loss and zero downtime even in case of a disaster where a data center is completely down. The post “Global Kafka Deployments” explores the different deployment options and their trade-offs in more detail.

Kafka Transactions API example

And even better: Kafka’s Transaction API, i.e., Exactly-Once Semantics (EOS), has been available since Kafka 0.11 (that GA’ed a long time ago). EOS makes building transactional workloads even easier as you don’t need to handle duplicates anymore.

Kafka now supports atomic writes across multiple partitions through the transactions API. This allows a producer to send a batch of messages to multiple partitions. Either all messages in the batch are eventually visible to any consumer, or none are ever visible to consumers. Here is an example:

Kafka provides a built-in transactions API. And the performance impact (that many people are worried about) is minimal. Here is a simple rule of thumb: If you care about exactly-once semantics, simply activate it! If performance issues force you to disable it, you can still fine-tune your application or disable it. But most projects are fine with the minimal performance trade-offs versus the enormous benefit of handling transactional behavior out-of-the-box.

Nevertheless, to be clear: You don’t need to use Kafka’s Transaction API to build mission-critical, transactional workloads.

SAGA design pattern for transactional data in Kafka without transactions

The Kafka Transactions API is optional. As discussed above, Kafka is resilient without transactions. Though eliminating duplicates is your task then. Exactly-once semantics solve this problem out-of-the-box across all Kafka components. Kafka Connect, Kafka Streams, ksqlDB, and different clients like Java, C++, .NET, Go support EOS.

However, I am also not saying that you should always use the Kafka Transaction API or that it solves every transactional problem. Keep in mind that scalable distributed systems require other design patterns than a traditional “Oracle to IBM MQ transaction”.

Some business transactions span multiple services. Hence, you need a mechanism to implement transactions that span services. A familiar design pattern and implementation for such a transactional workload is the SAGA pattern with a stateful orchestration application.

Swisscom’s Custodigit is an excellent example of such an implementation leveraging Kafka Streams. It is a modern banking platform for digital assets and cryptocurrencies that provides crucial features and guarantees for seriously regulated crypto investments – more details in my blog post about Blockchain, Crypto, NFTs, and Kafka.

And yes, there are always trade-offs between the Kafka Transaction API and exactly-once semantics, stateful orchestration in a separate application, and two-phase-commit transactions like Oracle DB and IBM MQ use it. Choose the right tool to define your appropriate enterprise architecture!

Kafka with other data stores and streaming engines

Most enterprises use Kafka as the central scalable real-time data hub. Hence, use cases include analytical and transactional workloads.

Most Kafka projects I see today also leverage Kafka Connect for data integration, Kafka Streams/ksqlDB for continuous data processing, and Schema Registry for data governance.

Thus, with Kafka, one (distributed and scalable) infrastructure enables messaging, storage, integration, and data processing. But of course, most Kafka clusters connect to other applications (like SAP or Salesforce) and data management systems (like MongoDB, Snowflake, Databricks, et al.) for analytics:

I explored in detail why Kafka is a database for some specific use cases but will NOT replace other databases and data lakes in its own blog post.

In addition to Kafka-native stream processing engines like Kafka Streams or ksqlDB, other streaming analytics frameworks like Apache Flink or Spark Streaming can easily be connected for transactional or analytical workloads. Just keep in mind that especially transactional workloads get harder end-to-end with every additional system and infrastructure you add to the enterprise architecture.

Kappa architecture for analytics AND transactions with Kafka as the data hub

Real-time data beats slow data. That’s true for almost every use case. Yet, enterprise architects build new infrastructures with the Lambda architecture that includes a separate batch layer for analytics and a real-time layer for transactional workloads.

A single real-time pipeline, called Kappa architecture, is the better fit. Real-world examples from companies such as Disney, Shopify, Uber, and Twitter explore the benefits of Kappa but also show how batch processing fits into this discussion positively with no Lambda. In its dedicated post, learn how a Kappa architecture can revolutionize how you built analytical and transactional workloads with the same scalable real-time data hub powered by Kafka.

How do you leverage data streaming for analytical or transactional workloads? Do you use exactly-once semantics to ease the implementation of transactions? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Analytics vs. Transactions in Data Streaming with Apache Kafka appeared first on Kai Waehner.