kafka streams Archives - Kai Waehner

The Past, Present and Future of Stream Processing

Kai Waehner — Wed, 20 Mar 2024 06:47:53 +0000

Stream processing has existed for decades. However, it really kicks off in the 2020s thanks to the adoption of open source frameworks like Apache Kafka and Flink. Fully managed cloud services make it easy to configure and deploy stream processing in a cloud-native way; even without the need to write any code. This blog post explores the past, present and future of stream processing. The discussion includes various technologies and cloud services, low code/ no code trade-offs, outlooks into the support of machine learning and GenAI, streaming databases, and the integration between data streaming and data lakes with Apache Iceberg.

In December 2023, the research company proved that data streaming is a new software category and not just yet another integration or data platform. Forrester published “The Forrester Wave: Streaming Data Platforms, Q4 2023“. Get free access to the report here. The leaders are Microsoft, Google and Confluent, followed by Oracle, Amazon, Cloudera, and a few others. A great time to review the past, present and future of stream processing as a key component in a data streaming architecture.

The Past of Stream Processing: The Move from Batch to Real-Time

The evolution of stream processing began as industries sought more timely insights from their data. Initially, batch processing was the norm. Data was collected over a period, stored, and processed at intervals. This method, while effective for historical analysis, proved inefficient for real-time decision-making.

In parallel to batch processing, message queues were created to provide real-time communication for transactional data. Message Brokers like IBM MQ or TIBCO EMS were a common way to decouple applications. Applications send data and receive data in an event-driven architecture without worrying about if the recipient was ready, how to handle backpressure, etc. The stream processing journey began.

Stream Processing is a Journey Over Decades…

… and we are still in a very early stage at most enterprises. Here is an excellent timeline of TimePlus about the journey of stream processing open source frameworks, proprietary platforms and SaaS cloud services:

Source: TimePlus

The stream processing journey started decades ago with research and first purpose-built proprietary products for specific use cases like stock trading.

Open source stream processing frameworks emerged during the big data and Hadoop era to make at least the ingestion layer a bit more real-time. Today, most enterprises at least get started understanding the value of stream processing for analytical and transactional use cases across industries. The cloud is a fundamental change as you can start streaming and processing data with a button click leveraging fully managed SaaS and simple UIs (if you don’t want to operate infrastructure or write low-level source code).

TIBCO StreamBase, Software AG Apama, IBM Streams

The advent of message queue technologies like IBM MQ and TIBCO EMS moved many critical applications to real-time message brokers. Real-time messaging enabled the consumption of data in real-time to store it in a database, mainframe, or application for further processing.

However, only true stream processing capabilities included in tools like TIBCO StreamBase, Software AG Apama or IBM (InfoSphere) Streams marked a significant shift towards real-time data processing. These products enabling businesses to react to information as it arrived by processing and correlating the data in motion.

Visual coding in tools like StreamBase or Apama represents an innovative approach to developing stream processing solutions. These tools provide a graphical interface that allows developers and analysts to design, build, and test applications by connecting various components and logic blocks visually, rather than writing code manually. Under the hood, the code generation worked with a Streaming SQL language.

Here is a screenshot of the TIBCO StreamBase IDE for visual drag & drop of streaming pipelines:

TIBCO StreamBase IDE

Some drawbacks of these early stream processing solutions include high cost, vendor lock-in, no flexibility regarding tools or APIs, and missing communities. These platforms are monolithic and were built far before cloud-native elasticity and scalability became a requirement for most RFIs and RFPs when evaluating vendors.

Open Source Event Streaming with Apache Kafka

The actual significant change for stream processing came with introducing Apache Kafka, a distributed streaming platform that allowed for high-throughput, fault-tolerant handling of real-time data feeds. Kafka, alongside other technologies like Apache Flink, revolutionized the landscape by providing the tools necessary to move from batch to real-time stream processing seamlessly.

The adoption of open source technologies changed all industries. Openness, flexibility, and community-driven development enabled easier influence on the features and faster innovation.

Over 100.000 organizations use Apache Kafka. The massive adoption came from a unique combination of capabilities: Messaging, storage, data integration, stream processing, all in one scalable and distributed infrastructure.

Various open source stream processing engines emerged. Kafka Streams was added to the Apache Kafka project. Other examples include Apache Storm, Spark Streaming, and Apache Flink.

The Present of Stream Processing: Architectural Evolution and Mass Adoption

The fundamental change to processing data in motion has enabled the development of data products and data mesh. Decentralizing data ownership and management with domain-driven design and technology-independent microservices promotes a more collaborative and flexible approach to data architecture. Each business unit can choose its own technology, API, cloud service, and communication paradigm like real-time, batch, or request-response.

From Lambda Architecture to Kappa Architecture

Today, stream processing is at the heart of modern data architecture, thanks in part to the emergence of the Kappa architecture. This model simplifies the traditional Lambda Architecture by using a single stream processing system to handle both real-time and historical data analysis, reducing complexity and increasing efficiency.

Lambda architecture with separate real-time and batch layers:

Kappa architecture with a single pipeline for real-time and batch processing:

For more details about the pros and cons of Kappa vs. Lambda, check out my “Kappa Architecture is Mainstream Replacing Lambda“. It explores case studies from Uber, Twitter, Disney and Shopify.

Kafka Streams and Apache Flink Become Mainstream

Apache Kafka has become synonymous with building scalable and fault-tolerant streaming data pipelines. Kafka facilitating true decoupling of domains and applications makes it integral to microservices and data mesh architectures.

Plenty of stream processing frameworks, products, and cloud services emerged in the past years. This includes open source frameworks like Kafka Streams, Apache Storm, Samza, Flume, Apex, Flink, Spark Streaming, and cloud services like Amazon Kinesis, Google Cloud Dataflow, Azure Stream Analytics. The “Data Streaming Landscape 2024” gives an overview of relevant technologies and vendors.

Apache Flink seems to become the de facto standard for many enterprises (and vendors). The adoption is like Kafka four years ago:

Source: Confluent

This does not mean other frameworks and solutions are bad. For instance, Kafka Streams is complementary to Apache Flink, as it suites different use cases.

No matter what technology enterprises choose, the mass adoption of stream processing is in progress right now. This includes modernizing existing batch processes AND building innovative new business models that only work in real time. As a concrete example, think about ride-hailing apps like Uber, Lyft, FREENOW, Grab. They are only possible because events are processed and correlated in real-time. Otherwise, everyone would still prefer a traditional taxi.

Stateless and Stateful Stream Processing

In data streaming, stateless and stateful stream processing are two approaches that define how data is handled and processed over time:

The choice between stateless and stateful processing depends on the specific requirements of the application, including the nature of the data, the complexity of the processing needed, and the performance and scalability requirements.

Stateless Stream Processing

Stateless Stream Processing refers to the handling of each data point or event independently from others. In this model, the processing of an event does not depend on the outcomes of previous events or require keeping track of the state between events. Each event is processed based on the information it contains, without the need for historical context or future data points. This approach is simpler and can be highly efficient for tasks that don’t require knowledge beyond the current event being processed.

The implementation could be a stream processor (like Kafka Streams or Flink), functionality in a connector (like Kafka Connect Single Message Transforms), or a Web Assembly (WASM) embedded into a streaming platform.

Stateful Stream Processing

Stateful Stream Processing involves keeping track of information (state) across multiple events to perform computations that depend on data beyond the current event. This model allows for more complex operations like windowing (aggregating events over a specific time frame), joining streams of data based on keys, and tracking sequences of events or patterns over time. Stateful processing is essential for scenarios where the outcome depends on accumulated knowledge or trends derived from a series of data points, not just on a single input.

The implementation is much more complex and challenging than stateless stream processing. A dedicated stream processing implementation is required. Dedicated distributed engines (like Apache Flink) handle stateful computionations, memory usage and scalability better than Kafka-native tools like Kafka Streams or KSQL (because the latter are bound to Kafka Topics).

Low Code, No Code, AND A Lot of Code!

No-code and low-code tools are software platforms that enable users to develop applications quickly and with minimal coding knowledge. These tools provide graphical user interfaces with drag-and-drop capabilities, allowing users to assemble and configure applications visually rather than writing extensive lines of code.

Common features and benefits of visual coding:

Rapid Development: Both types of platforms significantly reduce development time, enabling faster delivery of applications.
User-Friendly Interface: The graphical interface and drag-and-drop functionality make it easy for users to design, build, and iterate on applications.
Cost Reduction: By enabling quicker development with fewer resources, these platforms can lower the cost of software creation and maintenance.
Accessibility: They make application development accessible to a broader range of people, reducing the dependency on skilled developers for every project.

So far, the theory.

Disadvantages of Visual Coding Tools

Actually, StreamBase, Apama, et al., had great visual coding offerings. However, no-code / low-code tools have many drawbacks and disadvantages, too:

Limited Customization and Flexibility: While these platforms can speed up development for standard applications, they may lack the flexibility needed for highly customized solutions. Developers might find it challenging to implement specific functionalities that aren’t supported out of the box.
Dependency on Vendors: Using no-code/low-code platforms often means relying on third-party vendors for the platform’s stability, updates, and security. This dependency can lead to potential issues if the vendor cannot maintain the platform or goes out of business. And often the platform team is the bottleneck for implementing new business or integration logic.
Performance Concerns: Applications built with no-code/low-code platforms may not be as optimized as those developed with traditional coding, potentially leading to lower performance or inefficiencies, especially for complex applications.
Scalability Issues: As businesses grow, applications might need to scale up to support increased loads. No-code/low-code platforms might not always support this level of scalability or might require significant workarounds, affecting performance and user experience.
Over-reliance on Non-Technical Users: While empowering citizen developers is a key advantage of these platforms, it can also lead to governance challenges. Without proper oversight, non-technical users might create inefficient workflows or data structures, leading to technical debt and maintenance issues.
Cost Over Time: Initially, no-code/low-code platforms can reduce development costs. However, as applications grow and evolve, the ongoing subscription costs or fees for additional features and scalability can become significant.

Flexibility is King: Stream Processing for Everyone!

Microservices, domain-driven design, data mesh… All these modern design approaches taught us to provide flexible enterprise architectures. Each business unit and persona should be able to choose its own technology, API, or SaaS. And no matter if you do real-time, near real-time, batch or request response communication.

Apache Kafka provides the true decoupling out-of-the-box. Therefore, low-code or now-code tools is an option. However, a data scientist, data engineer, software developer or citizen integrator can choose its own technology for stream processing.

The past, present and future of stream processing shows different frameworks, visual coding tools and even applied generative AI. One solution does NOT replace but complement the other alternatives:

The Future of Stream Processing: Serverless SaaS, GenAI and Streaming Databases

Stream processing is set to grow exponentially in the future, thanks to advancements in cloud computing, SaaS, and AI. Let’s explore the future of stream processing and look at the expected short, mid and long-term developments.

SHORT TERM: Fully Managed Serverless SaaS for Stream Processing

The cloud’s scalability and flexibility offer an ideal environment for stream processing applications, reducing the overhead and resources required for on-premise solutions. As SaaS models continue to evolve, stream processing capabilities will become more accessible to a broader range of businesses, democratizing real-time data analytics.

For instance, look at the serverless Flink Actions in Confluent Cloud. You can configure and deploy stream processing for use cases like deduplication or masking without any code:

Source: Confluent

MID TERM: Automated Tooling and the Help of GenAI

Integrating AI and machine learning with stream processing will enable more sophisticated predictive analytics. This opens new frontiers for automated decision-making and intelligent applications while continuously processing incoming event streams. The full potential of embedding AI into stream processing has to be learned and implemented in the upcoming years.

For instance, automated data profiling is one instance of stream processing that GenAI can support significantly. Software tools analyze and understand the quality, structure, and content of a dataset without manual intervention as the events flow through the data pipeline in real-time. This process typically involves examining the data to identify patterns, anomalies, missing values, and inconsistencies. A perfect fit for stream processing!

Automated data profiling in the stream processor can provide insights into data types, frequency distributions, relationships between columns, and other metadata information crucial for data quality assessment, governance, and preparation for further analysis or processing.

MID TERM: Streaming Storage and Analytics with Apache Iceberg

Apache Iceberg is an open-source table format for huge analytic datasets that provides powerful capabilities in managing large-scale data in data lakes. Its integration with streaming data sources like Apache Kafka and analytics platforms, such as Snowflake, Starburst, Dremio, AWS Athena or Databricks, can significantly enhance data management and analytics workflows.

Integration between Streaming Data from Kafka and Analytics on Databricks or Snowflake using Apache Iceberg

Supporting the Apache Iceberg table format might be a crucial strategic move by streaming and analytics frameworks, vendors and cloud services. Here are some key benefits from the enterprise architecture perspective:

Unified Batch and Stream Processing: Iceberg tables can serve as a bridge between streaming data ingestion from Kafka and doxwnstream analytic processing. By treating streaming data as an extension of a batch-based table, Iceberg enables a seamless transition from real time to batch analytics, allowing organizations to analyze data with minimal latency.
Schema Evolution: Iceberg supports schema evolution without breaking downstream systems. This is useful when dealing with streaming data from Kafka, where the schema might evolve. Consumers can continue reading data using the schema they understand, ensuring compatibility and reducing the need for data pipeline modifications.
Time Travel and Snapshot Isolation: Iceberg’s time travel feature allows analytics on data as it looked at any point in time, providing snapshot isolation for consistent reads. This is crucial for reproducible reporting and debugging, especially when dealing with continuously updating streaming data from Kafka.
Cross-Platform Compatibility: Iceberg provides a unified data layer accessible by different compute engines, including those used by Databricks and Snowflake. This enables organizations to maintain a single copy of their data that is queryable across different platforms, facilitating a multi-tool analytics ecosystem without data silos.

LONG TERM: Transactional + Analytics = Streaming Database?

Streaming databases, like RisingWave or Materialize, are designed to handle real-time data processing and analytics. This offers a way to manage and query data that is continuously generated from sources like IoT devices, online transactions, and application logs. Traditional databases that are optimized for static data stored on disk. Instead, streaming databases are built to process and analyze data in motion. They provide insights almost instantaneously as the data flows through the system.

Streaming databases offer the ability to perform complex queries and analytics on streaming data, further empowering organizations to harness real-time insights.

The ongoing innovation in streaming databases will probably lead to more advanced, efficient, and user-friendly solutions, facilitating broader adoption and more creative applications of stream processing technologies.

Having said this, we are still in the very early stage. It is not clear yet when you really need a streaming database instead of a mature and scalable stream processor like Apache Flink. The future will show us and competition is great for innovation.

The Future of Stream Processing is Open Source and Cloud

The journey from batch to real-time processing has transformed how businesses interact with their data. The continued evolution couples technologies like Apache Kafka, Kafka Streams, and Apache Flink with the growth of cloud computing and SaaS. Stream processing will redefine the future of data analytics and decision-making.

As we look ahead, the future possibilities for stream processing are boundless, promising more agile, intelligent, and real-time insights into the ever-increasing streams of data.

If you want to learn more, listen to the following on-demand webinar about the past, present and future of stream processing with me joined by the two streaming industry veterans Richard Tibbets (founder of StreamBase) and Michael Benjamin (TimePlus). I had the please work with them for a few years at TIBCO where we deployed StreamBase at many Financial Services companies for stock trading and similar use cases:

How does your stream processing journey look like? In which decade did you join? Or are you just learning with the latest open-source frameworks or cloud services? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Past, Present and Future of Stream Processing appeared first on Kai Waehner.

When to choose Redpanda instead of Apache Kafka?

Kai Waehner — Wed, 16 Nov 2022 03:19:39 +0000

Data streaming emerged as a new software category. It complements traditional middleware, data warehouse, and data lakes. Apache Kafka became the de facto standard. New players enter the market because of Kafka’s success. One of those is Redpanda, a lightweight Kafka-compatible C++ implementation. This blog post explores the differences between Apache Kafka and Redpanda, when to choose which framework, and how the Kafka ecosystem, licensing, and community adoption impact a proper evaluation.

Disclaimer: I work for Confluent. However, the post is not about comparing features but explaining the concepts behind the alternatives of using Apache Kafka (and related products, including Confluent) or Redpanda. I talk to enterprises across the globe every week. Below, I summarize common misunderstandings or missing knowledge about both technologies. I hope it helps you to make the right decision. Either choose to run open-source Apache Kafka, one of the various commercial Kafka offerings or cloud services, or Redpanda. All are great options with pros and cons…

Data streaming: A new software category

Data-driven applications are the new black. As part of this, data streaming is a new software category. If you don’t understand yet how it differs from other data management platforms like a data warehouse or data lake, check out the following blog series:

And if you wonder how Apache Kafka differs from other middleware, check out how Kafka fits into comparison with ETL, ESB, and iPaas.

Apache Kafka: The de facto standard for data streaming

Apache Kafka became the de facto standard for data streaming similar to Amazon S3 is the de facto standard for object storage. Kafka is used across industries for many use cases.

The adoption curve of Apache Kafka

The growth of the Apache Kafka community in the last years is impressive:

>100,000 organizations using Apache Kafka
>41,000 Kafka Meetup attendees
>32,000 Stack Overflow Questions
>12,000 Jiras for Apache Kafka
>31,000 Open Job Listings Request Kafka Skills

And look at the increased number of active monthly unique users downloading the Kafka Java client library with Maven:

Source: Sonatype

The numbers grow exponentially. That’s no surprise to me as the adoption pattern and maturity curve for Kafka are similar in most companies:

Start with one or few use cases (that prove the business value quickly)
Deploy the first applications to production and operate them 24/7
Tap into the data streams from many domains, business units, and technologies
Move to a strategic central nervous system with a decentralized data hub

Kafka use cases by business value across industries

The main reason for the incredible growth of Kafka’s adoption curve is the variety of potential use cases for data streaming. The potential is almost endless. Kafka’s characteristics of combing low latency, scalability, reliability, and true decoupling establish benefits across all industries and use cases:

Search my blog for your favorite industry to find plenty of case studies and architectures. Or to get started, read about use cases for Apache Kafka across industries.

The emergence of many Kafka vendors

The market for data streaming is enormous. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products. Most vendors use Kafka or implement its protocol because Kafka has become the de facto standard for data streaming.

Learn more about the various data streaming vendors in the following blog posts:

To be clear: An increasing number of Kafka vendors is a great thing! It proves the creation of a new software category. Competition pushes innovation. The market share is big enough for many vendors. And I am 100% convinced that we are still in a very early stage of the data streaming hype cycle…

After a lengthy introduction to set the context, let’s now review a new entrant into the Kafka market: Redpanda…

Introducing Redpanda: Kafka-compatible data streaming

Redpanda is a data streaming platform. Its website explains its positioning in the market and product strategy as follows (to differentiate it from Apache Kafka):

No Java: A JVM-free and ZooKeeper-free infrastructure.
Designed in C++: Designed for a better performance than Apache Kafka.
A single-binary architecture: No dependencies to other libraries or nodes.
Self-managing and self-healing: A simple but scalable architecture for on-premise and cloud deployments.
Kafka-compatible: Out-of-the-box support for the Kafka protocol with existing applications, tools, and integrations.

This sounds great. You need to evaluate whether Redpanda is the right choice for your next project or if you should stick with “real Apache Kafka”.

How to choose the proper “Kafka” implementation for your project?

A recommendation that some people find surprising: Qualify out first! That’s much easier. Similarly, like I explained when NOT to use Apache Kafka.

As part of the evaluation, the question is if Kafka is the proper protocol for you. And for Kafka, pick different offerings and begin with the comparison.

Start your evaluation with the business case requirements and define your most critical needs like uptime SLAs, disaster recovery strategy, enterprise support, operations tooling, self-managed vs. fully-managed cloud service, capabilities like messaging vs. data ingestion vs. data integration vs. applications, and so on. Based on your use cases and requirements, you can start qualifying out vendors like Confluent, Redpanda, Cloudera, Red Hat / IBM, Amazon MSK, Amazon Kinesis, Google Pub Sub, and others to create a shortlist.

The following sections compare the open-source project Apache Kafka versus the re-implementation of the Kafka protocol of Redpanda. You can use these criteria (and information from other blogs, articles, videos, and so on) to evaluate your options.

Similarities between Redpanda and Apache Kafka

The high-level value propositions are the same in Redpanda and Apache Kafka:

Data streaming to process data in real-time at scale continuously
Decouple applications and domains with a distributed storage layer
Integrate with various data sources and data sinks
Leverage stream processing to correlate data and take action in real-time
Self-managed operations or consuming a fully-managed cloud offering

However, the devil is in the details and facts. Don’t trust marketing, but look deeper into the various products and cloud services.

Deployment options: Self-managed vs. cloud service

Data streaming is required everywhere. While most companies across industries have a cloud-first strategy, some workloads must stay at the edge for different reasons: Cost, latency, or security requirements. My blog about use cases for Apache Kafka at the edge is still one of the most read articles I have written in recent years.

Besides operating Redpanda by yourself, you can buy Redpanda as a product and deploy it in your environment. Instead of just self-hosting Redpanda, you can deploy it as a data plane in your environment using Kubernetes (supported by the vendor’s external control plane) or leverage a cloud service (fully managed by the vendor).

The different deployment options for Redpanda are great. Pick what you need. This is very similar to Confluent’s deployment options for Apache Kafka. Some other Kafka vendors only provide either self-managed (e.g., Cloudera) or fully managed (e.g., Amazon MSK Serverless) deployment options.

What I miss from Redpanda: No official documentation about SLAs of the cloud service and enterprise support. I hope they do better than Amazon MSK (excluding Kafka support from their cloud offerings). I am sure you will get that information if you reach out to the Redpanda team, who will probably soon incorporate some information into their website.

Bring your own Cluster (BYOC)

There is a third option besides self-managing a data streaming cluster and leveraging a fully managed cloud service: Bring your own Cluster (BYOC). This alternative allows end users to deploy a solution partially managed by the vendor in your own infrastructure (like your data center or your cloud VPC).

Here is Redpanda’s marketing slogan: “Redpanda clusters hosted on your cloud, fully managed by Redpanda, so that your data never leaves your environment!”

This sounds very appealing in theory. Unfortunately, it creates more questions and problems than it solves:

How does the vendor access your data center or VPC?
Who decides how and when to scale a cluster?
When to act on issues? How and when do you roll a cluster to incorporate bug fixes or version upgrades?
What about cost management? What is the total cost of ownership? How much value does the vendor solution bring?
How do you guarantee SLAs? Who has to guarantee them, you or the vendor?
For regulated industries, how are security controls and compliance supported? How are you sure about what the vendor does in an environment you ostensibly control? How much harder will a bespoke third-party risk assessment be if you aren’t using pure SaaS?

For these reasons, cloud vendors only host managed services in the cloud vendor’s environment. Look at Amazon MSK, Azure Event Hubs, Google Pub Sub, Confluent Cloud, etc. All fully managed cloud services are only in the VPC of the vendor for the above reasons.

There are only two options: Either you hand over the responsibility to a SaaS offering or control it yourself. Everything in the middle is still your responsibility in the end.

Community vs. commercial offerings

The sales approach of Redpanda looks almost identical to how Confluent sells data streaming. A free community edition is available, even for production usage. The enterprise edition adds enterprise features like tiered storage, automatic data balancing, or 24/7 enterprise support.

No surprise here. And a good strategy, as data streaming is required everywhere for different users and buyers.

Technical differences between Apache Kafka and Redpanda

There are plenty of technical and non-functional differences between Apache Kafka products and Redpanda. Keep in mind that Redpanda is NOT Kafka. Redpanda uses the Kafka protocol. This is a small but critical difference. Let’s explore these details in the following sections.

Apache Kafka vs. Kafka protocol compatibility

Redpanda is NOT an Apache Kafka distribution like Confluent Platform, Cloudera, or Red Hat. Instead, Redpanda re-implements the Kafka protocol to provide API compatibility. Being Kafka-compatible is not the same as using Apache Kafka under the hood, even if it sounds great in theory.

Two other examples of Kafka-compatible offerings:

Azure Event Hubs: A Kafka-compatible SaaS cloud service offering from Microsoft Azure. The service itself works and performs well. However, its Kafka compatibility has many limitations. Microsoft lists a lot of them on its website. Some limitations of the cloud service are the consequence of a different implementation under the hood, like limited retention time and message sizes.
Apache Pulsar: An open-source framework competing with Kafka. The feature set overlaps a lot. Unfortunately, Pulsar often only has good marketing for advanced features to compete with Kafka or to differentiate. And one example is its Kafka mapper to be compatible with the Kafka protocol. Contrary to Azure Event Hubs as a serious implementation (with some limitations), Pulsar’s compatibility wrapper provides a basic implementation that is compatible with only minor parts of the Kafka protocol. So, while alleged “Kafka compatibility” sounds nice on paper, one shouldn’t seriously consider this for migrating your running Kafka infrastructure to Pulsar.

We have seen compatible products for open-source frameworks in the past. Re-implementations are usually far away from being complete and perfect. For instance, MongoDB compared the official open source protocol to its competitor Amazon DocumentDB to pinpoint the fact that DocumentDB only passes ~33% of the MongoDB integration test chain.

In summary, it is totally fine to use these non-Kafka solutions like Azure Event Hubs, Apache Pulsar, or Redpanda for a new project if they fulfill your requirements better than Apache Kafka. But keep in mind that it is not Kafka. There is no guarantee that additional components from the Kafka ecosystem (like Kafka Connect, Kafka Streams, REST Proxy, and Schema Registry) behave the same when integrated with a non-Kafka solution that only uses the Kafka protocol with its own implementation.

How good is Redpanda’s Kafka protocol compatibility?

Frankly, I don’t know. Probably and hopefully, Redpanda has better Kafka compatibility than Pulsar. The whole product is based on this value proposition. Hence, we can assume that the Redpanda team spends plenty of time on compatibility. Redpanda has NOT achieved 100% API compatibility yet.

Time will tell when we see more case studies from enterprises across industries that migrated some Apache Kafka projects to Redpanda and successfully operated the infrastructure for a few years. Why wait a few years to see? Well, I compare it to what I see from people starting with Amazon MSK. It is pretty easy to get started. However, after a few months, the first issues happen. Users find out that Amazon MSK is not a fully-managed product and does not provide serious Kafka SLAs. Hence, I see too many teams starting with Amazon MSK and then migrating to Confluent Cloud after some months.

But let’s be clear: If you run an application against Apache Kafka and migrate to a re-implementation supporting the Kafka protocol, you should NOT expect 100% the same behavior as with Kafka!

Some underlying behavior will differ even if the API is 100% compatible. This is sometimes a benefit. For instance, Redpanda focuses on performance optimization with C++. This is only possible in some workloads because of the re-implementation. C++ is superior compared to Java and the JVM for some performance and memory scenarios.

Redpanda = Apache Kafka – Kafka Connect – Kafka Streams

Apache Kafka includes Kafka Connect for data integration and Kafka Streams for stream processing.

Like most Kafka-compatible projects, Redpanda does exclude these critical pieces from its offering. Hence, even 100 percent protocol compatibility would not mean a product re-implements everything in the Apache Kafka project.

Lower latency vs. benchmarketing

Always think about your performance requirements before starting a project. If necessary, do a proof of concept (POC) with Apache Kafka, Apache Pulsar, and Redpanda. I bet that in 99% of scenarios, all three of them will show a good enough performance for your use case.

Don’t trust opinionated benchmarks from others! Your use case will have different requirements and characteristics. And performance is typically just one of many evaluation dimensions.

I am not a fan of most “benchmarks” of performance and throughput. Benchmarks are almost always opinionated and configured for a specific problem (whether a vendor, independent consultant or researcher conducts them).

My colleague Jack Vanlightly explained this concept of benchmarketing with excellent diagrams:

Source: Jack Vanlightly

Here is one concrete example you will find in one of Redpanda’s benchmarks: Kafka was not built for very high throughput producers, and this is what Redpanda is exploiting when they claim that Kafka’s throughput is inferior to Redpanda. Ask yourself this question: Of 1GB/s use cases, who would create that throughput with just 4 producers? Benchmarketing at its finest.

Hence, once again, start with your business requirements. Then choose the right tool for the job. Benchmarks are always built for winning against others. Nobody will publish a benchmark where the competition wins.

Soft real-time vs. hard real-time

When we speak about real-time in the IT world, we mean end-to-end data processing pipelines that need at least a few milliseconds. This is called soft real-time. And this is where Apache Kafka, Apache Pulsar, Redpanda, Azure Event Hubs, Apache Flink, Amazon Kinesis, and similar platforms fit into. None of these can do hard real time.

Hard real-time requires a deterministic network with zero latency and no spikes. Typical scenarios include embedded systems, field buses, and PLCs in manufacturing, cars, robots, securities trading, etc. Time-Sensitive Networking (TSN) is the right keyword if you want more research.

I wrote a dedicated blog post about why data streaming is NOT hard real-time. Hence, don’t try to use Kafka or Redpanda for these use cases. That’s OT (operational technology), not IT (information technology). OT is plain C or Rust on embedded software.

No ZooKeeper with Redpanda vs. no ZooKeeper with Kafka

Besides being implemented in C++ instead of using the JVM, the second big differentiator of Redpanda is no need for ZooKeeper and two complex distributed systems… Well, with Apache Kafka 3.3, this differentiator is gone. Kafka is now production-ready without ZooKeeper! KIP-500 was a multi-year journey and an operation at Kafka’s heart.

To be fair, it will still take some time until the new ZooKeeper-less architecture goes into production. Also, today, it is only supported by new Kafka clusters. However, migration scenarios with zero downtime and without data loss will be supported in 2023, too. But that’s how a severe release cycle works for a mature software product: Step-by-step implementation and battle-testing instead of starting with marketing and selling of alpha and beta features.

ZooKeeper-less data streaming with Kafka is not just a massive benefit for the scalability and reliability of Kafka but also makes operations much more straightforward, similar to ZooKeeper-less Redpanda.

By the way, this was one of the major arguments why I did not see the value of Apache Pulsar. The latter requires not just two but three distributed systems: Pulsar broker, ZooKeeper, and BookKeeper. That’s nonsense and unnecessary complexity for virtually all projects and use cases.

Lightweight Redpanda + heavyweight ecosystem = middleweight data streaming?

Redpanda is very lightweight and efficient because of its C++ implementation. This can help in limited compute environments like edge hardware. As an additional consequence, Redpanda has fewer latency spikes than Apache Kafka. That are significant arguments for Redpanda for some use cases!

However, you need to look at the complete end-to-end data pipeline. If you use Redpanda as a message queue, you get these benefits compared to the JVM-based Kafka engine. You might then pick a message queue like RabbitMQ or NATs instead. I don’t start this discussion here as I focus on the much more powerful and advanced data streaming use cases.

Even in edge use cases where you deploy a single Kafka broker, the hardware, like an industrial computer (IPC), usually provides at least 4GB or 8GB of memory. That is sufficient for deploying the whole data streaming platform around Kafka and other technologies.

Data streaming is more than messaging or data ingestion

My fundamental question is, what is the benefit of a C++ implementation of the data hub if all the surrounding systems are built with JVM technology or even worse and slow technologies like Python?

Kafka-compatible tools like Redpanda integrate well with the Kafka ecosystem, as they use the same protocol. Hence, tools like Kafka Connect, Kafka Streams, KSQL, Apache Flink, Faust, and all other components from the Kafka ecosystem work with Redpanda. You will find such an example for almost every existing Kafka tool on the Redpanda blog.

However, these combinations kill almost all the benefits of having a C++ layer in the middle. All integration and processing components would also need to be as efficient as Redpanda and use C++ (or Go or Rust) under the hood. These tools do not exist today (likely, as they are not needed by many people). And here is an additional drawback: The debugging, testing, and monitoring infrastructure must combine C++, Python, and JVM platforms if you combine tools like Java-based Kafka Connect and Python-based Faust with C++-based Redpanda. So, I don’t get the value proposition here.

Data replication across clusters

Having more than one Kafka cluster is the norm, not an exception. Use cases like disaster recovery, aggregation, data sovereignty in different countries, or migration from on-premise to the cloud require multiple data streaming clusters.

Replication across clusters is part of open-source Apache Kafka. MirrorMaker 2 (based on Kafka Connect) supports these use cases. More advanced (proprietary) tools from vendors like Confluent Replicator or Cluster Linking make these use cases more effortless and reliable.

Data streaming with the Kafka ecosystem is perfect as the foundation of a decentralized data mesh:

How do you build these use cases with Redpanda?

It is the same story as for data integration and stream processing: How much does it help to have a very lightweight and performant core if all other components rely on “3rd party” code bases and infrastructure? In the case of data replication, Redpanda uses Kafka’s Mirrormaker.

And make sure to compare MirrorMaker to Confluent Cluster Linking – the latter uses the Kafka protocol for replications and does not need additional infrastructure, operations, offset sync, etc.

Non-functional differences between Apache Kafka and Redpanda

Technical evaluations are dominant when talking about Redpanda vs. Apache Kafka. However, the non-functional differences are as crucial before making the strategic decision to choose the data streaming platform for your next project.

Licensing, adoption curve and the total cost of ownership (TCO) are critical for the success of establishing a data streaming platform.

Open source (Kafka) vs. source available (Redpanda)

As the name says, Apache Kafka is under the very permissive Apache license 2.0. Everyone, including cloud providers, can use the framework for building internal applications, commercial products, and cloud services. Committers and contributions are spread across various companies and individuals.

Redpanda is released under the more restrictive Source Available License (BSL). The intention is to deter cloud providers from offering Redpanda’s work as a service. For most companies, this is fine, but it limits broader adoption across different communities and vendors. The likelihood of external contributors, committers, or even other vendors picking the technology is much smaller than in Apache projects like Kafka.

This has a significant impact on the (future) adoption curve…

Maturity, community and ecosystem

The introduction of this article showed the impressive adoption of Kafka. Just keep in mind: Redpanda is NOT Apache Kafka! It just supports the Kafka protocol.

Redpanda is a brand-new product and implementation. Operations are different. The behavior of the engine is different. Experts are not available. Job offerings do not exist. And so on.

Kafka is significantly better documented, has a tremendously larger community of experts, and has a vast array of supporting tooling that makes operations more straightforward.

There are many local and online Kafka training options, including online courses, books, meetups, and conferences. You won’t find much for Redpanda beyond the content of the vendor behind it.

And don’t trust marketing! That’s true for every vendor, of course. If you read a great feature list on the Redpanda website, double-check if the feature truly exists and in what shape it is. Example: RBAC (role-based access control) is available for Redpanda. The devil lies in the details. Quote from the Redpanda RBAC documentation: “This page describes RBAC in Redpanda Console and therefore manages access only for Console users but not clients that interact via the Kafka API. To restrict Kafka API access, you need to use Kafka ACLs.” There are plenty of similar examples today. Just try to use the Redpanda cloud service. You will find many things that are more alpha than beta today. Make sure not to fall into the same myths around the marketing of product features as some users did with Apache Pulsar a few years ago.

The total cost of ownership and business value

When you define your project’s business requirements and SLAs, ask yourself how much downtime or data loss is acceptable. The RTO (recovery time objective) and RPO (recovery point objective) impact a data streaming platform’s architecture and overall process to ensure business continuity, even in the case of a disaster.

The TCO is not just about the cost of a product or cloud service. Full-time engineers need to operate and integrate the data streaming platform. Expensive project leads, architects, and developers build applications.

Project risk includes the maturity of the product and the expertise you can bring in for consulting and 24/7 support.

Similar to benchmarketing regarding latency, vendors use the same strategy for TCO calculations! Here is one concrete example you always hear from Redpanda: “C++ does enable more efficient use of CPU resources.”

This statement is correct. However, the problem with that statement is that Kafka is rarely CPU-bound and much more IO-bound. Redpanda has the same network and disk requirements as Kafka, which means Redpanda has limited differences from Kafka in terms of TCO regarding infrastructure.

When to choose Redpanda instead of Apache Kafka?

You need to evaluate whether Redpanda is the right choice for your next project or if you should stick with the “real Apache Kafka” and related products or cloud offerings. Read articles and blogs, watch videos, search for case studies in your industry, talk to different competitive vendors, and build your proof of concept or pilot project. Qualifying out products is much easier than evaluating plenty of offerings.

When to seriously consider Redpanda?

You need C++ infrastructure because your ops team cannot handle and analyze JVM logs – but be aware that this is only the messaging core, not the data integration, data processing, or other capabilities of the Kafka ecosystem
The slight performance differences matter to you – and you still don’t need hard real-time
Simple, lightweight development on your laptop and in automated test environments – but you should then also run Redpanda in production (using different implementations of an API for TEST and PROD is a risky anti-pattern)

You should evaluate Redpanda against Apache Kafka distributions and cloud services in these cases.

This post explored the trade-offs Redpanda has from a technical and non-functional perspective. If you need an enterprise-grade solution or fully-managed cloud service, a broad ecosystem (connectors, data processing capabilities, etc.), and if 10ms latency is good enough and a few p99 spikes are okay, then I don’t see many reasons why you would take the risk of adopting Redpanda instead of an actual Apache Kafka product or cloud service.

The future will tell us if Redpanda is a severe competitor…

I didn’t even cover the fact that a startup always has challenges finding great case studies, especially with big enterprises like fortune 500 companies. The first great logos are always the hardest to find. Sometimes, startups never get there. In other cases, a truly competitive technology and product are created. Such a journey takes years. Let’s revisit this blog post in one, two, and five years to see the evolution of Redpanda (and Apache Kafka).

What are your thoughts? When do you consider using Redpanda instead of Apache Kafka? Are you using Redpanda already? Why and for what use cases? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to choose Redpanda instead of Apache Kafka? appeared first on Kai Waehner.

Disaster Recovery with Kafka across the Edge and Hybrid Cloud (QCon Talk)

Kai Waehner — Wed, 06 Apr 2022 11:15:46 +0000

I spoke at QCon London in April 2022 about building disaster recovery and resilient real-time enterprise architectures with Apache Kafka. This blog post summarizes the use cases, architectures, and real-world examples. The slide deck and video recording of the presentation is included as well.

What is QCon?

QCon is a leading software development conference held across the globe for 16 years. It provides a realistic look at what is trending in tech. The QCon events are organized by InfoQ, a well-known website for professional software development with over two million unique visitors per month.

QCon in 2022 uncovers emerging software trends and practices. Developers and architects learn how to solve complex engineering challenges without the product pitches.

There is no Call for Papers (CfP) for QCon. The organizers invite trusted speakers to talk about trends, best practices, and real-world stories. This makes QCon so strong and respected in the software development community.

Disaster Recovery and Resiliency with Apache Kafka

Apache Kafka is the de facto data streaming platform for analytical AND transactional workloads. Multiple options exist to design Kafka for resilient applications. For instance, MirrorMaker 2 and Confluent Replicator enable uni- or bi-directional real-time replication between independent Kafka clusters in different data centers or clouds.

Cluster Linking is a more advanced and straightforward option from Confluent leveraging the native Kafka protocol instead of additional infrastructure and complexity using Kafka Connect (like MirrorMaker 2 and Replicator).

Stretching a single Kafka cluster across multiple regions is the best option to guarantee no downtime and seamless failover in the case of a disaster. However, it is hard to operate and only recommended (i.e., consistent, stable, and mission-critical) across distances with enhanced add-ons to open-source Kafka:

QCon Presentation: Disaster Recovery with Apache Kafka

In my QCon talk, I intentionally showed the broad spectrum of real-world success stories across industries for data streaming with Apache Kafka from companies such as BMW, JPMorgan Chase, Robinhood, Royal Caribbean, and Devon Energy.

Best practices explored how to build resilient enterprises architecture with disaster recovery with RPO (Recovery Point Object) and RTO (Recovery Time Objective) in mind. The audience learns how to get your SLAs and requirements for downtime and data loss right.

The examples looked at serverless cloud offerings integrating to the IoT edge, hybrid retail architectures, and the disconnected edge in military scenarios.

The agenda looks like this:

Resilient enterprise architectures
Real-time data streaming with the Apache Kafka ecosystem
Cloud-first and serverless Industrial IoT in automotive
Multi-region infrastructure for core banking
Hybrid cloud for customer experiences in retail
Disconnected edge for safety and security in the public sector

Slide Deck from QCon Talk:

Here is the slide deck of my presentation from QCon London 2022:

We also had a great panel that discussed lessons learned from building resilient applications on the code and infrastructure level, plus the organizational challenges and best practices:

Video Recording from QCon Talk:

With the risk of Covid in mind, InfoQ decided not to record QCon sessions live.

Instead, a pre-recorded video had to be submitted by the speakers. The video recording is already available for QCon attendees (no matter if on-site in London or at the QCon Plus virtual event):

Qcon makes conference talks available for free sometime after the event. I will update this post with the free link as soon as it is available.

Disaster Recovery with Apache Kafka across all Industries

I hope you enjoyed the slides and video on this exciting topic. Hybrid and global Kafka infrastructures for disaster recovery and other use cases are the norm, not exceptions.

Real-time data beats slow data. That is true in almost every use case. Hence, data streaming with the de facto standard Apache Kafka gets adopted more and more across all industries.

How do you leverage data streaming with Apache Kafka for building resilient applications and enterprise architectures? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Disaster Recovery with Kafka across the Edge and Hybrid Cloud (QCon Talk) appeared first on Kai Waehner.

Streaming ETL with Apache Kafka in the Healthcare Industry

Kai Waehner — Fri, 01 Apr 2022 05:47:00 +0000

IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part three: Streaming ETL. Examples include Babylon Health and Bayer.

Blog Series – Kafka in Healthcare

Many healthcare companies leverage Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap. I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Overview – Data Streaming Use Cases and Architectures for Healthcare (including Slide Deck)
Legacy Modernization and Hybrid Cloud (Optum / UnitedHealth Group, Centene, Bayer)
THIS POST: Streaming ETL (Bayer, Babylon Health)
Real-time Analytics (Cerner, Celmatix, CDC/Centers for Disease Control and Prevention)
Machine Learning and Data Science (Recursion, Humana)
Open API and Omnichannel (Care.com, Invitae)

Stay tuned for a dedicated blog post for each of these topics as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Streaming ETL with Apache Kafka

Streaming ETL is similar to concepts you might know from traditional ETL tools. I have already explored how data streaming with Kafka differs from data integration tools and iPaaS cloud services. The critical difference is that you leverage a single platform for data integration and processing at scale in real-time. There is no need to combine several platforms to achieve this. The result is a Kappa architecture that enables real-time but also batch workloads with a single integration architecture.

Streaming ETL with Kafka combines different components and features:

Kafka Connect as Kafka-native integration framework
Kafka Connect source and sink connectors to consume and produce data from/to any other database, application, or API
Single Message Transform (SMT) – an optional Kafka Connect feature – to process (filter, change, remove, etc.) incoming or outgoing messages within the connector deployment
Kafka Streams or ksqlDB for continuous data processing in real-time at scale for stateless or stateful ETL jobs
Data governance via schema management, enforcement and versioning using the Schema Registry
Security and access control using features like role-based access control, audit logs, and end-to-end encryption

In the cloud, you can leverage a serverless Kafka offering for the whole Streaming ETL pipeline. Confluent Cloud fully manages Kafka’s end-to-end infrastructure, including connectors, ksqlDB workloads, data governance, and security.

One last general note: Don’t Design for Data at Rest to Reverse it! Learn more here: “When to Use Reverse ETL and when it is an Anti-Pattern“. Instead, use real-time Streaming ETL for Data in Motion and the Kappa architecture from scratch.

Let’s look at a few real-world deployments in the healthcare sector.

Babylon Health – PII and GDRP compliant Security

Babylon Health is a digital-first health service provider and value-based care company that combines an artificial intelligence-powered platform with virtual clinical operations for patients. Patients are connected with health care professionals through its web and mobile application.

Babylon’s mission is to put an accessible and affordable health service in the hands of every person on earth. For that mission, Babylon built an agile microservice architecture with the Kafka ecosystem:

Here are the “wonders of working” in Healthcare for Babylon (= reasons to choose Kafka):

Real-time data processing
Replayability of historical information
Order matters and is ensured with guaranteed ordering
GDPR and data ownership for PII compliant security
Data governance via the schema registry to provide true decoupling and access via many programming languages like Java, Python, and Ruby

Bayer – Data Integration and Processing in R&D

Bayer AG is a German multinational pharmaceutical and life sciences company and one of the largest pharmaceutical companies in the world. They leverage Kafka in various use cases and business domains.

The following scenario is from the research and development department of the pharma business unit. Their focus areas are cardiovascular diseases, oncology, and women’s health. The division employs over 7,500 R&D people and expenses over 2.75 billion euros for R&D.

The use case Bayer presented at a recent Kafka Summit is about analyzing clinical trials, patents, reports, news, and literature leveraging the Kafka ecosystem. The R&D team processes 250 Million documents from 30+ individual data sources. The data includes 7 TB of raw text-rich data with daily updates, additions, and deletions. Algorithms and data evolve. Bayer needs to completely reprocess the data regularly. Various document streams with different formats and schemas flow through several text processing and enrichment steps.

Scalable, reliable Kafka pipelines with Kafka Streams (Java) and Faust (Python) replaced custom, error-prone, non-scalable scripts. Schemas are used as the data interface to ensure data governance. Avro is the first-class citizen data format to enable compression and better throughput.

The true decoupling of Kafka in conjunction with the Schema Registry guarantees interoperability among different components and technologies (java, python, commercial tools, open-source, scientific, proprietary).

Streaming ETL with Kafka for Real-Time Data Integration at any Scale

Think about IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains. Real-time data beats slow data in the healthcare supply chain almost everywhere.

This blog post explored the capabilities of the Apache Kafka Ecosystem for Streaming ETL. Real-world deployments from Babylon Health and Bayer showed how enterprises successfully deploy Kafka for different enterprise architecture use cases.

How do you leverage data streaming with Apache Kafka in the healthcare industry? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Streaming ETL with Apache Kafka in the Healthcare Industry appeared first on Kai Waehner.

Kafka for Cybersecurity (Part 2 of 6) – Cyber Situational Awareness

Kai Waehner — Thu, 08 Jul 2021 09:01:29 +0000

Apache Kafka became the de facto standard for processing data in motion across enterprises and industries. Cybersecurity is a key success factor across all use cases. Kafka is not just used as a backbone and source of truth for data. It also monitors, correlates, and proactively acts on events from various real-time and batch data sources to detect anomalies and respond to incidents. This blog series explores use cases and architectures for Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part two: Cyber Situational Awareness.

Blog series: Apache Kafka for Cybersecurity

This blog series explores why security features such as RBAC, encryption, and audit logs are only the foundation of a secure event streaming infrastructure. Learn about use cases, architectures, and reference deployments for Kafka in the cybersecurity space:

Part 1: Data in Motion as cybersecurity backbone
Part 2 (THIS POST): Situational awareness
Part 3: Threat intelligence
Part 4: Forensics
Part 5: Air-gapped and zero trust environments
Part 6: SIEM / SOAR modernization

Subscribe to my newsletter to get updates immediately after the publication. Besides, I will also update the above list with direct links to this blog series’s posts as soon as published.

The Four Stages of an Adaptive Security Architecture

Gartner defines four stages of adaptive security architecture to prevent, detect, respond and predict cybersecurity incidents:

Continuous monitoring and analytics are the keys to building a successful proactive cybersecurity solution. It should be obvious: Continuous monitoring and analytics require a scalable real-time infrastructure. Data at rest, i.e., stored in a database, data warehouse, or data lake, cannot continuously monitor data in real-time with batch processes.

Situational Awareness

“Situation awareness is the perception of the elements in the environment within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future.” Source: Endsley, M. R. SAGAT: A methodology for the measurement of situation awareness (NOR DOC 87-83). Hawthorne, CA: Northrop Corp.

Here is a theoretical view on situational awareness and the relation between humans and computers:

Endsley, M. R. Toward a Theory of Situation Awareness in Dynamic Systems. Human Factors, 1995

Cyber Situational Awareness = Continuous Real-Time Analytics

Cyber Situational Awareness is the subset of all situation awareness necessary to support taking action in cyber. It is the mandatory key concept to defend against cybersecurity attacks.

Automation and analytics in real-time are key characteristics:

Endsley, M. R. Toward a Theory of Situation Awareness in Dynamic Systems. Human Factors, 1995

No matter how good the threat detection algorithms and security platforms are. Prevention or at least detection of attacks need to happen in real-time. And predictions with cutting-edge machine learning models do not help if they are executed in a batch process over might.

Situational awareness covers various levels beyond the raw network events. It includes all environments, including application data, logs, people, and processes.

I covered the challenges in the first post of this blog series. In summary, cybersecurity experts’ key challenge is finding the needle(s) in the haystack. The haystack is typically huge, i.e., massive volumes of data. Often, it is not just one haystack but many. Hence, a key task is to reduce false positives.

Situational awareness is not just about viewing the dashboard but understand what’s going on in real-time. Situational awareness finds the relevant data to create critical (rare) alerts automatically. No human can handle the huge volumes of data.

Situational Awareness in Motion with Kafka

The Kafka ecosystem provides the components to correlate massive volumes of data in real-time to provide situational awareness across the enterprise find all needles in the haystack:

Event streaming powered by the Kafka ecosystem delivers contextually rich data to reduce false positives:

Collect all events from data sources with Kafka Connect
Filter event streams with Kafka Connect’s Single Message Transforms (SMT) so that only relevant data gets into the Kafka topic
Empower real-time streaming applications with Kafka Streams or ksqlDB to correlate events across various source interfaces
Forward priority events to other systems such as the SIEM/SOAR with Kafka Connect or any other Kafka client (Java, C, C++, .NET, Go, JavaScript, HTTP via REST Proxy, etc.)

Example: Situational Awareness with Kafka and SIEM/SOAR

SIEM/SOAR modernization is its own blog post of this series. But the following picture depicts how Kafka enables true decoupling between applications in a domain-driven design (DDD):

A traditional data store like a data lake is NOT the right spot for implementing situational awareness as it is data at rest. Data at rest is not a bad thing. Several use cases such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) require this approach. Real-time beats slow data. Hence, event streaming with the de facto standard Apache Kafka is the right fit for situational awareness.

Event streaming and data lake technologies are complementary, not competitive. The blog post “Serverless Kafka in a Cloud-native Data Lake Architecture” explores this discussion in much more detail by looking at AWS’ lake house strategy and its relation to event streaming.

The Data

Situational awareness requires data. A lot of data. Across various interfaces and communication paradigms. A few examples:

Text Files TXT
Firewalls and network devices
Binary files
Antivirus
Databases
Access
APIs
Audit logs
Network flows
Intrusion detection
Syslog
And many more…

Let’s look at the three steps of implementing situational awareness: Data producers, data normalization and enrichment, and data consumers.

Data Producers

Data comes from various sources. This includes real-time systems, batch systems, files, and much more. The data includes high-volume logs (including Netflow and indirectly PCAP) and low volume transactional events:

Data Normalization and Enrichment

The key success factor to implementing situational awareness is data correlation in real-time at scale. This includes data normalization and processing such as filter, aggregate, transform, etc.:

With Kafka, end-to-end data integration and continuous stream processing are possible with a single scalable and reliable platform. This is very different from the traditional MQ/ETL/ESB approach. Data governance concepts for enforcing data structures and ensuring data quality are crucial on the client-side and server-side. For this reason, the Schema Registry is a mandatory component in most Kafka architectures.

Data Consumers

A single system cannot implement cyber situational awareness. Different questions, challenges, and problems require different tools. Hence, most Kafka deployments run various Kafka consumers using different communication paradigms and different speeds:

Some workloads require data correlation in real-time to detect anomalies or even prevent threats as soon as possible. Kafka-native applications come into play. The client technology is flexible depending on the infrastructure, use case, and developer experience. Java, C, C++, Go are some coding options. Kafka Streams or ksqlDB provide out-of-the-box stream processing capabilities. The latter is the recommended option as it provides many features built-in such as sliding windows to build stateful aggregations.

A SIEM, SOAR, or data lake is complementary to run other analytics use cases for threat detection, intrusion detection, or reporting. The SIEM/SOAR modernization blog post of this series explores this combination in more detail.

Situational Awareness with Kafka and Sigma

Let’s take a look at a concrete example. A few of my colleagues built a great implementation for cyber situational awareness: Confluent Sigma. So, what is it?

Sigma – An Open Signature Format for Cyber Detection

Sigma is a generic and open signature format that allows you to describe relevant log events straightforwardly. The rule format is very flexible, easy to write, and applicable to any log file. The main purpose of this project is to provide a structured form in which cybersecurity engineers can describe their developed detection methods and make them shareable with others – either within the company or even share with the community.

A few characteristics that describe Sigma:

Open-source framework
A domain-specific language (DSL)
Specify patterns in cyber data
Sigma is for log files what Snort is for network traffic, and YARA is for files

Sigma provides integration with various tools such as ArcSight, QRadar, Splunk, Elasticsearch, Sumologic, and many more. However, as you learned in this post, many scenarios for cyber situational awareness require real-time data correlation at scale. That’s where Kafka comes into play. Having said this, a huge benefit is that you can specify a Sigma signature once and then use all the mentioned tools.

Confluent Sigma

Confluent Sigma is an open-source project implemented by a few of my colleagues. Kudos to Michael Peacock, William LaForest, and a few more. The project integrates Sigma into Kafka by embedding the Sigma rules into stream processing applications powered by Kafka Streams and ksqlDB:

Situational Awareness with Zeek, Kafka Streams, KSQL, and Sigma

Here is a concrete event streaming architecture for situational awareness:

A few notes on the above picture:

Sigma defines the signature rules
Zeek provides incoming IDS log events at high volume in real-time
Confluent Platform processes and correlates the data in real-time
The stream processing part built with Kafka Streams and ksqlDB includes stateless functions such as filtering and stateful functions such as aggregations
The calculated detections get ingested into a Zeek dashboard and other Kafka consumers

Here is an example of a Sigma Rule for windowing and aggregation of logs:

The Kafka Streams topology of the example looks like this:

My colleagues will do a webinar to demonstrate Confluent Sigma in more detail, including a live demo. I will update and share the on-demand link here as soon as available. Some demo code is available on Github.

Cisco ThousandEyes Endpoint Agents

Let’s take a look at a concrete Kafka-native example to implement situational awareness at scale in real-time. Cisco ThousandEyes Endpoint Agents is a monitoring tool to gain visibility from any employee to any application over any network. It provides proactive and real-time monitoring of application experience and network connectivity.

The platform leverage the whole Kafka ecosystem for data integration and stream processing:

Kafka Streams for stateful network tests
Interactive queries for fetching results
Kafka Streams for windowed aggregations for alerting use cases
Kafka Connect for integration with backend systems such as MySQL, Elastic, MongoDB

ThousandEyes’ tech blog is a great resource to understand the implementation in more detail.

Kafka is a Key Piece to Implement Cyber Situational Awareness

Cyber situational awareness is mandatory to defend against cybersecurity attacks. A successful implementation requires continuous real-time analytics at scale. This is why the Kafka ecosystem is a perfect fit.

The Confluent Sigma implementation shows how to build a flexible but scalable and reliable architecture for realizing situational awareness. Event streaming is a key piece of the puzzle.

However, it is not a replacement for other tools such as Zeek for network analysis or Splunk as SIEM. Instead, event streaming complements these technologies and provides a central nervous system that connects and truly decouples these other systems. Additionally, the Kafka ecosystem provides the right tools for real-time stream processing.

How did you implement cyber situational awareness? Is it already real-time and scalable? What technologies and architectures do you use? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka for Cybersecurity (Part 2 of 6) – Cyber Situational Awareness appeared first on Kai Waehner.

Apache Kafka in the Insurance Industry

Kai Waehner — Mon, 07 Jun 2021 12:54:20 +0000

The rise of data in motion in the insurance industry is visible across all lines of business, including life, healthcare, travel, vehicle, and others. Apache Kafka changes how enterprises rethink data. This blog post explores use cases and architectures for event streaming. Real-world examples from Generali, Centene, Humana, and Tesla show innovative insurance-related data integration and stream processing in real-time.

Digital Transformation in the Insurance Industry

Most insurance companies have similar challenges:

Challenging market environments
Stagnating economy
Regulatory pressure
Changes in customer expectations
Proprietary and monolithic legacy applications
Emerging competition from innovative insurtechs
Emerging competition from other verticals that add insurance products

Only a good transformation strategy guarantees a successful future for traditional insurance companies. Nobody wants to become the next Nokia (mobile phone), Kodak (photo camera), or BlockBuster (video rental). If you fail to innovate in time, you are done.

Real-time beats slow data. Automation beats manual processes. The combination of these two game changers creates completely new business models in the insurance industry. Some examples:

Claims processing including review, investigation, adjustment, remittance or denial of the claim
Claim fraud detection by leveraging analytic models trained with historical data
Omnichannel customer interactions including a self-service portal and automated tools like NLP-powered chatbots
Risk prediction based on lab testing, biometric data, claims data, patient-generated health data (depending on the laws of a specific country)

These are just a few examples.

The shift to real-time data processing and automation is key for many other use cases, too. Machine learning and deep learning enable the automation of many manual and error-prone processes like document and text processing.

The Need for Brownfield Integration

Traditional insurance companies usually (have to) start with brownfield integration before building new use cases. The integration of legacy systems with modern application infrastructures and the replication between data centers and public or private cloud-native infrastructures are a key piece of the puzzle.

Common integration scenarios use traditional middleware that is already in place. This includes MQ, ETL, ESB, and API tools. Kafka is complementary to these middleware tools:

More details about this topic are available in the following two posts:

Greenfield Applications at Insurtech Companies

Insurtechs have a huge advantage: They can start greenfield. There is no need to integrate with legacy applications and monolithic architectures. Hence, some traditional insurance companies go the same way. They start from scratch with new applications instead of trying to integrate old and new systems.

This setup has a huge architectural advantage: There is no need for traditional middleware as only modern protocols and APIs need to be integrated. No monolithic and proprietary interfaces such as Cobol, EDI, or SAP BAPI/iDoc exist in this scenario. Kafka makes new applications agile, scalable, and flexible with open interfaces and real-time capabilities.

Here is an example of an event streaming architecture for claim processing and fraud detection with the Kafka ecosystem:

Real-World Deployments of Kafka in the Insurance Industry

Let’s take a look at a few examples of real-world deployments of Kafka in the insurance industry.

Generali – Kafka as Integration Platform

Generali is one of the top ten largest insurance companies in the world. The digital transformation from Generali Switzerland started with Confluent as a strategic integration platform. They started their journey by integrating with hundreds of legacy systems like relational databases. Change Data Capture (CDC) pushes changes into Kafka in real-time. Kafka is the central nervous system and integration platform for the data. Other real-time and batch applications consume the events.

From here, other applications consume the data for further processing. All applications are decoupled from each other. This is one of the unique benefits of Kafka compared to other messaging systems. Real decoupling and domain-driven design (DDD) are not possible with traditional MQ systems or SOAP / REST web services.

Design Principles of Generali’s Cloud-Native Architecture

The key design principles for the next-generation platform at Generali include agility, scalability, cloud-native, governance, data, and event processing. Hence, Generali’s architecture is powered by a cloud-native infrastructure leveraging Kubernetes and Apache Kafka:

The following integration flow shows the scalable microservice architecture of Generali. The streaming ETL process includes data integration and data processing decoupled environments:

Centene – Integration and Data Processing at Scale in Real-Time

Centene is the largest Medicaid and Medicare Managed Care Provider in the US. Their mission statement is “transforming the health of the community, one person at a time”. The healthcare insurer acts as an intermediary for both government-sponsored and privately insured health care programs.

Centene’s key challenge is growth. Many mergers and acquisitions require a scalable and reliable data integration platform. Centene chose Kafka due to the following capabilities:

highly scalable
high autonomy and decoupling
high availability and data resiliency
real-time data transfer
complex stream processing

Centene’s architecture uses Kafka for data integration and orchestration. Legacy databases, MongoDB, and other applications and APIs leverage the data in real-time, batch, and request-response:

Swiss Mobiliar – Decoupling and Orchestration

Swiss Mobiliar (Schweizerische Mobiliar aka Die Mobiliar) is is the oldest private insurer in Switzerland.

Event Streaming with Kafka supports various use cases at Swiss Mobiliar:

Orchestrator application to track the state of a billing process
Kafka as database and Kafka Streams for data processing
Complex stateful aggregations across contracts and re-calculations
Continuous monitoring in real-time

Their architecture shows the decoupling of applications and orchestration of events:

Also, check out the on-demand webinar with Mobiliar and Spoud to learn more about their Kafka usage.

Humana – Real-Time Integration and Analytics

Humana Inc. is a for-profit American health insurance. In 2020, the company ranked 52 on the Fortune 500 list.

Humana leverages Kafka for real-time integration and analytics. They built an interoperability platform to transition from an insurance company with elements of health to truly a health company with elements of insurance.

Here are the key characteristics of their Kafka-based platform:

Consumer-centric
Health plan agnostic
Provider agnostic
Cloud resilient and elastic
Event-driven and real-time

Kafka integrates conversations between the users and the AI platform powered by IBM Watson. The platform captures conversational flows and processes them with natural language processing (NLP) – a deep learning concept.

Some benefits of the platform:

Adoption of open standards
Standardized integration partners
In-workflow integration
Event-driven for real-time patient interactions
Highly scalable

freeyou – Stateful Streaming Analytics

freeyou is an insurtech for vehicle insurance. Streaming analytics for real-time price adjustments powered by Kafka and ksqlDB enable new business models. Their marketing slogan shows how they innovate and differentiate from traditional competitors:

“We make insurance simple. With our car insurance, we make sure that you stay mobile in everyday life – always and everywhere. You can take out the policy online in just a few minutes and manage it easily in your freeyou customer account. And if something should happen to your vehicle, we’ll take care of it quickly and easily.”

A key piece of freeyou’s strategy is a great user experience and automatic price adjustments in real-time in the backend. Obviously, Kafka and its stream processing ecosystem are a perfect fit here.

As discussed above, the huge advantage of an insurtech is the possibility to start from the greenfield. No surprise that freeyou’s architectures leverage cutting-edge design and technology. Kafka and KQL enable streaming analytics within the pricing engine, recalculation modules, and other applications:

Tesla – Carmaker and Utility Company, now also Car Insurer

Everybody knows: Tesla is an automotive company that sells cars, maintenance, and software upgrades.

More and more people know: Tesla is a utility company that sells energy infrastructure, solar panels, and smart home integration.

Almost nobody knows: Tesla is a car insurer for their car fleet (limited to a few regions in the early phase). That is the obvious next step if you already collect all the telemetry data from all your cars on the street.

Tesla has built a Kafka-based data platform infrastructure “to support millions of devices and trillions of data points per day”. Tesla showed an interesting history and evolution of their Kafka usage at a Kafka Summit in 2019:

Tesla’s infrastructure heavily relies on Kafka.

There is no public information about Telsa using Kafka for their specific insurance applications. But at a minimum, the data collection from the cars and parts of the data processing relies on Kafka. Hence, I thought this is a great example to think about innovation in car insurance.

Tesla: “Much Better Feedback Loop”

Elon Musk made clear: “We have a much better feedback loop” instead of being statistical like other insurers. This is a key differentiator!

There is no doubt that many vehicle insurers will use fleet data to calculate insurance quotes and provide better insurance services. For sure, some traditional insurers will partner with vehicle manufacturers and fleet providers. This is similar to smart city development, where several enterprises partner to build new innovative use cases.

Connected vehicles and V2X (Vehicle to X) integrations are the starting point for many new business models. No surprise: Kafka plays a key role in the connected vehicles space (not just for Tesla).

Many benefits are created by a real-time integration pipeline:

Shift from human experts to automation driven by big data and machine learning
Real-time telematics data from all its drivers’ behavior and the performance of its vehicle technology (cameras, sensors, …)
Better risk estimation of accidents and repair costs of vehicles
Evaluation of risk reduction through new technologies (autopilot, stability control, anti-theft systems, bullet-resistant steel)

For these reasons, event streaming should be a strategic component of any next-generation insurance platform.

Slide Deck: Kafka in the Insurance Industry

The following slide deck covers the above discussion in more detail:

Kafka Changes How to Think Insurance

Apache Kafka changes how enterprises rethink data in the insurance industry. This includes brownfield data integration scenarios and greenfield cutting-edge applications. The success stories from traditional insurance companies such as Generali and insurtechs such as freeyou prove that Kafka is the right choice everywhere.

Kafka and its ecosystem enable data processing at scale in real-time. Real decoupling allows the integration between monolith legacy systems and modern cloud-native infrastructure. Kafka runs everywhere, from edge deployments to multi-cloud scenarios.

What are your experiences and plans for low latency use cases? What use case and architecture did you implement? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Insurance Industry appeared first on Kai Waehner.

Apache Kafka in the Airline, Aviation and Travel Industry

Kai Waehner — Fri, 19 Feb 2021 11:21:55 +0000

Aviation and travel are notoriously vulnerable to social, economic, and political events, as well as the ever-changing expectations of consumers. Coronavirus is just a piece of the challenge. This post explores use cases, architectures, and references for Apache Kafka in the aviation industry, including airline, airports, global distribution systems (GDS), aircraft manufacturers, and more. Kafka was relevant pre-covid and will become even more important post-covid.

Airlines and Aviation are Changing – Beyond Covid-19!

Aviation and travel are notoriously vulnerable to social, economic, and political events. These months have been particularly testing one due to the global pandemic with Covid-19. But the upcoming change is coming not just due to the Coronavirus but because of the ever-changing expectations of consumers.

Right now is the time to lay the ground for the future of the aviation and travel industry.

Consumer behaviors and expectations are changing. Whole industries are being disrupted, and the aviation industry is not immune to these sweeping forces of change.

The future business of airlines and airports will be digitally integrated into the ecosystem of partners and suppliers. Companies will provide more personalized customer experiences and be enabled by a new suite of the latest technologies, including automation, robotics, and biometrics.

For instance, new customer notification mobile apps provide customers with relevant and timely updates throughout their journeys. Other major improvements support the front line service teams at various touchpoints throughout the airports and end to end travel journey.

Apache Kafka in the Airline Industry

Apache Kafka is the de facto standard for event streaming use cases across industries. Many use cases can be applied to the aviation industry, too. Concepts like payment, customer experience, and manufacturing differ in detail. But in the end, it is about integrating systems and processing data in real-time at scale.

For instance, omnichannel retail with Apache Kafka applies to airline, airports, global distribution systems (GDS), and other aviation industry sectors.

However, it is always easier to learn from other companies in the same industry. Therefore, the following explores a few public Apache Kafka success stories from the aviation industry.

Lufthansa – Kafka Unified Streaming Cloud Operations

Lufthansa talks about the benefits of using Apache Kafka instead of traditional messaging queues (TIBCO EMS, IBM MQ) for data processing.

The journey started with the question if Lufthansa can do data processing better, cheaper, and faster.

Lufthansa’s Kafka architecture does not have any surprises. A key lesson learned from many companies: The real added value is created when you leverage Kafka not just for messaging, but its entire ecosystem, including different clients/proxies, connectors, stream processing, and data governance.

The result at Lufthansa: A better, cheaper, and faster infrastructure for real-time data processing at scale.

My two favorite statements (once again: not really a surprise, as I see the same at many other customers):

“Scaling Kafka is really inexpensive”
“Kafka adopted and integrated within 3 months”

Watch the full talk from Marcos Carballeira Rodríguez from Lufthansa recorded at the Confluent Streaming Days 2020 to see all the architectures and quotes from Lufthansa.

And check out this exciting video recording of Lufthansa discussing their Kafka use cases for middleware modernization and machine learning:

Singapore Airlines – Predictive Maintenance with Kafka Connect, Kafka Streams, and ksqlDB

Singapore Airlines is an early adopter of KSQL to continuously process sensor data and apply analytic models to the events. They already talked about their Kafka ecosystem usage (including Kafka Connect, Kafka Streams, and KSQL) back in 2018. The use case is predictive maintenance with a scalable real-time infrastructure, as you can see in my summary slide:

Check out the complete slide deck from Singapore Airlines for more details.

Air France Hop – Scalable Real-Time Microservices

I really like the Kafka Summit talk title: “Hop! Airlines Jets to Real-Time“. Air France Hop leverage Change Data Capture (CDC) with HVR and Kafka for real-time data processing and integration with legacy monoliths. A pretty common pattern to integrate the old and the new software and IT world:

The complete slide deck and on-demand video recording about this case study are available on the Kafka Summit page.

Amadeus – Real-Time and Batch Log Processing

As I said initially, Kafka is not just relevant for each airline, airport, and aircraft manufacturers. The global distribution system (GDS) from Amadeus is one of the world’s biggest (competing mainly with Sabre). Passenger name record (PNR) is a record in the computer reservation system (CRS) and a crucial part of any GDS vendor. While many end-users don’t even know about Amadeus, the aviation industry could not survive without them. Their workloads are mission-critical and need to run 24/7 in real-time, plus connect to their partners’ systems (like an airline) in a very stable and mature manner!

Amadeus is relying on Apache Kafka for both real-time and batch data processing, as they explain on the official Apache Kafka website:

Streaming Data Exchange for the Travel Industry

After looking at some examples, let’s now cover one more key topic: Data integration and correlation between partners in the aviation industry. Airline, airports, GDS, travel companies, and many other companies need to integrate very well. Obviously, this is already implemented. Otherwise, there is no way to operate flights with passengers and cargo. At least in theory. Honestly, one of the most significant pain points of the travel industry for customers is bad integration across companies. Some examples:

Late or (even worse) no notification about a delay or cancellation
Issues with the display of available seats or upgrade
Broken booking process on the website because of different flight numbers, connecting flights,
Booking class issues for upgrades or rebookings
Display of technical error messages instead of business information (for instance, I can’t count how often I had seen an “IBM WebSphere” error message when I tried to book a flight on the website of my most commonly used airline)
The list goes on and on and on… No matter which airline you pick. That’s at my experience as a frequent traveler across all continents and timezones.

There are reasons for these issues. The aviation network is very complex. For instance, Lufthansa group sells tickets for all their own brands (like Swiss or Austrian Airlines), plus tickets from Star Alliance partners (such as United or Singapore Airlines). Hence, airline, airports, GDS, and many partner systems have to work together. 24/7. In real-time. For this reason, more and more companies in the aviation industry rely on Kafka internally.

But that’s only half of the story… How do you integrate with partners?

Event Streaming vs. REST / HTTP APIs

I explored the discussion around event streaming with Kafka vs. RESTful web services with HTTP in much more detail in another article: “Comparison: Apache Kafka vs. API Management / API Gateway tools like Mulesoft or Kong“. In short: Kafka and REST APIs have their trade-offs. Both are complementary and used together in many architectures. API Management is a great add-on for many applications and microservices, no matter if they are built with HTTP or Kafka under the hood.

But one point is clear: If you need a scalable real-time integration with a partner system, then HTTP is not the right choice. You can either pick gRPC as a request-response alternative or use Kafka natively for the integration with partners, as you use it internally already anyway:

Kafka-native replication between partners works very well. No matter what Kafka vendor and version you and your partner are running. Obviously, the biggest challenge is the security (not from a technical but an organizational perspective). Kafka requires TCP. That’s much harder to get approval for opening it to a partner than HTTP ports.

But from a technical point of view, streaming replication often makes much more sense. I have seen the first customers implementing integration via tools like Confluent Replicator. I am sure that we will see this pattern much more in the future and with better out-of-the-box tool support from vendors.

Data Integration and Correlation at an Airport with Airline Data using KSQL

So, let’s assume that you have the data streams connected at an airport. No matter if just internal data or also partner data. Data correlation adds the business value. Sönke Liebau from OpenCore presented a great airport demo with Kafka and KSQL at a Kafka Summit.

Let’s take a look at some events at an airport:

These events exist in various structures and with different technologies and formats. Some data streams arrive in real-time. However, some other data sets come from a monolithic mainframe in batch via a file integration. Kafka Connect is a Kafka-native middleware to implement this integration.

Afterward, all this data needs to be correlated with historical data from a loyalty system or relational database. This is where stream processing comes into play: This concept enables the continuous data correlation in real-time at scale. Kafka-native technologies like Kafka Streams or ksqlDB exist to build streaming ETL pipelines or business applications.

The following example correlates the gate information from the airport with the airline flight information to send a delay notification to the customer who is waiting for the connection flight:

Tons of use cases exist to leverage event streams from different systems (and partners) in real-time. Some examples from an airport perspective:

Location-based services while the customer is walking through the airport and waiting for the flight. Example: Coupons for a restaurant (with many empty seats or food reserves to thrash if not sold during the day)
Airline services such as free or points-based discounted lounge entrance (because the lounge tracking systems knows that it is almost empty right now anyway)
Partner services like notifying the airport hotel that the guest can stay longer in the room because of a long delay of the upcoming flight

The list of opportunities is almost endless. However, most use cases are only possible if all systems are integrated and data is continuously correlated in real-time. If you need some more inspiration, check out the two blogs “Kafka at the Edge in a Smart Retail Store” and “Kafka in a Train for Improved Customer Experience“. All these use cases are a perfect fit for airline, airports, and their partner ecosystem.

Slides – Apache Kafka in the Aviation, Airline and Travel Industry

The following slide deck goes into more detail:

Kafka for Improved Operations and Customer Experience in the Aviation Industry

This post explored various use cases for event streaming with Apache Kafka in the aviation industry. Airline, airports, aerospace, flight safety, manufacturing, GDS, retail, and many more partners rely on Apache Kafka.

No question: Kafka is getting mainstream these months in the aviation industry. Serverless and consumption-based offerings such as Confluent Cloud boost the adoption even more. A streaming data exchange between partners is the next step I see on the horizon. I am looking forward to Kafka-native interfaces from Open APIs of enterprises, better support for streaming interfaces in API Management tools, and COTS solution from software vendors.

What are your experiences and plans for event streaming in the aviation industry? Did you already build applications with Apache Kafka? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Airline, Aviation and Travel Industry appeared first on Kai Waehner.

A Hybrid Streaming Architecture for Smart Retail Stores with Apache Kafka

Kai Waehner — Mon, 01 Feb 2021 07:54:27 +0000

Event Streaming with Apache Kafka disrupts the retail industry. Walmart’s real-time inventory system and Target’s omnichannel distribution and logistics are two great examples. This blog post explores a concrete use case as part of the overall story: A hybrid streaming architecture to build smart retail stores for autonomous or disconnected edge computing and replication to the cloud with Apache Kafka.

Disruption of the Retail Industry with Apache Kafka

Various deployments across the globe leverage event streaming with Apache Kafka for very different use cases. Consequently, Kafka is the right choice, whether you need to optimize the supply chain, disrupt the market with innovative business models, or build a context-specific customer experience.

I explored the use cases of Apache Kafka in retail in a dedicated blog post: “The Disruption of Retail with Event Streaming and Apache Kafka“. Learn about the real-time inventory system from Walmart, omnichannel distribution and logistics at Target, context-specific customer 360 at AO.com, and much more.

This post shows a specific example: The smart retail store and its connection to cloud applications. The example uses AWS. Of course, any other cloud infrastructure can be used instead, such as Azure, GCP, or Alibaba. Walgreens is a great real-world example for building smart retail with 5G and mobile edge computing (MEC) deployments to their 9000 stores.

A Hybrid Streaming Architecture with Apache Kafka for the Smart Retail Store

Multiple Kafka clusters are the norm, not an exception! Hybrid architecture requires Kafka clusters in one or more clouds and in local data centers. In the meantime, the trend goes even further: Plenty of use cases exist for Kafka at the edge (i.e., outside the data center).

In retail, the best customer experience and increased revenue require edge processing with low latency. Often, the internet connection is bad, too. Hence, hybrid Kafka architectures make a lot of sense:

The bi-directional communication between each edge site and a central Kafka cluster is possible with Kafka-native tools such as Mirrormaker 2 or Confluent’s Cluster Linking.

The cloud is best for aggregation use cases, data lakes, data warehouses, integration with 3rd party SaaS, etc. However, many retail use cases need to run at the edge even if there is no internet connection.

Edge Processing and Analytics in the Retail Store

Many retail stores have a bad internet connection that is not stable and has low bandwidth. Hence, the digital transformation in retail requires data processing at the edge:

Kafka at the edge includes various use cases in a retail store:

Low latency transactions like payment processing at the point of sale (Kafka is not hard real-time, but able to process data milliseconds end-to-end)
Location-based services such as context-specific recommendations and advertisements (including machine learning and model inference)
Integration with other edge applications and devices (sensors, cameras, mobile apps, etc.)
Pre-processing before replication to the cloud (filtering, aggregations, etc.) to reduce costs
Buffering and backpressure handling (real decoupling between applications and between edge and cloud)

The Autonomous (or Disconnected) Edge: An Offline Retail Store

Many architectures don’t do real edge processing. They just connect the clients at the edge to the backends in the cloud. This is fine for some use cases. However, several good reasons exist to deploy Kafka at the edge beyond replication to the cloud:

Always on – process edge data even if you don’t have a (good) internet connection
Backpressure handling – decouple the edge from the cloud if there is no stable connection to the cloud
Reduced traffic costs – it does not make sense to replicate all sensor data etc. to the cloud
Low latency and edge data processing are key for some use cases – for instance, context-specific and location-based customer notifications don’t make sense if the person already walked away from a product or even out of your store already (please note that Kafka is NOT hard real-time, though!)
Analytics – Machine Learning in the cloud is great to train models (and Kafka is a key piece of the ML story, too), but the model inference at scale in real-time (with Kafka) can only happen at the edge

With Kafka at the edge, you can solve all these scenarios with a single technology, including non-real-time use cases:

Real-World Example: Swimming Retail Stores at Royal Caribbean

Royal Caribbean is a cruise line. It operates the four largest passenger ships in the world. As of January 2021, the line operates twenty-four ships and has six additional ships on order.

Royal Caribbean implemented one of the most famous use cases for Kafka at the edge. Each cruise ship has a Kafka cluster running locally for use cases such as payment processing, loyalty information, customer recommendations, etc.:

All the reasons I described above apply for Royal Caribbean:

Bad and costly connectivity to the internet
The requirement to do edge computing in real-time for a seamless customer experience and increased revenue
Aggregation of al the cruise trips in the cloud for analytics and reporting to improve the customer experience, upsell opportunities, and many other business processes

Hence, a Kafka cluster on each ship enables local processing and reliable, mission-critical workloads. The Kafka storage guarantees durability, no data loss, and guaranteed ordering of events – even though they are processed later. Only very critical data is sent directly to the cloud (if there is connectivity at all). All other data is replicated to the central Kafka cluster in the cloud when the ship arrives in a harbor for a few hours. A stable internet connection and high bandwidth are available before leaving for the next trip again.

Obviously, the same architecture can be applied to traditional retail stores on land in malls or other buildings.

Kafka at the Edge is the New Black!

The “Kafka at the edge” story is coming up more and more. Obviously, it is not just relevant for retail stores but also for bank branches, restaurants, factories, cell towers, stadiums, and hospitals.

5G will be a key reason for Kafka’s success at the edge (and edge computing in general). The better you can connect things at the edge, the more you can do with it there. The example of building a smart factory with Kafka and a private 5G campus network goes into more detail. Streaming machine learning with Apache Kafka at the edge is the new black! This is true for many use cases, including advanced planning, payment and fraud detection, or customer recommendations.

What are your experiences and plans for event streaming in the retail industry or with Kafka at the edge (outside the data center)? Did you already build applications with Apache Kafka? Check out the “Infrastructure Checklist for Apache Kafka at the Edge” if you plan to go that direction!

Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post A Hybrid Streaming Architecture for Smart Retail Stores with Apache Kafka appeared first on Kai Waehner.

Apache Kafka in Gaming (Games Industry, Bookmaker, Betting, Gambling, Video Streaming)

Kai Waehner — Thu, 16 Jul 2020 06:13:54 +0000

This blog post explores how event streaming with Apache Kafka provides a scalable, reliable, and efficient infrastructure to make gamers happy and Gaming companies successful. Various use cases and architectures in the gaming industry are discussed, including online and mobile games, betting, gambling, and video streaming.

Learn about:

Real-time analytics and data correlation of game telemetry
Monetization network for real-time advertising and in-app purchases
Payment engine for betting
Detection of financial fraud and cheating
Chat function in games and cross-games
Monitor the results of live operations like weekend events or limited time offers
Real-time analytics on metadata and chat data for marketing campaigns

The Evolution of the Gaming Industry

The gaming industry must process billions of events per day in real-time and ensure consistent and reliable data processing and correlation across gameplay interactions and backend analytics. Deployments must run globally and work for millions of users 24/7 on 365 days a year.

These requirements are valid for hardcore games and blockbusters, including massively multiplayer online role-playing games (MMORPG), first-person shooters, and multiplayer online battle arenas (MOBA), but also mid-core and casual games. Reliable and scalable real-time integration with consumer devices like smartphones and game consoles is as essential as cooperating with online streaming services like Twitch and betting providers.

Business Models in the Gaming Industry

Gaming is not just about games anymore. Though, even in the games industry, the option of playing games diverse from consoles and PCs to mobile games, casino games, online games, and various other options. In addition to the games, people also engage via professional eSports, $$$ tournaments, live video streaming, and real-time betting.

This is a crazy evolution, isn’t it? Here are some of the business models relevant today in the gaming industry:

Hardware sales
Game sales
Free-to-play + in-game purchases, such as skins or champions
Gambling (Loot boxes)
Game-as-a-service (Subscription)
Seasonal in-game purchases like passes for theme events, mid-season invitational & world championship, passes for competitive play
Game-Infrastructure-as-a-Service
Merchandise sales
Communities including eSports broadcast, ticket sales, franchising fees
Live betting
Video streaming, including ads, rewards, etc.
…

Evolution of “AI” (Artificial Intelligence) in Gaming

Artificial Intelligence (business rules, statistical models, machine learning, deep learning) is vital for many use cases in Gaming. These use cases include:

In-game AI: Non-playable characters (NPC), environments, features
Fraud detection: Cheating, financial fraud, child abuse
Game analytics: Retention, game changes (real-time delivery or via next patch/update)
Research: Find new algorithms, improve AI, adopt to business problems

Many of the use cases I explore use AI in conjunction with event streaming and Kafka in the following.

Hybrid Gaming Architectures for Event Streaming with Apache Kafka

A vast demand for building an open, flexible, scalable platform and real-time processing are the reasons why so many gaming-related projects use Apache Kafka. I will not discuss Kafka here and assume you know why Kafka became the de facto standard for event streaming.

What’s more interesting is the different deployments and architectures I have seen in the wild. Infrastructures in the gaming industry are often global. Sometimes cloud-only, sometimes hybrid with local on-premises installations. Betting is usually regional (mainly because of laws and compliance reasons). Games typically are global. If a game is excellent, it gets deployed and rolled out across the world.

Let’s now take a look at several different use cases and architectures in the gaming industry. Most of these examples are relevant in all gaming-related use cases, including games, mobile, betting, gambling, and video streaming.

Infrastructure Operations – Live Monitoring and Troubleshooting

Monitoring the results of live operations is essential for every mission-critical infrastructure. Use cases include:

Game clients, game servers, game services
Service health 24/7
Special events such as weekend tournaments, limited time offers and user acquisition campaigns

Immediate and correct troubleshooting require real-time monitoring. You need to be able to answer questions like “Who creates the problem? Client? ISP? The game itself?”

Let’s take a look at a typical example in the gaming industry: A new marketing campaign:

“Play for free over the weekend”
Scalability – Huge extra traffic
Monitoring – Was the marketing campaign successful? How profitable is the game/business?
Real-time (e.g., alerting)
Batch (e.g., analytics and reporting of success with Snowflake)

A lot of diverse data has to be integrated, correlated, and monitored to keep the infrastructure running and to troubleshoot issues.

Elasticity Is the Key for Success in the Games Industry

A key challenge in infrastructure monitoring is the required elasticity. You cannot just provision some hardware, deploy the software, and operate it 24 hours 365 days a year. Gaming infrastructures require elasticity. No matter if you care about online games, betting, or video streaming.

Chris Dyl, Director of Platform at Epic Games, pointed this out well at AWS Summit 2018: “We have an almost ten times difference in workloads between peak and low-peak. Elasticity is really, really important for us in any particular region at the cloud providers”.

Confluent provides elasticity for any Kafka deployment, no matter if the event streaming platform runs self-managed at the edge or fully managed in the cloud. Check out “Scaling Apache Kafka to 10+ GB Per Second in Confluent Cloud” to see how Kafka can be scaled automatically in the cloud. Self-managed Kafka gets elastic by using tools such as Self-Balancing Kafka, Tiered Storage, and Confluent Operator for Kubernetes.

Game Telemetry – Real-time Analytics and Data Correlation with Kafka

Game Telemetry describes how the player plays the game. Player information includes business logic such as user actions (button clicks, shooting, use item) or game environment metrics (quests, level up), and technical information like login from a specific server, IP address, location.

Global Gaming requires proxies all over the world to guarantee regional latency for millions of clients. Besides, a central analytics cluster (with anonymized data) correlates data from across the globe. Here are some use cases for using game telemetry:

Game monitoring
How well do players progress through the game and what problems occurred
Live operations – Adjust the gameplay
Server-side changes while the player is playing the game (e.g., time-limited event, give reward)
Real-time updates to improve the game or align to audience needs (or in other words: Recommend an item / upgrade / skin / additional in-game purchase

Most use cases require processing big data streams in real-time:

Big Fish Games

Big Fish Games is an excellent example of live operations leveraging Apache Kafka and its ecosystem. They develop casual and mid-core games. 2.5 billion games were installed on smartphones and computers in 150 countries, representing over 450 unique mobile games and over 3,500 unique PC games.

Live operations use real-time analytics of game telemetry data. For instance, Big Fish Games increases revenue while the player plays the game by making context-specific recommendations for in-game purchases in real-time. Kafka Streams is used for continuous data correlation in real-time at scale.

Check out the details in the Kafka Summit Talk “How Big Fish Games developed real-time analytics“.

Monetization Network

Monetization networks are a fundamental component in most gaming companies. Use cases include:

In-game advertising
Micro-transactions and in-game purchases: Sell Skins, Upgrade to the next level…
Game-Infrastructure-as-a-Service: Multi-platform-and-store-integration, matchmaking, advertising, player identity and friends, cross-play, lobbies, leader boards, achievements, game analytics, …
Partner network: Cross-sell game data, game SDK, game analytics, …

A monetization network looks like the following:

Unity Ads – Monetization network

Unity is a fantastic example. In 2019, content installed 33 billion times, reaching 3 billion devices worldwide. The company provides a real-time 3D development platform.

Unity operates one of the largest monetization networks in the world:

Reward players for watching ads
Incorporate banner ads
Incorporate Augmented Reality (AR) ads
Playable ads
Cross-Promotions

Unity is a data-driven company:

Average about half a million events per second
Handles millions of dollars of monetary transactions
Data infrastructure based on Confluent Platform, Confluent Cloud and Apache Kafka

A single data pipeline provides the foundational infrastructure for analytics, R&D, monetization, cloud services, etc. for real-time and batch processing leveraging Apache Kafka:

Real-time monetization network
Feed machine learning models in real-time
Data lake went from two-day latency down to 15 minutes

If you want to learn about their success story migrating this platform from self-managed Kafka to fully-managed Confluent Cloud, read Unity’s post on the Confluent Blog: “How Unity uses Confluent for real-time event streaming at scale“.

Chat Function within Games and Cross-Platform

Building a chat platform is not a trivial task in today’s world. Chatting means send text, in-game screenshots, in-game items, and other things. Millions of events have to be processed in real-time. Cross-platform chat platforms need to support various technologies, programming languages, and communication paradigms such as real-time, batch, request-response:

The characteristics of Kafka make it the perfect infrastructure for chat platforms due to high scalability, real-time processing, and real decoupling, including backpressure handling.

Payment Engine

Payment infrastructure needs to be real-time, scalable, reliable, and technology-independent. No matter if your solution is built for games, betting, casino, 3D game engines, video streaming, or any other 3rd services.

Most payment engines in the gaming industry are built on top of Apache Kafka. Many of these companies provide public information about their real-time betting infrastructure. Here is one example of an architecture:

One example use case is the implementation of a betting delay and approval system in live bets. Stateful streaming analytics is required to improve the margin:

Kafka-native technologies like Kafka Streams or ksqlDB enable a straightforward implementation of these scenarios.

William Hill – A Secure and Reliable Real-time Microservice Architecture

William Hill went from a monolith to a flexible, scalable microservice architecture:

Kafka as central, reliable streaming infrastructure
Kafka for messaging, storage, cache and processing of data
Independent decoupled microservices
Decoupling and replayability
Technology independence
High throughput + low latency + real-time

William Hill’s trading platform leverages Kafka as the heart of all events and transactions:

“process-to-process” execution in real-time
Integration with analytic models for real-time machine learning
Various data sources and data sinks (real-time, batch, request-response)

Bookmaker business == Banking Business (including Legacy Middleware and Mainframes)

Not everyone can start from greenfield. Legacy middleware and mainframe integration, offloading, and replacement is a common scenario.

Betting usually is a regulated market. PII data is often processing on-premise in a regional data center. Non-PII data can be offloaded to the cloud for Analytics.

Legacy technologies like mainframe are a crucial cost factor, monolithic and inflexible. I covered the relation between Kafka and Mainframe in detail in the following post:

And here is the story about Kafka vs. Legacy Middleware (MQ, ETL, EBS).

Streaming Analytics for Retention, Compliance, and Customer Experience

Data quality is critical for legal compliance. Responsible gaming compliance. Client retention is vital to keep engagement and revenue growth.

Plenty of real-time streaming analytics use cases exist in this environment. Some examples where Kafka-native frameworks like Kafka Streams or ksqlDB can provide the foundation for a reliable and scalable solution:

Player winning / losing streak
Player conversion – from registration to wage (within x min)
Game achievement of the player
Fraud detection – e.g., payment windows
Long-running windows per player over days/months
Tournaments
Incentive unhappy players with an additional free credit
Reports to regulator – replay old events in a guaranteed order
Geolocation to enable features, limitations or commissions

Stream processing is also relevant for many other use cases, including fraud detection, as you will see in the next section.

Fraud Detection in Gaming with Kafka

Real-time analytics for detecting anomalies is a widespread scenario in any payment infrastructure. In Gaming, two different kinds of fraud exist:

Cheating: Fake accounts, bots, …
Financial fraud: match-fixing, stolen credit cards, …

Here is an example of doing streaming analytics for fraud detection with Kafka, its ecosystem, and machine learning:

Here is an example of detecting financial fraud and cheating with Jupyter notebooks and Python to analyze data pre-processed with ksqlDB:

Customer 360 is critical for real-time and context-specific acquisition, engagement, and retention. Use cases include:

Real-Time Event Streaming
- Game event triggers
- Personalized statistics and odds
- Player segmentation
- Campaign orchestration (“player journey”)
Loyalty system
- Rewards e.g., upgrade, exclusive in-game content, beta keys for the announcement event
- Avoid customer churn
- Cross-selling
Social Network integration
- Twitter, Facebook, …
- Example: Candy Crush (I guess every Facebook user has seen ads for this game)
Partner integration
- API Management

The following architecture depicts the relation between various internal and external components of a customer 360 solution:

Customer 360 at Sky Betting & Gaming

Sky Betting & Gaming has built a real-time streaming architecture for customer 360 use cases with Kafka’s ecosystem.

Here is a quote of why they choose Kafka-native frameworks like Kafka Streams instead of a zoo of technologies like Hadoop, Spark, Storm, and others:

“Most of our streaming data is in the form of topics on a Kafka cluster. This means we can use tooling designed around Kafka instead of general streaming solutions with Kafka plugins/connectors.

Kafka itself is a fast-moving target, with client libraries constantly being updated; waiting for these new libraries to be included in an enterprise distribution of Hadoop or any off the shelf tooling is not really an option. Finally, the data in our first use-case is user-generated and needs to be presented back to the user as quickly as possible.”

Disney+ Hotstar – Telco-OTT for millions of cricket fans in India

In India, people love cricket. Millions of users watch live streams on their smartphones. But they are not just watching it. Instead, gambling is also part of the story. For instance, you can bet on the result of the next play. People compete with each other and can win rewards.

This infrastructure has to run at extreme scale. Millions of actions have to be processed each second. No surprise that Disney+ Hotstar chose Kafka as the heart of this infrastructure:

IoT Integration is often also part of such a customer 360 implementation. Use cases include:

Live eSports events, TV, video streaming and news stations
Fan engagement
Audience communication
Entertaining features for Alexa, Google Home or sports-specific hardware

Cross-Company Kafka Integration

Last but not least, let’s talk about a trend I see in many industries: Streaming replication across departments and companies.

Most companies in the gaming industry use event streaming with Kafka at the heart of their business. However, connecting to the outside world (i.e., other departments, partners, 3rd party services) is typically done via HTTP / REST APIs. A total anti-pattern! Not scalable! Why not directly stream the data?

I see more and more companies moving to this approach.

API Management is an elaborate discussion on its own. Therefore, I have a dedicated blog post about the relation between Kafka and API Management:

Slides and Video – Kafka in the Gaming Industry

Here are the slides and on-demand video recording discussing Apache Kafka in the gaming industry in more detail:

As you learned in this post, Kafka is used everywhere in the gaming industry. No matter if you focus on games, betting, or video streaming.

What are your experiences with modernizing the infrastructure and applications in the gaming industry? Did you or do you plan to use Apache Kafka and its ecosystem? What is your strategy? Let’s connect on LinkedIn and discuss it!

The post Apache Kafka in Gaming (Games Industry, Bookmaker, Betting, Gambling, Video Streaming) appeared first on Kai Waehner.

Apache Kafka for Telco-OTT (Telecom Sector) and Media Applications

Kai Waehner — Fri, 03 Jul 2020 07:20:53 +0000

Current IT architectures in the telecom and media sector are not able to satisfy business needs because of their high complexity, lack of flexibility, and low level of automation. The biggest hurdle to overcome with digital transformation is to understand that it isn’t just a simple technology challenge – it covers every part of the telco business! This blog post explores next-generation architectures for the telecom and media sector with the Apache Kafka ecosystem to build Telco-OTT (Over the Top) services.

Apache Kafka in the Telecom Sector (OSS, BSS, Middleware, OTT)

A previous blog post covered the use cases of Event Streaming and Apache Kafka in the telecom industry: “Event Streaming and Apache Kafka in Telco Industry“. Follow that post or the related “whiteboard on Youtube about Apache Kafka and the telecom sector” to learn about various use cases like the following:

Telco-OTT (Over-The-Top) – 3rd Party Telecom and Media Services

Telco-OTT (Over-The-Top, OTT) is a particular field in the overall architecture in Telco and Media enterprises:

OSS (Operations Support System), as part of the telecommunication infrastructure, ensures services are working.
BSS (Business Support System), as part of the telecommunication infrastructure, ensures that they can be provided to an actual customer
OTT Applications use the existing telecommunication infrastructure and provide better cost and/or features and/or convenience.

An example is the easiest way to understand this better.

OTT Applications for Messaging and Chat (SMS, RCS, WhatsApp, WeChat)

Messaging applications are a great example of Telco-OTT. Easy to understand for everybody, but pretty powerful tools.

Here are some examples of messaging apps:

SMS (short message service) by telco providers: Text format message.
WhatsApp by Facebook: Above + group chat, gif/stickers, photos, videos, audio, location, contact information, and ‘walkie-talkie’ services.
WeChat by Tencent: Above + payment + various other partner integrations

The telecom sector had huge revenue coming from simple text messages via SMS. No competition existed for a long time. Telcos were able to charge ~20 Cent per message. A billion-dollar business emerged.

The internet and app economy changed this massively:

People-to-People (P2P) SMS shrunk significantly. Application-to-people (A2P) is lower today, too.

The telecom sector has the following options to act when a competing service becomes successful:

Partnership: No control over the service and potentially damage their reputation and customer relationships.
Development of own services: No skills for building the service and too late anyway.
Blocking OTT services: Losing revenue for traffic and customers.

Not good options, right? The best answer is to innovate early; before competitors take over.

In this case, the answer from the telco providers was RCS (rich communication services)… RCS enabled additional features to simple text, such as sending images and videos. But it came “a little bit late” and was much more expensive compared to “almost free” apps like WhatsApp and WeChat. (yes, I know, if the product is free, then you are the product – but this is another discussion…)

Next-Generation Telecom Architecture

The above example shows the main problem in the Telecommunications Industry. The telecom sector needs to completely change their strategy and overall infrastructure to stay competitive and innovative. This change is the only way to “bridge the gap with the Over-the-Top (OTT) internet providers and survive in the digital age“:

Microservices Architecture for the Telecom Sector to Compete with 3rd Party OTT Applications

The telco infrastructure grew over the decades. It is inflexible, monolithic, and complex. The trend at most telco companies (I talked to in the last 12 months across the world) only sees one direction for the future: Build a flexible, open, scalable Telco architecture to process massive volumes of data in real-time.

The capability to decouple applications is where Apache Kafka shines with its distributed architecture providing a combination of messaging and storage to enable real domain-driven design (DDD):

A microservices architecture decouples applications and allows the integration between legacy and modern applications, infrastructure, and technologies. This is once again where Apache Kafka shines as scalable real-time data integration and middleware with any Telco infrastructure, using low-level interfaces such as TCP, Syslog or SNMP, business integrations with a CRM, EMS, NMS, IMS, or even mainframe offloading and replacement.

Hybrid Architectures as Key for Success in the Telco Industry

Hybrid cloud infrastructure is key to success in the telecommunications industry. The network infrastructure will always be on-premise and at the edge, while big data analytics, customer relations, and other business applications can run in a central data center or at a cloud provider:

Apache Kafka and its ecosystem provide various architecture patterns for distributed, hybrid, edge, and global deployments.

Telco architectures often combine Kafka with other Telco-specific applications and services from leading providers in the telecom sector, such as Amdocs, Ericsson, or Huawei.

Disney Hotstar – OTT in Real-Time for Millions of Cricket Fans in India

Let’s now take a look at a specific example for building an OTT service with requirements for real-time processing at a massive scale.

Disney+ Hotstar (known as Hotstar outside India), is an Indian over-the-top streaming service. In 2018, during the Indian Premier League, Hotstar introduced “Watch N Play”, a real-time cricket prediction game in which over 33 million unique users answered over 2 billion questions, won more than 100 million rewards, built with Kafka as the backbone.

In the game, the user guesses the outcome of the next ball. If he/she guesses right before the actual result, they score points to climb up the ladder and receive rewards along the way. Supporting potentially millions of users with differing stream times & device latencies, Hotstar used topics to separate logical streams & partitions provide up to supporting 1M requests/second and more.

This is an impressive OTT service built with Apache Kafka and its ecosystem, isn’t it? Check out more details in Hotstar’s Kafka Summit talk “Scaling for India’s Cricket Hungry Population“.

OTT at Netflix and Tencent with Kafka

In addition to the above Disney+ Hotstar example, I want to quickly refer to two more OTT and media examples:

Kafka at Netflix provides its business teams critical insights that help plan, determine spending, and account for all Netflix content. Kafka acts as a bridge for all point-to-point and Netflix Studio wide communications. The Netflix Studio Productions and Finance Team embraces distributed governance as the way of architecting systems.
Kafka at Tencent is responsible for integrating Tencent’s internet, social, and content platforms. Kafka handles 10 trillion+ messages per day. This includes many major products – from the well-known QQ, QZone, video, App Store, news, and browser, to relatively new members of the ecosystem such as live broadcast, anime, and movies.

Time to Change for Traditional Players in the Telecom and Media Sector

Deloitte has defined four scenarios for 2030. Every company in the telco industry needs to think about they want to go:

This necessary change affects any company working in the telco industry. It is time to change – no matter if you work on projects in OSS, BSS, OTT, IMS, Middleware, 3rd party services, or any other telecom domain.

Open Source MANO (OSM) for NFV and OSS Interoperability

A great example of this change in the telecom sector is the project “Open Source MANO (OSM)” for management and orchestration in OSS environments. This open-source project formed by ETSI (European Telecommunication Standards Institute) to “address challenges associated with orchestration, interoperability and performance optimization between different NFV (Network Functions Virtualization) and OSS systems”:

“The unified message bus of Open Source MANO is implemented with Apache Kafka. This bus allows asynchronous communication between OSM components and enables the introduction of new modules that can be easily pluggable.”

(please note that I don’t like the term ‘bus’ because it depicts Kafka as a messaging bus even though it is an event streaming platform including messaging, storage, processing, and integration capabilities)

The Evolution of the Telco Industry with Apache Kafka

The following slide deck and video recording cover the evolution of Kafka in the telco industry, including use cases, architectures, and technologies (OSS, BSS, OTT, IMS, NFV, Middleware, Mainframe, etc.):

Apache Kafka and Event Streaming for Innovation in the Telecom Sector

Current IT architectures in the telecom sector are not able to satisfy business needs because of their high complexity, lack of flexibility, and low level of automation.

Event Streaming with Apache Kafka and its ecosystem provides a scalable, reliable, and flexible infrastructure to process massive volumes in real-time. It enables real decoupling, plus powerful data integration and processing capabilities. Many telco enterprises build new infrastructures around Kafka.

This blog post focused on next-generation architecture to build Telco-OTT services. Disney+ Hotstar is an imposing example. But no matter in which part of the Telco industry you work, change and innovation are essential to staying competitive and innovative in the telecom sector.

What are your experiences with modernizing the infrastructure and applications in the telco industry? Did you or do you plan to use Apache Kafka and its ecosystem? What is your strategy? Let’s connect on LinkedIn and discuss it!

The post Apache Kafka for Telco-OTT (Telecom Sector) and Media Applications appeared first on Kai Waehner.