Streaming Analytics Archives - Kai Waehner

Fraud Detection with Apache Kafka, KSQL and Apache Flink

Kai Waehner — Tue, 25 Oct 2022 11:38:46 +0000

Fraud detection becomes increasingly challenging in a digital world across all industries. Real-time data processing with Apache Kafka became the de facto standard to correlate and prevent fraud continuously before it happens. This blog post explores case studies for fraud prevention from companies such as Paypal, Capital One, ING Bank, Grab, and Kakao Games that leverage stream processing technologies like Kafka Streams, KSQL, and Apache Flink.

Fraud detection and the need for real-time data

Fraud detection and prevention is the adequate response to fraudulent activities in companies (like fraud, embezzlement, and loss of assets because of employee actions).

An anti-fraud management system (AFMS) comprises fraud auditing, prevention, and detection tasks. Larger companies use it as a company-wide system to prevent, detect, and adequately respond to fraudulent activities. These distinct elements are interconnected or exist independently. An integrated solution is usually more effective if the architecture considers the interdependencies during planning.

Real-time data beats slow data across business domains and industries in almost all use cases. But there are few better examples than fraud prevention and fraud detection. It is not helpful to detect fraud in your data warehouse or data lake after hours or even minutes, as the money is already lost. This “too late architecture” increases risk, revenue loss, and lousy customer experience.

It is no surprise that most modern payment platforms and anti-fraud management systems implement real-time capabilities with streaming analytics technologies for these transactional and analytical workloads. The Kappa architecture powered by Apache Kafka became the de facto standard replacing the Lambda architecture.

A stream processing example in payments

Stream processing is the foundation for implementing fraud detection and prevention while the data is in motion (and relevant) instead of just storing data at rest for analytics (too late).

No matter what modern stream processing technology you choose (e.g., Kafka Streams, KSQL, Apache Flink), it enables continuous real-time processing and correlation of different data sets. Often, the combination of real-time and historical data helps find the right insights and correlations to detect fraud with a high probability.

Let’s look at a few examples of stateless and stateful stream processing for real-time data correlation with the Kafka-native tools Kafka Streams and ksqlDB. Similarly, Apache Flink or other stream processing engines can be combined with the Kafka data stream. It always has pros and cons. While Flink might be the better fit for some projects, it is another engine and infrastructure you need to combine with Kafka.

Ensure you understand your end-to-end SLAs and requirements regarding latency, exactly-once semantics, potential data loss, etc. Then use the right combination of tools for the job.

Stateless transaction monitoring with Kafka Streams

A Kafka Streams application, written in Java, processes each payment event in a stateless fashion one by one:

Stateful anomaly detection with Kafka and KSQL

A ksqlDB application, written with SQL code, continuously analyses the transactions of the last hour per customer ID to identify malicious behavior:

Kafka and Machine Learning with TensorFlow for real-time scoring for fraud detection

A KSQL UDF (user-defined function) embeds an analytic model trained with TensorFlow for real-time fraud prevention:

Case studies across industries

Several case studies exist for fraud detection with Kafka. It is usually combined with stream processing technologies, such as Kafka Streams, KSQL, and Apache Flink. Here are a few real-world deployments across industries, including financial services, gaming, and mobility services:

– Paypal processes billions of messages with Kafka for fraud detection.

– Capital One looks at events as running its entire business (powered by Confluent), where stream processing prevents $150 of fraud per customer on average per year by preventing personally identifiable information (PII) violations of in-flight transactions.

– ING Bank started many years ago by implementing real-time fraud detection with Kafka, Flink, and embedded analytic models

– Grab is a mobility service in Asia that leverages fully managed Confluent Cloud, Kafka Streams, and ML for stateful stream processing in its internal GrabDefence SaaS service.

– Kakao Games, a South-Korean gaming company uses data streaming to detect and operate anomalies with 300+ patterns through KSQL

Let’s explore the latter case study in more detail.

Deep dive into fraud prevention with Kafka and KSQL in mobile gaming

Kakao Games is a South Korea-based global video game publisher specializing in games across various genres for PC, mobile, and VR platforms. The company presented at Current 2022 – The Next Generation of Kafka Summit in Austin, Texas.

Here is a detailed summary of their compelling use case and architecture for fraud detection with Kafka and KSQL.

Use case: Detect malicious behavior by gamers in real-time

The challenge is evident when you understand the company’s history: Kakao Games has many outsourced games purchased via third-party game studios. Each game has its unique log with its standard structure and message format. Reliable real-time data integration at scale is required as a foundation for analytical business processes like fraud detection.

The goal is to analyze game logs and telemetry data in real-time. This capability is critical for preventing and remediating threats or suspicious actions from users.

Architecture: Change data capture and streaming analytics for fraud prevention

The Confluent-powered event streaming platform supports game log standardization. ksqlDB analyzes the incoming telemetry data for in-game abuse and anomaly detection.

Source: Kakao Games (Current 2022 in Austin, Texas)

Implementation: SQL recipes for data streaming with KSQL

Kakao Games detects anomalies and prevents fraud with 300+ patterns through KSQL. Use cases include bonus abuse, multiple account usage, account takeover, chargeback fraud, and affiliate fraud.

Here are a few code examples written with SQL code using KSQL:

Source: Kakao Games (Current 2022 in Austin, Texas)

Results: Reduced risk and improved customer experience

Kakao Games can do real-time data tracking and analysis at scale. Business benefits are faster time to market, increased active users, and more revenue thanks to a better gaming experience.

Fraud detection only works in real-time

Ingesting data with Kafka into a data warehouse or a data lake is only part of a good enterprise architecture. Tools like Apache Spark, Databricks, Snowflake, or Google BigQuery enable finding insights within historical data. But real-time fraud prevention is only possible if you act while the data is in motion. Otherwise, the fraud already happened when you detect it.

Stream processing provides a scalable and reliable infrastructure for real-time fraud prevention. The choice of the right technology is essential. However, all major frameworks, like Kafka Streams, KSQL, or Apache Flink, are very good. Hence, the case studies of Paypal, Capital One, ING Bank, Grab, and Kakao Games look different. Still, they have the same foundation with data streaming powered by the de facto standard Apache Kafka to reduce risk, increase revenue, and improve customer experience.

If you want to learn more about streaming analytics with the Kafka ecosystem, check out how Apache Kafka helps in cybersecurity to create situational awareness and threat intelligence and how to learn from a concrete fraud detection example with Apache Kafka in the crypto and NFT space.

How do you leverage data streaming for fraud prevention and detection? What does your architecture look like? What technologies do you combine? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Fraud Detection with Apache Kafka, KSQL and Apache Flink appeared first on Kai Waehner.

Is Apache Kafka an iPaaS or is Event Streaming its own Software Category?

Kai Waehner — Wed, 03 Nov 2021 06:57:54 +0000

Enterprise integration is more challenging than ever before. The IT evolution requires the integration of more and more technologies. Applications are deployed across the edge, hybrid, and multi-cloud architectures. Traditional middleware such as MQ, ETL, ESB does not scale well enough or only processes data in batch instead of real-time. This post explores why Apache Kafka is the new black for integration projects, how Kafka fits into the discussion around cloud-native iPaaS solutions, and why event streaming is a new software category. A concrete real-world example shows the difference between event streaming and traditional integration platforms respectively iPaaS.

What is iPaaS (Enterprise Integration Platform as a Service)?

iPaaS (Enterprise Integration Platform as a Service) is a term coined by Gartner. Here is the official Gartner definition: “Integration Platform as a Service (iPaaS) is a suite of cloud services enabling development, execution, and governance of integration flows connecting any combination of on-premises and cloud-based processes, services, applications, and data within individual or across multiple organizations.” The acronym eiPaaS (Enterprise Integration Platform as a Service)” is used in some reports as a replacement for iPaaS.

The Gartner Magic Quadrant for iPaaS shows various vendors:

Three points stand out for me:

Many very different vendors provide a broad spectrum of integration solutions.
Many vendors (have to) list various products to provide an iPaaS; this means different technologies, codebases, support teams, etc.
No Kafka offering (like Confluent, Cloudera, Amazon MSK) is in the magic quadrant.

The last bullet point makes me wonder if Kafka-based solutions should be considered iPaaS or not?

Is Apache Kafka an iPaaS?

I don’t know. It depends on the definition of the term “iPaaS”. Yes, Kafka solutions fit into the iPaaS, but it is just a piece of the event streaming success story.

Kafka is an event streaming platform. Use cases differ from traditional middleware like MQ, ETL, ESB, or iPaaS. Check out real-world Kafka deployments across industries if you don’t know the use cases yet.

Kafka does not directly compete with ETL tools like Talend or Informatica, MQ frameworks like IBM MQ or RabbitMQ, API Management platforms like MuleSoft, and cloud-based iPaaS like Boomi or TIBCO. At least not if people understand the differences between Kafka and traditional middleware. For that reason, many people (including me) think that Event Streaming should be its Magic Quadrant.

Having said this, all these very different vendors are in the iPaaS Magic Quadrant. So, should Kafka respectively its vendors be in here? I think so because I have seen hundreds of customers leverage the Kafka ecosystem as a cloud-native, scalable, event-driven integration platform, often in hybrid and multi-cloud architectures. And that’s an iPaaS.

What’s different with Kafka as an Integration Platform?

If you are new to this discussion, check out the article “Apache Kafka vs. MQ, ETL, ESB” or the related slides and video. Here is my summary on why Kafka is unique for integration scenarios and therefore adopted everywhere:

A unique selling point for event streaming is the ability to leverage a single platform. In contrast, other iPaaS solutions require different products (including codebases, support teams, integration between the additional tech, etc.).

Kafka as Cloud-native and Serverless iPaaS

Fundamental differences exist between modern iPaaS solutions and traditional legacy middleware; this includes the software architecture, the scalability and operations of the platform, and data processing capabilities. On a high level, an “Kafka iPaaS” requires the following characteristics:

Cloud-native Infrastructure: Elasticity is vital for success in the modern IT world. The ability to scale up and down (= expanding and shrinking Kafka clusters) is mandatory. This capability enables starting with a small pilot project and scaling up or handling load spikes (like Christmas business in retail).
Automated Operations: Truly serverless SaaS should always be the preferred option if the software runs in a public cloud. Orchestration tools like Kubernetes and related tools (like a Kafka operator for Kubernetes) are the next best option in a self-managed data center or at the edge outside a data center.
Complete Platform: An integration platform requires real-time messaging and storage for backpressure handling and long-running processes. Data integration and continuous data processing are mandatory, too. Hence, an “Kafka iPaaS” is only possible if you have access to various pre-built Kafka-native connectors to open standards, legacy systems, and modern SaaS interfaces. Otherwise, Kafka needs to be combined with another middleware like an iPaaS or an ETL tool like Apache Nifi.
Single Solution: It sounds trivial, but most other middleware solutions use several codebases and products under the hood. Just look at stacks from traditional players such as IBM and Oracle or open-source-driven Cloudera. The complex software stack makes it much harder to provide end-to-end integration, 24/7 operations, and mission-critical support. Don’t get me wrong: Kafka-native solutions like Confluent Cloud also include different products with additional pricing (like a fully-managed connector or data governance add-ons), but all run on a single Kafka-native platform.

From this perspective, some Kafka solutions are modern, cloud-native, scalable, iPaaS. Having said this, would I be happy if you consider some Kafka solutions as an iPaaS on your technology radar? No, not really!

Event Streaming is its Software Category!

While some Kafka solutions can be used as iPaaS, this is only one of many usage scenarios for event streaming. However, as explained above, Kafka-based solutions differ greatly from other iPaaS solutions in the Gartner Magic Quadrant. Hence, event streaming deserves its software category.

If you still wonder what I mean, check out event streaming use cases across industries to understand the difference between Kafka and traditional iPaaS, MQ, ETL, ESB, API tools. Here is a relatively old but still fantastic diagram that summarizes the broad spectrum of use cases for event streaming:

TL;DR: Kafka provides capabilities for various use cases, not just integration scenarios. Many new innovative business applications are built with Kafka. It is not just an integration platform but a unique suite of capabilities for end-to-end data processing in real-time at scale.

New concepts like Data Mesh also prove the evolution. The basic principles are not unique: Domain-driven design, microservices, true decoupling of services, but now with much more focus on data as a product. The latter means it is turning from a cost center into a profit center and innovative new services. Event streaming is a critical component of a data mesh-powered enterprise architecture as real-time almost always beats slow data across use cases.

The Non-Existing Event Streaming Gartner Magic Quadrant or Forrester Wave

Unfortunately, a Gartner Magic Quadrant or Forrester Wave for Event Streaming does NOT exist today. While some event streaming solutions fit into some of these reports (like the Gartner iPaaS Magic Quadrant or the Forrester Wave for Streaming Analytics), it is still an apple to orange comparison.

Event Streaming is its category. Many software vendors built their entire business around this category. Confluent is the leader in this space – note that I am biased as a Confluent employee, but I guess there is no question around this statement Many other companies emerge around Kafka, or in a related way using the Kafka protocol, or competitive event streaming offerings such as Amazon Kinesis or Apache Pulsar.

The following Event Streaming Landscape 2021 summarizes the current status:

I hope to see a Gartner Magic Quadrant for Event Streaming and a Forrester Wave for Event Streaming soon, too.

Open Source vs. Partially Managed vs. Fully-Managed Event Streaming

One more aspect to point out: You might have noticed that I said, “some event streaming solutions can be considered an iPaaS”. The word “some” is a crucial detail. Just providing an open-source framework is not enough.

iPaaS requires a complete offering, ideally as fully-managed services. Many vendors for event streaming use Kafka, Pulsar, or another framework but do NOT provide a complete offering with operations tooling, commercial 24/7 support, user interfaces, etc. The following resources should help you learn more about the event streaming landscape in 2021:

TL;DR: Evaluate the various offerings. A lot of capabilities are just marketing! Many “fully-managed services” are only partially managed instead of serverless and with very limited SLAs and support. Some other offerings provide plenty of cool features but are more an alpha version and overselling than a mature battle-service solution. A counterexample is the complexity in T-Mobile’s report about upgrading Amazon MSK. This shows the difference between “promoting and selling a fully managed service” and the “not at all fully-managed reality”. A truly fully-managed offering does NOT require the end user to upgrade the infrastructure.

Kafka as Event Streaming iPaaS at Deutsche Bahn (German Railway)

Let’s now look at a practicable example to understand why a traditional iPaaS cannot help in use cases that require event streaming and why this combination of capabilities in a single technology sets a new software category.

This section explores a real-world use case with the journey of Deutsche Bahn (German Railway) providing a great customer experience to their customers. This example uses Event Streaming as iPaaS (regarding my definition of these terms).

Use Case: Improve Customer Experience with Real-time Notifications

The use case sounds very simple: Improve the customer experience by providing real-time information and notifications to customers across various channels like websites, smartphones, and displays at train stations.

Delays and cancellations happen in a complex rail network like in Germany. Frequent travelers accept this downside. Nobody can do anything against lousy weather, self-murder using a traveling train, and technical defects.

However, customers at least expect real-time information and notifications so that they can wait in a coffee shop or lounge instead of freezing at the station platform for minutes or even hours. The reality at Deutsche Bahn was a different sequence of notifications: 10min delay, 20min delay, 30min delay, train canceled – take the next train.

The goal of the project Reisendeninformation (= traveler information system) was to improve the day-to-day experience of millions of travelers across Germany by delivering up-to-date, accurate, and consistent travel information in any location.

Initial Project: A mess of different Integration Technologies

Again, the use case sounds simple. Let’s send real-time notifications to customers if a train is delayed or canceled. Every magic black iPaaS box can do this:

Is this true or just the marketing of all the integration vendors? Can each black box integrate all the different systems to correlate events in real-time?

Deutsche Bahn started with an open-source messaging platform to send real-time notifications. Unfortunately, the team quickly found out that not every piece of information was coming in in real-time. So, a caching and storage system was added to the project to handle the late-arriving information from some legacy batch or file-based systems. Now, the data from the legacy systems needed to be integrated. Hence, an integration framework was installed to connect to files, databases, and other applications. Now, the data needed to be processed, correlating real-time and non-real-time data from various systems. A stream processing engine can do this.

The pilot project included several different frameworks. A few skeptical questions came up:

How to scale this combination of very different technologies?
How to get end-to-end support across various platforms?
Is this cost-efficient?
What is the time-to-market for new features?

Deutsche Bahn re-evaluated their tech stack and found out that Apache Kafka provides all the required capabilities out-of-the-box within one platform.

The Migration to Cloud-native Kafka

The team at Deutsche Bahn re-architected their pilot project. Here is the new solution leveraging Kafka as the single point of truth between various systems, technologies, and communication paradigms:

A traditional iPaaS can implement this scenario. But with several codebases, technologies, and clusters, even if you select one software vendor! Some iPaaS might even do well in the beginning but struggle to scale up. Only event streaming allows to start small but scales up with no need to re-architect the infrastructure.

Today, the project is in production. Check your DB Navigator mobile app to get real-time updates about all trains in Germany. Living in Germany, I appreciate this new service to have a much better traveling experience.

Learn more about the project from the Deutsche Bahn team. They gave several public talks at different conferences and wrote on the Confluent Blog about their Kafka journey. Though, the journey did not end here As described in their blog post, Deutsche Bahn is now evaluating the migration from a self-managed Confluent Platform deployment in the public cloud to the fully-managed, truly serverless Confluent Cloud offering to reduce TCO and improve time-to-market.

Complete Platform: Kafka is more than just Messaging

A project like the one described above is only possible with a complete platform. Many people still think about Kafka as an ingestion layer into a data lake or data warehouse, as this was one of the first prominent Kafka use cases. Data ingestion is still an excellent use case today. Many projects already use more than just the core of Kafka to implement this. Kafka Connect provides out-of-the-box connectivity between Kafka and the data store. If you are in the public cloud, you even get integrated in a fully-managed, serverless manner whether you need to integrate with a 1st party cloud service like Amazon S3, Google Cloud BigQuery, Azure Cosmos DB, or other 3rd SaaS like MongoDB Atlas, Snowflake, or Databricks.

Continous Kafka-native stream processing is the next level of a complete platform. For instance, Deutsche Bahn leverages Kafka Streams a lot for their data correlations in real-time at scale. Other companies use ksqlDB as a fully-managed function in Confluent Cloud. The enormous benefit is that you don’t need yet another platform or service for streaming analytics. A complete platform makes the architecture more cost-efficient, and end-to-end integration is easier from SLA and support perspective.

A complete platform requires many additional services “on top”, like visualization of data flows, data governance, role-based access control, audit logs, self-service capabilities, user interfaces, visual coding/code generation tools, etc. Visual coding is the point where traditional middleware and iPaaS tools are stronger today than event streaming offerings.

3rd Party integration via Open API and non-Kafka Tools

So far, you learned why event streaming is its software category and how Deutsche Bahn is an excellent example to show this. However, event streaming is NOT the silver bullet for every problem! When exploring if Kafka and MQ/ETL/ESB are friends, enemies, or frenemies, I already pointed this out. For instance, MQ or an ESB can complement event streaming in an integration project, depending on your project requirements.

Let’s go back to Deutsche Bahn. As mentioned, their real-time traveler information platform is live, with Kafka as the single point of truth. Recently, Deutsche Bahn announced a partnership with Google and 3rd Party Integration with Google Maps:

Real-time Schedule Updates to 3rd Party Google Maps API

The integration provides real-time train schedule updates to Google Maps users:

The integration allows to reach new people and expand the business. Users can buy train tickets via one click from the Google Maps page.

I don’t know what technology or product this 3rd party integration uses. The heart of Deutsche Bahn’s real-time infrastructure enables new innovative business models and collaboration with partners.

Likely, this integration between Deutsche Bahn and Google Maps does not directly use the Kafka protocol (even though this is done sometimes, for instance, see Here Technologies Open API for their mapping service).

Event Streaming is complementary to other services. In this example, the project team might have used an API Management platform to provide internal APIs to external consumers, including access control, billing, and reporting. The article “Apache Kafka and API Management / API Gateway – Friends, Enemies or Frenemies?” explores the relationship between event streaming and API Management.

Event Streaming Everywhere – Cloud, Data Center, Edge

Real-time beats slow data everywhere. This is a new software category because we don’t just send events into another database via a messaging system. Instead, we use and correlate data from different data source in real-time. That’ the real added value and game changer in innovative projects.

Hence, event streaming must be possible in every location. While cloud-first is a viable strategy for many IT projects, edge and hybrid scenarios are and will continue to be very relevant.

Think about a project related to the Deutsche Bahn example above (but being completely fictive): A hybrid architecture with real-time applications the cloud and edge computing within the trains:

I covered this in other articles, including “Edge Use Cases for Apache Kafka Across Industries“. TL;DR: Leverage the open architecture of event streaming for real-time data processing everywhere, including multi-cloud, data centers, and edge deployments (i.e., outside a data center). The enterprise architecture does not need various technologies and products to implement real-time data processing and integration with separate iPaaS, ETL tools, ESBs, MQ systems.

However, once again, it is crucial to understand how event streaming fits into the enterprise architecture. For instance, Kafka is often combined with IoT technologies such as MQTT for the last mile integration with IoT devices in these edge scenarios.

Slide Deck and Video for “Apache Kafka vs. Cloud-native iPaaS Middleware”

Here are the related slides for this topic:

And the video recording of this presentation:

Kafka is a cloud-native iPaaS, and much more!

Kafka is the new black for integration projects across industries because of its unique combination of capabilities. Some Kafka solutions are part of the iPaaS category, with trade-offs like any other integration platform.

However, event streaming is its software category. Hence, iPaaS is just one usage of Kafka or other similar event streaming platforms. Real-time data beats slow data. For that reason, event streaming is the backbone for many projects to process data in motion (but also integrate with other systems that store data at rest for reporting, model training, and other use cases).

How do you leverage event streaming and Kafka as an integration platform? What projects did you already work on or are in the planning? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Is Apache Kafka an iPaaS or is Event Streaming its own Software Category? appeared first on Kai Waehner.

Kappa Architecture is Mainstream Replacing Lambda

Kai Waehner — Thu, 23 Sep 2021 06:39:43 +0000

Real-time data beats slow data. That’s true for almost every use case. Nevertheless, enterprise architects build new infrastructures with the Lambda architecture that includes separate batch and real-time layers. This blog post explores why a single real-time pipeline, called Kappa architecture, is the better fit. Real-world examples from companies such as Disney, Shopify, Uber, and Twitter explore the benefits of Kappa but also show how batch processing fits into this discussion positively without the need for Lambda.

This post is heavily inspired by Jay Kreps’ article “Questioning the Lambda Architecture” from 2014 (!) and maps his thoughts to the real-world situation in 2021. Today, almost every business solution, data storage and analytics provider, and business application leverages event streaming and asynchronous, truly decoupled event-based communication paradigms for data processing. For that reason, many move from Lambda to Kappa architectures.

A Modern Enterprise Architecture

A modern enterprise architecture offers cloud-native characteristics: Flexibility, elasticity, automation, true decoupling between different applications, and real-time capabilities (where needed).

Microservices, Data Mesh, and Domain-driven Design for True Decoupling

Let’s quickly explore the buzzwords to understand how most people build modern enterprise architectures today:

Domain-driven Design (DDD) enforces strict boundaries between service communication and a decentralized application landscape.
Microservices enable building flexible, decoupled applications with different programming languages and communication paradigms.
Data Mesh allows to architect services around data. Data is the product in a data mesh. Self-service capabilities and federation enable business units to focus on their business problem.

My blog post “Microservices, Apache Kafka, and Domain-Driven Design” explored this discussion in more detail (even though the buzzword “data mesh” did not exist at the time of writing). TL;DR: An event-driven streaming infrastructure such as Apache Kafka uniquely enables proper decoupling and real-time data processing (contrary to traditional web service / REST / HTTP-based microservice architectures and contrary to traditional messaging systems (MQ, ESB). The blog post about Kafka vs. MQ/ETL/ESB might also be helpful to learn more.

Real-time Data Beats Slow Data, but NOT Always!

Think about your industry, business units, problems you solve, and innovative new applications you build. Real-time data beats slow data. This statement is almost always true. Either to increase revenue, reduce cost, reduce risk, or improve the customer experience.

Data at Rest means to store data in a database, data warehouse, or data lake. This way, data is processed too late in many use cases – even if a real-time streaming component (like Kafka) ingests the data. The data processing is still a web service call, SQL query, or map-reduce batch process away from providing a result to your problem.

Don’t get me wrong. Data at rest is not a bad thing. Several use cases such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) work very well with this approach. But real-time beats batch in almost all other use cases.

I analyzed the relation between data at rest and data in motion and how this point of view regarding the enterprise architecture changed with the cloud-first strategy of most companies in the blog post “Serverless Kafka in a Cloud-native Data Lake Architecture“.

The de facto standard for real-time data processing is Apache Kafka. Hence, the covered real-world examples in this post use Kafka.

With this context in mind, let’s revisit Lambda architecture.

The Lambda Architecture

Nathan Marz coined the Lambda architecture: A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods.

Lambda architecture includes batch, speed, and serving layers. This approach enables processing data in real-time but also easy re-processing of batched static datasets. The problem with out-of-order data is also solved.

This approach attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data while simultaneously using real-time stream processing to provide views of online data. The rise of lambda architecture is correlated with the steady growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce.

Two Options for a Lamba Architecture

The web discusses two different approaches to Lambda architecture.

The initial approach provided a unified serving layer. A unified serving layer joins the real-time and batch layer:

Another alternative is two separate serving layers. One layer is for real-time consumption, the other one for batch consumption:

I see the second option much more in the field. In the end, both have the same concept of building two separate layers for data ingestion and processing.

Issues with the Lambda Architecture

The Hadoop vendors heavily pitched the Lambda architecture to deploy and operate a super complex infrastructure with many big data frameworks. Today, I only hear the pain of enterprises complaining about this complexity and the missing business value. No surprise that most of these vendors did not survive or have a very confusing and unclear future product strategy.

Disney has summarized the concerns with the Lambda architecture on one slide:

The batch and streaming sides each require a different codebase that must be maintained and kept in sync so that processed data produces the same result from both paths. Additionally, with batch, speed, and serving layers, everything needs to be processed (at least) twice. That increases the cost and operations efforts of storage, network, and compute.

Jay Kreps had similar arguments when he proposed the Kappa architecture in 2014 (!), already: “The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems like it would be”.

So, what’s different in Kappa architecture?

The Kappa Architecture

The Kappa architecture is a software architecture that is event-based and able to handle all data at all scale in real-time for transactional AND analytical workloads.

The central premise behind the Kappa architecture is that you can perform both real-time and batch processing with a single technology stack. The heart of the infrastructure is streaming architecture. First, the event streaming platform log stores incoming data. From there, a stream processing engine processes the data continuously in real-time or ingests the data into any other analytics database or business application via any communication paradigm and speed, including real-time, near real-time, batch, request-response.

Unlike the Lambda Architecture, in this approach, you only do re-processing when your processing code changes, and you need to recompute your results. And, of course, the job doing the re-computation is just an improved version of the same code, running on the same framework, taking the same input data.

Benefits of the Kappa architecture

The Kappa architecture has several benefits:

Handle all the use cases (streaming, batch, RPC) with a single architecture
One codebase that is always in synch
One set of infrastructure and technology
The heart of the infrastructure is real-time, scalable, and reliable
Improved data quality with guaranteed ordering and no mismatches
No need to re-architect for new use cases

TL;DR: The Kappa architecture leverages a single source of truth focusing on simplicity in the enterprise architecture. People can develop, test, debug, and operate their systems on a single processing framework for BOTH real-time and batch systems. To be clear: The leading system for some applications can still be another system. For instance, the leading system for ERP is still SAP, while the source of truth for consumers is the Kafka log.

Kappa for Transactional and Analytical Workloads

Contrary to a data lake, event-streaming-powered Kappa architectures enable transactional workloads in addition to analytical workloads too.

For instance, Kafka and its ecosystem support exactly-once semantics so that you can build your next payment platform for aftersales or customer interactions with mission-critical SLAs, low latency, and fault-tolerance built-in. Independently, the data science team consumes historical events for finding insights in a batch process using machine learning.

Kappa is NOT a free lunch!

The Kappa architecture sounds too good to be true? Well, a basic rule of thumb is still valid: Use the right tool for the job!

Event streaming is a paradigm shift. A big bang migration will not work. Here are a few lessons learned from Disney about introducing the Kappa architecture:

As a big bang does not work, a good way is to rethink data and databases. Martin Kleppmann called it “turning the database inside out“. Let’s look at this approach and how it helps to leverage the Kappa architecture in combination with other databases and analytics platforms.

The Inside and Outside Perspective to Solve the Kappa Challenges

Turning the database inside out is a new thinking of the enterprise architecture. The heart of the infrastructure is event-based and real-time. Where needed, you consume the events in batch or store them in additional storage and analytics tools with their concepts and paradigms after they consumed the events.

The inner perspective of Kappa: The central nervous system

Think of an event streaming platform like Kafka:

Data availability/retention: Compacted Topics, Tiered Storage
Data consistency and fault-tolerance: Exactly-once semantics, Multi-Region Clusters, Cluster Linking
Handling late-arriving data: Event time and processing time are different. State management in the streaming application, proper data sinks, replay with guaranteed ordering, and timestamps.
Data reprocessing and backfill: Dynamic clusters (ideally a serverless cloud offering or at least a cloud-native self-managed cluster), stateful applications (Kafka Streams, ksqlDB, external stream processing framework like Apache Flink).
Data integration: Kafka Connect for sources and sinks, clients for any language, REST Proxy (real-time but also batch and RPC

An event streaming platform provides many characteristics to built a Kappa architecture. However, it is not a silver bullet. Additional databases and analytics tools are mandatory for some use cases. For instance, Kafka does not scale well for dynamic bursty workloads. Complex SQL queries and joins also need another database.

The outer perspective of Kappa: The applications and data stores

Think of any business application, data storage, or analytics platform:

Data Consumption: Consume the data from the central nervous system. Consume the data at your speed (real-time, near real-time, batch, RPC).
Data Storage: Store the data in your storage as long as you need it (in-memory, short-term storage, long-term storage).
Data Processing: Process the data for your use case (real-time notification, indexing into your query engine, a batch process for reporting or model training, etc.). Complex processing is not doable in the event streaming platform (e.g., complex joins, intensive compute with batch algorithms).

The discussion “Can Apache Kafka be used as a database?” is also helpful to understand both perspective and the trade-offs of using Kafka as data storage.

Cost-Efficient and Scalable Kappa Architectures

A huge problem of realizing the Kappa architecture in the real world was storing vast volumes of data in an event streaming platform. This approach was costly and had scalability issues at the Terabyte or Petabyte scale. On the contrary, data lakes were designed for vast volumes from the beginning. Hadoop and HDFS were used on-premise in the early phases. The public cloud enabled the migration to fully-managed object storage such as AWS S3 or Google Cloud Storage to make data lakes even more scalable and cost-efficient for big data.

One approach is to reduce the data stored in the event streaming platform. Infinite retention leveraging log compaction is a viable approach to reduce the storage size. However, compacted topics shrink data sets and only store the latest value for each message key. Hence, this workaround is not applicable for every use case.

Another workaround I have seen a lot in practice is building a “streaming data lake” with Kafka as a streaming layer and object storage for long-term storage. The bi-directional integration was built with Kafka Connect and sink and source connectors. This was actually the main reason why Confluent built an S3 Source connector for Kafka Connect in addition to its heavily used S3 Sink connector.

Tiered Storage for Event Streaming

The good news is that streaming platforms evolved. Tiered Storage allows decoupling storage from computing in event streaming platforms such as Kafka or Pulsar.

Tiered Storage is a game-changer for Kappa architectures. It manages the storage without a performance impact on real-time consumers. Additionally, this enables a very cost-efficient and elastic Kappa architecture without the need for a traditional data lake. Uber talks about the motivation and benefits of Tiered Storage for Kafka (KIP-405) in a recent Kafka Summit talk.

Kappa architectures are very flexible regarding the underlying storage technology. While Uber uses Hadoop’s HDFS as storage, Confluent went another way: Confluent Tiered Storage for Kafka is based on the S3 interface to leverage object storage and works for both public cloud provider object stores such as AWS S3 or GCS, and on-premise object stores such as PureStorage or MinIO for Kubernetes.

In other words: Tiered Storage for Kafka can leverage the same modern data storage as modern cloud data lakes (or as AWS calls it today: Lake House). Hence, the Kappa architecture provides the best of both worlds: Real-time data processing plus cost-efficient and scalable long-term storage for replaying historical data.

Real-World Examples for a Kappa Architecture

The above was a lot of theory. Let’s recap: Real-time data beats slow data in most use cases. But batch processing is still needed and will not go away.

Let’s now look at a few real-world examples of Kappa architectures at Uber, Shopify, and Disney.

Kappa at Uber for Trillions of Messages and Petabytes per Day

Uber is a very prominent tech giant. They talk a lot about their software architectures and deployments regularly in public. Uber is one of the most significant Kafka users in the world. In the meantime, they process over 4 trillion msgs and 3PBs per day.

As a perfect fit for this blog post, Uber presented at a recent Kafka Summit about their Kappa architecture:

As you can see, Uber’s architecture evolved precisely to what I described in the above sections. The central nervous system is a Kafka-based real-time infrastructure. Uber still has batch pipelines. Uber also provides APIs (e.g., to mobile apps). And – no surprise – they also have traditional SQL and NoSQL databases, business intelligence reporting tools, dashboards, and much more.

Uber’s architecture shows the massive benefits of Kappa: The heart of the infrastructure is real-time, scalable, fault-tolerant, and reliable. A single pipeline for everything. No need for a Lambda architecture! Kappa enables transactional and analytical workloads. Each microservice in the data mesh can use its technology and communication paradigm for each application.

Kappa at Shopify for Stateless and Stateful Data Streaming

Shopify presented their Kappa architecture in a recent Kafka Summit talk: “It’s Time To Stop Using Lambda Architecture“ The session covered the concerns of Kappa architecture and how Shopify solved them with different building blocks. The three key components are the log (Kafka), streaming framework (Kafka Streams and Apache Flink), and data sinks (any real-time consumer or data store).

Here is one example of a stateful Kappa scenario at Shopify:

Shopify discussed the core building blocks of their Kappa architecture:

The Log (Kafka)

Durability with Topic Compaction and Tiered Storage
Consistency via Exactly-Once Semantics (EOS)
Data Integration via Kafka Connect
Elasticity via dynamic Kafka clusters

Streaming Framework (Kafka Streams / Flink)

Reliability and scalability
Fault tolerance
State management

Data Sinks

Real-time consumers
Update/upsert for simplified design, for instance, RDBMS, NoSQL, Compacted Kafka Topics
Append-only storage (i.e., no update), for instance, regular Kafka Topics, Time Series databases

Kappa at Disney as Single Source of Truth

Disney’s Kafka Summit talk “Big Data Kappa” is very inspiring. It probably includes the most lessons learned and trade-offs of a real-world Kappa deployment. I encourage you to watch the on-demand video—many insights and guidance for building your own Kappa Architecture.

All data writes at Disney go through Kafka as the source of truth. The following screenshot shows the concept. The green box is the Kafka cluster, including Tiered Storage as the single source of truth. Any application consumes the data from Kafka for further processing and optional external storage.

Disney solves the following problems with its Kafka-based Kappa architecture:

Keep it simple (Kiss)
Reduce Code Duplication
Decreasing End To End Latency
Full System Immutability
Avoiding Data Discrepancies
Ability to move laterally between storage systems
Everyone wants their answers faster

Kappa at Twitter for Migration from Lambda Architecture

Twitter processes approximately 400 billion events in real-time and generates petabyte (PB) scale data every day. The on-premise architecture with Hadoop and Kafka using the Lambda architecture was not efficient enough:

Therefore, Twitter migrated to the cloud on GCP with Kafka using the Kappa architecture:

With the new hybrid architecture on both Twitter Data Center and Google Cloud Platform, they “are able to process billions of events in real-time and achieve low latency, high accuracy, stability, architecture simplicity, and reduced operation cost for engineers” as Twitter quotes in their detailed blog post about their Lambda to Kappa migration.

Example Project: Kappa for Machine Learning including Model Training, Scoring, and Monitoring

After real-world examples from Uber, Shopify, and Disney, I want to share one more practical code example: A technical demo connecting to 100,000 IoT devices to do streaming machine learning.

The use case is about integrating tens or hundreds of thousands of IoT devices and processing the data in real-time. The demo use case is predictive maintenance (i.e., anomaly detection) in a connected car infrastructure to predict motor engine failures:

The implemented Kappa architecture provides a single real-time infrastructure for various very different use cases and processing paradigms:

Real-time data ingestion at high throughput from IoT devices via an MQTT proxy: Integration with millions of interfaces, in this case, simulated vehicles.
Batch processing for model training: The TensorFlow Python application from the data scientist consumes historical data from the Kafka log to train analytic models.
Real-time stream processing for model scoring: The Java-based streaming application is powered by Kafka Streams / ksqlDB and operated by the production engineer with mission-critical SLAs and low latency.
Near-real time ingestion into the digital twin for analytics: Kafka Connect ingests the data into different databases and applications, in this case, a MongoDB Atlas cloud service.
Synchronous request-response / RPC communication for mobile app integration and transactional workloads: The Confluent REST Proxy (or any other web / mobile proxy) sends real-time alerts to humans.

The whole infrastructure is cloud-native. It runs on Kubernetes and can be deployed in a data center or on any hyperscaler. The following blog post explains the demo in detail: IoT Live Demo – 100.000 Connected Cars with Kubernetes, Kafka, MQTT, TensorFlow. The code is available in the Github project.

Kappa under the Hood of Next Generation Software Products and SaaS Offerings

Software companies have the same challenges as end-users like Uber, Shopify, or Disney. Hence, no surprise that software vendors move to Kappa architectures and real-time capabilities as the heart of their infrastructures.

This section shows a few examples of software vendors that moved to event-based architectures, event streaming, and asynchronous external interfaces within their next-generation software offerings.

Once again: This does NOT mean that everything within these products is real-time or event-based, but only if the related components provide real-time capabilities, then you can provide a real-time interface for internal or external consumers.

Business Solutions (Salesforce, SAP, Slack, et al.)

Business solutions provide customer interactions, logistics, manufacturing, internal communication, and many other use cases. No surprise that real-time data beats slow data. For this reason, most modern business solutions moved from less flexible and less scalable communication paradigms to event-based interfaces. Instead of using files, web service APIs, or manual changes, communication happens via event-driven APIs internally and externally.

A few examples across different business solutions:

Salesforce: The internal “platform events” architecture heavily relies on Apache Kafka for decoupled real-time data processing at scale. External APIs like the integration with Salesforce’s proprietary sObject datatype moved from SOAP and REST web service to Streaming API PushTopics, Enterprise Messaging Platform Events, and Change Data Capture Events.
SAP: Instead of relying on its legacy proprietary interfaces such as BAPI and iDoc, SAP moved to event-based APIs in their next-generation SAP S/4 Hana ERP platform. The blog “SAP integration options for Apache Kafka” shows the mess of numerous legacy interfaces and alternative modern event-based integration options.
Slack: Being a messaging platform by nature, it is no surprise that the heart of their core backend infrastructure leverages event streaming. Slack’s data streaming team focuses on providing Kafka as a Service for the company at the scale of trillions of messages per day across dozens of clusters in Amazon data centers. For the front-end, Slack’s current architecture leverages a service mesh built with Envoy and WebSockets.

Databases, Data Warehouses, Log Analytics

Data storage and analytics vendors are traditionally batch technologies for long-term storage, dashboards, reporting, and interactive queries. The heart of most solutions is still a batch system for analytics workloads. That’s the core business of these products and services.

Nevertheless, almost all of these vendors went into (near) real-time business due to customer demand. Hence, event-based integration capabilities and near real-time ingestion, processing, and analytics are becoming more prevalent. Some examples:

MongoDB: “Change Streams” allow applications to access real-time data changes from the document-based NoSQL datastore.
Snowflake: “Snowpipe” can help organizations seamlessly load continuously generated data into the cloud data warehouse.
Elasticsearch: “Data Streams” lets you store append-only time series data across multiple indices while giving you a single named resource for requests. Data streams are well-suited for logs, events, metrics, and other continuously generated data to ingest data into the Elastic search engine.

These solutions have in common that they move from batch to near real-time ingestion into their data store or data lake. Nevertheless, they still store and analyze data at rest. Hence, this is complementary but not an alternative to event streaming.

New entrants into the market try to differentiate from the above data storage vendors by providing a real-time infrastructure at its core. A great example is Rockset, a scalable real-time analytics platform in the cloud. As it is a native real-time solution, Rockset natively integrates with event streaming platforms such as Apache Kafka.

Event Streaming

Event Streaming platforms are event-based by nature. They process data in motion continuously. Therefore, the central nervous system of a Kappa architecture has to be an event streaming platform. Period.

For a comparison of frameworks like Kafka and Pulsar, plus reviewing the differentiators from platform vendors and SaaS providers such as Confluent, Cloudera, Red Hat, Amazon MSK, Azure Event Hubs, etc., please check out this comparison of event streaming platforms.

Event streaming will be one serverless component in a cloud-native data lake architecture in many future enterprise architectures.

It is worth noting that event streaming and the above-discussed business solutions and data storage and analytics vendors are complementary, not competitive! For instance, Confluent partners with business solutions such as Salesforce, database vendors such as MongoDB and Elastic, data-warehouses such as Snowflake, and cloud providers such as AWS or Azure to provide Source, Sink, and Change Data Capture (CDC) connectors powered by Kafka Connect. The fully managed Confluent Cloud service even provides the end-to-end integration as part of the serverless offering in the public cloud.

Video Recording: Kappa vs. Lambda Architecture

I covered the discussion around “Kappa vs. Lambda” in a 40-minute video recording, too. Enjoy:

Kappa is the New Black for the Enterprise Architecture

Real-time data beats slow data. After reading this article, think about your industry, business unit, and projects again. If real-time data processing improves your customer experiences, increases your revenue, or reduces your cost and risk, then why wait? The Kappa architecture provides enormous benefits and a much simpler infrastructure than the Lambda architecture.

Having said this, batch processing and other data storage and analytics services are not going away. Kappa and event streaming are complementary, and no silver bullet for every problem. For more details, check out the article “Can Apache Kafka replace a database?” – that article emphases this statement and explores the trade-offs.

Event streaming is the foundation of Kappa architecture. There is no way around this. Apache Kafka is the de facto standard for event streaming and the choice in real-world Kappa architectures. If you still need or want to evaluate your own event streaming platform, continue with the Kafka vs. Pulsar comparison or the general comparison of competitive event streaming vendors and cloud solutions.

Did you already Kappa architecture? Or do you still rely on or even prefer Lambda architectures? What are your experiences and opinions? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kappa Architecture is Mainstream Replacing Lambda appeared first on Kai Waehner.

Kafka for Cybersecurity (Part 3 of 6) – Cyber Threat Intelligence

Kai Waehner — Thu, 15 Jul 2021 06:39:35 +0000

Apache Kafka became the de facto standard for processing data in motion across enterprises and industries. Cybersecurity is a key success factor across all use cases. Kafka is not just used as a backbone and source of truth for data. It also monitors, correlates, and proactively acts on events from various real-time and batch data sources to detect anomalies and respond to incidents. This blog series explores use cases and architectures for Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part three: Cyber Threat Intelligence.

Blog series: Apache Kafka for Cybersecurity

This blog series explores why security features such as RBAC, encryption, and audit logs are only the foundation of a secure event streaming infrastructure. Learn about use cases, architectures, and reference deployments for Kafka in the cybersecurity space:

Part 1: Data in Motion as cybersecurity backbone
Part 2: Situational awareness
Part 3 (THIS POST): Threat intelligence
Part 4: Forensics
Part 5: Air-gapped and zero trust environments
Part 6: SIEM / SOAR modernization

Subscribe to my newsletter to get updates immediately after the publication. Besides, I will also update the above list with direct links to this blog series’s posts as soon as published.

Cyber Threat Intelligence

Threat intelligence, or cyber threat intelligence, reduces harm by improving decision-making before, during, and after cybersecurity incidents reducing operational mean time to recovery, and reducing adversary dwell time for information technology environments.

Threat intelligence is evidence-based knowledge, including context, mechanisms, indicators, implications, and action-oriented advice about an existing or emerging menace or hazard to assets. This intelligence can be used to inform decisions regarding the subject’s response to that menace or hazard.

Threat intelligence solutions gather raw data about emerging or existing threat actors & threats from various sources. This data is then analyzed and filtered to produce threat intel feeds and management reports that contain information that automated security control solutions can use.

Threat intelligence keeps organizations informed of the risks of advanced persistent threats, zero-day threats and exploits, and how to protect against them.

Situational Awareness is Not Enough…

… but the foundation to collect and pre-process data in real-time at scale. Only real-time situational awareness enables real-time threat intelligence to provide huge benefits to the enterprise:

Mitigate harmful events in cyberspace
Proactive cybersecurity posture that is predictive, not just reactive
Bolster overall risk management policies
Improved detection of threats
Better decision-making during and following the detection of a cyber intrusion

In summary, threat intelligence allows to:

See the whole board. And see it more quickly.
See around corners.
See the enemy before they see you.

Threat Intelligence for Prevention or Mitigation across the Cyber Kill Chain

Threat intelligence is the knowledge that allows you to prevent or mitigate cyberattacks. It covers all the phases of the so-called “Cyber Kill Chain“:

Threat intelligence provides several benefits:

Empowers organizations to develop a proactive cybersecurity posture and to bolster overall risk management policies
Drives momentum toward a cybersecurity posture that is predictive, not just reactive
Enables improved detection of threats
Informs better decision-making during and following the detection of a cyber intrusion

Transactional Data vs. Analytics Data

Most use cases around data-in-motion are about all the data. This is true for all transactional use cases and even for many analytical use cases. Each event is valuable: A sale, an order, a payment, an alert from a machine, etc.

However, data is often full of noise. As I discussed earlier in this blog series, the goal in the cybersecurity space is to find the needle in the haystack and to reduce false-positive alerts.

SIEM, SOAR, OT, and ICS are almost always analytic processing regimes, BUT knowing when they are not is important. Kafka can configure topics to be tuned for transactions or analytics. That is unprecedented in the history of data processing. Threat intelligence (= awareness-in-motion) assumes the PATTERN is valuable, not the data.

Analytics in Motion powered by Kafka Streams / ksqlDB

As you can hopefully imagine from the above requirements and characteristics, event streaming with Apache Kafka and its streaming analytics ecosystem is a perfect fit for the technical infrastructure for threat intelligence.

Threat detection makes sense of the signal and the noise of the data by continuously processing signatures. This enables to detect, contain and neutralize threats proactively:

Analytics can be many things in such a scenario:

Simple business logic such as stateless filtering or stateful aggregations
Complex business rules with custom code or with an integrated separate rules engine like Drools
Machine Learning / Deep Learning by embedding any analytic model (TensorFlow, H2O, DataRobot, whatever) into the Kafka streaming application
Information from external systems applied in real-time in the right context

On a high level, the advantages of using Kafka Streams or ksqlDB for threat intelligence can be described as follows:

A single scalable and reliable real-time infrastructure for end-to-end data integration and data processing
Flexibility to write custom rules and embed other rules engines, frameworks, or trained models
Integration with other threat detection systems like IDS, SIEM, SOAR

The business logic for cyber threat detection looks different for every use case. Known attack patterns like MITRE ATT&ACK help with the implementation. However, situational awareness and threat detection also need to detect unknown anomalies.

Let’s now take a look at a concrete example.

Intel’s Cyber Intelligence Platform

Let me quote Intel themselves:

“As cyber threats continuously grow in sophistication and frequency, companies need to quickly acclimate to effectively detect, respond, and protect their environments. At Intel, we’ve addressed this need by implementing a modern, scalable Cyber Intelligence Platform (CIP) based on Splunk and Confluent. We believe that CIP positions us for the best defense against cyber threats well into the future.

Our CIP ingests tens of terabytes of data each day and transforms it into actionable insights through streams processing, context-smart applications, and advanced analytics techniques. Kafka serves as a massive data pipeline within the platform. It provides us the ability to operate on data in-stream, enabling us to reduce Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR). Faster detection and response ultimately leads to better prevention.”

Let’s explore Intel’s CIP for threat intelligence in more detail.

Detecting Vulnerabilities with Stream Processing

Intel’s CIP leverages the whole Kafka ecosystem provided by Confluent:

Ingestion: Kafka producer clients for various sources such as databases, scanning engines, IP address management, asset management inventory, etc.
Streaming analytics: Kafka Streams for filtering vulnerabilities by business unit, joining asset ownership with vulnerable assets, etc.
Egress: Kafka Connect sink connectors for data lakes, IT partners, other business units, SIEM, SOAR, etc.
High availability: Multi-Region Clusters (MRC) for high availability across regions
And much more…

Here is a high-level architecture:

Intel’s Kafka Maturity Timeline

Building a cybersecurity infrastructure is not a big bang. A step-by-step approach starts with integrating the first sources and sinks, some simple stream processing, and deployment as a pilot project. Over time, more and more data sources and sinks are added, the business logic gets more powerful, and the scale increases.

Intel’s Kafka maturity timeline shows their learning curve:

Kafka Benefits to Intel

Intel describes their benefits for leveraging event streaming as follows:

Economies of scale
Operate on data in the stream
Reduce technical debt and downstream costs
Generates contextually rich data
Global scale and reach
Always on
Modern architecture with a thriving community
Kafka leadership through Confluent expertise

That’s pretty much the same reasons I use in many of my other blog posts to explain the rise of data in motion powered by Apache Kafka across industries and use cases…

For more intel on Intel’s Cyber Intelligence Platform powered by Confluent and Splunk, check out their whitepaper and Kafka Summit talk.

Scalable Real-time Cyber Threat Intelligence with Kafka

Kafka is not just used as a backbone and source of truth for data. It also monitors, correlates, and proactively acts on events from various real-time and batch data sources to implement cyber threat intelligence.

The Cyber Intelligence Platform from Intel is a great example of a Kafka-powered cybersecurity solution. It leverages the whole Kafka ecosystem to build a scalable and reliable real-time integration and processing layer. The streaming analytics logic depends on the use case. It can cover simple business logic but also external rules engines or analytic models.

How do you fight against cybersecurity risks? What technologies and architectures do you use to implement cyber threat intelligence? How does Kafka complement other tools? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka for Cybersecurity (Part 3 of 6) – Cyber Threat Intelligence appeared first on Kai Waehner.

Pulsar vs Kafka – Comparison and Myths Explored

Kai Waehner — Tue, 09 Jun 2020 15:32:29 +0000

Pulsar vs Kafka – which one is better? This blog post explores pros and cons, popular myths, and non-technical criteria to find the best tool for your business problem.

My discussions are usually around Apache Kafka and its ecosystem as I work for Confluent. The only questions I got about Pulsar in the last years came from Pulsar committers and contributors. They asked me deep technical questions so as to be able to explain where Kafka sucks and why Pulsar is the much better option. Discussions about this topic on platforms like Reddit are typically very opinionated, often inaccurate, and brutal. The following is my point of view based on years of experience with open source streaming platforms.

Tech comparisons are the new black: Kafka vs. Middleware, Event Streaming and API Platforms

Tech comparisons are meant to guide people to choose the right solution and architecture for their business problem. There is no all-rounder, and there should be no bias. Choose the right tool for the problem.

However, technical comparisons are almost always biased. Even if the author does not work for a vendor and is an “independent” consultant, he or she is still likely to have a biased opinion from past experiences and knowledge, whether purposely or unknowingly. Still, comparisons from different perspectives are useful, and we’ve seen Apache Pulsar discussed in a few places on the internet, so I wanted to share my personal views of how Kafka and Pulsar compare. I work for Confluent, the leading experts behind Apache Kafka and its ecosystem, so keep that in mind, but the aim of this post is not to provide opinion, it’s to weigh up facts rather than myths.

Technical comparisons of open source frameworks and commercial software products happen all the time. I did several comparisons in the past on my blog or other platforms like InfoQ, including a Comparison of integration frameworks, Choosing the right ESB for your integration needs, Kafka vs. ETL / ESB / MQ, Kafka vs. Mainframe and Apache Kafka and API Management / API Gateway. All these comparisons were done because customers wanted to understand when to use which tool.

For Pulsar vs. Kafka, the situation is a little bit different.

Why compare Pulsar and Kafka?

Talking to prospects or customers, I rarely get asked about Pulsar. To be fair, this increased slightly in the last months. I guess the question comes up in every ~15th or ~20th meeting due to the overlapping feature set and use cases. However, this seems to be mostly due to a few posts on the internet that claim Pulsar is in some ways better than Kafka. There is no fact-checking and very little material, if any, for the opposing view.

I have not talked to a single organization that seriously considered deploying Pulsar in production, although I know there are a large number of users out there in the world who need a distributed messaging technology like Kafka or Pulsar. But I also think that Pulsar’s alleged reference users are not particularly accurate.

For example, their flagship user is Tencent, a large Chinese tech company, but Tencent is a huge Kafka user, whereas Pulsar’s use is limited to just one project. Tencent processes trillion messages per day (in digits: 10,000,000,000,000) with Kafka. As it turns out, Tencent uses Kafka 1000x more than Pulsar (ten trillion msg/day vs. tens of billion msg/day). The Tencent team discussed their Kafka deployment in more detail: How Tencent PCG Uses Apache Kafka to Handle 10 Trillion+ Messages Per Day.

Comparison of two competitive open source frameworks

Apache Kafka and Apache Pulsar are two exciting and competing technologies. Therefore, it makes a lot of sense to compare them. Period.

Both Apache Kafka and Apache Pulsar have very similar feature sets. I recommend that you evaluate both frameworks for available features, maturity, market adoption, open source tools and projects, training material, availability of local meetups, videos, blog posts, etc. Reference use cases from your industry or business problems help making the right decision.

Confluent published such a comparison of “Kafka vs. Pulsar vs. RabbitMQ: Performance, Architecture, and Features Compared“. I was involved in creating this comparison. So we have that comparison already…

What is this blog post here about then?

I want to explore the myths from some ‘Kafka vs. Pulsar’ arguments which I see regularly in blog posts and forum discussions. Afterwards, I will give a more comprehensive comparison beyond just technical aspects because most Pulsar discussions focus purely on tech features.

Kafka vs Pulsar – Technology myths explored

The following discusses some myths I have come across. I agree with some of them, but also counter some others with hard facts. Of course, different opinions can exist for some of these statements. Again, this is totally fine. The following is my point of view.

Myth 1: “Pulsar has differentiating built-in features compared to Kafka”?

True.

If you compare Apache Kafka to Apache Pulsar, features like its tiered architecture, queuing, and multi-tenancy are mentioned as differentiators.

But:

Kafka has many differentiating features, too:

Half as many servers to run
Data saved to disk only once
Data cached in memory only once
Battle-tested replication protocol
Zero copy performance
Transactions
Built-in stream processing
Long term storage
In the works: ZooKeeper removal (KIP-500), which makes Kafka even more simple to operate and deploy than Pulsar (which has a four-component architecture of Pulsar, ZooKeeper, BookKeeper, and RocksDB), apart from making Kafka more scalable, more resilient, etc. etc..)
In the works: Tiered Storage (KIP-405), which makes Kafka more elastic and cost-efficient.

Also ask yourself: Should you really compare just the open source frameworks or products and vendors with their complete offering?

It is easy to add new features if you don’t have to provide mission-critical support for it. Don’t just evaluate features in a checklist, but also evaluate how they are battle-tested in production scenarios. How many “differentiating features” are low-quality and implemented quickly vs. high-quality implementations?

For instance: It took a few years to implement and battle-test Kafka Streams as Kafka-native stream processing engine. Do you really want to compare this to Pulsar Functions? The latter is a feature to add user-defined functions (UDF); without any relation to “real stream processing”. Or is this more like Single Message Transformations (SMT), a core feature of Kafka Connect? Just be sure to a) compare apples to apples (instead of apples to oranges) and b) don’t forget to think about the maturity of a feature. The more powerful and critical, the more mature it should be…

The Kafka community spends a large amount of efforts to improve the core project and its ecosystem. Confluent alone has over 200 full time engineers working on the Kafka project, additional community components, commercial products and the SaaS offering on major cloud providers.

Myth 2: “Pulsar has a few very big users like Tencent in China”?

True.

But: Tencent actually uses Kafka more than Pulsar. The billing department, which uses Pulsar, is only a small fraction at Tencent, whereas a large portion of the core business is using Kafka, and they have a Global-Kafka like architecture that combines 1000+ brokers into a single logical cluster.

Always be cautious with open source projects. Check out the success at “normal companies”. Just because a tech giant uses it, does not mean it will work for your company well. How many Fortune 2000 companies shared their success stories around Pulsar in the past?

Look for proof points beyond tech giants!

Proof points beyond the tech giants are helpful to get insights and lessons learned from other people. Not from the software vendors. The Kafka website gives many examples about mission-critical deployments. Even more impressive: At the past Kafka Summit conferences in San Francisco, New York and London, every year various enterprises from different industries present their use cases and success stories. Including fortune 2000 companies, mid-size enterprises and startups.

Just to give you one specific example in the Kafka world: Various different implementations exist for replication of data in real time between separate Kafka clusters, including MirrorMaker 1 (part of the Apache Kafka project), MirrorMaker 2 (part of the Apache Kafka project), Confluent Replicator (built by Confluent and only available as part of Confluent Platform or Confluent Cloud), uReplicator (open sourced by Uber), Mirus (open sourced by Salesforce), Brooklin (open sourced by LinkedIn).

In practice, only two options are reasonable if you don’t want to maintain and improve the code by yourself: MirrorMaker 2 (very new, not mature yet, but a great option mid and long term) and Confluent Replicator (battle-tested in many mission-critical deployments, but not open source). All the other options work, too. But who maintains the projects? Who solves bugs and security issues? Who do you call when you have a problem in production? Deployment in production for mission-critical deployments is different from evaluating and trying out an open source project.

Myth 3: “Pulsar provides message queuing and event streaming in a single solution”?

Partly.

Message queues are used for point-to-point communication. They provide an asynchronous communications protocol, meaning that the sender and receiver of the message do not need to interact with the message queue at the same time.d

Pulsar has only limited support for message queuing, and limited support for event streaming. If it wants to compete in either area, it still has a long way to go for two reasons:

1) Pulsar has only limited support for message queuing because it misses popular messaging features like message XA transactions, routing, message filtering, etc. that are commonly used with messaging systems like IBM MQ, RabbitMQ, and ActiveMQ. Pulsar’s “adapters” for messaging systems are similarly limited. While they may look nice on paper, they are less useful in practice.

2) Pulsar has only limited support for event streaming. For example, it does not support exactly-once delivery and processing semantics, which disqualifies it for most use cases in practice – you would never implement, say, a payment processing system with Pulsar as it may cause duplicate payments, or lose payments. It also lacks functionality to perform stream processing with features like joins, aggregations, windowing, fault-tolerant state management, and event-time based processing. Pulsar’s “topics” functionality is also different to Kafka’s, and suffers from BookKeeper’s origins, as it was conceived and designed in 2008 as a write ahead log for Hadoop’s HDFS namenode, with only short-lived data storage in mind.

Side note: Pulsar’s “Kafka adapter”, like its messaging siblings, is similarly limited. While it may look nice on paper, it is less useful in practice because it supports only a small subset of Kafka functionality.

Like Pulsar, Kafka has only limited support for message queuing.

In Kafka, different workarounds can be used to realize “real queuing” behavior. If you want to use separate message queues instead of shared Kafka topics for:

Security? => Use Kafka’s ACLs (and optional tools like Confluent’s role-based access control aka RBAC).
Semantics (i.e. separate applications)? => Use Kafka’s consumer groups.
Load balancing? => Use Kafka’s partitions.

I typically ask customers what exactly they want to do with queuing. Often, Kafka provides out-of-the-box solutions for use cases which simply require thinking of the solution in new terms. Also, the number of high throughput use cases that need queuing is relatively small.

Having explained all these workarounds and limitations of Pulsar and Kafka for messaging, let’s be clear: Neither Kafka nor Pulsar provide a “real messaging solution”.

If you really need a messaging solution, shouldn’t you better choose a “real messaging framework” like RabbitMQ or NATS for a messaging problem anyway?

There is no ‘yes or no’ answer to this. I see many customers replacing existing messaging systems like IBM MQ with Kafka (for scalability and cost reasons). Know the options, their trade-offs, and do an evaluation to solve your problem the best way…

Myth 4: “Pulsar provides stream processing”?

False.

Or to be fair: It depends on your definition of stream processing. Is it only rudimentary features, or full-fledged stream processing?

In one sentence, I typically explain stream processing as continuous consumption, processing, and aggregation of events from different data sources. In real time. At scale. And, of course, in a fault-tolerant manner, including (and especially) for any stateful processing operations.

Pulsar provides only rudimentary functionality for stream processing, using its Pulsar Functions interface. This is suited for simple callbacks, but it isn’t a true stream processing offering like you get it with Kafka Streams or ksqlDB for building streaming applications that include stateful information, sliding windows, and other stream processing concepts. Use cases exist in every industry. For instance, check out the Kafka Streams website for examples from the New York Times, Pinterest, Trivago, Zalando, and others.

Streaming analytics examples with Pulsar typically use Pulsar in conjunction with another “proper” stream processing framework like Apache Spark or Apache Flink, which of course means you now need to operate even more additional pieces of distributed infrastructure and to understand their complex interactions.

Myth 5: “Pulsar provides exactly-once semantics like Kafka”?

False.

Pulsar provides a deduplication feature that ensures that a message will not be stored in the Pulsar broker twice, but nothing prevents a consumer from reading this message multiple times. This is insufficient for any form of stream processing use case where both input and output are from Pulsar.

Also, unlike Kafka’s Transactions feature, it is not possible to accurately tie messages committed to state recorded inside a stream processor.

Exactly-Once Semantics (EOS) are available since Kafka 0.11 (released three years ago) and used in many production deployments. Kafka’s EOS supports the whole Kafka ecosystem, including Kafka Connect, Kafka Streams, ksqlDB and clients like Java, C, C++, Go or Python. Kafka Summit had several talks about Kafka’s EOS functionality, including this great intro for everybody, with slides and video recording.

Myth 6: “Pulsar’s performance is much better than Kafka’s”?

False.

I am not a fan of most “benchmarks” of performance and throughput. Benchmarks are almost always opinionated and configured for a specific problem (no matter if a vendor, independent consultant, or researcher conducts them).

For example, there is one benchmark published by GIGAOM, which compares the latency and performance of Kafka versus Pulsar. But this benchmark deliberately slowed Kafka down by forcing it to synchronize-to-disk on every single message by setting the Kafka config ‘flush.messages = 1’ (this makes every request cause an fsync). The benchmark also forces the Kafka Consumer to acknowledge synchronously while the Pulsar consumer acknowledges asynchronously. Unsurprisingly, this benchmark setup makes Pulsar the seemingly clear “winner”. But this benchmark does not mention or explain this significant configuration difference in the setup and measurements. This is what some people call apples-to-oranges comparison.

Pulsar’s architecture actually requires higher network utilization (due to the Pulsar broker tier which acts as a proxy in front of BookKeeper bookies) as well as twice the I/O (as BookKeeper writes data to a write ahead log as well as to the main segment).

Confluent did some benchmarks, too. More an apple-to-apple comparison. Not surprisingly, the results were different. But should you really care about these benchmark fights from software vendors?

Think about your performance requirements. Do a proof of concept (POC) with Kafka and Pulsar, if you must. I bet that in 99% of scenarios, both will show acceptable performance for your use case. Don’t trust opinionated benchmarks from others! Your use case will have different requirements and characteristics anyway, and typically performance is just one of many evaluation dimensions.

Myth 7: “Pulsar is easier to operate than Kafka”?

False.

Both Kafka and Pulsar are hard to operate if you don’t use additional tooling.

Kafka includes two distributed systems: Kafka itself and Apache ZooKeeper.

But: Pulsar includes three distributed systems and an additional storage technology: Pulsar, ZooKeeper, and Apache BookKeeper. Like Pulsar, BookKeeper uses ZooKeeper, too. And lastly, RocksDB is used for certain storage tasks. This means that Pulsar has a significantly higher complexity to understand, tweak, and tune than Kafka. Additionally, Pulsar also has more configuration parameters than Kafka.

Kafka is firmly going into the opposite direction and is removing ZooKeeper (see KIP-500) so that you have just one distributed system to deploy, operate, scale and monitor:

ZooKeeper is Kafka’s biggest scalability bottleneck and comes with operational challenges — This is true for Kafka but even more so for Pulsar!

One of the key issues of my customers is how to run ZooKeeper in mission-critical deployments at scale. Therefore I am really looking forward to Kafka’s simplified architecture, where you will deploy Kafka brokers only. This also establishes a unified security model, as ZooKeeper’s security no longer needs to be separately configured. This is a huge benefit, especially for larger organizations and regulated industries. Compliance and information security departments will thank you for this simplified architecture.

Operations is NOT just about Architecture!

Kafka is significantly better documented, has a tremendously larger community of experts, and a vast array of supporting tooling that make operations easier.

Additionally, there are many options for local and online Kafka training, including online courses, books, meetups, and conferences. You won’t find much for Pulsar, unfortunately.

Myth 8: “An architecture with three tiers is better than two tiers”?

It depends.

Personally, I am skeptical that Pulsar’s three tier architecture (using Pulsar brokers, ZooKeeper and BookKeeper) is an advantage for most projects. It is a trade-off!

Twitter described their move away from BookKeeper + DistributedLog (the latter a system very similar to Pulsar, with comparable architecture and design) just over a year ago, citing the advantages of Kafka’s single-tier architecture, such as cost efficiency and better performance, over a two-tier architecture that decouples storage and serving.

Like Pulsar, DistributedLog is built on top of BookKeeper and adds streaming-like functionality with an architecture and concepts similar to Pulsar (e.g., using decoupled storage and serving tiers). DistributedLog was originally a standalone project but eventually became a sub-project of BookKeeper, though nowadays it appears to be no longer actively developed (only a few commits in the past 12 months). The main reasons Twitter cited for switching to Kafka were (1) significant cost savings and performance gains and (2) Kafka’s huge community and adoption. For example, they concluded: “For single consumer use cases, we saw a 68% resource savings, and for fanout cases with multiple consumers, we saw a 75% resource savings.”

There are benefits from a three-tier architecture to build a scalable infrastructure. But the extra layer also increases network utilization by (at least) 33%, and data held in Pulsar’s brokers must additionally be cached in both layers for equivalent performance, and also written to disk twice because the storage format of Bookkeeper is not based on a log.

On the cloud, where most Kafka deployments are being run, the best backing storage tier is in fact not a niche technology like BookKeeper, but a widely used and battle-tested object store like AWS S3 or GCP GCS.

Tiered Storage in Confluent Platform, which is backed by the likes of AWS S3 and GCP GCS, provides the same benefits without Pulsar’s extra layer of BookKeeper and the resulting extra network transfer cost and latency that this architecture incurs. It took Confluent two years to build and make Tiered Storage for Kafka generally available, including global 24/7 support for your most mission-critical data. Tiered Storage is not available yet for open source Apache Kafka, but Confluent is working with the rest of the Kafka community (including some major tech companies like Uber) on KIP-405 to add Tiered Storage to Kafka with different storage options.

There are always pros and cons for both architectures. Personally, I think that 95% of projects do not need a complex three-tier architecture. And where they make sense it is to add the advantages of external, price-efficient storage. You should care about 24/7 service level agreements (SLA), scalability, and throughout. Plus integration into your ecosystem as well as security, management tooling, and support. If your requirements require a three-tier architecture, then of course give it a go!

Sub-Myth: “Pulsar is better for lagging consumers because of its caching layer and storage layer”?

False.

The main problem with lagging consumers is that they exhaust the page cache i.e. recent messages are already cached. Reads from older segments replace these reducing the performance of consumers reading from the head of the log.

Pulsar’s architecture is actually worse in this regard. It retains the same issue around cache-flushing, but now the reads must do an extra network hop + and IO rather than just reading from the local media.

Myth 9: “Kafka does not scale as well as Pulsar”?

False.

This is one of the key arguments by the Pulsar community. As I said before, this always depends on the chosen benchmark. For example, I have seen tests with equivalent computing resources where Kafka did significantly better at high throughputs than Pulsar. Here is a “Pulsar vs. Kafka benchmark” where Kafka is much faster than Pulsar:

Scalability is not a problem for most use cases. You can easily scale up Kafka to process several gigabytes per second, as you can see in a demo to “Scale Apache Kafka to 10+ GB Per Second in Confluent Cloud“:

Honestly speaking, less than 1% of users should be worried about this discussion at all. If you have requirements like Netflix (processing Petabytes per day) or LinkedIn (processing trillions of messages), let’s talk about and discuss the best architecture, hardware, and configuration for such a deployment. For anybody else, don’t be worried.

Sub-Myth: “Kafka’s current approach means it can only store ~ 500K partitions per cluster”?

True.

Kafka today has not yet the best architecture for large scale deployments with hundreds of thousands of Kafka Topics and Partitions.

But: Pulsar, too, does not allow for unlimited scale. It just has different limits.

Kafka’s partition limit is imposed by Zookeeper. Removing Zookeeper from Kafka through the work in KIP-500 removes this upper bound.

As a side note:

The right design of your architecture is critical for success!

Most of the customers I have seen in trouble with Kafka partition counts and scalability are because they designed their architecture and applications in the wrong way (they’d run into the same issues if they were using Pulsar)!

Kafka is an event streaming platform, and not the next IBM MQ. If you try to recreate your favorite MQ solution and architecture with Kafka, you will likely fail. I have seen several customers failing here and then succeeding by re-architecting their setup with our help.

Chances are very high that you will not have any issues with partition numbers and scalability, even today with Kafka’s usage of ZooKeeper, if you design your use case right and understand Kafka’s basic concepts. This experience of customers is a common theme for any technology, like Kafka, that introduces a new technology level and paradigm well beyond what was done before (a prime example is the adoption hurdles faced by companies when they first began to move their use cases to the cloud).

Sub-Myth: “Pulsar supports a practically infinite number of partitions”?

False.

BookKeeper has the same 1-file-per-ledger limitation Kafka has, but there are multiple ledgers in one partition. Pulsar’s broker layer groups partitions into bundles, but it’s storage layer, Bookkeeper, stores data in segments with many segments for each partition.

Like for Kafka, the metadata for these segments is stored in Zookeeper, which imposes a limit on the total number that can be stored. Kafka is removing this dependency, thus allowing it to scale significantly further. I am really looking forward to seeing KIP-500 being implemented until ~ the end of 2020. “Apache Kafka Needs No Keeper: Removing the Apache ZooKeeper Dependency” walks you through the implementation details and planned timelines.

Sub-Myth: “Kafka scaling needs to be defined when creating a Kafka Topic”?

Partly true.

If more scalability is needed, Kafka topics can either be over-partitioned (i.e., you configure a topic with more partitions than you initially need for a use case; see Streams and Tables in Apache Kafka: Topics, Partitions, and Storage Fundamentals), or they can be re-configured to use more partitions if there are requirements to scale in the future. This is not perfect, but a consequence of how distributed event streaming works (and why it scales much better than traditional messaging systems like IBM MQ).

Best practices for creating topics and procedures for changing topic configurations during production are available. So no worries!

But: Pulsar topics have this restriction, too!

Write throughput is based on the number of partitions allocated in a Pulsar topic in the exact same way it is in a Kafka topic, so Pulsar topics must be over-provisioned for exactly the same reasons. That’s because, for each partition, only a single ledger (of the partition’s potentially many ledgers) is writable at the same time. Also, increasing the number of partitions dynamically impacts message ordering just like it does in Kafka (i.e. the message order is lost).

Both Kafka and Pulsar scale like crazy. This is sufficient for almost all use cases!

If you need even more extreme scale, I think a ZooKeeper-free implementation is the best choice. KIP-500 is thus the most anticipated Kafka change I see in the community and in Confluent’s customer base.

Myth 10: “Pulsar recovers from machine failure instantly but Kafka has to reload data”?

True and false.

Killing a Pulsar broker is indeed seamless, but (in contrast to a Kafka broker) the Pulsar broker doesn’t store any data but is only a proxy fronting the actual storage layer, which is BookKeeper. So highlighting that a Pulsar broker failure can easily be resolved is a marketing distraction, because actually one must talk about what happens when a BookKeeper node (a “bookie”) fails.

Killing and restarting a BookKeeper bookie requires the same redistribution of data seen in Kafka’s case. This is the nature of distributed systems, with concepts like replication and partitions.

Elastic Kafka is here already!

Elasticity is important. Confluent’s founder Jay Kreps has recently blogged about this topic: Elastic Apache Kafka Clusters in Confluent Cloud. In a SaaS cloud service like Confluent Cloud, the end user shouldn’t have to care at all about machine failure. 24/7 uptime is expected and should be guaranteed with 99.xx SLAs. Consumption-based pricing (i.e., pay as you go) means you do not have to worry about issues like broker management, sizing broker nodes, expanding or shrinking clusters, etc. under the hood at all.

Self-managed Kafka clusters also need similar capabilities. Tiered Storage for Kafka is huge because most of the data is not stored on the broker anymore to allow almost instant recovery from failures. In conjunction with tools like Self-Balancing Kafka (a Confluent feature coming in Q3 and discussed in the above link blog post), users don’t have to worry about elasticity in their self-managed clusters at all.

Unfortunately, if you are looking for such a modern offering for Pulsar, there is none available.

Myth 11: “Pulsar has better Inter-Cluster (Geo) Replication than Kafka”?

False.

Every distributed system has to solve problems like the CAP theorem and quorum in distributed computing. The quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed system. A quorum-based technique is implemented to enforce consistent operation in a distributed system.

Kafka requires ZooKeeper to solve the quorum problem. Even after KIP-500 and ZooKeeper removal, the universal laws of real-world physics are still the same: There are latency issues deploying a distributed system over regions like the US East, Central and West or even globally. That’s because the speed of light, though very high, does have a limit.

Various deployment options exist to work around this problem, including real time replication tools like Apache Kafka’s MirrorMaker 2, Confluent’s Replicator or Confluent’s Multi-Region-Clusters. Check out “Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments” for various different deployment options and best practices:

There is no single pattern or implementation to provide global replication AND zero downtime + zero data loss! For the most critical applications, Confluent’s Multi-Region-Clusters allows RTO=0 and RPO=0 (i.e. zero downtime and zero data loss) with automatic disaster recovery and client fail-over even if a complete data center or cloud region goes down.

Here, Pulsar’s architecture requires even more complexity than a “basic” Pulsar deployment. That’s because, for geo-replication, Pulsar requires an additional “global” Zookeeper cluster, which makes Pulsar inappropriate for geo-distribution over large distances. There is a workaround, but the problem around CAP theorem and physics do not go away.

No matter if you use Kafka or Pulsar, you need a battle-tested design to fight the laws of physics in your global deployments!

Myth 12: “Pulsar is compatible with Kafka’s interface and API”?

Partially True.

Pulsar provides a very basic implementation that is compatible with only minor parts of the Kafka v2.0 protocol.

Pulsar has a converter for basic parts of the Kafka protocol.

So, while alleged “Kafka compatibility” sounds nice on paper, one shouldn’t seriously consider this for migrating your running Kafka infrastructure to Pulsar? I doubt someone will take the risk…

We have seen “Kafka compatibility” claims in other examples such as the much more mature Azure Event Hubs service. Check out the limiting factors of their Kafka API, and be surprised! No support for core Kafka features like transactions (and thus exactly-once semantics), compression, or log compaction.

As it is not Kafka under the hood, also expect further diverging and unexpected behaviors when you connect your existing Kafka applications against such a “compatible” setup. No matter if Azure Event Hubs, Pulsar, or any other wrapper.

Kafka vs. Pulsar – Comprehensive Comparison

The last sections explored various technology myths we find in many other blog posts. I think I brought some clarity into these discussions.

Now, let’s not forget to take a look beyond the technical details of Kafka and Pulsar. Non-functional aspects are as important when choosing a technology.

I will cover three critical aspects in the following: Market traction, enterprise support and cloud offerings.

Market Traction of Apache Kafka and Apache Pulsar

Taking a look at Google Trends from the last five years confirms my personal experience, I see the interest in Apache Pulsar is very limited compared to Apache Kafka:

The picture looks very similar when you take a look at Stack Overflow and similar platforms, number and size of supporting vendors, the open ecosystem (tool integrations, wrapper frameworks like Spring Kafka), and similar characteristics for technology trends.

Job openings is another very good indicator of adoption of technology. Not many job openings for Pulsar means not many companies are using it. Search in your favorite job search engine. If you search globally, you will find <100 job openings for Pulsar, but thousands of jobs for Kafka. Additionally, most of the ones showing Pulsar say something like “looking for experience with Kafka, Pulsar, Kinesis or similar technologies”.

In most cases, these characteristics are much more relevant for the success of your next project than the subtle technical differences. The key goal is to solve your business problem, isn’t it?

So with the lack of adoption, why is Pulsar coming up in conversations at all? One reason is that independent consulting companies, research analysts, and bloggers (including me) need to talk about new cutting-edge technologies to keep their audience interested… And to be honest, it makes a good story.

Enterprise Support for Kafka and Pulsar

There is enterprise support for Kafka and Pulsar!

Though, the situation is not what you might expect. Here are the vendors you can call and ask for a meeting to discuss the potential next steps for working together on your Pulsar journey:

Streamlio (now acquired by Splunk), the former company behind Apache Pulsar. Splunk did not yet announce a future Pulsar strategy to support people working on their own Pulsar-based projects. Splunk is well-known for their widely-adopted analytics platform. That’s their core business (~ $1.8B in 2019). The only thing people complain about Splunk is the pricing. Splunk is a heavy Kafka user under the hood and now incorporates Pulsar into their Splunk Data Stream Processor (DSP). It is very doubtful that Splunk will jump on the open source bandwagon to support your next standalone Pulsar project (but a broader-scope DSP might be coming, of course). The future will show us…
StreamNative, founded by one of the original developers of Apache Pulsar, provides an event streaming platform based on Pulsar. At the time of writing this in June 2020, StreamNative has 13 (!) employees on LinkedIn. I am not sure if this is the right scale to support your next mission-critical deployment in 2020 but they do offer it.
TIBCO announced support for Pulsar in December 2019. Their core strategy moved from integration to analytics in the last years. TIBCO’s middleware customers are migrating away in high numbers. Their middleware team had to do some desperate strategy decisions: Support other platforms even though having zero contribution and experience with the projects. You are right, this might be a myth. But hey, a fact is that TIBCO also does the same for Kafka. And here is a nice trivia: TIBCO provides Kafka and ZooKeeper to you on Windows! Something nobody else does – because others know that this is not stable and creates inconsistencies all the time. But hey, TIBCO can support you now with Kafka and Pulsar. Why evaluate these two frameworks if one single vendor allows you to use both? Even on Windows; with .exe download and .bat scripts for starting the server components:

The number of vendors supporting Kafka grows every quarter!

Kafka has incredible huge market adoption in the meantime. The best proof for this is when the biggest software vendors provide support and tools around it. IBM, Oracle, Amazon, Microsoft and many other software companies support Kafka and build integration capabilities and own products around it.

The latest “wake-up call” for me was at Oracle OpenWorld 2019 in San Francisco where I attended a roadmap session from the Oracle product manager for GoldenGate (Oracle’s well-known great but also very expensive CDC tool). Most of the talk focused on opening GoldenGate to make it the data integration platform for everything. Half the talk was about event streaming, Kafka and how GoldenGate will provide integration with different databases / data lakes and Kafka in both directions.

Fully-Managed Cloud Offerings for Kafka and Pulsar

Let’s take a look at the cloud offerings available for Kafka and Pulsar.

There is a cloud service available for Apache Pulsar. It has a very innovative name:

Kafkaesque.

No kidding. Check the link… [Update: On ~June 17th, they rebranded the service: KAFKAESQUE is now KESQUE – probably they realized how embarrassing the name was.]

Maybe you also check out the various cloud offerings for Apache Kafka to find out which offering fits you better:

Confluent Cloud (SaaS) is a fully-managed service providing consumption-based pricing, 24/7 SLAs and elastic, serverless characteristics for Apache Kafka and its ecosystem (e.g. Schema Registry, Kafka Connect connectors and ksqlDB for stream processing).
Amazon MSK (PaaS) provisions ZooKeeper and Kafka Brokers so that the end user can operate it, fix bugs, do rolling upgrades, etc. One important fact everybody should be aware of: AWS excludes Kafka issues from its 99.95 SLAs and support!
Azure Event Hubs (SaaS) provides a Kafka endpoint (with a proprietary implementation under the hood) to interact with Kafka applications. It is very scalable and performant. As it is not really Kafka, but just an emulation, it misses several core features of Kafka like exactly-once semantics, log compaction, and compression. Not to mention the surrounding capabilities like Kafka Connect and Kafka Streams
Big Blue (IBM) and Big Red (Oracle) have cloud offerings around Kafka and its APIs. I have no idea if anyone is using them and how good they are. Never seen them in the wild by myself.
Plenty of smaller players like Aiven, CloudKarafka, Instaclustr, and others.

As you can see, the current cloud offerings show relatively clear how the market adoption of Kafka and Pulsar look like.

Conclusion – Apache Kafka or Apache Pulsar?

TL;DR: Pulsar is still a long way from Kafka’s level of maturity in terms of being proven for high scale use cases and building a community.

You should also question whether Pulsar is actually better.

Evaluate Kafka and Pulsar if you are going the purely open source way. Find out which fits you best. In your evaluation, include the technical feature set, maturity, vendors, developer community, and other relevant factors. Which one fits your situation best?

If you need an enterprise solution that covers much more than what both of these two open source systems offer, Kafka is the only option: Choose a Kafka-based offering from one of the various vendors or a suitable cloud offering. Pulsar, unfortunately, is not ready for this today and the foreseeable future.

How do you think about Apache Kafka vs. Apache Pulsar? What is your strategy? Let’s connect on LinkedIn and discuss! Stay informed about new blog posts by subscribing to my newsletter.

The post Pulsar vs Kafka – Comparison and Myths Explored appeared first on Kai Waehner.

Apache Kafka and Machine Learning in Banking and Finance Industry

Kai Waehner — Wed, 15 Apr 2020 10:16:28 +0000

The combination of Apache Kafka and Machine Learning / Deep Learning are the new black in Banking and Finance Industry. This blog post covers use cases, architectures and a fraud detection example.

Event Streaming in the Finance Industry

Various different (typically mission-critical) use cases emerged to deploy event streaming in the finance industry. Here are a few companies leveraging Apache Kafka for banking projects:

Check past Kafka Summit video recordings and slides for details about use cases and architectures of these companies from the finance sector.

Here are a few concrete examples:

Capital One: Becoming truly event driven – offering a service other parts of the bank can use.
ING: Significantly improved customer experience – as a differentiator + Fraud detection and cost savings.
Nordea: Able to meet strict regulatory requirements around real-time reporting + cost savings.
Paypal: Processing 400+ Billion events per day for user behavioral tracking, merchant monitoring, risk & compliance, fraud detection, and other use cases.
Royal Bank of Canada (RBC): Mainframe off-load, better CX & fraud detection – brought many parts of the bank together

This is just a very short list of companies in the financial sector using Apache Kafka as event streaming platform for the heart of their business. Plenty of other examples are available by tens of global banks leveraging Apache Kafka for many use cases.

Apache Kafka as Middleware in Banking

One of the key use cases I see in banking is actualIy NOT the reason for this post: Apache Kafka as modern, scalable, reliable middleware:

Building a scalable 24/7 middleware infrastructure with real time processing, zero downtime, zero data loss and integration to legacy AND modern technologies, databases and applications
Integration with existing legacy middleware (ESB, ETL, MQ)
Replacement of proprietary integration platforms
Offloading from expensive systems like Mainframes

For these scenarios, please check out my blogs, slides and videos about Apache Kafka vs. Middleware (MQ, ETL, ESB).

Let’s now focus on another very interesting topic I see more and more in the finance industry: Apache Kafka + Event Streaming + Machine Learning. Let’s discuss how this fits together from use case and technical perspective…

Machine Learning in Banking and Financial Services

Machine Learning (ML) allows computers to find hidden insights without
being explicitly programmed where to look. Different algorithms are applied to historical data to find insights and patterns. These insights are then stored in an analytic model to do predictions on new events. Some example for ML algorithms:

Linear Regression
Decision Trees
Naïve Bayes
Clustering
Neural Networks (aka Deep Learning) like CNN, RNN, Transformer, Autoencoder

If you need more details about Machine Learning / Deep Learning, check out other resources on the web.

We want to find out how to leverage Machine Learning together with Kafka to improve traditional and to build new innovative use cases in the finance industry:

Machine Learning Pipelines for Model Training, Scoring and Monitoring

In general, you need to create a pipeline for model training, model scoring (aka predictions) and monitoring, like the following:Some key requirements for most ML pipelines:

Real time processing
Ingestion and data processing at scale (often very high throughput)
Scalability and reliability (often zero downtime AND zero data loss)
Integration with various technologies, databases and applications (typically legacy and modern interfaces)
Decoupling of data producers and data consumers (some are real time, but others are near real time, batch or request-response)

Does this sound familiar to you? I guess you can imagine why Apache Kafka comes into play here…

Apache Kafka and Machine Learning / Deep Learning

I talked about the relation between Apache Kafka and Machine Learning / Deep Learning more than enough in the past two years. As a recap, here is how a Kafka+ML architecture could look like:

I recommend the following blog posts to learn more about building a scalable and reliable real time infrastructure for ML:

Intro: How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka
Focus on model deployment (Embedded vs. model server): Machine Learning and Real-Time Analytics in Apache Kafka Applications
Cutting Edge ML without the need for a data lake: Streaming Machine Learning with Kafka, Tiered Storage and Without a Data Lake

Let’s now take a look at a concrete example…

Fraud Detection – Helping the Finance Industry with Event Streaming and AI / Machine Learning

Fraud is a billion-dollar business and it is increasing every year. The PwC global economic crime survey of 2018 found that half (49 percent) of the 7,200 companies they surveyed had experienced fraud of some kind.

Traditional methods of data analysis have long been used to detect fraud. They require complex and time-consuming investigations that deal with different domains of knowledge like financial, economics, business practices and law. Fraud often consists of many instances or incidents involving repeated transgressions using the same method. Fraud instances can be similar in content and appearance but usually are not identical.

This is where machine learning and artificial intelligence (AI) come into play: These ML algorithms seek for accounts, customers, suppliers, etc. that behave ‘unusually’ in order to output suspicion scores, rules or visual anomalies, depending on the method.

Streaming Analytics for Real Time Fraud Detection

Let’s take a look at a possible architecture to implement real time streaming analytics for fraud detection at scale:

You can always choose between traditional and cutting-edge methods and algorithms for each use case you deploy to the event streaming platform.

Sometimes, existing business rules or statistical models work fine. Not everything needs to be a neural network – because it also has drawback like high requirements for computing power and big data sets. Not everything needs to be real time. Cutting edge technologies are awesome, but not need required for everything. BUT all of these technologies and concepts have to work with each other seamlessly and reliably.

Fraud Detection with Apache Kafka, Kafka Streams Kafka Connect, ksqlDB and TensorFlow

The following shows the above use case mapped to technologies:

We leverage the following technologies:

Apache Kafka as central, scalable and reliable Event Streaming platform
Kafka Connect to integrate with data sources and data sinks (some real time, some batch, some request-response via REST / HTTP)
Kafka Streams / ksqlDB to implement continuous data processing (streaming ETL, business rules, model scoring)
MQTT over WebSockets to connect to millions of users via web and mobile apps
TensorFlow for model training and model scoring
TensorFlow IO for streaming ingestion from a Kafka topic for model training (another approach not shown here is to ingest the data from Kafka into a data lake like Hadoop or AWS S3 to do model training with ML cloud services, Spark or any other ML tool)

Any other technology can be added / replaced / removed depending on your use case. For instance, it is totally fine and common to complement the above architecture

with legacy middleware already in place
by re-processing the data from the commit log with different ML technologies like open source H2O.ai or the AutoML tool DataRobot
integrating with other stream processing engines like Apache Flink

This sounds pretty cool? You want to try it out by yourself?

Demos and Code Examples for Apache Kafka + Machine Learning

Please check out my Github page. The repositories provide various demos and code examples for Apache Kafka + Machine Learning using different ML technologies like TensorFlow, DL4J and H2O.

These examples do not focus on the finance industry. But you can easily map this to your use cases; I have seen various deployments in the financial sector doing exactly the same as the examples below implement.

Here are some highlights of my Github repository:

Getting started: Different models embedded into Kafka Streams microservices
Model scoring with SQL: Deep Learning UDF for KSQL for Streaming Anomaly Detection
Kafka combined with a model server: TensorFlow Serving + gRPC + Java + Kafka Streams
Rapid Prototyping: Kafka + ML using Python in a Jupyter Notebook
ML at scale: Streaming Machine Learning at Scale from 100000 IoT Devices with HiveMQ, Apache Kafka and TensorFLow

Slides / Video Recording: Kafka + ML in Finance Industry

Take a look at these slides and video recording to understand in more details how to build a ML infrastructure with the Kafka ecosystem and your favorite ML tools, cloud services, and other infrastructure components.

I explain in more detail how and why Apache Kafka has become the de facto standard for reliable and scalable streaming infrastructures in the finance industry.

AI / Machine learning and the Apache Kafka ecosystem are a great combination for training, deploying and monitoring analytic models at scale in real time. They are showing up more and more in projects, but still feel like buzzwords and hype for science projects. Therefore, I discuss in detail:

Connecting the dots! How are Kafka and Machine Learning related?
How can these concepts and technologies be combined to productionize analytic models in mission-critical and scalable real time applications?
A step-by-step approach to build a scalable and reliable real time infrastructure for fraud detection in an instant-payment application using Deep Learning and an Autoencoder for anomaly detection

We build a hybrid ML architecture using technologies such as Apache Kafka, Kafka Connect, Kafka Streams, ksqlDB, TensorFlow, TF Serving, TF IO, Confluent Tiered Storage, Google Cloud Platform (GCP), Google Cloud Storage (GCS), and more.

Check out the slide deck:

Here is the video recording walking through the above slides:

Multi-Region Disaster Recovery

The most critical infrastructures require 24/7 availability and zero data loss even in case of disaster, i.e. outage of a complete data center or cloud region.

From ML perspective, an outage might be acceptable for model training. However, model scoring and monitoring (e.g. for fraud detection in your instant payment application) should run continuously with zero downtime and zero data loss (aka RPO = 0 and RTO = 0).

Machine Learning Architecture with Zero Down Time and Zero Data Loss

Banking and finance industry is where I see the highest number of critical use cases across all industries. Kafka is highly available by nature. However, disaster recovery without downtime and without data loss is still not easy to solve. Tools like MirrorMaker 2 or Confluent Replicator are good enough for some scenarios. If you need guaranteed zero downtime and zero data loss, additional tooling is required.

Confluent provides Multi-Region Clusters (MRC) to solve this problem. Let’s take a look at the architecture of a large FinServ customer:

This architecture provides business continuity with RPO = 0 and RTO = 0 as the data is replicated synchronously between US East and US West regions with Kafka-native technologies. This architecture has various advantages, including:

No need for additional tools
Automatic fail-over handling of clients (no need for custom client logic)
No data loss or downtime even in case of disaster

This banking use case was about clearing time from ‘deposit’ to ‘available’ in real time. Exactly the same architecture can be deployed for your mission-critical ML use cases (like fraud detection).

Pretty cool, isn’t it? There is much more to learn about “Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments“.

Kafka + Machine Learning = 24/7 + Real Time

No matter if financial services or any other industry, here are the key lessons learned for implementing scalable, realiable real time ML infrastructures:

Don’t underestimate the hidden technical debt in Machine Learning systems
Leverage the Apache Kafka open source ecosystem to build a scalable and flexible real time ML platform
The market provides some cutting edge solutions helping to deploy Kafka and ML together, like Tiered Storage for Kafka and TensorFlow’s IO Plugin for streaming ingestion into TensorFlow to simplify your big data architecture

What use cases, challenges architectures do you have? Please share your insights and let’s discuss… And let’s connect on LinkedIn to stay in touch!

The post Apache Kafka and Machine Learning in Banking and Finance Industry appeared first on Kai Waehner.

Big Data Spain: Talk about KSQL – The Streaming SQL Engine for Apache Kafka

Kai Waehner — Thu, 15 Nov 2018 05:35:24 +0000

In November 2018, I was back in Madrid to speak at Big Data Spain. A great event all about big data, analytics and machine learning. One of the largest tech companies in Spain. A perfect event to talk about KSQL – The Streaming SQL Engine for Apache Kafka.

Big Data Spain is held in Kinepolis, a big cinema. One of my favorite locations for a tech conference – for speakers and audience.

All talks at Big Data Spain are recorded. Video recording and slides below.

KSQL – The Open Source SQL Streaming Engine for Apache Kafka

My talk was an update about KSQL. The slide deck describes various different use cases for KSQL. I also included some advanced topics such as User Defined Functions (UDF). Here is the abstract:

The rapidly expanding world of stream processing can be daunting, with new concepts such as various types of time semantics, windowed aggregates, changelogs, and programming frameworks to master.
KSQL is an open-source, Apache 2.0 licensed streaming SQL engine on top of Apache Kafka which aims to simplify all this and make stream processing available to everyone. Even though it is simple to use, KSQL is built for mission-critical and scalable production deployments (using Kafka Streams under the hood).
Benefits of using KSQL include: No coding required; no additional analytics cluster needed; streams and tables as first-class constructs; access to the rich Kafka ecosystem. This session introduces the concepts and architecture of KSQL. Use cases such as Streaming ETL, Real Time Stream Monitoring or Anomaly Detection are discussed. A live demo shows how to setup and use KSQL quickly and easily on top of your Kafka ecosystem.

Key takeaways:

– KSQL includes access to the rich Apache Kafka ecosystem and is suitable for various use cases, including Streaming ETL, Real Time Stream Monitoring and Anomaly Detection

– KSQL allows to realize stream processing without coding and without additional analytics cluster

Slide Deck: KSQL Introduction

Here is the slide deck:

Video Recording: Intro to KSQL

Here is the video recording from my talk:

The post Big Data Spain: Talk about KSQL – The Streaming SQL Engine for Apache Kafka appeared first on Kai Waehner.

Model Serving: Stream Processing vs. RPC / REST with Java, gRPC, Apache Kafka, TensorFlow

Kai Waehner — Mon, 09 Jul 2018 01:13:45 +0000

Machine Learning / Deep Learning models can be used in different ways to do predictions. My preferred way is to deploy an analytic model directly into a stream processing application (like Kafka Streams or KSQL). You could e.g. use the TensorFlow for Java API. This allows best latency and independence of external services. Several examples can be found in my Github project: Model Inference within Kafka Streams Microservices using TensorFlow, H2O.ai, Deeplearning4j (DL4J).

However, direct deployment of models is not always a feasible approach. Sometimes it makes sense or is needed to deploy a model in another serving infrastructure like TensorFlow Serving for TensorFlow models. Model Inference is then done via RPC / Request Response communication. Organisational or technical reasons might force this approach. Or you might want to leverage the built-in features for managing and versioning different models in the model server.

So you combine stream processing with RPC / Request-Response paradigm. The architecture looks like the following:

Pros of an external model serving infrastructure like TensorFlow Serving:

Simple integration with existing technologies and organizational processes
Easier to understand if you come from non-streaming world
Later migration to real streaming is also possible
Model management built-in for different models and versioning

Cons:

Worse latency as remote call instead of local inference
No offline inference (devices, edge processing, etc.)
Coupling the availability, scalability, and latency / throughput of your Kafka Streams application with the SLAs of the RPC interface
Side-effects (e.g. in case of failure) not covered by Kafka processing (e.g. Exactly Once)

Combination of Stream Processing and Model Server using Apache Kafka, Kafka Streams and TensorFlow Serving

I created the Github Java project “TensorFlow Serving + gRPC + Java + Kafka Streams” to demo how to do model inference with Apache Kafka, Kafka Streams and a TensorFlow model deployed using TensorFlow Serving. The concepts are very similar for other ML frameworks and Cloud Providers, e.g. you could also use Google Cloud ML Engine for TensorFlow (which uses TensorFlow Serving under the hood) or Apache MXNet and AWS model server.

Most ML servers for model serving are also extendible to serve other types of models and data, e.g. you could also deploy non-TensorFlow models to TensorFlow Serving. Many ML servers are available as cloud service and for local deployment.

TensorFlow Serving

Let’s discuss TensorFlow Serving quickly. It can be used to host your trained analytic models. Like with most model servers, you can do inference via request-response paradigm. gRPC and REST / HTTP are the two common technologies and concepts used.

The blog post “How to deploy TensorFlow models to production using TF Serving” is a great explanation of how to export and deploy trained TensorFlow models to a TensorFlow Serving infrastructure. You can either deploy your own infrastructure anywhere or leverage a cloud service like Google Cloud ML Engine. A SavedModel is TensorFlow’s recommended format for saving models, and it is the required format for deploying trained TensorFlow models using TensorFlow Serving or deploying on Goodle Cloud ML Engine.

The core architecture is described in detail in TensorFlow Serving’s architecture overview:

This architecture allows deployement and managend of different models and versions of these models including additional features like A/B testing. In the following demo, we just deploy one single TensorFlow model for Image Recognition (based on the famous Inception neural network).

Demo: Mixing Stream Processing with RPC: TensorFlow Serving + Kafka Streams

Disclaimer: The following is a shortened version of the steps to do. For full example including source code and scripts, please go to my Github project “TensorFlow Serving + gRPC + Java + Kafka Streams“.

Things to do

Install and start a ML Serving Engine
Deploy prebuilt TensorFlow Model
Create Kafka Cluster
Implement Kafka Streams application
Deploy Kafka Streams application (e.g. locally on laptop or to a Kubernetes cluster)
Generate streaming data to test the combination of Kafka Streams and TensorFlow Serving

Step 1: Create a TensorFlow model and export it to ‘SavedModel’ format

I simply added an existing pretrained Image Recognition model built with TensorFlow. You just need to export a model using TensorFlow’s API and then use the exported folder. TensorFlow uses Protobuf to store the model graph and adds variables for the weights of the neural network.

Google ML Engine shows how to create a simple TensorFlow model for predictions of census using the “ML Engine getting started guide“. In a second step, you can build a more advanced example for image recognition using Transfer Learning folling the guide “Image Classification using Flowers dataset“.

You can also combine cloud and local services, e.g. build the analytic model with Google ML Engine and then deploy it locally using TensorFlow Serving as we do.

Step 2: Install and start TensorFlow Serving server + deploy model

Different options are available. Installing TensforFlow Serving on a Mac is still a pain in mid of 2018. apt-get works much easier on Linux operating systems. Unforunately there is nothing like a ‘brew’ command or simple zip file you can use on Mac. Alternatives:

You can build the project and compile everything using Bazel build system – which literaly takes forever (on my laptop), i.e. many hours.
Install and run TensorFlow Serving via a Docker container . This also requires building the project. In addition, documentation is not very good and outdated.
Preferred option for beginners => Use a prebuilt Docker container with TensorFlow Serving. I used an example from Thamme Gowda. Kudos to him for building a project which not just contains the TensorFlow Serving Docker image, but also shows an example of how to do gRPC communication between a Java application and TensorFlow Serving.

If you want to your own model, read the guide “Deploy TensorFlow model to TensorFlow serving“. Or to use a cloud service, e.g. take a look at “Getting Started with Google ML Engine“.

Step 3: Create Kafka Cluster and Kafka topics

Create a local Kafka environment (Apache Kafka broker + Zookeeper). The easiest way is the open source Confluent CLI – which is also part of Confluent Open Source and Confluent Enteprise Platform. Just type “confluent start kafka“.

You can also create a cluster using Kafka as a Service. Best option is Confluent Cloud – Apache Kafka as a Service. You can choose between Confluent Cloud Professional for “playing around” or Confluent Cloud Enterprise on AWS, GCP or Azure for mission-critical deployments including 99.95% SLA and very large scale up to 2 GBbyte/second throughput. The third option is to connect to your existing Kafka cluster on premise or in cloud (note that you need to change the broker URL and port in the Kafka Streams Java code before building the project).

Next create the two Kafka topics for this example (‘ImageInputTopic’ for URLs to the image and ‘ImageOutputTopic’ for the prediction result):

Step 4 Build and deploy Kafka Streams app + send test messages

The Kafka Streams microservice (i.e. Java class) “Kafka Streams TensorFlow Serving gRPC Example” is the Kafka Streams Java client. The microservice uses gRPC and Protobuf for request-response communication with the TensorFlow Serving server to do model inference to predict the contant of the image. Note that the Java client does not need any TensorFlow APIs, but just gRPC interfaces.

This example executes a Java main method, i.e. it starts a local Java process running the Kafka Streams microservice. It waits continuously for new events arriving at ‘ImageInputTopic’ to do a model inference (via gRCP call to TensorFlow Serving) and then sending the prediction to ‘ImageOutputTopic’ – all in real time within milliseconds.

In the same way, you could deploy this Kafka Streams microservice anywhere – including Kubernetes (e.g. on premise OpenShift cluster or Google Kubernetes Engine), Mesosphere, Amazon ECS or even in a Java EE app – and scale it up and down dynamically.

Now send messages, e.g. with kafkacat, and use kafka-console-consumer to consume the predictions.

Once again, if you want to see source code and scripts, then please go to my Github project “TensorFlow Serving + gRPC + Java + Kafka Streams“.

The post Model Serving: Stream Processing vs. RPC / REST with Java, gRPC, Apache Kafka, TensorFlow appeared first on Kai Waehner.

Apache Kafka + Kafka Streams + Mesos / DCOS = Scalable Microservices

Kai Waehner — Fri, 27 Oct 2017 08:05:16 +0000

I had a talk at MesosCon 2017 Europe in Prague about building highly scalable, mission-critical microservices with Apache Kafka, Kafka Streams and Apache Mesos / DCOS. I would like to share the slides and a video recording of the live demo.

Abstract

Microservices establish many benefits like agile, flexible development and deployment of business logic. However, a Microservice architecture also creates many new challenges. This includes increased communication between distributed instances, the need for orchestration, new fail-over requirements, and resiliency design patterns.

This session discusses how to build a highly scalable, performant, mission-critical microservice infrastructure with Apache Kafka, Kafka Streams and Apache Mesos respectively DC/OS. Apache Kafka brokers are used as powerful, scalable, distributed message backbone. Kafka’s Streams API allows to embed stream processing directly into any external microservice or business application. Without the need for a dedicated streaming cluster. Apache Mesos can be used as scalable infrastructure for both, the Apache Kafka brokers and external applications using the Kafka Streams API, to leverage the benefits of a cloud native platforms like service discovery, health checks, or fail-over management.

A live demo shows how to develop real time applications for your core business with Kafka messaging brokers and Kafka Streams API. You see how to deploy / manage / scale them on a DC/OS cluster using different deployment options.

Key takeaways

Successful microservice architectures require a highly scalable messaging infrastructure combined with a cloud-native platform which manages distributed microservices
Apache Kafka offers a highly scalable, mission critical infrastructure for distributed messaging and integration
Kafka’s Streams API allows to embed stream processing into any external application or microservice
Mesos respectively DC/OS allow management of both, Kafka brokers and external applications using Kafka Streams API, to leverage many built-in benefits like health checks, service discovery or fail-over control of microservices
See a live demo which combines the Apache Kafka streaming platform and DC/OS

Architecture: Kafka Brokers + Kafka Streams on Kubernetes and DC/OS

The following picture shows the architecture. You can either run Kafka Brokers and Kafka Streams microservices natively on DC/OS via Marathon or leverage Kubernetes as Docker container orchestration tool (which is also supported my Mesosphere in the meantime).

Slides

Here are the slides from my talk:

Live Demo

The following video shows the live demo. It is built on AWS using Mesosphere’s Cloud Formation script to setup a DC/OS cluster in ten minutes.

Here, I deployed both – Kafka brokers and Kafka Streams microservices – directly to DC/OS without leveraging Kubernetes. I expect to see many people continue to deploy Kafka brokers directly on DC/OS. For microservices many teams might move to the following stack: Microservice –> Docker –> Kubernetes –> DC/OS.

Do you also use Apache Mesos respectively DC/OS to run Kafka? Only the brokers or also Kafka clients (producers, consumers, Streams, Connect, KSQL, etc)? Or do you prefer another tool like Kubernetes (maybe on DC/OS)?

The post Apache Kafka + Kafka Streams + Mesos / DCOS = Scalable Microservices appeared first on Kai Waehner.

Deep Learning in Real Time with TensorFlow, H2O.ai and Kafka Streams (Slides from JavaOne 2017)

Kai Waehner — Wed, 04 Oct 2017 16:59:04 +0000

Early October… Like every year in October, it is time for JavaOne and Oracle Open World in San Francisco… I am glad to be back at this huge event again. My talk at JavaOne 2017 was all about deployment of analytic models to scalable production systems leveraging Apache Kafka and Kafka Streams. Let’s first look at the abstract. After that I attach the slides and refer to further material around this topic.

Abstract “Deep Learning in Real Time with TensorFlow, H2O.ai and Kafka Streams”

Intelligent real time applications are a game changer in any industry. Deep Learning is one of the hottest buzzwords in this area. New technologies like GPUs combined with elastic cloud infrastructure enable the sophisticated usage of artificial neural networks to add business value in real world scenarios. Tech giants use it e.g. for image recognition and speech translation. This session discusses some real-world scenarios from different industries to explain when and how traditional companies can leverage deep learning in real time applications.

This session shows how to deploy Deep Learning models into real time applications to do predictions on new events. Apache Kafka will be used to inter analytic models in a highly scalable and performant way.

The first part introduces the use cases and concepts behind Deep Learning. It discusses how to build Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) and Autoencoders leveraging open source frameworks like TensorFlow, DeepLearning4J or H2O.

The second part shows how to deploy the built analytic models to real time applications leveraging Apache Kafka as streaming platform and Apache Kafka’s Streams API to embed the intelligent business logic into any external application or microservice.

Key Takeaways for the Audience: Kafka Streams + Deep Learning

Here are the takeaways of this talk:

Focus of this talk is to discuss and show how to productionize analytic models built by data scientists – the key challenge in most companies.
Deep Learning allows to build different neural networks to solve complex classification and regression scenarios and can add business value in any industry
Deep Learning is used to build analytics models using open source frameworks like TensorFlow, DeepLearning4J or H2O.ai.
Apache Kafka’s Streams API allows to embed the intelligent business logic into any application or microservice
Apache Kafka’s Streams API leverages these Deep Learning Models (without Redeveloping) to act on new events in real time

Slides and Further Material around Apache Kafka and Machine Learning

Here are the slides of my talk:

Some further material around Apache Kafka, Kafka Streams and Machine Learning:

I will post more examples and use cases around Apache Kafka and Machine Learning in the upcoming months… Stay tuned!

The post Deep Learning in Real Time with TensorFlow, H2O.ai and Kafka Streams (Slides from JavaOne 2017) appeared first on Kai Waehner.