KSQL Archives - Kai Waehner

Fraud Detection with Apache Kafka, KSQL and Apache Flink

Kai Waehner — Tue, 25 Oct 2022 11:38:46 +0000

Fraud detection becomes increasingly challenging in a digital world across all industries. Real-time data processing with Apache Kafka became the de facto standard to correlate and prevent fraud continuously before it happens. This blog post explores case studies for fraud prevention from companies such as Paypal, Capital One, ING Bank, Grab, and Kakao Games that leverage stream processing technologies like Kafka Streams, KSQL, and Apache Flink.

Fraud detection and the need for real-time data

Fraud detection and prevention is the adequate response to fraudulent activities in companies (like fraud, embezzlement, and loss of assets because of employee actions).

An anti-fraud management system (AFMS) comprises fraud auditing, prevention, and detection tasks. Larger companies use it as a company-wide system to prevent, detect, and adequately respond to fraudulent activities. These distinct elements are interconnected or exist independently. An integrated solution is usually more effective if the architecture considers the interdependencies during planning.

Real-time data beats slow data across business domains and industries in almost all use cases. But there are few better examples than fraud prevention and fraud detection. It is not helpful to detect fraud in your data warehouse or data lake after hours or even minutes, as the money is already lost. This “too late architecture” increases risk, revenue loss, and lousy customer experience.

It is no surprise that most modern payment platforms and anti-fraud management systems implement real-time capabilities with streaming analytics technologies for these transactional and analytical workloads. The Kappa architecture powered by Apache Kafka became the de facto standard replacing the Lambda architecture.

A stream processing example in payments

Stream processing is the foundation for implementing fraud detection and prevention while the data is in motion (and relevant) instead of just storing data at rest for analytics (too late).

No matter what modern stream processing technology you choose (e.g., Kafka Streams, KSQL, Apache Flink), it enables continuous real-time processing and correlation of different data sets. Often, the combination of real-time and historical data helps find the right insights and correlations to detect fraud with a high probability.

Let’s look at a few examples of stateless and stateful stream processing for real-time data correlation with the Kafka-native tools Kafka Streams and ksqlDB. Similarly, Apache Flink or other stream processing engines can be combined with the Kafka data stream. It always has pros and cons. While Flink might be the better fit for some projects, it is another engine and infrastructure you need to combine with Kafka.

Ensure you understand your end-to-end SLAs and requirements regarding latency, exactly-once semantics, potential data loss, etc. Then use the right combination of tools for the job.

Stateless transaction monitoring with Kafka Streams

A Kafka Streams application, written in Java, processes each payment event in a stateless fashion one by one:

Stateful anomaly detection with Kafka and KSQL

A ksqlDB application, written with SQL code, continuously analyses the transactions of the last hour per customer ID to identify malicious behavior:

Kafka and Machine Learning with TensorFlow for real-time scoring for fraud detection

A KSQL UDF (user-defined function) embeds an analytic model trained with TensorFlow for real-time fraud prevention:

Case studies across industries

Several case studies exist for fraud detection with Kafka. It is usually combined with stream processing technologies, such as Kafka Streams, KSQL, and Apache Flink. Here are a few real-world deployments across industries, including financial services, gaming, and mobility services:

– Paypal processes billions of messages with Kafka for fraud detection.

– Capital One looks at events as running its entire business (powered by Confluent), where stream processing prevents $150 of fraud per customer on average per year by preventing personally identifiable information (PII) violations of in-flight transactions.

– ING Bank started many years ago by implementing real-time fraud detection with Kafka, Flink, and embedded analytic models

– Grab is a mobility service in Asia that leverages fully managed Confluent Cloud, Kafka Streams, and ML for stateful stream processing in its internal GrabDefence SaaS service.

– Kakao Games, a South-Korean gaming company uses data streaming to detect and operate anomalies with 300+ patterns through KSQL

Let’s explore the latter case study in more detail.

Deep dive into fraud prevention with Kafka and KSQL in mobile gaming

Kakao Games is a South Korea-based global video game publisher specializing in games across various genres for PC, mobile, and VR platforms. The company presented at Current 2022 – The Next Generation of Kafka Summit in Austin, Texas.

Here is a detailed summary of their compelling use case and architecture for fraud detection with Kafka and KSQL.

Use case: Detect malicious behavior by gamers in real-time

The challenge is evident when you understand the company’s history: Kakao Games has many outsourced games purchased via third-party game studios. Each game has its unique log with its standard structure and message format. Reliable real-time data integration at scale is required as a foundation for analytical business processes like fraud detection.

The goal is to analyze game logs and telemetry data in real-time. This capability is critical for preventing and remediating threats or suspicious actions from users.

Architecture: Change data capture and streaming analytics for fraud prevention

The Confluent-powered event streaming platform supports game log standardization. ksqlDB analyzes the incoming telemetry data for in-game abuse and anomaly detection.

Source: Kakao Games (Current 2022 in Austin, Texas)

Implementation: SQL recipes for data streaming with KSQL

Kakao Games detects anomalies and prevents fraud with 300+ patterns through KSQL. Use cases include bonus abuse, multiple account usage, account takeover, chargeback fraud, and affiliate fraud.

Here are a few code examples written with SQL code using KSQL:

Source: Kakao Games (Current 2022 in Austin, Texas)

Results: Reduced risk and improved customer experience

Kakao Games can do real-time data tracking and analysis at scale. Business benefits are faster time to market, increased active users, and more revenue thanks to a better gaming experience.

Fraud detection only works in real-time

Ingesting data with Kafka into a data warehouse or a data lake is only part of a good enterprise architecture. Tools like Apache Spark, Databricks, Snowflake, or Google BigQuery enable finding insights within historical data. But real-time fraud prevention is only possible if you act while the data is in motion. Otherwise, the fraud already happened when you detect it.

Stream processing provides a scalable and reliable infrastructure for real-time fraud prevention. The choice of the right technology is essential. However, all major frameworks, like Kafka Streams, KSQL, or Apache Flink, are very good. Hence, the case studies of Paypal, Capital One, ING Bank, Grab, and Kakao Games look different. Still, they have the same foundation with data streaming powered by the de facto standard Apache Kafka to reduce risk, increase revenue, and improve customer experience.

If you want to learn more about streaming analytics with the Kafka ecosystem, check out how Apache Kafka helps in cybersecurity to create situational awareness and threat intelligence and how to learn from a concrete fraud detection example with Apache Kafka in the crypto and NFT space.

How do you leverage data streaming for fraud prevention and detection? What does your architecture look like? What technologies do you combine? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Fraud Detection with Apache Kafka, KSQL and Apache Flink appeared first on Kai Waehner.

Disaster Recovery with Kafka across the Edge and Hybrid Cloud (QCon Talk)

Kai Waehner — Wed, 06 Apr 2022 11:15:46 +0000

I spoke at QCon London in April 2022 about building disaster recovery and resilient real-time enterprise architectures with Apache Kafka. This blog post summarizes the use cases, architectures, and real-world examples. The slide deck and video recording of the presentation is included as well.

What is QCon?

QCon is a leading software development conference held across the globe for 16 years. It provides a realistic look at what is trending in tech. The QCon events are organized by InfoQ, a well-known website for professional software development with over two million unique visitors per month.

QCon in 2022 uncovers emerging software trends and practices. Developers and architects learn how to solve complex engineering challenges without the product pitches.

There is no Call for Papers (CfP) for QCon. The organizers invite trusted speakers to talk about trends, best practices, and real-world stories. This makes QCon so strong and respected in the software development community.

Disaster Recovery and Resiliency with Apache Kafka

Apache Kafka is the de facto data streaming platform for analytical AND transactional workloads. Multiple options exist to design Kafka for resilient applications. For instance, MirrorMaker 2 and Confluent Replicator enable uni- or bi-directional real-time replication between independent Kafka clusters in different data centers or clouds.

Cluster Linking is a more advanced and straightforward option from Confluent leveraging the native Kafka protocol instead of additional infrastructure and complexity using Kafka Connect (like MirrorMaker 2 and Replicator).

Stretching a single Kafka cluster across multiple regions is the best option to guarantee no downtime and seamless failover in the case of a disaster. However, it is hard to operate and only recommended (i.e., consistent, stable, and mission-critical) across distances with enhanced add-ons to open-source Kafka:

QCon Presentation: Disaster Recovery with Apache Kafka

In my QCon talk, I intentionally showed the broad spectrum of real-world success stories across industries for data streaming with Apache Kafka from companies such as BMW, JPMorgan Chase, Robinhood, Royal Caribbean, and Devon Energy.

Best practices explored how to build resilient enterprises architecture with disaster recovery with RPO (Recovery Point Object) and RTO (Recovery Time Objective) in mind. The audience learns how to get your SLAs and requirements for downtime and data loss right.

The examples looked at serverless cloud offerings integrating to the IoT edge, hybrid retail architectures, and the disconnected edge in military scenarios.

The agenda looks like this:

Resilient enterprise architectures
Real-time data streaming with the Apache Kafka ecosystem
Cloud-first and serverless Industrial IoT in automotive
Multi-region infrastructure for core banking
Hybrid cloud for customer experiences in retail
Disconnected edge for safety and security in the public sector

Slide Deck from QCon Talk:

Here is the slide deck of my presentation from QCon London 2022:

We also had a great panel that discussed lessons learned from building resilient applications on the code and infrastructure level, plus the organizational challenges and best practices:

Video Recording from QCon Talk:

With the risk of Covid in mind, InfoQ decided not to record QCon sessions live.

Instead, a pre-recorded video had to be submitted by the speakers. The video recording is already available for QCon attendees (no matter if on-site in London or at the QCon Plus virtual event):

Qcon makes conference talks available for free sometime after the event. I will update this post with the free link as soon as it is available.

Disaster Recovery with Apache Kafka across all Industries

I hope you enjoyed the slides and video on this exciting topic. Hybrid and global Kafka infrastructures for disaster recovery and other use cases are the norm, not exceptions.

Real-time data beats slow data. That is true in almost every use case. Hence, data streaming with the de facto standard Apache Kafka gets adopted more and more across all industries.

How do you leverage data streaming with Apache Kafka for building resilient applications and enterprise architectures? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Disaster Recovery with Kafka across the Edge and Hybrid Cloud (QCon Talk) appeared first on Kai Waehner.

Streaming ETL with Apache Kafka in the Healthcare Industry

Kai Waehner — Fri, 01 Apr 2022 05:47:00 +0000

IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part three: Streaming ETL. Examples include Babylon Health and Bayer.

Blog Series – Kafka in Healthcare

Many healthcare companies leverage Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap. I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Overview – Data Streaming Use Cases and Architectures for Healthcare (including Slide Deck)
Legacy Modernization and Hybrid Cloud (Optum / UnitedHealth Group, Centene, Bayer)
THIS POST: Streaming ETL (Bayer, Babylon Health)
Real-time Analytics (Cerner, Celmatix, CDC/Centers for Disease Control and Prevention)
Machine Learning and Data Science (Recursion, Humana)
Open API and Omnichannel (Care.com, Invitae)

Stay tuned for a dedicated blog post for each of these topics as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Streaming ETL with Apache Kafka

Streaming ETL is similar to concepts you might know from traditional ETL tools. I have already explored how data streaming with Kafka differs from data integration tools and iPaaS cloud services. The critical difference is that you leverage a single platform for data integration and processing at scale in real-time. There is no need to combine several platforms to achieve this. The result is a Kappa architecture that enables real-time but also batch workloads with a single integration architecture.

Streaming ETL with Kafka combines different components and features:

Kafka Connect as Kafka-native integration framework
Kafka Connect source and sink connectors to consume and produce data from/to any other database, application, or API
Single Message Transform (SMT) – an optional Kafka Connect feature – to process (filter, change, remove, etc.) incoming or outgoing messages within the connector deployment
Kafka Streams or ksqlDB for continuous data processing in real-time at scale for stateless or stateful ETL jobs
Data governance via schema management, enforcement and versioning using the Schema Registry
Security and access control using features like role-based access control, audit logs, and end-to-end encryption

In the cloud, you can leverage a serverless Kafka offering for the whole Streaming ETL pipeline. Confluent Cloud fully manages Kafka’s end-to-end infrastructure, including connectors, ksqlDB workloads, data governance, and security.

One last general note: Don’t Design for Data at Rest to Reverse it! Learn more here: “When to Use Reverse ETL and when it is an Anti-Pattern“. Instead, use real-time Streaming ETL for Data in Motion and the Kappa architecture from scratch.

Let’s look at a few real-world deployments in the healthcare sector.

Babylon Health – PII and GDRP compliant Security

Babylon Health is a digital-first health service provider and value-based care company that combines an artificial intelligence-powered platform with virtual clinical operations for patients. Patients are connected with health care professionals through its web and mobile application.

Babylon’s mission is to put an accessible and affordable health service in the hands of every person on earth. For that mission, Babylon built an agile microservice architecture with the Kafka ecosystem:

Here are the “wonders of working” in Healthcare for Babylon (= reasons to choose Kafka):

Real-time data processing
Replayability of historical information
Order matters and is ensured with guaranteed ordering
GDPR and data ownership for PII compliant security
Data governance via the schema registry to provide true decoupling and access via many programming languages like Java, Python, and Ruby

Bayer – Data Integration and Processing in R&D

Bayer AG is a German multinational pharmaceutical and life sciences company and one of the largest pharmaceutical companies in the world. They leverage Kafka in various use cases and business domains.

The following scenario is from the research and development department of the pharma business unit. Their focus areas are cardiovascular diseases, oncology, and women’s health. The division employs over 7,500 R&D people and expenses over 2.75 billion euros for R&D.

The use case Bayer presented at a recent Kafka Summit is about analyzing clinical trials, patents, reports, news, and literature leveraging the Kafka ecosystem. The R&D team processes 250 Million documents from 30+ individual data sources. The data includes 7 TB of raw text-rich data with daily updates, additions, and deletions. Algorithms and data evolve. Bayer needs to completely reprocess the data regularly. Various document streams with different formats and schemas flow through several text processing and enrichment steps.

Scalable, reliable Kafka pipelines with Kafka Streams (Java) and Faust (Python) replaced custom, error-prone, non-scalable scripts. Schemas are used as the data interface to ensure data governance. Avro is the first-class citizen data format to enable compression and better throughput.

The true decoupling of Kafka in conjunction with the Schema Registry guarantees interoperability among different components and technologies (java, python, commercial tools, open-source, scientific, proprietary).

Streaming ETL with Kafka for Real-Time Data Integration at any Scale

Think about IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains. Real-time data beats slow data in the healthcare supply chain almost everywhere.

This blog post explored the capabilities of the Apache Kafka Ecosystem for Streaming ETL. Real-world deployments from Babylon Health and Bayer showed how enterprises successfully deploy Kafka for different enterprise architecture use cases.

How do you leverage data streaming with Apache Kafka in the healthcare industry? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Streaming ETL with Apache Kafka in the Healthcare Industry appeared first on Kai Waehner.

Kafka for Cybersecurity (Part 2 of 6) – Cyber Situational Awareness

Kai Waehner — Thu, 08 Jul 2021 09:01:29 +0000

Apache Kafka became the de facto standard for processing data in motion across enterprises and industries. Cybersecurity is a key success factor across all use cases. Kafka is not just used as a backbone and source of truth for data. It also monitors, correlates, and proactively acts on events from various real-time and batch data sources to detect anomalies and respond to incidents. This blog series explores use cases and architectures for Kafka in the cybersecurity space, including situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization. This post is part two: Cyber Situational Awareness.

Blog series: Apache Kafka for Cybersecurity

This blog series explores why security features such as RBAC, encryption, and audit logs are only the foundation of a secure event streaming infrastructure. Learn about use cases, architectures, and reference deployments for Kafka in the cybersecurity space:

Part 1: Data in Motion as cybersecurity backbone
Part 2 (THIS POST): Situational awareness
Part 3: Threat intelligence
Part 4: Forensics
Part 5: Air-gapped and zero trust environments
Part 6: SIEM / SOAR modernization

Subscribe to my newsletter to get updates immediately after the publication. Besides, I will also update the above list with direct links to this blog series’s posts as soon as published.

The Four Stages of an Adaptive Security Architecture

Gartner defines four stages of adaptive security architecture to prevent, detect, respond and predict cybersecurity incidents:

Continuous monitoring and analytics are the keys to building a successful proactive cybersecurity solution. It should be obvious: Continuous monitoring and analytics require a scalable real-time infrastructure. Data at rest, i.e., stored in a database, data warehouse, or data lake, cannot continuously monitor data in real-time with batch processes.

Situational Awareness

“Situation awareness is the perception of the elements in the environment within a volume of time and space, the comprehension of their meaning, and the projection of their status in the near future.” Source: Endsley, M. R. SAGAT: A methodology for the measurement of situation awareness (NOR DOC 87-83). Hawthorne, CA: Northrop Corp.

Here is a theoretical view on situational awareness and the relation between humans and computers:

Endsley, M. R. Toward a Theory of Situation Awareness in Dynamic Systems. Human Factors, 1995

Cyber Situational Awareness = Continuous Real-Time Analytics

Cyber Situational Awareness is the subset of all situation awareness necessary to support taking action in cyber. It is the mandatory key concept to defend against cybersecurity attacks.

Automation and analytics in real-time are key characteristics:

Endsley, M. R. Toward a Theory of Situation Awareness in Dynamic Systems. Human Factors, 1995

No matter how good the threat detection algorithms and security platforms are. Prevention or at least detection of attacks need to happen in real-time. And predictions with cutting-edge machine learning models do not help if they are executed in a batch process over might.

Situational awareness covers various levels beyond the raw network events. It includes all environments, including application data, logs, people, and processes.

I covered the challenges in the first post of this blog series. In summary, cybersecurity experts’ key challenge is finding the needle(s) in the haystack. The haystack is typically huge, i.e., massive volumes of data. Often, it is not just one haystack but many. Hence, a key task is to reduce false positives.

Situational awareness is not just about viewing the dashboard but understand what’s going on in real-time. Situational awareness finds the relevant data to create critical (rare) alerts automatically. No human can handle the huge volumes of data.

Situational Awareness in Motion with Kafka

The Kafka ecosystem provides the components to correlate massive volumes of data in real-time to provide situational awareness across the enterprise find all needles in the haystack:

Event streaming powered by the Kafka ecosystem delivers contextually rich data to reduce false positives:

Collect all events from data sources with Kafka Connect
Filter event streams with Kafka Connect’s Single Message Transforms (SMT) so that only relevant data gets into the Kafka topic
Empower real-time streaming applications with Kafka Streams or ksqlDB to correlate events across various source interfaces
Forward priority events to other systems such as the SIEM/SOAR with Kafka Connect or any other Kafka client (Java, C, C++, .NET, Go, JavaScript, HTTP via REST Proxy, etc.)

Example: Situational Awareness with Kafka and SIEM/SOAR

SIEM/SOAR modernization is its own blog post of this series. But the following picture depicts how Kafka enables true decoupling between applications in a domain-driven design (DDD):

A traditional data store like a data lake is NOT the right spot for implementing situational awareness as it is data at rest. Data at rest is not a bad thing. Several use cases such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) require this approach. Real-time beats slow data. Hence, event streaming with the de facto standard Apache Kafka is the right fit for situational awareness.

Event streaming and data lake technologies are complementary, not competitive. The blog post “Serverless Kafka in a Cloud-native Data Lake Architecture” explores this discussion in much more detail by looking at AWS’ lake house strategy and its relation to event streaming.

The Data

Situational awareness requires data. A lot of data. Across various interfaces and communication paradigms. A few examples:

Text Files TXT
Firewalls and network devices
Binary files
Antivirus
Databases
Access
APIs
Audit logs
Network flows
Intrusion detection
Syslog
And many more…

Let’s look at the three steps of implementing situational awareness: Data producers, data normalization and enrichment, and data consumers.

Data Producers

Data comes from various sources. This includes real-time systems, batch systems, files, and much more. The data includes high-volume logs (including Netflow and indirectly PCAP) and low volume transactional events:

Data Normalization and Enrichment

The key success factor to implementing situational awareness is data correlation in real-time at scale. This includes data normalization and processing such as filter, aggregate, transform, etc.:

With Kafka, end-to-end data integration and continuous stream processing are possible with a single scalable and reliable platform. This is very different from the traditional MQ/ETL/ESB approach. Data governance concepts for enforcing data structures and ensuring data quality are crucial on the client-side and server-side. For this reason, the Schema Registry is a mandatory component in most Kafka architectures.

Data Consumers

A single system cannot implement cyber situational awareness. Different questions, challenges, and problems require different tools. Hence, most Kafka deployments run various Kafka consumers using different communication paradigms and different speeds:

Some workloads require data correlation in real-time to detect anomalies or even prevent threats as soon as possible. Kafka-native applications come into play. The client technology is flexible depending on the infrastructure, use case, and developer experience. Java, C, C++, Go are some coding options. Kafka Streams or ksqlDB provide out-of-the-box stream processing capabilities. The latter is the recommended option as it provides many features built-in such as sliding windows to build stateful aggregations.

A SIEM, SOAR, or data lake is complementary to run other analytics use cases for threat detection, intrusion detection, or reporting. The SIEM/SOAR modernization blog post of this series explores this combination in more detail.

Situational Awareness with Kafka and Sigma

Let’s take a look at a concrete example. A few of my colleagues built a great implementation for cyber situational awareness: Confluent Sigma. So, what is it?

Sigma – An Open Signature Format for Cyber Detection

Sigma is a generic and open signature format that allows you to describe relevant log events straightforwardly. The rule format is very flexible, easy to write, and applicable to any log file. The main purpose of this project is to provide a structured form in which cybersecurity engineers can describe their developed detection methods and make them shareable with others – either within the company or even share with the community.

A few characteristics that describe Sigma:

Open-source framework
A domain-specific language (DSL)
Specify patterns in cyber data
Sigma is for log files what Snort is for network traffic, and YARA is for files

Sigma provides integration with various tools such as ArcSight, QRadar, Splunk, Elasticsearch, Sumologic, and many more. However, as you learned in this post, many scenarios for cyber situational awareness require real-time data correlation at scale. That’s where Kafka comes into play. Having said this, a huge benefit is that you can specify a Sigma signature once and then use all the mentioned tools.

Confluent Sigma

Confluent Sigma is an open-source project implemented by a few of my colleagues. Kudos to Michael Peacock, William LaForest, and a few more. The project integrates Sigma into Kafka by embedding the Sigma rules into stream processing applications powered by Kafka Streams and ksqlDB:

Situational Awareness with Zeek, Kafka Streams, KSQL, and Sigma

Here is a concrete event streaming architecture for situational awareness:

A few notes on the above picture:

Sigma defines the signature rules
Zeek provides incoming IDS log events at high volume in real-time
Confluent Platform processes and correlates the data in real-time
The stream processing part built with Kafka Streams and ksqlDB includes stateless functions such as filtering and stateful functions such as aggregations
The calculated detections get ingested into a Zeek dashboard and other Kafka consumers

Here is an example of a Sigma Rule for windowing and aggregation of logs:

The Kafka Streams topology of the example looks like this:

My colleagues will do a webinar to demonstrate Confluent Sigma in more detail, including a live demo. I will update and share the on-demand link here as soon as available. Some demo code is available on Github.

Cisco ThousandEyes Endpoint Agents

Let’s take a look at a concrete Kafka-native example to implement situational awareness at scale in real-time. Cisco ThousandEyes Endpoint Agents is a monitoring tool to gain visibility from any employee to any application over any network. It provides proactive and real-time monitoring of application experience and network connectivity.

The platform leverage the whole Kafka ecosystem for data integration and stream processing:

Kafka Streams for stateful network tests
Interactive queries for fetching results
Kafka Streams for windowed aggregations for alerting use cases
Kafka Connect for integration with backend systems such as MySQL, Elastic, MongoDB

ThousandEyes’ tech blog is a great resource to understand the implementation in more detail.

Kafka is a Key Piece to Implement Cyber Situational Awareness

Cyber situational awareness is mandatory to defend against cybersecurity attacks. A successful implementation requires continuous real-time analytics at scale. This is why the Kafka ecosystem is a perfect fit.

The Confluent Sigma implementation shows how to build a flexible but scalable and reliable architecture for realizing situational awareness. Event streaming is a key piece of the puzzle.

However, it is not a replacement for other tools such as Zeek for network analysis or Splunk as SIEM. Instead, event streaming complements these technologies and provides a central nervous system that connects and truly decouples these other systems. Additionally, the Kafka ecosystem provides the right tools for real-time stream processing.

How did you implement cyber situational awareness? Is it already real-time and scalable? What technologies and architectures do you use? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka for Cybersecurity (Part 2 of 6) – Cyber Situational Awareness appeared first on Kai Waehner.

Apache Kafka in the Insurance Industry

Kai Waehner — Mon, 07 Jun 2021 12:54:20 +0000

The rise of data in motion in the insurance industry is visible across all lines of business, including life, healthcare, travel, vehicle, and others. Apache Kafka changes how enterprises rethink data. This blog post explores use cases and architectures for event streaming. Real-world examples from Generali, Centene, Humana, and Tesla show innovative insurance-related data integration and stream processing in real-time.

Digital Transformation in the Insurance Industry

Most insurance companies have similar challenges:

Challenging market environments
Stagnating economy
Regulatory pressure
Changes in customer expectations
Proprietary and monolithic legacy applications
Emerging competition from innovative insurtechs
Emerging competition from other verticals that add insurance products

Only a good transformation strategy guarantees a successful future for traditional insurance companies. Nobody wants to become the next Nokia (mobile phone), Kodak (photo camera), or BlockBuster (video rental). If you fail to innovate in time, you are done.

Real-time beats slow data. Automation beats manual processes. The combination of these two game changers creates completely new business models in the insurance industry. Some examples:

Claims processing including review, investigation, adjustment, remittance or denial of the claim
Claim fraud detection by leveraging analytic models trained with historical data
Omnichannel customer interactions including a self-service portal and automated tools like NLP-powered chatbots
Risk prediction based on lab testing, biometric data, claims data, patient-generated health data (depending on the laws of a specific country)

These are just a few examples.

The shift to real-time data processing and automation is key for many other use cases, too. Machine learning and deep learning enable the automation of many manual and error-prone processes like document and text processing.

The Need for Brownfield Integration

Traditional insurance companies usually (have to) start with brownfield integration before building new use cases. The integration of legacy systems with modern application infrastructures and the replication between data centers and public or private cloud-native infrastructures are a key piece of the puzzle.

Common integration scenarios use traditional middleware that is already in place. This includes MQ, ETL, ESB, and API tools. Kafka is complementary to these middleware tools:

More details about this topic are available in the following two posts:

Greenfield Applications at Insurtech Companies

Insurtechs have a huge advantage: They can start greenfield. There is no need to integrate with legacy applications and monolithic architectures. Hence, some traditional insurance companies go the same way. They start from scratch with new applications instead of trying to integrate old and new systems.

This setup has a huge architectural advantage: There is no need for traditional middleware as only modern protocols and APIs need to be integrated. No monolithic and proprietary interfaces such as Cobol, EDI, or SAP BAPI/iDoc exist in this scenario. Kafka makes new applications agile, scalable, and flexible with open interfaces and real-time capabilities.

Here is an example of an event streaming architecture for claim processing and fraud detection with the Kafka ecosystem:

Real-World Deployments of Kafka in the Insurance Industry

Let’s take a look at a few examples of real-world deployments of Kafka in the insurance industry.

Generali – Kafka as Integration Platform

Generali is one of the top ten largest insurance companies in the world. The digital transformation from Generali Switzerland started with Confluent as a strategic integration platform. They started their journey by integrating with hundreds of legacy systems like relational databases. Change Data Capture (CDC) pushes changes into Kafka in real-time. Kafka is the central nervous system and integration platform for the data. Other real-time and batch applications consume the events.

From here, other applications consume the data for further processing. All applications are decoupled from each other. This is one of the unique benefits of Kafka compared to other messaging systems. Real decoupling and domain-driven design (DDD) are not possible with traditional MQ systems or SOAP / REST web services.

Design Principles of Generali’s Cloud-Native Architecture

The key design principles for the next-generation platform at Generali include agility, scalability, cloud-native, governance, data, and event processing. Hence, Generali’s architecture is powered by a cloud-native infrastructure leveraging Kubernetes and Apache Kafka:

The following integration flow shows the scalable microservice architecture of Generali. The streaming ETL process includes data integration and data processing decoupled environments:

Centene – Integration and Data Processing at Scale in Real-Time

Centene is the largest Medicaid and Medicare Managed Care Provider in the US. Their mission statement is “transforming the health of the community, one person at a time”. The healthcare insurer acts as an intermediary for both government-sponsored and privately insured health care programs.

Centene’s key challenge is growth. Many mergers and acquisitions require a scalable and reliable data integration platform. Centene chose Kafka due to the following capabilities:

highly scalable
high autonomy and decoupling
high availability and data resiliency
real-time data transfer
complex stream processing

Centene’s architecture uses Kafka for data integration and orchestration. Legacy databases, MongoDB, and other applications and APIs leverage the data in real-time, batch, and request-response:

Swiss Mobiliar – Decoupling and Orchestration

Swiss Mobiliar (Schweizerische Mobiliar aka Die Mobiliar) is is the oldest private insurer in Switzerland.

Event Streaming with Kafka supports various use cases at Swiss Mobiliar:

Orchestrator application to track the state of a billing process
Kafka as database and Kafka Streams for data processing
Complex stateful aggregations across contracts and re-calculations
Continuous monitoring in real-time

Their architecture shows the decoupling of applications and orchestration of events:

Also, check out the on-demand webinar with Mobiliar and Spoud to learn more about their Kafka usage.

Humana – Real-Time Integration and Analytics

Humana Inc. is a for-profit American health insurance. In 2020, the company ranked 52 on the Fortune 500 list.

Humana leverages Kafka for real-time integration and analytics. They built an interoperability platform to transition from an insurance company with elements of health to truly a health company with elements of insurance.

Here are the key characteristics of their Kafka-based platform:

Consumer-centric
Health plan agnostic
Provider agnostic
Cloud resilient and elastic
Event-driven and real-time

Kafka integrates conversations between the users and the AI platform powered by IBM Watson. The platform captures conversational flows and processes them with natural language processing (NLP) – a deep learning concept.

Some benefits of the platform:

Adoption of open standards
Standardized integration partners
In-workflow integration
Event-driven for real-time patient interactions
Highly scalable

freeyou – Stateful Streaming Analytics

freeyou is an insurtech for vehicle insurance. Streaming analytics for real-time price adjustments powered by Kafka and ksqlDB enable new business models. Their marketing slogan shows how they innovate and differentiate from traditional competitors:

“We make insurance simple. With our car insurance, we make sure that you stay mobile in everyday life – always and everywhere. You can take out the policy online in just a few minutes and manage it easily in your freeyou customer account. And if something should happen to your vehicle, we’ll take care of it quickly and easily.”

A key piece of freeyou’s strategy is a great user experience and automatic price adjustments in real-time in the backend. Obviously, Kafka and its stream processing ecosystem are a perfect fit here.

As discussed above, the huge advantage of an insurtech is the possibility to start from the greenfield. No surprise that freeyou’s architectures leverage cutting-edge design and technology. Kafka and KQL enable streaming analytics within the pricing engine, recalculation modules, and other applications:

Tesla – Carmaker and Utility Company, now also Car Insurer

Everybody knows: Tesla is an automotive company that sells cars, maintenance, and software upgrades.

More and more people know: Tesla is a utility company that sells energy infrastructure, solar panels, and smart home integration.

Almost nobody knows: Tesla is a car insurer for their car fleet (limited to a few regions in the early phase). That is the obvious next step if you already collect all the telemetry data from all your cars on the street.

Tesla has built a Kafka-based data platform infrastructure “to support millions of devices and trillions of data points per day”. Tesla showed an interesting history and evolution of their Kafka usage at a Kafka Summit in 2019:

Tesla’s infrastructure heavily relies on Kafka.

There is no public information about Telsa using Kafka for their specific insurance applications. But at a minimum, the data collection from the cars and parts of the data processing relies on Kafka. Hence, I thought this is a great example to think about innovation in car insurance.

Tesla: “Much Better Feedback Loop”

Elon Musk made clear: “We have a much better feedback loop” instead of being statistical like other insurers. This is a key differentiator!

There is no doubt that many vehicle insurers will use fleet data to calculate insurance quotes and provide better insurance services. For sure, some traditional insurers will partner with vehicle manufacturers and fleet providers. This is similar to smart city development, where several enterprises partner to build new innovative use cases.

Connected vehicles and V2X (Vehicle to X) integrations are the starting point for many new business models. No surprise: Kafka plays a key role in the connected vehicles space (not just for Tesla).

Many benefits are created by a real-time integration pipeline:

Shift from human experts to automation driven by big data and machine learning
Real-time telematics data from all its drivers’ behavior and the performance of its vehicle technology (cameras, sensors, …)
Better risk estimation of accidents and repair costs of vehicles
Evaluation of risk reduction through new technologies (autopilot, stability control, anti-theft systems, bullet-resistant steel)

For these reasons, event streaming should be a strategic component of any next-generation insurance platform.

Slide Deck: Kafka in the Insurance Industry

The following slide deck covers the above discussion in more detail:

Kafka Changes How to Think Insurance

Apache Kafka changes how enterprises rethink data in the insurance industry. This includes brownfield data integration scenarios and greenfield cutting-edge applications. The success stories from traditional insurance companies such as Generali and insurtechs such as freeyou prove that Kafka is the right choice everywhere.

Kafka and its ecosystem enable data processing at scale in real-time. Real decoupling allows the integration between monolith legacy systems and modern cloud-native infrastructure. Kafka runs everywhere, from edge deployments to multi-cloud scenarios.

What are your experiences and plans for low latency use cases? What use case and architecture did you implement? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Insurance Industry appeared first on Kai Waehner.

Apache Kafka in the Airline, Aviation and Travel Industry

Kai Waehner — Fri, 19 Feb 2021 11:21:55 +0000

Aviation and travel are notoriously vulnerable to social, economic, and political events, as well as the ever-changing expectations of consumers. Coronavirus is just a piece of the challenge. This post explores use cases, architectures, and references for Apache Kafka in the aviation industry, including airline, airports, global distribution systems (GDS), aircraft manufacturers, and more. Kafka was relevant pre-covid and will become even more important post-covid.

Airlines and Aviation are Changing – Beyond Covid-19!

Aviation and travel are notoriously vulnerable to social, economic, and political events. These months have been particularly testing one due to the global pandemic with Covid-19. But the upcoming change is coming not just due to the Coronavirus but because of the ever-changing expectations of consumers.

Right now is the time to lay the ground for the future of the aviation and travel industry.

Consumer behaviors and expectations are changing. Whole industries are being disrupted, and the aviation industry is not immune to these sweeping forces of change.

The future business of airlines and airports will be digitally integrated into the ecosystem of partners and suppliers. Companies will provide more personalized customer experiences and be enabled by a new suite of the latest technologies, including automation, robotics, and biometrics.

For instance, new customer notification mobile apps provide customers with relevant and timely updates throughout their journeys. Other major improvements support the front line service teams at various touchpoints throughout the airports and end to end travel journey.

Apache Kafka in the Airline Industry

Apache Kafka is the de facto standard for event streaming use cases across industries. Many use cases can be applied to the aviation industry, too. Concepts like payment, customer experience, and manufacturing differ in detail. But in the end, it is about integrating systems and processing data in real-time at scale.

For instance, omnichannel retail with Apache Kafka applies to airline, airports, global distribution systems (GDS), and other aviation industry sectors.

However, it is always easier to learn from other companies in the same industry. Therefore, the following explores a few public Apache Kafka success stories from the aviation industry.

Lufthansa – Kafka Unified Streaming Cloud Operations

Lufthansa talks about the benefits of using Apache Kafka instead of traditional messaging queues (TIBCO EMS, IBM MQ) for data processing.

The journey started with the question if Lufthansa can do data processing better, cheaper, and faster.

Lufthansa’s Kafka architecture does not have any surprises. A key lesson learned from many companies: The real added value is created when you leverage Kafka not just for messaging, but its entire ecosystem, including different clients/proxies, connectors, stream processing, and data governance.

The result at Lufthansa: A better, cheaper, and faster infrastructure for real-time data processing at scale.

My two favorite statements (once again: not really a surprise, as I see the same at many other customers):

“Scaling Kafka is really inexpensive”
“Kafka adopted and integrated within 3 months”

Watch the full talk from Marcos Carballeira Rodríguez from Lufthansa recorded at the Confluent Streaming Days 2020 to see all the architectures and quotes from Lufthansa.

And check out this exciting video recording of Lufthansa discussing their Kafka use cases for middleware modernization and machine learning:

Singapore Airlines – Predictive Maintenance with Kafka Connect, Kafka Streams, and ksqlDB

Singapore Airlines is an early adopter of KSQL to continuously process sensor data and apply analytic models to the events. They already talked about their Kafka ecosystem usage (including Kafka Connect, Kafka Streams, and KSQL) back in 2018. The use case is predictive maintenance with a scalable real-time infrastructure, as you can see in my summary slide:

Check out the complete slide deck from Singapore Airlines for more details.

Air France Hop – Scalable Real-Time Microservices

I really like the Kafka Summit talk title: “Hop! Airlines Jets to Real-Time“. Air France Hop leverage Change Data Capture (CDC) with HVR and Kafka for real-time data processing and integration with legacy monoliths. A pretty common pattern to integrate the old and the new software and IT world:

The complete slide deck and on-demand video recording about this case study are available on the Kafka Summit page.

Amadeus – Real-Time and Batch Log Processing

As I said initially, Kafka is not just relevant for each airline, airport, and aircraft manufacturers. The global distribution system (GDS) from Amadeus is one of the world’s biggest (competing mainly with Sabre). Passenger name record (PNR) is a record in the computer reservation system (CRS) and a crucial part of any GDS vendor. While many end-users don’t even know about Amadeus, the aviation industry could not survive without them. Their workloads are mission-critical and need to run 24/7 in real-time, plus connect to their partners’ systems (like an airline) in a very stable and mature manner!

Amadeus is relying on Apache Kafka for both real-time and batch data processing, as they explain on the official Apache Kafka website:

Streaming Data Exchange for the Travel Industry

After looking at some examples, let’s now cover one more key topic: Data integration and correlation between partners in the aviation industry. Airline, airports, GDS, travel companies, and many other companies need to integrate very well. Obviously, this is already implemented. Otherwise, there is no way to operate flights with passengers and cargo. At least in theory. Honestly, one of the most significant pain points of the travel industry for customers is bad integration across companies. Some examples:

Late or (even worse) no notification about a delay or cancellation
Issues with the display of available seats or upgrade
Broken booking process on the website because of different flight numbers, connecting flights,
Booking class issues for upgrades or rebookings
Display of technical error messages instead of business information (for instance, I can’t count how often I had seen an “IBM WebSphere” error message when I tried to book a flight on the website of my most commonly used airline)
The list goes on and on and on… No matter which airline you pick. That’s at my experience as a frequent traveler across all continents and timezones.

There are reasons for these issues. The aviation network is very complex. For instance, Lufthansa group sells tickets for all their own brands (like Swiss or Austrian Airlines), plus tickets from Star Alliance partners (such as United or Singapore Airlines). Hence, airline, airports, GDS, and many partner systems have to work together. 24/7. In real-time. For this reason, more and more companies in the aviation industry rely on Kafka internally.

But that’s only half of the story… How do you integrate with partners?

Event Streaming vs. REST / HTTP APIs

I explored the discussion around event streaming with Kafka vs. RESTful web services with HTTP in much more detail in another article: “Comparison: Apache Kafka vs. API Management / API Gateway tools like Mulesoft or Kong“. In short: Kafka and REST APIs have their trade-offs. Both are complementary and used together in many architectures. API Management is a great add-on for many applications and microservices, no matter if they are built with HTTP or Kafka under the hood.

But one point is clear: If you need a scalable real-time integration with a partner system, then HTTP is not the right choice. You can either pick gRPC as a request-response alternative or use Kafka natively for the integration with partners, as you use it internally already anyway:

Kafka-native replication between partners works very well. No matter what Kafka vendor and version you and your partner are running. Obviously, the biggest challenge is the security (not from a technical but an organizational perspective). Kafka requires TCP. That’s much harder to get approval for opening it to a partner than HTTP ports.

But from a technical point of view, streaming replication often makes much more sense. I have seen the first customers implementing integration via tools like Confluent Replicator. I am sure that we will see this pattern much more in the future and with better out-of-the-box tool support from vendors.

Data Integration and Correlation at an Airport with Airline Data using KSQL

So, let’s assume that you have the data streams connected at an airport. No matter if just internal data or also partner data. Data correlation adds the business value. Sönke Liebau from OpenCore presented a great airport demo with Kafka and KSQL at a Kafka Summit.

Let’s take a look at some events at an airport:

These events exist in various structures and with different technologies and formats. Some data streams arrive in real-time. However, some other data sets come from a monolithic mainframe in batch via a file integration. Kafka Connect is a Kafka-native middleware to implement this integration.

Afterward, all this data needs to be correlated with historical data from a loyalty system or relational database. This is where stream processing comes into play: This concept enables the continuous data correlation in real-time at scale. Kafka-native technologies like Kafka Streams or ksqlDB exist to build streaming ETL pipelines or business applications.

The following example correlates the gate information from the airport with the airline flight information to send a delay notification to the customer who is waiting for the connection flight:

Tons of use cases exist to leverage event streams from different systems (and partners) in real-time. Some examples from an airport perspective:

Location-based services while the customer is walking through the airport and waiting for the flight. Example: Coupons for a restaurant (with many empty seats or food reserves to thrash if not sold during the day)
Airline services such as free or points-based discounted lounge entrance (because the lounge tracking systems knows that it is almost empty right now anyway)
Partner services like notifying the airport hotel that the guest can stay longer in the room because of a long delay of the upcoming flight

The list of opportunities is almost endless. However, most use cases are only possible if all systems are integrated and data is continuously correlated in real-time. If you need some more inspiration, check out the two blogs “Kafka at the Edge in a Smart Retail Store” and “Kafka in a Train for Improved Customer Experience“. All these use cases are a perfect fit for airline, airports, and their partner ecosystem.

Slides – Apache Kafka in the Aviation, Airline and Travel Industry

The following slide deck goes into more detail:

Kafka for Improved Operations and Customer Experience in the Aviation Industry

This post explored various use cases for event streaming with Apache Kafka in the aviation industry. Airline, airports, aerospace, flight safety, manufacturing, GDS, retail, and many more partners rely on Apache Kafka.

No question: Kafka is getting mainstream these months in the aviation industry. Serverless and consumption-based offerings such as Confluent Cloud boost the adoption even more. A streaming data exchange between partners is the next step I see on the horizon. I am looking forward to Kafka-native interfaces from Open APIs of enterprises, better support for streaming interfaces in API Management tools, and COTS solution from software vendors.

What are your experiences and plans for event streaming in the aviation industry? Did you already build applications with Apache Kafka? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in the Airline, Aviation and Travel Industry appeared first on Kai Waehner.

Confluent Platform 5.4 + Apache Kafka 2.4 (RBAC, Tiered Storage, Multi-Region Clusters, Audit Logs, and more)

Kai Waehner — Thu, 02 Apr 2020 07:46:22 +0000

Introducing new features in Confluent Platform 5.4 and Apache Kafka 2.4… I just want to share a presentation I did recently as part of the “Confluent Kitchen Tour” in response to the global travel ban and Corona crisis.

Confluent Platform 5.4 (with Apache Kafka 2.4)

CP 5.4 (based on AK 2.4) brings some awesome new features to build scalable, reliable and secure event streaming deployments:

Security: Role-Based Access Control (RBAC) and Structured Audit Logs
Resilience: Multi-Region Clusters (MRC)
Data Compatibility: Server-side Schema Validation for Schema Registry
Management & Monitoring: Control Center enhancements, RBAC management, Replicator monitoring
Performance & Elasticity: Tiered Storage (preview)
Stream Processing: New ksqlDB features like Pull Queries and Kafka Connect Integration (preview)

Rapid Pace of Innovation for Mission-Critical Event Streaming

Confluent Platform (CP), the commercial offering from #1 Kafka vendor Confluent, gets more powerful and mature with every release (no surprise). CP releases are tightly coupled to the latest Apache Kafka release to leverage bug fixes and new features from the open source community.

Let’s take a looks at my favorite new features: RBAC, Multi-Region Stretched Clusters and Tiered Storage for Apache Kafka.

Role Based Access Control (RBAC) for Apache Kafka

Kafka provides Access Control Lists (ACL). This is a low level feature and misses higher level, scalable configuration options.

Role-Based Access Control (RBAC) provides platform-wide security with fine-tuned granularity:

Granular control of access permissions
Clusters, topics, consumer groups, connectors
Efficient management at large scale
Delegate authorization management to true resource owners
Platform-wide standardization
Enforced via GUI, CLI and APIs
Enforced across all CP components: Connect, KSQL, Schema Registry, REST Proxy, Control Center and MQTT Proxy
Integration with LDAP and Active Directory

Multi-Region Replication (MRC) for Reliable Stretched Deployments

Stretched Clusters are a common deployment option for Kafka. However, this setup is hard to operate and cannot spread over different regions.

Multi-Region Clusters (MRC) change the game for disaster recovery for Kafka:

Zero downtime and zero data loss for critical Kafka Topics
Automated client fail-over
Streamlined DR operations
Leverages Kafka’s internal replication
No separate Connect clusters
Single multi-region cluster with high write throughput
Asynchronous replication using “Observer” replicas
Low bandwidth costs and high read throughput
Remote consumers read data locally, directly from Observers

Check out “Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments” to understand the trade-offs between a “normal stretched cluster” and Confluent’s Multi-Region Clusters (MRC)”.

MRC is really a game-changer for mission-critical Kafka deployments to guarantee zero downtime and zero data loss (also known as RPO=0 and RTO=0).

Tiered Storage for Better Scalability and Long Term Storage

Tiered Storage (preview release) enables Kafka with infinite retention cost-effectively and elastic scalability / data rebalancing:

Infinite retention
Older data is offloaded to inexpensive object storage, accessible at any time
Reduced storage costs
Storage limitations, like capacity and duration, are effectively uncapped
Elastic scalability
“Lighter” Kafka brokers enable instantaneous load balancing when scaling up

“Streaming Machine Learning with Tiered Storage and Without a Data Lake” shows a concrete example to leverage Tiered Storage for real time analytics and machine learning using TensorFlow.

Confluent Platform 5.4 (Slides + Video)

Here are the slides with an overview about the new features:

Here is a link to the video recording. Please note that this was delivered in German language.

Confluent Cloud – The Fully-Managed Alternative

Oh, and if you prefer a cloud service instead of self-managed Kafka deployments, please check out Confluent Cloud – the only really fully-managed Kafka-as-a-Service offering on the market with consumption-based pricing and mission critical SLAs.

Available on all major cloud providers (AWS, GCP and Azure). Leveraging the latest release of Kafka and providing tons of additional features (like fully managed connectors, Schema Registry and ksqlDB).

The post Confluent Platform 5.4 + Apache Kafka 2.4 (RBAC, Tiered Storage, Multi-Region Clusters, Audit Logs, and more) appeared first on Kai Waehner.

IoT Architectures for Digital Twin with Apache Kafka

Kai Waehner — Wed, 25 Mar 2020 15:47:30 +0000

A digital twin is a virtual representation of something else. This can be a physical thing, process or service. This post covers the benefits and IoT architectures of a Digital Twin in various industries and its relation to Apache Kafka, IoT frameworks and Machine Learning. Kafka is often used as central event streaming platform to build a scalable and reliable digital twin and digital thread for real time streaming sensor data.

I already blogged about this topic recently in detail: Apache Kafka as Digital Twin for Open, Scalable, Reliable Industrial IoT (IIoT). Hence that post covers the relation to Event Streaming and why people choose Apache Kafka to build an open, scalable and reliable digital twin infrastructure.

This article here extends the discussion about building an open and scalable digital twin infrastructure:

Digital Twin vs. Digital Thread
Relation between Event Streaming, Digital Twin and AI / Machine Learning
IoT Architectures for a Digital Twin with Apache Kafka and other IoT Platforms
Extensive slide deck and video recording

Key Take-Aways for Building a Digital Twin

Key Take-Aways:

A digital twin merges the real world (often physical things) and the digital world
Apache Kafka enables an open, scalable and reliable infrastructure for a Digital Twin
Event Streaming complements IoT platforms and other backend applications / databases.
Machine Learning (ML) and statistical models are used in most digital twin architectures to do simulations, predictions and recommendations.

Digital Thread vs. Digital Twin

The term ‘Digital Twin’ usually means the copy of a single asset. In the real world, many digital twins exist. The term ‘Digital Thread’ spans the entire life cycle of one or more digital twins. Eurostep has a great graphic explaining this:

When we talk about ‘Digital Twin’ use cases, we almost always mean a ‘Digital Thread’.

Honestly, the same is true in my material. Both terms overlap, but ‘Digital Twin’ is the “agreed buzzword”. It is important to understand the relation and definition of both terms, though.

Use Cases for Digital Twin and Digital Thread

Use cases exist in many industries. Think about some examples:

Downtime reduction
Inventory management
Fleet management
What-if simulations
Operational planning
Servitization
Product development
Healthcare
Customer experience

The slides and lecture (Youtube video) go into more detail discussing four use cases from different industries:

Virtual Singapore: A Digital Twin of the Smart City
Smart Infrastructure: Digital Solutions for Entire Building Lifecycle
Connected Car Infrastructure
Twinning the Human Body to Enhance Medical Care

The key message here is that digital twins are not just for automation industry. Instead, many industries and projects can add business value and innovation by building a digital twin.

Relation between Event Streaming, Digital Twin and AI / Machine Learning

Digital Twin respectively Digital Thread and AI / Machine Learning (ML) are complementary concepts. You need to apply ML to do accurate predictions using a digital twin.

Digital Twin and AI

Melda Ulusoy from MathWorks shows in a Youtube video how different Digital Twin implementations leverage statistical methods and analytic models:

Examples include physics-based modeling to simulate what-if scenarios and data-driven modeling to estimate the RUL (Remaining Useful Life).

Digital Twin and Machine Learning both have the following in common:

Continuous learning, monitoring and acting
(Good) data is key for success
The more data the better
Real time, scalability and reliability are key requirements

Digital Twin, Machine Learning and Event Streaming with Apache Kafka

Real time, scalability and reliability are key requirements to build a digital twin infrastructure. This makes clear how Event Streaming and Apache Kafka fit into this discussion. I won’t cover what Kafka is or relation between Kafka and Machine Learning in detail here because there are so many other blog posts and videos about it. To recap, let’s take a look at a common Kafka ML architecture providing openness, real time processing, scalability and reliability for model training, deployment / scoring and monitoring:

To get more details about Kafka + Machine learning, start with the blog post “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka” and google for slides, demos, videos and more blog posts.

Characteristics of Digital Twin Technology

The following five characteristics describe common Digital Twin implementations:

Connectivity
- Physical assets, enterprise software, customers
- Bidirectional communication to ingest, command and control
Homogenization
- Decoupling and standardization
- Virtualization of information
- Shared with multiple agents, unconstrained by physical location or time
- Lower cost and easier testing, development and predictions
Reprogrammable and smart
- Adjust and improve characteristics and develop new version of a product
Digital traces
- Go back in time and analyse historical events to diagnose problems
Modularity
- Design and customization of products and production modules
- Tweak modules of models and machines

There are plenty of options to implement these characteristics. Let’s take a look at some IoT platforms and how Event Streaming and Apache Kafka fit into the discussion.

IoT Platforms, Frameworks, Standards and Cloud Services

Plenty of IoT solutions are available on the market. IoT Analytics Research talks about over 600 IoT Platforms in 2019. All have their “right to exist” In most cases, some of these tools are combined with each other. There is no need or good reason to choose just one single solution.

Let’s take a quick look at some offerings and their trade-offs.

Proprietary IoT Platforms

Sophisticated integration for related IIoT protocols (like Siemens S7, Modbus, etc.) and standards (like OPC-UA)
Not a single product (plenty of acquisitions, OEMs and different code bases are typically the foundation)
Typically very expensive
Proprietary (just open interfaces)
Often limited scalability
Examples: Siemens MindSphere, Cisco Kinetic, GE Digital and Predix

IoT Offerings from Cloud Providers

Sophisticated tools for IoT management (devices, shadowing, …)
Good integration with other cloud services (storage, analytics, …)
Vendor lock-in
No key focus on hybrid and edge (but some on premises products)
Limited scalability
Often high cost (beyond ’hello world’)
Examples: All major cloud providers have IoT services, including AWS, GCP, Azure and Alibaba

Standards-based / Open Source IoT Platforms

Open and standards-based (e.g. MQTT)
Open source / open core business model
Infrastructure-independent
Different vendors contribute and compete behind the core technologies (competition means innovation)
Sometimes less mature or non-existent connectivity (especially to legacy and proprietary protocols)
Examples: Open source frameworks like Eclipse IoT, Apache PLC4X or Node-RED and standards like MQTT and related vendors like HiveMQ
Trade-off: Solid offering for one standard (e.g. HiveMQ for MQTT) or diversity but not for mission-critical scale (e.g. Node-RED)

IoT Architectures for a Digital Twin / Digital Thread with Apache Kafka and other IoT Platforms

So, we learned that there are hundreds of IoT solutions available. Consequently, how does Apache Kafka fit into this discussion?

As discussed in the other blog post and in the below slides / video recording: There is huge demand for an open, scalable and reliable infrastructure for Digital Twins. This is where Kafka comes into play to provide a mission-critical event streaming platform for real time messaging, integration and processing.

Kafka and the 5 Characteristics of a Digital Twin

Let’s take a look at a few architectures in the following. Keep in mind the five characteristics of Digital Twins discussed above and its relation to Kafka:

Connectivity – Kafka Connect provides connectivity as scale in real time to IoT interfaces, big data solutions and cloud services. The Kafka ecosystem is complementary, NOT competitive to other Middleware and IoT Platforms.
Homogenization – Real decoupling between clients (i.e. producers and consumers) is one of the key strengths of Kafka. Schema management and enforcement leveraging different technologies (JSON Schema, Avro, Profobuf, etc.) enables data awareness and standardization.
Reprogrammable and smart – Kafka is the de facto standard for microservice architectures for exactly this reason: Separation of concerns and domain-driven design (DDD). Deploy new decoupled applications and do versioning, A/B testing, canarying.
Digital traces – Kafka is a distributed commit log. Events are appended, stored as long as you want (potentially forever with retention time = -1) and immutable. Seriously, what other technology could be used better to build a digital trace for a digital twin?
Modularity – The Kafka infrastructure itself is modular and scalable. This includes components like Kafka brokers, Connect, Schema Registry, REST Proxy and client applications in different languages like Java, Scala, Python, Go, .NET, C++ and others. With this modularity, you can easily build the right Digital Twin architecture your your edge, hybrid or global scenarios and also combine the Kafka components with any other IoT solutions.

Each of the following IoT architectures for a Digital Twin has its pros and cons. Depending on your overall enterprise architecture, project situation and many other aspects, pick and choose the right one:

Scenario 1: Digital Twin Monolith

An IoT Platform is good enough for integration and building the digital twin. No need to use another database or integrate with the rest of the enterprise.

Scenario 2: Digital Twin as External Database

An IoT Platform is used for integration with the IoT endpoints. The Digital Twin data is stored in an external database. This can be something like MongoDB, Elastic, InfluxDB or a Cloud Storage. The database could be used just for storage and for additional tasks like processing, dashboards and analytics.

A combination with yet another product is also very common. For instance, a Business Intelligence (BI) tool like Tableau, Qlik or Power BI can use the SQL interface of a database for interactive queries and reports.

Scenario 3: Kafka as Backbone for the Digital Twin and the Rest of the Enterprise

The IoT Platform is used for integration with the IoT endpoints. Kafka is the central event streaming platform to provide decoupling between the other components. As a result, the central layer is open, scalable and reliable. The database is used for the digital twin (storage, dashboards, analytics). Other applications also consume parts of the data from Kafka (some real time, some batch, some request-response communication).

Scenario 4: Kafka as IoT Platform

Kafka is the central event streaming platform to provide a mission-critical real time infrastructure and the integration layer to the IoT endpoints and other applications. The digital twin is implemented in its own solution. In this example, it does not use a database like in the examples above, but a Cloud IoT Service like Azure Digital Twins.

Scenario 5: Kafka as IoT Platform

Kafka is used to implement the digital twin. No other components or databases are involved. Other consumers consume the raw data and the digital twin data.

Like all the other architectures, this has pros and cons. The main question in this approach is if Kafka can really replace a database and how you can query the data. First if all, Kafka can be used as database (check out the detailed discussion in the linked blog post), but it will not replace other databases like Oracle, MongoDB or Elasticsearch.

Having said this, I have already seen several deployments of Kafka for Digital Twin infrastructures in automation, aviation, and even banking industry.

Especially with “Tiered Storage” in mind (a Kafka feature currently discussed in a KIP-405 and already implemented by Confluent), Kafka gets more and more powerful for long-term storage.

Slides and Video Recording – IoT Architectures for a Digital Twin with Apache Kafka

This section provides a slide deck and video recording to discuss Digital Twin use cases, technologies and architectures in much more detail.

The agenda for the deck and lecture:

Digital Twin – Merging the Physical and the Digital World
Real World Challenges
IoT Platforms
Apache Kafka as Event Streaming Solution for IoT
Spoilt for Choice for a Digital Twin
Global IoT Architectures
A Digital Twin for 100000 Connected Cars

Slides

Here is the long version of the slides (with more content than the slides used for the video recording):

Video Recording

The video recording covers a “lightweight version” of the above slides:

The post IoT Architectures for Digital Twin with Apache Kafka appeared first on Kai Waehner.

Can Apache Kafka Replace a Database?

Kai Waehner — Thu, 12 Mar 2020 14:47:36 +0000

Can and should Apache Kafka replace a database? How long can and should I store data in Kafka? How can I query and process data in Kafka? These are common questions that come up more and more. Short answers like “Yes” or “It depends” are not good enough for you? Then this read for you! This blog post explains the idea behind databases and different features like storage, queries and transactions to evaluate when Kafka is a good fit and when it is not.

Jay Kreps, the co-founder of Apache Kafka and Confluent, explained already in 2017 why “It’s okay to store data in Apache Kafka”. However, many things have improved and new components and features were added in the last three years. This update covers the core concepts of Kafka from a database perspective. It includes Kafka-native add-ons like Tiered Storage for long-term cost-efficient storage and ksqlDB as event streaming database. The relation and trade-offs between Kafka and other databases are explored to complement each other instead of thinking about a replacement. This discussion includes different options for pull and push based bi-directional integration.

I also created a slide deck and video recording about this topic.

What is a Database? Oracle? NoSQL? Hadoop?

Let’s think about the term “database” on a very high level. According to Wikipedia,

“A database is an organized collection of data, generally stored and accessed electronically from a computer system.

The database management system (DBMS) is the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS software additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a “database system”. Often the term “database” is also used to loosely refer to any of the DBMS, the database system or an application associated with the database.

Computer scientists may classify database-management systems according to the database models that they support. Relational databases became dominant in the 1980s. These model data as rows and columns in a series of tables, and the vast majority use SQL for writing and querying data. In the 2000s, non-relational databases became popular, referred to as NoSQL because they use different query languages.”

Based on this definition, we know that there are many databases on the market. Oracle. MySQL. Postgres. Hadoop. MongoDB. Elasticsearch. AWS S3. InfluxDB. Kafka.

Hold on. Kafka? Yes, indeed. Let’s explore this in detail…

Storage, Transactions, Processing and Querying Data

A database infrastructure is used for storage, queries and processing of data, often with specific delivery and durability guarantees (aka transactions).

There is not just one database as we all should know from all the NoSQL and Big Data products on the market. For each use case, you (should) choose the right database. It depends on your requirements. How long to store data? What structure should the data have? Do you need complex queries or just retrieval of data via key and value? Require ACID transactions, exactly-once semantics, or “just” at least once delivery guarantees?

These and many more questions have to be answered before you decide if you need a relational database like MySQL or Postgres, a big data batch platform like Hadoop, a document store like MongoDB, a key-value store like RocksDB, a time series database like InfluxDB, an in-memory cache like Memcached, or something else.

Every database has different characteristics. Thus, when you ask yourself if you can replace a database with Kafka, which database and what requirements are you talking about?

What is Apache Kafka?

Obviously, it is also crucial to understand what Kafka is to decide if Kafka can replace your database. Otherwise, it is really hard to proceed with this evaluation…

Kafka is an Event Streaming Platform => Messaging! Stream Processing! Database! Integration!

First of all, Kafka is NOT just a pub/sub messaging system to send data from A to B. This is what some unaware people typically respond to such a question when they think Kafka is the next IBM MQ or RabbitMQ. Nope. Kafka is NOT a messaging system. A comparison with other messaging solutions is an apple to orange comparison (but still valid to decide when to choose Kafka or a messaging system).

Kafka is an event streaming platform. Companies from various industries presented hundred of use cases where they use Kafka successfully for much more than just messaging. Just check out all the talks from the Kafka Summits (including free slide decks and video recordings).

One of the main reasons why Apache Kafka became the de facto standard for so many different use cases is its combination of four powerful concepts:

Publish and subscribe to streams of events, similar to a message queue or enterprise messaging system
Store streams of events in a fault-tolerant storage as long as you want (hours, days, months, forever)
Process streams of events in real time, as they occur
Integration of different sources and sinks (no matter if real time, batch or request-response)

Decoupled, Scalable, Highly Available Streaming Microservices

With these four pillars built into one distributed event streaming platform, you can decouple various applications (i.e., producers and consumers) in a reliable, scalable, and fault-tolerant way.

As you can see, storage is one of the key principles of Kafka. Therefore, depending on your requirements and definition, Kafka can be used as a database.

Is “Kafka Core” a Database with ACID Guarantees?

I won’t cover the whole discussion about how “Kafka Core” (meaning Kafka brokers and its concepts like distributed commit log, replication, partitions, guaranteed ordering, etc.) fits into the ACID (Atomicity, Consistency, Isolation, Durability) transaction properties of databases. This was discussed already by Martin Kleppmann at Kafka Summit San Francisco 2018 (“Is Kafka a Database?”) and a little bit less technically by Tim Berglund (“Dissolving the Problem – Making an ACID-Compliant Database Out of Apache Kafka”).

TL;DR: Kafka is a database and provides ACID guarantees. However, it works differently than other databases. Kafka is also not replacing other databases; but a complementary tool in your toolset.

The Client Side of Kafka

In messaging systems, the client API provides producers and consumers to send and read messages. All other logic is implemented using low level programming or additional frameworks.

In databases, the client API provides a query language to create data structures and enables the client to store and retrieve data. All other logic is implemented using low level programming or additional frameworks.

In an event streaming platform, the client API is used for sending and consuming data like in a messaging system. However, in contrary to messaging and databases, the client API provides much more functionality.

Independent, scalable, reliable components applications can be built with the Kafka APIs. Therefore, a Kafka client application is a distributed system that queries, processes and stores continuous streams of data. Many applications can be built without the need for another additional framework.

The Kafka Ecosystem – Kafka Streams, ksqlDB, Spring Kafka, and Much More…

The Kafka ecosystem provides various different components to implement applications.

Kafka itself includes a Java and Scala client API, Kafka Streams for stream processing with Java, and Kafka Connect to integrate with different sources and sinks without coding.

Many additional Kafka-native client APIs and frameworks exist. Here are some examples:

librdkafka: A C library implementation of the Apache Kafka protocol, providing Producer, Consumer and Admin clients. It was designed with message delivery reliability and high performance in mind, current figures exceed 1 million msgs/second for the producer and 3 million msgs/second for the consumer. In addition to the C library, it is often used as wrapper to provide Kafka clients from other programming languages such as C++, Golang, Python and JavaScript.
REST Proxy: Provides a RESTful interface to a Kafka cluster. It makes it easy to produce and consume messages, view the state of the cluster, and perform administrative actions without using the native Kafka protocol or clients.
ksqlDB: An event streaming database for Apache Kafka that enables you to build event streaming applications leveraging your familiarity with relational databases.
Spring for Kafka: Applies core Spring concepts to the development of Kafka-based messaging and streaming solutions. It provides a “template” as a high-level abstraction for sending messages. Includes first-class support for Kafka Streams. Additional Spring frameworks like Spring Cloud Stream and Spring Cloud Data Flow also provide native support for event streaming with Kafka.
Faust: A library for building streaming applications in Python, similar to the original Kafka Streams library (but more limited functionality and less mature).
TensorFlow I/O + Kafka Plugin: A native integration into TensorFlow for streaming machine learning (i.e. directly consuming models from Kafka for model training and model scoring instead of using another data lake).
Many more…

Domain-Driven Design (DDD), Dumb Pipes, Smart Endpoints

The importance of Kafka’s client side is crucial for the discussion of potentially replacing a database because Kafka applications can be stateless or stateful; the latter keeping state in the application instead of using an external database. The storage section below contains more details about how the client application can store data long term and is highly available.

With this, you understand that Kafka has a powerful server and a powerful client side. Many people are not aware of this when they evaluate Kafka versus other messaging solutions or storage systems.

This in conjunction with the capability to do real decoupling between producer and consumers leveraging the underlying storage of Kafka, it becomes clear why Apache Kafka became the de facto standard and backbone for microservice architectures – not just replacing other traditional middleware but also building the client applications using Domain Driven Design (DDD) for decoupled applications, dumb pipes and smart endpoints:

Check out the blog post “Microservices, Apache Kafka, and Domain-Driven Design (DDD)” for more details on this discussion.

Again, why is this important for the discussion around Kafka being a database: For every new microservice you create, you should ask yourself: Do I really need a “real database” backend in my microservice? With all its complexity and cost regarding development, testing, operations, monitoring?

Often, the answer is yes, of course. But I see more and more applications where keeping the state directly in the Kafka application is better, easier or both.

Storage – How long can you store Data in Kafka? And what is the Database under the Hood?

The short answer: Data can be stored in Kafka as long as you want. Kafka even provides the option to use a retention time of -1. This means “forever”.

The longer answer is much more complex. You need to think about the cost and scalability of the Kafka brokers. Should you use HDDs or SDDs? Or maybe even Flash based technology? Pure Storage wrote a nice example leveraging flash storage to write 5 million messages per second with three producers and 3x replication. It depends on how much data you need to store and how fast you need to access data and be able to recover from failures..

Publishing with Apache Kafka at The New York Times is a famous example for storing data in Kafka forever. Kafka is used for storing all the articles ever published by The New York Times and replacing their API-based approach. The Streams API is used to feed published content in real-time to the various applications and systems that make it available to our readers.

Another great example is the Account Activity Replay API from Twitter: It uses Kafka as Storage to provide “a data recovery tool that lets developers retrieve events from as far back as five days. This API recovers events that weren’t delivered for various reasons, including inadvertent server outages during real-time delivery attempts.”

Until now we are just talking about the most commonly used Kafka features: Log-based storage with retention time and disks attached to the broker. However, you also need to consider the additional capabilities of Kafka to have a complete discussion about long-term storage in a Kafka infrastructure: Compacted topics, tiered storage and client-side storage. All of these features can quickly change your mind of how you think about Kafka, its use cases and its architecture.

Compacted Topics – Log Compaction and “Event Updates”

Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition. It addresses use cases and scenarios such as restoring state after application crashes or system failure or reloading caches after application restarts during operational maintenance. Therefore, log compaction does not have a retention time.

Obviously, the big trade-off is that log compaction does not keep all events and the full order of changes. For this, you need to use the normal Kafka Topics with a specific retention time.

Or you can use -1 to store all data forever. The big trade-offs here are high cost for the disks and more complex operations and scalability.

Tiered Storage – Long-Term Storage in Apache Kafka

KIP 405 (Kafka Improvement Proposal) was created to standardize the interface for Tiered Storage in Kafka to provide different implementations by different contributors and vendors. The KIP is not implemented yet and the interface is still under discussion by contributors from different companies.

Confluent already provides a (commercial) implementation for Kafka to use Tiered Storage for storing data long term in Kafka at low cost. The blog post “Infinite Storage in Confluent Platform” talks about the motivation and implementation of this game-changing Kafka add-on.

Tiered Storage for Kafka reduces cost (due to cheaper object storage), increases scalability (due to the separation between storage and processing), and eases operations (due to simplified and much faster rebalancing).

Here are some examples for storing the complete log in Kafka long-term (instead of leveraging compacted topics):

New Consumer, e.g. a complete new microservices or a replacement of an existing application
Error-Handling, e.g. re-processing of data in case of error to fix errors and process events again
Compliance / Regulatory Processing: Reprocessing of already processed data for legal reasons; could be very old data (e.g. pharma: 10 years old)
Query and Analysis of Existing Events; No need for another data store / data lake; ksqlDB (position first, but know the various limitations); Kafka-native analytics tool (e.g. Rockset with Kafka connector and full SQL support for Tableau et al)
Machine Learning and Model Training: Consume events for model training with a) different one machine learning framework and different hyperparameters or b) different Machine Learning frameworks

The last example is discussed in more detail here: Streaming Machine Learning with Tiered Storage (no need for a Data Lake).

Client-Side Database – Stateful Kafka Client Applications and Microservices

As discussed above, Kafka is not just the server side. You build highly available and scalable real time applications on the Kafka client side.

RocksDB for Stateful Kafka Applications

Often, these Kafka client applications (have to) keep important state. Kafka Streams and ksqlDB leverage RocksDB for this (you could also just use in-memory storage or replace RocksDB with another storage; I have never seen the latter option in the real world, though). RocksDB is a key-value store for running mission-critical workloads. It is optimized for fast, low latency storage.

In Kafka Streams applications, that solves the problem of abstracting access to local stable storage instead of using an external database. Using an external database would require external communication / RPC every time an event is processed. A clear anti-pattern in event streaming architectures.

RocksDB allows software engineers to focus their energies on the design and implementation of other areas of their systems with the peace of mind of relying on RocksDB for access to stable storage. It is battle-tested at several silicon valley companies and used under the hood of many famous databases like Apache Cassandra, CockroachDB or MySQL (MyRocks). RocksDB Is Eating the Database World covers the history and use cases in more detail.

ksqlDB as Event Streaming Database

Kafka Streams and ksqlDB – the event streaming database for Kafka – allow building stateful streaming applications; including powerful concepts like joins, sliding windows and interactive queries of the state. The following example shows how you build a stateful payment application:

The client application keeps the data in its own application for real time joins and other data correlations. It combines the concepts of a STREAM (unchangeable event) and a TABLE (updated information like in a relational database). Keep in mind that this application is highly scalable. It is typically not just a single instance. Instead, it is a distributed cluster of client instances to provide high availability and parallelizing data processing. Even if something goes down (VM, container, disk, network), the overall system will not lose and data and continue running 24/7.

As you can see, many questions have to be answered and various features have to be considered to make the right decision about how long and where you want to store data in Kafka.

One good reason to store data long-term in Kafka is to be able to use the data at a later point in time for processing, correlations or analytics.

Query and Processing – Can you Consume and Analyze the Kafka Storage?

Kafka provides different options to consume and query data.

Queries in Kafka can be either PUSH (i.e. continuously process and forward events) or PULL (i.e. the client requests events like you know it from your favorite SQL database).

I will show you different options in the following sections.

Consumer Applications Pull Events

Kafka clients pull the data from the brokers. This decouples producers, brokers and consumers and makes the infrastructure scalable and reliable.

Kafka itself includes a Java and Scala client to consume data. However, Kafka clients are available for almost any other programming language, including widespread languages like C, C++, Python, JavaScript or Golang and exotic languages like RUST. Check out Yeva Byzek’s examples to see your favorite programming language in action. Additionally, Confluent provides a REST Proxy. This allows consumption of events via HTTP(S) from any language or tool supporting this standard.

Applications have different options to consume events from the Kafka broker:

Continuous consumption of the latest events (in real time or batch)
Just specific time frames or partitions
All data from the beginning

Stream Processing Applications / Microservices Pull and Push Events

Kafka Streams and ksqlDB pull events from the brokers, process the data and then push the result back into another Kafka topic. These queries are running continuously. Powerful queries are possible; including JOINs and stateful aggregations.

These features are used for streaming ETL and real time analytics at scale, but also to build mission-critical business applications and microservices.

This discussion needs much more detail and cannot be covered in this blog post focusing on the database perspective. Get started e.g. with my intro to event streaming with ksqlDB from Big Data Spain in Madrid covering use cases and more technical details. “Confluent Developer” is another great point for getting started with building event streaming applications. The site provides plenty of tutorials, videos, demos, and more around Apache Kafka.

The feature “interactive queries” allows querying values from the client applications’ state store (typically implemented with RocksDB under the hood). The events are pulled via technologies like REST / HTTP or pushed via intermediaries like a WebSockets proxy. Kafka Streams provides the core functionality. The interactive query interface has to be implemented by yourself on top. Pro: Flexibility. Con: Not provided out-of-the-box.

Kafka as Query Engine and its Limitations

None of the above described Kafka query capabilities are as powerful as your beloved Oracle database or Elasticsearch!

Therefore, Kafka will not replace other databases. It is complementary. The main idea behind Kafka is to continuously process streaming data; with additional options to query stored data.

Kafka is good enough as database for some use cases. However, the query capabilities of Kafka are not good enough for some other use cases.

Kafka is then often used as central streaming platform where one or more databases (and other applications) build their own materialized real time view leveraging their own technology.

The principle is often called „turning the database inside out“. This design pattern allows using the right database for the right problem. Kafka is used in these scenarios

as scalable event streaming platform for data integration
for decoupling between different producers and consumers
to handle backpressure
to continuously process and correlate incoming events
for enabling the creation and updating of materialized views within other databases
to allow interactive queries directly to Kafka (depending on the use case and used technology)

Kafka as Single Source of Truth and Leading System?

For many scenarios, it is great if the central event streaming platform is the central single source of truth. Kafka provides an event-based real time infrastructure that is scalable and decouples all the producers and consumers. However, in the real world, something like an ERP system will often stay the leading system even it pushes the data via Kafka to the rest of the enterprise.

That’s totally fine! Kafka being the central streaming platform does not force you to make it the leading system for every event. For some applications and databases, the existing source of truth is still the source of truth after the integration with Kafka.

The key point here is that your single source of truth should not be a database that stores the data at rest, e.g. in a data lake like Hadoop or AWS S3. This way your central storage is a slow batch system. You cannot simply connect a real time consumer to it. On the other side, if an event streaming platform is your central layer, then you can ingest it into your data lake for data processing at rest, but you can also easily add another real time consumer.

Native ANSI SQL Query Layer to Pull Events? Tableau, Qlik, Power BI et al to analyze Kafka?

Access to massive volumes of event streaming data through Kafka has sparked a strong interest in interactive, real-time dashboards and analytics, with the idea being similar to what was built on top of traditional databases like Oracle or MySQL using Tableau, Qlik, or Power BI and batch frameworks like Hadoop using Impala, Presto, or BigQuery: The user wants to ask questions and get answers quickly.

Leveraging Rockset, a scalable SQL search and analytics engine based on RocksDB, and in conjunction with BI and analytics tools like Tableau, you can directly query the Kafka log. With ANSI SQL. No limitations. At scale. In real time. This is a good time to question your data lake strategy, isn’t it?

Check out details about the technical implementation and use cases here: “Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset“. Bosch Power Tools is a great real world example for using Kafka as long-term storage and Rockset for real time analytics and powerful interactive queries.

Transactions – Delivery and Processing Guarantees in Kafka

TL;DR: Kafka provides end-to-end processing guarantees, durability and high availability to build the most critical business applications. I have seen many mission-critical infrastructures built on Kafka in various industries, including banking, telco, insurance, retailing, automotive and many others.

I want to focus on delivery guarantees and correctness as critical characteristics of messaging systems and databases. Transactions are required in many applications to guarantee no data loss and deterministic behavior.

Transaction processing in databases is information processing that is divided into individual, indivisible operations called transactions. Each transaction must succeed or fail as a complete unit; it can never be only partially complete. Many databases with transactional capabilities (including Two-Phase-Commit / XA Transactions) do not scale well and are hard to operate.

Therefore, many distributed systems provide just “at least once semantics”.

Exactly-Once Semantics (EOS) in Kafka

In the contrary, Kafka is a distributed system that provides various guarantee deliveries. Different configuration options allow at-least-once, at-most-once and exactly-once semantics (EOS).

Exactly-once semantics is what people compare to database transactions. The idea is similar: You need to guarantee that each produced information is consumed and processed exactly once. Many people argued that this is not possible to implement with Kafka because individual, indivisible operations can fail in a distributed system! In the Kafka world, many people referred to the famous Hacker News discussion “You Cannot Have Exactly-Once Delivery” and similar Twitter conversations.

In mid of 2017, the “unbelievable thing” happened: Apache Kafka 0.11 added support for Exactly-Once Semantics (EOS). Note that it does not include the term “transaction” intentionally. Because it is not a transaction. Because a transaction is not possible in a distributed system. However, the result is the same: Each consumer consumes each produced message exactly once. “Exactly-once Semantics are Possible: Here’s How Kafka Does it” covers the details of the implementation. In short, EOS includes three features:

Idempotence: Exactly-once in order semantics per partition
Transactions: Atomic writes across multiple partitions
Exactly-once stream processing in Apache Kafka

Matthias J. Sax did a great job explaining EOS at Kafka Summit 2018 in London. Slides and video recording are available for free.

EOS works differently than transactions in databases but provides the same result in the end. It can be turned on by configuration. By the way: The performance penalty compared to at-least-once semantics is not big. Typically something between 10 and 25% slower end-to-end processing.

Exactly-Once Semantics in the Kafka Ecosystem (Kafka Connect, Kafka Streams, ksqlDB, non-Java Clients)

EOS is not just part of Kafka core and the related Java / Scala client. Most Kafka components support exactly-once delivery guarantees, including:

Some (but not all) Kafka Connect connectors. For example AWS S3 and Elasticsearch.
Kafka Streams and ksqlDB to process data exactly once for streaming ETL or in business applications.
Non-Java clients. librdkafka – the core foundation of many Kafka clients in various programming languages – added support for EOS recently.

Will Kafka Replace your existing Database?

In general, no! But you should always ask yourself: Do you need another data store in addition to Kafka? Sometimes yes, sometimes no. We discussed the characteristics of a database and when Kafka is sufficient.

Each database has specific features, guarantees and query options. Use MongoDB as document store, Elasticsearch for text search, Oracle or MySQL for traditional relational use cases, or Hadoop for a big data lake to run map/reduce jobs for reports.

This blog post hopefully helps you make the right decision for your next project.

However, hold on: The question is not always “Kafka vs database XYZ”. Often, Kafka and databases are complementary! Let’s discuss this in the following section…

Kafka Connect – Integration between Kafka and other Databases

Apache Kafka includes Kafka Connect: A framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems. Using Kafka Connect you can use existing connector implementations for common data sources and sinks to move data into and out of Kafka:

This includes many connectors to various databases. To query data from a source system, event can either be pulled (e.g. with the JDBC Connector) or pushed via Chance-Data-Capture (CDC, e.g. with the Debezium Connector). Kafka Connect can also write into any sink data storage, including various relational, NoSQL and big data infrastructures like Oracle, MongoDB, Hadoop HDFS or AWS S3.

Confluent Hub is a great resource to find the available source and sink connectors for Kafka Connect. The hub includes open source connectors and commercial offerings from different vendors. To learn more about Kafka Connect, you might want to check out Robin Moffat’s blog posts. He has implemented and explained tens of fantastic examples leveraging Kafka Connect to integrate with many different source and sink databases.

Apache Kafka is a Database with ACID Guarantees, but Complementary to other Databases!

Apache Kafka is a database. It provides ACID guarantees and is used in hundreds of companies for mission-critical deployments. However, in many cases Kafka is not competitive to other databases. Kafka is an event streaming platform for messaging, storage, processing and integration at scale in real time with zero downtime and zero data loss.

With these characteristics, Kafka is often used as central streaming integration layer. Materialized views can be built by other databases for their specific use cases like real time time series analytics, near real time ingestion into a text search infrastructure, or long term storage in a data lake.

In summary, if you get asked if Kafka can replace a database, then here are different answers:

Kafka can store data forever in a durable and high available manner providing ACID guarantees
Different options to query historical data are available in Kafka
Kafka-native add-ons like ksqlDB or Tiered Storage make Kafka more powerful than ever before for data processing and event-based long-term storage
Stateful applications can be built leveraging Kafka clients (microservices, business applications) without the need for another external database
Not a replacement for existing databases like MySQL, MongoDB, Elasticsearch or Hadoop
Other databases and Kafka complement each other; the right solution has to be selected for a problem; often purpose-built materialized views are created and updated in real time from the central event-based infrastructure
Different options are available for bi-directional pull and push based integration between Kafka and databases to complement each other

Please let me know what you think and connect on LinkedIn… Stay informed about new blog posts by subscribing to my newsletter.

The post Can Apache Kafka Replace a Database? appeared first on Kai Waehner.

Smart City with an Event Streaming Platform like Apache Kafka

Kai Waehner — Mon, 24 Feb 2020 13:10:29 +0000

A smart city is an urban area that uses different types of electronic Internet of Things (IoT) sensors to collect data and then use insights gained from that data to manage assets, resources and services efficiently. This includes data collected from citizens, devices, and assets that is processed and analyzed to monitor and manage traffic and transportation systems, power plants, utilities, water supply networks, waste management, crime detection, information systems, schools, libraries, hospitals, and other community services.

I did a Confluent webinar about this topic recently together with my colleague Robert Cowart. Rob has deep experience in this topic from several projects in the last years. I face this discussion regularly from different perspectives in customer meetings all over the world:

Cities / governments: Increase safety, improve planning, increase efficiency, reduce cost
Automotive / vehicle vendors: Improve customer experience, cross-sell
Third party companies (ride sharing, ticket-less parking, marketplaces, etc.): Provide innovative new services and business models

This blog post covers an extended version of the webinar. I give a quick overview and share the slides + video recording.

Benefits of a Smart City

A smart city provides many benefits for the civilization and the city management. Some of the goals are:

Improved Pedestrian Safety
Improved Vehicle Safety
Proactively Engaged First Responders
Reduced Traffic Congestion
Connected / Autonomous Vehicles
Improved Customer Experience
Automated Business Processes

“Smart City” is a very generic term and often used as buzzword. It includes many different use cases and stake holders. In summary, a smart city provides the right insights (enriched and analyzed) at the right time (increasingly “real-time”) to the right people, processes and systems.

Innovative New Business Models Emerging…

A smart city establishes exciting new business models. Many of these make the experience for the end user much better. For instance, I am so glad that I don’t have to pay for parking with pocket changes anymore. A simple app really makes me very happy.

Here are a few arbitrary examples of innovative projects and services related to building a smart city:

wejo is offering a platform designed specifically for connected car data.
Park Now provides cities and operators a cashless mobile parking and payment solution
Scheidt & Bachmann offers a ticketless parking management system.
The government of Singapore created Virtual Singapore; an authoritative 3D digital platform intended for use by the public, private, people and research sectors for urban planning, collaboration and decision-making, communication, visualization and other use cases.

The latter is a great example for building a digital twin outside of manufacturing. I covered the topic “Event Streaming with Apache Kafka for Building a Digital Twin” in detail in another post.

Technical Challenges for Building a Smart City

Many cities are investing in technologies to transform their cities into smart city environments in which data collection and analysis is utilized to manage assets and resources efficiently.

The key challenges are:

Integration with different data sources and technologies…
Data transformation and correlation to provide multiple perspectives…
Real time processing to act while the information is important…
High Scalability and zero downtime to run continuously even in case of hardware failure (life in a city never stops)…

Modern technology can help connect the right data, at the right time, to the right people, processes and systems.

Learn How to Build a Smart City with Apache Kafka

Innovations around smart cities and the Internet of Things give cities the ability to improve motor safety, unify and manage transportation systems and traffic, save energy and provide a better experience for the residents.

By utilizing an event streaming platform, like Apache Kafka, cities are able to process data in real-time from thousands of sources, such as sensors. By aggregating that data and analyzing real-time data streams, more informed decisions can be made and fine-tuned operations developed for a positive impact on everyday challenges faced by cities.

Learn how to:

Overcome challenges for building a smart city
Connect thousands of devices, machines, and people
Build a real time infrastructure to correlate relevant events
Leverage open source and fully managed solutions from the Apache Kafka ecosystem
Plan and deploy hybrid architectures with edge, on premise and cloud infrastructure

The following two sections share a slide deck and video recording with lots of use cases, technical information and best practices.

Slide Deck – Smart City with the Apache Kafka Ecosystem

Here is the slide deck:

Video Recording – Event Streaming and Real Time Analytics At Scale

The video recording is an extended version of the recent Confluent webinar: