Allgemein Archives - Kai Waehner https://www.kai-waehner.de/blog/category/allgemein/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Mon, 28 Apr 2025 06:29:25 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png Allgemein Archives - Kai Waehner https://www.kai-waehner.de/blog/category/allgemein/ 32 32 Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink https://www.kai-waehner.de/blog/2025/04/28/fraud-detection-in-mobility-services-ride-hailing-food-delivery-with-data-streaming-using-apache-kafka-and-flink/ Mon, 28 Apr 2025 06:29:25 +0000 https://www.kai-waehner.de/?p=7516 Mobility services like Uber, Grab, and FREE NOW (Lyft) rely on real-time data to power seamless trips, deliveries, and payments. But this real-time nature also opens the door to sophisticated fraud schemes—ranging from GPS spoofing to payment abuse and fake accounts. Traditional fraud detection methods fall short in speed and adaptability. By using Apache Kafka and Apache Flink, leading mobility platforms now detect and block fraud as it happens, protecting their revenue, users, and trust. This blog explores how real-time data streaming is transforming fraud prevention across the mobility industry.

The post Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Mobility services like Uber, Grab, FREE NOW (Lyft), and DoorDash are built on real-time data. Every trip, delivery, and payment relies on accurate, instant decision-making. But as these services scale, they become prime targets for sophisticated fraud—GPS spoofing, fake accounts, payment abuse, and more. Traditional, batch-based fraud detection can’t keep up. It reacts too late, misses complex patterns, and creates blind spots that fraudsters exploit. To stop fraud before it happens, mobility platforms need data streaming technologies like Apache Kafka and Apache Flink for fraud detection. This blog explores how leading platforms are using real-time event processing to detect and block fraud as it happens—protecting revenue, user trust, and platform integrity at scale.

Fraud Prevention in Mobility Services with Data Streaming using Apache Kafka and Flink with AI Machine Learning

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases.

The Business of Mobility Services (Ride-Hailing, Food Delivery, Taxi Aggregators, etc.)

Mobility services have become an essential part of modern urban life. They offer convenience and efficiency through ride-hailing, food delivery, car-sharing, e-scooters, taxi aggregators, and micro-mobility options. Companies such as Uber, Lyft, FREE NOW (former MyTaxi; acquired by Lyft recently), Grab, Careem, and DoorDash connect millions of passengers, drivers, restaurants, retailers, and logistics partners to enable seamless transactions through digital platforms.

Taxis and Delivery Services in a Modern Smart City

These platforms operate in highly dynamic environments where real-time data is crucial for pricing, route optimization, customer experience, and fraud detection. However, this very nature of mobility services also makes them prime targets for fraudulent activities. Fraud in this sector can lead to financial losses, reputational damage, and deteriorating customer trust.

To effectively combat fraud, mobility services must rely on real-time data streaming with technologies such as Apache Kafka and Apache Flink. These technologies enable continuous event processing and allow platforms to detect and prevent fraud before transactions are finalized.

Why Fraud is a Major Challenge in Mobility Services

Fraudsters continually exploit weaknesses in digital mobility platforms. Some of the most common fraud types include:

  1. Fake Rides and GPS Spoofing: Drivers manipulate GPS data to simulate trips that never occurred. Passengers use location spoofing to receive cheaper fares or exploit promotions.
  1. Payment Fraud and Stolen Credit Cards: Fraudsters use stolen payment methods to book rides or order food.
  1. Fake Drivers and Passengers: Fraudsters create multiple accounts and pretend to be both the driver and passenger to collect incentives. Some drivers manipulate fares by manually adjusting distances in their favor.
  1. Promo Abuse: Users create multiple fake accounts to exploit referral bonuses and promo discounts.
  1. Account Takeovers and Identity Fraud: Hackers gain access to legitimate accounts, misusing stored payment information. Fraudsters use fake identities to bypass security measures.

Fraud not only impacts revenue but also creates risks for legitimate users and drivers. Without proper fraud prevention measures, ride-hailing and delivery companies could face serious losses, both financially and operationally.

The Unseen Enemy: Core Challenges in Mobility Fraud
Detection

Traditional fraud detection relies on batch processing and manual rule-based systems. However, these approaches are no longer effective due to the speed and complexity of modern mobile apps with real-time experiences combined with modern fraud schemes.

Payment Fraud - The Hidden Enemy in a Digital World
Payment Fraud – The Hidden Enemy in a Digital World

Key challenges in mobility fraud detection include:

  • Fraud occurs in real-time, requiring instant detection and prevention before transactions are completed.
  • Millions of events per second must be processed, requiring scalable and efficient systems.
  • Fraud patterns constantly evolve, making static rule-based approaches ineffective.
  • Platforms operate across hybrid and multi-cloud environments, requiring seamless integration of fraud detection systems.

To overcome these challenges, real-time streaming analytics powered by Apache Kafka and Apache Flink provide an effective solution.

Event-driven Architecture for Mobility Services with Data Streaming using Apache Kafka and Flink

Apache Kafka: The Backbone of Event-Driven Fraud Detection

Kafka serves as the core event streaming platform. It captures and processes real-time data from multiple sources such as:

  • GPS location data
  • Payment transactions
  • User and driver behavior analytics
  • Device fingerprints and network metadata

Kafka provides:

  • High-throughput data streaming, capable of processing millions of events per second to support real-time decision-making.
  • An event-driven architecture that enables decoupled, flexible systems—ideal for scalable and maintainable mobility platforms.
  • Seamless scalability across hybrid and multi-cloud environments to meet growing demand and regional expansion.
  • Always-on reliability, ensuring 24/7 data availability and consistency for mission-critical services such as fraud detection, pricing, and trip orchestration.

An excellent success story about the transition to data streaming comes from DoorDash: Why DoorDash migrated from Cloud-native Amazon SQS and Kinesis to Apache Kafka and Flink.

Apache Flink enables real-time fraud detection through advanced event correlation and applied AI:

  • Detects anomalies in GPS data, such as sudden jumps, route manipulation, or unrealistic movement patterns.
  • Analyzes historical user behavior to surface signs of account takeovers or other forms of identity misuse.
  • Joins multiple real-time streams—including payment events, location updates, and account interactions—to generate accurate, low-latency fraud scores.
  • Applies machine learning models in-stream, enabling the system to flag and stop suspicious transactions before they are processed.
  • Continuously adapts to new fraud patterns, updating models with fresh data in near real-time to reflect evolving user behavior and emerging threats.

With Kafka and Flink, fraud detection can shift from reactive to proactive to stop fraudulent transactions before they are completed.

I already covered various data streaming success stories from financial services companies such as Paypal, Capital One and ING Bank in a dedicated blog post. And a separate case study from about “Fraud Prevention in Under 60 Seconds with Apache Kafka: How A Bank in Thailand is Leading the Charge“.

Real-World Fraud Prevention Stories from Mobility Leaders

Fraud is not just a technical issue—it’s a business-critical challenge that impacts trust, revenue, and operational stability in mobility services. The following real-world examples from industry leaders like FREE NOW (Lyft), Grab, and Uber show how data streaming with advanced stream processing and AI are used around the world to detect and stop fraud in real time, at massive scale.

FREE NOW (Lyft): Detecting Fraudulent Trips in Real Time by Analyzing GPS Data of Cars

FREE NOW operates in more than 150 cities across Europe with 48 million users. It integrates multiple mobility services, including taxis, private vehicles, car-sharing, e-scooters, and bikes.

The company was recently acquired by Lyft, the U.S.-based ride-hailing giant known for its focus on multimodal urban transport and strong presence in North America. This acquisition marks Lyft’s strategic entry into the European mobility ecosystem, expanding its footprint beyond the U.S. and Canada.

FREE NOW - former MyTaxi - Company Overview
Source: FREE NOW

Fraud Prevention Approach leveraging Data Streaming (presented at Kafka Summit)

  • Uses Kafka Streams and Kafka Connect to analyze GPS trip data in real-time.
  • Deploys fraud detection models that identify anomalies in trip routes and fare calculations.
  • Operates data streaming on fully managed Confluent Cloud and applications on Kubernetes for scalable fraud detection.
Fraud Prevention in Mobility Services with Data Streaming using Kafka Streams and Connect at FREE NOW
Source: FREE NOW

Example: Detecting Fake Rides

  1. A driver inputs trip details into the app.
  2. Kafka Streams predicts expected trip fare based on distance and duration.
  3. GPS anomalies and unexpected route changes are flagged.
  4. Fraud alerts are triggered for suspicious transactions.

By implementing real-time fraud detection with Kafka and Flink, FREE NOW (Lyft) has significantly reduced fraudulent trips and improved platform security.

Grab: AI-Powered Fraud Detection for Ride-Hailing and Delivery with Data Streaming and AI/ML

Grab is a leading mobility platform in Southeast Asia, handling millions of transactions daily. Fraud accounts for 1.6 percent of total revenue loss in the region.

To address these significant fraud numbers, Grab developed GrabDefence—an AI-powered fraud detection engine that leverages real-time data and machine learning to detect and block suspicious activity across its platform.

Fraud Detection and Presentation with Kafka and AI ML at Grab in Asia
Source: Grab

Fraud Detection Approach

  • Uses Kafka Streams and machine learning for fraud risk scoring.
  • Leverages Flink for feature aggregation and anomaly detection.
  • Detects fraudulent transactions before they are completed.
GrabDefence - Fraud Prevention with Data Streaming and AI / Machine Learning in Grab Mobility Service
Source: Grab

Example: Fake Driver and Passenger Fraud

  1. Fraudsters create accounts as both driver and passenger to claim rewards.
  2. Kafka ingests device fingerprints, payment transactions, and ride data.
  3. Flink aggregates historical fraud behavior and assigns risk scores.
  4. High-risk transactions are blocked instantly.

With GrabDefence built with data streaming, Grab reduced fraud rates to 0.2 percent, well below the industry average. Learn more about GrabDefence in the Kafka Summit talk.

Uber: Project RADAR – AI-Powered Fraud Detection with Human Oversight

Uber processes millions of payments per second globally. Fraud detection is complex due to chargebacks and uncollected payments.

To combat this, Uber launched Project RADAR—a hybrid system that combines machine learning with human reviewers to continuously detect, investigate, and adapt to evolving fraud patterns in near real time. Low latency is not required in this scenario. And humans are in the loop of the business process. Hence, Apache Spark is sufficient for Uber.

Uber Project Radar for Scam Detection with Humans in the Loop
Source: Uber

Fraud Prevention Approach

  • Uses Kafka and Spark for multi-layered fraud detection.
  • Implements machine learning models to detect chargeback fraud.
  • Incorporates human analysts for rule validation.
Uber Project RADAR with Apache Kafka and Spark for Scam Detection with AI and Machine Learning
Source: Uber

Example: Chargeback Fraud Detection

  1. Kafka collects all ride transactions in real time.
  2. Stream processing detects anomalies in payment patterns and disputes.
  3. AI-based fraud scoring identifies high-risk transactions.
  4. Uber’s RADAR system allows human analysts to validate fraud alerts.

Uber’s combination of AI-driven detection and human oversight has significantly reduced chargeback-related fraud.

Fraud in mobility services is a real-time challenge that requires real-time solutions that work 24/7, even at extreme scale for millions of events. Traditional batch processing systems are too slow, and static rule-based approaches cannot keep up with evolving fraud tactics.

By leveraging data streaming with Apache Kafka in conjunction with Kafka Streams or Apache Flink, mobility platforms can:

  • Process millions of events per second to detect fraud in real time.
  • Prevent fraudulent transactions before they occur.
  • Use AI-driven real-time fraud scoring for accurate risk assessment.
  • Adapt dynamically through continuous learning to evolving fraud patterns.

Mobility platforms such as Uber, Grab, and FREE NOW (Lyft) are leading the way in using real-time streaming analytics to protect their platforms from fraud. By implementing similar approaches, other mobility businesses can enhance security, reduce financial losses, and maintain customer trust.

Real-time fraud prevention in mobility services is not an option; it is a necessity. The ability to detect and stop fraud in real time will define the future success of ride-hailing, food delivery, and urban mobility platforms.

Stay ahead of the curve! Subscribe to my newsletter for insights into data streaming and connect with me on LinkedIn to continue the conversation. And download my free book about data streaming use cases.

The post Fraud Detection in Mobility Services (Ride-Hailing, Food Delivery) with Data Streaming using Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Cathay: From Premium Airline to Integrated Travel Ecosystem with Data Streaming https://www.kai-waehner.de/blog/2025/03/10/cathay-from-premium-airline-to-integrated-travel-ecosystem-with-data-streaming/ Mon, 10 Mar 2025 06:05:21 +0000 https://www.kai-waehner.de/?p=7479 Cathay Pacific is evolving beyond aviation, rebranding as Cathay to offer a seamless travel and lifestyle ecosystem. From flights to shopping, loyalty rewards, and digital experiences, real-time data streaming with Apache Kafka is at the heart of this transformation. By replacing traditional middleware with a cloud-native Kafka platform, Cathay has unlocked real-time customer insights, seamless integrations, and smarter operations—driving innovation across the travel industry.

The post Cathay: From Premium Airline to Integrated Travel Ecosystem with Data Streaming appeared first on Kai Waehner.

]]>
Cathay Pacific is no longer just an airline. With its rebrand to Cathay, the company is expanding beyond flights to build a comprehensive travel ecosystem, where customers not only book flights but also shop, earn loyalty rewards, and experience seamless end-to-end travel services. To enable this transformation, real-time data streaming with Apache Kafka in the cloud has become the backbone of Cathay’s data strategy. Data streaming helps integrate systems, optimize customer experience, and unlock new business opportunities.

From Airline to Travel Ecosystem with Data Streaming using Apache Kafka at Cathay Pacific

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch. And make sure to download my free book about data streaming use cases, including success stories around Kafka and Flink at Lufthansa and Schiphol Group (Amsterdam Airport).

The aviation and travel industry is undergoing a digital transformation, driven by real-time data streaming. From flight operations and passenger experience to predictive maintenance and dynamic pricing, streaming technologies like Apache Kafka and Apache Flink enable airlines, airports, and travel platforms to process and act on data instantly. This shift enhances operational efficiency, safety, and customer satisfaction, making travel more seamless, intelligent, and data-driven.

Event-driven Architecture with Data Streaming using Apache Kafka and Flink in Aviation, Airlines, Airports

Learn more about data streaming with Kafka and Flink in the airline and travel industry:

Cathay – A Global Premium Airline and Lifestyle Brand

Founded in 1946 in Hong Kong, Cathay Pacific started as a regional airline and grew into a global premium carrier, connecting Asia with the world. Over decades, it expanded long-haul routes, acquired Dragonair, and became a major player in international aviation.

Now, Cathay is undergoing a strategic transformation beyond just aviation. The company has rebranded as ‘Cathay’, introducing a premium travel master brand that integrates flights, holidays, shopping, dining, wellness, and payments. All unified under the Asia Miles loyalty programme. While Cathay Pacific remains the airline brand, ‘Cathay’ now represents a holistic approach to premium travel experiences.

CEO Ronald Lam emphasized that modern travelers seek more than just transportation—they seek seamless and personalized experiences. This evolution reflects Cathay’s 77-year journey and renewed commitment to customer-centric innovation.

The Future of Travel with Real-Time Data Streaming

To power this transformation, Cathay is leveraging real-time data streaming to enable:

✅ Personalized customer experiences to offer tailored holiday packages and rewards.

✅ Seamless digital integration to allow customers to book, shop, and earn rewards effortlessly.

✅ Smarter operations to optimize airline logistics, inventory, and real-time customer engagement.

This article summarizes a conversation between Marc Keng (Integration Architect Lead at Cathay Pacific) and me at the Data in Motion Tour in Singapore.

Cathay Airline talking about Data Streaming with Apache Kafka at Confluent Data in Motion Tour Singapore
Source: Confluent

Cathay’s first global campaign in three years, ‘Feels Good To Move’, embodies this shift—redefining premium travel as an integrated lifestyle. By harnessing data-driven intelligence and real-time streaming technologies, Cathay is shaping the future of travel, offering a truly connected, customer-first experience.

Why Cathay Pacific Moved from Traditional Middleware to Data Streaming with Apache Kafka

Like many enterprises, Cathay Pacific previously relied on traditional middleware using APIs, file transfers, and message queues (MQ). While these technologies worked well for certain integrations, they often lacked real-time capabilities and couldn’t scale efficiently to support the airline’s expanding digital services.

Why Apache Kafka?

  1. Industry Standard for Data Streaming:
  2. Seamless SaaS Integration:
    • As Cathay expanded its ecosystem, integrating with cloud-based services like AWS, payment platforms, and booking engines became essential.
    • Kafka’s widespread support made integration with third-party travel and retail solutions much easier.

Why a Cloud-Native SaaS for Data Streaming?

Migrating to serverless Confluent Cloud was a strategic decision that accelerated Cathay’s ability to innovate.

✅ Security and Compliance: Private Link ensures secure, direct connectivity between cloud services.

✅ Managed Connectors: Prebuilt AWS S3 and Lambda connectors reduced integration time and effort.

✅ Faster Time to Market: Fully managed Kafka infrastructure eliminated operational overhead, letting engineers focus on building new customer experiences.

✅ Schema Registry for Data Governance: Standardized schemas ensure data quality and consistency, preventing errors when streaming data across multiple systems.

Driving the Business Value with Data Streaming

Cathay Pacific has embraced real-time event streaming to connect legacy systems, cloud services, and business applications across its expanding ecosystem.

Business Value of Data Streaming with Apache Kafka and Flink in the free Confluent eBook

Real-Time Analytics for a Personalized Customer Experience

By combining booking data, shopping transactions, and loyalty program activity, Cathay can generate real-time customer insights to:

🔹 Optimize in-flight services based on customer preferences.
🔹 Offer personalized promotions while customers shop via Cathay’s retail channels.
🔹 Improve demand forecasting for better inventory and pricing decisions.

With Kafka-powered real-time analytics, business teams can instantly access and act on data—no need to wait for batch ETL jobs.

Empowering Business Users with a Self-Service Data Catalog

To make real-time data easily accessible across departments, Cathay has built an internal data catalog using Kafka streams.

✅ Business users can discover and access real-time datasets without needing IT intervention.
✅ Faster innovation as teams can combine different data sources (e.g., booking & shopping data) to create new customer engagement strategies.

Moving from Batch Processing to Streaming

Cathay is gradually phasing out traditional batch ETL pipelines in favor of event-driven architectures:

🚀 Replacing legacy file-based processing and MQ systems with real-time Kafka streaming.
🚀 Exploring Apache Flink for real-time stream processing, enabling advanced data transformations.
🚀 Reducing reliance on Informatica, whose batch ETL approach and clunky connectors limit real-time capabilities.

The long-term vision? A unified event streaming platform where all operational data flows in real time.

Data Streaming: The Foundation for Cathay’s Future Growth

As Cathay Pacific continues its digital transformation, real-time data streaming will be essential in supporting new services and revenue streams beyond air travel.

By leveraging Apache Kafka and its ecosystem in Confluent Cloud, Cathay has turned its data into a strategic asset, enabling:

🔹 Seamless integration between cloud-native and legacy applications
🔹 Faster time to market for new customer experiences
🔹 A scalable, event-driven infrastructure for future innovations

The transition from a traditional airline to a full-scale travel & retail ecosystem requires real-time, scalable, and flexible data infrastructure.

With Confluent Cloud, Cathay has built a central integration platform where data is treated as a product—accessible, trustworthy, and reusable across the organization.

As Cathay expands its partner ecosystem, introduces AI-driven personalization, and enhances loyalty programs, its investment in data streaming will be critical to unlocking new business opportunities.

Cathay is proving that in today’s digital world, an airline is much more than just flights—it’s an integrated travel experience, powered by real-time data. 🚀

For more success stories in the airline and travel industry, make sure to download my free ebook that covers success stories around Kafka and Flink at Lufthansa and Schiphol Group (Amsterdam Airport). Please let me know your thoughts, feedback and use cases on LinkedIn and stay in touch via my newsletter.

The post Cathay: From Premium Airline to Integrated Travel Ecosystem with Data Streaming appeared first on Kai Waehner.

]]>
Why DoorDash migrated from Cloud-native Amazon SQS and Kinesis to Apache Kafka and Flink https://www.kai-waehner.de/blog/2022/08/18/why-doordash-migrated-from-cloud-native-amazon-sqs-and-kinesis-to-apache-kafka-and-flink/ Thu, 18 Aug 2022 06:40:26 +0000 https://www.kai-waehner.de/?p=4738 Even digital natives - that started their business in the cloud without legacy applications in their own data centers - need to modernize their cloud-native enterprise architecture to improve business processes, reduce costs, and provide real-time information to their downstream applications. This blog post explores the benefits of an open and flexible data streaming platform compared to a proprietary message queue and data ingestion cloud services. A concrete example shows how DoorDash replaced cloud-native AWS SQS and Kinesis with Apache Kafka and Flink.

The post Why DoorDash migrated from Cloud-native Amazon SQS and Kinesis to Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Even digital natives – that started their business in the cloud without legacy applications in their own data centers – need to modernize their cloud-native enterprise architecture to improve business processes, reduce costs, and provide real-time information to their downstream applications. This blog post explores the benefits of an open and flexible data streaming platform compared to a proprietary message queue and data ingestion cloud services. A concrete example shows how DoorDash replaced cloud-native AWS SQS and Kinesis with Apache Kafka and Flink.

Migration from Amazon Kinesis and SQS to Apache Kafka and Flink in the Cloud on AWS

Message queue and ETL vs. data streaming with Apache Kafka

A message queue like IBM MQ, RabbitMQ, or Amazon SQS enables sending and receiving of messages. This works great for point-to-point communication. However, additional tools like Apache NiFi, Amazon Kinesis Data Firehose, or other ETL tools are required for data integration and data processing.

A data streaming platform like Apache Kafka provides many capabilities:

  • Producing and consuming messages for real-time messaging at any scale
  • Data integration to avoid spaghetti architectures with plenty of middleware components in the end-to-end pipeline
  • Stream processing to continuously process and correlate data from different systems
  • Distributed storage for true decoupling, backpressure handling, and replayability of events. All in a single platform.

I covered the apple vs. orange comparison between message queues and data streaming in the article “Comparison: JMS Message Queue vs. Apache Kafka“. In conclusion, data streaming with Kafka provides a data hub to easily access events from different downstream applications (no matter what technology, API, or communication paradigm they use).

Apache Kafka as cloud-native real-time Data Hub for Data Streaming and Integration

If you need to mix a messaging infrastructure with ETL platforms, a spaghetti architecture is the consequence. Whether you are on-premise and use ETL or ESB tools, or if you are in the cloud and can leverage iPaaS platforms. The more platforms you (have to) combine in a single data pipeline, the higher the costs, operations complexity, and SLA guarantees. That’s one of the top arguments for why Apache Kafka became the de facto standard for data integration.

Cloud-native is the new black for infrastructure and applications

Most modern enterprise architectures leverage cloud-native infrastructure, applications, and SaaS, no matter if the deployment happens in the public or private cloud. A cloud-native infrastructure provides:

  • automation via DevOps and continuous delivery
  • elastic scale with containers and orchestration tools like Kubernetes
  • agile development with domain-driven design and decoupled microservices

The Benefits and Concepts of Cloud Native Infrastructure

In the public cloud, fully-managed SaaS offerings are the preferred way to deploy infrastructure and applications. This includes services like Amazon SQS, Amazon Kinesis, and Confluent Cloud for fully-managed Apache Kafka. The scarce team of experts can focus on solving business problems and innovative new applications instead of operating the infrastructure.

However, not everything can run as a SaaS. Cost, security, and latency are the key arguments why applications are deployed in their own cloud VPC, an on-premise data center, or at the edge. Operators and CRDs for Kubernetes or Ansible scripts are common solutions to deploy and operate your own cloud-native infrastructure if using a serverless cloud product is not possible or feasible.

DoorDash: From multiple pipelines to data streaming for Snowflake integration

DoorDash is an American company that operates an online food ordering and food delivery platform. With a 50+% market share, it is the largest food delivery company in the United States.

Obviously, such a service requires scalable real-time pipelines to be successful. Otherwise, the business model does not work. For similar reasons, all the other mobility services like Uber and Lyft in the US, Free Now in Europe, or Grab in Asia leverage data streaming as the foundation of their data pipelines.

The following success story is based on DoorDash’s blog post “Building scalable real time event processing with Kafka and Flink“.

Challenges with multiple integration pipelines using SQS and Kinesis instead of Apache Kafka

Events are generated from many DoorDash services and user devices. They need to be processed and transported to different destinations, including:

  • OLAP data warehouse for business analysis
  • Machine Learning (ML) platform to generate real-time features like recent average wait times for restaurants
  • Time series metric backend for monitoring and alerting so that teams can quickly identify issues in the latest mobile application releases

The integration pipelines and downstream consumers leverage different technologies, APIs, and communication paradigms (real-time, near real-time, batch).

Each pipeline is built differently and can only process one kind of event. It involves multiple hops before the data finally gets into the data warehouse.

It is cost inefficient to build multiple pipelines that are trying to achieve similar purposes. DoorDash used cloud-native AWS messaging and streaming systems like Amazon SQS and Amazon Kinesis for data ingestion into the Snowflake data warehouse:

Legacy data pipeline at DoorDashMixing different kinds of data transport and going through multiple messaging/queueing systems without carefully designed observability around it leads to difficulties in operations.

These issues resulted in high data latency, significant cost, and operational overhead at DoorDash. Therefore, DoorDash moved to a cloud-native streaming platform powered by Apache Kafka and Apache Flink for continuous stream processing before ingesting data into Snowflake:

Cloud-native data streaming powered by Apache Kafka and Apache Flink for Snowflake integration at Doordash

The move to a data streaming platform provides many benefits to DoorDash:

  • Heterogeneous data sources and destinations, including REST APIs using the Confluent rest proxy
  • Easily accessible from any downstream application (no matter which technology, API, or communication paradigm)
  • End-to-end data governance with schema enforcement and schema evolution with Confluent Schema Registry
  • Scalable, fault-tolerant, and easy to operate for a small team

REST/HTTP is complementary to data streaming with Kafka

Not all communication is real-time and streaming. HTTP/REST APIs are crucial for many integrations. DoorDash leverages the Confluent REST Proxy to produce and consume via HTTP to/from Kafka.

Integration of Kafka and API Management Tools using REST HTTP

Learn more about this combination, its use cases, and trade-offs in my blog post “Request-Response with REST/HTTP vs. Data Streaming with Apache Kafka – Friends, Enemies, Frenemies?“.

All the details about this cloud-native infrastructure optimization are in DoorDash’s engineering blog post: “Building Scalable Real-Time Event Processing with Kafka and Flink“.

Don’t underestimate vendor lock-in and cost of proprietary SaaS offerings

One of the key reasons I see customers migrating away from proprietary serverless cloud services like Kinesis is cost. While it looks fine initially, it can get crazy when the data workloads scale. Very limited retention time and missing data integration capabilities are other reasons.

The DoorDash example shows how even cloud-native greenfield projects require modernization of the enterprise architecture to simplify the pipelines and reduce costs.

A side benefit is the independence of a specific cloud provider. With open-source powered engines like Kafka or Flink, the whole integration pipeline can be deployed everywhere. Possible deployments include:

  • Cluster linking across countries or even continents (including filtering, anonymization, and other data privacy relevant processing before data sharing and replication)
  • Multiple cloud providers (e.g., if GCP is cheaper than AWS or because Mainland China only provides Alibaba)
  • Low latency workloads or zero trust security environments at the edge (e.g., in a factory, stadium, or train.

How do you see the trade-offs between open source frameworks like Kafka and Flink versus proprietary cloud services like AWS SQS or Kinesis? What are your decision criteria to make the right choice for your project? Did you already migrate services from one to the other? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Why DoorDash migrated from Cloud-native Amazon SQS and Kinesis to Apache Kafka and Flink appeared first on Kai Waehner.

]]>
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake https://www.kai-waehner.de/blog/2022/07/05/data-streaming-for-data-ingestion-into-data-warehouse-and-data-lake/ Tue, 05 Jul 2022 15:19:03 +0000 https://www.kai-waehner.de/?p=4633 The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let's explore this dilemma in a blog series. This is part 2: Data Streaming for Data Ingestion into the Data Warehouse and Data Lake.

The post Data Streaming for Data Ingestion into the Data Warehouse and Data Lake appeared first on Kai Waehner.

]]>
The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 2: Data Streaming for Data Ingestion into the Data Warehouse and Data Lake.

Apache Kafka Confluent for Data Ingestion into Snowflake Databricks BigQuery Data Warehouse

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using a data warehouse, data lake, and data streaming together:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. THIS POST: Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Reliable and scalable data ingestion with data streaming

Reliable and scalable data ingestion is crucial for any analytics platform, whether you build a data warehouse, data lake, or lakehouse.

Almost every major player in the analytics space re-engineered the existing platform to enable data ingestion in (near) real-time instead of just batch integration.

Data ingestion = data integration

Just getting data from A to B is usually not enough. Data integration with Extract, Transform, Load (ETL) workflows connects to various source systems and processes incoming events before ingesting them into one or more data sinks. However, most people think about a message queue when discussing data ingestion. Instead, a good data ingestion technology provides additional capabilities:

  • Connectivity: Connectors enable a quick but reliable integration with any data source and data sink. Don’t build your own connector if a connector is available!
  • Data processing: Integration logic filters, enriches, or aggregates data from one or more data sources.
  • Data sharing: Replicate data across regions, data centers, and multi-cloud.
  • Cost-efficient storage: Truly decouple data sources from various downstream consumers (analytics platforms, business applications, SaaS, etc.)

Kafka is more than just data ingestion or a message queue. Kafka is a cloud-native middleware:

Apache Kafka as cloud-native middleware for data ingestion into data warehouse data lake lakehouse

One common option is to run ETL workloads in the data warehouse or data lake. A message queue is sufficient for data ingestion. Even in that case, most projects use the data streaming platform Apache Kafka as the data ingestion layer, even though many great Kafka features are disregarded this way.

Instead, running ETL workloads in the data ingestion pipeline has several advantages:

  • True decoupling between the processing and storage layer enables a scalable and cost-efficient infrastructure.
  • ETL workloads are processed in real-time, no matter what pace each downstream application provides.
  • Each consumer (data warehouse, data lake, NoSQL database, message queue) has the freedom of choice for the technology/API/programming language/communication paradigm to decide how and when to read, process, and store data.
  • Downstream consumers choose between raw data and curated relevant data sets.

Data integration at rest or in motion?

ETL is nothing new. Tools from Informatica, TIBCO, and other vendors provide great visual coding for building data pipelines. The evolution in the cloud brought us cloud-native alternatives like SnapLogic or Boomi. API Management tools like MuleSoft are another common option for building an integration layer based on APIs and point-to-point integration.

The enormous benefit of these tools is a much better time-to-market for building and maintaining pipelines. A few drawbacks exist, though:

  • Separate infrastructure to operate and pay for
  • Often limited scalability and/or latency using a dedicated middleware compared to a native stream ingestion layer (like Kafka or Kinesis)
  • Data stored at rest in separate storage systems
  • Tight coupling with web services and point-to-point connections
  • Integration logic lies in the middleware platform; expertise and proprietary tooling without freedom of choice for implementing data integration

The great news is that you have the freedom of choice. If the heart of the infrastructure is real-time and scalable, then you can add not just scalable real-time applications but any batch or API-based middleware:

Kafka and other Middleware like MQ ETL ESB API in the Enterprise Architecture

Point-to-point integration via API Management is complementary to data streaming, not competitive.

Data mesh for multi-cloud data ingestion

Data Mesh is the latest buzzword in the software industry. It combines domain-driven design, data marts, microservices, data streaming, and other concepts. Like Lakehouse, Data Mesh is a logical concept, NOT physical infrastructure built with a single technology! I explored what role data streaming with Apache Kafka plays in a data mesh architecture.

In summary, the characteristics of Kafka, such as true decoupling, backpressure handling, data processing, and connectivity to real-time and non-real-time systems, enable the building of a distributed global data architecture.

Here is an example of a data mesh built with Confluent Cloud and Databricks across cloud providers and regions:

Global Multi Cloud Data Mesh with Apache Kafka Confluent Databricks

Data streaming as the data ingestion layer plays many roles in a multi-cloud and/or multi-region environment:

  • Data replication between clouds and/or regions
  • Data integration with various source applications (either directly or via 3rd party ETL/iPaaS middleware)
  • Preprocessing in the local region to reduce data transfer costs and improve latency between regions
  • Data ingestion into one or more data warehouses and/or data lakes
  • Backpressure handling for slow consumers or between regions in case of disconnected internet

The power of data streaming becomes more apparent in this example: Data ingestion or a message queue alone is NOT good enough for most “data ingestion projects“!

Apache Kafka – The de facto standard for data ingestion into the lakehouse

Apache Kafka is the de facto standard for data streaming, and a critical use case is data ingestion. Kafka Connect enables a reliable integration in real-time at any scale. It automatically handles failure, network issues, downtime, and other operations issues. Search for your favorite analytics platform and check the availability of a Kafka Connect connector. The chances are high that you will find one.

Critical advantages of Apache Kafka compared to other data ingestion engines like AWS Kinesis include:

  • True decoupling: Data ingestion into a data warehouse is usually only one of the data sinks. Most enterprises ingest data into various systems and build new real-time applications with the same Kafka infrastructure and APIs.
  • Open API across multi-cloud and hybrid environments: Kafka can be deployed everywhere, including any cloud provider, data center, or edge infrastructure.
  • Cost-efficiency: The more you scale, the more cost-efficient Kafka workloads become compared to Kinesis and similar tools.
  • Long-term storage in the streaming platform: Replayability of historical data and backpressure handling for slow consumers are built into Kafka (and cost-efficient if you leverage Tiered Storage).
  • Pre-processing and ETL in a single infrastructure: Instead of ingesting vast volumes of raw data into various systems at rest, the incoming data can be processed in motion with tools like Kafka Streams or ksqlDB once. Each consumer chooses the curated or raw data it needs for further analytics – in real-time, near real-time batch, or request-response.

Kafka as data ingestion and ETL middleware with Kafka Connect and ksqlDB

Kafka provides two capabilities that many people often underestimate in the beginning (because they think about Kafka just as a message queue):

  • Data integration: Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems. Kafka Connect makes it simple to quickly define connectors that move large data sets into and out of Kafka.
  • Stream Processing: Kafka Streams is a client library for building stateless or stateful applications and microservices, where the input and output data are stored in Kafka clusters. ksqlDB is built on top of Kafka Streams to build data streaming applications leveraging your familiarity with relational databases and SQL.

The key difference from other ETL tools is that Kafka eats its own dog food:

Kafka vs. ESB ETL Nifi - Eat your own Dog Food

Please note that Kafka Connect is more than a set of connectors. The underlying framework provides many additional middleware features. Look at single message transforms (SMT), dead letter queue, schema validation, and other capabilities of Kafka Connect.

Kafka as cloud-native middleware; NOT iPaaS

Data streaming with Kafka often replaces ETL tools and ESBs. Here are the key reasons:

Why Kafka as iPaaS instead of Traditional Middleware like MQ ETL ESB

Some consider data streaming an integration platform as a service (iPaaS). I can’t entirely agree with that. The arguments are very similar to saying Kafka is an ETL tool. These are very different technologies that shine with other characteristics (and trade-offs). Check out my blog post explaining why data streaming is a new software category and why Kafka is NOT an iPaaS.

Reference architectures for data ingestion with Apache Kafka

Let’s explore two example architectures for data ingestion with Apache Kafka and Kafka Connect: Elasticsearch Data Streams and Databricks Delta Lake. Both products serve very different use cases and require a different data ingestion strategy.

The developer can use the same Kafka Connect APIs for different connectors. Under the hood, the implementation looks very different to serve different needs and SLAs.

Both are using the dedicated Kafka Connect connector on the high-level architecture diagram. Under the hood, ingestion speed, indexing strategy, delivery guarantees, and many other factors must be configured depending on the project requirements and data sink capabilities. 

Data ingestion example: Real-time indexing with Kafka and Elasticsearch Data Streams

One of my favorite examples is Elasticsearch. Naturally, the search engine built indices in batch mode. Hence, the first Kafka Connect connector ingested events in batch mode. However, Elastic created a new real-time indexing strategy called Elasticsearch Data Streams to offer its end users faster query and response times.

Elasticsearch Data Streams and Apache Kafka with Confluent for Data Ingestion

An Elastic Data Stream lets you store append-only time series data across multiple indices while giving you a single named resource for requests. Data streams are well-suited for logs, events, metrics, and other continuously generated data. You can submit indexing and search requests directly to a data stream. The stream automatically routes the request to backing indices that store the stream’s data. A Kafka data stream is the ideal data source for Elastic Data Streams.

Data ingestion example: Confluent and Databricks

Confluent in conjunction with Databricks is another excellent example. Many people struggle to explain both vendors’ differences and unique selling points. Why? Because the marketing looks very similar for both: Process big data, store it forever, analyze it in real-time, deploy across multi-cloud and multi-region, and so on. The reality is that Confluent and Databricks overlap only ~10%. Both complement each other very well.

Confluent’s focus is to process data in motion. Databricks’ primary business is storing data at rest for analytics and reporting. Yes, you can store data long term in Kafka (especially with Tiered Storage). Yes, you can process data with Databricks in (near) real-time with Spark Streaming. That’s fine for some use cases, but in most scenarios, you (should) choose and combine the best technology for a problem.

Here is a reference architecture for data streaming and analytics with Confluent and Databricks:

Reference Architecture for Data Streaming and Analytics with Confluent and Databricks

Confluent and Databricks are a perfect combination of a modern machine learning architecture:

  • Data ingestion with data streaming
  • Model training in the data lake
  • (streaming or batch) ETL where it fits the use case best
  • Model deployment in the data lake close to the data science environment or a data streaming app for mission-critical SLAs and low latency
  • Monitoring the end-to-end pipeline (ETL, model scoring, etc.) in real-time with data streaming

I hope it is clear that data streaming vendors like Confluent have very different capabilities, strengths, and weaknesses than data warehouse vendors like Databricks or Snowflake.

How to build a cloud-native lakehouse with Kafka and Spark?

Databricks coined Lakehouse to talk about real-time data in motion and batch workloads at rest in a single platform. From my point of view, the Lakehouse is a great logical view. But there is no silver bullet! The technical implementation of a Lakehouse requires different tools in a modern data stack.

In June 2022, I presented my detailed perspective on this discussion at Databricks’ Data + AI Summit in San Francisco. Check out my slide deck “Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture“:

Data streaming is much more than data ingestion into a lakehouse

Data streaming technologies like Apache Kafka are perfect for data ingestion into one or more data warehouses and/or data lakes. BUT data streaming is much more: Integration with various data sources, data processing, replication across regions or clouds, and finally, data ingestion into the data sinks.

Examples with vendors like Confluent, Databricks, and Elasticsearch showed how data streaming helps solve many data integration challenges with a single technology.

Nevertheless, there is no silver bullet. A modern lakehouse leverages a best of breed technologies. No single technology can be the best to handle all kinds of data sets and communication paradigms (like real-time, batch, request-response).

For more details, browse other posts of this blog series:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. THIS POST: Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

How do you combine data warehouse and data streaming today? Is Kafka just your ingestion layer into the data lake? Do you already leverage data streaming for additional real-time use cases? Or is Kafka already the strategic component in the enterprise architecture for decoupled microservices and a data mesh? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Data Streaming for Data Ingestion into the Data Warehouse and Data Lake appeared first on Kai Waehner.

]]>
Apache Kafka as Data Hub for Crypto, DeFi, NFT, Metaverse – Beyond the Buzz https://www.kai-waehner.de/blog/2022/02/04/apache-kafka-as-data-hub-for-crypto-defi-nft-metaverse-beyond-the-buzz/ Fri, 04 Feb 2022 14:29:19 +0000 https://www.kai-waehner.de/?p=4129 Decentralized finance with crypto and NFTs is a huge topic these days. It becomes a powerful combination with the coming metaverse platforms across industries. This blog post explores the relationship between crypto technologies and modern enterprise architecture. I discuss how event streaming and Apache Kafka help build innovation and scalable real-time applications of a future metaverse. Let's skip the buzz (and NFT bubble) and instead review existing real-world deployments in the crypto and blockchain world powered by Kafka and its ecosystem.

The post Apache Kafka as Data Hub for Crypto, DeFi, NFT, Metaverse – Beyond the Buzz appeared first on Kai Waehner.

]]>
Decentralized finance with crypto and NFTs is a huge topic these days. It becomes a powerful combination with the coming metaverse platforms from social networks, cloud providers, gaming vendors, sports leagues, and fashion retailers. This blog post explores the relationship between crypto technologies like Ethereum, blockchain, NFTs, and modern enterprise architecture. I discuss how event streaming and Apache Kafka help build innovation and scalable real-time applications of a future metaverse. Let’s skip the buzz (and NFT bubble) and instead review practical examples and existing real-world deployments in the crypto and blockchain world powered by Kafka and its ecosystem.

Event Streaming as the Data Hub for Crypto NFT and Metaverse

What are crypto, NFT, DeFi, blockchain, smart contracts, metaverse?

I assume that most readers of this blog post have a basic understanding of the crypto market and event streaming with Apache Kafka. The target audience should be interested in the relationship between crypto technologies and a modern enterprise architecture powered by event streaming. Nevertheless, let’s explain each buzzword in a few words to have the same understanding:

  • Blockchain: Foundation for cryptocurrencies and decentralized applications (dApp) powered by digital distributed ledgers with immutable records
  • Smart contracts: dApps running on a blockchain like Ethereum
  • DeFi (Decentralized Finance): Group of dApps to provide financial services without intermediaries
  • Cryptocurrency (or crypto): Digital currency that works as an exchange through a computer network that is not reliant on any central authority, such as a government or bank
  • Crypto coin: Native coin of a blockchain to trade currency or store value (e.g., Bitcoin in the Bitcoin blockchain or Ether for the Ethereum platform)
  • Crypto token: Similar to a coin, but uses another coin’s blockchain to provide digital assets – the functionality depends on the project (e.g., developers built plenty of tokens for the Ethereum smart contract platform)
  • NFT (Non-fungible Token): Non-interchangeable, uniquely identifiable unit of data stored on a blockchain (that’s different from Bitcoin where you can “replace” one Bitcoin with another one), covering use cases such as identity, arts, gaming, collectibles, sports, media, etc.
  • Metaverse: A network of 3D virtual worlds focused on social connections. This is not just about Meta (former Facebook). Many platforms and gaming vendors are creating their metaverses these days. Hopefully, open protocols enable interoperability between different metaverses, platforms, APIs, and AR/VR technologies. Crypto and NFTs will probably be critical factors in the metaverse.

Cryptocurrency and DeFi marketplaces and brokers are used for trading between fiat and crypto or between two cryptocurrencies. Other use cases include long-term investments and staking. The latter compensates for locking your assets in a Proof-of-Stake consensus network that most modern blockchains use instead of the resource-hungry Proof-of-Work used in Bitcoin. Some solutions focus on providing services on top of crypto (monitoring, analytics, etc.).

Don’t worry if you don’t know what all these terms mean. A scalable real-time data hub is required to integrate crypto and non-crypto technologies to build innovative solutions for crypto and metaverse.

What’s my relation to crypto, blockchain, and Kafka?

It might help to share my background with blockchain and cryptocurrencies before I explore the actual topic about its relation to event streaming and Apache Kafka.

I worked with blockchain technologies 5+ years ago at TIBCO. I implemented and deployed smart contracts with Solidity (the smart contract programming language for Ethereum). I integrated blockchains such as Hyperledger and Ethereum with ESB middleware. And yes, I bought some Bitcoin for ~500 dollars. Unfortunately, I sold them afterward for ~1000 dollars as I was curious about the technology, not the investment part. 🙂

Middleware in real-time is vital for integration

I gave talks at international conferences and published a few articles about blockchain. For instance, “Blockchain – The Next Big Thing for Middleware” at InfoQ in 2016. The article is still pretty accurate from a conceptual perspective. The technologies, solutions, and vendors developed, though.

I thought about joining a blockchain startup. Coincidentally, one company I was talking about was building a “next-generation blockchain platform for transactions in real-time at scale”, powered by Apache Kafka. No joke.

However, I joined Confluent in 2017. I thought that processing data in motion at any scale for transactional and analytics workloads is the more significant paradigm shift. “Why I Move (Back) to Open Source for Messaging, Integration and Stream Processing” describes my decision in 2017. I think I was right. 🙂 Today, most enterprises leverage Kafka as an alternative middleware for MQ, ETL, and ESB tools or implement cloud-native iPaaS with serverless Kafka.

Blockchain, crypto, and NFT are here to stay (for some use cases)

Blockchain and crypto are here to stay, but it is not needed for every problem. Blockchain is a niche. For that, it is excellent. TL;DR: You only need a blockchain in untrusted environments. The famous example of supply chain management is valid. Cryptocurrencies and smart contracts are also here to stay. Partly for investment, partly for building innovative new applications.

Today, I work with customers across the globe. Crypto marketplaces, blockchain monitoring infrastructure, and custodian banking platforms are built on Kafka for good reasons (scale, reliability, real-time). The key to success for most customers is integrating crypto and blockchain platforms and the rest of the IT infrastructure, like business applications, databases, and data lakes.

Trust is important, and trustworthy marketplaces (= banks?) are needed (for some use cases)

Privately, I own several cryptocurrencies. I am not a day-trader. My strategy is a long-term investment (but only a fraction of my total investment; only the money I am okay to lose 100%). I invest in several coins and platforms, including Bitcoin, Ethereum, Solana, Polkadot, Chainlink, and a few even more risky ones. I firmly believe that crypto and NFTs are a game-changer for some use cases like gaming and metaverse, but I also think paying hundreds of thousands of dollars for a digital ape is insane (and just an investment bubble).

I sold my Bitcoins in 2016 because of the missing trustworthy marketplaces. I don’t care too much about the decentralization of my long-term investment. I do not want to hold my cold storage, write a long and complex code on paper, and put it into my safe. I want to have a secure, trustworthy custodian that takes over this burden.

For that reason, I use compliant German banks for crypto investments. If a coin is unavailable, I go to an international marketplace that feels trustworthy. For instance, the exchange of crypto.com and the NFT marketplace OpenSea recently did a great job letting their insurance pay for a hack and loss of customer coins and NFTs, respectively. That’s what I expect as a customer and why I am happy to pay a small fee for buying and selling cryptocurrencies or NFTs on such a platform.

The False Promise of Web3” is a great read to understand why many crypto and blockchain discussions are not indeed about decentralization. As the article says, “the advertised decentralization of power out of the hands of a few has, in fact, been a re-centralization of power into the hands of fewer“. I am a firm believer in the metaverse, crypto, NFT, DeFi, and blockchain. However, I am fine if some use cases are centralized and provide regulation, compliance, security, and other guarantees.

With this long background story, let’s explore the current crypto and blockchain world and how this relates to event streaming and Apache Kafka.

Use cases for the metaverse and non-fungible tokens (NFT)

Let’s start with some history:

  • 1995: Amazon was just an online shop for books. I bought my books in my local store.
  • 2005: Netflix was just a DVD-in-mail service. Technology changed over time, but I also rent HD DVDs, Blu-Rays, and other mediums similarly.
  • 2022: I will not use Zuckerberg’s Metaverse. Never. It will fail like Second Life (which was launched in 2003). And I don’t get the noise around NFT. It is just a jpeg file that you can right-click and save for free.

Well, we are still in the early stages of cryptocurrencies, and in the very early stages of Metaverse, DeFi, and NFT use cases and business models. However, this is not just hype like the Dot-com bubble in the early 2000s. Tech companies have exciting business models. And software is eating every industry today. Profit margins are enormous, too.

Let’s explore a few use cases where Metaverse, DeFi, and NFTs make sense:

Billions of players and massive revenues in the gaming industry

The gaming industry is already bigger than all other media categories combined, and this is still just the beginning of a new era. Millions of new players join the gaming community every month across the globe.

Connectivity and cheap smartphones are sold in less wealthy countries. New business models like “play to earn” change how the next generation of gamers plays a game. More scalable and low latency technologies like 5G enable new use cases. Blockchain and NFT (Non-Fungible Token) are changing the monetization and collection market forever.

NFTs for identity, collections, ticketing, swaggering, and tangible things

Let’s forget that Justin Bieber recently purchased a Bored Ape Yacht Club (BAYC) NFT for $1.29 million. That’s insane and likely a bubble. However, many use case makes a lot of sense for NFT, not just in (future) virtual metaverse but also in the real world. Let’s look at a few examples:

  • Sports and other professional events: Ticketing for controlled pricing, avoiding fraud, etc.
  • Luxury goods: Transparency, traceability, and security against tampering; the handbag, the watch, the necklace: for such luxury goods, NFTs can be used as certificates of authenticity and ownership.
  • Arts: NFTs can be created in such a way that royalty and license fees are donated every time they are resold
  • Carmakers: The first manufacturers link new cars with NFTs. Initially for marketing, but also to increase their resale value in the medium term. With the help of the blockchain, I can prove the number of previous owners of a car and provide reliable information about the repair and accident history.
  • Tourism: Proof of attendance. Climbed Mount Everest? Walked at the Grand Canyon? Visited Hawaii for marriage? With a badge on the blockchain, this can be proven once and for all.
  • Non-Profit: The technology is ideal for fundraising in charities – not only because it is transparent and decentralized. Every NFT can be donated or auctioned off for a good cause so that this new way of creating value generates new funds for the benefit of charitable projects. Vodafone has demonstrated this by selling the world’s first SMS.
  • Swaggering: Twitter already allows you to connect your account to your crypto wallet to set up an NFT as your profile picture. Steph Curry from the NBA team, Golden State Warriors, presents his 55 ETH (~ USD 180,000) Bored Ape Yacht Club NFT as his Twitter profile picture when writing this blog.

With this in mind, I can think about plenty of other significant use cases for NFTs.

Metaverse for new customer experiences

If I think about the global metaverse (not just the Zuckerberg one), I see plenty so many use cases even I could imagine using:

  • How many (real) dollars would you pay if your (virtual) house is next to your favorite actor or musician in a virtual world if you can speak with its digital twin (trained by the actual human) every day?
  • How many (real) dollars would you pay if this actor visits his house once a week so that virtual neighbors or lottery winners can talk to the natural person via virtual reality?
  • How many (real) dollars would you pay if you could bring your Fortnite shotgun (from a game from another gaming vendor) to this meeting and use it in the clay pigeon shooting competition with the neighbor?
  • Give me a day, and I will add 1000 other items to this wish list…

I think you get the point. NFTs and metaverse make sense for many use cases. This statement is valid from the perspective of a great customer experience and to build innovative business models (with ridiculous profit margins).

So, finally, we come to the point of talking about the relation to event streaming and Apache Kafka.

Kafka inside a crypto platform

First, let’s understand how to qualify if you need a truly distributed, decentralized ledger or blockchain. Kafka is sufficient most times.

Kafka is NOT a blockchain, but a distributed ledger for crypto

Kafka is not a blockchain, but a distributed commit log. Many concepts and foundations of Kafka are very similar to a blockchain. It provides many characteristics required for real-world “enterprise blockchain” projects:

  • Real-Time
  • High Throughput
  • Decentralized database
  • Distributed log of records
  • Immutable log
  • Replication
  • High availability
  • Decoupling of applications/clients
  • Role-based access control to data

I explored this in more detail in my post “Apache Kafka and Blockchain – Comparison and a Kafka-native Implementation“.

Do you need a blockchain? Or just Kafka and crypto integration?

A blockchain increases the complexity significantly compared to traditional IT projects. Do you need a blockchain or distributed ledger (DLT) at all? Qualify out early and choose the right tool for the job!

Use a Kafka for

  • Enterprise infrastructure
  • Real-time data hub for transactional and analytical workloads
  • Open, scalable, real-time data integration and processing
  • True decoupling between applications and databases with backpressure handling and data replayability
  • Flexible architectures for many use cases
  • Encrypted payloads

Use a real blockchain / DLTs like Hyperledger, Ethereum, Cardano, Solana, et al. for

  • Deployment over various independent organizations (where participants verify the distributed ledger contents themselves)
  • Specific use cases
  • Server-side managed and controlled by multiple organizations
  • Scenarios where the business value overturns the added complexity and project risk

Use Kafka and Blockchain together to combine the benefits of both for

  • Blockchain for secure communication over various independent organizations
  • Reliable data processing at scale in real-time with Kafka as a side chain or off-chain from a blockchain
  • Integration between blockchain / DLT technologies and the rest of the enterprise, including CRM, big data analytics, and any other custom business applications

The last section shows that Kafka and blockchain, respectively crypto are complementary. For instance, many enterprises use Kafka as the data hub between crypto APIs and enterprise software.

It is pretty straightforward to build a metaverse without a blockchain (if you don’t want or need to offer true decentralization). Look at this augmented reality demo powered by Apache Kafka to understand how the metaverse is built with modern technologies.

Kafka as a component of blockchain vs. Kafka as middleware for blockchain integration

Some powerful DLTs or blockchains are built on top of Kafka. See the example of R3’s Corda in the next section.

Kafka is used to implementing a side chain or off-chain platform in some other use cases, as the original blockchain does not scale well enough (blockchain is known as on-chain data). Not just Bitcoin has the problem of only processing single-digit (!) transactions per second. Most modern blockchain solutions cannot scale even close to the workloads Kafka processes in real-time.

Having said this, more interestingly, I see more and more companies using Kafka within their crypto trading platforms, market exchanges, and token trading marketplaces to integrate between the crypto and the traditional IT world.

Here are both options:Apache Kafka and Blockchain - DLT - Use Cases and Architectures

R3 Corda – A distributed ledger for banking and the finance industry powered by Kafka

R3’s Corda is a scalable, permissioned peer-to-peer (P2P) distributed ledger technology (DLT) platform. It enables the building of applications that foster and deliver digital trust between parties in regulated markets.

Corda is designed for the banking and financial industry. The primary focus is on financial services transactions. The architectural designs are simple when compared to true blockchains. Evaluate requirements such as time to market, flexibility, and use case (in)dependence to decide if Corda is sufficient or not.

Corda’s architectural history looks like many enterprise architectures: A messaging system (in this case, RabbitMQ) was introduced years ago to provide a real-time infrastructure. Unfortunately, the messaging solution does not scale as needed. It does not provide all essential features like data integration, data processing, or storage for true decoupling, backpressure handling, or replayability of events.

Therefore, Corda 5 replaces RabbitMQ and migrates to Kafka.

R3 Corda DLT Blockchain powered by Apache Kafka

Here are a few reasons for the need to migrate R3’s Corda from RabbitMQ to Kafka:

  • High availability for critical services
  • A cost-effective way to scale (horizontally) to deal with bursty and high-volume throughputs
  • Fully redundant, worker-based architecture
  • True decoupling and backpressure handling to facilitate communication between the node services, including the process engine, database integration, crypto integration, RPC service (HTTP), monitoring, and others
  • Compacted topics (logs) as the mechanism to store and retrieve the most recent states

Kafka in the crypto enterprise architecture

While using Kafka within a DLT or blockchain, the more prevalent use cases leverage Kafka as the scalable real-time data hub between cryptocurrencies or blockchains and enterprise applications. Let’s explore a few use cases and real-world examples for that.

Example crypto architecture: Kafka as data hub in the metaverse

My recent post about live commerce powered by event streaming and Kafka transforming the retail metaverse shows how the retail and gaming industry connects virtual and physical things. The retail business process and customer communication happen in real-time, no matter if you want to sell clothes, a smartphone, or a blockchain-based NFT token for your collectible or video game.

The following architecture shows what an NFT sales play could look like by interesting and orchestration the information flow between various crypt and non-crpyto applications in real-time at any scale:

Event Streaming with Apache Kafka as Data Hub for Crypto Blockchain and Metaverse

 

Kafka’s role as the data hub in crypto trading, marketplaces, and the metaverse

Let’s now explore the combination of Kafka and blockchains, respectively cryptocurrencies and decentralized finance (DeFi).

Once again, Kafka is not the blockchain nor the cryptocurrency. The blockchain is a cryptocurrency like Bitcoin or a platform providing smart contracts like Ethereum, where people build new distributed applications (dApps) like NFTs for the gaming or art industry. Kafka is the data hub in between to connect these blockchains with other Oracles (= the non-blockchain apps = traditional IT infrastructure) like the CRM system, data lake, data warehouse, business applications, and so on.

Let’s look at an example and explore a few technical use cases where Kafka helps:

Kafka for data processing in the crypto world

 

A Bitcoin transaction is executed from the mobile wallet. A real-time application monitors the data off-chain, correlates it, shows it in a dashboard, and sends push notifications. Another completely independent department replays historical events from the Kafka log in a batch process for a compliance check with dedicated analytics tools.

The Kafka ecosystem provides so many capabilities to use the data from blockchains and the crypto world with other data from traditional IT.

Holistic view in a data mesh across typical enterprise IT and blockchains

  • Measuring the health of blockchain infrastructure, cryptocurrencies, and dApps to avoid downtime, secure the infrastructure, and make the blockchain data accessible.
  • Kafka provides an agentless and scalable way to present that data to the parties involved and ensure that the relevant information is exposed to the right teams before a node is lost. This is relevant for innovative Web3 IoT projects like Helium or simpler closed distributed ledgers (DLT) like R3 Corda.
  • Stream processing via Kafka Streams or ksqlDB to interpret the data to get meaningful information
  • Processors that focus on helpful block metrics  – with information related to average gas price (gas refers to the cost necessary to perform a transaction on the network), number of successful or failed transactions,  and transaction fees
  • Monitoring blockchain infrastructure and telemetry log events in real-time
  • Regulatory monitoring and compliance
  • Real-time cybersecurity (fraud, situational awareness, threat intelligence)

Continuous data integration at any scale in real-time

  • Integration + true decoupling + backpressure handling + replayability of events
  • Kafka as Oracle integration point (e.g. Chainlink -> Kafka -> Rest of the IT infrastructure)
  • Kafka Connect connectors that incorporate blockchain client APIs like OpenEthereum (leveraging the same concept/architecture for all clients and blockchain protocols)
  • Backpressure handling via throttling and nonce management over the Kafka backbone to stream transactions into the chain
  • Processing multiple chains at the same time (e.g., monitoring and correlating transactions on Ethereum, Solana, and Cardano blockchains in parallel)

Continuous stateless or stateful data processing

  • Stream processing with Kafka Streams or ksqlDB enables real-time data processing in DeFi / trading / NFT / marketplaces
  • Most blockchain and crypto use cases require more than just data ingestion into a database or data lake – continuous stream processing adds enormous value to many problems
  • Aggregate chain data (like Bitcoin or Ethereum), for instance, smart contract states or price feeds like the price of cryptos against USD
  • Specialized ‘processors’ that take advantage of Kafka Streams’ utilities to perform aggregations, reductions, filtering, and other practical stateless or stateful operations

Building new business models and solutions on top of crypto and blockchain infrastructure

  • Custody for crypto investments in a fully integrated, end-to-end solution
  • Deployment and management of smart contracts via a blockchain API and user interface
  • Customer 360 and loyalty platforms, for instance, NFT integration into retail, gaming, social media for new customer experiences by sending a context-specific AirDrop to a wallet of the customer
  • These are just a few examples – the list goes on and on…

The following section shows a few real-world examples. Some are relatively simple monitoring tools. Others are complex and powerful banking platforms.

Real-world examples of Kafka in the crypto and DeFi world

I have already explored how some blockchain and crypto solutions (like R3’s Corda) use event streaming with Kafka under the hood of their platform. Contrary, the following focuses on several public real-world solutions that leverage Kafka as the data hub between blockchains / crypto / NFT markets and new business applications:

  • TokenAnalyst: Visualization of crypto markets
  • EthVM: Blockchain explorer and analytics engine
  • Kaleido: REST API Gateway for blockchain and smart contracts
  • Nash: Cloud-native trading platform for cryptocurrencies
  • Swisscom’s Custodigit: Crypto banking platform
  • Chainlink: Oracle network for connecting smart contracts from blockchains to the real world

TokenAnalyst – Visualization of crypto markets

TokenAnalyst is an analytics tool to visualize and analyze the crypto market. TokenAnalyst is an excellent example that leverages the Kafka stack (Connect, Streams, ksqlDB, Schema Registry) to integrate blockchain data from Bitcoin and Ethereum with their analytics tools.

Kafka Connect helps with integrating databases and data lakes. The integration with Ethereum and other cryptocurrencies is implemented via a combination of the official crypto APIs and the Kafka producer client API.

Kafka Streams provides a stateful streaming application to prevent invalid blocks in downstream aggregate calculations. For example, TokenAnalyst developed a block confirmer component that resolves reorganization scenarios by temporarily keeping blocks and only propagates them when a threshold of some confirmations (children to that block are mined) is reached.

TokenAnalyst - Kafka Connect and Kafka Streams for Blockchain and Ethereum Integration

EthVM – A blockchain explorer and analytics engine

The beauty of public, decentralized blockchains like Bitcoin and Ethereum is transparency. The tamper-proof log enables Blockchain explorers to monitor and analyze all transactions.

EthVM is an open-source Ethereum blockchain data processing and analytics engine powered by Apache Kafka. The tool enables blockchain auditing and decision-making. EthVM verifies the execution of transactions and smart contracts, checks balances, and monitors gas prices. The infrastructure is built with Kafka Connect, Kafka Streams, and Schema Registry. A client-side visual block explorer is included, too.

EthVM - A Kafka based Crpyto and Blockchain Explorer

Kaleido – A Kafka-native Gateway for crypto and smart contracts

Kaleido provides enterprise-grade blockchain APIs to deploy and manage smart contracts, send Ethereum transactions, and query blockchain data. It hides the blockchain complexities of Ethereum transaction submission, thick Web3 client libraries, nonce management, RLP encoding, transaction signing, and smart contract management.

Kaleido offers REST APIs for on-chain logic and data. It is backed by a fully-managed high throughput Apache Kafka infrastructure.

Kaleido - REST API for Crypto like Ethereum powered by Apache Kafka

One exciting aspect in the above architecture: Kaleido also provides a native direct Kafka connection from the client-side besides the API (= HTTP) gateway. This is a clear trend I discussed before already. Check out:

Nash – Cloud-native trading platform for cryptocurrencies

Nash is an excellent example of a modern trading platform for cryptocurrencies using blockchain under the hoodThe heart of Nash’s platform leverages Apache Kafka. The following quote from their community page says:

“Nash is using Confluent Cloud, google cloud platform to deliver and manage its services. Kubernetes and apache Kafka technologies will help it scale faster, maintain top-notch records, give real-time services which are even hard to imagine today.”

Nash - Finserv Cryptocurrency Blockchain exchange and wallet leveraging Apache Kafka and Confluent Cloud

Nash provides the speed and convenience of traditional exchanges and the security of non-custodial approaches. Customers can invest in, make payments, and trade Bitcoin, Ethereum, NEO, and other digital assets. The exchange is the first of its kind, offering non-custodial cross-chain trading with the full power of a real order book. The distributed, immutable commit log of Kafka enables deterministic replayability in its exact order.

Swisscom’s Custodigit – A crypto banking platform powered by Kafka Streams

Custodigit is a modern banking platform for digital assets and cryptocurrencies. It provides crucial features and guarantees for seriously regulated crypto investments:

  • Secure storage of wallets
  • Sending and receiving on the blockchain
  • Trading via brokers and exchanges
  • Regulated environment (a key aspect and no surprise as this product is coming from the Swiss – a very regulated market)

Kafka is the central nervous system of Custodigit’s microservice architecture and stateful Kafka Streams applications. Use cases include workflow orchestration with the “distributed saga” design pattern for the choreography between microservices. Kafka Streams was selected because of:

  • lean, decoupled microservices
  • metadata management in Kafka
  • unified data structure across microservices
  • transaction API (aka exactly-once semantics)
  • scalability and reliability
  • real-time processing at scale
  • a higher-level domain-specific language for stream processing
  • long-running stateful processes

Architecture diagrams are only available in Germany, unfortunately. But I think you get the points:

  • Custodigit microservice architecture – some microservices integrate with brokers and stock markets, others with blockchain and crypto:

Custodigit microservice architecture

  • Custodigit Saga pattern for stateful orchestration – stateless business logic is truly decoupled, while the saga orchestrator keeps the state for choreography between the other services:

Custodigit Saga pattern for stateful orchestration

 

Chainlink is the industry standard oracle network for connecting smart contracts to the real world. “With Chainlink, developers can build hybrid smart contracts that combine on-chain code with an extensive collection of secure off-chain services powered by Decentralized Oracle Networks. Managed by a global, decentralized community of hundreds of thousands of people, Chainlink introduces a fairer contract model. Its network currently secures billions of dollars in value for smart contracts across decentralized finance (DeFi), insurance, and gaming ecosystems, among others. The full vision of the Chainlink Network can be found in the Chainlink 2.0 white paper.”

Unfortunately, I could not find any public blog post or conference talks about Chainlink’s architecture. Hence, I can only let Chainlink’s job offering speak about their impressive Kafka usage for real-time observability at scale in a critical, transactional financial environment.

Chainlink is transitioning from traditional time series-based monitoring toward an event-driven architecture and alerting approach.

Chainlink Job Role - Blockchain Oracle Integration powered by Apache Kafka

This job offer sounds very interesting, doesn’t it? And it is a colossal task to solve cybersecurity challenges in this industry. If you look for a blockchain-based Kafka role, this might be for you.

Serverless Kafka enables focusing on the business logic in your crypto data hub infrastructure!

This article explored practical use cases for crypto marketplaces and the coming metaverse. Many enterprise architectures already leverage Apache Kafka and its ecosystem to build a scalable real-time data hub for crypto and non-crypto technologies.

This combination is the foundation for a metaverse ecosystem and innovative new applications, customer experiences, and business models. Don’t fear the metaverse. This discussion is not just about Meta (former Facebook), but about interoperability between many ecosystems to provide fantastic new user experiences (of course, with its drawbacks and risks, too).

A clear trend across all these fancy topics and buzzwords is the usage of serverless cloud offerings. This way, project teams can spend their time on the business logic instead of operating the infrastructure. Check out my articles about “serverless Kafka and its relation to cloud-native data lakes and lake houses” and my “comparison of Kafka offerings on the market” to learn more.

How do you use Apache Kafka with cryptocurrencies, blockchain, or DeFi applications? Do you deploy in the public cloud and leverage a serverless Kafka SaaS offering? What other technologies do you combine with Kafka? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka as Data Hub for Crypto, DeFi, NFT, Metaverse – Beyond the Buzz appeared first on Kai Waehner.

]]>
IoT Analytics with Kafka for Real Estate and Smart Building https://www.kai-waehner.de/blog/2021/11/25/iot-analytics-apache-kafka-smart-building-real-estate-smart-city/ Thu, 25 Nov 2021 13:59:25 +0000 https://www.kai-waehner.de/?p=3973 This blog post explores how event streaming with Apache Kafka enables IoT analytics for cost savings, better consumer experience, and reduced risk in real estate and smart buildings. Examples include improved real estate maintenance and operations, smarter energy consumptions, optimized space usage, better employee experience, and better defense against cyber attacks.

The post IoT Analytics with Kafka for Real Estate and Smart Building appeared first on Kai Waehner.

]]>
Smart building and real estate generate enormous opportunities for governments and private enterprises in the smart city sector. This blog post explores how event streaming with Apache Kafka enables IoT analytics for cost savings, better consumer experience, and reduced risk. Examples include improved real estate maintenance and operations, smarter energy consumptions, optimized space usage, better employee experience, and better defense against cyber attacks.

Apache Kafka Smart Building Real Estate Smart City Energy Consumption IoT Analytics

This post results from many customer conversations in this space, inspired by the article “5 Examples of IoT and Analytics at Work in Real Estate” from IT Business Edge.

Data in Motion for Smart City and Real Estate

A smart city is an urban area that uses different electronic Internet of Things (IoT) sensors to collect data and then use insights gained from that data to efficiently manage assets, resources, and services. Apache Kafka fits into the smart city architecture as the backbone for real-time streaming data integration and processing. Kafka is the de facto standard for Event Streaming.

The Government-owned Event Streaming platform from the Ohio Department of Transportation (ODOT) is a great example. Many smart city architectures are hybrid and require the combination of various technologies and communication paradigms like data streaming, fire-and-forget, and request-response. For instance, Kafka and MQTT enable the last-mile integration and data correlation of IoT data in real-time at scale.

Event Streaming is possible everywhere, in the traditional data center, the public cloud, or at the edge (outside a data center):

Smart City with Smart Buildings and Real Estate leveraging Apache Kafka

IoT Analytics Use Cases for Event Streaming with Smart Building and Real Estate

Real estate and buildings are crucial components of a smart city. This post explores various use cases for IoT analytics with event streaming to improve the citizen experience and reduce maintenance and operations costs using smart buildings.

The following sections explore these use cases:

  • Optimized Space Usage within a Smart Building
  • Predictive Analytics and Preventative Maintenance
  • Smart Home Energy Consumption
  • Real Estate Maintenance and Operations
  • Employee Experience in a Smart Building
  • Cybersecurity for Situational Awareness and Threat Intelligence

Optimized Space Usage within a Smart Building

Optimized space usage within buildings is crucial from an economic perspective. It enables to size space according to the need for rentals and to reduce building maintenance costs.

A few examples for data processing related to space optimization:

  • Count people entering and leaving the premises with real-time alerting
  • Track the walking behavior of visitors with continuous real-time aggregation of various data sources
  • Optimized space usage during an event to optimize the customer experience; e.g., rearranging the chairs, tables, signs, etc. in a conference ballroom during the conference (as the next conference will have different people, requirements, and challenges)
  • Plan future building, room, or location constructions with batch analytics on historical information

Optimized Space Usage within Smart Buildings

Predictive Analytics and Preventative Maintenance

Predictive analytics and preventative maintenance require real-time data processing. The monitoring of critical building assets and equipment such as air conditioning, elevators, and lighting prevents breakdowns and improves efficiency:

Predictive Analytics and Preventative Maintenance in a Smart Building using Kafka Streams and ksqlDB

Continuous data processing is possible either in a stateless or stateful way. Here are two examples:

  • Stateful preventive maintenance: Continuous tilt and shock detection calculating an average value
  • Stateless condition monitoring: Temperature and humidity spikes with filter above-threshold events

My blog post about “Streaming Analytics for Condition Monitoring and Predictive Maintenance with Event Streaming and Apache Kafka” goes into more detail. A Kafka-native Digital Twin plays a key role in some IoT projects, too.

Smart Home Energy Consumption

The energy industry lives in a significant change. The increased use of digital tools supports the expected structural changes in the energy system to become green and less wasteful.

Smart energy consumption is a powerful and reasonable approach to reduce waste and save costs. Monitoring energy consumption in real-time enables the improvement of current business usage patterns:

Smart Home Energy Consumption with Kafka and Event Streaming

A few examples that require real-time data integration and data processing for sensor analytics:

  • Analyze malfunctioning equipment for its excessive energy use
  • Turn on and off lights and automatically context-driven instead of time-based configuration
  • Monitor air conditioning for overloads

Real Estate Maintenance and Operations

The maintenance and operations of buildings and real estate require on-site and remote work. Hence, the public administration can perform administrative tasks and data analytics in a remote data center or cloud that aggregates information across locations. On the other side, some use cases require edge computing for real-time monitoring and analytics:

Real Estate Maintenance and Operations

It always depends on the point of view. A manager for a smart building might work on-site while a manager monitors all facilities in a region. A global manager oversees many regional managers. Technology needs to support the need of all stakeholders. All of them can do a better job with real-time information and real-time applications.

Employee Experience in a Smart Building

Satisfied employees are crucial for a decent smart city and real estate strategy. Real-time applications can help here, too:

Employee Experience

A few examples to improve the experience of the employees:

  • Ambiance: Adjust noise and light level to reduce distractions
  • Air quality: Control air to enhance morale and productivity
  • Feedback Device: Improve layout, equipment, and office supplies

Cybersecurity for Situational Awareness and Threat Intelligence

Continuous data correlation became essential to defend against cyber attacks. Monitoring, alerting, and proactive actions are only possible if data integration and data correlation happen in real-time at scale reliably:

Cybersecurity for Situational Awareness and Threat Intelligence in Smart Buildings and Smart City

Plenty of use cases require event streaming as the scalable real-time backbone for cybersecurity. Kafka’s cybersecurity examples include situational awareness, threat intelligence, forensics, air-gapped and zero trust environments, and SIEM / SOAR modernization.

Smart City, Real Estate, and Smart Building require Real-Time IoT Analytics

Plenty of use cases exist to add business value to real estate and smart buildings. Data-driven correlation and analytics with data from any IoT interface in real-time is a game-changer to improve the consumer experience, save costs, and reduce risks.

Apache Kafka is the de-facto standard for event streaming. No matter if you are on-premise, in the public cloud, at the edge, or in a hybrid scenario, evaluate and compare the available Kafka offerings on the market to start your project the right way.

How do you optimize data usage in real estate and smart buildings? What technologies and architectures do you use? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post IoT Analytics with Kafka for Real Estate and Smart Building appeared first on Kai Waehner.

]]>
Streaming Data Exchange with Kafka and a Data Mesh in Motion https://www.kai-waehner.de/blog/2021/11/14/streaming-data-exchange-data-mesh-apache-kafka-in-motion/ Sun, 14 Nov 2021 13:25:45 +0000 https://www.kai-waehner.de/?p=3412 Data Mesh is a new architecture paradigm that gets a lot of buzzes these days. This blog post looks into this principle deeper to explore why no single technology is the perfect fit to build a  Data Mesh. Examples show why an open and scalable decentralized real-time platform like Apache Kafka is often the heart of the Data Mesh infrastructure, complemented by many other data platforms to solve business problems.

The post Streaming Data Exchange with Kafka and a Data Mesh in Motion appeared first on Kai Waehner.

]]>
Data Mesh is a new architecture paradigm that gets a lot of buzzes these days. Every data and platform vendor describes how to build the best Data Mesh with their platform. The Data Mesh story includes cloud providers like AWS, data analytics vendors like Databricks and Snowflake, and Event Streaming solutions like Confluent. This blog post looks into this principle deeper to explore why no single technology is the perfect fit to build a Data Mesh. Examples show why an open and scalable decentralized real-time platform like Apache Kafka is often the heart of the Data Mesh infrastructure, complemented by many other data platforms, to solve business problems.

Streaming Data Exchange with Apache Kafka and Data Mesh in Motion

Data at Rest vs. Data in Motion

Before we get into the Data Mesh discussion, it is crucial to clarify the difference and relevance of Data at Rest and Data in Motion:

  • Data at Rest: Data is ingested and stored in a storage system (database, data warehouse, data lake). Business logic and queries execute against the storage. Everyday use cases include reporting with business intelligence tools, model training in machine learning,  and complex batch analytics like shuffling or map and reduce. As the data is at rest, the processing is too late for real-time use cases.
  • Data in Motion: Data is processed and correlated continuously while new events are fed into the platform. Business logic and queries execute in real-time. Common real-time use cases include inventory management, order processing, fraud detection, predictive maintenance, and many other use cases.

Real-time Data Beats Slow Data

Real-time beats slow data in almost all use cases across industries. Hence, ask yourself or your business team how they want or need to consume and process the data in the next project. Data at Rest and Data in Motion have trade-offs. Therefore, both concepts are complementary. For this reason, modern cloud infrastructures leverage both in their architecture. Serverless Event Streaming with Kafka combined with the AWS Lakehouse is a great resource to learn more.

However, while connecting a batch system to a real-time nervous system is possible, the other way round – connecting a real-time consumer to batch storage – is not possible. The Kappa vs. Lambda Architecture discussion gives more insights into this.

Kafka is a database. So, it is also possible to use it for data at rest. For instance, the replayability of historical events in guaranteed ordering is essential and helpful for many use cases. However, long-term storage in Kafka has several limitations, like limited query capabilities. Hence, for many use cases, event streaming and other storage systems are complementary, not competitive.

Data Mesh – An Architecture Paradigm

Data mesh is an implementation pattern (not unlike microservices or domain-driven design) but applied to data. Thoughtworks coined the term. You will find tons of resources on the web. Zhamak Dehghani gave a great introduction about “How to build the Data Mesh Foundation and its Relation to Event Streaming” at the Kafka Summit Europe 2021.

Domain-driven + Microservices + Event Streaming

Data Mesh is not an entirely new paradigm. It has several historical influences:

Data Mesh Architecture with Micorservices Domain-driven Design Data Marts and Event Streaming

The architectural paradigm unlocks analytical data at scale, rapidly unlocking access to an ever-growing number of distributed domain data sets for a proliferation of consumption scenarios such as machine learning, analytics, or data-intensive applications across the organization. A data mesh addresses the common failure modes of the traditional centralized data lake or data platform architecture.

Data Mesh is a Logical View, not Physical!

Data mesh shifts to a paradigm that draws from modern distributed architecture: considering domains as the first-class concern, applying platform thinking to create a self-serve data infrastructure, treating data as a product, and implementing open standardization to enable an ecosystem of interoperable distributed data products.

Here is an example of a Data Mesh:

Data Mesh with Apache Kafka

TL;DR: Data Mesh combines existing paradigms, including Domain-driven Design, Data Marts, Microservices, and Event Streaming.

Data as the Product

However, the differentiating aspect focuses on product thinking (“Microservice for Data”) with data as a first-class product. Data products are a perfect fit for Event Streaming with Data in Motion to build innovative new real-time use cases.

Data Product - The Domain Driven Microservice for Data

A Data Mesh with Event Streaming

Why is Event Streaming a good fit for data mesh?

Streams are real-time, so you can propagate data throughout the mesh immediately, as soon as new information is available. Streams are also persisted and replayable, so they let you capture both real-time AND historical data with one infrastructure. And because they are immutable, they make for a great source of record, which is helpful for governance.

Data in Motion is crucial for most innovative use cases. And as discussed before, real-time data beats slow data in almost all scenarios. Hence, it makes sense that the heart of a Data Mesh architecture is an Event Streaming platform. It provides true decoupling, scalable real-time data processing, and highly reliable operations across the edge, data center, and multi-cloud.

Kafka Streaming API – The De Facto Standard for Data in Motion

The Kafka API is the de facto standard for Event Streaming. I won’t explore this discussion again and again. Here are a few references before we move to the “Kafka + Data Mesh” content…

A Kafka-powered Data Mesh

I highly recommend watching Ben Stopford’s and Michael Noll’s talk about “Apache Kafka and the Data Mesh“. Several of the screenshots in this post are from that presentation, too. Kudos to my two colleagues! The talk explores the key concepts of a Data Mesh and how they are related to Event Streaming:

  • Domain-driven Decentralization
  • Data as a Self-serve Product
  • First-class Data Platform
  • Federated Governance

Let’s now explore how Event Streaming with Kafka fits into the Data Mesh architecture and how other solutions like a database or data lake complement it.

Data Exchange for Input and Output within a Data Mesh using Kafka

Data product, a “microservice for the data world”:

  • A node on the data mesh, situated within a domain.
  • Produces and possibly consumes high-quality data within the mesh.
  • Encapsulates all the elements required for its function, namely data plus code plus infrastructure.

A Data Mesh is not just one Technology!

The heart of a Data Mesh infrastructure must be real-time, decoupled, reliable, and scalable. Kafka is a modern cloud-native enterprise integration platform (also often called iPaaS today). Therefore, Kafka provides all the capabilities for the foundation of a Data Mesh.

However, not all components can or should be Kafka-based. Choose the right tool for a problem. Let’s explore in the following subsections how Kafka-native technologies and other solutions are used in a Data Mesh together.

Stream Processing within the Data Product with Kafka Streams and ksqlDB

An event-based data product aggregates and correlates information from one or more data sources in real-time. Stateless and stateful stream processing is implemented with Kafka-native tools such as Kafka Streams or ksqlDB:

Event Streaming within the Data Product with Stream Processing Kafka Streams and ksqlDB

Variety of Protocols and Communication Paradigms within the Data Product – HTTP, gRPC, MQTT, and more

Obviously, not every application uses just Event Streaming as a technology and communication paradigm. The above picture shows how one consumer application could also be a request/response technology like HTTP or gRPC to do a pull query. In contrast, another application continuously consumes the streaming push query with a native Kafka consumer in any programming language, such as Java, Scala, C, C++, Python, Go, etc.

The data product often includes complementary technologies. For instance, if you built a connected car infrastructure, you likely use MQTT for the last-mile integration, ingest the data into Kafka, and further processing with Event Streaming. The “Kafka + MQTT Blog Series” is an excellent example from the IoT space to learn about building a data product with complementary technologies.

Variety of Solutions within the Data Product – Event Streaming, Data Warehouse, Data Lake, and more

The beauty of microservice architectures is that every application can choose the right technologies. An application might or might not include databases, analytics tools, or other complementary components. The input and output data ports of the data product should be independent of the chosen solutions:

Data Stores within the Data Product with Snowflake MongoDB Oracle et al

Kafka Connect is the right Kafka-native technology to connect other technologies and communication paradigms with the Event Streaming platform. Evaluate if you need another integration middleware (like an ETL or ESB) or if the Kafka infrastructure is the better enterprise integration platform (iPaaS) for your data product within the data mesh.

A Global Streaming Data Exchange

The Data Mesh concept is relevant for global deployments, not just within a single project or region. Multiple Kafka clusters are the norm, not an exception. I wrote about customers using Event Streaming with Kafka in global architectures a long time ago.

Various architectures exist to deploy Kafka across data centers and multiple clouds. Some use cases require low latency and deploy some Kafka instances at the edge or in a 5G zone. Other use cases replicate data between regions, countries, or continents across the globe for disaster recovery, aggregation, or analytics use cases.

Here is one example spanning a streaming Data Mesh across multiple cloud providers like AWS, Azure, GCP, or Alibaba, and on-premise / edge sites:

Hybrid Cloud Streaming Data Mesh powered by Apache Kafka and Cluster Linking

This example shows all the characteristics discussed in the above sections for a Data Mesh:

  • Decentralized real-time infrastructure across domains and infrastructures
  • True decoupling between domains within and between the clouds
  • Several communication paradigms, including data streaming, RPC, and batch
  • Data integration with legacy and cloud-native technologies
  • Continuous stream processing where it adds value, and batch processing in some analytics sinks

Example: A Streaming Data Exchange across Domains in the Automotive Industry

The following example from the automotive industry shows how independent stakeholders (= domains in different enterprises) use a cross-company streaming data exchange:

Streaming Data Exchange with Data Mesh in Motion using Apache Kafka and Cluster Linking

Innovation does not stop at the own border. Streaming replication is relevant for all use cases where real-time is better than slow data (valid for most scenarios). A few examples:

  • End-to-end supply chain optimization from suppliers to the OEM to the intermediary to the aftersales
  • Track and trace across countries
  • Integration of 3rd party add-on services to the own digital product
  • Open APIs for embedding and combining external services to build a new product

I could go on and on with the list. Many data products need to be accessible by 3rd party in real-time at scale. Some API gateway or API management tool comes into play in such a situation.

A real-world example of a streaming data exchange powered by Kafka is the mobility service Here Technologies. They expose the Kafka API to directly consume streaming data from their mapping services (as an alternative option to their HTTP API):

Here Technologies Moblility Service Apache Kafka Open API

However, even if all collaborating partners use Kafka under the hood in their architecture, exposing the Kafka API directly to the outside world does not always make sense. Some technical capabilities (e.g., access control or connectivity to thousands of devices) and missing business functions (e.g., for monetization or reporting) of the Kafka ecosystem bring an API layer on top of the Event Streaming infrastructure into play in many real-world deployments.

Open API for 3rd Party Integration and Streaming API Management

API Gateways and API Management tools exist in many varieties, including open-source frameworks, commercial products, and SaaS cloud offerings. Features include technical routing, access control, monetization, and reporting.

However, most people still implement the Open API concept with RPC in mind. I guess 95+% still use HTTP(S) to make APIs accessible to other stakeholders (e.g., other business units or external parties). RPC makes little sense in a streaming Data Mesh architecture if the data needs to be processed at scale in real-time.

There is still an impedance mismatch between Event Streaming and API Management. But it gets better these days. Specifications like AsyncAPI, calling itself the “industry standard for defining asynchronous APIs”, and similar approaches bring Open API to the data streaming world. My post “Kafka versus API Management with tools like MuleSoft, Kong, or Apigee” is still pretty much accurate if you want to dive deeper into this discussion. IBM API Connect was one of the first vendors that integrated Kafka via Async API.

A great example of the evolution from RPC to streaming APIs is the machine learning space. “Streaming Machine Learning with Kafka-native Model Deployment” explores how model servers such as Seldon enhance their product with a native Kafka API besides HTTP and gRPC request-response communication:

Kafka-native Machine Learning Model Server Seldon

Journey to the Streaming Data Mesh with Kafka

The paradigm shift is enormous. Data Mesh is not a free lunch. The same was and still is true for microservice architectures, domain-driven design, Event Streaming, and other modern design principles.

In analogy to Confluent’s maturity model for Event Streaming, our team has described the journey for deploying a streaming Data Mesh:

Data Mesh Journey with Event Streaming and Apache Kafka

The efforts likely take a few years in most scenarios. The shift is not just about technologies, but, as necessary are adjustments to organizations and business processes. I guess most companies are still in a very early stage. Let me know where you are on this journey!

Streaming Data Exchange as Foundation for a Data Mesh

A Data Mesh is an implementation pattern, not a specific technology. However, most modern enterprise architectures require a decentralized streaming data infrastructure to build valuable and innovative data products in independent, truly decoupled domains. Hence, Kafka, being the de facto standard for Event Streaming,  comes into play in many Data Mesh architectures.

Many Data Mesh architectures span across many domains in various regions or even continents. The deployments run at the edge, on-prem, and multi-cloud. The integration connects to many solutions, technologies with different communication paradigms.

A cloud-native Event Streaming infrastructure with the capability to link clusters with each other out-of-the-box enables building a modern Data Mesh. No Data Mesh will use just one technology or vendor. Learn from the inspiring posts from your favorite data products vendors like AWS, Snowflake, Databricks, Confluent, and many more to define and build your custom Data Mesh successfully. Data Mesh is a journey, not a big bang.

Did you already start building your Data Mesh? How does the enterprise architecture look like? What frameworks, products, and cloud services do you use? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Streaming Data Exchange with Kafka and a Data Mesh in Motion appeared first on Kai Waehner.

]]>
Mainframe Integration, Offloading and Replacement with Apache Kafka https://www.kai-waehner.de/blog/2020/04/24/mainframe-offloading-replacement-apache-kafka-connect-ibm-db2-mq-cdc-cobol/ Fri, 24 Apr 2020 14:55:10 +0000 https://www.kai-waehner.de/?p=2232 Time to get more innovative; even with the mainframe! This blog post covers the steps I have seen…

The post Mainframe Integration, Offloading and Replacement with Apache Kafka appeared first on Kai Waehner.

]]>
Time to get more innovative; even with the mainframe! This blog post covers the steps I have seen in projects where enterprises started offloading data from the mainframe to Apache Kafka with the final goal of replacing the old legacy systems.

Mainframes are still hard at work, processing over 70 percent of the world’s most important computing transactions every day. Organizations like banks, credit card companies, medical facilities, stock brokerages, and others that can absolutely not afford downtime and errors depend on the mainframe to get the job done. Nearly three-quarters of all Fortune 500 companies still turn to the mainframe to get the critical processing work completed” (BMC).

Mainframe Offloading and Replacement with Apache Kafka and CDC

 

Cost, monolithic architectures and missing experts are the key challenges for mainframe applications. Mainframes are used in several industries, but I want to start this post by thinking about the current situation at banks in Germany; to understand the motivation why so many enterprises ask for help with mainframe offloading and replacement…

Finance Industry in 2020 in Germany

Germany is where I live, so I get the most news from the local newspapers. But the situation is very similar in other countries across the world…

Here is the current situation in Germany.

Challenges for Traditional Banks

  • Traditional banks are in trouble. More and more branches are getting closed every year.
  • Financial numbers are pretty bad. Market capitalization went down significantly (before Corona!). Deutsche Bank and Commerzbank – former flagships of the German finance industry – are in the news every week. Usually for bad news.
  • Commerzbank had to leave the DAX in 2019 (the blue chip stock market index consisting of the 30 major German companies trading on the Frankfurt Stock Exchange); replaced by Wirecard, a modern global internet technology and financial services provider offering its customers electronic payment transaction services and risk management. UPDATE Q3 2020: Wirecard is now insolvent due to the Wirecard scandal. A really horrible scandal for the German financial market and government.
  • Traditional banks talk a lot about creating a “cutting edge” blockchain consortium and distributed ledger to improve their processes in the future and to be “innovative”. For example, some banks plan to improve the ‘Know Your Customer’ (KYC) process with a joint distributed ledger to reduce costs. Unfortunately, this is years away from being implemented; and not clear if it makes sense and adds real business value. Blockchain has still not proven its added value in most scenarios.

Neobanks and FinTechs Emerging

  • Neobanks like Revolut or N26 provide an innovative mobile app and great customer experience, including a great KYC process and experience (without the need for a blockchain). In my personal International bank transactions typically take less than 24 hours. In contrary, the mobile app and website of my private German bank is a real pain in the ass. And it feels like getting worse instead of better. To be fair: Neobanks do not shine with customers service if there are problems (like fraud; they sometimes simply freeze your account and do not provide good support).
  • International Fintechs are also arriving in Germany to change the game. Paypal is the normal everywhere, already. German initiatives for a “German Paypal alternative” failed. Local retail stores already accept Apple Pay (the local banks are “forced” to support it in the meantime). Even the Chinese service Alipay is supported by more and more shops.

Traditional banks should be concerned. The same is true for other industries, of course. However, as said above, many enterprises still rely heavily on the 50+ year old mainframe while competing with innovative competitors. It feels like these companies relying on mainframes have a harder job to change than let’s say airlines or the retail industry.

Digital Transformation with(out) the Mainframe

I had a meeting with an insurance company a few months ago. The team presented me an internal paper quoting their CEO: “This has to be the last 20M mainframe contract with IBM! In five years, I expect to not renew this.” That’s what I call ‘clear expectations and goals’ for the IT team… 🙂

Why does Everybody Want to Get Rid of the Mainframe?

Many large, established companies, especially those in the financial services and insurance space still rely on mainframes for their most critical applications and data. Along with reliability, mainframes come with high operational costs, since they are traditionally charged by MIPS (million instructions per second). Reducing MIPS results in lowering operational expense, sometimes dramatically.

Many of these same companies are currently undergoing architecture modernization including cloud migration, moving from monolithic applications to micro services and embracing open systems.

Modernization with the Mainframe and Modern Technologies

This modernization doesn’t easily embrace the mainframe, but would benefit from being able to access this data. Kafka can be used to keep a more modern data store in real-time sync with the mainframe, while at the same time persisting the event data on the bus to enable microservices, and deliver the data to other systems such as data warehouses and search indexes.

This not only will reduce operational expenses, but will provide a path for architecture modernization and agility that wasn’t available from the mainframe alone.

As final step and the ultimate vision of most enterprises, the mainframe might be replaced by new applications using modern technologies.

Kafka in Financial Services and Insurance Companies

I recently wrote about how payment applications can be improved leveraging Kafka and Machine Learning for real time scoring at scale. Here are a few concrete examples from banking and insurance companies leveraging Apache Kafka and its ecosystem for innovation, flexibility and reduced costs:

  • Capital One: Becoming truly event driven – offering a service other parts of the bank can use.
  • ING: Significantly improved customer experience – as a differentiator + Fraud detection and cost savings.
  • Freeyou: Real time risk and claim management in their auto insurance applications.
  • Generali: Connecting legacy databases and the modern world with a modern integration architecture based on event streaming.
  • Nordea: Able to meet strict regulatory requirements around real-time reporting + cost savings.
  • Paypal: Processing 400+ Billion events per day for user behavioral tracking, merchant monitoring, risk & compliance, fraud detection, and other use cases.
  • Royal Bank of Canada (RBC): Mainframe off-load, better user experience and fraud detection – brought many parts of the bank together.

The last example brings me back to the topic of this blog post: Companies can save millions of dollars “just” by offloading data from their mainframes to Kafka for further consumption and processing. Mainframe replacement might be the long term goal; but just offloading data is a huge $$$ win.

Domain-Driven Design (DDD) for Your Integration Layer

So why use Kafka for mainframe offloading and replacement? There are hundreds of middleware tools available on the market for mainframe integration and migration.

Well, in addition to being open and scalable (I won’t cover all the characteristics of Kafka in this post), one key characteristic is that Kafka enables decoupling applications better than any other messaging or integration middleware.

Legacy Migration and Cloud Journey

Legacy migration is a journey. Mainframes cannot be replaced in a single project. A big bang will fail. This has to be planned long-term. The implementation happens step by step.

Almost every company on this planet has a cloud strategy in the meantime. The cloud has many benefits, like scalability, elasticity, innovation, and more. Of course, there are trade-offs: Cloud is not always cheaper. New concepts need to be learned. Security is very different (I am NOT saying worse or better, just different). And hybrid will be the normal for most enterprises. Only enterprises younger than 10 years are cloud-only.

Therefore, I use the term ‘cloud’ in the following. But the same phases would exist if you “just” want to move away from mainframes but stay in your own on premises data centers.

On a very high level, mainframe migration could contain three phases:

  • Phase 1 – Cloud Adoption: Replicate data to the cloud for further analytics and processing. Mainframe offloading is part of this.
  • Phase 2 – Hybrid Cloud: Bidirectional replication of data in real time between the cloud and application in the data center, including mainframe applications.
  • Phase 3 – Cloud-First Development: All new applications are build in the cloud. Even the core banking system (or at least parts of it) is running in the cloud (or in a modern infrastructure in the data center).

Journey from Legacy to Hybrid and Cloud Infrastructure

How do we get there? As I said, a big bang will fail. Let’s take a look at a very common approach I have seen at various customers…

Status Quo: Mainframe Limitations and $$$

The following is the status quo: Applications are consuming data from the mainframe; creating expensive MIPS:

Direct Legacy Mainframe Communication to App

The mainframe is running and core business logic is deployed. It works well (24/7, mission-critical). But it is really tough or even impossible to make changes or add new features. Scalability is often becoming an issue, already. And well, let’s not talk about cost. Mainframes cost millions.

Mainframe Offloading

Mainframe offloading means data is replicated to Kafka for further analytics and processing by other applications. Data continues to be written to the mainframe via the existing legacy applications:

Mainframe Offloading

Offloading data is the easy part as you do not have to change the code on the mainframe. Therefore, the first project is often just about reading data.

Writing to the mainframe is much harder because the business logic on the mainframe must be understood to make changes. Therefore, many projects avoid this huge (and sometimes not solvable) challenge by keeping the writes as it is… This is not ideal, of course. You cannot add new features or changes to the existing application.

People are not able or do not want to take the risk to change it. There is no DevOps or CI/CD pipelines and no A/B testing infrastructure behind your mainframe, dear university graduates! 🙂

Mainframe Replacement

Everybody wants to replace Mainframes due to its technical limitations and cost. But enterprises cannot simply shut down the mainframe because of all the mission-critical business applications.

Therefore, COBOL, the main programming language for the mainframe is still widely used in applications to do large-scale batch and transaction processing jobs. Read the  blog post ‘Brush up your COBOL: Why is a 60 year old language suddenly in demand?‘ for a nice ‘year 2020 introduction to COBOL’.

Enterprises have the following options:

Option 1: Continue to develop actively on the mainframe

Train or hire more (expensive) Cobol developers to extend and maintain your legacy applications. Well, this is where you want to go away from. If you forgot why: Cost, cost, cost. And technical limitations. I find it amazing how mainframes even support “modern technologies” such as Web Services. Mainframes do not stand still. Having said this, even if you use more modern technologies and standards, you are still running on the mainframe with all its drawbacks.

Option 2: Replace the Cobol code with a modern application using a migration and code generation tool

Various vendors provide tools which take COBOL code and automatically migrate it to generated source code in a modern programming language (typically Java). The new source code can be refactored and changed (as Java uses static typing so that any errors can be shown in the IDE and changes can be apply to all related dependencies in the code structure). This is the theory. The more complex your COBOL code is, the harder it gets to migrate it and the more custom coding is required for the migration (which results in the same problem the COBOL developer had on the mainframe: If you don’t understand the code routines, you cannot change it without risking errors, data loss or downtime in the new version of the application).

Option 3: Develop a new future-ready application with modern technologies to provide an open, flexible and scalable architecture

This is the ideal approach. Keep the old legacy mainframe running as long as needed. A migration is not a big bang. Replace use cases and functionality step-by-step. The new applications will have very different feature requirements anyway. Therefore, take a look at the Fintech and Insurtech companies. Starting green field has many advantages. The big drawback is high development and migration costs.

In reality, a mix of these three options can also happen. No matter what works for you, let’s now think how Apache Kafka fits into this story.

Domain-Driven Design for your Integration Layer

Kafka is not just a messaging system. It is an event streaming platform that provides storage capabities. In contrary to MQ systems or Web Service / API based architectures, Kafka really decouples producers and consumers:

Domain-Driven Design for the Core Banking Integration Layer with Apache Kafka

With Kafka and its Domain-driven Design, every application / service / microservice / database / “you-name-it” is completely independent and loosely coupled from each other. But scalable, highly available and reliable! The blog post “Apache Kafka, Microservices and Domain-Driven Design (DDD)” goes much deeper into this topic.

Sberbank, the biggest bank in Russia, is a great example for their domain-driven approach: Sberbank built their new core banking system on top of the Apache Kafka ecosystem. This infrastructure is open and scalable, but also ready for mission-critical payment use cases like instant-payment or fraud detection, but also for innovative applications like Chat Bots in customer service or Robo-advising for trading markets.

Why is this decoupling so important? Because the mainframe integration, migration and replacement is a (long!) journey

Event Streaming Platform and Legacy Middleware

An Event Streaming Platform gives applications independence. No matter if an application uses a new modern technology or legacy, proprietary, monolithic components. Apache Kafka provides the freedom to tap into and manage shared data, no matter if the interface is real time messaging, a synchronous REST / SOAP web service, a file based system, a SQL or NoSQL database, a Data Warehouse, a data lake or anything else:

Integration between Kafka and Legacy Middleware

Apache Kafka as Integration Middleware

The Apache Kafka ecosystem is a highly scalable, reliable infrastructure and allows high throughput in real time. Kafka Connect provides integration with any modern or legacy system, be it Mainframe, IBM MQ, Oracle Database, CSV Files, Hadoop, Spark, Flink, TensorFlow, or anything else. More details here:

Kafka Ecosystem for Security and Data Governance

The Apache Kafka ecosystem provides additional capabilities. This is important as Kafka is not just used as messaging or ingestion layer, but a platform for your most mission-critical use cases. Remember the examples from banking and insurance industry I showed in the beginning of this blog post. These are just a few of many more mission-critical Kafka deployments across the world. Check past Kafka Summit video recordings and slides for many more mission-critical use cases and architectures.

I will not go into detail in this blog post, but the Apache Kafka ecosystem provides features for security and data governance like Schema Registry (+ Schema Evolution), Role Based Access Control (RBAC), Encryption (on message or field level), Audit Logs, Data Flow analytics tools, etc…

Mainframe Offloading and Replacement in the Next 5 Years

With all the theory in mind, let’s now take a look at a practical example. The following is a journey many enterprises walk through these days. Some enterprises are just in the beginning of this journey, while others already saved millions by offloading or even replacing the mainframe.

I will walk you through a realistic approach which takes ~5 years in this example (but this can easily take 10+ years depending on the complexity of your deployments and organization). The importance is quick wins and a successful step-by-step approach; no matter how long it takes.

Year 0: Direct Communication between App and Mainframe

Let’s get started. Here is again the current situation: Applications directly communicate with the mainframe.

Direct Legacy Mainframe Communication to App

Let’s reduce cost and dependencies between the mainframe and other applications.

Year 1: Kafka for Decoupling between Mainframe and App

Offloading data from the mainframe enables existing applications to consume the data from Kafka instead of creating high load and cost for direct mainframe access:

Kafka for Decoupling between Mainframe and App

This saves a lot of MIPS (million instructions per second). MIPS is a way to measure the cost of computing on mainframes. Less MIPS, less cost.

After offloading the data from the mainframe, new applications can also consume the mainframe data from Kafka. No additional MIPS cost! This allows building new applications on the mainframe data. Gone is the time where developers cannot build new innovative applications because the MIPS cost was the main blocker to access the mainframe data.

Kafka can store the data as long as it needs to be stored. This might be just 60 minutes for one Kafka Topic for log analytics or 10 years for customer interactions for another Kafka Topic. With Tiered Storage for Kafka, you can even reduce costs and increase elasticity by separating processing from storage with a back-end storage like AWS S3. This discussion is out of scope of this article. Check out “Is Apache Kafka a Database?” for more details.

Change Data Capture (CDC) for Mainframe Offloading to Kafka

The most common option for mainframe offloading is Change Data Capture (CDC): Transaction log-based CDC pushes data changes (insert, update, delete) from the mainframe to Kafka. The advantages:

  • Real time push updates to Kafka
  • Eliminate disruptive full loads, i.e. minimize production impact
  • Reduce MIPS consumption
  • Integrate with any mainframe technology (DB2 Z/OS, VSAM, IMS/DB, CISC, etc.)
  • Full support

On the first connection, the CDC tool reads a consistent snapshot of all of the tables that are whitelisted. When that snapshot is complete, the connector continuously streams the changes that were committed to the DB2 database for all whitelisted tables in capture mode. This generates corresponding insert, update and delete events. All of the events for each table are recorded in a separate Kafka topic, where they can be easily consumed by applications and services.

The big disadvantage of CDC is high licensing costs. Therefore, other integration options exist…

Integration Options for Kafka and Mainframe

While CDC is typically the preferred choice, there are more alternatives. Here are the integration options I have seen being discussed in the field for mainframe offloading:

  1. IBM InfoSphere Data Replication (IIDR) CDC solution for Mainframe
  2. 3rd Party commercial CDC solution (e.g. Attunity or HLR)
  3. Open-source CDC solution (e.g Debezium – but you still need an IIDC license, this is the same challenge as with Oracle and GoldenGate CDC)
  4. Create interface tables + Kafka Connect + JDBC connector
  5. IBM MQ interface + Kafka Connect’s IBM MQ connector
  6. Confluent REST Proxy and HTTP(S) calls from the mainframe
  7. Kafka Clients on the mainframe

Evaluate the trade-offs and make your decision. Requiring a fully supported solution often eliminates several options quickly 🙂 In my experience, most people use CDC with IIDR or a 3rd party tool, followed by IBM MQ. Other options are more theoretical in most cases.

Year 2 to 4: New Projects and Applications

Now it is time to build new applications:

New Projects and Applications

 

This can be agile, lightweight Microservices using any technology. Or this can be external solutions like a data lake or cloud service. Pick what you need. Your are flexible. The heart of your infrastructure allows this. It is open, flexible and scalable. Welcome to a new modern IT world! 🙂

Year 5: Mainframe Replacement

At some point, you might wonder: What about the old mainframe applications? Can we finally replace them (or at least some of them)? When the new application based on a modern technology is ready, switch over:

Mainframe Replacement

As this is a step-by-step approach, the risk is limited. First of all, the mainframe can keep running. If the new application does not work, switch back to the mainframe (which is still up-to-date as you did not stop inserting the updates into its database in parallel to the new application).

As soon as the new application is battle-tested and proven, the mainframe application can be shut down. The $$$ budgeted for the next mainframe renewal can be used for other innovative projects. Congratulations!

Mainframe Offloading and Replacement is a Journey… Kafka can Help!

Here is the big picture of our mainframe offloading and replacement story:

Mainframe Offloading and Replacement with Apache Kafka

Hybrid Cloud Infrastructure with Apache Kafka and Mainframe

Remember the three phases I discussed earlier: Phase 1 – Cloud Adoption, Phase 2 – Hybrid Cloud, Phase 3 – Cloud-First Development. I did not tackle this in detail in this blog post. Having said this, Kafka is an ideal candidate for this journey. Therefore, this is a perfect combination for offloading and replacing mainframes in a hybrid and multi-cloud strategy. More details here:

Slides and Video Recording for Kafka and Mainframe Integration

Here are the slides:

And the video walking you through the slides:

What are your experiences with mainframe offloading and replacement? Did you or do you plan to use Apache Kafka and its ecosystem? What is your strategy? Let’s connect on LinkedIn and discuss! Stay informed about new blog posts by subscribing to my newsletter.

The post Mainframe Integration, Offloading and Replacement with Apache Kafka appeared first on Kai Waehner.

]]>
Smart City with an Event Streaming Platform like Apache Kafka https://www.kai-waehner.de/blog/2020/02/24/building-smart-city-event-streaming-platform-apache-kafka/ Mon, 24 Feb 2020 13:10:29 +0000 https://www.kai-waehner.de/?p=2050 A smart city is an urban area that uses different types of electronic Internet of Things (IoT) sensors…

The post Smart City with an Event Streaming Platform like Apache Kafka appeared first on Kai Waehner.

]]>
A smart city is an urban area that uses different types of electronic Internet of Things (IoT) sensors to collect data and then use insights gained from that data to manage assets, resources and services efficiently. This includes data collected from citizens, devices, and assets that is processed and analyzed to monitor and manage traffic and transportation systems, power plants, utilities, water supply networks, waste management, crime detection, information systems, schools, libraries, hospitals, and other community services.

Smart City - Event Streaming with Apache Kafka

I did a Confluent webinar about this topic recently together with my colleague Robert Cowart. Rob has deep experience in this topic from several projects in the last years. I face this discussion regularly from different perspectives in customer meetings all over the world:

  1. Cities / governments: Increase safety, improve planning, increase efficiency, reduce cost
  2. Automotive / vehicle vendors: Improve customer experience, cross-sell
  3. Third party companies (ride sharing, ticket-less parking, marketplaces, etc.): Provide innovative new services and business models

This blog post covers an extended version of the webinar. I give a quick overview and share the slides + video recording.

Benefits of a Smart City

A smart city provides many benefits for the civilization and the city management. Some of the goals are:

  • Improved Pedestrian Safety
  • Improved Vehicle Safety
  • Proactively Engaged First Responders
  • Reduced Traffic Congestion
  • Connected / Autonomous Vehicles
  • Improved Customer Experience
  • Automated Business Processes

“Smart City” is a very generic term and often used as buzzword. It includes many different use cases and stake holders. In summary, a smart city provides the right insights (enriched and analyzed) at the right time (increasingly “real-time”) to the right people, processes and systems.

Innovative New Business Models Emerging…

A smart city establishes exciting new business models. Many of these make the experience for the end user much better. For instance, I am so glad that I don’t have to pay for parking with pocket changes anymore. A simple app really makes me very happy.

Here are a few arbitrary examples of innovative projects and services related to building a smart city:

  • wejo is offering a platform designed specifically for connected car data.
  • Park Now provides cities and operators a cashless mobile parking and payment solution
  • Scheidt & Bachmann offers a ticketless parking management system.
  • The government of Singapore created Virtual Singapore; an authoritative 3D digital platform intended for use by the public, private, people and research sectors for urban planning, collaboration and decision-making, communication, visualization and other use cases.

The latter is a great example for building a digital twin outside of manufacturing. I covered the topic “Event Streaming with Apache Kafka for Building a Digital Twin” in detail in another post.

Technical Challenges for Building a Smart City

Many cities are investing in technologies to transform their cities into smart city environments in which data collection and analysis is utilized to manage assets and resources efficiently.

The key challenges are:

  • Integration with different data sources and technologies…
  • Data transformation and correlation to provide multiple perspectives…
  • Real time processing to act while the information is important…
  • High Scalability and zero downtime to run continuously even in case of hardware failure (life in a city never stops)…

Challenges - Data Integration, Correlation, Real Time

Modern technology can help connect the right data, at the right time, to the right people, processes and systems.

Learn How to Build a Smart City with Apache Kafka

Innovations around smart cities and the Internet of Things give cities the ability to improve motor safety, unify and manage transportation systems and traffic, save energy and provide a better experience for the residents.

By utilizing an event streaming platform, like Apache Kafka, cities are able to process data in real-time from thousands of sources, such as sensors. By aggregating that data and analyzing real-time data streams, more informed decisions can be made and fine-tuned operations developed for a positive impact on everyday challenges faced by cities.

Event Streaming and Event Driven Architecture for a Smart City with Apache Kafka

Learn how to:

  • Overcome challenges for building a smart city
  • Connect thousands of devices, machines, and people
  • Build a real time infrastructure to correlate relevant events
  • Leverage open source and fully managed solutions from the Apache Kafka ecosystem
  • Plan and deploy hybrid architectures with edge, on premise and cloud infrastructure

The following two sections share a slide deck and video recording with lots of use cases, technical information and best practices.

Slide Deck – Smart City with the Apache Kafka Ecosystem

Here is the slide deck:

Video Recording – Event Streaming and Real Time Analytics At Scale

The video recording is an extended version of the recent Confluent webinar:

Further readings…

Some recommended material to dig deeper into event streaming to build a smart city.

Zero Downtime and Disaster Recovery

High scalability and zero downtime are crucial in a smart city. A deep dive into zero downtime and disaster recovery with the Apache Kafka ecosystem is available here: Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments.

Real Time Analytics and Machine Learning at Scale

If you are curious how to build a smart city infrastructure, check out the following demo. It provide a scalable deployment of Apache Kafka for real time analytics with 100000 connected cars:

The post Smart City with an Event Streaming Platform like Apache Kafka appeared first on Kai Waehner.

]]>