Kai Waehner
  • Home
  • Highlights
  • Activities
    • Talks at International Conferences
    • Video Recordings
    • Publications
  • Blog
    • Analytics
      • Analytics
      • Apache Spark
      • Big Data
      • Business Intelligence
      • Deep Learning
      • Hadoop
      • In Memory
      • Jupyter
      • Machine Learning
      • NoSQL
      • Open Source
      • Kafka Streams
      • Python
      • Social Network
      • TensorFlow
    • Cloud
      • Cloud
      • Apache Mesos
      • Cloud-Native
      • Docker
      • Kubernetes
      • Open Source
      • Persistence
      • Service Mesh
    • Internet of things
      • Internet of Things
      • Big Data
      • IIoT
      • MQTT
      • Open Source
      • PLC4X
    • Integration
      • Integration
      • API Management
      • Application Server
      • Blockchain
      • BPM
      • EAI
      • ESB
      • IT Certifications
      • IT Conferences
      • Java / JEE
      • Messaging
      • Microservices
      • Middleware
      • Open Source
      • Apache Kafka
      • Kafka Connect
      • Persistence
      • Service Mesh
      • SOA
      • Social Network
      • Web Framework
    • Stream Processing
      • Stream Processing
      • Apache Kafka
      • Big Data
      • Confluent
      • Kafka Connect
      • Kafka Streams
      • KSQL
      • Persistence
  • About me
  • Stay in Contact
Kai Waehner Technology Evangelist – Big Data Analytics – Middleware – Apache Kafka
Kai Waehner Technology Evangelist – Big Data Analytics – Middleware – Apache Kafka
  • Home
  • Highlights
  • Activities
    • Talks at International Conferences
    • Video Recordings
    • Publications
  • Blog
    • Analytics
    • Cloud
    • Internet of things
    • Integration
    • Stream Processing
      • Apache Spark
      • Big Data
      • Business Intelligence
      • Deep Learning
      • Hadoop
      • In Memory
      • Jupyter
      • Kafka Streams
      • Machine Learning
      • NoSQL
      • Open Source
      • Python
      • Social Network
      • TensorFlow
        • Apache Mesos
        • Cloud-Native
        • Docker
        • Kubernetes
        • Open Source
        • Persistence
        • Service Mesh
          • Big Data
          • IIoT
          • MQTT
          • Open Source
          • PLC4X
            • Apache Kafka
            • API Management
            • Application Server
            • Blockchain
            • BPM
            • EAI
            • ESB
            • IT Certifications
            • IT Conferences
            • Java / JEE
            • Kafka Connect
            • Messaging
            • Microservices
            • Middleware
            • Open Source
            • Persistence
            • Service Mesh
            • SOA
            • Social Network
            • Web Framework
              • Apache Kafka
              • Big Data
              • Confluent
              • Kafka Connect
              • Kafka Streams
              • KSQL
              • Persistence
            • About ME
            • Stay in Contact
            • Apache Kafka
            • Audio
            • Big Data
            • Large Messages
            • Machine Vision
            • Video

            Handling Large Messages with Apache Kafka (CSV, XML, Image, Video, Audio, Files)

            • 15 minute read
            Apache Kafka for Large Messages like Audio Video Image Files
            • ByKai Waehner
            • 7. August 2020
            Total
            1
            Shares
            1
            0
            0
            Share
            1 people shared the story
            1
            0
            0
            0

            Kafka was not built for large messages. Period. Nevertheless, more and more projects send and process 1Mb, 10Mb, and even much bigger files and other large payloads via Kafka. One reason is that Kafka was designed for large volume/throughput – which is required for large messages. This blog post covers the use cases, architectures, and trade-offs for handling large messages with Kafka.

            Use Cases for Large (Kafka) Message Payloads

            Various use cases for large message payloads exist: Image recognition, video analytics, audio analytics, and file processing are widespread examples.

            Image Recognition and Video Analytics

            Image recognition and video analytics (also known as computer vision) is probably the number one use case. Many examples require the analysis of videos in real-time, including:

            • Security and surveillance (access control, intrusion detection, motion detection)
            • Transport monitoring system (vehicle traffic detection, incidence detection, pedestrian monitoring)
            • Healthcare (health status monitoring, telemedicine, surgical video analysis)
            • Manufacturing (machine vision for quality assurance, augmented support and training)

            The usage of image and video processing via concepts such as Computer Vision (e.g., OpenCV) or Deep Learning / Neural Networks (e.g., TensorFlow) reduces time, cost, and human effort, plus this makes industries more secure, reliable, and consistent.

            Audio Analytics

            Audio analytics is an interesting use case, coming up more and more:

            • In conjunction with video analytics: See the use cases above. Often video and audio need to be processed together.
            • Consumer IoT (CIoT): Alerting, informing, advising people, e.g., using Audio Analytic.
            • Industrial IoT (IIoT): Machine diagnostics and predictive maintenance using advanced sound analysis, e.g., using Neuron Soundware
            • Natural Language Processing (NLP): Chatbots and other modern systems use text and speech translation, e.g., using the fully-managed services from the major cloud providers

            Big Data File Processing

            Last but not least, the processing of big files received in batch-mode will not go away any time soon. But big files can be incorporated into a modern event streaming workflow for decoupling/separation of concerns, connectivity to various sinks. And it allows data processing in real-time and batch simultaneously.

            Legacy systems will provide data sources like big CSV or proprietary files or snapshots/exports from databases that need to be integrated. Data processing includes streaming applications (such as Kafka Streams, ksqlDB, or Apache Flink) to continuously process, correlate, and analyze events from different data sources. Data sources such as Hadoop or Spark processed incoming data in batch mode (e.g., map/reduce, shuffling). Other data sources such as data warehouse (e.g., Snowflake) or text search (e.g., Elasticsearch) ingest data in near-real-time.

            What Kafka is NOT

            After exploring use cases for large message payloads, let’s clarify what Kafka is not:

            Kafka is usually not the right technology to store and process large files (images, videos, proprietary files, etc.) as a whole. Products were built specifically for these use cases.

            For instance, a Content Delivery Network (CDN) such as Akamai, Limelight Networks, or Amazon CloudFront distribute video streams and other software downloads across the globe. Or “big file editing and processing” (like a video processing tool). Or video editing tools from Adobe, Autodesk, Camtasia, and many other vendors are used to structure and present all video information, including films and television shows, video advertisements, and video essays.

            Let’s take a look at one example that combines Kafka and these other tools:

            Netflix processes over 6 Petabytes per day with Kafka. However, this is “just” for message orchestration, coordination, data integration, data preprocessing, ingestion into data lakes, building stateless and stateful business applications, and other use cases. But Kafka is not used to sharing and storing all the shows and movies you watch on your TV or tablet. A Content Delivery Network (CDN) like Akamai is used in conjunction with other tools and products to provide you the excellent video streaming experience you know.

            Okay, Kafka is not the right tool to store and process large files as a whole, like a CDN or video editing tool. Why, when, and how should you handle large message payloads with Kafka then? And what is a “large message” in Kafka terms?

            Features and Limitations of using Kafka for Large Messages

            Originally, Kafka was not built for processing large messages and files. This does not mean that you cannot do it!

            Kafka limits the max size of messages. The default value of the broker configuration’ ‘ message.max.bytes’ is 1MB.

            Why does Kafka limit the message size by default?

            • Different sizing, configuration, and tuning required for large message handling compared to a mission-critical real-time cluster with low latency.
            • Large messages increase the memory pressure on the broker JVM.
            • Large messages are expensive to handle and could slow down the brokers.
            • A reasonable message size limit can meet the requirements of most use cases.
            • Good workarounds exist if you need to handle large messages.
            • Most cloud offerings don’t allow large messages.

            There are noticeable performance impacts from increasing the allowable message size.

            Hence, understand all alternatives discussed below before sending messages >1Mb through your Kafka cluster. Depending on your SLAs for uptime and latency, a separate Kafka cluster should be considered for processing large messages.

            Having said this, I have seen customers processing messages far bigger than 10Mb with Kafka. It is valid to evaluate Kafka for processing large messages instead of using another tool for that (often in conjunction with Kafka).

            LinkedIn talked a long time ago about the pros and cons of two different approaches: Using ‘Kafka only’ vs. ‘Kafka in conjunction with another data storage’. Especially outside the public cloud, most enterprises cannot simply use an S3 object store for big data. Therefore, the question comes up if one system (Kafka) is good enough, or if you should invest in two systems (Kafka and external storage).

            Let’s take a look at the trade-offs for using Kafka for large messages.

            Kafka for Large Messages – Alternatives and Trade-Offs

            There is no single best solution. The decision on how to handle large messages with Kafka depends on your use cases, SLAs, and already existing infrastructure.

            The following three available alternatives exist to handle large messages with Kafka:

            • Reference-based messaging in Kafka and external storage
            • In-line large message support in Kafka without external storage
            • In-line large message support and tiered storage in Kafka

            Here are the characteristics and pros/cons of each approach (this is an extension from a LinkedIn presentation in 2016):

            Apache Kafka for large message payloads and files - Alternatives and Trade-offs

            Also, don’t underestimate the power of compression for large messages. Some big files like CSV or XML can reduce its size significantly just by setting the compression parameter to use GZIP, Snappy, or LZ4.

            Even a 1GB file could be sent via Kafka, but this is undoubtedly not what Kafka was designed for. In both the client and the broker, a 1GB chunk of memory will need to be allocated in JVM for every 1GB message. Hence, in most cases, for really large files, it is better to externalize them into an object store and use Kafka just for the metadata.

            You need to define what is ‘a large message’ by yourself and when to use which of the design patterns discussed in this blog post. That’s why I am writing this up here… 🙂

            The following sections explore these alternatives in more detail. Before we start, let’s explain the general concept of Tiered Storage for Kafka mentioned in the above table. Many readers might not be aware of this yet.

            Tiered Storage for Kafka

            Kafka data is mostly consumed in a streaming fashion using tail reads. Tail reads leverage OS’s page cache to serve the data instead of disk reads. Older data is typically read from the disk for backfill or failure recovery purposes and is infrequent.

            In the tiered storage approach, the Kafka cluster is configured with two tiers of storage – local and remote. Local tier is the same as the current Kafka that uses the local disks on the Kafka brokers to store the log segments. The new remote tier uses an external storage system such as AWS S3, GCS, or MinIO to store the completed log segments. Two separate retention periods are defined corresponding to each of the tiers.

            With remote tier enabled, the retention period for the local tier can be significantly reduced from days to few hours. The retention period for remote tier can be much longer, months, or even years.

            Tiered Storage for Kafka allows scaling storage independent of memory and CPUs in a Kafka cluster, enabling Kafka to be a long-term storage solution. This also reduces the amount of data stored locally on Kafka brokers and hence the amount of data that needs to be copied during recovery and rebalancing.

            The consumer API does not change at all. Kafka applications consume data as before. They don’t even know if Tiered Storage is used under the hood.

            Confluent Tiered Storage

            Confluent Tiered Storage is available today in Confluent Platform and used under the hood in Confluent Cloud:

            Confluent Tiered Storage for Kafka

            From an infrastructure perspective, Confluent Tiered Storage required an external (object) storage like AWS S3, GCS, or MinIO. But from operations and development perspective, the complexity of end-to-end communication and separation of messages and files is provided out-of-the-box under the hood.

            KIP-405 – Add Tiered Storage to Kafka

            KIP-405 –  Add Tiered Storage Support to Kafka is also in the works. Confluent is actively working on this with the open-source community. Uber is leading this initiative.

            Kafka + Tiered Storage is an exciting option (in some use cases) for handling large messages. It provides a single infrastructure to the operator, but also cost savings and better elasticity.

            We now understand the technical feasibility of handling large message payloads with Kafka. Let’s now discuss the different use cases and architectures in more detail.

            Use Cases and Architectures using Kafka for Large Messages

            The processing of the content of your large message payload depends on the technical use case. Do you want to

            • Send an image to analyze or enhance it?
            • Stream a video to a remote consumer application?
            • Analyze audio noise in real-time?
            • Process a structured (i.e., splittable) file line-by-line?
            • Send an unstructured (i.e., non-splittable) file to a consumer tool to process it?

            I cover a few use cases for handling large messages:

            • Manufacturing: Quality assurance in production lines deployed at the edge in the factory
            • Retailing: Augmented reality for better customer experience and cross/up-selling
            • Pharma and Life Sciences: Image processing and machine learning for drug discovery
            • Public sector: Security and surveillance
            • Media: Content delivery of large video files
            • Banking: Attachments in a chat application for customer service

            The following sections explore these use cases with different architectural approaches to process large message payloads with Apache Kafka to discuss their pros and cons:

            1. Kafka-native payload processing
            2. Chunk and re-assemble
            3. Metadata in Kafka and linking to external storage
            4. Externalizing large payloads on-the-fly

            Kafka for Large Message Payloads – Image Processing

            Computer vision and image recognition are used in many industries, including automotive, manufacturing, healthcare, retailing, and innovative “silicon valley use cases”. Image processing includes tools such as OpenCV but also technologies implementing deep learning algorithms such as Convolutional Neural Networks (CNN).

            Let’s take a look at a few examples from different industries.

            Kafka-native Image Processing for Machine Vision in Manufacturing

            Machine Vision is the technology and methods used to provide imaging-based automatic inspection and analysis for such applications as automated inspection, process control, and robot guidance, usually in industry.

            A Kafka-native machine vision implementation sends images from cameras to Kafka. Preprocessing adds metadata and correlation it with data from other backend systems. The message is then consumed by one or more applications:

            Kafka-native Image Processing

            Image processing and machine learning for drug discovery in Pharma and Life Sciences

            “On average, it takes at least ten years for a new medicine to complete the journey from initial discovery to the marketplace” said PhRMA.

            Here is one example where event streaming at scale in real-time speeds up this process significantly.

            Recursion had several technical challenges. Their drug discovery process was manual and slow, bursty batch mode, not scalable:

            Drug Discovery in manual and slow, bursty batch mode, not scalable

            To solve these challenges, Recursion leveraged Kafka and its ecosystem to built a massively parallel system that combines experimental biology, artificial intelligence, automation, and real-time event streaming to accelerate drug discovery:

            Drug Discovery in automated, scalable, reliable real time Mode

            Check out Recusion’s Kafka Summit talk to learn more details.

            I see plenty of customers in various industries implementing scalable real-time machine learning infrastructures with the Kafka ecosystem. Related to the above use case, I explored more details in the blog post “Apache Kafka and Event Streaming in Pharma and Life Sciences“. The following shows a potential ML infrastructure:

            Streaming Analytics for Drug Discovery in Real Time at Scale with Apache Kafka

            Kafka-native Image Recognition for Augmented Reality in Retailing

            Augmented reality (AR) is an interactive experience of a real-world environment where the objects that reside in the real world are enhanced by computer-generated perceptual information. AR applications are usually built with engines such as Unity or Unreal. Use cases exist in various industries. Industry 4.0 is the most present one today. But other industries start building fascinating applications. Just think about Pokemon Go from Nintendo for your smartphone.

            The following shows an example of AR in the telco industry for providing an innovative retailing service. The customer makes a picture of his home, sends the picture to an OTT service of the Telco provider, and receives the enhanced picture (e.g, with a new couch to buy for your home):

            Augmented Reality with Apache Kafka and Deep Learning for Picture Enhancement

            Kafka is used for orchestration, integration with backend services, and sending the original and enhanced image between the smartphone and the OTT Telco service.

            Machine Vision at the Edge in Industrial IoT (IIoT) with Confluent and Hivecell

            Kafka comes up at the edge more and more. Here is an example of machine vision at the edge with Kafka in Industrial IoT (IIoT) / Industry 4.0 (I4):

            Hivecell and Confluent Platform for Image Processing at the Edge

            A Hivecell node is equipped with

            • Confluent MQTT Proxy: Integration with the cameras
            • Kafka Broker and ZooKeeper: Event streaming platform
            • Kafka Streams: Data processing, such as filtering, transformations, aggregations, etc.
            • Nvidia’s Triton Inference server: Image recognition using trained analytic models
            • Kafka Connect and Confluent Replicator:  Replication of the machine vision results to the cloud

            Video Streaming with Apache Kafka

            Streaming media is the process of delivering and obtaining media. Data is continuously received by and presented to one or more consumers while being delivered by a provider. Buffering the split up data packages of videos on the consumer side ensures a continuous flow.

            The implementation of video streaming with Kafka-native technologies is pretty straightforward:

            Video Streaming with Apache Kafka via Chunk + Re-Assemble Large Kafka Message Payloads

            This architecture leverages the Composed Message Processor Enterprise Integration Pattern (EIP):

            Composed Message Processor Enterprise Integration Pattern

            The use case is even more straightforward, as we don’t need a content-based router in our case. We just combine the Splitter and Aggregator EIPs.

            The blog post “Building a distributed multi-video processing pipeline with Python and Confluent Kafka” shows a great demo of processing a stream of video frames with Kafka.

            Split and aggregate video streams for security and surveillance in the public sector

            The following shows a use case for video streaming with Kafka for security and surveillance:

            Video Streaming with Apache Kafka for Security and Surveillance as part of Modernized SIEM

            In this case, video streaming is part of a modernized SIEM (security information and event management). Audio streaming works in a very similar way. Hence, I will not cover it separately.

            A smart city is another example where video, image, and audio processing with Kafka come into play.

            Kafka for Large Message Payloads – Big Data Files (CSV, XML, Video, Proprietary)

            Up above, we have seen examples for processing specific large messages: Images, video, audio. In many use cases, other kinds of files need to be processed. Large files include:

            • Structured data, e.g., big CSV or XML files
            • Unstructured data, e.g., complete videos (not continuous video streaming) or other binary files such as an analytic model

            Big Aggregated Files from Offline Devices in IoT

            Big CSV, XML or JSON files are even normality in modern real-time event-based architectures. Connected cars are one example. In theory, every event is sent in real-time (check out the Audi keynote at Kafka Summit for this use case). In practice, cars (or other IoT devices) are offline regularly. Many firmware implementations then send the data as a big aggregated file as Tesla described in a Confluent blog:

            “One of the unique challenges with device data streams, as opposed to web server log data streams, is that it is normal for devices to become disconnected for a long while – easily months – and to then send a huge amount of data as they come back online. Depending on how your device firmware is written, this could mean many, many small messages (which can be wasteful for storage and bandwidth considerations), or, more likely, the backlog of small messages batched together as a couple of large messages. If we were to just blindly write these into the same Kafka topics that we were planning to use for regular ingest, it could easily overwhelm our cluster.”

            Similarly to the video streaming use case discussed above, Enterprise Integration Patterns such as Splitter, Router, and Aggregator are perfect for this scenario. Download the big file at once, split it up, process event by event, and if necessary, route or aggregate the events again.

            Claim Check Enterprise Integration Pattern for non-splittable Large Messages

            As I said before, Kafka is not the right technology to store big files. Specific tools were built for this, including object stores such as AWS S3 or MinIO.

            The Claim Check EIP is the perfect solution for this problem:

             

            Claim Check Pattern EIP

            Metadata in Kafka and linking to external storage for content delivery of large video files in the media industry

            Many large video files are produced in the media industry. Specific storage and video editing tools are used. Kafka does not send these big files. But it controls the orchestration in a flexible, decoupled real-time architecture:

            Metadata in Kafka + Link to Object Storage for Large Files

            Externalizing large payloads on-the-fly for legacy integration from proprietary systems in Financial Services

            Big files have to be processed in many industries. In financial services, I saw several use cases where large proprietary files have to be shared between different legacy applications.

            Similar to the Claim Check EIP used above, you can also leverage Kafka Connect and its Single Message Transformations (SMT) feature:

            Externalizing Large Payloads on the fly with Kafka Connect SMT

            Here is a detailed example of processing large XML files with Kafka Connect and the open-source File Pulse connector: “Streaming data into Kafka – Loading XML file“.

            Natural Language Processing (NLP) using Kafka and Machine Learning for Large Text Files

            Machine Learning and Kafka are a perfect fit. I covered this topic in many articles and talks in the past. Just google or start with this blog post to get an idea about this approach: “Using Apache Kafka to Drive Cutting-Edge Machine Learning“.

            Natural Language Processing (NLP) using Kafka and machine learning for large text files is a great example. “Continuous NLP Pipelines with Python, Java, and Apache Kafka” shows how to implement the above design pattern using Kafka Streams, Kafka Connect, and an S3 Serializer / Deserializer.

            I really like this example because it also solves the impedance mismatch between the data scientist (who loves Python) and the production engineer (who loves Java). “Machine Learning with Python, Jupyter, KSQL and TensorFlow” explores this challenge in more detail.

            Large Messages in a Chat Application for Customer Service in Banking

            You just learned how to handle large files with Kafka by externalizing them into an object store and only sending the metadata via Kafka. In some use cases, this is too much effort or cost. Sending large files directly via Kafka is possible and sometimes easier to implement. The architecture is much simpler and more cost-effective.

            I already discussed the trade-offs above. But here is an excellent use case of sending large files natively with Kafka: Attachments in a chat application for customer service.  

            An example of a financial firm using Kafka for a chat system is Goldman Sachs. They led the development of Symphony, an industry initiative to build a cloud-based platform for instant communication and content sharing that securely connects market participants. Symphony is based on an open-source business model that is cost-effective, extensible, and customizable to suit end-user needs. Many other FinServ companies invested into Symphony, including Bank of America, BNY Mellon, BlackRock, Citadel, Citi, Credit Suisse, Deutsche Bank, Goldman Sachs, HSBC, Jefferies, JPMorgan, Maverick, Morgan Stanley, Nomura, and Wells Fargo.

            Kafka is a perfect fit for chat applications. The broker storage and decoupling are perfect for multi-platform and multi-technology infrastructure. Offline capabilities and consuming old messages are built into Kafka, too. Here is an example from a chat platform in the gaming industry:

            Real-time Chat function at scale within games and cross-platform usings Apache Kafka

            Attachments like files, images, or any other binary content can be part of this implementation. Different architectures are possible. For instance, you could use dedicated Kafka Topics for handling large messages. Or you just put them into your’ ‘ chat message’ event. With Confluent Schema Registry, the schema could have an attribute ‘attachment’. Or you externalize the attachment using the Claim Check EIP discussed above.

            Kafka-native handling of large messages has its use cases!

            As you learned in this post, plenty of use cases exist for handling large messages files with Apache Kafka and its ecosystem. Kafka was built for large volume/throughput – which is required for large messages. ‘Scaling Apache Kafka to 10+ GB Per Second in Confluent Cloud‘ is an impressive example.

            However, not all large messages should be processed with Kafka. Often you should use the right storage system and just leverage Kafka for the orchestration. Know the different design patterns and choose the right technology for each problem.

            A common scenario for Kafka-native processing of large messages is at the edge where other data storages are often not available or would increase the cost and complexity for provisioning the infrastructure.

            What are your experiences with handling large messages with the Kafka ecosystem? Did you or do you plan to use Apache Kafka and its ecosystem? What is your strategy? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

            Total
            1
            Shares
            Share 1
            Tweet 0
            Pin it 0

            Dont‘ miss my next post. Subscribe!

            We don’t spam! Read our privacy policy for more info.
            If you have issues with the registration, please try a private browser tab / incognito mode. If it doesn't help, write me: kontakt@kai-waehner.de

            Check your inbox or spam folder to confirm your subscription.

            Related Tags
            • Analytics
            • apache kafka
            • AR
            • audio
            • Augmented Reality
            • banking
            • Big Data
            • Big Files
            • chat
            • Chat Bot
            • chatbot
            • Cloud
            • Cloud-Native
            • CSV
            • Deep Learning
            • Edge
            • EIP
            • Enterprise Integration Pattern
            • file
            • Healthcare
            • I4
            • IIoT
            • image
            • Industry 4.0
            • IoT
            • JSON
            • kafka
            • Large Files
            • Large Messages
            • logistics
            • machine learning
            • machine vision
            • media
            • Neural Network
            • pharma
            • predictive maintenance
            • Real Time
            • Security
            • speech translation
            • surveillance
            • Tiered Storage
            • transportation
            • Video
            • XML
            Kai Waehner

            bridging the gap between technical innovation and business value for real-time data streaming, processing and analytics

            You May Also Like
            How to do Error Handling in Data Streaming
            Read More
            • 103.2K views
            • 15 minute read

            Error Handling via Dead Letter Queue in Apache Kafka

            • Apache Kafka
            • Data Streaming
            • Dead Letter Queue
            • Design Pattern
            • Message Queue
            • ByKai Waehner
            • 30. May 2022
            Recognizing and handling errors is essential for any reliable data streaming pipeline. This blog post explores best practices for implementing error handling using a Dead Letter Queue in Apache Kafka infrastructure. The options include a custom implementation, Kafka Streams, Kafka Connect, the Spring framework, and the Parallel Consumer. Real-world case studies show how Uber, CrowdStrike, Santander Bank, and Robinhood build reliable real-time error handling at an extreme scale.
            Read More
            0
            0
            0
            Kappa Architecture vs Lambda Architecture for Apache Kafka Pulsar Data Lakes
            Read More
            • 46.6K views
            • 17 minute read

            Kappa Architecture is Mainstream Replacing Lambda

            • Apache Kafka
            • Apache Pulsar
            • Data at Rest
            • Data in Motion
            • Data Lake
            • Data Warehouse
            • Event Streaming
            • Kappa Architecture
            • Lambda Architecture
            • ByKai Waehner
            • 23. September 2021
            Real-time data beats slow data. That’s true for almost every use case. Nevertheless, enterprise architects build new infrastructures with the Lambda architecture that includes separate batch and real-time layers. This blog post explores why a single real-time pipeline, called Kappa architecture, is the better fit. Real-world examples from companies such as Disney, Shopify, Uber, and Twitter explore the benefits of Kappa but also show how batch processing fits into this discussion positively without the need for Lambda.
            Read More
            0
            0
            0
            Global Field CTO
            Kai Waehner
            bridging the gap between technical innovation and business value for real-time data…
            OUT NOW: MY FREE EBOOK

             

            Subscribe to my newsletter

            Stay informed about new blog posts!

            We don’t spam! Read our privacy policy for more info.
            If you have issues with the registration, please try a private browser tab / incognito mode. If it doesn't help, write me: kontakt@kai-waehner.de

            Check your inbox or spam folder to confirm your subscription.

            End-to-End Integration
            YouTube

            By loading the video, you agree to YouTube's privacy policy.
            Learn more

            Load video

            Featured Posts
            • Apache Kafka and PLC4X Architecture for IIoT Automation Industry 1
              Apache Kafka, KSQL and Apache PLC4X for IIoT Data Integration and Processing
            • 2
              Apache Kafka vs. Middleware (MQ, ETL, ESB) – Slides + Video
            • 3
              Deep Learning Example: Apache Kafka + Python + Keras + TensorFlow + Deeplearning4j
            Categories
            Tag – Cloud
            Analytics Apache Apache Camel apache kafka AWS Azure Big Data Cloud Cloud-Native Confluent Data Streaming data warehouse Deep Learning docker EAI Edge Enterprise Application Integration ESB event streaming flink GCP Hadoop Hybrid IBM IIoT Integration IoT J2EE Java JEE kafka Kafka Connect kafka streams KSQL machine learning microservices middleware open source Oracle Real Time SaaS StreamBase Streaming Analytics Stream Processing TIBCO
            How Penske Logistics Transforms Fleet Intelligence with Kafka and AI
            Read More
            • 278 views
            • 7 minute read
            • Agentic AI
            • Apache Kafka
            • Artificial Intelligence
            • Asset Tracking
            • Data Streaming
            • Logistics
            • Transportation

            How Penske Logistics Transforms Fleet Intelligence with Data Streaming and AI

            • ByKai Waehner
            • 2. June 2025
            Data Streaming with Confluent Meets SAP and Databricks for Agentic AI at Sapphire in Madrid
            Read More
            • 506 views
            • 6 minute read
            • Agentic AI
            • Conference
            • Confluent
            • Data Streaming
            • Databricks
            • SAP
            • SAP Datasphere
            • SAP Hana
            • SAP Sapphire

            Data Streaming Meets the SAP Ecosystem and Databricks – Insights from SAP Sapphire Madrid

            • ByKai Waehner
            • 28. May 2025
            Agentic AI with Apache Kafka as Event Broker Combined with MCP and A2A Protocol
            Read More
            • 3.5K views
            • 7 minute read
            • A2A
            • Agentic AI
            • Apache Kafka
            • API
            • Artificial Intelligence
            • GenAI
            • MCP

            Agentic AI with the Agent2Agent Protocol (A2A) and MCP using Apache Kafka as Event Broker

            • ByKai Waehner
            • 26. May 2025
            • 1 share
            Real Time Gaming with Apache Kafka Powers Dream11 Fantasy Sports
            Read More
            • 698 views
            • 4 minute read
            • Analytics
            • Apache Kafka
            • Betting
            • Fantasy Sports
            • Fraud Detection
            • Gaming

            Powering Fantasy Sports at Scale: How Dream11 Uses Apache Kafka for Real-Time Gaming

            • ByKai Waehner
            • 19. May 2025
            Kai Waehner
            2024 © Kai Waehner | Imprint | Data Privacy