Deep Learning Archives - Kai Waehner

Apache Kafka for Conversational AI, NLP and Chatbot

Kai Waehner — Wed, 08 Dec 2021 00:48:54 +0000

Natural Language Processing (NLP) helps many projects in the real world for service desk automation, customer conversation with a chatbot, content moderation in social networks, and many other use cases. Apache Kafka became the predominant orchestration layer in these machine learning platforms for integrating various data sources, processing at scale, and real-time model inference. This article shows how companies across different industries such as the carmaker BMW, the online travel and booking Expedia, and the dating app Tinder leverage the combination of event streaming with machine learning for reliable real-time conversational AI, NLP, and chatbots.

Natural Language Processing (NLP)

Natural language processing is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, mainly how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of the text, documents, and speech, including the contextual nuances of the language within them. The technology can then accurately extract information and insights in the text and categorize and organize the text itself.

Use cases for natural language processing frequently involve speech recognition, natural language understanding, translation, and natural language generation.

Like most other ML concepts, NLP can use an ocean of different algorithms. In the 2010s, NLP kicked off when representation learning and deep neural network-style machine learning methods became widespread in NLP. Modern ML frameworks such as TensorFlow, in conjunction with the elastic compute power in the public cloud, enabled NLP usage for any company of any size.

Conversational AI has become mainstream because of Deep Learning

A chatbot is a software application used to conduct an online chat conversation via text or text-to-speech instead of direct contact with a live human agent. Designed to simulate how a human behaves as a conversational partner convincingly, chatbot systems typically require continuous tuning and testing, and many in production remain unable to converse adequately. Chatbots are used in dialog systems for various purposes, including customer service, request routing, or information gathering.

discover.bot published an excellent article explaining how a chatbot works behind the scenes using NLP. It uses a combination of Natural Language Understanding (NLU) and Natural Language Generation (NLG):

Like any machine learning/deep learning application, a chatbot requires model training (= teaching a chatbot) and model scoring (=applying the chatbot in a dialog with a human). Therefore, building a chatbot is a machine learning problem with related tools, APIs, and cloud services. How does event streaming with Kafka fit into this story?

Machine Learning / NLP and Kafka – An Impedance Mismatch

I wrote about Machine Learning and Kafka a lot in the past. TL;DR: Machine Learning requires data integration and processing at scale, and model predictions often require reliable and robust real-time applications. That’s where Kafka and its ecosystem fit into the story:

For more details:

These articles describe different workarounds to solve the impedance mismatch. A few examples:

Some teams let the data science teams deploy Python in Docker containers in production and integrate it with other applications and programming platforms like Java, Go, or .NET / C++.
A few projects use Faust as Kafka-native streaming Python library (with several limitations compared to Kafka Streams or ksqlDB).
Embedding NLP models (trained with any machine learning framework using any programming language, including Python) into a native Kafka application for robust model scoring is a well-known option.
Last but not least, many model servers added Kafka-native streaming interfaces directly using the Kafka protocol as an alternative to RPC communication via HTTP or gRPC.

Let’s now look at a few real-world examples for machine learning, NLP, chatbots, and the Kafka ecosystem in companies such as BMW, Expedia, and Tinder.

BMW – Kafka as Orchestration Layer for NLP and Chatbots

The automotive company BMW presented innovative NLP services at Kafka Summit in 2019 already. It is no surprise that a carmaker has various NLP scenarios. These include digital contract intelligence, workplace assistant, machine translation, and customer conversations. The latter contains multiple use cases for conversational AI:

Service desk automation
Speech analysis of customer interaction center (CIC) calls to improve the quality
Self-service using smart knowledge bases
Agent support
Chatbots

The text and speech data is structured, enriched, contextualized, summarized, and translated to build real-time decision support applications. Kafka is a crucial component of BMW’s ML and NLP architecture. The real-time integration and data correlation enable interactive and interoperable data consumption and usage:

BMW explained the key advantages of leveraging Kafka and its streaming processing library Kafka Streams as the real-time integration and orchestration platform:

Flexible integration: Multiple supported interfaces for different deployment scenarios, including various machine learning technologies, programming languages, and cloud providers
Modular end-to-end pipelines: Services can be connected to provide full-fledged NLP applications.
Configurability: High agility for each deployment scenario

Expedia – Conversations Platform powered by Cloud-native Kafka

Expedia is a leading online travel and booking. They have many use cases for machine learning. One of my favorite examples is their Conversations Platform built on Kafka and Confluent Cloud to provide an elastic cloud-native application.

The goal of Expedia’s Conversations Platform was simple: Enable millions of travelers to have natural language conversations with an automated agent via text, Facebook, or their channel of choice. Let them book trips, make changes or cancellations, and ask questions:

“How long is my layover?”
“Does my hotel have a pool?”
“How much will I get charged if I want to bring my golf clubs?”

Then take all that is known about that customer across all of Expedia’s brands and apply machine learning models to give customers what they are looking for immediately in real-time and automatically, whether a straightforward answer or a complex new itinerary.

Real-time Orchestration realized in four Months

Such a platform is no place for batch jobs, back-end processing, or offline APIs. To quickly make decisions that incorporate contextual information, the platform needs data in near real-time, and it needs it from a wide range of services and systems. Meeting these needs meant architecting the Conversations Platform around a central nervous system based on Confluent Cloud and Apache Kafka. Kafka made it possible to orchestrate data from loosely coupled systems, enrich data as it flows between them so that by the time it reaches its destination, it is ready to be acted upon, and surface aggregated data for analytics and reporting.

Expedia built this platform from zero to production in four months. That’s the tremendous advantage of using a fully managed serverless event streaming platform as the foundation. The project team can focus on the business logic.

The Covid pandemic proved the idea of an elastic platform: Companies were hit with a tidal wave of customer questions, cancellations, and re-bookings. Throughout this once-in-a-lifetime event, the Conversations Platform proved up to the challenge, auto-scaling as necessary and taking off much of the load of live agents.

Expedia’s Migration from MQ to Kafka as Foundation for Real-time Machine Learning and Chatbots

As part of their conversations platform, Expedia needed to modernize their IT infrastructure, as Ravi Vankamamidi, Director of Technology at Expedia Group, explained in a Kafka Summit keynote.

Expedia’s old legacy chatbot service relied on a legacy messaging system. This service was a question-and-answer board with very limited scope for booking scenarios. This service could handle two-party conversations. It could not scale to bring all different systems into one architecture to build a powerful chatbot that is helpful for customer conversations.

I explored several times that event streaming is more than just a (scalable) message queue. Check out my old (but still accurate and relevant) Comparison between MQ and Kafka, or the newer comparison between cloud-native iPaaS and Kafka.

Expedia needed a service that was closer to travel assistance. It needed to handle context-specific, multi-party, multi-channel conversations. Hence, features such as natural language processing, translation, and real-time analytics are required. The full service needs to be scalable across multiple brands. Therefore, a fast and highly scalable platform with order guarantees, exactly-once-semantics (EOS), and real-time data processing were needed.

The Kafka-native event streaming platform powered by Confluent was the best choice and met all requirements. One year after the rollout, the new conversations platform doubled the Net Promoter Score (NPS). The new platform proved the business value of the new platform quickly.

Tinder – Content Moderation with Kafka Streams and ML

The dating app Tinder is a great example where I can think of tens of use cases for NLP. Tinder talked at a past Kafka Summit about their Kafka-powered machine learning platform.

Tinder is a massive user of Kafka and its ecosystem for various use cases, including content moderation, matching, recommendations, reminders, and user reactivation. They used Kafka Streams as a Kafka-native stream processing engine for metadata processing and correlation in real-time at scale:

A critical use case in any dating or social platform is content moderation for detecting fakes, filtering sexual content, and other inappropriate things. Content moderation combines NLP and text processing (e.g., for chat messages) with image processing (e.g., selfie uploads) or processes the metadata with Kafka and stores the linked content in a data lake. Both leverage Deep Learning to process high volumes of text and images. Here is what content moderation looks like in Tinder’s Kafka architecture:

Plenty of ways exist to process text, images, and videos with the Kafka ecosystem. I wrote a detailed article about handling large messages and files with Apache Kafka to explore the options and trade-offs.

Chatbots could also play a key role “the other way round”. More and more dating apps (and other social networks) fight against spam, fraud, and automated chatbots. Like building a chatbot, a chatbot detection system can analyze the data streams to block a dating app’s chatbot.

Let’s now explore how a developer can build a Kafka-native NLP application.

Telegram Bot API – ML in a Streaming App

Many project teams build their chatbot or other NLP services. Unfortunately, this is a considerable effort and often not cost-efficient. Another simplified and more cost-efficient alternative is integrating an NLP or chatbot API as a service. My colleague Robin Moffat wrote a great post about building a Telegram powered chatbot with Kafka and ksqlDB where the chatbot API is integrated into the real-time Kafka application. This way, like in the Expedia example above, the NLP application integrates in real-time in a truly decoupled fashion with other applications in the enterprise architecture.

This example uses Telegram. That is a messaging platform, similar in concept to WhatsApp, Facebook Messenger, etc. A nice Bot API offers integration via a REST API. While the example is using Telegram, the same approach would work just fine with a bot on your platform of choice (Slack, etc.) or within your standalone application that wants to look up the state that’s being populated and maintained from a stream of events in Kafka.

Reddit Text Processing – NLP with Streaming Model Predictions

The drawback of the above Telegram example is integrating a REST API for the chatbot. Remote procedure calls (RPC) are still predominant in machine learning. However, RPC is an anti-pattern in the event streaming world as it creates challenges concerning robustness, latency, scalability, and error handling. RPC integration is fine for many use cases. But because of other available options, there is no need to do RPC calls.

Kafka-native streaming machine learning is an alternative for model predictions. The two deployment options are embedding an analytic model into the Kafka application or using a model server that supports event streaming besides RPC (HTTP/gRPC) calls. I wrote a detailed article with the pros and cons of both approaches for model predictions using Kafka applications.

The following shows an example of a Kafka-native integration between a model server and other applications using the Kafka protocol:

The Seldon model server is an example that already supports the Kafka interface. The Seldon team demoed how they train and deploy a machine learning model leveraging a scalable stream processing architecture for an automated text prediction use-case. They use Scikit-learn (Sklearn) and SpaCy to train an ML model from the Reddit Content Moderation dataset and deploy that model using Seldon Core for real-time processing of text data from Kafka real-time streams:

Kafka-native NLP to build the next Conversational AI and Chatbot

Apache Kafka became the de facto standard for event streaming. One pillar of Kafka use cases includes ML platforms including various NLP-related concepts such as conversational AI, chatbots, and speech translation for improving and automating service desks, content moderation, and plenty of other use cases.

A Kafka-based orchestration and integration layer provides true decoupling in a scalable real-time platform. The benefits for ML platforms include back-pressure handling, pre-processing, and aggregations at any scale in real-time. Another benefit is the capabilities to connect different communication paradigms and technologies. The examples from BMW, Expedia, and Tinder showed how Kafka-based NLP infrastructure could look.

How do you build conversational AI, chatbots, and other NLP applications? What technologies and architectures do you use? Are event streaming and Kafka part of the architecture? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka for Conversational AI, NLP and Chatbot appeared first on Kai Waehner.

Streaming Machine Learning with Kafka-native Model Deployment

Kai Waehner — Tue, 27 Oct 2020 15:07:50 +0000

Apache Kafka became the de facto standard for event streaming across the globe and industries. Machine Learning (ML) includes model training on historical data and model deployment for scoring and predictions. While training is mostly batch, scoring usually requires real-time capabilities at scale and reliability. Apache Kafka plays a key role in modern machine learning infrastructures. The next-generation architecture leverages a Kafka-native streaming model server instead of RPC (HTTP/gRPC) calls:

This blog post explores the architectures and trade-offs between three options for model deployment with Kafka: Embedded model into the Kafka application, model server and RPC, model server, and Kafka-native communication.

Kafka and Machine Learning Architecture

Model deployment is usually completely separated from model training (from the process and the technology perspective). The model training is often executed in elastic cloud infrastructure or data lakes such as Hadoop or Spark. Model scoring (i.e., doing the predictions) is usually a mission-critical workload with different uptime SLAs and latency requirements:

Learn more about this architecture and the relation to modern ML approaches such as Hybrid ML architectures or AutoML in the blog post “Using Apache Kafka to Drive Cutting-Edge Machine Learning“.

Two alternatives for model deployment in Kafka infrastructures: The model can either be embedded into the Kafka application, or it can be deployed into a separate model server. The blog post “Model Server and Model Deployment Options in a Kafka Infrastructure” covers the use cases and architectures in detail and explores some code examples.

The following explores both options’ trade-offs and introduces a third option: A Kafka-native streaming model server.

RPC between Kafka App and Model Server

The analytic model is deployed into a model server. The streaming Kafka application does a request-response call to send the input data to the model and to receive the prediction in return:

Almost every ML product or framework provides a model server. This includes open-source frameworks such as TensorFlow and the related model server TF Serving, but also proprietory tools such as SAS, AutoML vendors such as DataRobot, and cloud ML services from all major cloud providers such as AWS, Azure, GCP.

Pros and Cons of a Model Server with RPC

Trade-offs using Kafka in conjunction with an RPC-based model server and HTTP/gRPC:

Simple integration with existing technologies and organizational processes
Easiest to understand if you come from a non-streaming world
Tight coupling of the availability, scalability, and latency/throughput between application and model server
Separation of concerns (e.g. Python model + Java streaming app)
Limited scalability and robustness
Later migration to real streaming is also possible
Model management built-in for different models, versioning, and A/B testing
Model monitoring built-in

Example: TensorFlow Serving with GRPC and Kafka Streams

An example of this approach is available on Github: “TensorFlow Serving + gRPC + Java + Kafka Streams“. In this case, the TensorFlow model is exported and deployed into a TensorFlow Serving infrastructure.

Embedded Model into Kafka Application

An analytic model is just a binary, i.e., a file stored in memory or persistent storage.

The data type differs and depends on the ML framework. But it does not matter if it is Java (e.g., with H2O), Protobuf (e.g., with TensorFlow), or proprietary (with many commercial ML tools).

Any model can be embedded into a Kafka application (if the ML solution provides programmatic APIs – but almost every tool does):

The Kafka application for embedding the model can either be a Kafka-native stream processing engine such as Kafka Streams or ksqlDB, or a “regular” Kafka application using any Kafka client such as Java, Scala, Python, Go, C, C++, etc.

Pros and Cons of Embedding an Analytic Model into a Kafka Application

Trade-offs of embedding analytic models into a Kafka application:

Best latency as local inference instead of remote call
No coupling of the availability, scalability, and latency/throughput of your Kafka Streams application
Offline inference (devices, edge processing, etc.)
No side-effects (e.g., in case of failure), all covered by Kafka processing (e.g., exactly once)
No built-in model management and monitoring

Example: Kafka Python Application with Embedded TensorFlow Model

A robust and scalable example of the embedded model approach is presented in Github project “Streaming Machine Learning with Kafka, MQTT, and TensorFlow for 100000 Connected Cars“. This demo uses Python for both model training and model deployment in separate, scalable containers in a Kubernetes infrastructure.

Several other (more simple) demos to try out this approach are available here: “Machine Learning + Kafka Streams Examples“. The examples use TensorFlow, H2O.ai, and DeepLearning4J (DL4J) in conjunction with Kafka Streams and ksqlDB.

Kafka-native Streaming Model Server

A Kafka-native streaming model server combines some characteristics from both approaches discussed above. It enables the separation of concerns by providing a model server with all the expected features. But the model server does not use RPC communication via HTTP/gRPC and all the drawbacks this creates for a streaming architecture. Instead, the model server communicates via the native Kafka protocol and Kafka topics with the client application:

Pros and Cons of a Kafka-native Streaming Model Server

Trade-offs of a Kafka-native streaming model server:

Good latency via Kafka-native streaming communication
Some coupling of the availability, scalability, and latency/throughput of your Kafka Streams application
Some side-effects (e.g., in case of failure), but most potential issues covered by Kafka processing (e.g., decoupling and persistence via Kafka topics)
Model management built-in for different models, versioning, and A/B testing
Model monitoring built-in
Separation of concerns (e.g. Python model + Java streaming app)
Scalability and robustness of the model server not necessarily Kafka-like (because the underlying implementation is often not Kafka-native yet)

A Kafka-native streaming model server provides many advantages of a streaming architecture and the features of a model server. Just be aware that a Kafka-native interface does NOT mean that the model server itself is implemented with Kafka under the hood. Hence, test your scalability, robustness, and latency requirements to decide if an embedded model might be a better approach.

Example: Streaming Model Deployment with Kafka and Seldon

Seldon is a robust and scalable open-source model server. It allows us to manage, serve, and scale models in any language or framework on Kubernetes.

In mid of 2020, Seldon added a key differentiator compared to many other model servers on the market: Seldon added support for Kafka. Hence, Seldon combines the advantages of a separate model server with the streaming paradigm of Kafka:

The Jupyter notebook demonstrates this example using scikit-learn, the NLP framework spaCy, Seldon, and Kafka-native integration with producer and consumer applications. Check out the blog post “Real-Time Machine Learning at Scale using SpaCy, Kafka & Seldon Core” for more details.

All Model Deployment Options have Trade-Offs in Streaming Machine Learning Architectures

This blog post covered three alternatives for model deployment in a Kafka infrastructure: A model server with RPC, embedding models into the Kafka application, and a Kafka-native model server. All three have their trade-offs. Know them, and evaluate the right choice for your project. The good news is that it is also pretty straightforward to change from one approach to another one.

UPDATE May 2021: Dataiku also provides a native Kafka interface in the meantime (including support for Schema Registry). Great to see different model servers adopting this architecture.

If you want to learn more details about Kafka-native model deployment, check out the following video recording and slide deck from Kafka Summit:

The talk does not cover the “streaming model server” approach (because no model server provided a Kafka-native interface in 2019). But you can still learn a lot about the different architectures and best practices.

If you want to learn more about “Streaming Machine Learning with Kafka – without another Data Lake“, check out the following video recording and slide deck. It explores a simplified architecture and the advantages of Tiered Storage for Kafka:

What are your experiences with Kafka and model deployment? What are your use cases? Which approach works best for you? What is your strategy? Let’s connect on LinkedIn and discuss it!

The post Streaming Machine Learning with Kafka-native Model Deployment appeared first on Kai Waehner.

Apache Kafka as Digital Twin for Open, Scalable, Reliable Industrial IoT (IIoT)

Kai Waehner — Thu, 28 Nov 2019 12:53:48 +0000

This blog post discusses the benefits of a Digital Twin in Industrial IoT (IIoT) and its relation to Apache Kafka. Kafka is often used as central event streaming platform to build a scalable and reliable digital twin for real time streaming sensor data.

In November 2019, I attended the SPS Conference in Nuremberg. This is one of the most important events about Industrial IoT (IIoT). Vendors and attendees from all over the world fly in to make business and discuss new products. Hotel prices in this region go up from usually 80-100€ to over 300€ per night. Germany is still known for its excellent engineering and manufacturing industry. German companies drive a lot of innovation and standardization around Internet of Things (IoT) and Industry 4.0.

The article discusses:

The relation between Operational Technology (OT) and Information Technology (IT)
The advantages and architecture of an open event streaming platform for edge and global IIoT infrastructures
Examples and use cases for Digital Twin infrastructures leveraging an event streaming platform for storage and processing in real time at scale

SPS – A Trade Show in Germany for Global Industrial IoT Vendors

“SPS covers the entire spectrum of smart and digital automation – from simple sensors to intelligent solutions, from what is feasible today to the vision of a fully digitalized industrial world“. No surprise almost all vendors show software and hardware solutions for innovative use cases like

Condition monitoring in real time
Predictive maintenance using artificial intelligence (AI) – I prefer the more realistic term “machine learning”, a subset of AI
Integration of legacy machines and proprietary protocols in the shop floors
Robotics
New digital services
Other related buzzwords.

Digital Twin as Huge Value Proposition

New software and hardware needs to improve business processes, increase revenue or cut costs. Many booths showed solution for a digital twin as part of this buzzword bingo and value proposition. Most vendors exhibited complete solutions as hardware or software products. Nobody talked about the underlying implementation. Not all vendors could explain in detail how the infrastructure really scales and performs under the hood.

This post starts from a different direction. It begins with definition and use cases of a digital twin infrastructure. The challenges and requirements are discussed in detail. Afterwards, possible architectures and combinations of solutions show the benefits of an open and scalable event streaming platform as part of the puzzle.

The value proposition of a digital twin always discusses the combination of OT and IT:

OT: Operation Technology; dealing with machines and devices
IT: Information Technology, dealing with information

Excursus: SPS == PLC —> A Core Component in each IoT Infrastructure

A funny, but relevant marginal note for non-German people: The event name of the trade fair and acronym “SPS” stands for “smart production solutions”. However, in Germany “SPS” actually stands for “SpeicherProgrammierbare Steuerung”. The English translation might be very familiar for you: “Programmable Logic Controller” or shortened “PLC“.

PLC is a core component in any industrial infrastructure. This industrial digital computer has been ruggedized and adapted for the control of manufacturing processes, such as:

Assembly lines
Robotic devices
Any activity that requires high reliability control and ease of programming and process fault diagnosis

PLCs are built to withstand extreme temperatures, strong vibrations, high humidity, and more. Furthermore, since they are not reliant on a PC or network, a PLC will continue to independently function without any connectivity. PLC is OT. This is very different from the IT hardware a software engineers knows from developing and deploying “normal” Java, .NET or Golang applications.

Digital Twin – Merging the Physical and the Digital World

A digital twin is a digital replica of a living or non-living physical entity. By bridging the physical and the virtual world, data is transmitted seamlessly allowing the virtual entity to exist simultaneously with the physical entity. The digital twin therefore interconnects OT and IT.

Digital Replica of Potential and Actual Physical Assets

Digital twin refers to a digital replica of potential and actual physical assets (physical twin), processes, people, places, systems and devices. The digital replica can be used for various purposes. The digital representation provides both the elements and the dynamics of how an IoT device operates and lives throughout its life cycle.

Definitions of digital twin technology used in prior research emphasize two important characteristics. Firstly, each definition emphasizes the connection between the physical model and the corresponding virtual model or virtual counterpart. Secondly, this connection is established by generating real time data using sensors.

Digital Twins Link Internet of Things and Artificial Intelligence

Digital twins connect internet of things, artificial intelligence, machine learning and software analytics to create living digital simulation models. These models update and change as their physical counterparts change. A digital twin continuously learns and updates itself from multiple sources to represent its near real-time status, working condition or position.

This learning system learns from

itself, using sensor data that conveys various aspects of its operating condition.
human experts, such as engineers with deep and relevant industry domain knowledge.
other similar machines
other similar fleets of machines
the larger systems and environment in which it may be a part of.

A digital twin also integrates historical data from past machine usage to factor into its digital model.

Use Cases for a Digital Twin in Various Industries

In various industrial sectors, digital twins are used to optimize the operation and maintenance of physical assets, systems and manufacturing processes. They are a formative technology for the IIoT. A digital twin in the workplace is often considered part of Robotic Process Automation (RPA). Per Industry-analyst firm Gartner, a digital twin is part of the broader and emerging hyperautomation category.

Some industries that can leverage digital twins include:

Manufacturing Industry: Physical manufacturing objects are virtualized and represented as digital twin models seamlessly and closely integrated in both the physical and cyber spaces. Physical objects and twin models interact in a mutually beneficial manner. Therefore, the IT infrastructure also sends control commands back to the actuators of the machines (OT).
Automotive Industry: Digital twins in the automobile industry use existing data in order to facilitate processes and reduce marginal costs. They can also suggest incorporating new features in the car that can reduce car accidents on the road.
Healthcare Industry: Lives can be improved in terms of medical health, sports and education by taking a more data-driven approach to healthcare. The biggest benefit of the digital twin on the healthcare industry is the fact that healthcare can be tailored to anticipate on the responses of individual patients.

Examples for Digital Twins in the Industrial IoT

Digital twins are used in various scenarios. For example, they enable the optimization of the maintenance of power generation equipment such as power generation turbines, jet engines and locomotives. Further examples of industry applications are aircraft engines, wind turbines, large structures (e.g. offshore platforms, offshore vessels), heating, ventilation, and air conditioning (HVAC) control systems, locomotives, buildings, utilities (electric, gas, water, waste water networks).

Let’s take a look at some use cases in more detail:

Monitoring, Diagnostics and Prognostics

A digital twin can be used for monitoring, diagnostics and prognostics to optimize asset performance and utilization. In this field, sensory data can be combined with historical data, human expertise and fleet and simulation learning to improve the outcome of prognostics. Therefore, complex prognostics and intelligent maintenance system platforms can use digital twins in finding the root cause of issues and improve productivity.

Digital twins of autonomous vehicles and their sensor suite embedded in a traffic and environment simulation have also been proposed as a means to overcome the significant development, testing and validation challenges for the automotive application. In particular when the related algorithms are based on artificial intelligence approaches that require extensive training data and validation data sets.

3D Modeling for the Creation of Digital Companions

Digital twins are used for 3D modeling to create digital companions for the physical objects. It can be used to view the status of the actual physical object. This provides a way to project physical objects into the digital world. For instance, when sensors collect data from a connected device, the sensor data can be used to update a “digital twin” copy of the device’s state in real time.

The term “device shadow” is also used for the concept of a digital twin. The digital twin is meant to be an up-to-date and accurate copy of the physical object’s properties and states. This includes information such as shape, position, gesture, status and motion.

Embedded Digital Twin

Some manufacturers embed a digital twin into their device. This improves quality, allows earlier fault detection and gives better feedback on product usage to the product designers.

How to Build a Digital Twin Infrastructure?

TL;DR: You need to have the correct information in real time at the right location to be able to analyze the data and act properly. Otherwise you will have conversations like the following:

Challenges and Requirements for Building a Scalable, Reliable Digital Twin

The following challenges have to be solved to implement a successful digital twin infrastructure:

Connectivity: Different machines and sensors typically do not provide the same interface. Some use a modern standard like MQTT or OPC-UA. Though, many (older) machines use proprietary interfaces. Furthermore, you also need to integrate to the rest of the enterprise.
Ingestion and Processing: End-to-end pipelines and correlation of all relevant information from all machines, devices, MES, ERP, SCM, and any other related enterprise software in the factories, data center or cloud.
Real Time: Ingestion of all information in real time (this is a tough term; in this case “real time” typically means milliseconds, sometimes even seconds or minutes are fine).
Long-term Storage: Storage of the data for reporting, batch analytics, correlations of data from different data sources, and other use cases you did not think at the time when the data was created.
Security: Trusted data pipelines using authentication, authorization and encryption.
Scalability: Ingestion and processing of sensor data from one or more shop floors creates a lot of data.
Decoupling: Sensors produce data continuously. They don’t ask if the consumers are available and if they can keep up with the input. The handling of backpressure and decoupling if producers and consumers is mandatory.
Multi-Region or Global Deployment: Analysis and correlation of data in one plant is great. But it is even better if you can correlate and analyze the data from all your plants. Maybe even plants deployed all over the world.
High Availability: Building a pipeline from one shop floor to the cloud for analytics is great. However, even if you just integrate a single plant, the pipeline typically has to run 24/7. In some cases without data loss and with order guarantees of the sensor events.
Role-Based Access Control (RBAC): Integration of tens or hundreds of machines requires capabilities to manage the relation between the hardware and its digital twin. This includes the technical integration, role-based access control, and more.

The above list covers technical requirements. On top, business services have to be provided. This includes device management, analytics, web UI / mobile app, and other capabilities depending on your use cases.

So, how do you get there? Let’s discuss three alternatives in the following sections.

Solution #1 => IoT / IIoT COTS Product

COTS (commercial off-the-shelf) is software or hardware products that are ready-made and available for sale. IIoT COTS products are built for exactly one problem: Development, Deployment and Operations of IoT use cases.

Many IoT COTS solutions are available on the market. This includes products like Siemens MindSphere, PTC IoT, GE Predix, Hitachi Lumada or Cisco Kinetic for deployments in the data center or in different clouds.

Major cloud providers like AWS, GCP, Azure and Alibaba provide IoT-specific services. Cloud providers are a potential alternative if you are fine with the trade-offs of vendor lock-in. Main advantage: Native integration into the ecosystem of the cloud provider. Main disadvantage: Lock-in into the ecosystem of the cloud provider.

At the SPS trade show, 100+ vendors presented their IoT solution to connect shop floors, ingest data into the data center or cloud, and do analytics. Often, many different tools, products and cloud services are combined to solve a specific problem. This can either be obvious or just under the hood with OEM partners. I already covered the discussion around using one single integration infrastructure versus many different middleware components in much more detail.

Read Gartner’s Magic Quadrant for Industrial IoT Platforms 2019. The analyst report describes the mess under the hood of most IIoT products. Plenty of acquisitions and different code bases are the foundation of many commercial products. Scalability and automated rollouts are key challenges. Download of the report is possible with a paid Gartner account or via free download from any vendor with distribution rights, like PTC.

Proprietary and Vendor-specific Features and Products

Automation industry typically uses proprietary and expensive products. All of them are marketed as flexible cloud products. The truth is that most of them are not really cloud-native. This means they are not scalable and not extendible like you would expect it from a cloud-native service. Some problems I have heard in the past from end users:

Scalability is not given. Many products don’t use a scalable architecture. Instead, additional black boxes or monoliths are added if you need to scale. This challenges uptime SLAs, burden of operations and cost.
Some vendors just deploy their legacy on premise solution into AWS EC2 instances and call it a cloud product. This is far away from being cloud native.
Even IoT solutions from some global cloud providers do not scale good enough to implement large IoT projects (e.g. to connect to one million cars). I saw several companies which evaluate a cloud-native MQTT solution. They then went to a specific MQTT provider afterwards (but still used the cloud provider for its other great services, like computing resources, analytics services, etc).
Most complete IoT solutions use many different components and code bases under the hood. Even if you buy one single product, the infrastructure usually uses various technologies and (OEM) vendors. This makes development, testing, roll-out and 24/7 end-to-end operations much harder and more costly.

Long Product Lifecycles with Proprietary Protocols

In automation industry, product lifecycles are very long (tens of years). Simple changes or upgrades are not possible.

IIoT usually uses incompatible protocols. Typically, not just the products, but also these protocols are proprietary. They are just built for hardware and machines of one specific vendor. Siemens S7, Modbus, Allen Bradley, Beckhoff ADS, to name a few of these protocols and “standards”.

OPC-UA is supported more and more. This is a real standard. However, it has all the pros and cons of an open standard. It is often poorly implemented by the vendors and requires an app server on top of the PLC.

In most scenarios I have seen, connectivity to machines and sensors from different vendors is required, too. Even in one single plant, you typically have many different technologies and communication paradigms to integrate.

There is More than just the Machines and PLCs…

No matter how you integrate to the shop floor machines; this is just the first step to get data correlated with other systems from your enterprise. Connectivity to all the machines in the shop floors is not sufficient. Integration and combination of the IoT data with other enterprise applications is crucial. You also want to provide and sell additional services in the data center or cloud. In addition, you might even want to integrate with partners and suppliers. This adds additional value to your product. You can provide features you don’t sell by yourself.

Therefore, an IoT COTS solution does not solve all the challenges. There is huge demand to build an open, flexible, scalable platform. This is the reason why Apache Kafka comes into play in many projects as event streaming platform.

Solution #2 => Apache Kafka as Event Streaming Platform for the Digital Twin Infrastructure

Apache Kafka provides many business and technical characteristics out-of-the-box:

Cost reduction due to open core principle
No vendor lock-in
Flexibility and Extensibility
Scalability
Standards-based Integration
Infrastructure-, vendor- and technology-independent
Decoupling of applications and machines

Apache Kafka – An Immutable, Distributed Log for Real Time Processing and Long Term Storage

I assume you already know Apache Kafka: The de-facto standard for real-time event streaming. Apache Kafka provides the following characteristics:

Open-source (Apache 2.0 License)
Global-scale
High volume messaging
Real-time
Persistent storage for backup and decoupling of sources and sinks
Connectivity (via Kafka Connect)
Continuous Stream processing (via Kafka Streams)

This is all included within the Apache Kafka framework. Fully independent of any vendor. Vibrant community with thousands of contributors from many hundreds of companies. Adoption all over the world in any industry.

If you need more details about Apache Kafka, check out the Kafka website, the extensive Confluent documentation, or hundreds of free video recordings and slide decks from all Kafka Summit events to learn about the technology and use cases.

How is Kafka related to Industrial IoT, shop floors and building digital twins? Let’s take a quick look at Kafka’s capabilities for integration and continuous data processing at scale.

Connectivity to Industrial Control Systems (ICS)

Kafka can connect to different functional levels of a manufacturing control operation:

Integrate directly to Level 1 (PLC / DCS): Kafka Connect or any other Kafka Clients (Java, Python, .NET, Go, JavaScript, etc.) can connect directly to PLCs or an OPC-UA server. The the data gets directly ingested from the machines and devices. Check out “Apache Kafka, KSQL and Apache PLC4X for IIoT Data Integration and Processing” for an example to integrate with different PLC protocols like Siemens S7 and Modbus in real time at scale.
Integrate on level 2 (Plant Supervisory) or even above on 3 (Production Control) or 4 (Production Scheduling): For instance, you integrate with MES / ERP / SCM systems via REST / HTTP interfaces or any MQTT-supported plant supervisory or production control infrastructure. The blog post “IoT and Event Streaming at Scale with MQTT” discusses a few different options.
What about level 0 (Field level)? Today, IT typically only connects to PLCs respectively DCS. IT does not directly integrate to sensors and actuators in the field bus, switch or end node. This will probably change in the future. Sensors get smarter and more powerful. And new standards for the “last mile” of the network emerge, like 10BASE-T1L.

We discussed different integration levels between IT and OT infrastructure. However, connecting from the IT to the PLC and shop floor is just half of the story…

Connectivity and Data Ingestion to MES, ERP, SCM, Big Data, Cloud and the Rest of the Enterprise

Kafka Connect enables reliable and scalable integration of Kafka with any other system. This is important as you need to integrate and correlate sensor data from PLCs with applications and databases from the rest of the enterprise. Kafka Connect provides sources and sinks to

Databases like Oracle or SAP Hana
Big Data Analytics like Hadoop, Spark or Google Cloud Machine Learning
Enterprise Applications like ERP, CRM, SCM
Any other application using custom connectors or Kafka clients

Continuous Stream Processing and Streaming Analytics in Real Time on the Digital Twin with Kafka

A pipeline from machines to other systems in real time at scale is just part of the full story. You also need to continuously process the data. For instance, you implement streaming ETL, real time predictions with analytic models, or execute any other business logic.

Kafka Streams allows to write standard Java apps and microservices to continuously process your data in real-time with a lightweight stream processing Java API. You could alos use ksqlDB to do stream processing using SQL semantics. Both frameworks use Kafka natively under the hood. Therefore, you leverage all Kafka-native features like high throughput, high availability and zero downtime out-of-the box.

The integration with other streaming solutions like Apache Flink, Spark Streaming, AWS Kinesis or other commercial products like TIBCO StreamBase, IBM Streams or Software AG’s Apama is possible, of course.

Kafka is so great and widely adopted because it decouples all sources and sinks from each other leveraging its distributed messaging and storage infrastructure. Smart endpoints and dumb pipes is a key design principle applied with Kafka automatically to decouple services through a Domain-Driven Design (DDD):

Kafka + Machine Learning / Deep Learning = Real Time Predictions in Industrial IoT

The combination of Kafka and Machine Learning for digital twins is not different from any other industry. However, IoT projects usually generate big data sets throught real time streaming sensor data. Therefore, Kafka + Machine Learning makes even more sense in IoT projects for building digital twins than in many other projects where you lack big data sets.

Analytic models need to be applied at scale. Predictions often need to happen in real time. Data preprocessing, feature engineering and model training also need to happen as fast as possible. Monitoring the whole infrastructure in real time at scale is a key requirement, too. This includes not just the infrastructure for the machines and sensors in the shop floor, but also the overall end-to-end edge-to-cloud infrastructure.

With this in mind, you quickly understand that Machine Learning / Deep Learning and Apache Kafka are very complementary. I have covered this in detail in many other blog posts and presentations. Get started here for more details:

Blog Post: How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka
Slide Deck: Apache Kafka + Machine Learning => Intelligent Real Time Applications
Video Recording: Deep Learning in Mission Critical and Scalable Real Time Applications with Open Source Frameworks

Example for Kafka + ML + IoT: Embedded System at the Edge using an Analytic Model in the Firmware

Let’s discuss a quick example for Kafka + ML + IoT Edge: Embedded systems are very inflexible. It is hard to change code or business rules. The code in the firmware applies between input and out of the hardware. The code is embedded which implements business rules. However, new code or business rules cannot be simply deployed with a script or continuous delivery like you know it from your favorite Java / .NET / Go / Python application and tools like Maven and Jenkins. Instead, each change requires a new, long development lifecycle including tests, certifications and a manufacturing process.

There is another option Instead of writing and embedding business rules in code with a complex and costly process: Analytic models can be trained on historical data. This can happen anywhere. For instance, you can ingest sensor data into a data lake in the cloud via Kafka. The model is trained in the elastic and flexible cloud infrastructure. Finally, this model is deployed to an embedded system respectively a new firmware version is created with this model (using the long, expensive process). However, updating (i.e. improving) the model (which is already deployed on the embedded system) gets much easier because no code has to be changed. The mode is “just” re-trained and improved.

This way, business rules can be updated and improved by improving the already deployed model in the embedded system. No new development lifecycle, testing and certification and manufacturing process are required. DNP/AISS1 from SSV Software is one example of a hardware starter kit with pre-installed ML algorithms.

Solution #3 => Kafka + IIoT COTS Product as Complementary Solutions for the Digital Twin

The above sections described how to use either an IoT COTS product or an event streaming platform like Apache Kafka and its ecosystem for building a digital twin infrastructure. Interestingly, Kafka and IoT COTS are actually combined in most deployments I have seen so far.

Most IoT COTS products provide out-of-the-box Kafka connectors because end users are asking for this feature all the time. Let’s discuss the combination of Kafka and other IoT products in more detail.

Kafka is Complementary to Industry Solutions such as Siemens MindSphere or Cisco Kinetic

Kafka can be used in very different scenarios. It is s best for building a scalable, reliable and open infrastructure to integrate IoT at the edge and the rest of the enterprise.

However, Kafka is not the silver bullet for every problem. A specific IoT product might be the better choice for integration and processing of IoT interfaces and shop floors. This depends on the specific requirements, existing ecosystem and already used software. Complexity and cost of solutions need to be evaluated. “Build vs. buy” is always a valid question. Often, the best choice and solution is a mix of building an open, flexible, self-built central streaming infrastructure and buying COTS for specific integration and processing scenarios.

Different Combinations of Kafka and IoT Solutions

The combination of an even streaming platform like Kafka with one or more other IoT products or frameworks is very common:

Sometimes, the shop floor is connecting to an IoT solution. The IoT solution is used as gateway or proxy. This can be a broad, powerful (but also more complex and expensive) solution like Siemens MindSphere. Or the choice is to deploy “just” a specific solution, more lightweight solution to solve one problem. For instance, HiveMQ could be deployed as scalable MQTT cluster to connect to machines and devices. This IoT gateway or proxy connects to Kafka. Kafka is either deployed in the same infrastructure or in another data center or cloud. The Kafka cluster connects to the rest of the enterprise.

In other scenarios, Kafka is used as IoT gateway or proxy to connect to the PLCs or Distributed Control System (DCS) directly. Kafka then connects to an IoT Solution like AWS IoT or Google Cloud’s MQTT Bridge where further processing and analytics happen.

Communication is often bidirectional. No matter what architecture you choose. This means the data is ingested from the shop floor, processed and correlated, and finally control events are sent back to the machines. For instance, in predictive analytics, you first train analytic models with tools like TensorFlow in the cloud. Then you deploy the analytic model at the edge for real time predictions.

Eclipse ditto – An Open Source Framework Dedicated to the Digital Twin; with Kafka Integration

There is not just commercial IoT solutions on the market, of course. Kafka is complementary to COTS IoT solutions and to open source IoT frameworks. Eclipse IoT alone provides various different IoT frameworks. Let’s take a look at one of them, which fits perfectly into this blog post:

Eclipse ditto – an open source framework for building a digital twin. With the decoupled principle of Kafka, it is straightforward to leverage other frameworks for specific tasks.

ditto was created to help realizing digital twins with an open source API. It provides features like like Device-as-a-Service, state management for digital twins, and APIs to organize your set of digital twins. Kafka integration is built-in into ditto out-of-the-box. Therefore, ditto and Kafka complement each other very well to build a digital twin infrastructure.

The World is Hybrid and Polyglot

The world is hybrid and polyglot in terms of technologies and standards. Different machines use different technologies and protocols. Each plant uses its own frameworks, products and standards. Business units use different analytics tools and not always the same clouds provider. And so on…

Global Kafka Architecture for Edge / On Premises / Hybrid / Multi Cloud Deployments of the Digital Twin

Kafka is often not just the backbone to integrate IoT devices and the rest of the enterprise. More and more companies deploy multiple Kafka clusters in different regions, facilities and cloud providers.

The right architecture for Kafka deployments depends on the use cases, SLAs and many other requirements. If you want to build a digital twin architecture, you typically have to think about edge AND data centers / cloud infrastructure:

Apache Kafka as Global Nervous System for Streaming Data

Using Kafka as global nervous system for streaming data typically means you spin up different Kafka clusters. The following scenarios are very common:

Local edge Kafka clusters in the shop floors: Each factory has its own Kafka cluster to integrate with the machines, sensors and assembly lines. But also with ERP systems, SCADA monitoring tools, and mobile devices from the workers. This is typically a very small Kafka cluster with e.g. three Brokers (which still can process ~100+ MB/sec). Sometimes, one single Kafka broker is deployed. This is fine if you do not need high availability and prefer low cost and very simple operations.
Central regional Kafka clusters: Kafka clusters are deployed in different regions. Each Kafka cluster is used to ingest, process and aggregate data from different factories in that region or from all cars within a region. These Kafka clusters are bigger than the local Kafka clusters as they need to integrate data from several edge Kafka clusters. The integration can be realized easily and reliable with Confluent Replicator or in the future maybe with MirrorMaker 2 (if it matures over time). Don’t use Mirrormaker 1 at all – you can find many good reasons on the web. Another option is to directly integrate Kafka clients deployed at the edge to a central regional Kafka cluster. Either with a Kafka Client using Java, C, C++, Python, Go or another programming language. Or using a proxy in the middle, like Confluent REST Proxy, Confluent MQTT Proxy, or any MQTT Broker outside the Kafka environment. Find out more details about comparing different MQTT and HTTP-based IoT integration options for Kafka here.
Multi-region or global Kafka clusters: You can deploy one Kafka cluster in each region or continent. Then replicate the data between each other (one- or bidirectional) in real time using Confluent Replicator. Or you can leverage the multi-data center replication Kafka feature from Confluent Platform to spin up one logical cluster over different regions. The latter provides automatic fail-over, zero data loss and much easier operations of server and client side.

This is just a quick summary of deployment options for Kafka clusters at the edge, on premises or in the cloud. You typically combine different options to deploy a hybrid and global Kafka infrastructure. Often, you start small with one pipeline and a single Kafka cluster. Scaling up and rolling out the global expansion should be included into the planning from the beginning. I will speak in more detail about different “architecture patterns and best practices for distributed, hybrid and global Apache Kafka deployments” at DevNexus in Atlanta in February 2020. This is a good topic for another blog post in 2020.

Polyglot Infrastructure – There is no AWS, GCP or Azure Cloud in China and Russia!

For global deployments, you need to choose the right cloud providers or build your own data centers in some different regions.

For example, you might leverage Confluent Cloud. This is a fully managed Kafka service with usage-based pricing and enterprise-ready SLAs in Europe and the US on Azure, AWS or GCP. Confluent Cloud is a real serverless approach. No need to think about Kafka Brokers, operations, scalability, rebalancing, security, fine tuning, upgrades.

US cloud providers do not provide cloud services in China. Alibaba is the leading cloud provider. Kafka can be deployed on Alibaba cloud. Or choose a generic cloud-native infrastructure like Kubernetes. Confluent Operator, a Kubernetes Operator including CRD and Helm charts, is a tool to support and automate provisioning and operations on any Kubernetes infrastructure.

No public cloud is available in Russia at all. The reasons is mainly legal restrictions. Kafka has to be deployed on premises.

In some scenarios, the data from different Kafka clusters in different regions is replicated and aggregated. Some anonymous sensor data from all continents can be correlated to find new insights. But some specific user data might always just stay in the country of origin and local region.

Standardized Infrastructure Templates and Automation

Many companies build one general infrastructure template on a specific abstraction level. This template can then be deployed to different data centers and cloud providers the same way. This standardizes and eases operations in global Kafka deployments.

Cloud-Native Kubernetes for Robust IoT and Self-Healing, Scalable Kafka Cluster

Today, Kubernetes is often choosen as the abstraction layer. Kubernetes is deployed and managed by the cloud provider (e.g. GKE on GCP) or an operations team on premises. All required infrastructure on top of Kubernetes is scripted and automated with a template framework (e.g. Terraform). This can then be rolled out to different regions and infrastructures in the same standardized and automated way.

The same is applicable for the Kafka infrastructure on top of Kubernetes. Either you leverage existing tools like Confluent Operator or build your own scripts and custom resource definitions. The Kafka Operator for Kubernetes has several features built-in, like automated handling of fail-over, rolling upgrades and security configuration.

Find the right abstraction level for your Digital Twin infrastructure:

You can also use tools like the open source Kafka Ansible scripts to deploy and operate the Kafka ecosystem (including components like Schema Registry or Kafka Connect), of course.

However, the beauty of a cloud-native infrastructure like Kubernetes is its self-healing and robust characteristics. Failure is expected. This means that your infrastructure continues to run without downtime or data loss in case of node or network failures.

This is quite important if you deploy a digital twin infrastructure and roll it out to different regions, countries and business units. Many failures are handled automatically (in terms of continuous operations without downtime or data loss).

Not every failure requires a P1 and call to the support hotline. The system continues to run while an ops team can replace defect infrastructure in the background without time pressure. This is exactly what you need to deploy robust IoT solutions to production at scale.

Apache Kafka as the Digital Twin for 100000 Connected Cars

Let’s conclude with a specific example to build a Digital Twin infrastructure with the Kafka ecosystem. We use an implementation from the automotive industry. But this is applicable to any scenario where you want to build and leverage digital twins. Read more about “Use Cases for Apache Kafka in Automotive Industry” here.

Honestly, this demo was not built with the idea of creating a digital twin infrastructure in mind. However, think about it (and take a look at some definitions, architectures and solutions on the web): Digital Twin is just a concept and software design pattern. Remember our definition from the beginning of the article: A digital twin is a digital replica of a living or non-living physical entity. Therefore, the decision of the right architecture and technology depends on the use case(s).

We built a demo which shows how you can integrate with tens or hundreds of thousands IoT devices and process the data in real time. The demo use case is predictive maintenance (i.e. anomaly detection) in a connected car infrastructure to predict motor engine failures: “Building a Digital Twin for Streaming Machine Learning at Scale from 100000 IoT Devices with HiveMQ, Apache Kafka and TensorFlow“.

In this example, the data from 100000 cars is ingested and stored in the Kafka cluster, i.e. the digital twin, for further processing and real time analytics.

Kafka client applications consume the data for different use cases and in different speed:

Real time data pre-processing and data engineering using the data from the digital twin with Kafka Streams and KSQL / ksqlDB.
Streaming model training (i.e. without a data lake in the middle) with the Maschine Learning / Deep Learning framework TensorFlow and its Kafka plugin (part of TensorFlow I/O). In our example, we train two neural networks: An unsupervised Autoencoder for anomaly detection and a supervised LSTM.
Model deployment for inference in real time on new car sensor events to predict potential failures in the motor engine.
Ingestion of the data into another batch system, database or data lake (Oracle, HDFS, Elastic, AWS S3, Google Cloud Storage, whatever).
Another consumer could be a real time time series database like InfluxDB or TimescaleDB.

Build your own Digital Twin Infrastructure with Kafka and its Open Ecosystem!

I hope this post gave you some insights and ideas. I described three options to build the infrastructure for a digital twin: 1) IoT COTS, 2) Kafka, 3) Kafka + Iot COTS. Many companies leverage Apache Kafka as central nervous system for their IoT infrastructure in one way or the other.

Digital Twin is just one of many possible IoT use cases for an event streaming platform. Often, Kafka is “just” part of the solution. Pick and choose the right tools for your use cases. Evaluate the Kafka ecosystem and different IoT frameworks / solutions to find the best combination for your project. Don’t forget to include the vision and long-term planning into your decisions! If you plan it right from the beginning, it is (relative) straightforward to start with a pilot or MVP. Then roll it out to the first plant into production. Over time, deploy a global Digital Twin infrastructure…

The post Apache Kafka as Digital Twin for Open, Scalable, Reliable Industrial IoT (IIoT) appeared first on Kai Waehner.

Apache Kafka in the Automotive Industry

Kai Waehner — Fri, 22 Nov 2019 05:51:28 +0000

In November 2019, I had the pleasure to visit “Motor City” Detroit. I met with several automotive companies, suppliers, startups and cloud providers to discuss use cases and architectures around Apache Kafka. I work with companies related to the German automotive industry for many years. It was great to see the ideas and current status of projects running overseas in the US.

I am really excited about the role of Apache Kafka and its ecosystem in the automotive industry. Kafka became the central nervous system of many applications in various different areas related to automotive industry. Machine Learning also got more and more impact on these use cases.

A Long Journey – From Car Production and Sales to Digital Services

My trip to Detroit started with some private events. First, I watched the amazing football college game of the Michigan Wolverines hosting the Michigan State Spartans in front of 111000 fans in Ann Arbor. The next day I went to the Ford Piquette Plant in Detroit where Henry Ford and his team manufactured the first cars. The Henry Ford Museum was another great visit to learn more about the mastermind Henry Ford and the history of car manufacturing in the US.

The Car Industry is Changing…

I talked to many local people. The main excitement was around three topics in the state of Michigan: Universities, sports and cars. I have to admit that talking about car companies created a much more frustrating and negative mood for many people. They work in the factories and offices of car companies. They were more happy about discussions around sports and universities. Many people are worried about the future of the automotive industry.

Thus, let’s think in more detail about the car business… A car company produces and sells cars. A supplier produces parts for the car. A vendor shop or independent company provides maintenance and repair services.

That was the situation for many decades. Welcome to a new era where car companies have to do much more than just manufacturing and selling cars to stay competitive.

I think people should not be worried, though. This is very exciting. While many old jobs will be done my machines and robots in the future, new tasks and job roles will be created for everybody.

Customer Behaviour has Changed…

Customer behaviour has changed significantly:

Customers expect digital services and integration with their own IT ecosystem (smartphone, music streaming service, smart home, and many other apps)
Customers look for cheap repair services away from the car vendors because it is much cheaper but still a good service
Repair shops use 3rd party suppliers for car parts because the original supplier is too expensive and does not add any value or better product quality
Tesla, Uber, Volkswagen, Daimler and many other automotive and non-automotive companies work on autonomous driving (in several steps and releases: Automatic distance measurement on the highway —> Automatic parking in the parking lot —> Driving back to the entrance of the supermarket when you call your call via your smartphone —> Real autonomous driving to the final destination)
Zipcar, car2go, drivenow and other companies provide car sharing services where car usage is paid by minutes and miles: You use your smartphone to locate the next car at the airport, drive to your home in the city, and leave the car in front of your house so that the next customer can pick it up
Uber, Lyft, Grab, Free Now (the most ridiculous company name you could have for ride sharing) and other services provide cheap ride sharing with fantastic user experience: You order a taxi via mobile app, select guest services (like choice of music, wish of conversations with the driver or not), see expected estimated time of arrival and expected cost, monitor the status of the taxi location in real time, get transported, pay automatically by the integrated payment service, collect points at the integrated and partnered loyalty systems (like Hilton in case of Lyft or MilesAndMore like Free Now to name a few examples)

In short, companies in the automotive industry have to change their business model fundamentally to stay competitive. Otherwise, they will become the next “hardware supplier” for a tech giant. This would reduce profit, establish tough dependencies and create market pressure. Even the existence of the company is at risk. No matter if you are on the market for 10 or 100 years, already.

Digital Use Cases in Automotive Industry

The background about the history and new requirements for the automotive industry is important to understand. Every car-related company can add more value and innovation to the own business model.

The use cases can be separated in infrastructure, manufacturing and customer interaction (while there are some overlaps, of course).

Infrastructure for Connected Cars

A fully connected infrastructure with real time communication is required to build innovative use cases such as:

Connected cars: The “Hello World” example for discussing cutting edge use cases in automotive industry. It includes more or less all challenges you have to solve in Internet of Things (IoT) scenarios, including connectivity to millions of devices, large scale throughput, reliable communication with bad and low network connectivity, real time processing requirements. Ingest and process sensor information from millions of cars in real time to correlate events and. Provide bi-directional communication between cars and other applications. Do big data analytics in the backend to get new insights. Build additional services to the manufacturer, customer and partners.
Fleet Management: A specific example of real time correlation of events from various different systems like mobile apps, hardware integrated into cars and trucks, and backend systems like CRMs and payment systems. Various examples exist in different industries, including logistics of truck fleets, track&trace of package delivery, ridesharing, food services, and many more.
Emergency system: If your car shows abnormal behaviour or crashed into a tree, the car automatically sends a crash notification (including details about the crash scenario) to the police and hospital next to you to send help immediately.
Smart City and Smart Driving: Connect and correlate data of cars with other devices like traffic lights. Automotive companies partner with cities and other data providers to offer better security and more comfortable driving experience. Collect and share basic safety messages. Monitor cars and send speed / slow-down warnings, wrong-way detection, congestion. Provide crowdsourced data with applications such as Waze, Google Maps, Twitter, Nextdoor.

Manufacturing and Industrial IoT (IIoT)

Producing cars is a very complex task. This includes the production of the car itself (the “hardware”) and the intelligent logic (the “software”) of the car. Both integrate deeply with each other to implement use cases such as:

Industrial IoT (IIoT) and smart manufacturing: Predictive maintenance, early part scrapping and improved product quality are some examples to improve the quality of the manufactured parts, reduce costs and risks. The analysis of car sensor data also allows to implement additional digital services on top of the plain car parts.
Autonomous Driving: One of the most interesting use cases. Real time processing of big data sets and using cutting edge technologies such as Neural Networks allows car companies to build car with autonomous driving features. This is still in early stage and will probably take many years before it is available in Europe. But some US states and China are pressing ahead with self-driving cars on public streets, already. This will take some more years in most places. In the meantime, you can already use less powerful but still impressive features like Tesla’s “summon car” feature. This is not working perfectly today, but this is just a matter of time.

Customer Interactions and Data Analytics

Connected infrastructures and self-driving cars are awesome. Another added value comes through the possibility to communicate directly with the drivers to provide additional services, for instance:

Remote Diagnostics: Analytics can happen at the edge (e.g. in the car) or in a data center / cloud. Autonomous driving needs to detect persons in real time. Thus, this needs to be applied in the car. Cross selling is an example where you correlate data between different backend systems and remote users or apps. Therefore, this should happen in the backend. You just send the result of the correlation back to the car or mobile app of the driver. Predictive maintenance can happen directly in the car or machine. Or you correlate the events in the backend and just send the alerts to the car or machine to stop it before the engine breaks. No matter where you deploy the analytics logic, it typically has to happen in real time. Usually within milliseconds or seconds, but sometimes minutes or even hours are sufficient.
Identity Management: The days are over where you use a key (a piece of hardware) to open your car and start your engine. Car sharing services provide a mobile app. This is used for registration, payments and handing over the key (for the time of usage). Security (Authentication, authorization and encryption of data) are getting more and more important in IoT. This is even true in Industrial IoT where security discussions did not exist at all in the past. Car sharing is a great example where this is a key requirement to deploy an infrastructure and service successfully. The next step several companies are working on (long live China!) is image recognition instead of using a physical key, or smartphone to open your car and start the engine. Other use cases include improving the customer experience in the car. For instance, default configuration of seat, radio and air conditioning can be based on the driver’s historical behaviour (stored forever in your favorite database) and the outside ecosystem (e.g. integration with a weather service to know how warm it is). You will probably also not need to carry your driver’s license with you in the future because police will also just scan your face. I was pretty impressed when I just had to scan my face at Global Entry entering the US border coming from Germany last week, instead of scanning my passport and entering all the annoying details about my travel and private information.
Aftersales: Car companies and manufacturers need to sell additional services to make more money and / or to make customers happy and loyal. This includes use cases like upselling (provide 100 extra horsepower via digital download for 24 hours to have some fun on German Autobahn) and cross-selling (provide a 20% coupon for the partnered steak house coming at the next stop 10 miles always from your current driving location).
Payment Integration: In the US, it is common to pay everything by credit card. In Europe, many shops still just accept cash. In the future, payments will be done automatically. You just go to the gas station or electric charging unit and refuel respectively load your car and leave. Similar to existing an Uber or Lyft car today. The payment is done via the integrated partner payment system of your choice. Once again, loyalty systems are integrated, too. Authentication and authorization are done via cameras and neural networks doing image recognition.
Data Monetization: Connected car infrastructures produce a lot of data. Depending on law, privacy options and other regulations, automotive companies can and will do its own analytics, but also sell the data or parts of the data to partners (hopefully anonymized properly). Like Google or Facebook today, the automotive industry will be able to make a fortune with the data of the cars and all the connected partner systems. Most customers will probably agree to this “feature” because they get added value out of this, too (like discounts or “free” additional services).

Wow, this is a lot of use cases for digitalization of the automotive business, isn’t it? And there are many more to come…

Challenges and Requirements to Build Automotive Infrastructure at Scale

Let’s now quickly discuss the challenges and requirements to solve with all those exciting use cases:

Integration between many different consumer apps and backend systems (including many legacy and proprietary interfaces, and different communication paradigms like real time streaming, request-response, batch processing)
Big Data – sensors and millions of mobile apps from users produce a large set of (structured and unstructured) data continuously
Hybrid integration and deployments at the edge (e.g. cars), local proxies in different regions, backend applications in data centers or cloud, and SaaS applications
Global scale with zero downtime, rolling out the applications on different infrastructures (there is no AWS / Azure / GCP) in China but only Alibaba, and there is no cloud at all in Russia (just private clouds on premises).
Real time information is required in most scenarios to provide a good digital experience and added business value

This is a lot of challenges to solve. If you know Apache Kafka already well (and maybe also other traditional middleware), then you can probably guess why Kafka is such a good fit here.

Apache Kafka in the Automotive Industry

After we understood the various innovative use cases and its related challenges in the automotive industry, let’s now discuss (quickly) why so many automotive companies and suppliers use Apache Kafka as central nervous system for the discussed use cases.

Kafka for Real Time Streaming at Scale

Car sensors produce data continuously. The more cars you connect, the more data you get. Kafka provides messaging and streaming at scale, providing a reliable, highly scalable infrastructure with zero downtime. Rolling upgrades, backwards compatibility, up- and downscale at runtime, and other features are built-in.

Kafka for Handling Backpressure and Decoupling of Services

Cars are decoupled from the backed systems and mobile apps. Kafka provides storage for backpressure and decoupling of services. It is not just a messaging system! Kafka also stores data as long as you want. For instance, a Kafka topic is stored for a few days for log analytics. Another Kafka topic is stored for years to analyze customer and payment transactions of the past. Order guarantees, timestamps and high availability are provided out of-the-box. Connected applications and data stores can build a materialized view to leverage the data.

Kafka for Integration with Legacy and Modern Technologies

Car data needs to be integrated. Kafka is the backbone for backend systems (legacy like ERP, Mainframe + modern SaaS like Cloud CRM, big data analytics). Connectivity via Kafka Connect provides Kafka-native integration at scale to any legacy or modern technology and communication paradigm. Either directly from Kafka clients or via other technologies as proxy like MQTT to cars, Websockets to mobile apps, via proprietary protocols like Siemens S7 or open standards like OPC-UA to machines in plants). Check out my comparison between Kafka and Middleware (MQ, ETL, ESB) for more details.

Kafka for Continuous Stream Processing with Kafka Streams and ksqlDB

Car data needs to be processed and correlated in real time at scale with other backend databases and applications. Kafka Streams or ksqlDB provide Kafka-native capabilities to do process data continuously, for use cases like Streaming ETL, stateful aggregations (sliding windows) and building business applications with their own state (leveraging rocksdb under the hood – no need for another database.

Machine Learning and Kafka in Automotive Use Cases

Kafka is used more and more in Machine Learning infrastructures. Some examples from tech giants are Uber, Netflix or Paypal. The big challenge about Machine Learning is the deploy at scale in a reliable way (for both model training and predictions). I covered this topic in details in various blog posts, videos and demos. Check out this blog post to get started: “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka“.

Kafka for Data Engineering, Model Training + Deployment and Monitoring

In the Automotive Industry, Machine Learning needs to be applied at scale. Predictions often need to happen in real time. Therefore, more and more automotive-related use cases discussed in the above section leverage Kafka together with Machine Learning for different aspects like:

Data ingestion and processing of data with Kafka from cars and other systems for both model training of historical data and real time predictions on new events – ideally using the same ingestion and pre-processing pipeline for both.
Streaming model training with Kafka without an additional database in the middle (e.g. leveraging TensorFlow I/O and its native Kafka integration) or the storage in a data lake (e.g. HDFS, AWS S3) for model training.
Model deployment with Kafka at the edge (e.g. in autonomous driving for stopping the car via image recognition because a person is on the street in front of the car) or in the data center / cloud (e.g. predictions for cross selling by correlation of real time behaviour together with correlated information from the loyalty and CRM systems in backend). Check out the video and slides discussing “Event-driven Model Serving: Stream Processing vs. RPC” from Kafka Summit for more details.
Monitoring with Kafka (e.g. real time alerting, or distributed tracing to detect anomalies and the root cause of problems).

One huge advantage of using Kafka for Machine Learning infrastructures is solving the impedance mismatch between the data scientist (loving rapid prototyping with Python and Jyipter) and the software developer / production engineer (loving scalable and reliable Java applications).

Kafka Architecture – From Edge Deployments to Global Replication

The architecture for Kafka deployments depends on the use cases, SLAs and many other requirements. Automotive companies and suppliers typically have to think globally:

Using Kafka as global nervous system for streaming data typically means you spin up different Kafka clusters. The following scenarios are very common in the automotive industry:

Local edge Kafka clusters in the factories: Each factory has its own Kafka cluster to integrate with the machines, sensors and assembly lines. But also with ERP systems, SCADA monitoring tools, and mobile devices from the workers. This is typically a very small Kafka cluster with e.g. three Brokers (which still can process ~100+ MB/sec). Sometimes, just one single Kafka broker is deployed. This is fine if you do not need high availability and prefer low cost and very simple operations.
Central regional Kafka clusters: Kafka clusters are deployed in different regions. Each Kafka cluster is used to ingest, process and aggregate data from different factories in that region or from all cars within a region. These Kafka clusters are bigger than the local Kafka clusters as they need to integrate data from several edge Kafka clusters. The integration can be realized easily and reliable with Confluent Replicator or in the future maybe with MirrorMaker 2. Don’t use Mirrormaker 1 at all – you can find many good reasons on the web). Another option is to directly integrate Kafka clients deployed at the edge to a central regional Kafka cluster. Either with a Kafka Client using Java, C, C++, Python, Go or another programming language. Or using a proxy in the middle, like Confluent REST Proxy, Confluent MQTT Proxy, or any MQTT Broker outside the Kafka environment. Find out more details about comparing different MQTT and HTTP-based IoT integration options for Kafka here.
Global Kafka clusters: You can deploy one Kafka cluster in each region or continent. Then replicate the data between each other (one- or bidirectional) in real time using Confluent Replicator. Or you can leverage the multi-data center replication Kafka feature from Confluent to spin up one logical cluster over different regions.

This is just a quick summary of deployment options for Kafka clusters at the edge, on premises or in the cloud. You typically combine different options to deploy a hybrid and global Kafka infrastructure. I will speak about different “architecture patterns for distributed, hybrid and global Apache Kafka deployments” at DevNexus in Atlanta in February 2020.

For example, you might leverage Confluent Cloud, a fully managed Kafka service with usage-based pricing and enterprise-ready SLAs in Europe and the US on Azure, AWS or GCP. But you need to use Alibaba cloud in China and used self-managed Kafka there deployed to Kubernetes and operated by Confluent Operator. In Russia, no public cloud is available at all. You need to deploy on premises and manage the cluster by yourself. Then you replicate and aggregate some anonymous sensor data from all continents to get new aggregated insights. But some specific user data might always just stay in the country of origin and local region.

Live Demo – 100000 Connected Cars

We built a live demo (available on Github) to show how easy it is to set up a connected car infrastructure at scale (including a cutting edge streaming machine learning example for predictive maintenance in real time): Streaming Machine Learning at Scale from 100000 IoT Devices with HiveMQ, Apache Kafka and TensorfLow:

Such a setup can be the foundation for most of the discussed use cases in this blog post.

You can also take a look at the following link to see the slide deck and video recording. They discuss the demo architecture and walking you through the live demo:

Please let me know your thoughts about Apache Kafka in the automotive industry. Try out the demo and share your feedback. Let’s build some exciting new and innovative use cases for the automotive industry.

The post Apache Kafka in the Automotive Industry appeared first on Kai Waehner.

IoT Live Demo – 100.000 Connected Cars with Kubernetes, Kafka, MQTT, TensorFlow

Kai Waehner — Fri, 08 Nov 2019 13:38:34 +0000

You want to see an Internet of Things (IoT) example at huge scale? Not just 100 or 1000 devices producing data, but a really scalable demo with millions of messages from tens of thousands of devices? This is the right demo for you! we leveraging Kubernetes, Apache Kafka, MQTT and TensorFlow.

The demo shows how you can integrate with tens or hundreds of thousands IoT devices and process the data in real time. The demo use case is predictive maintenance (i.e. anomaly detection) in a connected car infrastructure to predict motor engine failures:

IoT Infrastructure – MQTT and Kafka on Kubernetes

We deploy Kubernetes, Kafka, MQTT and TensorFlow in a scalable, cloud-native infrastructure to integrate and analyse sensor data from 100000 cars in real time. The infrastructure is built with Terraform. We use GCP, but you could do the same on AWS, Azure, Alibaba or on premises.

Data processing and analytics is done in real time at scale with GCP GKE, HiveMQ, Confluent and TensorFlow I/O for streaming machine learning / deep learning and bi-directional communication in a scalable, elastic and reliable infrastructure:

Github Project – 100000 Connected Cars

The project is available on Github. You can set the demo up in ~30min by just installing a few CLI tools and executing two or three shell scripts.

Check out the Github project “Streaming Machine Learning at Scale from 100000 IoT Devices with HiveMQ, Apache Kafka and TensorFlow“.

Please try out the demo. Feedback and PRs are welcome.

20min Live Demo – IoT at Scale on GCP with GKE, Confluent, HiveMQ and TensorFlow IO

Here is the video recording of the live demo:

If your area of interest is Industrial IoT (IIoT), you might also check out the following example. It covers the integration of machines and PLCs like Siemens S7, Modbus or Beckhoff in factories and shop floors:

Apache Kafka, KSQL and Apache PLC4X for IIoT Data Integration and Processing

The post IoT Live Demo – 100.000 Connected Cars with Kubernetes, Kafka, MQTT, TensorFlow appeared first on Kai Waehner.

Apache Kafka and Machine Learning for Real Time Supply Chain Optimization in IIoT

Kai Waehner — Fri, 23 Aug 2019 06:59:43 +0000

I did a webinar with Confluent‘s partner Expero about “Apache Kafka and Machine Learning for Real Time Supply Chain Optimization“. This is a great example for anybody in automation industry / Industrial IoT (IIoT) like automotive, manufacturing, logistics, etc.

We explain how a real time event streaming platform can integrate in real time with the legacy world and proprietary IIoT protocols (like Siemens S7, Modbus, Beckhoff ADS, OPC-UA, et al). You can process the data at scale and then ingest it into a modern database (like AWS S3, Snowflake or MongoDB) or analytic / machine learning framework (like TensorFlow, PyTorch or Azure Machine Learning Service).

Here is the architecture we use to discuss and implement the supply chain optimization use case leveraging real time stream processing and machine learning:

We leverage various components from the Apache Kafka ecosystem. This includes:

Kafka Connect as scalable and reliable integration framework
Kafka Connect connectors like PLC4X – a great Apache framework to integrate with IIoT protocols
KSQL for continuous processing (filter, transform, aggregate) of the sensor data
Kafka Streams to deploy the trained analytic models to a real time streaming scoring engine

Use Case: Supply Chain Optimization using Real Time and Batch Processes

Automating multifaceted, complex workflows requires hybrid solutions including streaming analytics of IOT data and batch analytics. This includes machine learning solutions and real time visualization. Leaders in organizations who are responsible for global supply chain planning are responsible for working with and integrating with data from disparate sources around the world. Many of these data sources output information in real time. This assists planners in operationalizing plans and interacting with manufacturing output. IOT sensors on manufacturing equipment and inventory control systems feed real time processing pipelines to match actuals productions figures against planned schedules to calculate yield efficiency.

Using information from both real time systems and batch optimization, supply chain managers are able to economize operations and automate tedious inventory and manufacturing accounting processes. Sitting on top of all of these systems is a supply chain visualization tool. This enables users’ visibility over the global supply chain. If you are responsible for key data integration initiatives, join for a detailed walk through of a customer’s use of this system built using Confluent and Expero tools.

What will you learn?

See different use cases in automation industry and Industrial IoT (IIoT) where an event streaming platform adds business value
Understand different architecture options to leverage Apache Kafka and Confluent Platform in IoT scenarios in the cloud, on premise data centers and at the edge
Learn how to leverage different analytics tools and machine learning frameworks in a flexible and scalable way
How real time visualization ties together streaming and batch analytics for business users, interpreters, and analysts
Understand how streaming and batch analytics optimize the supply chain planning workflow.
Conceptualize the intersection between resource utilization and manufacturing assets with long term planning and supply chain optimization.

Industrial Internet of Things (IIoT) in Real Time at Scale with Apache Kafka

Here is the slide deck and video recording. Have fun watching it. Please let me know if you have any feedback or questions:

Slide Deck:

Video Recording:

The post Apache Kafka and Machine Learning for Real Time Supply Chain Optimization in IIoT appeared first on Kai Waehner.

Apache Kafka + KSQL + TensorFlow for Data Scientists via Python + Jupyter Notebook

Kai Waehner — Fri, 18 Jan 2019 14:15:22 +0000

Why would a data scientist use Kafka Jupyter Python KSQL TensorFlow all together in a single notebook?

There is an impedance mismatch between model development using Python and its Machine Learning tool stack and a scalable, reliable data platform. The former is what you need for quick and easy prototyping to build analytic models. The latter is what you need to use for data ingestion, preprocessing, model deployment and monitoring at scale. It requires low latency, high throughput, zero data loss and 24/7 availability requirements.

This is the main reason I see in the field why companies struggle to bring analytic models into production to add business value. Python in practice is not the most well-known technology for large scale and performant, reliable environments. However, it is a great tool for data scientist and a great client of a data platform like Apache Kafka.

Therefore, I created a project to demonstrate how this impedance mismatch can be solved. A much more detailed blog post about this topic will come on Confluent Blog soon. In this blog post here, I want to discuss and share my Github project:

“Making Machine Learning Simple and Scalable with Python, Jupyter Notebook, TensorFlow, Keras, Apache Kafka and KSQL“. This project includes a complete Jupyter demo which combines:

Simplicity of data science tools (Python, Jupyter notebooks, NumPy, Pandas)
Powerful Machine Learning / Deep Learning frameworks (TensorFlow, Keras)
Reliable, scalable event-based streaming technology for production deployments (Apache Kafka, Kafka Connect, KSQL).

If you want to learn more about the relation between the Apache Kafka open source ecosystem and Machine Learning, please check out these two blog posts:

Let’s quickly describe these components and then take a look at the combination of them in a Jupyter notebook.

Python, Jupyter Notebook, Machine Learning / Deep Learning

Jupyter exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages. Therefore, it is a great tool to build analytic models using Python and machine learning / deep learning frameworks like TensorFlow.

Using Jupyter notebooks (or similar tools like Google’s Colab or Hortonworks’ Zeppelin) together with Python and your favorite ML framework (TensorFlow, PyTorch, MXNet, H2O, “you-name-it”) is the best and easiest way to do prototyping and building demos.

However, building prototypes or even sophisticated analytic models in a Jupyter notebook with Python is a different challenge than building a scalable, reliable and performant machine learning infrastructure. I always refer to the great paper Hidden Technical Debt in Machine Learning Systems for this discussion:

Think about use cases where you CANNOT go into production without large scale. For instance, connected car infrastructures, payment and fraud detection systems or global web applications with millions of users. This is where the Apache Kafka ecosystem comes into play.

Apache Kafka and KSQL

Apache Kafka is an open-source stream-processing software platform developed by Linkedin and donated to Apache Software Foundation. It is written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency streaming platform for handling and processing real-time data feeds.

Confluent KSQL is the streaming SQL engine that enables real-time data processing against Apache Kafka. It provides an easy-to-use, yet powerful interactive SQL interface for stream processing on Kafka; without the need to write code in a programming language such as Java or Python. KSQL is scalable, elastic, fault-tolerant. It supports a wide range of streaming operations, for example data filtering, transformations, aggregations, joins, windowing, and sessionization.

Check out these slides and video recording from my talk at Big Data Spain 2018 in Madrid if you want to learn more about KSQL.

Kafka + Jupyter + Python to solve the hidden technical dept in Machine Learning

To solve the hidden technical dept in Machine Learning infrastructures, you can combine the benefits of ML related tools and the Apache Kafka ecosystem:

Python tool stack like Jupyter, Pandas or scikit-learn
Machine Learning frameworks like TensorFlow, H2O or DeepLearning4j
Apache Kafka ecosystem including components like Kafka Connect for integration and Kafka Streams or KSQL for real time stream processing and model inference

The following diagram depicts an example of such an architecture:

If you want to get a better understanding of the relation between the Apache Kafka ecosystem and Machine Learning / Deep Learning, check out the following material:

Example: Kafka + Jupyter + Python + KSQL + TensorFlow

Let’s now take a look at an example which combines all these technologies like Python, Jupyter, Kafka, KSQL and TensorFlow to build a scalable but easy-to-use environment for machine learning.

This Jupyter notebook is not meant to be perfect using all coding and ML best practices, but just a simple guide how to build your own notebooks where you can combine Python APIs with Kafka and KSQL.

Use Case: Fraud Detection for Credit Card Payments

We use a test data set of credit card payments from Kaggle as foundation to train an unsupervised autoencoder to detect anomalies and potential fraud in payments.

Focus of this project is not just model training, but the whole Machine Learning infrastructure including data ingestion, data preprocessing, model training, model deployment and monitoring. All of this needs to be scalable, reliable and performant.

Leveraging Python + KSQL + Keras / TensorFlow from a Jupyter Notebook

The notebook walks you through the following steps:

Integrate with events from a Kafka stream,
Preprocess data with KSQL (transformations, aggregations, filtering, etc.)
Prepare data for model training with Python libraries, i.e. preprocess data with Numpy, Pandas and scikit-learn
Train an analytic model with Keras and TensorFlow using Python API
Predict data using the analytic model with Keras and TensorFlow using Python API
Deploy the analytic model to a scalable Kafka environment leveraging Kafka Streams or KSQL (not part of the Jupyter notebook, but links to demos are shared)

Here is a screenshot of the Jupyter notebook where use the ksql-python API to

Connect to KSQL server
Create first KSQL STREAM based on Kafka topic
Do first SELECT query

Check out the complete Jupyter Notebook to see how to combine Kafka, KSQL, Numpy, Pandas, TensorFlow and Keras to integrate and preprocess data and then train your analytic model.

Why should a Data Scientist use Kafka and KSQL at all?

Yes, you can also use Pandas, scikit-learn, TensorFlow transform, and other Python libraries in your Jupyter notebook. Please do so where it makes sense! This is not an “either … or” question. Pick the right tool for the right problem.

The key point is that the Kafka integration and KSQL statements allow you to

Use the existing environment of the data scientist which he loves (including Python and Jupyter) and combine it with Kafka and KSQL to integrate and continuously process real time streaming data by using a simple Python Wrapper API to execute KSQL queries.
Easily connect to streaming data instead of just historical batches of data (maybe from last day, week or month, e.g. coming in via CSV files).
Merge different concepts like streaming event-based sensor data coming from Kafka with Python programming concepts like Generators or Dictionaries which you can use for your Python data tools or ML frameworks like Numpy, Pandas or scikit-learn
Reuse the same logic for integration, preprocessing and monitoring and move it from your Jupyter notebook to large scale test and production systems.

Check out the complete Jupyter notebook to see a full example which combines Python, Kafka, KSQL, Numpy, Pandas, TensorFlow and Keras. In my opinion, this is a great combination and valuable for both, data scientist and software engineers.

I would like to get your feedback. Do you see any value in this? Or does it not make any sense in your scenarios and use cases?

The post Apache Kafka + KSQL + TensorFlow for Data Scientists via Python + Jupyter Notebook appeared first on Kai Waehner.

Deep Learning Example: Apache Kafka + Python + Keras + TensorFlow + Deeplearning4j

Kai Waehner — Tue, 27 Nov 2018 16:29:04 +0000

I added a new example to my “Machine Learning + Kafka Streams Examples” Github project:

“Python + Keras + TensorFlow + DeepLearning4j + Apache Kafka + Kafka Streams“.

This blog post discusses the motivation and why this is a great combination of technologies for scalable, reliable Machine Learning infrastructures. For more details about building Machine Learning / Deep Learning infrastructures leveraging the Apache Kafka open source ecosystem, check out these two blog posts:

Deep Learning with Python and Keras

The goal was to show how you can easily deploy a model developed with Python and Keras to a Java / Kafka ecosystem. Keras allows to use different Deep Learning backends under the hood: TensorFlow, CNTK, or Theano. It acts as high level wrapper and is very easy to use even for people new to Machine Learning. Due to these reasons, Keras is getting a lot of traction these days.

Deployment of Keras Models to Java / Kafka Ecosystem with Deeplearning4J (DL4J)

Machine Learning frameworks have different options for combining it with Java platform (and therefore with Apache Kafka ecosystem), like native Java APIs to load models or RPC interfaces to its framework-specific model servers.

Deeplearning4j seems to become the de facto standard for deployment of Keras models (if you trust Google search). The Deep Learning framework specifically built for the Java platform added many features and bug fixes to its Keras Model Import and supports many Keras concepts already getting better and better with every new (beta) release. I used 1.0.0-beta3 for my Github example. Please check here for a complete list of supported Keras features in Deeplearning4j like Layers, Loss Functions, Activation Functions, Initializers and Optimizers.

You can either fully train a model with Keras and “just” embed it into a Java application for model inference, or re-use the model or parts of it to improve the model with DL4J’s Java API (aka transfer learning).

Example: Python + Keras + TensorFlow + Apache Kafka + DL4J

I implemented a simple but still impressive example:

Development of an analytic model trained with Python, Keras and TensorFlow and deployment to Java and Kafka ecosystem. Simple to implement (as you see in the source code), but powerful, scalable and reliable.

This is no business case this time, but just a technical demonstration of a simple ‘Hello World’ Keras model. Feel free to replace the model with any other Keras model trained with your backend of choice. You just need to replace the model binary (and use a model which is compatible with DeepLearning4J ‘s model importer). Then you can embed it into your Kafka application for real time model inference.

Machine Learning Technologies

Python
DeepLearning4J
Keras – a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.
TensorFlow – used as backend under the hood of Keras
DeepLearning4J ‘s KerasModelImport feature is used for importing the Keras / TensorFlow model into Java. The used model is its ‘Hello World’ model example.
The Keras model was trained with this Python script.

Apache Kafka and DL4J for Real Time Predictions

The trained model is embedded into a Kafka Streams application for real time predictions. Here is the core Kafka Streams logic where I use the Deeplearning4j API to do predictions:

The full source code of my unit test is here: Kafka Streams + DeepLearning4j + Keras + TensorFlow. Just do a “git clone” of the Github project and run the Maven build and it should work out-of-the-box without any configuration.

The post Deep Learning Example: Apache Kafka + Python + Keras + TensorFlow + Deeplearning4j appeared first on Kai Waehner.

Deep Learning KSQL UDF for Streaming Anomaly Detection of MQTT IoT Sensor Data

Kai Waehner — Thu, 02 Aug 2018 08:12:02 +0000

I built a scenario for a hybrid machine learning infrastructure leveraging Apache Kafka as scalable central nervous system. The public cloud is used for training analytic models at extreme scale (e.g. using TensorFlow and TPUs on Google Cloud Platform (GCP) via Google ML Engine. The predictions (i.e. model inference) are executed on premise at the edge in a local Kafka infrastructure (e.g. leveraging Kafka Streams or KSQL for streaming analytics).

This post focuses on the on premise deployment. I created a Github project with a KSQL UDF for sensor analytics. It leverages the new API features of KSQL to build UDF / UDAF functions easily with Java to do continuous stream processing on incoming events.

Use Case: Connected Cars – Real Time Streaming Analytics using Deep Learning

Continuously process millions of events from connected devices (sensors of cars in this example):

I built different analytic models for this. They are trained on public cloud leveraging TensorFlow, H2O and Google ML Engine. Model creation is not focus of this example. The final model is ready for production already and can be deployed for doing predictions in real time.

Model serving can be done via a model server or natively embedded into the stream processing application. See the trade-offs of RPC vs. Stream Processing for model deployment and a “TensorFlow + gRPC + Kafka Streams” example here.

Demo: Model Inference at the Edge with MQTT, Kafka and KSQL

The Github project generates car sensor data, forwards it via Confluent MQTT Proxy to Kafka cluster for KSQL processing and real time analytics.

This project focuses on the ingestion of data into Kafka via MQTT and processing of data via KSQL:

A great benefit of Confluent MQTT Proxy is simplicity for realizing IoT scenarios without the need for a MQTT Broker. You can forward messages directly from the MQTT devices to Kafka via the MQTT Proxy. This reduces efforts and costs significantly. This is a perfect solution if you “just” want to communicate between Kafka and MQTT devices.

If you want to see the other part of the story (integration with sink applications like Elasticsearch / Grafana), please take a look at the Github project “KSQL for streaming IoT data“. This realizes the integration with ElasticSearch and Grafana via Kafka Connect and the Elastic connector.

KSQL UDF – Source Code

It is pretty easy to develop UDFs. Just implement the function in one Java method within a UDF class:

            @Udf(description = "apply analytic model to sensor input")
            public String anomaly(String sensorinput){ "YOUR LOGIC" }

Here is the full source code for the Anomaly Detection KSQL UDF.

How to run the demo with Apache Kafka and MQTT Proxy?

All steps to execute the demo are describe in the Github project.

You just need to install Confluent Platform and then follow these steps to deploy the UDF, create MQTT events and process them via KSQL leveraging the analytic model.

I use Mosquitto to generate MQTT messages. Of course, you can use any other MQTT client, too. That is the great benefit of an open and standardized protocol.

Hybrid Cloud Architecture for Apache Kafka and Machine Learning

If you want to learn more about the concepts behind a scalable, vendor-agnostic Machine Learning infrastructure, take a look at my presentation on Slideshare or watch the recording of the corresponding Confluent webinar “Unleashing Apache Kafka and TensorFlow in the Cloud“.

Please share any feedback! Do you like it, or not? Any other thoughts?

The post Deep Learning KSQL UDF for Streaming Anomaly Detection of MQTT IoT Sensor Data appeared first on Kai Waehner.

Model Serving: Stream Processing vs. RPC / REST with Java, gRPC, Apache Kafka, TensorFlow

Kai Waehner — Mon, 09 Jul 2018 01:13:45 +0000

Machine Learning / Deep Learning models can be used in different ways to do predictions. My preferred way is to deploy an analytic model directly into a stream processing application (like Kafka Streams or KSQL). You could e.g. use the TensorFlow for Java API. This allows best latency and independence of external services. Several examples can be found in my Github project: Model Inference within Kafka Streams Microservices using TensorFlow, H2O.ai, Deeplearning4j (DL4J).

However, direct deployment of models is not always a feasible approach. Sometimes it makes sense or is needed to deploy a model in another serving infrastructure like TensorFlow Serving for TensorFlow models. Model Inference is then done via RPC / Request Response communication. Organisational or technical reasons might force this approach. Or you might want to leverage the built-in features for managing and versioning different models in the model server.

So you combine stream processing with RPC / Request-Response paradigm. The architecture looks like the following:

Pros of an external model serving infrastructure like TensorFlow Serving:

Simple integration with existing technologies and organizational processes
Easier to understand if you come from non-streaming world
Later migration to real streaming is also possible
Model management built-in for different models and versioning

Cons:

Worse latency as remote call instead of local inference
No offline inference (devices, edge processing, etc.)
Coupling the availability, scalability, and latency / throughput of your Kafka Streams application with the SLAs of the RPC interface
Side-effects (e.g. in case of failure) not covered by Kafka processing (e.g. Exactly Once)

Combination of Stream Processing and Model Server using Apache Kafka, Kafka Streams and TensorFlow Serving

I created the Github Java project “TensorFlow Serving + gRPC + Java + Kafka Streams” to demo how to do model inference with Apache Kafka, Kafka Streams and a TensorFlow model deployed using TensorFlow Serving. The concepts are very similar for other ML frameworks and Cloud Providers, e.g. you could also use Google Cloud ML Engine for TensorFlow (which uses TensorFlow Serving under the hood) or Apache MXNet and AWS model server.

Most ML servers for model serving are also extendible to serve other types of models and data, e.g. you could also deploy non-TensorFlow models to TensorFlow Serving. Many ML servers are available as cloud service and for local deployment.

TensorFlow Serving

Let’s discuss TensorFlow Serving quickly. It can be used to host your trained analytic models. Like with most model servers, you can do inference via request-response paradigm. gRPC and REST / HTTP are the two common technologies and concepts used.

The blog post “How to deploy TensorFlow models to production using TF Serving” is a great explanation of how to export and deploy trained TensorFlow models to a TensorFlow Serving infrastructure. You can either deploy your own infrastructure anywhere or leverage a cloud service like Google Cloud ML Engine. A SavedModel is TensorFlow’s recommended format for saving models, and it is the required format for deploying trained TensorFlow models using TensorFlow Serving or deploying on Goodle Cloud ML Engine.

The core architecture is described in detail in TensorFlow Serving’s architecture overview:

This architecture allows deployement and managend of different models and versions of these models including additional features like A/B testing. In the following demo, we just deploy one single TensorFlow model for Image Recognition (based on the famous Inception neural network).

Demo: Mixing Stream Processing with RPC: TensorFlow Serving + Kafka Streams

Disclaimer: The following is a shortened version of the steps to do. For full example including source code and scripts, please go to my Github project “TensorFlow Serving + gRPC + Java + Kafka Streams“.

Things to do

Install and start a ML Serving Engine
Deploy prebuilt TensorFlow Model
Create Kafka Cluster
Implement Kafka Streams application
Deploy Kafka Streams application (e.g. locally on laptop or to a Kubernetes cluster)
Generate streaming data to test the combination of Kafka Streams and TensorFlow Serving

Step 1: Create a TensorFlow model and export it to ‘SavedModel’ format

I simply added an existing pretrained Image Recognition model built with TensorFlow. You just need to export a model using TensorFlow’s API and then use the exported folder. TensorFlow uses Protobuf to store the model graph and adds variables for the weights of the neural network.

Google ML Engine shows how to create a simple TensorFlow model for predictions of census using the “ML Engine getting started guide“. In a second step, you can build a more advanced example for image recognition using Transfer Learning folling the guide “Image Classification using Flowers dataset“.

You can also combine cloud and local services, e.g. build the analytic model with Google ML Engine and then deploy it locally using TensorFlow Serving as we do.

Step 2: Install and start TensorFlow Serving server + deploy model

Different options are available. Installing TensforFlow Serving on a Mac is still a pain in mid of 2018. apt-get works much easier on Linux operating systems. Unforunately there is nothing like a ‘brew’ command or simple zip file you can use on Mac. Alternatives:

You can build the project and compile everything using Bazel build system – which literaly takes forever (on my laptop), i.e. many hours.
Install and run TensorFlow Serving via a Docker container . This also requires building the project. In addition, documentation is not very good and outdated.
Preferred option for beginners => Use a prebuilt Docker container with TensorFlow Serving. I used an example from Thamme Gowda. Kudos to him for building a project which not just contains the TensorFlow Serving Docker image, but also shows an example of how to do gRPC communication between a Java application and TensorFlow Serving.

If you want to your own model, read the guide “Deploy TensorFlow model to TensorFlow serving“. Or to use a cloud service, e.g. take a look at “Getting Started with Google ML Engine“.

Step 3: Create Kafka Cluster and Kafka topics

Create a local Kafka environment (Apache Kafka broker + Zookeeper). The easiest way is the open source Confluent CLI – which is also part of Confluent Open Source and Confluent Enteprise Platform. Just type “confluent start kafka“.

You can also create a cluster using Kafka as a Service. Best option is Confluent Cloud – Apache Kafka as a Service. You can choose between Confluent Cloud Professional for “playing around” or Confluent Cloud Enterprise on AWS, GCP or Azure for mission-critical deployments including 99.95% SLA and very large scale up to 2 GBbyte/second throughput. The third option is to connect to your existing Kafka cluster on premise or in cloud (note that you need to change the broker URL and port in the Kafka Streams Java code before building the project).

Next create the two Kafka topics for this example (‘ImageInputTopic’ for URLs to the image and ‘ImageOutputTopic’ for the prediction result):

Step 4 Build and deploy Kafka Streams app + send test messages

The Kafka Streams microservice (i.e. Java class) “Kafka Streams TensorFlow Serving gRPC Example” is the Kafka Streams Java client. The microservice uses gRPC and Protobuf for request-response communication with the TensorFlow Serving server to do model inference to predict the contant of the image. Note that the Java client does not need any TensorFlow APIs, but just gRPC interfaces.

This example executes a Java main method, i.e. it starts a local Java process running the Kafka Streams microservice. It waits continuously for new events arriving at ‘ImageInputTopic’ to do a model inference (via gRCP call to TensorFlow Serving) and then sending the prediction to ‘ImageOutputTopic’ – all in real time within milliseconds.

In the same way, you could deploy this Kafka Streams microservice anywhere – including Kubernetes (e.g. on premise OpenShift cluster or Google Kubernetes Engine), Mesosphere, Amazon ECS or even in a Java EE app – and scale it up and down dynamically.

Now send messages, e.g. with kafkacat, and use kafka-console-consumer to consume the predictions.

Once again, if you want to see source code and scripts, then please go to my Github project “TensorFlow Serving + gRPC + Java + Kafka Streams“.

The post Model Serving: Stream Processing vs. RPC / REST with Java, gRPC, Apache Kafka, TensorFlow appeared first on Kai Waehner.