Microservices Archives - Kai Waehner

Policy Enforcement and Data Quality for Apache Kafka with Schema Registry

Kai Waehner — Mon, 16 Oct 2023 06:18:22 +0000

Good data quality is one of the most critical requirements in decoupled architectures, like microservices or data mesh. Apache Kafka became the de facto standard for these architectures. But Kafka is a dumb broker that only stores byte arrays. The Schema Registry enforces message structures. This blog post looks at enhancements to leverage data contracts for policies and rules to enforce good data quality on field-level and advanced use cases like routing malicious messages to a dead letter queue.

From point-to-point and spaghetti to decoupled microservices with Apache Kafka

Point-to-point HTTP / REST APIs create tightly couple services. Data lakes and lakehouses enforce a monolithic architecture instead of open-minded data sharing and choice of the best technology for a problem. Hence, Apache Kafka became the de facto standard for microservice and data mesh architectures. And data streaming with Kafka complementary (not competitive!) to APIs, data lakes / lakehouses, and other data platforms.

What makes Apache Kafka so popular

A scalable and decoupled architecture as a single source of record for high-quality, self-service access to real-time data streams, but also batch and request-response communication.

Difference between Kafka and ETL / ESB / iPaaS

Enterprise integration is more challenging than ever before. The IT evolution requires the integration of more and more technologies. Companies deploy applications across the edge, hybrid, and multi-cloud architectures.

Point-to-point integration is not good enough. Traditional middleware such as MQ, ETL, ESB does not scale well enough or only processes data in batch instead of real-time. Integration Platform as a Service (iPaaS) solutions are cloud-native but only allow point-to-point integration.

Apache Kafka is the new black for integration projects. Data streaming is a new software category.

Domain-driven design, microservices, data mesh…

The approaches use different principles and best practices. But reality is that the key for a long-living and flexible enterprise architecture is decoupled, independent applications. However, these applications need to share data in good quality with each other.

Apache Kafka shines here. It decouples applications because of its event store. Consumers don’t need to know producers. Domains build independent applications with its own technologies, APIs and cloud services:

Replication between different Kafka clusters enables a global data mesh across data centres and multiple cloud providers or regions. But unfortunately, Apache Kafka itself misses data quality capabilities. That’s where the Schema Registry comes into play.

The need for good data quality and data governance in Kafka Topics

To ensure data quality in a Kafka architecture, organizations need to implement data quality checks, data cleansing, data validation, and monitoring processes. These measures help in identifying and rectifying data quality issues in real time, ensuring that the data being streamed is reliable, accurate, and consistent.

Why you want good data quality in Kafka messages

Data quality is crucial for most Kafka-based data streaming use cases for several reasons:

Real-time decision-making: Data streaming involves processing and analyzing data as it is generated. This real-time aspect makes data quality essential because decisions or actions based on faulty or incomplete data can have immediate and significant consequences.
Data accuracy: High-quality data ensures that the information being streamed is accurate and reliable. Inaccurate data can lead to incorrect insights, flawed analytics, and poor decision-making.
Timeliness: In data streaming, data must be delivered in a timely manner. Poor data quality can result in delays or interruptions in data delivery, affecting the effectiveness of real-time applications.
Data consistency: Inconsistent data can lead to confusion and errors in processing. Data streaming systems must ensure that data adheres to a consistent schema and format to enable meaningful and accurate analysis. No matter if a producer or consumer uses real-time data streaming, batch processing, or request-response communication with APIs.
Data integration: Data streaming often involves combining data from various sources, such as sensors, databases, and external feeds. High-quality data is essential for seamless integration and for ensuring that data from different sources can be harmonized for analysis.
Regulatory compliance: In many industries, compliance with data quality and data governance regulations is mandatory. Failing to maintain data quality in data streaming processes can result in legal and financial repercussions.
Cost efficiency: Poor data quality can lead to inefficiencies in data processing and storage. Unnecessary processing of low-quality data can strain computational resources and lead to increased operational costs.
Customer satisfaction: Compromised data quality in applications directly impacts customers, it can lead to dissatisfaction, loss of trust, and even attrition.

Rules engine and policy enforcement in Kafka Topics with Schema Registry

Confluent designed the Schema Registry to manage and store the schemas of data that are shared between different systems in a Kafka-based data streaming environment. Messages from Kafka producers are validated against the schema.

The Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving your Avro, JSON Schema, and Protobuf schemas. It stores a versioned history of all schemas based on a specified subject name strategy, provides multiple compatibility settings, and allows evolution of schemas according to the configured compatibility settings and expanded support for these schema types.

Schema Registry provides serializers that plug into Apache Kafka clients that handle schema storage and retrieval for Kafka messages that are sent in any of the supported formats.

Schema Registry is available on GitHub under the Confluent Community License that allows deployment in production scenarios with no licensing costs. It became the de facto standard for ensuring data quality and governance in Kafka projects across all industries.

Enforcing the message structure as the foundation of good data quality

Confluent Schema Registry enforces message structure by serving as a central repository for schemas in a Kafka-based data streaming ecosystem. Here’s how it enforces message structure and rejects invalid messages:

Data messages produced by Kafka producers must adhere to the registered schema. A message is rejected if a message doesn’t match the schema. This behaviour ensures that only well-structured data are published and processes.

Schema Registry even supports schema evolution for data interoperability using different schema versions in producers and consumers. Find a detailed explanation and the limitations in the Confluent documentation.

Validation of schemas happens on the client side in Schema Registry. This is not good enough for some scenarios, like regulated markets, where the infrastructure provider cannot trust each data producer. Hence, Confluent’s commercial offering added broker-side schema validation.

Attribute-based policies and rules in data contracts

The validation of message schema is a great first step. However, many use cases require schema validation and policy enforcement on field level, i.e. validating each attribute of the message by itself with custom rules. Welcome to Data Contracts:

Disclaimer: The following add-on for Confluent Schema Registry is only available for Confluent Platform and Confluent Cloud. If you use any other Kafka service and schema registry, take this solution as an inspiration for building your own data governance suite – or migrate to Confluent

Data contracts support various rules, including data quality rules, field-level transformations, event-condition-action rules, and complex schema evolution. Look at the Confluent documentation “Data Contracts for Schema Registry” to learn all the details.

Data contracts and data quality rules for Kafka messages

As described in the Confluent documentation, a data contract specifies and supports the following aspects of an agreement:

Structure: This is the part of the contract that is covered by the schema, which defines the fields and their types.
Integrity constraints: This includes declarative constraints or data quality rules on the domain values of fields, such as the constraint that an age must be a positive integer.
Metadata: Metadata is additional information about the schema or its constituent parts, such as whether a field contains sensitive information. Metadata can also include documentation for a data contract, such as who created it.
Rules or policies: These data rules or policies can enforce that a field that contains sensitive information must be encrypted, or that a message containing an invalid age must be sent to a dead letter queue.
Change or evolution: This implies that data contracts are versioned, and can support declarative migration rules for how to transform data from one version to another, so that even changes that would normally break downstream components can be easily accommodated.

Example: PII data enforcing encryption and error-handling with a dead letter queue

One of the built-in rule types is Google Common Expression (CEL), which supports data quality rules.

Here is an example where a specific field is tagged as PII data. Rules can enforce good data quality or encryption of an attribute like the credit card number:

You can also configure advanced routing logic. For instance, error handling: If the expression “size(message.id) == 9” is not validated, then the streaming platform forwards the message to a dead letter queue for further processing with the configuration: “dlq.topic”: “bad-data”.

Dead letter queue (DLQ) is its own complex (but very important) topic. Check out the article “Error Handling via Dead Letter Queue in Apache Kafka” to learn from real-world implementations of Uber, CrowdStrike, Santander Bank, and Robinhood.

Data contracts as a foundation for new data streaming products and integration with Apache Flink

Schema Registry should be the foundation of any Kafka project. Data contracts enforce good data quality and interoperability between independent microservices. Each business unit and its data products can choose any technology or API. But data sharing with others works only with good (enforced) data quality.

No matter if you use Confluent Cloud or not, you can learn from this SaaS offering how schemas and data contracts enable data consistency and faster time to market for innovation. Products like Data Catalog, Data Lineage, Confluent Stream Sharing, or the out-of-the-box integration with serverless Apache Flink rely on a good internal data governance strategy with schemas and data contracts.

Do you already leverage data contracts in your Confluent environment? If you are not a Confluent user, how do you solve data consistency issues and enforce good data quality? Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Policy Enforcement and Data Quality for Apache Kafka with Schema Registry appeared first on Kai Waehner.

Apache Kafka for Data Consistency (and Real-Time Data Streaming)

Kai Waehner — Tue, 27 Dec 2022 08:46:04 +0000

Real-time data beats slow data in almost all use cases. But as essential is data consistency across all systems, including non-real-time legacy systems and modern request-response APIs. Apache Kafka’s most underestimated feature is the storage component based on the append-only commit log. It enables loose coupling for domain-driven design with microservices and independent data products in a data mesh. This blog post explores how Kafka enables data consistency with a real-world case study from financial services.

Apache Kafka = Real-time data streaming

Real-time beats slow data. It is that easy in almost all use cases. Ask any executive or business person: What’s better? If you can consume and use information now or later? The value of data goes down over time:

Apache Kafka is the de facto standard for real-time data streaming. Check out the data streaming landscape 2023 to learn more about Kafka-related products and cloud services.

So far, so good. However, one valid question always comes up: “Why is Apache Kafka different from a real-time message broker like RabbitMQ, IBM MQ, NATS, or Amazon SQS?”

TL;DR: Message brokers provide real-time messaging capabilities to produce and consume messages. Apache Kafka is a data streaming platform that combines messaging, storage, data integration, and stream processing capabilities.

My comparison of message brokers and data streaming explored the differences using ten characteristics. It is all about the storage component of Apache Kafka in the discussion of data consistency. Let’s explore why…

Real-time means many things…

… from deterministic systems with hard real-time up to minutes or even hours. Always define your requirements for real-time data processing and the end-to-end latency (not just the messaging component).

I clarified when to use Apache Kafka for real-time workloads in separate blog posts:

Data consistency = The biggest challenge of the enterprise architecture

Data consistency refers to whether the same data kept at different places does or does not match. The data is processed in many ways across the enterprise architecture:

Real-time: Message brokers or data streaming platforms transfer or process data when it is in motion.
Near real-time: Platforms ingest data into data lakes and data warehouses in seconds or minutes.
Batch: Reporting and analytics of historical data.
Request-response: Interactive API or SQL queries to collect specific information.
A point-in-time replay of historical data: Troubleshooting, incident management, regulatory reporting, and similar scenarios.

The applications and data platforms use very different (old and new) technologies, products, cloud services, and APIs. Integration and data consistency across the different communication paradigms is a massive challenge within the spaghetti architecture:

The consequence of inconsistent data is obvious:

Bad customer experience, e.g., late notification about flight delays or cancellations.
Revenue loss, e.g., inventory not up-to-date, missed or too late detection of fraud.
Increased cost, e.g., slow or wrong decisions in logistics across the supply chain.
Increased risk, e.g., unrecognized data breaches, compliance issues.

This is where the storage component of Apache Kafka makes the difference…

Apache Kafka = Streaming platform to decouple any application and communication paradigm

Kafka is an append-only commit log. Consumers are independent of each other and independent of producers. They interact at their own pace with their own communication paradigm and pull the information from the Kafka log.

This enables independent consumption and processing of data consistently. It does not matter what technology or communication paradigm the downstream consumer application uses:

This example of a real-time locating system (RTLS) for asset tracking can be built with Kafka straightforwardly. Downstream applications consume events in real-time, near real-time, batch, or via request-response HTTP/REST APIs. With other technologies, like real-time message queues, you need to add additional platforms for storage, integration, and data processing. Data streaming provides a single (scalable and reliable) platform.

Apache Kafka enables domain-driven design and data mesh architectures

Domain-driven Design (DDD) needs decoupled applications. Push-based message brokers or HTTP/REST web services enforce tight coupling between the systems. This creates the above spaghetti architecture.

On the other side, Apache Kafka truly decouples the domains, no matter what technologies or communication paradigms each domain uses. Loose coupling is the norm with Kafka:

This is why the storage component using the combination of real-time messaging and a distributed commit log enables data consistency across technologies and communication paradigms.

With Tiered Storage for Kafka, storage and compute are separated to enable long-term storage in Kafka for the replayability of historical data. This is not needed for every use case. Kafka does not replace your favorite database. But it is beneficial for some use cases, e.g., model training with TensorFlow and Kafka to leverage machine learning with a Kappa architecture.

Domain-driven design is the foundation of modern microservice architecture or data mesh. For that reason, many cloud-native enterprise architectures are built using Kafka as the heart of the infrastructure for real-time data sharing plus loose coupling between the data products.

Let’s look at a practical real-world example where the added value of Kafka was not its real-time capability but enabling data consistency across systems…

Erste Group Bank: A case study for data consistency with Apache Kafka

Erste Group Bank AG (Erste Group) is an Austrian financial service provider in Central and Eastern Europe serving 15.7 million clients in over 2,700 branches in seven countries. The bank presented its data streaming journey at Confluent’s Data in Motion 2022 tour in Zurich.

Having a strong mission to increase its customer experience, Erste Group is putting more and more data into action. Making this happen can be summarized as a race for consistency across our channels, squaring the circle between latency and data volumes.

The digital transformation at Erste Group required challenging integration across different technologies and communication paradigms. The following sections describe why Apache Kafka was chosen.

Hyper-personalized mobile banking

A great user experience in the new mobile app “Georg” is a crucial strategic component of Erste Group’s digital transformation to increase customer experience and revenue.

Here is how Erste Group promotes its mobile app: “For 8 million people in 6 countries, banking has a name. George. George empowers everyone to understand, manage, and improve their financial health. Simple. Intelligent. Personal. Unique.”

The following diagram shows the positive and negative reviews of customers, including Erste Group’s mobile app “George” and competitive banking apps:

Source: Erste Group Bank AG

A great mobile app user experience requires the combination of many technologies in the backend. Data streaming enables the foundation in the backend to build an intuitive mobile app across various European countries.

Scalability and accurate information at the right time in the right context make the difference:

Source: Erste Group Bank AG

Fully managed data streaming for omnichannel data consistency

Here comes the surprising part of why I chose this case study for the blog post: While the real-time capability of Apache Kafka at any scale is essential, the critical aspect of the technology choice was data consistency across platforms and communication paradigms:

Source: Erste Group Bank AG

Erste Group built an enterprise architecture with data streaming to enable asynchronous decoupled domains with event sourcing. The responsibility is split across cross-functional teams like Digital and Business Intelligence. However, consistent data is served in different ways for various downstream consumers:

Stream processing for real-time subscriptions
A serving layer for API integration via request-response protocols like HTTP/REST
Tiered Storage for long-term replayability of historical data to enable analytical queries

Source: Erste Group Bank AG

The infrastructure is fully managed in Confluent Cloud to enable focusing on business problems and innovation. DevOps and MLOps automate the development and monitoring lifecycle of the applications.

Data consistency is as critical as real-time data

Apache Kafka is the de facto standard for real-time data streaming. In addition, most enterprise architectures leverage the append-only commit log for loosely coupling to enable agile and elastic microservice architectures. The vision of building data products in a decentralized data mesh is made possible with Apache Kafka.

This post showed the case study of Erste Group to enable data consistency across domains and technologies with fully managed Kafka in the cloud. Obviously, we just explored the foundation. Data sharing across organizations and enforced data governance, including access control, encryption, and audit logging, are mandatory to realize a data mesh in the real world. I discussed these topics in my overview of the top 5 trends for data streaming in 2023.

Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka for Data Consistency (and Real-Time Data Streaming) appeared first on Kai Waehner.

Decentralized Data Mesh with Data Streaming in Financial Services

Kai Waehner — Fri, 28 Oct 2022 02:55:52 +0000

Digital transformation requires agility and fast time to market as critical factors for success in any enterprise. The decentralization with a data mesh separates applications and business units into independent domains. Data sharing in real-time with data streaming helps to provide information in the proper context to the correct application at the right time. This blog post explores a case study from the financial services sector where a data mesh was built across countries for loosely coupled data sharing but standardized enterprise-wide data governance.

Data mesh and the need for real-time data streaming

If there were a buzzword of the hour, it would undoubtedly be “data mesh“! This new architectural paradigm unlocks analytic and transactional data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios. The data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable:

The digital transformation in financial services

The new enterprise reality in the financial services sector: Innovate or be disrupted!

A few initiatives I have seen in banks around the world with real-time data leveraging data streaming:

Legacy modernization, e.g., mainframe offloading and replacement with Apache Kafka
Middleware modernization with scalable, open infrastructures replacing ETL, ESB, and iPaaS platforms
Hybrid cloud data replication for disaster recovery, migration, and other scenarios
Transactions and analytics in real-time at any scale
Fraud detection in real-time with Kafka and Flink to prevent fraud before it happens
Cloud-native core banking to enable modern business processes
Digital payment innovation with Apache Kafka as the data hub for Cryptocurrency, DeFi, NFT, and Metaverse (beyond the buzz)

Let’s look at a practical example from the real world.

Raiffeisen Bank International – A bank transformation across 12 countries

Raiffeisen Bank International (RBI) is scaling an event-driven architecture across the group as part of a bank-wide transformation program. This includes the creation of a reference architecture and the re-use of technology and concepts across 12 countries.

The universal bank is headquartered in Vienna, Austria. It has decades of experience (and related legacy infrastructure) in retail, corporate and markets, and investment banking.

Let’s explore the journey of Raiffeisen Bank’s digital transformation. If you want to listen to the story told by themselves, watch the free on-demand webinar.

Building a data mesh without knowing it…

Raiffeisen Bank, operating across 12 countries, has all the apparent challenges and requirements for data sharing across applications, platforms, and governments.

Raiffeisen Bank built a decentralized data mesh enterprise architecture with real-time data sharing as the fundamental key to its digital transformation. They did not even know about it because the buzzword did not exist when they started making it… But there are good reasons for using data streaming as the data hub:

The enterprise architecture of RBI’s data mesh with data streaming

The reference architecture includes data streaming as the heart of the infrastructure. It is real-time, scalable, and decoupled independent domains and applications. Open Banking APIs exist for request-response communication:

Source: Raiffeisen Bank International

The three core principles of the enterprise architecture ensure an agile, scalable, and future-ready infrastructure across the countries:

API: Internal APIs standardized based on domain-driven design
Group integration: Live, connected with 11 countries, 320 APIs available, constantly increasing
EDA: Event-driven reference architecture created and roll-out ongoing, group Layer live with the first use cases

The combination of data streaming with Apache Kafka and request-response with REST / HTTP is prevalent in enterprise architectures. Having said that, more and more use cases directly leverage a stream data exchange for data sharing across business units or organizations.

Decoupling with decentralized data streaming as the integration layer

The whole IT platform and technology stack is built for re-use in the group:

Source: Raiffeisen Bank International

Raiffeisen Bank’s reference architecture has all the characteristics that define a data mesh:

Loose coupling between applications, databases, and business units with domain-driven design
Independent microservices and data products (like different core banking platforms or individual analytics in the countries)
Data sharing in real-time via a decentralized data streaming platform (fully-managed in the cloud where possible, but freedom of choice for each country
Enterprise-wide API contacts (= schemas in the Kafka world)

Data governance in regulated banking across the data mesh

Financial service is a regulated market around the world. PCI, GDPR, and other compliance requirements are mandatory, whether you build monoliths or a decentralized data mesh.

Raiffeisenbank international built its data mesh with data governance, legal compliance, and data privacy in mind from the beginning:

Source: Raiffeisen Bank International

Here are the fundamental principles of Raiffeisen Bank’s data governance strategy:

Central integration layer for data sharing across the independent groups in real-time for transactional and analytical workloads
Cloud-first strategy (when it makes sense) with fully-managed Confluent Cloud for data streaming
Group-wide standardized event taxonomy and API contracts with Schema Registry
Group-wide governance with event product owners across the group
Platform as a service for self-service for internal customers within the different groups

Combining these paradigms and rules enables independent data processing and innovation while still being compliant and enabling data sharing across different groups.

The heart of a data mesh beats in real-time

Independent applications, domains, and organizations built separate data products in a data mesh. Real-time data sharing across these units with standardized and loosely coupled events is a critical success factor. Each downstream consumer gets the data as needed: Real-time, near real-time, batch, or request-response.

The case study from Raiffeisen Bank International showed how to build a powerful and flexible data mesh leveraging cloud-native data streaming powered by Apache Kafka. While this example comes from financial services, the principles and architectures apply to any vertical. The business objects and interfaces look different. But the significant challenges are very similar across industries.

How do you build a data mesh? Do you use batch technology like ETL tools and data lakes or rely on real-time data streaming for data sharing and integration? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Decentralized Data Mesh with Data Streaming in Financial Services appeared first on Kai Waehner.

Streaming Data Exchange with Kafka and a Data Mesh in Motion

Kai Waehner — Sun, 14 Nov 2021 13:25:45 +0000

Data Mesh is a new architecture paradigm that gets a lot of buzzes these days. Every data and platform vendor describes how to build the best Data Mesh with their platform. The Data Mesh story includes cloud providers like AWS, data analytics vendors like Databricks and Snowflake, and Event Streaming solutions like Confluent. This blog post looks into this principle deeper to explore why no single technology is the perfect fit to build a Data Mesh. Examples show why an open and scalable decentralized real-time platform like Apache Kafka is often the heart of the Data Mesh infrastructure, complemented by many other data platforms, to solve business problems.

Data at Rest vs. Data in Motion

Before we get into the Data Mesh discussion, it is crucial to clarify the difference and relevance of Data at Rest and Data in Motion:

Data at Rest: Data is ingested and stored in a storage system (database, data warehouse, data lake). Business logic and queries execute against the storage. Everyday use cases include reporting with business intelligence tools, model training in machine learning, and complex batch analytics like shuffling or map and reduce. As the data is at rest, the processing is too late for real-time use cases.
Data in Motion: Data is processed and correlated continuously while new events are fed into the platform. Business logic and queries execute in real-time. Common real-time use cases include inventory management, order processing, fraud detection, predictive maintenance, and many other use cases.

Real-time Data Beats Slow Data

Real-time beats slow data in almost all use cases across industries. Hence, ask yourself or your business team how they want or need to consume and process the data in the next project. Data at Rest and Data in Motion have trade-offs. Therefore, both concepts are complementary. For this reason, modern cloud infrastructures leverage both in their architecture. Serverless Event Streaming with Kafka combined with the AWS Lakehouse is a great resource to learn more.

However, while connecting a batch system to a real-time nervous system is possible, the other way round – connecting a real-time consumer to batch storage – is not possible. The Kappa vs. Lambda Architecture discussion gives more insights into this.

Kafka is a database. So, it is also possible to use it for data at rest. For instance, the replayability of historical events in guaranteed ordering is essential and helpful for many use cases. However, long-term storage in Kafka has several limitations, like limited query capabilities. Hence, for many use cases, event streaming and other storage systems are complementary, not competitive.

Data Mesh – An Architecture Paradigm

Data mesh is an implementation pattern (not unlike microservices or domain-driven design) but applied to data. Thoughtworks coined the term. You will find tons of resources on the web. Zhamak Dehghani gave a great introduction about “How to build the Data Mesh Foundation and its Relation to Event Streaming” at the Kafka Summit Europe 2021.

Domain-driven + Microservices + Event Streaming

Data Mesh is not an entirely new paradigm. It has several historical influences:

The architectural paradigm unlocks analytical data at scale, rapidly unlocking access to an ever-growing number of distributed domain data sets for a proliferation of consumption scenarios such as machine learning, analytics, or data-intensive applications across the organization. A data mesh addresses the common failure modes of the traditional centralized data lake or data platform architecture.

Data Mesh is a Logical View, not Physical!

Data mesh shifts to a paradigm that draws from modern distributed architecture: considering domains as the first-class concern, applying platform thinking to create a self-serve data infrastructure, treating data as a product, and implementing open standardization to enable an ecosystem of interoperable distributed data products.

Here is an example of a Data Mesh:

TL;DR: Data Mesh combines existing paradigms, including Domain-driven Design, Data Marts, Microservices, and Event Streaming.

Data as the Product

However, the differentiating aspect focuses on product thinking (“Microservice for Data”) with data as a first-class product. Data products are a perfect fit for Event Streaming with Data in Motion to build innovative new real-time use cases.

A Data Mesh with Event Streaming

Why is Event Streaming a good fit for data mesh?

Streams are real-time, so you can propagate data throughout the mesh immediately, as soon as new information is available. Streams are also persisted and replayable, so they let you capture both real-time AND historical data with one infrastructure. And because they are immutable, they make for a great source of record, which is helpful for governance.

Data in Motion is crucial for most innovative use cases. And as discussed before, real-time data beats slow data in almost all scenarios. Hence, it makes sense that the heart of a Data Mesh architecture is an Event Streaming platform. It provides true decoupling, scalable real-time data processing, and highly reliable operations across the edge, data center, and multi-cloud.

Kafka Streaming API – The De Facto Standard for Data in Motion

The Kafka API is the de facto standard for Event Streaming. I won’t explore this discussion again and again. Here are a few references before we move to the “Kafka + Data Mesh” content…

A Kafka-powered Data Mesh

I highly recommend watching Ben Stopford’s and Michael Noll’s talk about “Apache Kafka and the Data Mesh“. Several of the screenshots in this post are from that presentation, too. Kudos to my two colleagues! The talk explores the key concepts of a Data Mesh and how they are related to Event Streaming:

Domain-driven Decentralization
Data as a Self-serve Product
First-class Data Platform
Federated Governance

Let’s now explore how Event Streaming with Kafka fits into the Data Mesh architecture and how other solutions like a database or data lake complement it.

Data product, a “microservice for the data world”:

A node on the data mesh, situated within a domain.
Produces and possibly consumes high-quality data within the mesh.
Encapsulates all the elements required for its function, namely data plus code plus infrastructure.

A Data Mesh is not just one Technology!

The heart of a Data Mesh infrastructure must be real-time, decoupled, reliable, and scalable. Kafka is a modern cloud-native enterprise integration platform (also often called iPaaS today). Therefore, Kafka provides all the capabilities for the foundation of a Data Mesh.

However, not all components can or should be Kafka-based. Choose the right tool for a problem. Let’s explore in the following subsections how Kafka-native technologies and other solutions are used in a Data Mesh together.

Stream Processing within the Data Product with Kafka Streams and ksqlDB

An event-based data product aggregates and correlates information from one or more data sources in real-time. Stateless and stateful stream processing is implemented with Kafka-native tools such as Kafka Streams or ksqlDB:

Variety of Protocols and Communication Paradigms within the Data Product – HTTP, gRPC, MQTT, and more

Obviously, not every application uses just Event Streaming as a technology and communication paradigm. The above picture shows how one consumer application could also be a request/response technology like HTTP or gRPC to do a pull query. In contrast, another application continuously consumes the streaming push query with a native Kafka consumer in any programming language, such as Java, Scala, C, C++, Python, Go, etc.

The data product often includes complementary technologies. For instance, if you built a connected car infrastructure, you likely use MQTT for the last-mile integration, ingest the data into Kafka, and further processing with Event Streaming. The “Kafka + MQTT Blog Series” is an excellent example from the IoT space to learn about building a data product with complementary technologies.

Variety of Solutions within the Data Product – Event Streaming, Data Warehouse, Data Lake, and more

The beauty of microservice architectures is that every application can choose the right technologies. An application might or might not include databases, analytics tools, or other complementary components. The input and output data ports of the data product should be independent of the chosen solutions:

Kafka Connect is the right Kafka-native technology to connect other technologies and communication paradigms with the Event Streaming platform. Evaluate if you need another integration middleware (like an ETL or ESB) or if the Kafka infrastructure is the better enterprise integration platform (iPaaS) for your data product within the data mesh.

A Global Streaming Data Exchange

The Data Mesh concept is relevant for global deployments, not just within a single project or region. Multiple Kafka clusters are the norm, not an exception. I wrote about customers using Event Streaming with Kafka in global architectures a long time ago.

Various architectures exist to deploy Kafka across data centers and multiple clouds. Some use cases require low latency and deploy some Kafka instances at the edge or in a 5G zone. Other use cases replicate data between regions, countries, or continents across the globe for disaster recovery, aggregation, or analytics use cases.

Here is one example spanning a streaming Data Mesh across multiple cloud providers like AWS, Azure, GCP, or Alibaba, and on-premise / edge sites:

This example shows all the characteristics discussed in the above sections for a Data Mesh:

Decentralized real-time infrastructure across domains and infrastructures
True decoupling between domains within and between the clouds
Several communication paradigms, including data streaming, RPC, and batch
Data integration with legacy and cloud-native technologies
Continuous stream processing where it adds value, and batch processing in some analytics sinks

Example: A Streaming Data Exchange across Domains in the Automotive Industry

The following example from the automotive industry shows how independent stakeholders (= domains in different enterprises) use a cross-company streaming data exchange:

Innovation does not stop at the own border. Streaming replication is relevant for all use cases where real-time is better than slow data (valid for most scenarios). A few examples:

End-to-end supply chain optimization from suppliers to the OEM to the intermediary to the aftersales
Track and trace across countries
Integration of 3rd party add-on services to the own digital product
Open APIs for embedding and combining external services to build a new product

I could go on and on with the list. Many data products need to be accessible by 3rd party in real-time at scale. Some API gateway or API management tool comes into play in such a situation.

A real-world example of a streaming data exchange powered by Kafka is the mobility service Here Technologies. They expose the Kafka API to directly consume streaming data from their mapping services (as an alternative option to their HTTP API):

However, even if all collaborating partners use Kafka under the hood in their architecture, exposing the Kafka API directly to the outside world does not always make sense. Some technical capabilities (e.g., access control or connectivity to thousands of devices) and missing business functions (e.g., for monetization or reporting) of the Kafka ecosystem bring an API layer on top of the Event Streaming infrastructure into play in many real-world deployments.

Open API for 3rd Party Integration and Streaming API Management

API Gateways and API Management tools exist in many varieties, including open-source frameworks, commercial products, and SaaS cloud offerings. Features include technical routing, access control, monetization, and reporting.

However, most people still implement the Open API concept with RPC in mind. I guess 95+% still use HTTP(S) to make APIs accessible to other stakeholders (e.g., other business units or external parties). RPC makes little sense in a streaming Data Mesh architecture if the data needs to be processed at scale in real-time.

There is still an impedance mismatch between Event Streaming and API Management. But it gets better these days. Specifications like AsyncAPI, calling itself the “industry standard for defining asynchronous APIs”, and similar approaches bring Open API to the data streaming world. My post “Kafka versus API Management with tools like MuleSoft, Kong, or Apigee” is still pretty much accurate if you want to dive deeper into this discussion. IBM API Connect was one of the first vendors that integrated Kafka via Async API.

A great example of the evolution from RPC to streaming APIs is the machine learning space. “Streaming Machine Learning with Kafka-native Model Deployment” explores how model servers such as Seldon enhance their product with a native Kafka API besides HTTP and gRPC request-response communication:

Journey to the Streaming Data Mesh with Kafka

The paradigm shift is enormous. Data Mesh is not a free lunch. The same was and still is true for microservice architectures, domain-driven design, Event Streaming, and other modern design principles.

In analogy to Confluent’s maturity model for Event Streaming, our team has described the journey for deploying a streaming Data Mesh:

The efforts likely take a few years in most scenarios. The shift is not just about technologies, but, as necessary are adjustments to organizations and business processes. I guess most companies are still in a very early stage. Let me know where you are on this journey!

Streaming Data Exchange as Foundation for a Data Mesh

A Data Mesh is an implementation pattern, not a specific technology. However, most modern enterprise architectures require a decentralized streaming data infrastructure to build valuable and innovative data products in independent, truly decoupled domains. Hence, Kafka, being the de facto standard for Event Streaming, comes into play in many Data Mesh architectures.

Many Data Mesh architectures span across many domains in various regions or even continents. The deployments run at the edge, on-prem, and multi-cloud. The integration connects to many solutions, technologies with different communication paradigms.

A cloud-native Event Streaming infrastructure with the capability to link clusters with each other out-of-the-box enables building a modern Data Mesh. No Data Mesh will use just one technology or vendor. Learn from the inspiring posts from your favorite data products vendors like AWS, Snowflake, Databricks, Confluent, and many more to define and build your custom Data Mesh successfully. Data Mesh is a journey, not a big bang.

Did you already start building your Data Mesh? How does the enterprise architecture look like? What frameworks, products, and cloud services do you use? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Streaming Data Exchange with Kafka and a Data Mesh in Motion appeared first on Kai Waehner.

Mainframe Integration, Offloading and Replacement with Apache Kafka

Kai Waehner — Fri, 24 Apr 2020 14:55:10 +0000

Time to get more innovative; even with the mainframe! This blog post covers the steps I have seen in projects where enterprises started offloading data from the mainframe to Apache Kafka with the final goal of replacing the old legacy systems.

“Mainframes are still hard at work, processing over 70 percent of the world’s most important computing transactions every day. Organizations like banks, credit card companies, medical facilities, stock brokerages, and others that can absolutely not afford downtime and errors depend on the mainframe to get the job done. Nearly three-quarters of all Fortune 500 companies still turn to the mainframe to get the critical processing work completed” (BMC).

Cost, monolithic architectures and missing experts are the key challenges for mainframe applications. Mainframes are used in several industries, but I want to start this post by thinking about the current situation at banks in Germany; to understand the motivation why so many enterprises ask for help with mainframe offloading and replacement…

Finance Industry in 2020 in Germany

Germany is where I live, so I get the most news from the local newspapers. But the situation is very similar in other countries across the world…

Here is the current situation in Germany.

Challenges for Traditional Banks

Traditional banks are in trouble. More and more branches are getting closed every year.
Financial numbers are pretty bad. Market capitalization went down significantly (before Corona!). Deutsche Bank and Commerzbank – former flagships of the German finance industry – are in the news every week. Usually for bad news.
Commerzbank had to leave the DAX in 2019 (the blue chip stock market index consisting of the 30 major German companies trading on the Frankfurt Stock Exchange); replaced by Wirecard, a modern global internet technology and financial services provider offering its customers electronic payment transaction services and risk management. UPDATE Q3 2020: Wirecard is now insolvent due to the Wirecard scandal. A really horrible scandal for the German financial market and government.
Traditional banks talk a lot about creating a “cutting edge” blockchain consortium and distributed ledger to improve their processes in the future and to be “innovative”. For example, some banks plan to improve the ‘Know Your Customer’ (KYC) process with a joint distributed ledger to reduce costs. Unfortunately, this is years away from being implemented; and not clear if it makes sense and adds real business value. Blockchain has still not proven its added value in most scenarios.

Neobanks and FinTechs Emerging

Neobanks like Revolut or N26 provide an innovative mobile app and great customer experience, including a great KYC process and experience (without the need for a blockchain). In my personal International bank transactions typically take less than 24 hours. In contrary, the mobile app and website of my private German bank is a real pain in the ass. And it feels like getting worse instead of better. To be fair: Neobanks do not shine with customers service if there are problems (like fraud; they sometimes simply freeze your account and do not provide good support).
International Fintechs are also arriving in Germany to change the game. Paypal is the normal everywhere, already. German initiatives for a “German Paypal alternative” failed. Local retail stores already accept Apple Pay (the local banks are “forced” to support it in the meantime). Even the Chinese service Alipay is supported by more and more shops.

Traditional banks should be concerned. The same is true for other industries, of course. However, as said above, many enterprises still rely heavily on the 50+ year old mainframe while competing with innovative competitors. It feels like these companies relying on mainframes have a harder job to change than let’s say airlines or the retail industry.

Digital Transformation with(out) the Mainframe

I had a meeting with an insurance company a few months ago. The team presented me an internal paper quoting their CEO: “This has to be the last 20M mainframe contract with IBM! In five years, I expect to not renew this.” That’s what I call ‘clear expectations and goals’ for the IT team…

Why does Everybody Want to Get Rid of the Mainframe?

Many large, established companies, especially those in the financial services and insurance space still rely on mainframes for their most critical applications and data. Along with reliability, mainframes come with high operational costs, since they are traditionally charged by MIPS (million instructions per second). Reducing MIPS results in lowering operational expense, sometimes dramatically.

Many of these same companies are currently undergoing architecture modernization including cloud migration, moving from monolithic applications to micro services and embracing open systems.

Modernization with the Mainframe and Modern Technologies

This modernization doesn’t easily embrace the mainframe, but would benefit from being able to access this data. Kafka can be used to keep a more modern data store in real-time sync with the mainframe, while at the same time persisting the event data on the bus to enable microservices, and deliver the data to other systems such as data warehouses and search indexes.

This not only will reduce operational expenses, but will provide a path for architecture modernization and agility that wasn’t available from the mainframe alone.

As final step and the ultimate vision of most enterprises, the mainframe might be replaced by new applications using modern technologies.

Kafka in Financial Services and Insurance Companies

I recently wrote about how payment applications can be improved leveraging Kafka and Machine Learning for real time scoring at scale. Here are a few concrete examples from banking and insurance companies leveraging Apache Kafka and its ecosystem for innovation, flexibility and reduced costs:

Capital One: Becoming truly event driven – offering a service other parts of the bank can use.
ING: Significantly improved customer experience – as a differentiator + Fraud detection and cost savings.
Freeyou: Real time risk and claim management in their auto insurance applications.
Generali: Connecting legacy databases and the modern world with a modern integration architecture based on event streaming.
Nordea: Able to meet strict regulatory requirements around real-time reporting + cost savings.
Paypal: Processing 400+ Billion events per day for user behavioral tracking, merchant monitoring, risk & compliance, fraud detection, and other use cases.
Royal Bank of Canada (RBC): Mainframe off-load, better user experience and fraud detection – brought many parts of the bank together.

The last example brings me back to the topic of this blog post: Companies can save millions of dollars “just” by offloading data from their mainframes to Kafka for further consumption and processing. Mainframe replacement might be the long term goal; but just offloading data is a huge $$$ win.

Domain-Driven Design (DDD) for Your Integration Layer

So why use Kafka for mainframe offloading and replacement? There are hundreds of middleware tools available on the market for mainframe integration and migration.

Well, in addition to being open and scalable (I won’t cover all the characteristics of Kafka in this post), one key characteristic is that Kafka enables decoupling applications better than any other messaging or integration middleware.

Legacy Migration and Cloud Journey

Legacy migration is a journey. Mainframes cannot be replaced in a single project. A big bang will fail. This has to be planned long-term. The implementation happens step by step.

Almost every company on this planet has a cloud strategy in the meantime. The cloud has many benefits, like scalability, elasticity, innovation, and more. Of course, there are trade-offs: Cloud is not always cheaper. New concepts need to be learned. Security is very different (I am NOT saying worse or better, just different). And hybrid will be the normal for most enterprises. Only enterprises younger than 10 years are cloud-only.

Therefore, I use the term ‘cloud’ in the following. But the same phases would exist if you “just” want to move away from mainframes but stay in your own on premises data centers.

On a very high level, mainframe migration could contain three phases:

Phase 1 – Cloud Adoption: Replicate data to the cloud for further analytics and processing. Mainframe offloading is part of this.
Phase 2 – Hybrid Cloud: Bidirectional replication of data in real time between the cloud and application in the data center, including mainframe applications.
Phase 3 – Cloud-First Development: All new applications are build in the cloud. Even the core banking system (or at least parts of it) is running in the cloud (or in a modern infrastructure in the data center).

How do we get there? As I said, a big bang will fail. Let’s take a look at a very common approach I have seen at various customers…

Status Quo: Mainframe Limitations and $$$

The following is the status quo: Applications are consuming data from the mainframe; creating expensive MIPS:

The mainframe is running and core business logic is deployed. It works well (24/7, mission-critical). But it is really tough or even impossible to make changes or add new features. Scalability is often becoming an issue, already. And well, let’s not talk about cost. Mainframes cost millions.

Mainframe Offloading

Mainframe offloading means data is replicated to Kafka for further analytics and processing by other applications. Data continues to be written to the mainframe via the existing legacy applications:

Offloading data is the easy part as you do not have to change the code on the mainframe. Therefore, the first project is often just about reading data.

Writing to the mainframe is much harder because the business logic on the mainframe must be understood to make changes. Therefore, many projects avoid this huge (and sometimes not solvable) challenge by keeping the writes as it is… This is not ideal, of course. You cannot add new features or changes to the existing application.

People are not able or do not want to take the risk to change it. There is no DevOps or CI/CD pipelines and no A/B testing infrastructure behind your mainframe, dear university graduates!

Mainframe Replacement

Everybody wants to replace Mainframes due to its technical limitations and cost. But enterprises cannot simply shut down the mainframe because of all the mission-critical business applications.

Therefore, COBOL, the main programming language for the mainframe is still widely used in applications to do large-scale batch and transaction processing jobs. Read the blog post ‘Brush up your COBOL: Why is a 60 year old language suddenly in demand?‘ for a nice ‘year 2020 introduction to COBOL’.

Enterprises have the following options:

Option 1: Continue to develop actively on the mainframe

Train or hire more (expensive) Cobol developers to extend and maintain your legacy applications. Well, this is where you want to go away from. If you forgot why: Cost, cost, cost. And technical limitations. I find it amazing how mainframes even support “modern technologies” such as Web Services. Mainframes do not stand still. Having said this, even if you use more modern technologies and standards, you are still running on the mainframe with all its drawbacks.

Option 2: Replace the Cobol code with a modern application using a migration and code generation tool

Various vendors provide tools which take COBOL code and automatically migrate it to generated source code in a modern programming language (typically Java). The new source code can be refactored and changed (as Java uses static typing so that any errors can be shown in the IDE and changes can be apply to all related dependencies in the code structure). This is the theory. The more complex your COBOL code is, the harder it gets to migrate it and the more custom coding is required for the migration (which results in the same problem the COBOL developer had on the mainframe: If you don’t understand the code routines, you cannot change it without risking errors, data loss or downtime in the new version of the application).

Option 3: Develop a new future-ready application with modern technologies to provide an open, flexible and scalable architecture

This is the ideal approach. Keep the old legacy mainframe running as long as needed. A migration is not a big bang. Replace use cases and functionality step-by-step. The new applications will have very different feature requirements anyway. Therefore, take a look at the Fintech and Insurtech companies. Starting green field has many advantages. The big drawback is high development and migration costs.

In reality, a mix of these three options can also happen. No matter what works for you, let’s now think how Apache Kafka fits into this story.

Domain-Driven Design for your Integration Layer

Kafka is not just a messaging system. It is an event streaming platform that provides storage capabities. In contrary to MQ systems or Web Service / API based architectures, Kafka really decouples producers and consumers:

With Kafka and its Domain-driven Design, every application / service / microservice / database / “you-name-it” is completely independent and loosely coupled from each other. But scalable, highly available and reliable! The blog post “Apache Kafka, Microservices and Domain-Driven Design (DDD)” goes much deeper into this topic.

Sberbank, the biggest bank in Russia, is a great example for their domain-driven approach: Sberbank built their new core banking system on top of the Apache Kafka ecosystem. This infrastructure is open and scalable, but also ready for mission-critical payment use cases like instant-payment or fraud detection, but also for innovative applications like Chat Bots in customer service or Robo-advising for trading markets.

Why is this decoupling so important? Because the mainframe integration, migration and replacement is a (long!) journey…

Event Streaming Platform and Legacy Middleware

An Event Streaming Platform gives applications independence. No matter if an application uses a new modern technology or legacy, proprietary, monolithic components. Apache Kafka provides the freedom to tap into and manage shared data, no matter if the interface is real time messaging, a synchronous REST / SOAP web service, a file based system, a SQL or NoSQL database, a Data Warehouse, a data lake or anything else:

Apache Kafka as Integration Middleware

The Apache Kafka ecosystem is a highly scalable, reliable infrastructure and allows high throughput in real time. Kafka Connect provides integration with any modern or legacy system, be it Mainframe, IBM MQ, Oracle Database, CSV Files, Hadoop, Spark, Flink, TensorFlow, or anything else. More details here:

Kafka Ecosystem for Security and Data Governance

The Apache Kafka ecosystem provides additional capabilities. This is important as Kafka is not just used as messaging or ingestion layer, but a platform for your most mission-critical use cases. Remember the examples from banking and insurance industry I showed in the beginning of this blog post. These are just a few of many more mission-critical Kafka deployments across the world. Check past Kafka Summit video recordings and slides for many more mission-critical use cases and architectures.

I will not go into detail in this blog post, but the Apache Kafka ecosystem provides features for security and data governance like Schema Registry (+ Schema Evolution), Role Based Access Control (RBAC), Encryption (on message or field level), Audit Logs, Data Flow analytics tools, etc…

Mainframe Offloading and Replacement in the Next 5 Years

With all the theory in mind, let’s now take a look at a practical example. The following is a journey many enterprises walk through these days. Some enterprises are just in the beginning of this journey, while others already saved millions by offloading or even replacing the mainframe.

I will walk you through a realistic approach which takes ~5 years in this example (but this can easily take 10+ years depending on the complexity of your deployments and organization). The importance is quick wins and a successful step-by-step approach; no matter how long it takes.

Year 0: Direct Communication between App and Mainframe

Let’s get started. Here is again the current situation: Applications directly communicate with the mainframe.

Let’s reduce cost and dependencies between the mainframe and other applications.

Year 1: Kafka for Decoupling between Mainframe and App

Offloading data from the mainframe enables existing applications to consume the data from Kafka instead of creating high load and cost for direct mainframe access:

This saves a lot of MIPS (million instructions per second). MIPS is a way to measure the cost of computing on mainframes. Less MIPS, less cost.

After offloading the data from the mainframe, new applications can also consume the mainframe data from Kafka. No additional MIPS cost! This allows building new applications on the mainframe data. Gone is the time where developers cannot build new innovative applications because the MIPS cost was the main blocker to access the mainframe data.

Kafka can store the data as long as it needs to be stored. This might be just 60 minutes for one Kafka Topic for log analytics or 10 years for customer interactions for another Kafka Topic. With Tiered Storage for Kafka, you can even reduce costs and increase elasticity by separating processing from storage with a back-end storage like AWS S3. This discussion is out of scope of this article. Check out “Is Apache Kafka a Database?” for more details.

Change Data Capture (CDC) for Mainframe Offloading to Kafka

The most common option for mainframe offloading is Change Data Capture (CDC): Transaction log-based CDC pushes data changes (insert, update, delete) from the mainframe to Kafka. The advantages:

Real time push updates to Kafka
Eliminate disruptive full loads, i.e. minimize production impact
Reduce MIPS consumption
Integrate with any mainframe technology (DB2 Z/OS, VSAM, IMS/DB, CISC, etc.)
Full support

On the first connection, the CDC tool reads a consistent snapshot of all of the tables that are whitelisted. When that snapshot is complete, the connector continuously streams the changes that were committed to the DB2 database for all whitelisted tables in capture mode. This generates corresponding insert, update and delete events. All of the events for each table are recorded in a separate Kafka topic, where they can be easily consumed by applications and services.

The big disadvantage of CDC is high licensing costs. Therefore, other integration options exist…

Integration Options for Kafka and Mainframe

While CDC is typically the preferred choice, there are more alternatives. Here are the integration options I have seen being discussed in the field for mainframe offloading:

IBM InfoSphere Data Replication (IIDR) CDC solution for Mainframe
3rd Party commercial CDC solution (e.g. Attunity or HLR)
Open-source CDC solution (e.g Debezium – but you still need an IIDC license, this is the same challenge as with Oracle and GoldenGate CDC)
Create interface tables + Kafka Connect + JDBC connector
IBM MQ interface + Kafka Connect’s IBM MQ connector
Confluent REST Proxy and HTTP(S) calls from the mainframe
Kafka Clients on the mainframe

Evaluate the trade-offs and make your decision. Requiring a fully supported solution often eliminates several options quickly In my experience, most people use CDC with IIDR or a 3rd party tool, followed by IBM MQ. Other options are more theoretical in most cases.

Year 2 to 4: New Projects and Applications

Now it is time to build new applications:

This can be agile, lightweight Microservices using any technology. Or this can be external solutions like a data lake or cloud service. Pick what you need. Your are flexible. The heart of your infrastructure allows this. It is open, flexible and scalable. Welcome to a new modern IT world!

Year 5: Mainframe Replacement

At some point, you might wonder: What about the old mainframe applications? Can we finally replace them (or at least some of them)? When the new application based on a modern technology is ready, switch over:

As this is a step-by-step approach, the risk is limited. First of all, the mainframe can keep running. If the new application does not work, switch back to the mainframe (which is still up-to-date as you did not stop inserting the updates into its database in parallel to the new application).

As soon as the new application is battle-tested and proven, the mainframe application can be shut down. The $$$ budgeted for the next mainframe renewal can be used for other innovative projects. Congratulations!

Mainframe Offloading and Replacement is a Journey… Kafka can Help!

Here is the big picture of our mainframe offloading and replacement story:

Hybrid Cloud Infrastructure with Apache Kafka and Mainframe

Remember the three phases I discussed earlier: Phase 1 – Cloud Adoption, Phase 2 – Hybrid Cloud, Phase 3 – Cloud-First Development. I did not tackle this in detail in this blog post. Having said this, Kafka is an ideal candidate for this journey. Therefore, this is a perfect combination for offloading and replacing mainframes in a hybrid and multi-cloud strategy. More details here:

Slides and Video Recording for Kafka and Mainframe Integration

Here are the slides:

And the video walking you through the slides:

What are your experiences with mainframe offloading and replacement? Did you or do you plan to use Apache Kafka and its ecosystem? What is your strategy? Let’s connect on LinkedIn and discuss! Stay informed about new blog posts by subscribing to my newsletter.

The post Mainframe Integration, Offloading and Replacement with Apache Kafka appeared first on Kai Waehner.

Event Streaming and Apache Kafka in Telco Industry

Kai Waehner — Fri, 06 Mar 2020 12:58:37 +0000

Event Streaming is a hot topic in Telco Industry. In the last few months, I have seen various projects leveraging Apache Kafka and its ecosystem to implement scalable real time infrastructure in OSS and BSS scenarios. This blog post covers the reasons for this trend. The end shows a whiteboard video recording exploring the different use cases for event streaming in telcos in detail.

The Evolution of the Telecommunications Industry

The telecommunications industries within the sector of information and communication technology is made up of all telecommunications / telephone companies and internet service providers. It plays the crucial role in the evolution of mobile communications and the information society.

Traditional telephone calls continue to be the industry’s biggest revenue generator in Telecommunications Industry. But thanks to advances in network technology, Telecom today is less about voice and increasingly about text (messaging, email) and images (e.g. video streaming).

High-speed internet access for computer-based data applications such as broadband information services and interactive entertainment, is pervasive. Digital subscriber line (DSL) is the main broadband telecom technology. The fastest growth comes from (value-added) services delivered over mobile networks.

Telco Architecture – OSS, BSS and Cloud

The traditional telco landscape is separated into OSS (Operations Support Systems), BSS (Business Support Systems) and OSS-BSS-Integration.

OSS track network inventory, assets and provisioning of services. BSS deal with customer relationship management (CRM) and processes such as taking orders, processing bills, and collecting payments. Here is an example of such a Telecommunications Industry landscape:

(Overview about OSS / BSS Landscape)

Modern architectures (have to) include one more aspect: Cloud. Like in almost all other traditional industries, telcos move towards cloud. The consequence is a hybrid architecture. Telco have to stay hybrid forever because a core part of their business is infrastructure. A combination of on premises data centers, edge processing and multi cloud architectures is the new normal in Telecommunications Industry.

Telco Business – CDR (Call Detail Record), Customer 360, Media and more

The business in Telecommunications Industry did not change that much from an end user perspective yet. Many basic concepts are still the same. OSS and BSS work together. Call Detail Records (CDR) metadata from phone calls is still processed and analysed to monitor the infrastructure and sell services to customers. Smartphone contracts and media offerings are still pretty similar like ten years ago (you just get higher data limits per month).

However, the telecom business for the traditional telecommunications industry is changing significantly these days. It has to change. Otherwise, competitors like Apple or Netflix take over most of the business.

I like how this picture from Huawei describes these changes of the digital transformation in the telco industry:

The trend goes towards an open ecosystem, partnering, context-specific personalized offerings, multichannel, improved customer service and new innovative products.

These trends create a lot of challenges for the technical implementation. Traditional technologies which were used in the last 20 years do not work to realize this digital transformation. Requirements include real time processing at scale, decoupled and flexible applications and infrastructure, and still a reliable system with zero down time and zero data loss.

The combination of these requirements makes clear why more and more Telco projects and infrastructure are built with event streaming in mind and leveraging the de facto standard Apache Kafka and its ecosystem.

Use Cases for Apache Kafka in Telco

Apache Kafka is used in various very different telco projects in OSS, BSS, OSS-BSS-Integration. Many different architectures are used, included edge, data center, hybrid and multi cloud. Let’s take a look at some examples for use cases and architectures.

Middleware / Central Event Streaming Platform

Central Event Streaming Platform as core infrastructure for integration and new applications
Integration on all levels: OSS, BSS-OSS-Integration, BSS
OSS – Separate clusters are used in each country / OpCo for integration with syslog etc.
Integration with legacy and modern systems (including integration to existing legacy middleware)
Move Data from OSS Fixed, Mobile, Cable into various systems
Spanning different business units (TV, Analytics, Video Platform, Sales and Services)
Integration with external partners (today via APIs, future: Kafka-native)
Merger after acquisitions

Monitoring (Hardware, Performance, Security)

Real time monitoring of their network devices (routers, switches, other network devices)
Outside infrastructure: Million miles of fiber and coax, and over 40 million in home devices (severe weather to power grid outages to construction-related disruptions)
Proactive monitoring: Real time monitoring, problem analysis, metrics reporting and action response
Nationwide distributed IT infrastructure for all districts -> Fully-managed Security Information and Event Management
Smarter network planning and deployment, as well as more precise ROI-based investment decisions, taking into account the profitability of each radio site based on customers’ actual and predicted profitability
SIEM / SOAR: Mission critical cyber defence (typically a combination with other products like Elastic, Splunk, etc.)

Data Distribution

Event Hub across the whole telco organisation for automation and virtualization of transports to enable data-driven infrastructure decisions
Abstracted customer layer for all customer facing apps (Mobile App, Web Site, other CRM Systems)
Next generation media distribution network. Provides alarm correlation to allow identifying network faults and reroute traffic in real time.
Share information about network performance back to external customers
API Gateway for network providers
Accelerate the simplification and automation of standard processes, in both operational and support areas. These include IT and network operations, customer management back office functions and all other administrative activities
Message archiving service for customers (traders specifically) that want fully searchable archive of all messages (delivered through all channels).

Data Processing

CDR (Call Detail Record) processing: Technical ETL and business applications (range of scenarios like including upsell, churn analysis)
Data consolidation issues with many data sources, late arriving events, data out of sync and high SLA guarantees
Text messaging service to let you send and receive text and multimedia messages seamlessly on your smartphone, tablet, computer or the web using your mobile number; including rate limiting of SMS, MMS and RCS (Rich Communication Services) requests specifically for seasonal spikes; including redirect of RCS via SMS when delivery ability for the RCS message is not present
Personalized messages and recommendations are built and then delivered to customer inboxes

Business Applications

Microservices: Infrastructure for agile and decoupled app development
BSS: Data engine, customer engagement engine, access broker, dynamic business engine, information backbone, back office systems, customer 360, master data management, …
Customer 360: Track every interaction that the customer has with any channel
Supply Chain Management (SCM): Real Time Inventory. Support real time information down to the store level
Billing, Accounts, Lines of Service Data. Relieve Ops Burden
Make the mobile app and digital marketing channels over time become the main customer acquisition and management platform
B2B products like Inventory / Asset Tracking, Fleet Management
Connected Car solutions to retailers for aftersales, including vehicle analytics for drivers, Infotainment, vehicle diagnostics, personal assistance, SOS, remote access, convenience features
Connected Car solution platform for auto manufacturers to automate vehicle analytics for insurance companies and fleet management, tracking sales; vehicle performance; consumer demographics, etc.

Hybrid Infrastructure and Cloud Migration

Consume data from on premises and monitor in cloud
Aggregation cluster: Hub-and-spoke architecture with several regional data centers capturing data to local Kafka clusters and then sending metrics info to the main cluster
Edge processing (on different levels: OpCo, data center, retail store, …)

Media

Pass all data through a single source of truth (real time, highly scalable) for decoupling and quick innovation of building new services
IP-to-MAC: Handle all incoming MAC address registration requests from set-top boxes (to check account entitlements to make sure they are a valid customer)
Managing content metadata / epg
Processing pay per view bookings
Fraud use cases (MAC address cloning, sharing of streaming account details, SMS Spam detection, etc); leveraging machine learning in real time at scale for predictions
Digital natives (like Netflix) built completely new businesses that were born digital. Everything in a digitally-native business is an event
Context-specific advertising in real time based on user data (including product upsell, coupons, and other revenue creation with ad partners)

10 Reasons for Event Streaming (i.e. Apache Kafka) in Telco Industry

10 Reasons why telco projects leverage Event Streaming and Apache Kafka:

Real Time
Scalable
Cost Reduction
24/7 – Zero downtime, zero data loss
Decoupling – Storage, Domain Driven Design (DDD), Reprocessing of events
Data processing and stateful client applications
Integration
Hybrid Architecture – On Premises, multi cloud, edge computing
Fully managed cloud
No vendor locking

These characteristics are not just for telco business but relevant in general, of course. However, in the video below, I cover how they map to the telco industry and OSS / BSS use cases.

Obviously, Apache Kafka alone has a lot of strong characteristics. But to realize the digital transformation successfully, you also need expertise (services and support) and tools (development, operations and monitoring) to set up, operate and scale such an infrastructure 24/7 mission-critical.

Slide Deck

The following slide deck covers the evolution of Apache Kafka in the telecom sector, including use cases, architectures and technologies (OSS, BSS, OTT, IMS, NFV, Middleware, Mainframe, etc.):

Live Whiteboard

The following whiteboard explains why and how different companies in the telco industry leverage Event Streaming and Apache Kafka in OSS, BSS, OSS-BSS-Integration and Cloud / Hybrid Architectures. I cover in detail how the above 10 characteristics can used in your next telco project:

Video Recording – Telco Use Cases and Architectures

Here is a webinar where I talk about use cases, architectures and technologies around Apache Kafka in the telecom sector:

What are your experiences with modernizing the infrastructure and applications in the telco industry? Did you or do you plan to use Apache Kafka and its ecosystem? What is your strategy? Let’s connect on LinkedIn and discuss it!

The post Event Streaming and Apache Kafka in Telco Industry appeared first on Kai Waehner.

The Rise Of Event Streaming – Why Apache Kafka Changes Everything

Kai Waehner — Thu, 06 Feb 2020 17:32:27 +0000

I had the pleasure to deliver the keynote at OOP 2020 in Munich, Germany. This is a well-known international conference around topics like agility, architecture, security, programming languages and soft skill. My keynote had the title “The Rise Of Event Streaming – Why Apache Kafka Changes Everything“. Here are share some impressions and details of the talk…

Abstract of the Keynote Presentation

Business digitalization covers trends like microservices, the Internet of Things or Machine Learning. This is driving the need to process events at a whole new scale, speed and efficiency. Traditional solutions like ETL / data integration or messaging are not build to serve these needs.

Today, the open source project Apache Kafka is being used by thousands of companies including over 60% of the Fortune 100. These companies power and innovate their businesses by focusing their data strategies around event-driven architectures leveraging event streaming.

We will discuss the market and technology changes that have given rise to Kafka and to event-driven architectures. The audience learns the key aspects of building a platform for stream processing with Kafka. Examples of productive use cases from the automotive, manufacturing and transportation sector will showcase the power of event streaming.

Event Streaming Whiteboard – A Fantastic Live Drawing

My session was live-drawn during my presentation by so called “graphic recorders” from remarker. Here is the impressive outcome:

Slide Deck

Here is the slide deck of my presentation. It covers

the history of Apache Kafka and Event Streaming
core design principles and architecture of Apache Kafka
use cases from Lyft, Audi, Deutsche Bahn, Bosch, EON, and more

A specific focus of the presentation was on “stream processing”; a core feature and design concept as part of event streaming platforms. It allows to continuously process massive volumes of data in stateless or stateful applications in real time:

You can checkout my presentation and video recording about KSQL at Big Data Spain 2018 (and many other resources on the web) for more details about stream processing with the Apache Kafka ecosystem.

The post The Rise Of Event Streaming – Why Apache Kafka Changes Everything appeared first on Kai Waehner.

Apache Kafka as Digital Twin for Open, Scalable, Reliable Industrial IoT (IIoT)

Kai Waehner — Thu, 28 Nov 2019 12:53:48 +0000

This blog post discusses the benefits of a Digital Twin in Industrial IoT (IIoT) and its relation to Apache Kafka. Kafka is often used as central event streaming platform to build a scalable and reliable digital twin for real time streaming sensor data.

In November 2019, I attended the SPS Conference in Nuremberg. This is one of the most important events about Industrial IoT (IIoT). Vendors and attendees from all over the world fly in to make business and discuss new products. Hotel prices in this region go up from usually 80-100€ to over 300€ per night. Germany is still known for its excellent engineering and manufacturing industry. German companies drive a lot of innovation and standardization around Internet of Things (IoT) and Industry 4.0.

The article discusses:

The relation between Operational Technology (OT) and Information Technology (IT)
The advantages and architecture of an open event streaming platform for edge and global IIoT infrastructures
Examples and use cases for Digital Twin infrastructures leveraging an event streaming platform for storage and processing in real time at scale

SPS – A Trade Show in Germany for Global Industrial IoT Vendors

“SPS covers the entire spectrum of smart and digital automation – from simple sensors to intelligent solutions, from what is feasible today to the vision of a fully digitalized industrial world“. No surprise almost all vendors show software and hardware solutions for innovative use cases like

Condition monitoring in real time
Predictive maintenance using artificial intelligence (AI) – I prefer the more realistic term “machine learning”, a subset of AI
Integration of legacy machines and proprietary protocols in the shop floors
Robotics
New digital services
Other related buzzwords.

Digital Twin as Huge Value Proposition

New software and hardware needs to improve business processes, increase revenue or cut costs. Many booths showed solution for a digital twin as part of this buzzword bingo and value proposition. Most vendors exhibited complete solutions as hardware or software products. Nobody talked about the underlying implementation. Not all vendors could explain in detail how the infrastructure really scales and performs under the hood.

This post starts from a different direction. It begins with definition and use cases of a digital twin infrastructure. The challenges and requirements are discussed in detail. Afterwards, possible architectures and combinations of solutions show the benefits of an open and scalable event streaming platform as part of the puzzle.

The value proposition of a digital twin always discusses the combination of OT and IT:

OT: Operation Technology; dealing with machines and devices
IT: Information Technology, dealing with information

Excursus: SPS == PLC —> A Core Component in each IoT Infrastructure

A funny, but relevant marginal note for non-German people: The event name of the trade fair and acronym “SPS” stands for “smart production solutions”. However, in Germany “SPS” actually stands for “SpeicherProgrammierbare Steuerung”. The English translation might be very familiar for you: “Programmable Logic Controller” or shortened “PLC“.

PLC is a core component in any industrial infrastructure. This industrial digital computer has been ruggedized and adapted for the control of manufacturing processes, such as:

Assembly lines
Robotic devices
Any activity that requires high reliability control and ease of programming and process fault diagnosis

PLCs are built to withstand extreme temperatures, strong vibrations, high humidity, and more. Furthermore, since they are not reliant on a PC or network, a PLC will continue to independently function without any connectivity. PLC is OT. This is very different from the IT hardware a software engineers knows from developing and deploying “normal” Java, .NET or Golang applications.

Digital Twin – Merging the Physical and the Digital World

A digital twin is a digital replica of a living or non-living physical entity. By bridging the physical and the virtual world, data is transmitted seamlessly allowing the virtual entity to exist simultaneously with the physical entity. The digital twin therefore interconnects OT and IT.

Digital Replica of Potential and Actual Physical Assets

Digital twin refers to a digital replica of potential and actual physical assets (physical twin), processes, people, places, systems and devices. The digital replica can be used for various purposes. The digital representation provides both the elements and the dynamics of how an IoT device operates and lives throughout its life cycle.

Definitions of digital twin technology used in prior research emphasize two important characteristics. Firstly, each definition emphasizes the connection between the physical model and the corresponding virtual model or virtual counterpart. Secondly, this connection is established by generating real time data using sensors.

Digital Twins Link Internet of Things and Artificial Intelligence

Digital twins connect internet of things, artificial intelligence, machine learning and software analytics to create living digital simulation models. These models update and change as their physical counterparts change. A digital twin continuously learns and updates itself from multiple sources to represent its near real-time status, working condition or position.

This learning system learns from

itself, using sensor data that conveys various aspects of its operating condition.
human experts, such as engineers with deep and relevant industry domain knowledge.
other similar machines
other similar fleets of machines
the larger systems and environment in which it may be a part of.

A digital twin also integrates historical data from past machine usage to factor into its digital model.

Use Cases for a Digital Twin in Various Industries

In various industrial sectors, digital twins are used to optimize the operation and maintenance of physical assets, systems and manufacturing processes. They are a formative technology for the IIoT. A digital twin in the workplace is often considered part of Robotic Process Automation (RPA). Per Industry-analyst firm Gartner, a digital twin is part of the broader and emerging hyperautomation category.

Some industries that can leverage digital twins include:

Manufacturing Industry: Physical manufacturing objects are virtualized and represented as digital twin models seamlessly and closely integrated in both the physical and cyber spaces. Physical objects and twin models interact in a mutually beneficial manner. Therefore, the IT infrastructure also sends control commands back to the actuators of the machines (OT).
Automotive Industry: Digital twins in the automobile industry use existing data in order to facilitate processes and reduce marginal costs. They can also suggest incorporating new features in the car that can reduce car accidents on the road.
Healthcare Industry: Lives can be improved in terms of medical health, sports and education by taking a more data-driven approach to healthcare. The biggest benefit of the digital twin on the healthcare industry is the fact that healthcare can be tailored to anticipate on the responses of individual patients.

Examples for Digital Twins in the Industrial IoT

Digital twins are used in various scenarios. For example, they enable the optimization of the maintenance of power generation equipment such as power generation turbines, jet engines and locomotives. Further examples of industry applications are aircraft engines, wind turbines, large structures (e.g. offshore platforms, offshore vessels), heating, ventilation, and air conditioning (HVAC) control systems, locomotives, buildings, utilities (electric, gas, water, waste water networks).

Let’s take a look at some use cases in more detail:

Monitoring, Diagnostics and Prognostics

A digital twin can be used for monitoring, diagnostics and prognostics to optimize asset performance and utilization. In this field, sensory data can be combined with historical data, human expertise and fleet and simulation learning to improve the outcome of prognostics. Therefore, complex prognostics and intelligent maintenance system platforms can use digital twins in finding the root cause of issues and improve productivity.

Digital twins of autonomous vehicles and their sensor suite embedded in a traffic and environment simulation have also been proposed as a means to overcome the significant development, testing and validation challenges for the automotive application. In particular when the related algorithms are based on artificial intelligence approaches that require extensive training data and validation data sets.

3D Modeling for the Creation of Digital Companions

Digital twins are used for 3D modeling to create digital companions for the physical objects. It can be used to view the status of the actual physical object. This provides a way to project physical objects into the digital world. For instance, when sensors collect data from a connected device, the sensor data can be used to update a “digital twin” copy of the device’s state in real time.

The term “device shadow” is also used for the concept of a digital twin. The digital twin is meant to be an up-to-date and accurate copy of the physical object’s properties and states. This includes information such as shape, position, gesture, status and motion.

Embedded Digital Twin

Some manufacturers embed a digital twin into their device. This improves quality, allows earlier fault detection and gives better feedback on product usage to the product designers.

How to Build a Digital Twin Infrastructure?

TL;DR: You need to have the correct information in real time at the right location to be able to analyze the data and act properly. Otherwise you will have conversations like the following:

Challenges and Requirements for Building a Scalable, Reliable Digital Twin

The following challenges have to be solved to implement a successful digital twin infrastructure:

Connectivity: Different machines and sensors typically do not provide the same interface. Some use a modern standard like MQTT or OPC-UA. Though, many (older) machines use proprietary interfaces. Furthermore, you also need to integrate to the rest of the enterprise.
Ingestion and Processing: End-to-end pipelines and correlation of all relevant information from all machines, devices, MES, ERP, SCM, and any other related enterprise software in the factories, data center or cloud.
Real Time: Ingestion of all information in real time (this is a tough term; in this case “real time” typically means milliseconds, sometimes even seconds or minutes are fine).
Long-term Storage: Storage of the data for reporting, batch analytics, correlations of data from different data sources, and other use cases you did not think at the time when the data was created.
Security: Trusted data pipelines using authentication, authorization and encryption.
Scalability: Ingestion and processing of sensor data from one or more shop floors creates a lot of data.
Decoupling: Sensors produce data continuously. They don’t ask if the consumers are available and if they can keep up with the input. The handling of backpressure and decoupling if producers and consumers is mandatory.
Multi-Region or Global Deployment: Analysis and correlation of data in one plant is great. But it is even better if you can correlate and analyze the data from all your plants. Maybe even plants deployed all over the world.
High Availability: Building a pipeline from one shop floor to the cloud for analytics is great. However, even if you just integrate a single plant, the pipeline typically has to run 24/7. In some cases without data loss and with order guarantees of the sensor events.
Role-Based Access Control (RBAC): Integration of tens or hundreds of machines requires capabilities to manage the relation between the hardware and its digital twin. This includes the technical integration, role-based access control, and more.

The above list covers technical requirements. On top, business services have to be provided. This includes device management, analytics, web UI / mobile app, and other capabilities depending on your use cases.

So, how do you get there? Let’s discuss three alternatives in the following sections.

Solution #1 => IoT / IIoT COTS Product

COTS (commercial off-the-shelf) is software or hardware products that are ready-made and available for sale. IIoT COTS products are built for exactly one problem: Development, Deployment and Operations of IoT use cases.

Many IoT COTS solutions are available on the market. This includes products like Siemens MindSphere, PTC IoT, GE Predix, Hitachi Lumada or Cisco Kinetic for deployments in the data center or in different clouds.

Major cloud providers like AWS, GCP, Azure and Alibaba provide IoT-specific services. Cloud providers are a potential alternative if you are fine with the trade-offs of vendor lock-in. Main advantage: Native integration into the ecosystem of the cloud provider. Main disadvantage: Lock-in into the ecosystem of the cloud provider.

At the SPS trade show, 100+ vendors presented their IoT solution to connect shop floors, ingest data into the data center or cloud, and do analytics. Often, many different tools, products and cloud services are combined to solve a specific problem. This can either be obvious or just under the hood with OEM partners. I already covered the discussion around using one single integration infrastructure versus many different middleware components in much more detail.

Read Gartner’s Magic Quadrant for Industrial IoT Platforms 2019. The analyst report describes the mess under the hood of most IIoT products. Plenty of acquisitions and different code bases are the foundation of many commercial products. Scalability and automated rollouts are key challenges. Download of the report is possible with a paid Gartner account or via free download from any vendor with distribution rights, like PTC.

Proprietary and Vendor-specific Features and Products

Automation industry typically uses proprietary and expensive products. All of them are marketed as flexible cloud products. The truth is that most of them are not really cloud-native. This means they are not scalable and not extendible like you would expect it from a cloud-native service. Some problems I have heard in the past from end users:

Scalability is not given. Many products don’t use a scalable architecture. Instead, additional black boxes or monoliths are added if you need to scale. This challenges uptime SLAs, burden of operations and cost.
Some vendors just deploy their legacy on premise solution into AWS EC2 instances and call it a cloud product. This is far away from being cloud native.
Even IoT solutions from some global cloud providers do not scale good enough to implement large IoT projects (e.g. to connect to one million cars). I saw several companies which evaluate a cloud-native MQTT solution. They then went to a specific MQTT provider afterwards (but still used the cloud provider for its other great services, like computing resources, analytics services, etc).
Most complete IoT solutions use many different components and code bases under the hood. Even if you buy one single product, the infrastructure usually uses various technologies and (OEM) vendors. This makes development, testing, roll-out and 24/7 end-to-end operations much harder and more costly.

Long Product Lifecycles with Proprietary Protocols

In automation industry, product lifecycles are very long (tens of years). Simple changes or upgrades are not possible.

IIoT usually uses incompatible protocols. Typically, not just the products, but also these protocols are proprietary. They are just built for hardware and machines of one specific vendor. Siemens S7, Modbus, Allen Bradley, Beckhoff ADS, to name a few of these protocols and “standards”.

OPC-UA is supported more and more. This is a real standard. However, it has all the pros and cons of an open standard. It is often poorly implemented by the vendors and requires an app server on top of the PLC.

In most scenarios I have seen, connectivity to machines and sensors from different vendors is required, too. Even in one single plant, you typically have many different technologies and communication paradigms to integrate.

There is More than just the Machines and PLCs…

No matter how you integrate to the shop floor machines; this is just the first step to get data correlated with other systems from your enterprise. Connectivity to all the machines in the shop floors is not sufficient. Integration and combination of the IoT data with other enterprise applications is crucial. You also want to provide and sell additional services in the data center or cloud. In addition, you might even want to integrate with partners and suppliers. This adds additional value to your product. You can provide features you don’t sell by yourself.

Therefore, an IoT COTS solution does not solve all the challenges. There is huge demand to build an open, flexible, scalable platform. This is the reason why Apache Kafka comes into play in many projects as event streaming platform.

Solution #2 => Apache Kafka as Event Streaming Platform for the Digital Twin Infrastructure

Apache Kafka provides many business and technical characteristics out-of-the-box:

Cost reduction due to open core principle
No vendor lock-in
Flexibility and Extensibility
Scalability
Standards-based Integration
Infrastructure-, vendor- and technology-independent
Decoupling of applications and machines

Apache Kafka – An Immutable, Distributed Log for Real Time Processing and Long Term Storage

I assume you already know Apache Kafka: The de-facto standard for real-time event streaming. Apache Kafka provides the following characteristics:

Open-source (Apache 2.0 License)
Global-scale
High volume messaging
Real-time
Persistent storage for backup and decoupling of sources and sinks
Connectivity (via Kafka Connect)
Continuous Stream processing (via Kafka Streams)

This is all included within the Apache Kafka framework. Fully independent of any vendor. Vibrant community with thousands of contributors from many hundreds of companies. Adoption all over the world in any industry.

If you need more details about Apache Kafka, check out the Kafka website, the extensive Confluent documentation, or hundreds of free video recordings and slide decks from all Kafka Summit events to learn about the technology and use cases.

How is Kafka related to Industrial IoT, shop floors and building digital twins? Let’s take a quick look at Kafka’s capabilities for integration and continuous data processing at scale.

Connectivity to Industrial Control Systems (ICS)

Kafka can connect to different functional levels of a manufacturing control operation:

Integrate directly to Level 1 (PLC / DCS): Kafka Connect or any other Kafka Clients (Java, Python, .NET, Go, JavaScript, etc.) can connect directly to PLCs or an OPC-UA server. The the data gets directly ingested from the machines and devices. Check out “Apache Kafka, KSQL and Apache PLC4X for IIoT Data Integration and Processing” for an example to integrate with different PLC protocols like Siemens S7 and Modbus in real time at scale.
Integrate on level 2 (Plant Supervisory) or even above on 3 (Production Control) or 4 (Production Scheduling): For instance, you integrate with MES / ERP / SCM systems via REST / HTTP interfaces or any MQTT-supported plant supervisory or production control infrastructure. The blog post “IoT and Event Streaming at Scale with MQTT” discusses a few different options.
What about level 0 (Field level)? Today, IT typically only connects to PLCs respectively DCS. IT does not directly integrate to sensors and actuators in the field bus, switch or end node. This will probably change in the future. Sensors get smarter and more powerful. And new standards for the “last mile” of the network emerge, like 10BASE-T1L.

We discussed different integration levels between IT and OT infrastructure. However, connecting from the IT to the PLC and shop floor is just half of the story…

Connectivity and Data Ingestion to MES, ERP, SCM, Big Data, Cloud and the Rest of the Enterprise

Kafka Connect enables reliable and scalable integration of Kafka with any other system. This is important as you need to integrate and correlate sensor data from PLCs with applications and databases from the rest of the enterprise. Kafka Connect provides sources and sinks to

Databases like Oracle or SAP Hana
Big Data Analytics like Hadoop, Spark or Google Cloud Machine Learning
Enterprise Applications like ERP, CRM, SCM
Any other application using custom connectors or Kafka clients

Continuous Stream Processing and Streaming Analytics in Real Time on the Digital Twin with Kafka

A pipeline from machines to other systems in real time at scale is just part of the full story. You also need to continuously process the data. For instance, you implement streaming ETL, real time predictions with analytic models, or execute any other business logic.

Kafka Streams allows to write standard Java apps and microservices to continuously process your data in real-time with a lightweight stream processing Java API. You could alos use ksqlDB to do stream processing using SQL semantics. Both frameworks use Kafka natively under the hood. Therefore, you leverage all Kafka-native features like high throughput, high availability and zero downtime out-of-the box.

The integration with other streaming solutions like Apache Flink, Spark Streaming, AWS Kinesis or other commercial products like TIBCO StreamBase, IBM Streams or Software AG’s Apama is possible, of course.

Kafka is so great and widely adopted because it decouples all sources and sinks from each other leveraging its distributed messaging and storage infrastructure. Smart endpoints and dumb pipes is a key design principle applied with Kafka automatically to decouple services through a Domain-Driven Design (DDD):

Kafka + Machine Learning / Deep Learning = Real Time Predictions in Industrial IoT

The combination of Kafka and Machine Learning for digital twins is not different from any other industry. However, IoT projects usually generate big data sets throught real time streaming sensor data. Therefore, Kafka + Machine Learning makes even more sense in IoT projects for building digital twins than in many other projects where you lack big data sets.

Analytic models need to be applied at scale. Predictions often need to happen in real time. Data preprocessing, feature engineering and model training also need to happen as fast as possible. Monitoring the whole infrastructure in real time at scale is a key requirement, too. This includes not just the infrastructure for the machines and sensors in the shop floor, but also the overall end-to-end edge-to-cloud infrastructure.

With this in mind, you quickly understand that Machine Learning / Deep Learning and Apache Kafka are very complementary. I have covered this in detail in many other blog posts and presentations. Get started here for more details:

Blog Post: How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka
Slide Deck: Apache Kafka + Machine Learning => Intelligent Real Time Applications
Video Recording: Deep Learning in Mission Critical and Scalable Real Time Applications with Open Source Frameworks

Example for Kafka + ML + IoT: Embedded System at the Edge using an Analytic Model in the Firmware

Let’s discuss a quick example for Kafka + ML + IoT Edge: Embedded systems are very inflexible. It is hard to change code or business rules. The code in the firmware applies between input and out of the hardware. The code is embedded which implements business rules. However, new code or business rules cannot be simply deployed with a script or continuous delivery like you know it from your favorite Java / .NET / Go / Python application and tools like Maven and Jenkins. Instead, each change requires a new, long development lifecycle including tests, certifications and a manufacturing process.

There is another option Instead of writing and embedding business rules in code with a complex and costly process: Analytic models can be trained on historical data. This can happen anywhere. For instance, you can ingest sensor data into a data lake in the cloud via Kafka. The model is trained in the elastic and flexible cloud infrastructure. Finally, this model is deployed to an embedded system respectively a new firmware version is created with this model (using the long, expensive process). However, updating (i.e. improving) the model (which is already deployed on the embedded system) gets much easier because no code has to be changed. The mode is “just” re-trained and improved.

This way, business rules can be updated and improved by improving the already deployed model in the embedded system. No new development lifecycle, testing and certification and manufacturing process are required. DNP/AISS1 from SSV Software is one example of a hardware starter kit with pre-installed ML algorithms.

Solution #3 => Kafka + IIoT COTS Product as Complementary Solutions for the Digital Twin

The above sections described how to use either an IoT COTS product or an event streaming platform like Apache Kafka and its ecosystem for building a digital twin infrastructure. Interestingly, Kafka and IoT COTS are actually combined in most deployments I have seen so far.

Most IoT COTS products provide out-of-the-box Kafka connectors because end users are asking for this feature all the time. Let’s discuss the combination of Kafka and other IoT products in more detail.

Kafka is Complementary to Industry Solutions such as Siemens MindSphere or Cisco Kinetic

Kafka can be used in very different scenarios. It is s best for building a scalable, reliable and open infrastructure to integrate IoT at the edge and the rest of the enterprise.

However, Kafka is not the silver bullet for every problem. A specific IoT product might be the better choice for integration and processing of IoT interfaces and shop floors. This depends on the specific requirements, existing ecosystem and already used software. Complexity and cost of solutions need to be evaluated. “Build vs. buy” is always a valid question. Often, the best choice and solution is a mix of building an open, flexible, self-built central streaming infrastructure and buying COTS for specific integration and processing scenarios.

Different Combinations of Kafka and IoT Solutions

The combination of an even streaming platform like Kafka with one or more other IoT products or frameworks is very common:

Sometimes, the shop floor is connecting to an IoT solution. The IoT solution is used as gateway or proxy. This can be a broad, powerful (but also more complex and expensive) solution like Siemens MindSphere. Or the choice is to deploy “just” a specific solution, more lightweight solution to solve one problem. For instance, HiveMQ could be deployed as scalable MQTT cluster to connect to machines and devices. This IoT gateway or proxy connects to Kafka. Kafka is either deployed in the same infrastructure or in another data center or cloud. The Kafka cluster connects to the rest of the enterprise.

In other scenarios, Kafka is used as IoT gateway or proxy to connect to the PLCs or Distributed Control System (DCS) directly. Kafka then connects to an IoT Solution like AWS IoT or Google Cloud’s MQTT Bridge where further processing and analytics happen.

Communication is often bidirectional. No matter what architecture you choose. This means the data is ingested from the shop floor, processed and correlated, and finally control events are sent back to the machines. For instance, in predictive analytics, you first train analytic models with tools like TensorFlow in the cloud. Then you deploy the analytic model at the edge for real time predictions.

Eclipse ditto – An Open Source Framework Dedicated to the Digital Twin; with Kafka Integration

There is not just commercial IoT solutions on the market, of course. Kafka is complementary to COTS IoT solutions and to open source IoT frameworks. Eclipse IoT alone provides various different IoT frameworks. Let’s take a look at one of them, which fits perfectly into this blog post:

Eclipse ditto – an open source framework for building a digital twin. With the decoupled principle of Kafka, it is straightforward to leverage other frameworks for specific tasks.

ditto was created to help realizing digital twins with an open source API. It provides features like like Device-as-a-Service, state management for digital twins, and APIs to organize your set of digital twins. Kafka integration is built-in into ditto out-of-the-box. Therefore, ditto and Kafka complement each other very well to build a digital twin infrastructure.

The World is Hybrid and Polyglot

The world is hybrid and polyglot in terms of technologies and standards. Different machines use different technologies and protocols. Each plant uses its own frameworks, products and standards. Business units use different analytics tools and not always the same clouds provider. And so on…

Global Kafka Architecture for Edge / On Premises / Hybrid / Multi Cloud Deployments of the Digital Twin

Kafka is often not just the backbone to integrate IoT devices and the rest of the enterprise. More and more companies deploy multiple Kafka clusters in different regions, facilities and cloud providers.

The right architecture for Kafka deployments depends on the use cases, SLAs and many other requirements. If you want to build a digital twin architecture, you typically have to think about edge AND data centers / cloud infrastructure:

Apache Kafka as Global Nervous System for Streaming Data

Using Kafka as global nervous system for streaming data typically means you spin up different Kafka clusters. The following scenarios are very common:

Local edge Kafka clusters in the shop floors: Each factory has its own Kafka cluster to integrate with the machines, sensors and assembly lines. But also with ERP systems, SCADA monitoring tools, and mobile devices from the workers. This is typically a very small Kafka cluster with e.g. three Brokers (which still can process ~100+ MB/sec). Sometimes, one single Kafka broker is deployed. This is fine if you do not need high availability and prefer low cost and very simple operations.
Central regional Kafka clusters: Kafka clusters are deployed in different regions. Each Kafka cluster is used to ingest, process and aggregate data from different factories in that region or from all cars within a region. These Kafka clusters are bigger than the local Kafka clusters as they need to integrate data from several edge Kafka clusters. The integration can be realized easily and reliable with Confluent Replicator or in the future maybe with MirrorMaker 2 (if it matures over time). Don’t use Mirrormaker 1 at all – you can find many good reasons on the web. Another option is to directly integrate Kafka clients deployed at the edge to a central regional Kafka cluster. Either with a Kafka Client using Java, C, C++, Python, Go or another programming language. Or using a proxy in the middle, like Confluent REST Proxy, Confluent MQTT Proxy, or any MQTT Broker outside the Kafka environment. Find out more details about comparing different MQTT and HTTP-based IoT integration options for Kafka here.
Multi-region or global Kafka clusters: You can deploy one Kafka cluster in each region or continent. Then replicate the data between each other (one- or bidirectional) in real time using Confluent Replicator. Or you can leverage the multi-data center replication Kafka feature from Confluent Platform to spin up one logical cluster over different regions. The latter provides automatic fail-over, zero data loss and much easier operations of server and client side.

This is just a quick summary of deployment options for Kafka clusters at the edge, on premises or in the cloud. You typically combine different options to deploy a hybrid and global Kafka infrastructure. Often, you start small with one pipeline and a single Kafka cluster. Scaling up and rolling out the global expansion should be included into the planning from the beginning. I will speak in more detail about different “architecture patterns and best practices for distributed, hybrid and global Apache Kafka deployments” at DevNexus in Atlanta in February 2020. This is a good topic for another blog post in 2020.

Polyglot Infrastructure – There is no AWS, GCP or Azure Cloud in China and Russia!

For global deployments, you need to choose the right cloud providers or build your own data centers in some different regions.

For example, you might leverage Confluent Cloud. This is a fully managed Kafka service with usage-based pricing and enterprise-ready SLAs in Europe and the US on Azure, AWS or GCP. Confluent Cloud is a real serverless approach. No need to think about Kafka Brokers, operations, scalability, rebalancing, security, fine tuning, upgrades.

US cloud providers do not provide cloud services in China. Alibaba is the leading cloud provider. Kafka can be deployed on Alibaba cloud. Or choose a generic cloud-native infrastructure like Kubernetes. Confluent Operator, a Kubernetes Operator including CRD and Helm charts, is a tool to support and automate provisioning and operations on any Kubernetes infrastructure.

No public cloud is available in Russia at all. The reasons is mainly legal restrictions. Kafka has to be deployed on premises.

In some scenarios, the data from different Kafka clusters in different regions is replicated and aggregated. Some anonymous sensor data from all continents can be correlated to find new insights. But some specific user data might always just stay in the country of origin and local region.

Standardized Infrastructure Templates and Automation

Many companies build one general infrastructure template on a specific abstraction level. This template can then be deployed to different data centers and cloud providers the same way. This standardizes and eases operations in global Kafka deployments.

Cloud-Native Kubernetes for Robust IoT and Self-Healing, Scalable Kafka Cluster

Today, Kubernetes is often choosen as the abstraction layer. Kubernetes is deployed and managed by the cloud provider (e.g. GKE on GCP) or an operations team on premises. All required infrastructure on top of Kubernetes is scripted and automated with a template framework (e.g. Terraform). This can then be rolled out to different regions and infrastructures in the same standardized and automated way.

The same is applicable for the Kafka infrastructure on top of Kubernetes. Either you leverage existing tools like Confluent Operator or build your own scripts and custom resource definitions. The Kafka Operator for Kubernetes has several features built-in, like automated handling of fail-over, rolling upgrades and security configuration.

Find the right abstraction level for your Digital Twin infrastructure:

You can also use tools like the open source Kafka Ansible scripts to deploy and operate the Kafka ecosystem (including components like Schema Registry or Kafka Connect), of course.

However, the beauty of a cloud-native infrastructure like Kubernetes is its self-healing and robust characteristics. Failure is expected. This means that your infrastructure continues to run without downtime or data loss in case of node or network failures.

This is quite important if you deploy a digital twin infrastructure and roll it out to different regions, countries and business units. Many failures are handled automatically (in terms of continuous operations without downtime or data loss).

Not every failure requires a P1 and call to the support hotline. The system continues to run while an ops team can replace defect infrastructure in the background without time pressure. This is exactly what you need to deploy robust IoT solutions to production at scale.

Apache Kafka as the Digital Twin for 100000 Connected Cars

Let’s conclude with a specific example to build a Digital Twin infrastructure with the Kafka ecosystem. We use an implementation from the automotive industry. But this is applicable to any scenario where you want to build and leverage digital twins. Read more about “Use Cases for Apache Kafka in Automotive Industry” here.

Honestly, this demo was not built with the idea of creating a digital twin infrastructure in mind. However, think about it (and take a look at some definitions, architectures and solutions on the web): Digital Twin is just a concept and software design pattern. Remember our definition from the beginning of the article: A digital twin is a digital replica of a living or non-living physical entity. Therefore, the decision of the right architecture and technology depends on the use case(s).

We built a demo which shows how you can integrate with tens or hundreds of thousands IoT devices and process the data in real time. The demo use case is predictive maintenance (i.e. anomaly detection) in a connected car infrastructure to predict motor engine failures: “Building a Digital Twin for Streaming Machine Learning at Scale from 100000 IoT Devices with HiveMQ, Apache Kafka and TensorFlow“.

In this example, the data from 100000 cars is ingested and stored in the Kafka cluster, i.e. the digital twin, for further processing and real time analytics.

Kafka client applications consume the data for different use cases and in different speed:

Real time data pre-processing and data engineering using the data from the digital twin with Kafka Streams and KSQL / ksqlDB.
Streaming model training (i.e. without a data lake in the middle) with the Maschine Learning / Deep Learning framework TensorFlow and its Kafka plugin (part of TensorFlow I/O). In our example, we train two neural networks: An unsupervised Autoencoder for anomaly detection and a supervised LSTM.
Model deployment for inference in real time on new car sensor events to predict potential failures in the motor engine.
Ingestion of the data into another batch system, database or data lake (Oracle, HDFS, Elastic, AWS S3, Google Cloud Storage, whatever).
Another consumer could be a real time time series database like InfluxDB or TimescaleDB.

Build your own Digital Twin Infrastructure with Kafka and its Open Ecosystem!

I hope this post gave you some insights and ideas. I described three options to build the infrastructure for a digital twin: 1) IoT COTS, 2) Kafka, 3) Kafka + Iot COTS. Many companies leverage Apache Kafka as central nervous system for their IoT infrastructure in one way or the other.

Digital Twin is just one of many possible IoT use cases for an event streaming platform. Often, Kafka is “just” part of the solution. Pick and choose the right tools for your use cases. Evaluate the Kafka ecosystem and different IoT frameworks / solutions to find the best combination for your project. Don’t forget to include the vision and long-term planning into your decisions! If you plan it right from the beginning, it is (relative) straightforward to start with a pilot or MVP. Then roll it out to the first plant into production. Over time, deploy a global Digital Twin infrastructure…

The post Apache Kafka as Digital Twin for Open, Scalable, Reliable Industrial IoT (IIoT) appeared first on Kai Waehner.

Apache Kafka in the Automotive Industry

Kai Waehner — Fri, 22 Nov 2019 05:51:28 +0000

In November 2019, I had the pleasure to visit “Motor City” Detroit. I met with several automotive companies, suppliers, startups and cloud providers to discuss use cases and architectures around Apache Kafka. I work with companies related to the German automotive industry for many years. It was great to see the ideas and current status of projects running overseas in the US.

I am really excited about the role of Apache Kafka and its ecosystem in the automotive industry. Kafka became the central nervous system of many applications in various different areas related to automotive industry. Machine Learning also got more and more impact on these use cases.

A Long Journey – From Car Production and Sales to Digital Services

My trip to Detroit started with some private events. First, I watched the amazing football college game of the Michigan Wolverines hosting the Michigan State Spartans in front of 111000 fans in Ann Arbor. The next day I went to the Ford Piquette Plant in Detroit where Henry Ford and his team manufactured the first cars. The Henry Ford Museum was another great visit to learn more about the mastermind Henry Ford and the history of car manufacturing in the US.

The Car Industry is Changing…

I talked to many local people. The main excitement was around three topics in the state of Michigan: Universities, sports and cars. I have to admit that talking about car companies created a much more frustrating and negative mood for many people. They work in the factories and offices of car companies. They were more happy about discussions around sports and universities. Many people are worried about the future of the automotive industry.

Thus, let’s think in more detail about the car business… A car company produces and sells cars. A supplier produces parts for the car. A vendor shop or independent company provides maintenance and repair services.

That was the situation for many decades. Welcome to a new era where car companies have to do much more than just manufacturing and selling cars to stay competitive.

I think people should not be worried, though. This is very exciting. While many old jobs will be done my machines and robots in the future, new tasks and job roles will be created for everybody.

Customer Behaviour has Changed…

Customer behaviour has changed significantly:

Customers expect digital services and integration with their own IT ecosystem (smartphone, music streaming service, smart home, and many other apps)
Customers look for cheap repair services away from the car vendors because it is much cheaper but still a good service
Repair shops use 3rd party suppliers for car parts because the original supplier is too expensive and does not add any value or better product quality
Tesla, Uber, Volkswagen, Daimler and many other automotive and non-automotive companies work on autonomous driving (in several steps and releases: Automatic distance measurement on the highway —> Automatic parking in the parking lot —> Driving back to the entrance of the supermarket when you call your call via your smartphone —> Real autonomous driving to the final destination)
Zipcar, car2go, drivenow and other companies provide car sharing services where car usage is paid by minutes and miles: You use your smartphone to locate the next car at the airport, drive to your home in the city, and leave the car in front of your house so that the next customer can pick it up
Uber, Lyft, Grab, Free Now (the most ridiculous company name you could have for ride sharing) and other services provide cheap ride sharing with fantastic user experience: You order a taxi via mobile app, select guest services (like choice of music, wish of conversations with the driver or not), see expected estimated time of arrival and expected cost, monitor the status of the taxi location in real time, get transported, pay automatically by the integrated payment service, collect points at the integrated and partnered loyalty systems (like Hilton in case of Lyft or MilesAndMore like Free Now to name a few examples)

In short, companies in the automotive industry have to change their business model fundamentally to stay competitive. Otherwise, they will become the next “hardware supplier” for a tech giant. This would reduce profit, establish tough dependencies and create market pressure. Even the existence of the company is at risk. No matter if you are on the market for 10 or 100 years, already.

Digital Use Cases in Automotive Industry

The background about the history and new requirements for the automotive industry is important to understand. Every car-related company can add more value and innovation to the own business model.

The use cases can be separated in infrastructure, manufacturing and customer interaction (while there are some overlaps, of course).

Infrastructure for Connected Cars

A fully connected infrastructure with real time communication is required to build innovative use cases such as:

Connected cars: The “Hello World” example for discussing cutting edge use cases in automotive industry. It includes more or less all challenges you have to solve in Internet of Things (IoT) scenarios, including connectivity to millions of devices, large scale throughput, reliable communication with bad and low network connectivity, real time processing requirements. Ingest and process sensor information from millions of cars in real time to correlate events and. Provide bi-directional communication between cars and other applications. Do big data analytics in the backend to get new insights. Build additional services to the manufacturer, customer and partners.
Fleet Management: A specific example of real time correlation of events from various different systems like mobile apps, hardware integrated into cars and trucks, and backend systems like CRMs and payment systems. Various examples exist in different industries, including logistics of truck fleets, track&trace of package delivery, ridesharing, food services, and many more.
Emergency system: If your car shows abnormal behaviour or crashed into a tree, the car automatically sends a crash notification (including details about the crash scenario) to the police and hospital next to you to send help immediately.
Smart City and Smart Driving: Connect and correlate data of cars with other devices like traffic lights. Automotive companies partner with cities and other data providers to offer better security and more comfortable driving experience. Collect and share basic safety messages. Monitor cars and send speed / slow-down warnings, wrong-way detection, congestion. Provide crowdsourced data with applications such as Waze, Google Maps, Twitter, Nextdoor.

Manufacturing and Industrial IoT (IIoT)

Producing cars is a very complex task. This includes the production of the car itself (the “hardware”) and the intelligent logic (the “software”) of the car. Both integrate deeply with each other to implement use cases such as:

Industrial IoT (IIoT) and smart manufacturing: Predictive maintenance, early part scrapping and improved product quality are some examples to improve the quality of the manufactured parts, reduce costs and risks. The analysis of car sensor data also allows to implement additional digital services on top of the plain car parts.
Autonomous Driving: One of the most interesting use cases. Real time processing of big data sets and using cutting edge technologies such as Neural Networks allows car companies to build car with autonomous driving features. This is still in early stage and will probably take many years before it is available in Europe. But some US states and China are pressing ahead with self-driving cars on public streets, already. This will take some more years in most places. In the meantime, you can already use less powerful but still impressive features like Tesla’s “summon car” feature. This is not working perfectly today, but this is just a matter of time.

Customer Interactions and Data Analytics

Connected infrastructures and self-driving cars are awesome. Another added value comes through the possibility to communicate directly with the drivers to provide additional services, for instance:

Remote Diagnostics: Analytics can happen at the edge (e.g. in the car) or in a data center / cloud. Autonomous driving needs to detect persons in real time. Thus, this needs to be applied in the car. Cross selling is an example where you correlate data between different backend systems and remote users or apps. Therefore, this should happen in the backend. You just send the result of the correlation back to the car or mobile app of the driver. Predictive maintenance can happen directly in the car or machine. Or you correlate the events in the backend and just send the alerts to the car or machine to stop it before the engine breaks. No matter where you deploy the analytics logic, it typically has to happen in real time. Usually within milliseconds or seconds, but sometimes minutes or even hours are sufficient.
Identity Management: The days are over where you use a key (a piece of hardware) to open your car and start your engine. Car sharing services provide a mobile app. This is used for registration, payments and handing over the key (for the time of usage). Security (Authentication, authorization and encryption of data) are getting more and more important in IoT. This is even true in Industrial IoT where security discussions did not exist at all in the past. Car sharing is a great example where this is a key requirement to deploy an infrastructure and service successfully. The next step several companies are working on (long live China!) is image recognition instead of using a physical key, or smartphone to open your car and start the engine. Other use cases include improving the customer experience in the car. For instance, default configuration of seat, radio and air conditioning can be based on the driver’s historical behaviour (stored forever in your favorite database) and the outside ecosystem (e.g. integration with a weather service to know how warm it is). You will probably also not need to carry your driver’s license with you in the future because police will also just scan your face. I was pretty impressed when I just had to scan my face at Global Entry entering the US border coming from Germany last week, instead of scanning my passport and entering all the annoying details about my travel and private information.
Aftersales: Car companies and manufacturers need to sell additional services to make more money and / or to make customers happy and loyal. This includes use cases like upselling (provide 100 extra horsepower via digital download for 24 hours to have some fun on German Autobahn) and cross-selling (provide a 20% coupon for the partnered steak house coming at the next stop 10 miles always from your current driving location).
Payment Integration: In the US, it is common to pay everything by credit card. In Europe, many shops still just accept cash. In the future, payments will be done automatically. You just go to the gas station or electric charging unit and refuel respectively load your car and leave. Similar to existing an Uber or Lyft car today. The payment is done via the integrated partner payment system of your choice. Once again, loyalty systems are integrated, too. Authentication and authorization are done via cameras and neural networks doing image recognition.
Data Monetization: Connected car infrastructures produce a lot of data. Depending on law, privacy options and other regulations, automotive companies can and will do its own analytics, but also sell the data or parts of the data to partners (hopefully anonymized properly). Like Google or Facebook today, the automotive industry will be able to make a fortune with the data of the cars and all the connected partner systems. Most customers will probably agree to this “feature” because they get added value out of this, too (like discounts or “free” additional services).

Wow, this is a lot of use cases for digitalization of the automotive business, isn’t it? And there are many more to come…

Challenges and Requirements to Build Automotive Infrastructure at Scale

Let’s now quickly discuss the challenges and requirements to solve with all those exciting use cases:

Integration between many different consumer apps and backend systems (including many legacy and proprietary interfaces, and different communication paradigms like real time streaming, request-response, batch processing)
Big Data – sensors and millions of mobile apps from users produce a large set of (structured and unstructured) data continuously
Hybrid integration and deployments at the edge (e.g. cars), local proxies in different regions, backend applications in data centers or cloud, and SaaS applications
Global scale with zero downtime, rolling out the applications on different infrastructures (there is no AWS / Azure / GCP) in China but only Alibaba, and there is no cloud at all in Russia (just private clouds on premises).
Real time information is required in most scenarios to provide a good digital experience and added business value

This is a lot of challenges to solve. If you know Apache Kafka already well (and maybe also other traditional middleware), then you can probably guess why Kafka is such a good fit here.

Apache Kafka in the Automotive Industry

After we understood the various innovative use cases and its related challenges in the automotive industry, let’s now discuss (quickly) why so many automotive companies and suppliers use Apache Kafka as central nervous system for the discussed use cases.

Kafka for Real Time Streaming at Scale

Car sensors produce data continuously. The more cars you connect, the more data you get. Kafka provides messaging and streaming at scale, providing a reliable, highly scalable infrastructure with zero downtime. Rolling upgrades, backwards compatibility, up- and downscale at runtime, and other features are built-in.

Kafka for Handling Backpressure and Decoupling of Services

Cars are decoupled from the backed systems and mobile apps. Kafka provides storage for backpressure and decoupling of services. It is not just a messaging system! Kafka also stores data as long as you want. For instance, a Kafka topic is stored for a few days for log analytics. Another Kafka topic is stored for years to analyze customer and payment transactions of the past. Order guarantees, timestamps and high availability are provided out of-the-box. Connected applications and data stores can build a materialized view to leverage the data.

Kafka for Integration with Legacy and Modern Technologies

Car data needs to be integrated. Kafka is the backbone for backend systems (legacy like ERP, Mainframe + modern SaaS like Cloud CRM, big data analytics). Connectivity via Kafka Connect provides Kafka-native integration at scale to any legacy or modern technology and communication paradigm. Either directly from Kafka clients or via other technologies as proxy like MQTT to cars, Websockets to mobile apps, via proprietary protocols like Siemens S7 or open standards like OPC-UA to machines in plants). Check out my comparison between Kafka and Middleware (MQ, ETL, ESB) for more details.

Kafka for Continuous Stream Processing with Kafka Streams and ksqlDB

Car data needs to be processed and correlated in real time at scale with other backend databases and applications. Kafka Streams or ksqlDB provide Kafka-native capabilities to do process data continuously, for use cases like Streaming ETL, stateful aggregations (sliding windows) and building business applications with their own state (leveraging rocksdb under the hood – no need for another database.

Machine Learning and Kafka in Automotive Use Cases

Kafka is used more and more in Machine Learning infrastructures. Some examples from tech giants are Uber, Netflix or Paypal. The big challenge about Machine Learning is the deploy at scale in a reliable way (for both model training and predictions). I covered this topic in details in various blog posts, videos and demos. Check out this blog post to get started: “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka“.

Kafka for Data Engineering, Model Training + Deployment and Monitoring

In the Automotive Industry, Machine Learning needs to be applied at scale. Predictions often need to happen in real time. Therefore, more and more automotive-related use cases discussed in the above section leverage Kafka together with Machine Learning for different aspects like:

Data ingestion and processing of data with Kafka from cars and other systems for both model training of historical data and real time predictions on new events – ideally using the same ingestion and pre-processing pipeline for both.
Streaming model training with Kafka without an additional database in the middle (e.g. leveraging TensorFlow I/O and its native Kafka integration) or the storage in a data lake (e.g. HDFS, AWS S3) for model training.
Model deployment with Kafka at the edge (e.g. in autonomous driving for stopping the car via image recognition because a person is on the street in front of the car) or in the data center / cloud (e.g. predictions for cross selling by correlation of real time behaviour together with correlated information from the loyalty and CRM systems in backend). Check out the video and slides discussing “Event-driven Model Serving: Stream Processing vs. RPC” from Kafka Summit for more details.
Monitoring with Kafka (e.g. real time alerting, or distributed tracing to detect anomalies and the root cause of problems).

One huge advantage of using Kafka for Machine Learning infrastructures is solving the impedance mismatch between the data scientist (loving rapid prototyping with Python and Jyipter) and the software developer / production engineer (loving scalable and reliable Java applications).

Kafka Architecture – From Edge Deployments to Global Replication

The architecture for Kafka deployments depends on the use cases, SLAs and many other requirements. Automotive companies and suppliers typically have to think globally:

Using Kafka as global nervous system for streaming data typically means you spin up different Kafka clusters. The following scenarios are very common in the automotive industry:

Local edge Kafka clusters in the factories: Each factory has its own Kafka cluster to integrate with the machines, sensors and assembly lines. But also with ERP systems, SCADA monitoring tools, and mobile devices from the workers. This is typically a very small Kafka cluster with e.g. three Brokers (which still can process ~100+ MB/sec). Sometimes, just one single Kafka broker is deployed. This is fine if you do not need high availability and prefer low cost and very simple operations.
Central regional Kafka clusters: Kafka clusters are deployed in different regions. Each Kafka cluster is used to ingest, process and aggregate data from different factories in that region or from all cars within a region. These Kafka clusters are bigger than the local Kafka clusters as they need to integrate data from several edge Kafka clusters. The integration can be realized easily and reliable with Confluent Replicator or in the future maybe with MirrorMaker 2. Don’t use Mirrormaker 1 at all – you can find many good reasons on the web). Another option is to directly integrate Kafka clients deployed at the edge to a central regional Kafka cluster. Either with a Kafka Client using Java, C, C++, Python, Go or another programming language. Or using a proxy in the middle, like Confluent REST Proxy, Confluent MQTT Proxy, or any MQTT Broker outside the Kafka environment. Find out more details about comparing different MQTT and HTTP-based IoT integration options for Kafka here.
Global Kafka clusters: You can deploy one Kafka cluster in each region or continent. Then replicate the data between each other (one- or bidirectional) in real time using Confluent Replicator. Or you can leverage the multi-data center replication Kafka feature from Confluent to spin up one logical cluster over different regions.

This is just a quick summary of deployment options for Kafka clusters at the edge, on premises or in the cloud. You typically combine different options to deploy a hybrid and global Kafka infrastructure. I will speak about different “architecture patterns for distributed, hybrid and global Apache Kafka deployments” at DevNexus in Atlanta in February 2020.

For example, you might leverage Confluent Cloud, a fully managed Kafka service with usage-based pricing and enterprise-ready SLAs in Europe and the US on Azure, AWS or GCP. But you need to use Alibaba cloud in China and used self-managed Kafka there deployed to Kubernetes and operated by Confluent Operator. In Russia, no public cloud is available at all. You need to deploy on premises and manage the cluster by yourself. Then you replicate and aggregate some anonymous sensor data from all continents to get new aggregated insights. But some specific user data might always just stay in the country of origin and local region.

Live Demo – 100000 Connected Cars

We built a live demo (available on Github) to show how easy it is to set up a connected car infrastructure at scale (including a cutting edge streaming machine learning example for predictive maintenance in real time): Streaming Machine Learning at Scale from 100000 IoT Devices with HiveMQ, Apache Kafka and TensorfLow:

Such a setup can be the foundation for most of the discussed use cases in this blog post.

You can also take a look at the following link to see the slide deck and video recording. They discuss the demo architecture and walking you through the live demo:

Please let me know your thoughts about Apache Kafka in the automotive industry. Try out the demo and share your feedback. Let’s build some exciting new and innovative use cases for the automotive industry.

The post Apache Kafka in the Automotive Industry appeared first on Kai Waehner.

IoT Live Demo – 100.000 Connected Cars with Kubernetes, Kafka, MQTT, TensorFlow

Kai Waehner — Fri, 08 Nov 2019 13:38:34 +0000

You want to see an Internet of Things (IoT) example at huge scale? Not just 100 or 1000 devices producing data, but a really scalable demo with millions of messages from tens of thousands of devices? This is the right demo for you! we leveraging Kubernetes, Apache Kafka, MQTT and TensorFlow.

The demo shows how you can integrate with tens or hundreds of thousands IoT devices and process the data in real time. The demo use case is predictive maintenance (i.e. anomaly detection) in a connected car infrastructure to predict motor engine failures:

IoT Infrastructure – MQTT and Kafka on Kubernetes

We deploy Kubernetes, Kafka, MQTT and TensorFlow in a scalable, cloud-native infrastructure to integrate and analyse sensor data from 100000 cars in real time. The infrastructure is built with Terraform. We use GCP, but you could do the same on AWS, Azure, Alibaba or on premises.

Data processing and analytics is done in real time at scale with GCP GKE, HiveMQ, Confluent and TensorFlow I/O for streaming machine learning / deep learning and bi-directional communication in a scalable, elastic and reliable infrastructure:

Github Project – 100000 Connected Cars

The project is available on Github. You can set the demo up in ~30min by just installing a few CLI tools and executing two or three shell scripts.

Check out the Github project “Streaming Machine Learning at Scale from 100000 IoT Devices with HiveMQ, Apache Kafka and TensorFlow“.

Please try out the demo. Feedback and PRs are welcome.

20min Live Demo – IoT at Scale on GCP with GKE, Confluent, HiveMQ and TensorFlow IO

Here is the video recording of the live demo:

If your area of interest is Industrial IoT (IIoT), you might also check out the following example. It covers the integration of machines and PLCs like Siemens S7, Modbus or Beckhoff in factories and shop floors:

Apache Kafka, KSQL and Apache PLC4X for IIoT Data Integration and Processing

The post IoT Live Demo – 100.000 Connected Cars with Kubernetes, Kafka, MQTT, TensorFlow appeared first on Kai Waehner.