Blockchain Archives - Kai Waehner

Apache Kafka in Crypto and FinServ for Cybersecurity and Fraud Detection

Kai Waehner — Fri, 29 Apr 2022 10:34:46 +0000

The insane growth of the crypto and fintech market brings many unknown risks and successful cyberattacks to steal money and crypto coins. This post explores how data streaming with the Apache Kafka ecosystem enables real-time situational awareness and threat intelligence to detect and prevent hacks, money loss, and data breaches. Enterprises stay compliant with the law and keep customers happy in any innovative Fintech or Crypto application.

The Insane Growth of Crypto and FinTech Markets

The crypto and fintech markets are growing like crazy. Not every new crypto coin or blockchain is successful. Only a few fintech like Robinhood in the US or Trade Republic in Europe are successful. In the last months, the crypto market has been a bear market (writing this in April 2022).

Nevertheless, the overall global interest, investment, and growth in this market are unbelievable. Here is just one of many impressive statistics:

This survey came from NBC News. But you can find similar information in many other new severe portals across the globe.

The Threat is Real: Data Breaches, Hacks, Stolen Crypto!

With the growth of cryptocurrencies, blockchains, crypto + NFT markets in conjunction with very intuitive crypto trading mobile apps, and popular “normal” trading apps adding crypto support, cyberattacks are more dangerous than ever before.

Let’s look at two of the many recent successful cyberattacks against crypto markets to steal cryptocurrencies and explain why any crypto marketplace or trading app can be the next victim.

Supply Chain Attacks for Cyberattacks

While it feels more secure in trusting a well-known and trusted crypto shop (say Binance, Coinbase, or Crypto.com), appearances are deceiving. Many successful cyberattacks these days in the crypto and non-crypto world happen via supply chain attacks:

A supply chain attack means even if your infrastructure and applications are secure, attackers still get in via your certified B2B partners (like your CRM system or 3rd party payment integration). If your software or hardware partner gets hacked, the attacker gains access to you.

Hence, a continuous internal cybersecurity strategy with real-time data processing and a zero-trust approach is the only suitable option to provide your customers with a trustworthy and secure environment.

Examples of Successful Crypto Cyberattacks

There are so many successful hacks in the crypto space. Many don’t even make it into the prominent newspapers, even though coins worth millions of dollars are usually stolen.

Let’s look at two examples of successful supply chain attacks:

Hubspot CRM was hacked. Consequently, the crypto companies BlockFi, Swan Bitcoin, and Pantera had to advise users on how to stay safe. (source: Crypto News)
A MailChimp “insider” had carried out the phishing attack by sending malicious links to users of the multimedia platform. This included a successful phishing attack to steal funds stored in Trezor, a popular cryptocurrency wallet company. (source: Crypto Potato)

Obviously, this is not just a problem for crypto and fintech enterprises. Any other customer of a hacked software needs to act the same way. For the context, I choose crypto companies in the above examples.

Cybersecurity: Situational Awareness and Threat Intelligence with Apache Kafka

Cybersecurity in real-time is mandatory to fight successfully against cyberattacks. I wrote a blog series about how data streaming with Apache Kafka helps secure any infrastructure. Learn about use cases, architectures, and reference deployments for Kafka in the cybersecurity space:

Cybersecurity with Apache Kafka for Crypto Markets

Many crypto markets today use data streaming with Apache Kafka for various use cases. If done right, Kafka provides a secure, tamper-proof, encrypted data hub for processing events in real-time and for doing analytics of historical events with one scalable infrastructure:

If you want to learn more about “Kafka and Crypto” use cases, architectures, and success stories, check out this blog: Apache Kafka as Data Hub for Crypto, DeFi, NFT, Metaverse – Beyond the Buzz.

Kafka Architecture for Real-Time Cybersecurity in a Crypto Infrastructure

Let’s now look at a concrete example for integrating, correlating, and applying transactional and analytical information in a crypto environment with the power of the Kafka ecosystem. Here is the overview:

Data Producers from Blockchains, Crypto Markets, and the CRM system

Data comes from various sources:

Back-end applications include internal payment processors, fraud applications, customer platforms, and loyalty apps.
Third-party crypto and trading marketplaces like Coinbase, Binance, and Robinhood and direct transaction data from blockchains like Bitcoin or Ethereum.
External data and customer SaaS such as Salesforce or Snowflake.

The data includes business information, transactional workloads, and technical logs at different volumes and are integrated via various technologies, communication paradigms, and APIs:

Streaming ETL at any scale is a vital strength of the Kafka ecosystem and is often the first choice in data integration, ETL, and iPaaS evaluations. It is also widespread to combine transactional and analytical workloads within Kafka as the event data hub.

Real-Time Data Processing for Crypto Threat Intelligence with Machine Learning

The key benefit is not sending data from A to B in real-time but correlating the data from different sources. This enables detecting suspicious events that might be the consequence of a cyberattack:

AI and Machine Learning help build more advanced use cases and are very common in the Kafka world.

Data Consumers for Alerting and Regulatory Reporting

Real-time situational awareness and threat intelligence are the most crucial application of data streaming in the cybersecurity space. Additionally, many other data sinks consume the data, for instance, for compliance, regulatory reporting, and batch analytics in a data lake or lakehouse:

Kafka enables a Kappa architecture that simplifies real-time AND batch architectures compared to the much more complex and costly Lambda architecture.

Data Streaming with Kafka to Fight Cyberattacks in the Crypto and FinTech Space

Supply chain attacks require not just a secure environment but continuous threat intelligence. Data streaming with the Apache Kafka ecosystem builds the foundation. The example architecture showed how to integrate with internal systems and external blockchains, and crypto markets to correlate data in motion.

Kafka is not a silver bullet but the backbone to provide a scalable real-time data hub for your mission-critical cybersecurity infrastructure. If you deploy cloud-native applications (like most fintech and crypto companies), check out serverless data architectures around Kafka and Data Lakes and compare Kafka alternatives in the cloud, like Amazon MSK, Confluent Cloud, or Azure Event Hubs.

How do you use Apache Kafka with cryptocurrencies, blockchain, or other fintech applications? Do you deploy in the public cloud and leverage a serverless Kafka SaaS offering? What other technologies do you combine with Kafka? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka in Crypto and FinServ for Cybersecurity and Fraud Detection appeared first on Kai Waehner.

Apache Kafka as Data Hub for Crypto, DeFi, NFT, Metaverse – Beyond the Buzz

Kai Waehner — Fri, 04 Feb 2022 14:29:19 +0000

Decentralized finance with crypto and NFTs is a huge topic these days. It becomes a powerful combination with the coming metaverse platforms from social networks, cloud providers, gaming vendors, sports leagues, and fashion retailers. This blog post explores the relationship between crypto technologies like Ethereum, blockchain, NFTs, and modern enterprise architecture. I discuss how event streaming and Apache Kafka help build innovation and scalable real-time applications of a future metaverse. Let’s skip the buzz (and NFT bubble) and instead review practical examples and existing real-world deployments in the crypto and blockchain world powered by Kafka and its ecosystem.

What are crypto, NFT, DeFi, blockchain, smart contracts, metaverse?

I assume that most readers of this blog post have a basic understanding of the crypto market and event streaming with Apache Kafka. The target audience should be interested in the relationship between crypto technologies and a modern enterprise architecture powered by event streaming. Nevertheless, let’s explain each buzzword in a few words to have the same understanding:

Blockchain: Foundation for cryptocurrencies and decentralized applications (dApp) powered by digital distributed ledgers with immutable records
Smart contracts: dApps running on a blockchain like Ethereum
DeFi (Decentralized Finance): Group of dApps to provide financial services without intermediaries
Cryptocurrency (or crypto): Digital currency that works as an exchange through a computer network that is not reliant on any central authority, such as a government or bank
Crypto coin: Native coin of a blockchain to trade currency or store value (e.g., Bitcoin in the Bitcoin blockchain or Ether for the Ethereum platform)
Crypto token: Similar to a coin, but uses another coin’s blockchain to provide digital assets – the functionality depends on the project (e.g., developers built plenty of tokens for the Ethereum smart contract platform)
NFT (Non-fungible Token): Non-interchangeable, uniquely identifiable unit of data stored on a blockchain (that’s different from Bitcoin where you can “replace” one Bitcoin with another one), covering use cases such as identity, arts, gaming, collectibles, sports, media, etc.
Metaverse: A network of 3D virtual worlds focused on social connections. This is not just about Meta (former Facebook). Many platforms and gaming vendors are creating their metaverses these days. Hopefully, open protocols enable interoperability between different metaverses, platforms, APIs, and AR/VR technologies. Crypto and NFTs will probably be critical factors in the metaverse.

Cryptocurrency and DeFi marketplaces and brokers are used for trading between fiat and crypto or between two cryptocurrencies. Other use cases include long-term investments and staking. The latter compensates for locking your assets in a Proof-of-Stake consensus network that most modern blockchains use instead of the resource-hungry Proof-of-Work used in Bitcoin. Some solutions focus on providing services on top of crypto (monitoring, analytics, etc.).

Don’t worry if you don’t know what all these terms mean. A scalable real-time data hub is required to integrate crypto and non-crypto technologies to build innovative solutions for crypto and metaverse.

What’s my relation to crypto, blockchain, and Kafka?

It might help to share my background with blockchain and cryptocurrencies before I explore the actual topic about its relation to event streaming and Apache Kafka.

I worked with blockchain technologies 5+ years ago at TIBCO. I implemented and deployed smart contracts with Solidity (the smart contract programming language for Ethereum). I integrated blockchains such as Hyperledger and Ethereum with ESB middleware. And yes, I bought some Bitcoin for ~500 dollars. Unfortunately, I sold them afterward for ~1000 dollars as I was curious about the technology, not the investment part.

Middleware in real-time is vital for integration

I gave talks at international conferences and published a few articles about blockchain. For instance, “Blockchain – The Next Big Thing for Middleware” at InfoQ in 2016. The article is still pretty accurate from a conceptual perspective. The technologies, solutions, and vendors developed, though.

I thought about joining a blockchain startup. Coincidentally, one company I was talking about was building a “next-generation blockchain platform for transactions in real-time at scale”, powered by Apache Kafka. No joke.

However, I joined Confluent in 2017. I thought that processing data in motion at any scale for transactional and analytics workloads is the more significant paradigm shift. “Why I Move (Back) to Open Source for Messaging, Integration and Stream Processing” describes my decision in 2017. I think I was right. Today, most enterprises leverage Kafka as an alternative middleware for MQ, ETL, and ESB tools or implement cloud-native iPaaS with serverless Kafka.

Blockchain, crypto, and NFT are here to stay (for some use cases)

Blockchain and crypto are here to stay, but it is not needed for every problem. Blockchain is a niche. For that, it is excellent. TL;DR: You only need a blockchain in untrusted environments. The famous example of supply chain management is valid. Cryptocurrencies and smart contracts are also here to stay. Partly for investment, partly for building innovative new applications.

Today, I work with customers across the globe. Crypto marketplaces, blockchain monitoring infrastructure, and custodian banking platforms are built on Kafka for good reasons (scale, reliability, real-time). The key to success for most customers is integrating crypto and blockchain platforms and the rest of the IT infrastructure, like business applications, databases, and data lakes.

Trust is important, and trustworthy marketplaces (= banks?) are needed (for some use cases)

Privately, I own several cryptocurrencies. I am not a day-trader. My strategy is a long-term investment (but only a fraction of my total investment; only the money I am okay to lose 100%). I invest in several coins and platforms, including Bitcoin, Ethereum, Solana, Polkadot, Chainlink, and a few even more risky ones. I firmly believe that crypto and NFTs are a game-changer for some use cases like gaming and metaverse, but I also think paying hundreds of thousands of dollars for a digital ape is insane (and just an investment bubble).

I sold my Bitcoins in 2016 because of the missing trustworthy marketplaces. I don’t care too much about the decentralization of my long-term investment. I do not want to hold my cold storage, write a long and complex code on paper, and put it into my safe. I want to have a secure, trustworthy custodian that takes over this burden.

For that reason, I use compliant German banks for crypto investments. If a coin is unavailable, I go to an international marketplace that feels trustworthy. For instance, the exchange of crypto.com and the NFT marketplace OpenSea recently did a great job letting their insurance pay for a hack and loss of customer coins and NFTs, respectively. That’s what I expect as a customer and why I am happy to pay a small fee for buying and selling cryptocurrencies or NFTs on such a platform.

“The False Promise of Web3” is a great read to understand why many crypto and blockchain discussions are not indeed about decentralization. As the article says, “the advertised decentralization of power out of the hands of a few has, in fact, been a re-centralization of power into the hands of fewer“. I am a firm believer in the metaverse, crypto, NFT, DeFi, and blockchain. However, I am fine if some use cases are centralized and provide regulation, compliance, security, and other guarantees.

With this long background story, let’s explore the current crypto and blockchain world and how this relates to event streaming and Apache Kafka.

Use cases for the metaverse and non-fungible tokens (NFT)

Let’s start with some history:

1995: Amazon was just an online shop for books. I bought my books in my local store.
2005: Netflix was just a DVD-in-mail service. Technology changed over time, but I also rent HD DVDs, Blu-Rays, and other mediums similarly.
2022: I will not use Zuckerberg’s Metaverse. Never. It will fail like Second Life (which was launched in 2003). And I don’t get the noise around NFT. It is just a jpeg file that you can right-click and save for free.

Well, we are still in the early stages of cryptocurrencies, and in the very early stages of Metaverse, DeFi, and NFT use cases and business models. However, this is not just hype like the Dot-com bubble in the early 2000s. Tech companies have exciting business models. And software is eating every industry today. Profit margins are enormous, too.

Let’s explore a few use cases where Metaverse, DeFi, and NFTs make sense:

Billions of players and massive revenues in the gaming industry

The gaming industry is already bigger than all other media categories combined, and this is still just the beginning of a new era. Millions of new players join the gaming community every month across the globe.

Connectivity and cheap smartphones are sold in less wealthy countries. New business models like “play to earn” change how the next generation of gamers plays a game. More scalable and low latency technologies like 5G enable new use cases. Blockchain and NFT (Non-Fungible Token) are changing the monetization and collection market forever.

NFTs for identity, collections, ticketing, swaggering, and tangible things

Let’s forget that Justin Bieber recently purchased a Bored Ape Yacht Club (BAYC) NFT for $1.29 million. That’s insane and likely a bubble. However, many use case makes a lot of sense for NFT, not just in (future) virtual metaverse but also in the real world. Let’s look at a few examples:

Sports and other professional events: Ticketing for controlled pricing, avoiding fraud, etc.
Luxury goods: Transparency, traceability, and security against tampering; the handbag, the watch, the necklace: for such luxury goods, NFTs can be used as certificates of authenticity and ownership.
Arts: NFTs can be created in such a way that royalty and license fees are donated every time they are resold
Carmakers: The first manufacturers link new cars with NFTs. Initially for marketing, but also to increase their resale value in the medium term. With the help of the blockchain, I can prove the number of previous owners of a car and provide reliable information about the repair and accident history.
Tourism: Proof of attendance. Climbed Mount Everest? Walked at the Grand Canyon? Visited Hawaii for marriage? With a badge on the blockchain, this can be proven once and for all.
Non-Profit: The technology is ideal for fundraising in charities – not only because it is transparent and decentralized. Every NFT can be donated or auctioned off for a good cause so that this new way of creating value generates new funds for the benefit of charitable projects. Vodafone has demonstrated this by selling the world’s first SMS.
Swaggering: Twitter already allows you to connect your account to your crypto wallet to set up an NFT as your profile picture. Steph Curry from the NBA team, Golden State Warriors, presents his 55 ETH (~ USD 180,000) Bored Ape Yacht Club NFT as his Twitter profile picture when writing this blog.

With this in mind, I can think about plenty of other significant use cases for NFTs.

Metaverse for new customer experiences

If I think about the global metaverse (not just the Zuckerberg one), I see plenty so many use cases even I could imagine using:

How many (real) dollars would you pay if your (virtual) house is next to your favorite actor or musician in a virtual world if you can speak with its digital twin (trained by the actual human) every day?
How many (real) dollars would you pay if this actor visits his house once a week so that virtual neighbors or lottery winners can talk to the natural person via virtual reality?
How many (real) dollars would you pay if you could bring your Fortnite shotgun (from a game from another gaming vendor) to this meeting and use it in the clay pigeon shooting competition with the neighbor?
Give me a day, and I will add 1000 other items to this wish list…

I think you get the point. NFTs and metaverse make sense for many use cases. This statement is valid from the perspective of a great customer experience and to build innovative business models (with ridiculous profit margins).

So, finally, we come to the point of talking about the relation to event streaming and Apache Kafka.

Kafka inside a crypto platform

First, let’s understand how to qualify if you need a truly distributed, decentralized ledger or blockchain. Kafka is sufficient most times.

Kafka is NOT a blockchain, but a distributed ledger for crypto

Kafka is not a blockchain, but a distributed commit log. Many concepts and foundations of Kafka are very similar to a blockchain. It provides many characteristics required for real-world “enterprise blockchain” projects:

Real-Time
High Throughput
Decentralized database
Distributed log of records
Immutable log
Replication
High availability
Decoupling of applications/clients
Role-based access control to data

I explored this in more detail in my post “Apache Kafka and Blockchain – Comparison and a Kafka-native Implementation“.

Do you need a blockchain? Or just Kafka and crypto integration?

A blockchain increases the complexity significantly compared to traditional IT projects. Do you need a blockchain or distributed ledger (DLT) at all? Qualify out early and choose the right tool for the job!

Use a Kafka for

Enterprise infrastructure
Real-time data hub for transactional and analytical workloads
Open, scalable, real-time data integration and processing
True decoupling between applications and databases with backpressure handling and data replayability
Flexible architectures for many use cases
Encrypted payloads

Use a real blockchain / DLTs like Hyperledger, Ethereum, Cardano, Solana, et al. for

Deployment over various independent organizations (where participants verify the distributed ledger contents themselves)
Specific use cases
Server-side managed and controlled by multiple organizations
Scenarios where the business value overturns the added complexity and project risk

Use Kafka and Blockchain together to combine the benefits of both for

Blockchain for secure communication over various independent organizations
Reliable data processing at scale in real-time with Kafka as a side chain or off-chain from a blockchain
Integration between blockchain / DLT technologies and the rest of the enterprise, including CRM, big data analytics, and any other custom business applications

The last section shows that Kafka and blockchain, respectively crypto are complementary. For instance, many enterprises use Kafka as the data hub between crypto APIs and enterprise software.

It is pretty straightforward to build a metaverse without a blockchain (if you don’t want or need to offer true decentralization). Look at this augmented reality demo powered by Apache Kafka to understand how the metaverse is built with modern technologies.

Kafka as a component of blockchain vs. Kafka as middleware for blockchain integration

Some powerful DLTs or blockchains are built on top of Kafka. See the example of R3’s Corda in the next section.

Kafka is used to implementing a side chain or off-chain platform in some other use cases, as the original blockchain does not scale well enough (blockchain is known as on-chain data). Not just Bitcoin has the problem of only processing single-digit (!) transactions per second. Most modern blockchain solutions cannot scale even close to the workloads Kafka processes in real-time.

Having said this, more interestingly, I see more and more companies using Kafka within their crypto trading platforms, market exchanges, and token trading marketplaces to integrate between the crypto and the traditional IT world.

Here are both options:

R3 Corda – A distributed ledger for banking and the finance industry powered by Kafka

R3’s Corda is a scalable, permissioned peer-to-peer (P2P) distributed ledger technology (DLT) platform. It enables the building of applications that foster and deliver digital trust between parties in regulated markets.

Corda is designed for the banking and financial industry. The primary focus is on financial services transactions. The architectural designs are simple when compared to true blockchains. Evaluate requirements such as time to market, flexibility, and use case (in)dependence to decide if Corda is sufficient or not.

Corda’s architectural history looks like many enterprise architectures: A messaging system (in this case, RabbitMQ) was introduced years ago to provide a real-time infrastructure. Unfortunately, the messaging solution does not scale as needed. It does not provide all essential features like data integration, data processing, or storage for true decoupling, backpressure handling, or replayability of events.

Therefore, Corda 5 replaces RabbitMQ and migrates to Kafka.

Here are a few reasons for the need to migrate R3’s Corda from RabbitMQ to Kafka:

High availability for critical services
A cost-effective way to scale (horizontally) to deal with bursty and high-volume throughputs
Fully redundant, worker-based architecture
True decoupling and backpressure handling to facilitate communication between the node services, including the process engine, database integration, crypto integration, RPC service (HTTP), monitoring, and others
Compacted topics (logs) as the mechanism to store and retrieve the most recent states

Kafka in the crypto enterprise architecture

While using Kafka within a DLT or blockchain, the more prevalent use cases leverage Kafka as the scalable real-time data hub between cryptocurrencies or blockchains and enterprise applications. Let’s explore a few use cases and real-world examples for that.

Example crypto architecture: Kafka as data hub in the metaverse

My recent post about live commerce powered by event streaming and Kafka transforming the retail metaverse shows how the retail and gaming industry connects virtual and physical things. The retail business process and customer communication happen in real-time, no matter if you want to sell clothes, a smartphone, or a blockchain-based NFT token for your collectible or video game.

The following architecture shows what an NFT sales play could look like by interesting and orchestration the information flow between various crypt and non-crpyto applications in real-time at any scale:

Kafka’s role as the data hub in crypto trading, marketplaces, and the metaverse

Let’s now explore the combination of Kafka and blockchains, respectively cryptocurrencies and decentralized finance (DeFi).

Once again, Kafka is not the blockchain nor the cryptocurrency. The blockchain is a cryptocurrency like Bitcoin or a platform providing smart contracts like Ethereum, where people build new distributed applications (dApps) like NFTs for the gaming or art industry. Kafka is the data hub in between to connect these blockchains with other Oracles (= the non-blockchain apps = traditional IT infrastructure) like the CRM system, data lake, data warehouse, business applications, and so on.

Let’s look at an example and explore a few technical use cases where Kafka helps:

A Bitcoin transaction is executed from the mobile wallet. A real-time application monitors the data off-chain, correlates it, shows it in a dashboard, and sends push notifications. Another completely independent department replays historical events from the Kafka log in a batch process for a compliance check with dedicated analytics tools.

The Kafka ecosystem provides so many capabilities to use the data from blockchains and the crypto world with other data from traditional IT.

Holistic view in a data mesh across typical enterprise IT and blockchains

Measuring the health of blockchain infrastructure, cryptocurrencies, and dApps to avoid downtime, secure the infrastructure, and make the blockchain data accessible.
Kafka provides an agentless and scalable way to present that data to the parties involved and ensure that the relevant information is exposed to the right teams before a node is lost. This is relevant for innovative Web3 IoT projects like Helium or simpler closed distributed ledgers (DLT) like R3 Corda.
Stream processing via Kafka Streams or ksqlDB to interpret the data to get meaningful information
Processors that focus on helpful block metrics – with information related to average gas price (gas refers to the cost necessary to perform a transaction on the network), number of successful or failed transactions, and transaction fees
Monitoring blockchain infrastructure and telemetry log events in real-time
Regulatory monitoring and compliance
Real-time cybersecurity (fraud, situational awareness, threat intelligence)

Continuous data integration at any scale in real-time

Integration + true decoupling + backpressure handling + replayability of events
Kafka as Oracle integration point (e.g. Chainlink -> Kafka -> Rest of the IT infrastructure)
Kafka Connect connectors that incorporate blockchain client APIs like OpenEthereum (leveraging the same concept/architecture for all clients and blockchain protocols)
Backpressure handling via throttling and nonce management over the Kafka backbone to stream transactions into the chain
Processing multiple chains at the same time (e.g., monitoring and correlating transactions on Ethereum, Solana, and Cardano blockchains in parallel)

Continuous stateless or stateful data processing

Stream processing with Kafka Streams or ksqlDB enables real-time data processing in DeFi / trading / NFT / marketplaces
Most blockchain and crypto use cases require more than just data ingestion into a database or data lake – continuous stream processing adds enormous value to many problems
Aggregate chain data (like Bitcoin or Ethereum), for instance, smart contract states or price feeds like the price of cryptos against USD
Specialized ‘processors’ that take advantage of Kafka Streams’ utilities to perform aggregations, reductions, filtering, and other practical stateless or stateful operations

Building new business models and solutions on top of crypto and blockchain infrastructure

Custody for crypto investments in a fully integrated, end-to-end solution
Deployment and management of smart contracts via a blockchain API and user interface
Customer 360 and loyalty platforms, for instance, NFT integration into retail, gaming, social media for new customer experiences by sending a context-specific AirDrop to a wallet of the customer
These are just a few examples – the list goes on and on…

The following section shows a few real-world examples. Some are relatively simple monitoring tools. Others are complex and powerful banking platforms.

Real-world examples of Kafka in the crypto and DeFi world

I have already explored how some blockchain and crypto solutions (like R3’s Corda) use event streaming with Kafka under the hood of their platform. Contrary, the following focuses on several public real-world solutions that leverage Kafka as the data hub between blockchains / crypto / NFT markets and new business applications:

TokenAnalyst: Visualization of crypto markets
EthVM: Blockchain explorer and analytics engine
Kaleido: REST API Gateway for blockchain and smart contracts
Nash: Cloud-native trading platform for cryptocurrencies
Swisscom’s Custodigit: Crypto banking platform
Chainlink: Oracle network for connecting smart contracts from blockchains to the real world

TokenAnalyst – Visualization of crypto markets

TokenAnalyst is an analytics tool to visualize and analyze the crypto market. TokenAnalyst is an excellent example that leverages the Kafka stack (Connect, Streams, ksqlDB, Schema Registry) to integrate blockchain data from Bitcoin and Ethereum with their analytics tools.

Kafka Connect helps with integrating databases and data lakes. The integration with Ethereum and other cryptocurrencies is implemented via a combination of the official crypto APIs and the Kafka producer client API.

Kafka Streams provides a stateful streaming application to prevent invalid blocks in downstream aggregate calculations. For example, TokenAnalyst developed a block confirmer component that resolves reorganization scenarios by temporarily keeping blocks and only propagates them when a threshold of some confirmations (children to that block are mined) is reached.

EthVM – A blockchain explorer and analytics engine

The beauty of public, decentralized blockchains like Bitcoin and Ethereum is transparency. The tamper-proof log enables Blockchain explorers to monitor and analyze all transactions.

EthVM is an open-source Ethereum blockchain data processing and analytics engine powered by Apache Kafka. The tool enables blockchain auditing and decision-making. EthVM verifies the execution of transactions and smart contracts, checks balances, and monitors gas prices. The infrastructure is built with Kafka Connect, Kafka Streams, and Schema Registry. A client-side visual block explorer is included, too.

Kaleido – A Kafka-native Gateway for crypto and smart contracts

Kaleido provides enterprise-grade blockchain APIs to deploy and manage smart contracts, send Ethereum transactions, and query blockchain data. It hides the blockchain complexities of Ethereum transaction submission, thick Web3 client libraries, nonce management, RLP encoding, transaction signing, and smart contract management.

Kaleido offers REST APIs for on-chain logic and data. It is backed by a fully-managed high throughput Apache Kafka infrastructure.

One exciting aspect in the above architecture: Kaleido also provides a native direct Kafka connection from the client-side besides the API (= HTTP) gateway. This is a clear trend I discussed before already. Check out:

Nash – Cloud-native trading platform for cryptocurrencies

Nash is an excellent example of a modern trading platform for cryptocurrencies using blockchain under the hood. The heart of Nash’s platform leverages Apache Kafka. The following quote from their community page says:

“Nash is using Confluent Cloud, google cloud platform to deliver and manage its services. Kubernetes and apache Kafka technologies will help it scale faster, maintain top-notch records, give real-time services which are even hard to imagine today.”

Nash provides the speed and convenience of traditional exchanges and the security of non-custodial approaches. Customers can invest in, make payments, and trade Bitcoin, Ethereum, NEO, and other digital assets. The exchange is the first of its kind, offering non-custodial cross-chain trading with the full power of a real order book. The distributed, immutable commit log of Kafka enables deterministic replayability in its exact order.

Swisscom’s Custodigit – A crypto banking platform powered by Kafka Streams

Custodigit is a modern banking platform for digital assets and cryptocurrencies. It provides crucial features and guarantees for seriously regulated crypto investments:

Secure storage of wallets
Sending and receiving on the blockchain
Trading via brokers and exchanges
Regulated environment (a key aspect and no surprise as this product is coming from the Swiss – a very regulated market)

Kafka is the central nervous system of Custodigit’s microservice architecture and stateful Kafka Streams applications. Use cases include workflow orchestration with the “distributed saga” design pattern for the choreography between microservices. Kafka Streams was selected because of:

lean, decoupled microservices
metadata management in Kafka
unified data structure across microservices
transaction API (aka exactly-once semantics)
scalability and reliability
real-time processing at scale
a higher-level domain-specific language for stream processing
long-running stateful processes

Architecture diagrams are only available in Germany, unfortunately. But I think you get the points:

Custodigit microservice architecture – some microservices integrate with brokers and stock markets, others with blockchain and crypto:

Custodigit Saga pattern for stateful orchestration – stateless business logic is truly decoupled, while the saga orchestrator keeps the state for choreography between the other services:

Chainlink – Oracle network for connecting smart contracts from blockchains to the real world

Chainlink is the industry standard oracle network for connecting smart contracts to the real world. “With Chainlink, developers can build hybrid smart contracts that combine on-chain code with an extensive collection of secure off-chain services powered by Decentralized Oracle Networks. Managed by a global, decentralized community of hundreds of thousands of people, Chainlink introduces a fairer contract model. Its network currently secures billions of dollars in value for smart contracts across decentralized finance (DeFi), insurance, and gaming ecosystems, among others. The full vision of the Chainlink Network can be found in the Chainlink 2.0 white paper.”

Unfortunately, I could not find any public blog post or conference talks about Chainlink’s architecture. Hence, I can only let Chainlink’s job offering speak about their impressive Kafka usage for real-time observability at scale in a critical, transactional financial environment.

Chainlink is transitioning from traditional time series-based monitoring toward an event-driven architecture and alerting approach.

This job offer sounds very interesting, doesn’t it? And it is a colossal task to solve cybersecurity challenges in this industry. If you look for a blockchain-based Kafka role, this might be for you.

Serverless Kafka enables focusing on the business logic in your crypto data hub infrastructure!

This article explored practical use cases for crypto marketplaces and the coming metaverse. Many enterprise architectures already leverage Apache Kafka and its ecosystem to build a scalable real-time data hub for crypto and non-crypto technologies.

This combination is the foundation for a metaverse ecosystem and innovative new applications, customer experiences, and business models. Don’t fear the metaverse. This discussion is not just about Meta (former Facebook), but about interoperability between many ecosystems to provide fantastic new user experiences (of course, with its drawbacks and risks, too).

A clear trend across all these fancy topics and buzzwords is the usage of serverless cloud offerings. This way, project teams can spend their time on the business logic instead of operating the infrastructure. Check out my articles about “serverless Kafka and its relation to cloud-native data lakes and lake houses” and my “comparison of Kafka offerings on the market” to learn more.

How do you use Apache Kafka with cryptocurrencies, blockchain, or DeFi applications? Do you deploy in the public cloud and leverage a serverless Kafka SaaS offering? What other technologies do you combine with Kafka? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Kafka as Data Hub for Crypto, DeFi, NFT, Metaverse – Beyond the Buzz appeared first on Kai Waehner.

When NOT to use Apache Kafka?

Kai Waehner — Tue, 04 Jan 2022 07:24:59 +0000

Apache Kafka is the de facto standard for event streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job? This blog post explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka.

Market Trends – A Connected World

Let’s begin with understanding why Kafka comes up everywhere in the meantime. This clarifies the huge market demand for event streaming but also shows that there is no silver bullet solving all problems. Kafka is NOT the silver bullet for a connected world, but a crucial component!

The world gets more and more connected. Vast volumes of data are generated and need to be correlated in real-time to increase revenue, reduce costs, and reduce risks. I could pick almost any industry. Some are faster. Others are slower. But the connected world is coming everywhere. Think about manufacturing, smart cities, gaming, retail, banking, insurance, and so on. If you look at my past blogs, you can find relevant Kafka use cases for any industry.

I picked two market trends that show this insane growth of data and the creation of innovation and new cutting-edge use cases (and why Kafka’s adoption is insane across industries, too).

Connected Cars – Insane volume of telemetry data and aftersales

Here is the “Global Opportunity Analysis and Industry Forecast, 2020–2027” by Allied Market Research:

The Connected Car market includes a much wider variety of use cases and industries than most people think. A few examples: Network infrastructure and connectivity, safety, entertainment, retail, aftermarket, vehicle insurance, 3rd party data usage (e.g., smart city), and so much more.

Gaming – Billions of players and massive revenues

The gaming industry is already bigger than all other media categories combined, and this is still just the beginning of a new era – as Bitkraft depicts:

Millions of new players join the gaming community every month across the globe. Connectivity and cheap smartphones are sold in less wealthy countries. New business models like “play to earn” change how the next generation of gamers plays a game. More scalable and low latency technologies like 5G enable new use cases. Blockchain and NFT (Non-Fungible Token) are changing the monetization and collection market forever.

These market trends across industries clarify why the need for real-time data processing increases significantly quarter by quarter. Apache Kafka established itself as the de facto standard for processing analytical and transactional data streams at scale. However, it is crucial to understand when (not) to use Apache Kafka and its ecosystem in your projects.

What is Apache Kafka, and what is it NOT?

Kafka is often misunderstood. For instance, I still hear way too often that Kafka is a message queue. Part of the reason is that some vendors only pitch it for a specific problem (such as data ingestion into a data lake or data warehouse) to sell their products. So, in short:

Kafka is…

a scalable real-time messaging platform to process millions of messages per second.
an event streaming platform for massive volumes of big data analytics and small volumes of transactional data processing.
a distributed storage provides true decoupling for backpressure handling, support of various communication protocols, and replayability of events with guaranteed ordering.
a data integration framework for streaming ETL.
a data processing framework for continuous stateless or stateful stream processing.

This combination of characteristics in a single platform makes Kafka unique (and successful).

Kafka is NOT…

a proxy for millions of clients (like mobile apps) – but Kafka-native proxies (like REST or MQTT) exist for some use cases.
an API Management platform – but these tools are usually complementary and used for the creation, life cycle management, or the monetization of Kafka APIs.
a database for complex queries and batch analytics workloads – but good enough for transactional queries and relatively simple aggregations (especially with ksqlDB).
an IoT platform with features such as device management – but direct Kafka-native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the appropriate approach for (some) use cases.
a technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software!

For these reasons, Kafka is complementary, not competitive, to these other technologies. Choose the right tool for the job and combine them!

Case studies for Apache Kafka in a connected world

This section shows a few examples of fantastic success stories where Kafka is combined with other technologies because it makes sense and solves the business problem. The focus here is case studies that need more than just Kafka for the end-to-end data flow.

No matter if you follow my blog, Kafka Summit conferences, online platforms like Medium or Dzone, or any other tech-related news. You find plenty of success stories around real-time data streaming with Apache Kafka for high volumes of analytics and transactional data from connected cars, IoT edge devices, or gaming apps on smartphones.

A few examples across industries and use cases:

Audi: Connected car platform rolled out across regions and cloud providers
BMW: Smart factories for the optimization of the supply chain and logistics
SolarPower: Complete solar energy solutions and services across the globe
Royal Caribbean: Entertainment on cruise ships with disconnected edge services and hybrid cloud aggregation
Disney+ Hotstar: Interactive media content and gaming/betting for millions of fans on their smartphone
The list goes on and on and on.

So what is the problem with all these great IoT success stories? Well, there is no problem. But some clarification is needed to explain when to use event streaming with the Apache Kafka ecosystem and where other complementary solutions usually complement it.

When to use Apache Kafka?

Before we discuss when NOT to use Kafka, let’s understand where to use it to get more clear how and when to complement it with other technologies if needed.

I will add real-world examples to each section. In my experience, this makes it much easier to understand the added value.

Kafka consumes and processes high volumes of IoT and mobile data in real-time

Processing massive volumes of data in real-time is one of the critical capabilities of Kafka.

Tesla is not just a car maker. Tesla is a tech company writing a lot of innovative and cutting-edge software. They provide an energy infrastructure for cars with their Tesla Superchargers, solar energy production at their Gigafactories, and much more. Processing and analyzing the data from their vehicles, smart grids, and factories and integrating with the rest of the IT backend services in real-time is a crucial piece of their success.

Tesla has built a Kafka-based data platform infrastructure “to support millions of devices and trillions of data points per day”. Tesla showed an exciting history and evolution of their Kafka usage at a Kafka Summit in 2019:

Keep in mind that Kafka is much more than just messaging. I repeat this in almost every blog post as too many people still don’t get it. Kafka is a distributed storage layer that truly decouples producers and consumers. Additionally, Kafka-native processing tools like Kafka Streams and ksqlDB enable real-time processing.

Kafka correlates IoT data with transactional data from the MES and ERP systems

Data integration in real-time at scale is relevant for analytics and the usage of transactional systems like an ERP or MES system. Kafka Connect and non-Kafka middleware complement the core of event streaming for this task.

BMW operates mission-critical Kafka workloads across the edge (i.e., in the smart factories) and public cloud. Kafka enables decoupling, transparency, and innovation. The products and expertise from Confluent add stability. The latter is vital for success in manufacturing. Each minute of downtime costs a fortune. Read my related article “Apache Kafka as Data Historian – an IIoT / Industry 4.0 Real-Time Data Lake” to understand how Kafka improves the Overall Equipment Effectiveness (OEE) in manufacturing.

BMW optimizes its supply chain management in real-time. The solution provides information about the right stock in place, both physically and in transactional systems like BMW’s ERP powered by SAP. “Just in time, just in sequence” is crucial for many critical applications. The integration between Kafka and SAP is required for almost 50% of customers I talk to in this space. Beyond the integration, many next-generation transactional ERP and MES platforms are powered by Kafka, too.

Kafka integrates with all the non-IoT IT in the enterprise at the edge and hybrid or multi-cloud

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. Learn about several scenarios that may require multi-cluster solutions and see real-world examples with their specific requirements and trade-offs, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments, and global Kafka.

The true decoupling between different interfaces is a unique advantage of Kafka vs. other messaging platforms such as IBM MQ, RabbitMQ, or MQTT brokers. I also explored this in detail in my article about Domain-driven Design (DDD) with Kafka.

Infrastructure modernization and hybrid cloud architectures with Apache Kafka are typical across industries.

One of my favorite examples is the success story from Unity. The company provides a real-time 3D development platform focusing on gaming and getting into other industries like manufacturing with their Augmented Reality (AR) / Virtual Reality (VR) features.

The data-driven company already had content installed 33 billion times in 2019, reaching 3 billion devices worldwide. Unity operates one of the largest monetization networks in the world. They migrated this platform from self-managed Kafka to fully-managed Confluent Cloud. The cutover was executed by the project team without downtime or data loss. Read Unity’s post on the Confluent Blog: “How Unity uses Confluent for real-time event streaming at scale “.

Kafka is the scalable real-time backend for mobility services and gaming/betting platforms

Many gaming and mobility services leverage event streaming as the backbone of their infrastructure. Use cases include the processing of telemetry data, location-based services, payments, fraud detection, user/player retention, loyalty platform, and so much more. Almost all innovative applications in this sector require real-time data streaming at scale.

A few examples:

Mobility services: Uber, Lyft, FREE NOW, Grab, Otonomo, Here Technologies, …
Gaming services: Disney+ Hotstar, Sony Playstation, Tencent, Big Fish Games, …
Betting services: William Hill, Sky Betting, …

Just look at the job portals of any mobility or gaming service. Not everybody is talking about their Kafka usage in public. But almost everyone is looking for Kafka experts to develop and operate their platform.

These use cases are just as critical as a payment process in a core banking platform. Regulatory compliance and zero data loss are crucial. Multi-Region Clusters (i.e., a Kafka cluster stretched across regions like US East, Central, and West) enable high availability with zero downtime and no data loss even in the case of a disaster.

Vehicles, machines, or IoT devices embed a single Kafka broker

The edge is here to stay and grow. Some use cases require the deployment of a Kafka cluster or single broker outside a data center. Reasons for operating a Kafka infrastructure at the edge include low latency, cost efficiency, cybersecurity, or no internet connectivity.

Examples for Kafka at the edge:

Disconnected edge in logistics to store logs, sensor data, and images while offline (e.g., a truck on the street or a drone flying around a ship) until a good internet connection is available in the distribution center
Vehicle-to-Everything (V2X) communication in a local small data center like AWS Outposts (via a gateway like MQTT if large area, a considerable number of vehicles, or lousy network), or via direct Kafka client connection for a few hundreds of machines, e.g., in a smart factory )
Offline mobility services like integrating a car infrastructure with gaming, maps, or a recommendation engine with locally processed partner services (e.g., the next Mc Donalds comes in 10 miles, here is a coupon).

The cruise line Royal Caribbean is a great success story for this scenario. It operates the four largest passenger ships in the world. As of January 2021, the line operates twenty-four ships and has six additional ships on order.

Royal Caribbean implemented one of Kafka’s most famous use cases at the edge. Each cruise ship has a Kafka cluster running locally for use cases such as payment processing, loyalty information, customer recommendations, etc.:

I covered this example and other Kafka edge deployments in various blogs. I talked about use cases for Kafka at the edge, showed architectures for Kafka at the edge, and explored low latency 5G deployments powered by Kafka.

When NOT to use Apache Kafka?

Finally, we are coming to the section everybody was looking for, right? However, it is crucial first to understand when to use Kafka. Now, it is easy to explain when NOT to use Kafka.

For this section, let’s assume that we talk about production scenarios, not some ugly (?) workarounds to connect Kafka to something for a proof of concept directly; there is always a quick and dirty option to test something – and that’s fine for that goal. But things change when you need to scale and roll out your infrastructure globally, be compliant to law, and guarantee no data loss for transactional workloads.

With this in mind, it is relatively easy to qualify out Kafka as an option for some use cases and problems:

Kafka is NOT hard real-time

The definition of the term “real-time” is difficult. It is often a marketing term. Real-time programs must guarantee a response within specified time constraints.

Kafka – and all other frameworks, products, and cloud services used in this context – is only soft real-time and built for the IT world. Many OT and IoT applications require hard real-time with zero latency spikes.

Soft real-time is used for applications such as

Point-to-point messaging between IT applications
Data ingestion from various data sources into one or more data sinks
Data processing and data correlation (often called event streaming or event stream processing)

If your application requires sub-millisecond latency, Kafka is not the right technology. For instance, high-frequency trading is usually implemented with purpose-built proprietary commercial solutions.

Always keep in mind: The lowest latency would be to not use a messaging system at all and just use shared memory. In a race to the lowest latency, Kafka will lose every time. However, for the audit log and transaction log or persistence engine parts of the exchange, it is no data loss that becomes more important than latency and Kafka wins.

Most real-time use cases “only” require data processing in the millisecond to the second range. In that case, Kafka is a perfect solution. Many FinTechs, such as Robinhood, rely on Kafka for mission-critical transactional workloads, even financial trading. Multi-access edge computing (MEC) is another excellent example of low latency data streaming with Apache Kafka and cloud-native 5G infrastructure.

Kafka is NOT deterministic for embedded and safety-critical systems

This one is pretty straightforward and related to the above section. Kafka is not a deterministic system. Safety-critical applications cannot use it for a car engine control system, a medical system such as a heart pacemaker, or an industrial process controller.

A few examples where Kafka CANNOT be used for:

Safety-critical data processing in the car or vehicle. That’s Autosar / MINRA C / Assembler and similar technologies.
CAN Bus communication between ECUs.
Robotics. That’s C / C++ or similar low-level languages combined with frameworks such as Industrial ROS (Robot Operating System).
Safety-critical machine learning / deep learning (e.g., for autonomous driving)
Vehicle-to-Vehicle (V2V) communication. That’s 5G sidelink without an intermediary like Kafka.

My post “Apache Kafka is NOT Hard Real-Time BUT Used Everywhere in Automotive and Industrial IoT” explores this discussion in more detail.

TL;DR: Safety-related data processing must be implemented with dedicated low-level programming languages and solutions. That’s not Kafka! The same is true for any other IT software, too. Hence, don’t replace Kafka with IBM MQ, Flink, Spark, Snowflake, or any other similar IT software.

Kafka is NOT built for bad networks

Kafka requires good stable network connectivity between the Kafka clients and the Kafka brokers. Hence, if the network is unstable and clients need to reconnect to the brokers all the time, then operations are challenging, and SLAs are hard to reach.

There are some exceptions, but the basic rule of thumb is that other technologies are built specifically to solve the problem of bad networks. MQTT is the most prominent example. Hence, Kafka and MQTT are friends, not enemies. The combination is super powerful and used a lot across industries. For that reason, I wrote a whole blog series about Kafka and MQTT.

We built a connected car infrastructure that processes 100,000 data streams for real-time predictions using MQTT, Kafka, and TensorFlow in a Kappa architecture.

Kafka does NOT provide connectivity to tens of thousands of client applications

Another specific point to qualify Kafka out as an integration solution is that Kafka cannot connect to tens of thousands of clients. If you need to build a connected car infrastructure or gaming platform for mobile players, the clients (i.e., cars or smartphones) will not directly connect to Kafka.

A dedicated proxy such as an HTTP gateway or MQTT broker is the right intermediary between thousands of clients and Kafka for real-time backend processing and the integration with further data sinks such as a data lake, data warehouse, or custom real-time applications.

Where are the limits of Kafka client connections? As so often, this is hard to say. I have seen customers connect directly from their shop floor in the plant via .NET and Java Kafka clients via a direct connection to the cloud where the Kafka cluster is running. Direct hybrid connections usually work well if the number of machines, PLCs, IoT gateways, and IoT devices is in the hundreds. For higher numbers of client applications, you need to evaluate if you a) need a proxy in the middle or b) deploy “edge computing” with or without Kafka at the edge for lower latency and cost-efficient workloads.

When to MAYBE use Apache Kafka?

The last section covered scenarios where it is relatively easy to quality Kafka out as it simply cannot provide the required capabilities. I want to explore a few less apparent topics, and it depends on several things if Kafka is a good choice or not.

Kafka does (usually) NOT replace another database

Apache Kafka is a database. It provides ACID guarantees and is used in hundreds of companies for mission-critical deployments. However, most times, Kafka is not competitive with other databases. Kafka is an event streaming platform for messaging, storage, processing, and integration at scale in real-time with zero downtime or data loss.

Kafka is often used as a central streaming integration layer with these characteristics. Other databases can build materialized views for their specific use cases like real-time time-series analytics, near real-time ingestion into a text search infrastructure, or long-term storage in a data lake.

In summary, when you get asked if Kafka can replace a database, then there are several answers to consider:

Kafka can store data forever in a durable and high available manner providing ACID guarantees
Further options to query historical data are available in Kafka
Kafka-native add-ons like ksqlDB or Tiered Storage make Kafka more potent than ever before for data processing and event-based long-term storage
Stateful applications can be built leveraging Kafka clients (microservices, business applications) with no other external database
Not a replacement for existing databases, data warehouses, or data lakes like MySQL, MongoDB, Elasticsearch, Hadoop, Snowflake, Google BigQuery, etc.
Other databases and Kafka complement each other; the right solution has to be selected for a problem; often, purpose-built materialized views are created and updated in real-time from the central event-based infrastructure
Different options are available for bi-directional pull and push-based integration between Kafka and databases to complement each other

My blog post “Can Apache Kafka replace a database, data warehouse, or data lake?” discusses the usage of Kafka as a database in much more detail.

Kafka does (usually) NOT process large messages

Kafka was not built for large messages. Period.

Nevertheless, more and more projects send and process 1Mb, 10Mb, and even much bigger files and other large payloads via Kafka. One reason is that Kafka was designed for large volume/throughput – which is required for large messages. A very common example that comes up regularly is the ingestion and processing of large files from legacy systems with Kafka before ingesting the processed data into a Data Warehouse.

However, not all large messages should be processed with Kafka. Often you should use the right storage system and just leverage Kafka for the orchestration. Reference-based messaging (i.e. storing the file in another storage system and sending the link and metadata) is often the better design pattern:

Know the different design patterns and choose the right technology for your problem.

For more details and use cases about handling large files with Kafka, check out this blog post: “Handling Large Messages with Apache Kafka (CSV, XML, Image, Video, Audio, Files)“.

Kafka is (usually) NOT the IoT gateway for the last-mile integration of industrial protocols…

The last-mile integration with IoT interfaces and mobile apps is a tricky space. As discussed above, Kafka cannot connect to thousands of Kafka clients. However, many IoT and mobile applications only require tens or hundreds of connections. In that case, a Kafka-native connection is straightforward using one of the various Kafka clients available for almost any programming language on the planet.

Suppose a connection on TCP level with a Kafka client makes little sense or is not possible. In that case, a very prevalent workaround is the REST Proxy as the intermediary between the clients and the Kafka cluster. The clients communicate via synchronous HTTP(S) with the streaming platform.

Use cases for HTTP and REST APIs with Apache Kafka include the control plane (= management), the data plane (= produce and consume messages), and automation, respectively DevOps tasks.

Unfortunately, many IoT projects require much more complex integrations. I am not just talking about a relatively straightforward integration via an MQTT or OPC-UA connector. Challenges in Industrial IoT projects include:

The automation industry does often not use open standards but is slow, insecure, not scalable, and proprietary.
Product Lifecycles are very long (tens of years), with no simple changes or upgrades.
IIoT usually uses incompatible protocols, typically proprietary and built for one specific vendor.
Proprietary and expensive monoliths that are not scalable and not extendible.

Therefore, many IoT projects complement Kafka with a purpose-built IoT platform. Most IoT products and cloud services are proprietary but provide open interfaces and architectures. The open-source space is small in this industry. A great alternative (for some use cases) is Apache PLC4X. The framework integrates with many proprietary legacy protocols, such as Siemens S7, Modbus, Allen Bradley, Beckhoff ADS, etc. PLC4X also provides a Kafka Connect connector for native and scalable Kafka integration.

A modern data historian is open and flexible. The foundation of many strategic IoT modernization projects across the shop floor and hybrid cloud is powered by event streaming:

Kafka is NOT a blockchain (but relevant for web3, crypto trading, NFT, off-chain, sidechain, oracles)

Kafka is a distributed commit log. The concepts and foundations are very similar to a blockchain. I explored this in more detail in my post “Apache Kafka and Blockchain – Comparison and a Kafka-native Implementation“.

A blockchain should be used ONLY if different untrusted parties need to collaborate. For most enterprise projects, a blockchain is unnecessary added complexity. A distributed commit log (= Kafka) or a tamper-proof distributed ledger (= enhanced Kafka) is sufficient.

Having said this, more interestingly, I see more and more companies using Kafka within their crypto trading platforms, market exchanges, and NFT token trading marketplaces.

To be clear: Kafka is NOT the blockchain on these platforms. The blockchain is a cryptocurrency like Bitcoin or a platform providing smart contracts like Ethereum where people build new distributed applications (dApps) like NFTs for the gaming or art industry. Kafka is the streaming platform to connect these blockchains with other Oracles (= the non-blockchain apps) like the CRM, data lake, data warehouse, and so on:

TokenAnalyst is an excellent example that leverages Kafka to integrate blockchain data from Bitcoin and Ethereum with their analytics tools. Kafka Streams provides a stateful streaming application to prevent using invalid blocks in downstream aggregate calculations. For example, TokenAnalyst developed a block confirmer component that resolves reorganization scenarios by temporarily keeping blocks, and only propagates them when a threshold of a number of confirmations (children to that block are mined) is reached.

In some advanced use cases, Kafka is used to implementing a sidechain or off-chain platform as the original blockchain does not scale well enough (blockchain is known as on-chain data). Not just Bitcoin has the problem of only processing single-digit (!) transactions per second. Most modern blockchain solutions cannot scale even close to the workloads Kafka processes in real-time.

From DAOs to blue chip companies, measuring the health of blockchain infrastructure and IOT components is still necessary even in a distributed network to avoid downtime, secure the infrastructure, and make the blockchain data accessible. Kafka provides an agentless and scalable way to present that data to the parties involved and make sure that the relevant data is exposed to the right teams before a node is lost. This is relevant for cutting-edge Web3 IoT projects like Helium, or simpler closed distributed ledgers (DLT) like R3 Corda.

My recent post about live commerce powered by event streaming and Kafka transforming the retail metaverse shows how the retail and gaming industry connects virtual and physical things. The retail business process and customer communication happen in real-time; no matter if you want to sell clothes, a smartphone, or a blockchain-based NFT token for your collectible or video game.

TL;DR: Kafka is NOT…

… a replacement for your favorite database or data warehouse.

… hard real-time for safety-critical embedded workloads.

… a proxy for thousands of clients in bad networks.

… an API Management solution.

… an IoT gateway.

… a blockchain.

It is easy to qualify Kafka out for some use cases and requirements.

However, analytical and transactional workloads across all industries use Kafka. It is the de-facto standard for event streaming everywhere. Hence, Kafka is often combined with other technologies and platforms.

Where do you (not) use Apache Kafka? What other technologies do you combine Kafka with? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When NOT to use Apache Kafka? appeared first on Kai Waehner.

Apache Kafka and Blockchain – Comparison and a Kafka-native Implementation

Kai Waehner — Fri, 17 Jul 2020 08:14:40 +0000

This blog post discusses the concepts, use cases, and architectures behind Event Streaming, Apache Kafka, Distributed Ledger (DLT), and Blockchain. A comparison of different technologies such as Confluent, AIBlockchain, Hyperledger, Ethereum, Ripple, IOTA, and Libra explores when to use Kafka, a Kafka-native blockchain, a dedicated blockchain, or Kafka in conjunction with another blockchain.

Use Cases for Secure and Tamper-Proof Data Processing with a Blockchain

Blockchain is a hype topic for many years. While many companies talk about the buzzword, it is tough to find use cases where Blockchain is the best solution. The following examples show the potential of Blockchain. Here it might make sense:

Supply Chain Management (SCM): Manufacturing, transportation, logistics, and retailing see various use cases, including transaction settlement, tracking social responsibility, accurate cost information, precise shipping, and logistics data, automated purchasing and planning using integrated ERP and CRM systems, enforcement of partner contracts, food safety.
Healthcare: Management of patient data (especially access control), management and use of academic research data, prevention of compliance violations, reducing human errors, cross-cutting up-to-date patient information, secure identity management.
Telecom Industry: Blockchain solution for the settlement of roaming discount agreements to transact seamlessly with an ecosystem of partners and to reach significant growth of operators’ business relationships and business models.
Financial Services: Instant global payments with cryptocurrencies (including a digital US Dollar or EURO), reduced cost for end-users, audit transparency, data provenance, data lineage, fraud reduction, savings on business operations/finance reporting/compliance.

These are great, valid use cases. But you should ask yourself some critical questions:

Do you need a blockchain? What parts of “blockchain” do you need? Tamper-proof storage and encrypted data processing? Consortium deployments with access from various organizations? Does it add business value? Is it worth the added efforts, complexity, cost, risk?

Challenges and Concerns of Blockchain Technologies

Why am I so skeptical about Blockchain? I love the concepts and technologies! But I see the following concerns and challenges in the real world:

Technical complexity: Blockchain technology is still very immature, and the efforts to implement a blockchain project are massive and often underestimated
Organizational complexity: Deploying a blockchain over multiple organizations requires enormous efforts, including compliance and legal concepts.
Transaction speed: Bitcoin is slow. But also, most other blockchain frameworks are not built for millions of messages. Some solutions are in the works to improve from hundreds to thousands of messages per second. This scale is still not good enough for financial trading, supply chain, Internet of Things, and many other use cases.
Energy consumption: Bitcoin and some other cryptocurrencies use ‘Proof-of-Work‘ (POW) as a consensus mechanism. The resources used for Bitcoin mining is crazy. Period! (this issue is typically only relevant in public blockchains; private blockchains use other concepts for this reason)
Security: Providing a tamper-proof deployment is crucial. Deploying this over different organizations (permissioned / consortium blockchain) or public (permissionless) is hard. Security is much more than just a feature or API.
Data tenancy: Data privacy, compliance (across countries), and ownership of the data (who is allowed to use it for creating added value) are hard to define.
Lifecycle costs: Blockchain infrastructure is complex. It is not just an application but a cross-company distributed infrastructure. Development, operations, and management and are very different from hosting your monolith or microservices.

Hence, it is important to only choose a blockchain technology when it makes sense. Most projects I have seen in the last years were about evaluating technologies and doing technical POCs. Similar to Hadoop data lakes 5-10 years ago. I fear the same result as in the Hadoop era: The technical proof worked out, but no real business value or enormous additional and unnecessary complexity and cost were added.

Distributed Ledger and Blockchain Technologies

It is crucial to understand that people often don’t mean ‘Blockchain’ when they say ‘Blockchain’. What they actually mean is ‘Distributed Ledger Technology’ (DLT). What is the relation between Blockchain and DLT?

Blockchain: A Catchall Phrase for Distributed Ledger Technology (DLT)

The following explores Blockchain and DLT in more detail:

Distributed Ledger Technology (DLT):

Decentralized database
Managed by various participants. No central authority acts as an arbitrator or monitor
Distributed log of records, greater transparency
Fraud and manipulation more difficult, more complicated to hack the system

Blockchain:

Blockchain is nothing else but a DLT with a specific set of features
Shared database – a log of records – but in this case, shared using blocks that form a chain
Blocks are closed by a type of cryptographic signature called a ‘hash’; the next block begins with that same ‘hash’, a kind of wax seal
Hashing verifies that the encrypted information has not been manipulated and that it can’t be manipulated

Blockchain Concepts

A blockchain is either permissionless (i.e., public and accessible by everyone, like Bitcoin) or permissioned (using a group of companies or consortium).

Different consensus algorithms exist, including Proof-of-Work (POW), Proof-of-Stake (POS), or voting systems.

Blockchain uses global consensus across all nodes. In contrary, DLT can build the consensus without having to validate across the entire Blockchain

A Blockchain is a growing list of records, called blocks, linked using cryptography. Each block contains a cryptographic hash of the previous block, a timestamp, and transaction data.

Blockchain is a catchall phrase in most of the success stories I have seen in the last years. Many articles you read on the internet say ‘Blockchain’, but the underlying implementation is a DLT. Be aware of this when evaluating these technologies to find the best solution for your project.

The Relation between Kafka and Blockchain

Now you have a good understanding of use cases and concepts of Blockchains and DLTs.

How blockchain and DLT this related to event streaming and the Apache Kafka ecosystem?

I will not give an introduction to Apache Kafka here. Just mentioning a few reasons why Event Streaming with Apache Kafka is used for various use cases in any industry today:

Real-time
Scalable
High throughput
Cost reduction
24/7 – zero downtime, zero data loss
Decoupling – storage, domain-driven design (DDD)
Data (re-)processing and stateful client applications
Integration – connectivity to IoT, legacy, big data, everything
Hybrid architecture – On-premises, multi-cloud, edge computing
Fully managed cloud
No vendor locking

Single Kafka clusters can even be deployed over multiple regions and globally. Mission-critical deployments without downtime or data loss are crucial for Blockchain and many other use cases. ‘Architecture patterns for distributed, hybrid, edge, and global Apache Kafka deployments‘ explores these topics in detail. High availability is the critical foundation for thinking about using Kafka in the context of blockchain projects!

Kafka is NOT a Blockchain!

Kafka is not a blockchain. But it provides many characteristics required for real-world “enterprise blockchain” projects:

Real-Time
High Throughput
Decentralized database
Distributed log of records
Immutable log
Replication
High availability
Decoupling of applications/clients
Role-based access control to data

Three essential requirements are missing in Kafka for “real blockchain projects”:

Tamper-Proof
Encrypted payloads
Deployment over various independent organizations

These missing pieces are crucial for implementing a blockchain. Having said this, do you need all of them? Or just some of them? Evaluating this question clarifies if you should choose just Kafka, Kafka as native blockchain implementation, or Kafka in conjunction with a blockchain technology like Hyperledger or Ethereum.

Kafka in conjunction with Blockchain – Use Cases and Architectures

I wrote about “Blockchain – The Next Big Thing for Middleware” on InfoQ a long time ago in 2016 already. There is a need to integrate Blockchain and the rest of the enterprise.

Interestingly, many projects from that time don’t exist anymore today, for instance, Microsoft’s Project Bletchley initiative on Github to integrate blockchains with the rest of the enterprise is dead. A lot of traditional middleware (MQ, ETL, ESB) is considered legacy and replaced by Kafka at enterprises across the globe.

Today, blockchain projects use Kafka in two different ways. Either you connect Kafka to one or more blockchains, or you implement a Kafka-native blockchain:

Let’s take a look at a few different examples for Kafka in conjunction with blockchain solutions.

Kafka AND Blockchains – A Financial Services Platform

Nash is an excellent example of a modern trading platform for cryptocurrencies using Blockchain under the hood. The heart of Nash’s platform leverages Apache Kafka. The following quote from their community page says:

Nash provides the speed and convenience of traditional exchanges and the security of non-custodial approaches. Customers can invest in, make payments with, and trade Bitcoin, Ethereum, NEO, and other digital assets. The exchange is the first of its kind, offering non-custodial cross-chain trading with the full power of a real order book. The distributed, immutable commit log of Kafka enables deterministic replayability in its exact order at any time.

Kafka-Native Blockchain – A Tamper-Proof Blockchain implemented with Kafka

Kafka can be combined with blockchains, as described above. Another option is to use or build a Kafka-native blockchain infrastructure. High scalability and high volume data processing in real-time for mission-critical deployments is a crucial differentiator of using Kafka compared to “traditional blockchain deployments” like Bitcoin or Ethereum.

Kafka enables a blockchain for real-time processing and analysis of historical data with one single platform:

The following sections describe two examples of Kafka-native blockchain solutions:

Hyperledger Fabric: A complex, powerful framework used for deployment over various independent organizations
AIBlockchain: A flexible and straightforward approach to building a blockchain infrastructure within the enterprise

Hyperledger Fabric leveraging Apache Kafka under the Hood for Transaction Ordering

Hyperledger Fabric is a great (but also very complex) blockchain solution using Kafka under the hood.

Ordering in Hyperledger Fabric is what you might know from other blockchains as a ‘consensus algorithm’. It guarantees the integrity of transactions:

Hyperledger Fabric provides three ordering mechanisms for transactions: SOLO, Apache Kafka, and Simplified Byzantine Fault Tolerance (SBFT). Hyperledger Fabric and Apache Kafka share many characteristics. Hence, this combination is a natural fit. Choosing Kafka for the transaction ordering enables a fault-tolerant, highly scalable, and performant infrastructure.

AIBlockchain – A Tamper-proof Kafka-native Blockchain Implementation

AIBlockchain has implemented and patented a Kafka-native blockchain.

The project KafkaBlockchain is available on Github. It provides a Java library for tamper-evidence using Kafka. Messages are optionally encrypted and hashed sequentially. The library methods are called within a Kafka application’s message producer code to wrap messages and called within the application’s consumer code to unwrap messages.

Because blockchains must be strictly sequentially ordered, Kafka blockchain topics must either have a single partition, or consumers for each partition must cooperate to sequence the records:

Kafka already implements checksums for message streams to detect data loss. However, an attacker can provide false records that have correct checksums. Cryptographic hashes such as the standard SHA-256 algorithm are very difficult to falsify, which makes them ideal for tamper-evidence despite being a bit more computation than checksums:

The provided sample in the Github project stores the first (genesis) message SHA-256 hash for each blockchain topic in ZooKeeper. In production, secret-keeping facilities such as Vault can be used.

Please contact the AIBlockchain team for more information about the open-source KafkaBlockchain library, their blockchain patent, and the blockchain projects at customers. The on-demand webinar discussed below also covers this in much more detail.

When to Choose Kafka and/or Blockchain?

Price vs. value is the fundamental question to answer when deciding if you should choose a blockchain technology.

If you consider blockchain / DLT, it is pretty straightforward to make a comparison and make the right choice:

Use a Kafka-native blockchain such as AIBlockchain for

Enterprise infrastructure
Open, scalable, real-time requirements
Flexible architectures for many use cases

Use a real blockchain / DLTs like Hyperledger, Ethereum, Ripple, Libra, IOTA, et al. for

Deployment over various independent organizations (where participants verify the distributed ledger contents themselves)
Specific use cases
Server-side managed and controlled by multiple organizations
Scenarios where the business value overturns the added complexity and project risk

Use Kafka and Blockchain together to combine the benefits of both for

Integration between blockchain / DLT technologies and the rest of the enterprise, including CRM, big data analytics, and any other custom business applications
Reliable data processing at scale in real-time with Kafka for internal use cases
Blockchain for secure communication over various independent organizations

Use only Kafka if you don’t need a blockchain! This is probably true for 95+% of use cases! Period!

Infinite Long-Term Storage in Apache Kafka with Tiered Storage

Today, Kafka works well for recent events, short-horizon storage, and manual data balancing. Kafka’s present-day design offers extraordinarily low messaging latency by storing topic data on fast disks that are collocated with brokers. This concept of combining memory and storage is usually good. But sometimes, you need to store a vast amount of data for a long time.

Blockchain is such a use case!

Therefore let’s talk about long-term storage in Kafka leveraging Tiered Storage.

Confluent Tiered Storage for Kafka

(Disclaimer: The following was written in July 2020 – at a later time, make sure to check the status of Tiered Storage for Kafka as it is expected to evolve soon)

“KIP-405 – Add Tiered Storage support to Kafka” is in the works of implementing Tiered Storage for Kafka. Confluent is actively working on this with the open-source community. Uber is leading this initiative.

Confluent Tiered Storage is already available today in Confluent Platform and used under the hood in Confluent Cloud:

Read about the details and motivation here. Tiered Storage for Kafka creates many benefits:

Use Cases for Reprocessing Historical Data: New consumer application, error-handling, compliance / regulatory processing, query and analyze existing events, model training using Machine Learning / Deep Learning frameworks like TensorFlow.
Store Data Forever in Kafka: Older data is offloaded to inexpensive object storage, permitting it to be consumed at any time. Using AiB, storage can be made tamper-proof and immutable.
Save $$$: Storage limitations, like capacity and duration, are effectively uncapped leveraging cheap object stores like S3, GCS, MinIO or PureStorage.
Instantaneously scale up and down: Your Kafka clusters will be able to automatically self-balance load and hence elastically scale.

A Blockchain needs to store infinite data forever. Event-based with timestamps, encrypted, and tamper-proof. AiB’s tamper-proof Blockchain with KafkaBlockchain is a great example that could leverage Tiered Storage for Kafka.

Slides and Video: Kafka + Blockchain

I discussed this topic in more detail with Confluent’s partner AIBlockchain.

Check out the slide deck:

Here is the on-demand video recording:

As you learned in this post, Kafka is used in various blockchain scenarios; either complementary to integrate with blockchains / distributed ledger technology and the rest of the enterprise, or Kafka-native implementation.

What are your experiences with blockchain infrastructures? Does Blockchain add value? Is this worth the added technical and organizational complexity? Did you or do you plan to use Apache Kafka and its ecosystem? What is your strategy? Let’s connect on LinkedIn and discuss it!

The post Apache Kafka and Blockchain – Comparison and a Kafka-native Implementation appeared first on Kai Waehner.

Why I Move (Back) to Open Source for Messaging, Integration and Stream Processing

Kai Waehner — Mon, 01 May 2017 13:24:08 +0000

After three great years at TIBCO Software, I move back to open source and join Confluent, a company focusing on the open source project Apache Kafka to build mission-critical, scalable infrastructures for messaging, integration and streaming analytics. Confluent is a Silicon Valley startup, still in the beginning of its journey, with a 700% growing business in 2016, and is exjustpected to grow significantly in 2017 again.

In this blog post, I want to share why I see the future for middleware and big data analytics in open source technologies, why I really like Confluent, what I will focus on in the next months, and why I am so excited about this next step in my career.

Key Business Trends in the Industry: Big Data, Real Time Streaming Analytics, Machine Learning

Let’s talk shortly about three cutting-edge topics which get important in any industry and small, medium and big enterprises these days:

Big Data Analytics: Find insights and patterns in big historical datasets.
Real Time Streaming Platforms: Apply insights and patterns to new events in real time (e.g. for fraud detection, cross selling or predictive maintenance).
Machine Learning (and its hot subtopic Deep Learning): Leverage algorithms and let machines learn by themselves without programming everything explicitly.

These three topics disrupt every industry these days. Note that Machine Learning is related to the other two topics. Though today, we often see it as independent topic; many data science projects actually use only very small datasets (often less than a Gigabyte of input data). Fortunately, all three topics will be combined more and more to add additional business value.

Some industries are just in the beginning of their journey of disruption and digital transformation (like banks, telcos, insurance companies), others already realized some changes and innovation (e.g. retailers, airlines). In addition to the above topics, some other cutting edge success stories emerge in a few industries, like Internet of Things (IoT) in manufacturing or Blockchain in banking.

With all these business trends on the market, we also see a key technology trend for all these topics: The adoption of open source projects.

Key Technology Trend: Adoption of “Real” Open Source Projects

When I say “open source”, I mean some specific projects. I do not talk about very new, immature projects, but frameworks which are deployed for many years in production successfully, and used by various different developers and organizations. For example, Confluent’s Docker images like the Kafka REST Proxy or Kafka Schema Registry are each downloaded over 100.000 times, already.

A “real”, successful middleware or analytics open source project has the following characteristics:

Openness: Available for free and really open under a permissive license, i.e. you can use it in production and scale it out without any need to purchase a license or subscription (of course, there can be commercial, proprietary add-ons – but they need to be on top of the project, and not change the license for the used open source project under the hood)
Maturity: Used in business-relevant or even mission critical environments for at least one year, typically much longer
Adoption: Various vendors and enterprises support a project, either by contributing (source code, documentation, add-ons, tools, commercial support) or realizing projects
Flexibility: Deployment on any infrastructure, including on premise, public cloud, hybrid. Support for various application environments (Windows, Linux, Virtual Machine, Docker, Serverless, etc.), APIs for several programming languages (Java, .Net, Go, etc.)
Integration: Independent and loosely coupled, but also highly integrated (via connectors, plugins, etc.) to other relevant open source and commercial components

After defining key characteristics for successful open source projects, let’s take a look some frameworks with huge momentum.

Cutting Edge Open Source Technologies: Apache Hadoop, Apache Spark, Apache Kafka

I defined three key trends above which are relevant for any industry and many (open source and proprietary) software vendors. There is a clear trend towards some open source projects as de facto standards for new projects:

Big Data Analytics: Apache Hadoop (and its zoo of sub projects like Hive, Pig, Impala, HBase, etc.) and Apache Spark (which is often separated from Hadoop in the meantime) to store, process and analyze huge historical datasets
Real Time Streaming Platforms: Apache Kafka – not just for highly scalable messaging, but also for integration and streaming analytics. Platforms either use Kafka Streams to build stream processing applications / microservices or an “external” framework like Apache Flink, Apex, Storm or Heron.
Machine Learning: No clear “winner” here (and that is a good thing in my opinion as it is so multifaceted). Many great frameworks are available – for example R, Python and Scala offer various great implementations of Machine Learning algorithms, and specific frameworks like Caffee, Torch, TensorFlow or MXNet focus on Deep Learning and Artificial Neural Networks.

On top of these frameworks, various vendors build open source or proprietary tooling and offer commercial support. Think about the key Hadoop / Spark vendors: Hortonworks, Cloudera, MapR and others, or KNIME, RapidMiner or H2O.ai as specialized open source tools for machine learning in a visual coding environment.

Of course, there are many other great open source frameworks not mentioned here but also relevant on the market, for example RabbitMQ and ActiveMQ for messaging or Apache Camel for integration. In addition, new “best practice stacks” are emerging, like the SMACK Stack which combines Spark, Mesos, Akka, and Kafka.

I am so excited about Apache Kafka and Confluent, because it is used in any industry and many small and big enterprises, already. Apache Kafka production deployments accelerated in 2016 and it is now used by one-third of the Fortune 500. And this is just the beginning. Apache Kafka is not an all-rounder to solve all problems, but it is awesome in the things it is built for – as the huge and growing number of users, contributors and production deployments prove. It highly integrated with many other frameworks and tools. Therefore, I will not just focus on Apache Kafka and Confluent in my new job, but also many other technologies as discussed later.

Let’s next think about the relation of Apache Kafka and Confluent to proprietary software.

Open Source vs. Proprietary Software – A Never-ending War?

The trend is moving towards open source technologies these days. This is no question, but a fact. I have not seen a single customer in the last years who does not have projects and huge investments around Hadoop, Spark and Kafka. In the meantime, it changed from labs and first small projects to enterprise de facto standards and company-wide deployments and strategies. Replacement of closed legacy software is coming more and more – to reduce costs, but even more important to be more flexible, up-to-date and innovative.

What does this mean for proprietary software?

For some topics, I do not see much traction or demand for proprietary solutions. Two very relevant examples where closed software ruled the last ~20 years: Messaging solutions and analytics platforms. Open frameworks seem to replace them almost everywhere in any industry and enterprise in new projects (for good reasons).

New messaging projects are based on standards like MQTT or frameworks like Apache Kafka. Analytics is done with R and Python in conjunction with frameworks like scikit-learn or TensorFlow. These options leverage flexible, but also very mature implementations. Often, there is no need for a lot of proprietary, inflexible, complex or expensive tooling on top of it. Even IBM, the mega vendor, focuses on offerings around open source in the meantime.

IBM invests millions into Apache Spark for big data analytics and puts over 3500 researchers and developers to work on Spark-related projects instead of just pushing towards its various own proprietary analytics offerings like IBM SPSS. If you search for “IBM Messaging”, you find offerings based on AMQP standard and cloud services based on Apache Kafka instead of pushing towards new proprietary solutions!

I think IBM is a great example of how the software market is changing these days. Confluent (just in the beginning of its journey) or Cloudera (just went public with a successful IPO) are great examples for Silicon Valley startups going the same way.

In my opinion, a good proprietary software leverages open source technologies like Apache Kafka, Apache Hadoop or Apache Spark. Vendors should integrate natively with these technologies. Some opportunities for vendors:

Visual coding (for developers) to generate code (e.g. graphical components, which generate framework-compliant source code for a Hadoop or Spark job)
Visual tooling (e.g. for business analysts or data scientists), like a Visual Analytics tools which connect to big data stores to find insights and patterns
Infrastructure tooling for management and monitoring of open source infrastructure (e.g. tools to monitor and scale Apache Kafka messaging infrastructures)
Add-ons which are natively integrated with open source frameworks (e.g. instead of requiring own proprietary runtime and messaging infrastructures, an integration vendor should deploy its integration microservices on open cloud-native platforms like Kubernetes or Cloud Foundry and leverage open messaging infrastructures like Apache Kafka instead of pushing towards proprietary solutions)

Open Source and Proprietary Software Complement Each Other

Therefore, I do not see this as a discussion of “open source software” versus “proprietary software”. Both complement each other very well. You should always ask the following questions before making a decision for open source software, proprietary software or a combination of both:

What is the added value of the proprietary solution? Does this increase the complexity and increases the footprint of runtime and tooling?
What is the expected total cost of ownership of a project (TCO), i.e. license / subscription + project lifecycle costs?
How to realize the project? Who will support you, how do you find experts for delivery (typically consulting companies)? Integration and analytics projects are often huge with big investments, so how to make sure you can deliver (implementation, test, deployment, operations, change management, etc.)? Can we get commercial support for our mission-critical deployments (24/7)?
How to use this project with the rest of the enterprise architecture? Do you deploy everything on the same cluster? Do we want to set some (open) de facto standards in different departments and business units?
Do we want to use the same technology in new environments without limiting 30-day-trials or annoying sales cycles, maybe even deploy it to production without any license / subscription costs?
Do we want to add changes, enhancements or fixes to the platform by ourselves (e.g. if we need a specific feature immediately, not in six months)?

Let’s think about a specific example with these questions in mind:

Example: Do you still need an Enterprise Service Bus (ESB) in a World of Cloud-Native Microservices?

I faced this question a lot in the last 24 months, especially with the trend moving to flexible, agile microservices (not just for building business applications, but also for integration and analytics middleware). See my article “Do Microservices Spell the End of the ESB?”. The short summary: You still need middleware (call it ESB, integration platform, iPaaS, or somehow else), though the requirements are different today. This is true for open source and proprietary ESB products. However, something else has changed in the last 24 months…

In the past, open source and proprietary middleware vendors offered an ESB as integration platform. The offering included a runtime (to guarantee scalable, mission-critical operation of integration services) and a development environment (for visual coding and faster time-to-market). The last two years changed how we think about building new applications. We now (want to) build microservices, which run in a Docker container. The scalable, mission-critical runtime is managed by a cloud-native platform like Kubernetes or Cloud Foundry. Ideally, DevOps automates builds, tests and deployment. These days, ESB vendors adopt these technologies. So far, so good.

Now, you can deploy your middleware microservice to these cloud-native platforms like any other Java, .NET or Go microservice. However, this completely changes the added value of the ESB platform. Now, its benefit is just about visual coding and the key argument is time-to-market (you should always question and doublecheck if it is a valid argument). The runtime is not really part of the ESB anymore. In most scenarios, this completely changes the view on deciding if you still need an ESB. Ask yourself about time-to-market, license / subscription costs and TCO again! Also think about the (typically) increased resource requirements (Memory, CPU, Disk) of tooling-built services (typically some kind of big .ear file), compared to plain source code (e.g. Java’s .jar files).

Is the ESB still enough added value or should you just use a cloud-native platform and a messaging infrastructure? Is it easier to write some lines of source instead of setting up the ESB tooling where you often struggle importing your REST / Swagger or WSDL files and many other configuration environments before you actually can begin to leverage the visual coding environment? In very big, long-running projects, you might finally end up with a win. Though, in an agile, every-changing world with fail-fast ideology, various different technologies and frameworks, and automated CI/CD stacks, you might only add new complexity, but not get the expected value anymore like in the old world where the ESB was also the mission-critical runtime. The same is true for other middleware components like stream processing, analytic platforms, etc.

ESB Alternative: Apache Kafka and Confluent Open Source Platform

As alternative, you could use for example Kafka Connect, which is a very lightweight integration library based on Apache Kafka to build large-scale low-latency data pipelines. The beauty of Kafka Connect is that all the challenges around scalability, fail-over and performance are leveraged from the Kafka infrastructure. You just have to use the Kafka Connect connectors to realize very powerful integrations with a few lines of configuration for sources and sinks. If you use Apache Kafka as messaging infrastructure anyway, you need to find some very compelling reasons to use an ESB on top instead of the much less complex and much less heavyweight Kafka Connect library.

I think this section explained why I think that open source and proprietary software are complementary in many use cases. But it does not make sense to add heavyweight, resource intensive and complex tooling in every scenario. Open source is not free (you still need to spend time and efforts on the project and probably pay money for some kind of commercial support), but often open source without too much additional proprietary tooling is the better choice regarding complexity, time-to-market and TCO. You can find endless success stories about open source projects; not just from tech giants like Google, Facebook or LinkedIn, but also from many small and big traditional enterprises. Of course, any project can fail. Though, in projects with frameworks like Hadoop, Spark or Kafka, it is probably not due to issues with technology…

Confluent + “Proprietary Mega Vendors”

On the other side, I really look forward working together with “mostly proprietary” mega vendors like TIBCO, SAP, SAS and others where it makes sense to solve customer problems and build innovative, powerful, mission-critical solutions. For example, TIBCO StreamBase is a great tool if you want to develop stream processing applications via visual editor instead of writing source code. Actually, it does not even compete with Kafka Streams because the latter one is a library which you embed into any other microservice or application (deployed anywhere, e.g. in a Java application, Docker container, Apache Mesos, “you-choose-it”) while StreamBase (like its competitors Software AG Apama, IBM Streams and all the other open source frameworks like Apache Flink, Storm, Apache Apex, Heron, etc.) focus on building streaming application on its own cluster (typically either deployed on Hadoop’s YARN or on a proprietary cluster). Therefore, you could use StreamBase and its Kafka connector to build streaming applications leveraging Kafka as messaging infrastructure.

Even Confluent itself does offer some proprietary tooling like Confluent Control Center for management and monitoring on top of open source Apache Kafka and open source Confluent Platform, of course. This is the typical business model you see behind successful open source vendors like Red Hat: Embrace open source projects, offer 24/7 support and proprietary add-ons for enterprise-grade deployments. Thus, not everything is or needs to be open source. That’s absolutely fine.

So, after all these discussions about business and technology trends, open source and proprietary software, what will I do in my new job at Confluent?

Confluent Platform in Conjunction with Analytics, Machine Learning, Blockchain, Enterprise Middleware

Of course, I will focus a lot on Apache Kafka and Confluent Platform in my new job where I will work mainly with prospects and customers in EMEA, but also continue as Technology Evangelist with publications and conference talks. Let’s get into a little bit more detail here…

My focus never was being a deep level technology expert or fixing issues in production environments (but I do hands-on coding, of course). Many other technology experts are way better in very technical discussions. As in the past, I will focus on designing mission-critical enterprise architectures, discussing innovative real world use cases with prospects and customers, and evaluating cutting edge technologies in conjunction with the Confluent Platform. Here are some of my ideas for the next months:

Apache Kafka + Cloud Native Platforms = Highly Scalable Streaming Microservices (leveraging platforms like Kubernetes, Cloud Foundry, Apache Mesos)
Highly Scalable Machine Learning and Deep Learning Microservices with Apache Kafka Streams (using TensorFlow, MXNet, H2O.ai, and other open source frameworks)
Online Machine Learning (i.e. updating analytics models in real time for every new event) leveraging Apache Kafka as infrastructure backbone
Open Source IoT Platform for Core and Edge Streaming Analytics (leveraging Apache Kafka, Apache Edgent, and other IoT frameworks)
Comparison of Open Source Stream Processing Frameworks (differences between Kafka Streams and other modern frameworks like Heron, Apache Flink, Apex, Spark Streaming, Edgent, Nifi, StreamSets, etc.)
Apache Kafka / Confluent and other Enterprise Middleware (discuss when to combine proprietary middleware with Apache Kafka, and when to simply “just” use Confluent’s open source platform)
Middleware and Streaming Analytics as Key for Success in Blockchain Projects

You can expect publications, conference and meetup talks, and webinars about these and other topics in 2017 like I did it in the last years. Please let me know what you are most interested in and what other topics you would like to hear about!

I am also really looking forward to work together with partners on scalable, mission-critical enterprise architectures and powerful solutions around Apache Kafka and Confluent Platform. This will include combined solutions and use cases with open source but also proprietary software vendors.

Last but not least the most important part, I am excited to work with prospects and customers from traditional enterprises, tech giants and startups to realize innovative use cases and success stories to add business value.

As you can see, I am really excited to start at Confluent in May 2017. I will visit Confluent’s London and Palo Alto offices in the first weeks and also be at Kafka Summit in New York. Thus, an exciting month to get started in this awesome Silicon Valley startup.

Please let me know your feedback. Do you see the same trends? Do you share my opinions or disagree? Looking forward to discuss all these topics with customers, partners and anybody else in upcoming meetings, workshops, publications, webinars, meetups and conference talks.

The post Why I Move (Back) to Open Source for Messaging, Integration and Stream Processing appeared first on Kai Waehner.

Agile Cloud-to-Cloud Integration with iPaaS, API Management and Blockchain

Kai Waehner — Sun, 23 Apr 2017 18:41:06 +0000

Cloud-to-Cloud integration is part of a hybrid integration architecture. It enables to implement quick and agile integration scenarios without the burden of setting up complex VM- or container-based infrastructures. One key use case for cloud-to-cloud integration is innovation using a fail-fast methodology where you realize new ideas quickly. You typically think in days or weeks, not in months. If an idea fails, you throw it away and start another new idea. If the idea works well, you scale it out and bring it into production to a on premise, cloud or hybrid infrastructure. Finally, you make expose the idea and make it easily available to any interested service consumer in your enterprise, partners or public end users.

A great example where you need agile, fail-fast development is blockchain because it is a very hot topic, but frameworks are immature and change very frequently these days. Note that blockchain is not just about Bitcoin and finance industry. This is just the tip of the iceberg. Blockchain, which is the underlying infrastructure of Bitcoin, will change various industries, beginning with banking, manufacturing, supply chain management, energy, and others.

Middleware and Integration as Key for Success in Blockchain Projects

A key for successful blockchain projects is middleware as it allows integration with the rest of an enterprise architecture. Blockchain only adds value if it works together with your other applications and services. See an example of how to combine streaming analytics with TIBCO StreamBase and Ethereum to correlate blockchain events in real time to act proactively, e.g. for fraud detection.

The drawback of blockchain is its immaturity today. APIs change with every minor release, development tools are buggy and miss many required features for serious development, and new blockchain cloud services, frameworks and programming languages come and go every quarter.

This blog post shows how to leverage cloud integration with iPaaS and API Management to realize innovative projects quickly, fail fast, and adopt changing technologies and business requirements easily. We will use TIBCO Cloud Integration (TCI), TIBCO Mashery and Hyperledger Fabric running as IBM Bluemix cloud service.

IBM Hyperledger Fabric as Bluemix Cloud Service

Hyperledger is a vendor independent open source blockchain project which consists of various components. Many enterprises and software vendors committed to it for building different solutions for various problems and use cases. One example is IBM’s Hyperledger Fabric. The benefit of being a very flexible construction kit makes Hyperledger much more complex for getting started than other blockchain platforms, like Ethereum, which are less flexible but therefore easier to set up and use.

This is the reason why I use the BlueMix BaaS (Blockchain as a Service) to get started quickly without spending days on just setting up my own Hyperledger network. You can start a Hyperledger Fabric blockchain infrastructure with four peers and a membership service with just a few clicks. It takes around two minutes. Hyperledger Fabric is evolving fast. Therefore, a great example for quickly changing features, evolving requirements and (potentially) fast failing projects.

My project uses version Hyperledger Fabric 0.6 as free beta cloud service. I leverage its Swagger-based REST API in my middleware microservice to interconnect other cloud services with the blockchain:

However, when I began the project, it was already clear that the REST interface is deprecated and will not be included in the upcoming 1.0 release anymore. Thus, we are aware that an easy move to another API is needed in a few weeks or months.

Cloud-to-Cloud Integration, iPaaS and API Management for Agile and Fail-Fast Development

As mentioned in the introduction, middleware is key for success in blockchain projects to interconnect it with the existing enterprise architecture. This example leverages the following middleware components:

iPaaS: TIBCO Cloud Integration (TCI) is hosted and managed by TIBCO. It can be used without any setup or installation to quickly build a REST microservice which integrates with Hyperledger Fabric. TCI also allows to configure caching, throttling and security configuration to ensure controlled communication between the blockchain and other applications and services.
API Management: TIBCO Mashery is used to expose the TCI REST microservice for other services consumers. This can be internal, partners or public end users, depending on the use case.
On Premise / Cloud-Native Integration Infrastructure: TIBCO BusinessWorks 6 (BW6) respectively TIBCO BusinessWorks Container Edition (BWCE) can be used to deploy and productionize successful TCI microservices in your existing infrastructure on premise or on a cloud-native platform like CloudFoundry, Kubernetes or any other Docker-based platform. Of course, you can also continue to use the services within TCI itself to run and scale the production services in the public cloud, hosted and managed by TIBCO.

Scenario: TIBCO Cloud Integration + Mashery + Hyperledger Fabric Blockchain + Salesforce

I implemented the following technical scenario with the goal of showing agile cloud-to-cloud integration with fail-fast methodology from a technical perspective (and not to build a great business case): The scenario implements a new REST microservice with TCI via visual coding and configuration. This REST service connects to two cloud services: Salesforce and Hyperledger Fabric. The following picture shows the architecture:

Here are the steps to realize this scenario:

Create a REST service which receives customer ID and customer name as parameters.
Enhance the input data from the REST call with additional data from Salesforce CRM. In this case, we get a block hash which is stored in Salesforce as reference to the blockchain data. The block hash allows to doublecheck if Salesforce has the most up-to-date information about a customer while the blockchain itself ensures that the customer updates from various systems like CRM, ERP, Mainframe are stored and updated in a correct, secure, distributed chain – which can be accessed by all related systems, not just Salesforce.
Use Hyperledger REST API to validate via Salesforce’ block hash that the current information about the customer in Salesforce is up-to-date. If Salesforce has an older block hash, then update Salesforce to up-to-date values (both, the most recent hash block and the current customer data stored on blockchain)
Return the up-to-date customer data to the service consumer of the TCI REST service
Configure caching, throttling and security requirements so that service consumers cannot do unexpected behavior
Leverage API Management to expose the TCI REST service as public API so that external service consumers can subscribe to a payment plan and use it in other applications or microservices.

Implementation: Visual Coding, Web Configuration and Automation via DevOps

This section discusses the relevant steps of the implementation in more detail, including development and deployment of the chain code (Hyperledger’s term for smart contracts in blockchain), implementation of the REST integration service with visual coding in the IDE of TCI, and exposing the service as API via TIBCO Mashery.

Chain Code for Hyperledger Fabric on IBM Bluemix Cloud

Hyperledger Fabric has some detailed “getting started” tutorials. I used a “Hello World” Tutorial and adopted it to our use case so that you can store, update and query customer data on the blockchain network. Hyperledger Fabric uses Go for chain code and will allow other programming languages like Java in future releases. This is both a pro and con. The benefit is that you can leverage your existing expertise in these programming languages. However, you also have to be careful not to use “wrong” features like threads, indefinite loops or other functions which cause unexpected behavior on a blockchain. Personally, I prefer the concept of Ethereum. This platform uses Solidity, a programming language explicitly designed to develop smart contracts (i.e. chain code).

Cloud-to-Cloud Integration via REST Service with TIBCO Cloud Integration

The following steps were necessary to implement the REST microservice:

Create REST service interface with API Modeler web UI leveraging Swagger
Import Swagger interface into TCI’s IDE
Use Salesforce activity to read block hash from Salesforce’ cloud interface
Use Rest Client activity to do an HTTP Request to Hyperledger Fabric’s REST interface
Configure graphical mappings between the activities (REST Request à Salesforce à Hyperledger Fabric à REST Response)
Deploy microservice to TCI’s runtime in the cloud and test it from Swagger UI
Configure caching and throttling in TCI’s web UI

The last point is very important in blockchain projects. A blockchain infrastructure does not have millisecond response latency. Furthermore, every blockchain update costs money – Bitcoin, Ether, or whatever currency you use to “pay for mining” and to reach consensus in your blockchain infrastructure. Therefore, you can leverage caching and throttling in blockchain projects much more than in some other projects (only if it fits into your business scenario, of course).

Here is the implemented REST service in TCI’s graphical development environment:

And the caching configuration in TCI’s web user interface:

Exposing the REST Service to Service Consumers through API Management via TIBCO Mashery

The deployed TCI microservice can either be accessed directly by service consumers or you expose it via API Management, e.g. with TIBCO Mashery. The latter option allows to define rules, payment plans and security configuration in a centralized manner with easy-to-use tooling.

Note that the deployment to TCI and exposing its API to Mashery can also be done in one single step. Both products are loosely coupled, but highly integrated. Also note that this is typically not done manually (like in this technical demonstration), but integrated into a DevOps infrastructure leveraging frameworks like Maven or Jenkins to automate the deployment and testing steps.

Agile Cloud-to-Cloud Integration with Fail-Fast Methodology

Our scenario is now implemented successfully. However, it is already clear that the REST interface is deprecated and not included in the upcoming 1.0 release anymore. In blockchain frameworks, you can expect changes very frequently.

While implementing this example, IBM announced on its big user conference IBM Interconnect in Las Vegas that Hyperledger Fabric 1.0 is now publicly available. Well, it is really available on Bluemix since that day, but if you try to start the service you get the following message: “Due to high demand of our limited beta, we have reached capacity.” A strange error as I thought we are using public cloud infrastructures like Amazon AWS, Microsoft Azure, Google Cloud Platform or IBM Bluemix for exactly this reason to scale out easily… J

Anyway, the consequence is that we still have to work on our 0.6 version and do not know yet when we will be able or have to migrate to version 1.0. The good news is that iPaaS and cloud-to-cloud integration can realize changes quickly. In the case of our TCI REST service, we just need to replace the REST activity calling Hyperledger REST API with a Java activity and leverage Hyperledger’s Java SDK – as soon as it is available. Right now only a Client SDK for Node.js is available – not really an option for “enterprise projects” where you want to leverage the JVM respectively Java platform instead of JavaScript. Side note: This topic of using Java vs. JavaScript in blockchain projects is also well discussed in “Static Type Safety for DApps without JavaScript”.

This blog post focuses just on a small example, and Hyperledger Fabric 1.0 will bring other new features and concept changes with it, of course. Same is true for SaaS cloud offerings such as Salesforce. You cannot control when they change their API and what exactly they change. But you have to adopt it within a relative short timeframe or your service will not work anymore. An iPaaS is the perfect solution for these scenarios as you do not have a lot of complex setup in your private data center or a public cloud platform. You just use it “as a service”, change it, replace logic, or stop it and throw it away to start a new project. The implicit integration into API Management and DevOps support also allow to expose new versions for your service consumers easily and quickly.

Conclusion: Be Innovative by Failing Fast with Cloud Solutions

Blockchain is very early stage. This is true for platforms, tooling, and even the basic concepts and theories (like consensus algorithms or security enforcement). Don’t trust the vendors if they say Blockchain is now 1.0 and ready for prime time. It will still feel more like a 0.5 beta version these days. This is not just true for Hyperledger and its related projects such as IBM’s Hyperledger Fabric, but also for all others, including Ethereum and all the interesting frameworks and startups emerging around it.

Therefore, you need to be flexible and agile in blockchain development today. You need to be able to fail fast. This means set up a project quickly, try out new innovative ideas, and throw them away if they do not work; to start the next one. The same is true for other innovative ideas – not just for blockchain.

Middleware helps integrating new innovative ideas with existing applications and enterprise architectures. It is used to interconnect everything, correlate events in real time, and find insights and patterns in correlated information to create new added value. To support innovative projects, middleware also needs to be flexible, agile and support fail fast methodologies. This post showed how you can leverage iPaaS with TIBCO Cloud Integration and API Management with TIBCO Mashery to build innovative middleware microservices in innovative cloud-to-cloud integration projects.

The post Agile Cloud-to-Cloud Integration with iPaaS, API Management and Blockchain appeared first on Kai Waehner.

Blockchain, Integration, Streaming Analytics, Ethereum, Hyperledger

Kai Waehner — Fri, 24 Feb 2017 08:47:36 +0000

In the fast few weeks, I have published a few articles, slide decks and videos around Blockchain, Middleware, Integration, Streaming Analytics, Ethereum, Hyperledger. I want to share the links here…

Blockchain – The Next Big Thing for Middleware

InfoQ article: “Blockchain – The Next Big Thing for Middleware”

Key takeaways:

Blockchain is not just for Bitcoin
A blockchain is a protocol and ledger for building an immutable historical record of transactions
There is no new technology behind blockchain, just established components combined in a new way
Middleware is key for success to integrate blockchain with the rest of an enterprise architecture
Setting up your own blockchain is a complex process; Blockchain as a Service (BaaS) can allow faster adoption

Here is a video recording which summarizes why Blockchain is the Next Big Thing for Middleware.

Blockchain + Streaming Analytics = Smart Distributed Applications

Blog post at blockchainers: “Blockchain + Streaming Analytics = Smart Distributed Applications”

We are approaching Blockchain 3.0 these days. You ask “3.0”? Seriously? Blockchain is still in the early adoption phase! Yes, that is true. Nevertheless, while we are moving forward to real world use cases in various industries, we see that you need more than just a blockchain infrastructure. You need to build decentralized applications. These include blockchains, smart contracts plus other applications plus integration and analytics on top of these solutions. This articles shows how to leverage Streaming Analytics in combination with blockchain platforms like Hyperledger or Ethereum.

You can also take a look at the video recording about “Streaming Analytics with TIBCO StreamBase and Ethereum Blockchain“.

Upcoming Content for Blockchain, Middleware and Big Data Analytics

Right now, we are working on other very interesting blockchain content, for example:

Ethereum development with the Truffle framework
Data Discovery and Machine Learning to find insights and patterns in historical blockchain data
Cross-integration between different blockchain platforms like Ethereum, Hyperledger Fabric or R3 Corda via integration middleware and microservices
Deployment through various platforms, including on premise, cloud (e.g. Amazon AWS, Google Cloud Platform and Microsoft Azure) and hybrid architectures

Stay tuned for more interesting blockchain content…

The post Blockchain, Integration, Streaming Analytics, Ethereum, Hyperledger appeared first on Kai Waehner.