Apache Archives - Kai Waehner

Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Services for Data Streaming with Kafka and Flink

Kai Waehner — Sat, 18 Jan 2025 11:33:44 +0000

The cloud revolution has transformed how businesses deploy, scale, and manage data streaming solutions. While Software-as-a-Service (SaaS) and Platform-as-a-Service (PaaS) cloud models are often used interchangeably in marketing, their distinctions have significant implications for operational efficiency, cost, and scalability. In the context of data streaming around Apache Kafka and Flink, understanding these differences and recognizing common misconceptions—such as the overuse of the term “serverless”—can help you make an informed decision. Additionally, the emergence of Bring Your Own Cloud (BYOC) offers yet another option, providing organizations with enhanced control and flexibility in their cloud environments.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

The Data Streaming Landscape: Kafka, Flink, Cloud, and More

The Data Streaming Landscape 2025 highlights how data streaming has evolved into a key software category, moving from niche adoption to a fundamental part of modern data architecture.

With frameworks like Apache Kafka and Flink at its core, the landscape now spans self-managed, BYOC, and fully managed SaaS solutions, driving real-time use cases, unifying transactional and analytical workloads, and enabling innovation across industries.

If you’re still grappling with the fundamentals of stream processing, this article is a must-read: “Stateless vs. Stateful Stream Processing with Kafka Streams and Apache Flink“.

What is SaaS in Data Streaming?

SaaS data streaming solutions are fully managed services where the provider handles all aspects of deployment, maintenance, scaling, and updates. SaaS offerings are designed for ease of use, providing a serverless experience where developers focus solely on building applications rather than managing infrastructure.

Characteristics of SaaS in Data Streaming

Serverless Architecture: Resources scale automatically based on demand. True SaaS solutions eliminate the need to provision or manage servers.
Low Operational Overhead: The provider manages hardware, software, and runtime configurations, including upgrades and security patches.
Pay-As-You-Go Pricing: Consumption-based pricing aligns costs directly with usage, reducing waste during low-demand periods.
Rapid Deployment: SaaS enables users to start processing streams within minutes, accelerating time-to-value.

Examples of SaaS in Data Streaming:

Confluent Cloud: A fully managed Kafka platform offering serverless scaling, multi-tenancy, and a broad feature set for both stateless and stateful processing.
Amazon Kinesis Data Analytics: A managed service for real-time analytics with automatic scaling.

What is PaaS in Data Streaming?

PaaS offerings sit between fully managed SaaS and infrastructure-as-a-service (IaaS). These solutions provide a platform to deploy and manage applications but still require significant user involvement for infrastructure management.

Characteristics of PaaS in Data Streaming

Partial Management: The provider offers tools and frameworks, but users must manage servers, clusters, and scaling policies.
Manual Configuration: Deployment involves provisioning VMs or containers, tuning parameters, and monitoring resource usage.
Complex Scaling: Scaling is not always automatic; users may need to adjust resource allocation based on workload changes.
Higher Overhead: PaaS requires more expertise and operational involvement, making it less accessible to teams without dedicated DevOps resources.

Examples of PaaS in Data Streaming (Kafka, Flink)

PaaS offerings in data streaming, while simplifying some infrastructure tasks, still require significant user involvement compared to fully serverless SaaS solutions. Below are three common examples, along with their benefits and pain points compared to serverless SaaS:

Apache Flink (Self-Managed on Kubernetes Cloud Service like EKS, AKS or GKE)
- Benefits: Full control over deployment and infrastructure customization.
- Pain Points: High operational overhead for managing Kubernetes clusters, manual scaling, and complex resource tuning.
Amazon Managed Service for Apache Flink (Amazon MSF)
- Benefits: Simplifies infrastructure management and integrates with some other AWS services.
- Pain Points: Users still handle job configuration, scaling optimization, and monitoring, making it less user-friendly than serverless SaaS solutions.
Amazon MSK (Managed Streaming for Apache Kafka)
- Benefits: Eases Kafka cluster maintenance and integrates with the AWS ecosystem.
- Pain Points: Requires users to design and manage producers/consumers, manually configure scaling, and handle monitoring responsibilities. MSK also excludes support for Kafka if you have any operational issues with the Kafka piece of the infrastructure.

SaaS vs. PaaS: Key Differences

SaaS and PaaS differ in the level of management and user responsibility, with SaaS offering fully managed services for simplicity and PaaS requiring more user involvement for customization and control.

Feature	SaaS	PaaS
Infrastructure	Fully managed by the provider	Partially managed; user controls clusters
Scaling	Automatic and server less	Manual or semi-automatic scaling
Deployment Speed	Immediate, ready to use	Slower; requires configuration
Operational Complexity	Minimal	Moderate to high
Cost Model	Consumption-based, no idle costs	May incur idle resource costs

The big benefit of PaaS over SaaS is greater flexibility and control, allowing users to customize the platform, integrate with specific infrastructure, and optimize configurations to meet unique business or technical requirements. This level of control is often essential for organizations with strict compliance, security, or data sovereignty requirements.

SaaS is NOT Always Better than PaaS!

Be careful: The limitations and pain points of PaaS do NOT mean that SaaS is always better.

A concrete example: Amazon MSK Serverless simplifies Apache Kafka operations with automated scaling and infrastructure management but comes with significant limitations, including the lack of fully-managed connectors, advanced data governance tools, and native multi-language client support.

Amazon MSK also excludes Kafka engine support, leading to potential operational risks and cost unpredictability, especially when integrating with additional AWS services for a complete data streaming pipeline. I explored these challenges in more detail in my article “When NOT to choose Amazon MSK Serverless for Apache Kafka?“.

Bring Your Own Cloud (BYOC) as Alternative to PaaS

BYOC (Bring Your Own Cloud) offers a middle ground between fully managed SaaS and self-managed PaaS solutions, allowing organizations to host applications in their own VPCs.

BYOC provides enhanced control, security, and compliance while reducing operational complexity. This makes BYOC a strong alternative to PaaS for companies with strict regulatory or cost requirements.

As an example, here are the options of Confluent for deploying the data streaming platform: Serverless Confluent Cloud, Self-managed Confluent Platform (some consider this a PaaS if you leverage Confluent’s Kubernetes Operator and other automation / DevOps tooling) and WarpStream as BYOC offering:

Source: Confluent

While BYOC complements SaaS and PaaS, it can be a better choice when fully managed solutions don’t align with specific business needs. I wrote a detailed article about this topic: “Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud)“.

“Serverless” Claims: Don’t Trust the Marketing

Many cloud data streaming solutions are marketed as “serverless,” but this term is often misused. A truly serverless solution should:

Abstract Infrastructure: Users should never need to worry about provisioning, upgrading, or cluster sizing.
Scale Transparently: Resources should scale up or down automatically based on workload.
Eliminate Idle Costs: There should be no cost for unused capacity.

However, many products marketed as serverless still require some degree of infrastructure management or provisioning, making them closer to PaaS. For example:

A so-called “serverless” PaaS solution may still require setting initial cluster sizes or monitoring node health.
Some products charge for pre-provisioned capacity, regardless of actual usage.

Do Your Own Research

When evaluating data streaming solutions, dive into the technical documentation and ask pointed questions:

Does the solution truly abstract infrastructure management?
Are scaling policies automatic, or do they require manual configuration?
Is there a minimum cost even during idle periods?

By scrutinizing these factors, you can avoid falling for misleading “serverless” claims and choose a solution that genuinely meets your needs.

Choosing the Right Model for Your Data Streaming Business: SaaS, PaaS, or BYOC

When adopting a data streaming platform, selecting the right model is crucial for aligning technology with your business strategy:

Use SaaS (Software as a Service) if you prioritize ease of use, rapid deployment, and operational simplicity. SaaS is ideal for teams looking to focus entirely on application development without worrying about infrastructure.
Use PaaS (Platform as a Service) if you require deep customization, control over resource allocation, or have unique workloads that SaaS offerings cannot address.
Use BYOC (Bring Your Own Cloud) if your organization demands full control over its data but sees benefits in fully managed services. BYOC enables you to run the data plane within your cloud VPC, ensuring compliance, security, and architectural flexibility while leveraging SaaS functionality for the control plane .

In the rapidly evolving world of data streaming around Apache Kafka and Flink, SaaS data streaming platforms like Confluent Cloud provide the best of both worlds: the advanced features of tools like Apache Kafka and Flink, combined with the simplicity of a fully managed serverless experience. Whether you’re handling stateless stream processing or complex stateful analytics, SaaS ensures you’re scaling efficiently without operational headaches.

What deployment option do you use today for Kafka and Flink? Any changes planned in the future? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Fully Managed (SaaS) vs. Partially Managed (PaaS) Cloud Services for Data Streaming with Kafka and Flink appeared first on Kai Waehner.

My Data Streaming Journey with Kafka & Flink: 7 Years at Confluent

Kai Waehner — Fri, 03 May 2024 01:31:02 +0000

Time flies… I joined Confluent seven years ago when Apache Kafka was mainly used by a few tech giants and the company had ~100 employees. This blog post explores my data streaming journey, including Kafka becoming a de facto standard for over 100,000 organizations, Confluent doing an IPO on the NASDAQ stock exchange, 5000+ customers adopting a data streaming platform, and emerging new design approaches and technologies like data mesh, GenAI, and Apache Flink. I look at the past, present and future of my personal data streaming journey. Both, from the evolution of technology trends and the journey as a Confluent employee that started in a Silicon Valley startup and is now part of a global software and cloud company.

Disclaimer: Everything in this article reflects my personal opinions. This is particularly important when you talk about the outlook of a publicly listed company.

PAST: Apache Kafka is pretty much unknown outside of Silicon Valley in 2017

When I joined Confluent in 2017, most people did not know about Apache Kafka. Confluent was in the early stage with ~100 employees.

Tech: Big data with Hadoop and Spark as “game changer”; Kafka only the ingestion layer

2017 was a time where most companies installed Cloudera or Hortonworks. Hadoop + Spark and Kafka as ingestion layer. That was the starting point of using Apache Kafka. The predominant use case for Kafka at that time was data ingestion into Hadoop’s storage system HDFS. Map Reduce and later Apache Spark batch processes analyzed big data sets.

“The cloud” was not that big thing yet and the container wars were still going on (Kubernetes vs. Cloud Foundry vs. Mesosphere).

I announced my departure from TIBCO and the fact that I will join Confluent in a blog post in May 2017: “Why I move (back) to open source for messaging, integration and stream processing“. If you look at my predictions, my outlook was not too bad.

I was right about disruptive trends:

Massive adoption of open source (beyond Linux)
Companies moved from batch to real-time because real-time data beats slow data in almost all use cases across industries
Adoption of machine learning for improving existing business processes and innovation
From the Enterprise Service Bus (ESB) – called iPaaS in the cloud today – to more cloud-native middleware, i.e., Apache Kafka (many Kafka projects today are integration projects)
Kafka being complementary to other data platforms, including data warehouse, data lake, analytics engines, etc.

What I did not see coming:

The massive transition to the public cloud
Apache Kafka being an event store for out-of-the-box capabilities like true decoupling of applications (what is the foundation and de facto standard for event-based microservices and data mesh today) and replayability of historical events in guaranteed ordering with timestamps
Generative AI (GenAI) as a specific pillar of AI / ML (did anyone see this coming seven years ago?)

Company: Confluent is a silicon valley startup with ~100 people making Kafka enterprise-ready

Confluent was a traditional Silicon Valley startup in 2017 with ~100 employees and backed by venture capital. The initial HR conversation and first interview (and company pitch) was done by our CEO Jay Kreps. I still have Jay’s initial email in my mailbox. It started with this:

“Hi Kai, I’m the CEO of Confluent and one of the co-creators of Apache Kafka. Given your experience at Tibco maybe you’ve run into Kafka before? Confluent is the company we created that is taking Kafka to the enterprise. […] We aim to go beyond just messaging and really make the infrastructure for real-time data streams a foundational element of a modern company’s data architecture.“

In 2017, Confluent was just starting to kick off its global business. The United Kingdom and Germany are usually the first two countries outside the US for Silicon Valley startups because of their large economy and no language barrier in the UK.

I will not publish my response to Jay’s email with respect to my former employer, but I still can quote the following sentence I responded: “I really like to hear from you that you want to go beyond just messaging and really make the infrastructure for real-time data streams a foundational element of a modern company’s data architecture“. Mission accomplished. That’s where Confluent is today.

Working in an international overlay role for Confluent…

I also pointed out in the very first email to Confluent that I am already in an overlay role and work internationally. While I was officially starting as the first German employee for presales and solution consulting, I only signed the contract because everybody in Confluent’s executive team understood and agreed on supporting me to continue my career in an overlay role doing a mix of sales, marketing, enablement, evangelism, etc. internationally.

A few weeks later, I was already in the Confluent headquarters in Palo Alto, California:

I am still more or less in the same role today. However, with much more focus on executive conversations and business perspective instead of “just” technology. Like the technology developed, so did the conversations. Check out my career analysis describing what I do as Field CTO if you want to learn more.

While Confluent moved to a bigger office in Mountain View, California, the Kafka tree still exists today:

PRESENT: Everyone knows Kafka (and Confluent); most companies use it already in 2024

Apache Kafka is used in most organizations in 2024. Many enterprises are still in the early stage of the adoption curve and are building some streaming data pipelines. However, several companies, not just in Silicon Valley but worldwide and across industries, already matured and leverage stream processing with Kafka Streams or Apache Flink for advanced and critical use cases like fraud detection, context-specific customer personalization or predictive maintenance.

Tech: Apache Kafka is the de facto standard for data streaming

Over 100,000 organizations use Apache Kafka in 2024. This is an insane number. Apache Kafka became the de facto standard for data streaming. Data streaming is much more than just real-time messaging for transactional and analytics workloads. Most customers leverage Kafka Connect for data integration scenarios to legacy and cloud-native data sources and sinks. Confluent provides an entire ecosystem of integration capabilities like 100+ connectors, clients for any programming language, Confluent Stream Sharing, any many other integration alternatives. Always with security and data governance in mind. Enterprise-ready, as most people call it. And in the cloud, all of this is fully managed.

In the meantime, Apache Flink establishes itself as de facto standard for stream processing. Here is an interesting diagram showing the analogy of growth:

Source: Confluent

Various vendors build products and cloud services around the two successful open source data streaming frameworks: Apache Kafka and/or Flink. Some vendors leverage the open source frameworks, while others only rely on the open protocol to implement their own solutions to differentiate:

All major cloud providers provide Kafka as a service (AWS, Azure, GCP, Alibaba).
Many of the largest traditional software players include a Kafka product (including IBM, Oracle, and many more).
Established data companies support Kafka and/or Flink, like Confluent, Cloudera, Red Hat, Ververica, etc.
New startups emerge, including Redpanda, WarpStream, Decodable, and so on.

Data Streaming Landscape (vs. Data Lake, Data Warehouse and Lakehouse)

The current “Data Streaming Landscape 2024” provides a much more detailed overview. There will probably be some consolidation in the market. But it is great to see such an adoption and growth in the data streaming market.

While new concepts (e.g., data mesh) and technologies (e.g., GenAI) emerged in the past few years, one thing is clear: The fundamental value of event-driven architectures using data streaming with Kafka and Flink does not change: data in motion is much more valuable for most use cases and business instead of just storing and analyzing data at rest in a data warehouse, data lake, or in 2024 using an innovative lakehouse.

It is worth reading my blog series comparing data streaming with data lakes and data warehouses. These technologies are complementary, (mostly) not competitive.

I also published a blog series recently exploring how data streaming changes the view on Snowflake (and cloud data platforms in general) from an enterprise architecture perspective:

Company: Confluent is a global player with 3000+ employees and listed on the NASDAQ

Confluent is a global player in the software and cloud industry employing ~3000 reaching $1 Billion ARR soon. As announced a few earnings calls back, the company now also focuses on profit instead of just growth, and the last quarter was the first profitable quarter in the company’s history. This is a tremendous achievement looking into Confluent’s future.

Unfortunately, even in 2024, many people still struggle to understand event-driven architectures and stream processing. One of my major tasks in 2024 at Confluent is to educate people – internal, partners, customers/prospects, and the broader community – about data streaming:

What is data streaming?
What are the benefits, differences, and trade-offs compared to traditional software design patterns (like APIs, databases, message brokers, etc.) and related products/cloud services?
What use cases do companies use data streaming for?
How do industry-specific deployments look like (this differs a lot in financial services vs. retail vs. manufacturing vs. telco vs. gaming vs. public sector)?
What is the business value (reduced cost, increased revenue, reduced disk, improved customer experience, faster time to market)?

“The Past, Present and Future of Stream Processing” shows the evolution and looks at new concepts like emerging streaming databases and the unification of operational and analytical systems using Apache Iceberg or similar table formats.

Confluent is a well-known software and cloud company today. As part of my job, I present at international conferences, give press interviews, brief research analysts like Gartner/Forrester, and write public articles to let people know (in as simple as possible words) what data streaming is and why the adoption is so massive across all regions and industries.

Confluent Partners: Cloud Service Providers, 3rd Party Vendors, System Integrators

Confluent strategically works with cloud service providers (AWS, Azure, GCP, Alibaba), software / cloud vendors (the list is too long to name everyone), and system integrators. While some people still think about a company like AWS as an enemy, it is much more a friend to co-sell data streaming in combination with other cloud services via the Amazon marketplace.

The list of strategic partners grows year by year. One of the most exciting announcements of 2023 was the strategic partnership between SAP and Confluent to connect S/4Hana ERP and other systems with the rest of the software and cloud world using Confluent.

Confluent Customers: From Open Source Kafka to Hybrid Multi-Cloud

Confluent has over 5000 customers already. I talk about many of these customer journeys in by blogs. Just search for your favorite industry to learn more. One exciting example is the evolution of the data streaming adoption at BMW. Coming from a wild zoo of deployments, BMW has standardized on Confluent, including self-service, data governance, and global rollouts for smart factory, logistics, direct-to-consumer sales and marketing, and many other use cases.

BMW hosts an internal Kafka Wiesn (= Oktoberfest) every year where we sponsor some pretzels and internal teams and external partners like Michelin present new projects, use cases, success stories, architectures and best practices around the data streaming world for transactional and analytical workloads. Here is a picture of our event in 2023 where my colleague Evi Schneider and I visited the BMW headquarters:

FUTURE: Data streaming is a new software category in 2024+

Thinking about Gartner’s famous hype cycle, we are reaching the “plateau of productivity”. Thanks to mature open source frameworks, sophisticated (but far from perfect) products, and fully managed SaaS cloud offerings make the mass adoption of data streaming possible in the next few years.

Tech: Standards and (multi-cloud) SaaS are the new black

Data streaming is much more than just a better or more scalable messaging solution, a new integration platform, or a cloud-native processing platform. Data streaming is a new software category. Period. Even open source Kafka provides so many capabilities people don’t know, for instance, exactly-once semantics (EOS) for transactions, tiered storage API for separation of compute and storage, Kafka Connect for data integration and Kafka Streams for stream processing (both natively using the Kafka protocol), and so much more.

In December 2023, the research company Forrester published “The Forrester Wave: Streaming Data Platforms, Q4 2023“. Get free access to the report here. The leaders are Microsoft, Google, and Confluent, followed by Oracle, Amazon, Cloudera, and a few others. IDC followed in early 2024 with a similar report. This is a huge proof that the category of data streaming is attested. A Gartner Magic Quadrant for Data Streaming will hopefully (and likely) follow soon, too…

Cloud services make mass adoption very easy and affordable. Consumption-based pricing allows cost-sensitive exploration and adoption. I won’t take a look at different competing offerings in this blog post, check out the “Data Streaming Landscape 2024” and the “Comparison of Open Source Apache Kafka and Vendor Solutions / Cloud Service” for more details. I just want to say one thing: Make sure to evaluate open source frameworks and different products correctly. Read the terms and conditions. Understand the support agreements and expertise of a vendor. If a product offers you “Kafka as Windows .exe file download” or a cloud provider “excludes Kafka support in the support contract from its Kafka cloud offering”, then something is wrong with the offering. Both examples are true and available to be paid for in today’s Kafka landscape!

Company: Confluent is more than just Kafka and Flink; it is becoming the provider of a true data streaming platform

In the past years, the company transitioned from “the Kafka vendor” into a data streaming platform. Confluent still does only one thing (data streaming), but better than everyone else regarding product, support and expertise. I am huge fan of this approach compared to vendors with a similar number of employees that try to (!) solve every (!) problem.

As Confluent is a public company, it is possible to attend the quarterly earnings calls to learn about the product strategy and revenue/growth.

From a career perspective, I still enjoy doing the same thing I did when I started at Confluent seven years ago. I transitioned into the job role of a Global Field CTO, focusing more on executive and business conversations, not just focusing on the technology itself. This is a job role that comes up more and more in software companies. There is no standard definition for this job role. As I regularly get the question about what a Field CTO does, I summarized the tasks in my “Daily Life As A Field CTO“. The post concludes with the answer to how you can also become a Field CTO at a software company in your career.

Data streaming is still in an early stage…

Where are we today with data streaming as a paradigm and Confluent as a company? We are still early. This is comparable to Oracle where they started with a database. Data streaming is accepted as a new software category by many experts and research analysts. But education about the paradigm shift and business value is still one of the biggest challenges. Data streaming is a journey – learn from various companies across industries that already went through this in the past years.

I was really excited to start at Confluent in May 2017. I visited Confluent’s London and Palo Alto headquarters in the first weeks and also attended Kafka Summit in New York. It was an exciting month to get started in an outstanding Silicon Valley startup. Today, I still visit our headquarters regularly for executive briefings, and Kafka Summit or similar events from Confluent like Current and the Data in Motion Tour around the world.

I hope this was an interesting report about my past seven years in the data streaming world at Confluent. What is your opinion about the future of open source technologies like Apache Kafka and Flink, the transition to the cloud, and the outlook for Confluent as a company? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post My Data Streaming Journey with Kafka & Flink: 7 Years at Confluent appeared first on Kai Waehner.

Quix Streams – Stream Processing with Kafka and Python

Kai Waehner — Sun, 28 May 2023 10:22:38 +0000

Over 100,000 organizations use Apache Kafka for data streaming. However, there is a problem: The broad ecosystem lacks a mature client framework and managed cloud service for Python data engineers. Quix Streams is a new technology on the market trying to close this gap. This blog post discusses this Python library, its place in the Kafka ecosystem, and when to use it instead of Apache Flink or other Python- or SQL-based substitutes.

Why Python and Apache Kafka together?

Python is a high-level, general-purpose programming language. It has many use cases for scripting and development. But there is one fundamental purpose for its success: Data engineers and data scientists use Python. Period.

Yes, there is R as another excellent programming language for statistical computing. And many low-code/no-code visual coding platforms for machine learning (ML).

SQL usage is ubiquitous amongst data engineers and data scientists, but it’s a declarative formalism that isn’t expressive enough to specify all necessary business logic. When data transformation or non-trivial processing is required, data engineers and data scientists use Python.

Hence: Data engineers and data scientists use Python. If you don’t give them Python, you will find either shadow IT or Python scripts embedded into the coding box of a low-code tool.

Apache Kafka is the de facto standard for data streaming. It combines real-time messaging, storage for true decoupling and replayability of historical data, data integration with connectors, and stream processing for data correlation. All in a single platform. At scale for transactions and analytics.

Python and Apache Kafka for Data Engineering and Machine Learning

In 2017, I wrote a blog post about “How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka“. The article is still accurate and explores how data streaming and AI/ML are complementary:

Machine Learning requires a lot of infrastructure for data collection, data engineering, model training, model scoring, monitoring, and so on. Data streaming with the Kafka ecosystem enables these capabilities in real-time, reliable, and at scale.

DevOps, microservices, and other modern deployment concepts merged the job roles of software developers and data engineers/data scientists. The focus is much more on data products solving a business problem, operated by the team that develops it. Therefore, the Python code needs to be production-ready and scalable.

As mentioned above, the data engineering and ML tasks are usually realized with Python APIs and frameworks. Here is the problem: The Kafka ecosystem is built around Java and the JVM. Therefore, it lacks good Python support.

Let’s explore the options and why Quix Streams is a brilliant opportunity for data engineering teams for machine learning and similar tasks.

What options exist for Python and Kafka?

Many alternatives exist for data engineers and data scientists to leverage Python with Kafka.

Python integration for Kafka

Here are a few common alternatives for integrating Python with Kafka and their trade-offs:

Python Kafka client libraries: Produce and consume via Python. I wrote an example of using a Python Kafka client and TensorFlow to replay historical data from the Kafka log and train an analytic model. This is solid but insufficient for advanced data engineering as it lacks processing primitives, such as filtering and joining operations found in Kafka Streams and other stream processing libraries.
Kafka REST APIs: Confluent REST Proxy and similar components enable producing and consuming to/from Kafka. Work well for gluing interfaces together, but is not ideal for ML workloads with low latency and critical SLAs.
SQL: Stream processing engines like ksqlDB or FlinkSQL allow querying of data in SQL. I wrote an example of Model Training with Python, Jupyter, KSQL and TensorFlow. It works. But ksqlDB and Flink are other systems that need to be operated. And SQL isn’t expressive enough for all use cases.

Instead of just integrating Python and Kafka via APIs, native stream processing provides the best of both worlds: The simplicity and flexibility of dynamic Python code for rapid prototyping with Jupyter notebooks and serious data engineering AND stream processing for stateful data correlation at scale either for data ingestion and model scoring.

Stream processing with Python and Kafka

In the past, we had two suboptimal open-source options for stream processing with Kafka and Python:

Faust: A stream processing library, porting the ideas from Kafka Streams (a Java library and part of Apache Kafka) to Python. The feature set is much more limited compared to Kafka Streams. Robinhood open-sourced Faust. But it lacks maturity and community adoption. I saw several customers evaluating it but then moving to other options.
Apache Flink’s Python API: Flink’s adoption grows significantly yearly in the stream processing community. This API is a Python version of DataStream API, which allows Python users to write Python DataStream API jobs. Developers can also use the Table API, including SQL directly in there. It is an excellent option if you have a Flink cluster and some folks want to run Python instead of Java or SQL against it for data engineering. The Kafka-Flink integration is very mature and battle-tested.

As you see, all the alternatives for combining Kafka and Python have trade-offs. They work for some use cases but are imperfect for most data engineering and data science projects.

A new open-source framework to the rescue? Introducing a brand new stream processing library for Python: Quix Streams…

What is Quix Streams?

Quix Streams is a stream processing library focused on ease of use for Python data engineers. The library is open-source under Apache 2.0 license and available on GitHub.

Instead of a database, Quix Streams uses a data streaming platform such as Apache Kafka. You can process data with high performance and save resources without introducing a delay.

Some of the Quix Streams differentiators are defined as being lightweight, powerful, no JVM and no need for separate clusters of orchestrators. It sounds like the pitch for why to use Kafka Streams in the Java ecosystem minus the JVM – this is a positive comment!

Quix Streams does not use any domain-specific language or embedded framework. It’s a library that you can use in your code base. This means that with Quix Streams, you can use any external library for your chosen language. For example, data engineers can leverage Pandas, NumPy, PyTorch, TensorFlow, Transformers, and OpenCV in Python.

So far, so good. This was more or less the copy & paste of Quix Streams marketing (it makes sense to me)… Now let’s dig deeper into the technology.

The Quix Streams API and developer experience

The following is the first feedback after playing around, doing code analysis, and speaking with some Confluent colleagues and the Quix Streams team.

The good

The Quix API and tooling persona is the data engineer (that’s at least my understanding). Hence, it does not directly compete with other offerings, say a Java developer using Kafka Streams. Again, the beauty of microservices and data mesh is the focus of an application or data product per use case. Choose the right tool for the job!
The API is mostly sleek, with some weirdness / unintuitive parts. But it is still in beta, so hopefully, it will get more refined in the subsequent releases. No worries at this early stage of the project.
The integration with other data engineering and machine learning Python frameworks is excellent. If you can combine stream processing with Pandas, NumPy and similar libraries is a massive benefit for the developer experience.
The Quix library and SaaS platform are compatible with open-source Kafka and commercial offerings and cloud services like Cloudera, Confluent Cloud, or Amazon MSK. Quix’s commercial UI provides out-of-the-box integration with self-managed Kafka and Confluent Cloud. The cloud platform also provides a managed Kafka for testing purposes (for a few cents per Kafka topic, and not meant for production).

The improvable

The stream processing capabilities (like powerful sliding windows) are still pretty limited and not comparable to advanced engines like Kafka Streams or Apache Flink. The roadmap includes enhanced features.
The architecture is complex since executing the Python API jumps through three languages: Python -> C# -> C++. Does it matter to the end user? It depends on the use case, security requirements, and more. The reasoning for this architecture is Quix’s background coming from the McLaren F1 team and ultra-low latency use cases and building a polyglot platform for different programming environments.
It would be interesting to see a benchmark for throughput and latency versus Faust, which is Python top to bottom. There is a trade-off between inter-language marshaling/unmarshalling versus the performance boost of lower-level compiled languages. This should be fine if we trust Quix’s marketing and business model. I expect they will provide some public content soon, as this question will arise regularly.

The Quix Streams Data Pipeline Low Code GUI

The commercial product provides a user interface for building data pipelines and code, MLOps, and a production infrastructure for operating and monitoring the built applications.

Here is an example:

Tiles are K8’s containers, each purple (transformation) and orange (destination) node is backed by a Git project containing the application code.
The three blue (source) nodes on the left are replay services used to test the pipeline by replaying specific streams of data.
Arrows are individual Kafka topics in Confluent Cloud (green = live data).
The first visible pipeline node (bottom left) is joining data from different physical sites (see the three input topics, one was receiving data when I took the image).
There are three modular transformations in the visible pipeline (two rolling windows and one interpolation).
There are two real-time apps (one real-time Streamlit dashboard and the other is an integration with a Twilio SMS service).

Quix Streams vs. Apache Flink for stream processing with Python

The Quix team wrote a detailed comparison of Apache Flink and Quix Streams. I don’t think it’s an entirely fair comparison as it compares open-source Apache Flink to a Quix SaaS offering. Nevertheless, for the most part, it is a good comparison.

Flink was always Java-first and has added support for Python for its DataStream and Table APIs at a later stage. Contrary, Quix Streams is brand new. Hence, it lacks maturity and customer case studies.

Having said all this, I think Quix Streams is a great choice for some stream processing projects in the Python ecosystem!

Should you use Quix Streams or Apache Flink?

TL;DR: There is a place for both! Choose the right tool… Modern enterprise architectures built with concepts like data mesh, microservices, and domain-driven design allow this flexibility per use case and problem.

I recommend using Flink if the use case makes sense with SQL or Java. And if the team is willing to operate its own Flink cluster or has a platform team or a cloud service taking over the operational burden and complexity.

Contrary, I would use Quix Streams for Python projects if I want to go to production with a more microservice-like architecture building Python applications. However, beware that Quix currently only has a few built-in stateful functions or JOINs. More advanced stream processing use cases cannot be done with Quix (yet). This is likely changing in the next months by adding more capabilities.

Hence, make sure to read Quix’ comparison with Flink. But keep in mind if you want to evaluate the open-source Quix Streams library or the Quix SaaS platform. If you are in the public cloud, you might combine Quick Streams SaaS with other fully-managed cloud services like Confluent Cloud for Kafka. On the other side, in your own private VPC or on premise, you need to build your own platform with technologies like the Quix Streams library, Kafka or Confluent Platform, and so on.

The current state and future of Quix Streams

If you build a new framework or product for data streaming, you need to make sure that it does not overlap with existing established offerings. You need differentiators and/or innovation in a new domain that does not exist today.

Quix Streams accomplishes this essential requirement to be successful: The target audience is data engineers with Python backgrounds. No severe and mature tool or vendor exists in this space today. And the demand for Python will grow more and more with the focus on leveraging data for solving business problems in every company.

Maturity: Making the right (marketing) choices in the early stage

Quix Streams is in the early maturity stage. Hence, a lot of decisions can still be strengthened or revamped.

The following buzzwords come into my mind when I think about Quix Streams: Python, data streaming, stream processing, Python, data engineering, Machine Learning, open source, cloud, Python, .NET, C#, Apache Kafka, Apache Flink, Confluent, MSK, DevOps, Python, governance, UI, time series, IoT, Python, and a few more.

TL;DR: I see a massive opportunity for Quix Streams to become a great data engineering framework (and SaaS offering) for Python users.

I am not a fan of polyglot platforms. It requires finding the lowest common denominator. I was never a fan of Apache Beam for that reason. The Kafka Streams community did not choose to implement the Beam API because of too many limitations.

Similarly, most people do not care about the underlying technology. Yes, Quix Streams’ core is C++. But is the goal to roll out stream processing for various programming languages, only starting with Python, then going to .NET, and then to another one? I am skeptical.

Hence, I like to see a change in the marketing strategy already: Quix Streams started with the pitch of being designed for high-frequency telemetry services when you must process high volumes of time-series data with up to nanosecond precision. It is now being revamped to focus mainly on Python and data engineering.

Competition: Friends or enemies?

Getting market adoption is still hard. Intuitive use of the product, building a broad community, and the right integrations and partnerships (can) make a new product such as Quix Streams successful. Quix Streams is on a good way here. For instance, integrating serverless Confluent Cloud and other Kafka deployments works well:

This is a native integration, not a connector. Everything in the pipeline image runs as a direct Kafka protocol connection using raw TCP/IP packets to produce and consume data to topics in Confluent Cloud. Quix platform is orchestrating the management of the Confluent Cloud Kafka Cluster (create/delete topics, topic sizing, topic monitoring etc) using Confluent APIs.

However, one challenge of these kinds of startups is the decision to complement versus compete with existing solutions, cloud services, and vendors. For instance, how much time and money do you invest in data governance? Should you build this or use the complementing streaming platform or a separate independent tool (like Collibra)? We will see where Quix Streams will go here. Building its cloud platform for addressing Python engineers or overlapping with other streaming platforms?

My advice is the proper integration with partners that lead in their space. Working with Confluent for over six years, I know what I am talking about: We do one thing, data streaming, but we are the best in that one. We don’t even try to compete with other categories. Yes, a few overlaps always exist, but instead of competing, we strategically partner and integrate with other vendors like Snowflake (data warehouse), MongoDB (transactional database), HiveMQ (IoT with MQTT), Collibra (enterprise-wide data governance), and many more. Additionally, we extend our offering with more data streaming capabilities, i.e., improving our core functionality and business model. The latest example is our integration of Apache Flink into the fully-managed cloud offering.

Kafka for Python? Look at Quix Streams!

In the end, a data engineer or developer has several options for stream processing deeply integrated into the Kafka ecosystem:

Kafka Streams: Java client library
ksqlDB: SQL service
Apache Flink: Java, SQL, Python service
Faust: Python client library
Quix Streams: Python client library

All have their pros and cons. The persona of the data engineer or developer is a crucial criterion. Quix Streams is a nice new open-source framework for the broader data streaming community. If you cannot or do not want to use just SQL, but native Python, then watch the project (and the company/cloud service behind it).

UPDATE – May 30th, 2023: bytewax is another open-source stream processing library for Python integrating with Kafka. It is implemented in Rust under the hood. I never saw it in the field yet. But a few comments mentioned it after I shared this blog post on social networks. I think it is worth a mention. Let’s see if it gets more traction in the following months.

Do you already use stream processing, or is Kafka just your data hub and pipeline? How do you combine Python and Kafka today? Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post Quix Streams – Stream Processing with Kafka and Python appeared first on Kai Waehner.

When to choose Redpanda instead of Apache Kafka?

Kai Waehner — Wed, 16 Nov 2022 03:19:39 +0000

Data streaming emerged as a new software category. It complements traditional middleware, data warehouse, and data lakes. Apache Kafka became the de facto standard. New players enter the market because of Kafka’s success. One of those is Redpanda, a lightweight Kafka-compatible C++ implementation. This blog post explores the differences between Apache Kafka and Redpanda, when to choose which framework, and how the Kafka ecosystem, licensing, and community adoption impact a proper evaluation.

Disclaimer: I work for Confluent. However, the post is not about comparing features but explaining the concepts behind the alternatives of using Apache Kafka (and related products, including Confluent) or Redpanda. I talk to enterprises across the globe every week. Below, I summarize common misunderstandings or missing knowledge about both technologies. I hope it helps you to make the right decision. Either choose to run open-source Apache Kafka, one of the various commercial Kafka offerings or cloud services, or Redpanda. All are great options with pros and cons…

Data streaming: A new software category

Data-driven applications are the new black. As part of this, data streaming is a new software category. If you don’t understand yet how it differs from other data management platforms like a data warehouse or data lake, check out the following blog series:

And if you wonder how Apache Kafka differs from other middleware, check out how Kafka fits into comparison with ETL, ESB, and iPaas.

Apache Kafka: The de facto standard for data streaming

Apache Kafka became the de facto standard for data streaming similar to Amazon S3 is the de facto standard for object storage. Kafka is used across industries for many use cases.

The adoption curve of Apache Kafka

The growth of the Apache Kafka community in the last years is impressive:

>100,000 organizations using Apache Kafka
>41,000 Kafka Meetup attendees
>32,000 Stack Overflow Questions
>12,000 Jiras for Apache Kafka
>31,000 Open Job Listings Request Kafka Skills

And look at the increased number of active monthly unique users downloading the Kafka Java client library with Maven:

Source: Sonatype

The numbers grow exponentially. That’s no surprise to me as the adoption pattern and maturity curve for Kafka are similar in most companies:

Start with one or few use cases (that prove the business value quickly)
Deploy the first applications to production and operate them 24/7
Tap into the data streams from many domains, business units, and technologies
Move to a strategic central nervous system with a decentralized data hub

Kafka use cases by business value across industries

The main reason for the incredible growth of Kafka’s adoption curve is the variety of potential use cases for data streaming. The potential is almost endless. Kafka’s characteristics of combing low latency, scalability, reliability, and true decoupling establish benefits across all industries and use cases:

Search my blog for your favorite industry to find plenty of case studies and architectures. Or to get started, read about use cases for Apache Kafka across industries.

The emergence of many Kafka vendors

The market for data streaming is enormous. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products. Most vendors use Kafka or implement its protocol because Kafka has become the de facto standard for data streaming.

Learn more about the various data streaming vendors in the following blog posts:

To be clear: An increasing number of Kafka vendors is a great thing! It proves the creation of a new software category. Competition pushes innovation. The market share is big enough for many vendors. And I am 100% convinced that we are still in a very early stage of the data streaming hype cycle…

After a lengthy introduction to set the context, let’s now review a new entrant into the Kafka market: Redpanda…

Introducing Redpanda: Kafka-compatible data streaming

Redpanda is a data streaming platform. Its website explains its positioning in the market and product strategy as follows (to differentiate it from Apache Kafka):

No Java: A JVM-free and ZooKeeper-free infrastructure.
Designed in C++: Designed for a better performance than Apache Kafka.
A single-binary architecture: No dependencies to other libraries or nodes.
Self-managing and self-healing: A simple but scalable architecture for on-premise and cloud deployments.
Kafka-compatible: Out-of-the-box support for the Kafka protocol with existing applications, tools, and integrations.

This sounds great. You need to evaluate whether Redpanda is the right choice for your next project or if you should stick with “real Apache Kafka”.

How to choose the proper “Kafka” implementation for your project?

A recommendation that some people find surprising: Qualify out first! That’s much easier. Similarly, like I explained when NOT to use Apache Kafka.

As part of the evaluation, the question is if Kafka is the proper protocol for you. And for Kafka, pick different offerings and begin with the comparison.

Start your evaluation with the business case requirements and define your most critical needs like uptime SLAs, disaster recovery strategy, enterprise support, operations tooling, self-managed vs. fully-managed cloud service, capabilities like messaging vs. data ingestion vs. data integration vs. applications, and so on. Based on your use cases and requirements, you can start qualifying out vendors like Confluent, Redpanda, Cloudera, Red Hat / IBM, Amazon MSK, Amazon Kinesis, Google Pub Sub, and others to create a shortlist.

The following sections compare the open-source project Apache Kafka versus the re-implementation of the Kafka protocol of Redpanda. You can use these criteria (and information from other blogs, articles, videos, and so on) to evaluate your options.

Similarities between Redpanda and Apache Kafka

The high-level value propositions are the same in Redpanda and Apache Kafka:

Data streaming to process data in real-time at scale continuously
Decouple applications and domains with a distributed storage layer
Integrate with various data sources and data sinks
Leverage stream processing to correlate data and take action in real-time
Self-managed operations or consuming a fully-managed cloud offering

However, the devil is in the details and facts. Don’t trust marketing, but look deeper into the various products and cloud services.

Deployment options: Self-managed vs. cloud service

Data streaming is required everywhere. While most companies across industries have a cloud-first strategy, some workloads must stay at the edge for different reasons: Cost, latency, or security requirements. My blog about use cases for Apache Kafka at the edge is still one of the most read articles I have written in recent years.

Besides operating Redpanda by yourself, you can buy Redpanda as a product and deploy it in your environment. Instead of just self-hosting Redpanda, you can deploy it as a data plane in your environment using Kubernetes (supported by the vendor’s external control plane) or leverage a cloud service (fully managed by the vendor).

The different deployment options for Redpanda are great. Pick what you need. This is very similar to Confluent’s deployment options for Apache Kafka. Some other Kafka vendors only provide either self-managed (e.g., Cloudera) or fully managed (e.g., Amazon MSK Serverless) deployment options.

What I miss from Redpanda: No official documentation about SLAs of the cloud service and enterprise support. I hope they do better than Amazon MSK (excluding Kafka support from their cloud offerings). I am sure you will get that information if you reach out to the Redpanda team, who will probably soon incorporate some information into their website.

Bring your own Cluster (BYOC)

There is a third option besides self-managing a data streaming cluster and leveraging a fully managed cloud service: Bring your own Cluster (BYOC). This alternative allows end users to deploy a solution partially managed by the vendor in your own infrastructure (like your data center or your cloud VPC).

Here is Redpanda’s marketing slogan: “Redpanda clusters hosted on your cloud, fully managed by Redpanda, so that your data never leaves your environment!”

This sounds very appealing in theory. Unfortunately, it creates more questions and problems than it solves:

How does the vendor access your data center or VPC?
Who decides how and when to scale a cluster?
When to act on issues? How and when do you roll a cluster to incorporate bug fixes or version upgrades?
What about cost management? What is the total cost of ownership? How much value does the vendor solution bring?
How do you guarantee SLAs? Who has to guarantee them, you or the vendor?
For regulated industries, how are security controls and compliance supported? How are you sure about what the vendor does in an environment you ostensibly control? How much harder will a bespoke third-party risk assessment be if you aren’t using pure SaaS?

For these reasons, cloud vendors only host managed services in the cloud vendor’s environment. Look at Amazon MSK, Azure Event Hubs, Google Pub Sub, Confluent Cloud, etc. All fully managed cloud services are only in the VPC of the vendor for the above reasons.

There are only two options: Either you hand over the responsibility to a SaaS offering or control it yourself. Everything in the middle is still your responsibility in the end.

Community vs. commercial offerings

The sales approach of Redpanda looks almost identical to how Confluent sells data streaming. A free community edition is available, even for production usage. The enterprise edition adds enterprise features like tiered storage, automatic data balancing, or 24/7 enterprise support.

No surprise here. And a good strategy, as data streaming is required everywhere for different users and buyers.

Technical differences between Apache Kafka and Redpanda

There are plenty of technical and non-functional differences between Apache Kafka products and Redpanda. Keep in mind that Redpanda is NOT Kafka. Redpanda uses the Kafka protocol. This is a small but critical difference. Let’s explore these details in the following sections.

Apache Kafka vs. Kafka protocol compatibility

Redpanda is NOT an Apache Kafka distribution like Confluent Platform, Cloudera, or Red Hat. Instead, Redpanda re-implements the Kafka protocol to provide API compatibility. Being Kafka-compatible is not the same as using Apache Kafka under the hood, even if it sounds great in theory.

Two other examples of Kafka-compatible offerings:

Azure Event Hubs: A Kafka-compatible SaaS cloud service offering from Microsoft Azure. The service itself works and performs well. However, its Kafka compatibility has many limitations. Microsoft lists a lot of them on its website. Some limitations of the cloud service are the consequence of a different implementation under the hood, like limited retention time and message sizes.
Apache Pulsar: An open-source framework competing with Kafka. The feature set overlaps a lot. Unfortunately, Pulsar often only has good marketing for advanced features to compete with Kafka or to differentiate. And one example is its Kafka mapper to be compatible with the Kafka protocol. Contrary to Azure Event Hubs as a serious implementation (with some limitations), Pulsar’s compatibility wrapper provides a basic implementation that is compatible with only minor parts of the Kafka protocol. So, while alleged “Kafka compatibility” sounds nice on paper, one shouldn’t seriously consider this for migrating your running Kafka infrastructure to Pulsar.

We have seen compatible products for open-source frameworks in the past. Re-implementations are usually far away from being complete and perfect. For instance, MongoDB compared the official open source protocol to its competitor Amazon DocumentDB to pinpoint the fact that DocumentDB only passes ~33% of the MongoDB integration test chain.

In summary, it is totally fine to use these non-Kafka solutions like Azure Event Hubs, Apache Pulsar, or Redpanda for a new project if they fulfill your requirements better than Apache Kafka. But keep in mind that it is not Kafka. There is no guarantee that additional components from the Kafka ecosystem (like Kafka Connect, Kafka Streams, REST Proxy, and Schema Registry) behave the same when integrated with a non-Kafka solution that only uses the Kafka protocol with its own implementation.

How good is Redpanda’s Kafka protocol compatibility?

Frankly, I don’t know. Probably and hopefully, Redpanda has better Kafka compatibility than Pulsar. The whole product is based on this value proposition. Hence, we can assume that the Redpanda team spends plenty of time on compatibility. Redpanda has NOT achieved 100% API compatibility yet.

Time will tell when we see more case studies from enterprises across industries that migrated some Apache Kafka projects to Redpanda and successfully operated the infrastructure for a few years. Why wait a few years to see? Well, I compare it to what I see from people starting with Amazon MSK. It is pretty easy to get started. However, after a few months, the first issues happen. Users find out that Amazon MSK is not a fully-managed product and does not provide serious Kafka SLAs. Hence, I see too many teams starting with Amazon MSK and then migrating to Confluent Cloud after some months.

But let’s be clear: If you run an application against Apache Kafka and migrate to a re-implementation supporting the Kafka protocol, you should NOT expect 100% the same behavior as with Kafka!

Some underlying behavior will differ even if the API is 100% compatible. This is sometimes a benefit. For instance, Redpanda focuses on performance optimization with C++. This is only possible in some workloads because of the re-implementation. C++ is superior compared to Java and the JVM for some performance and memory scenarios.

Redpanda = Apache Kafka – Kafka Connect – Kafka Streams

Apache Kafka includes Kafka Connect for data integration and Kafka Streams for stream processing.

Like most Kafka-compatible projects, Redpanda does exclude these critical pieces from its offering. Hence, even 100 percent protocol compatibility would not mean a product re-implements everything in the Apache Kafka project.

Lower latency vs. benchmarketing

Always think about your performance requirements before starting a project. If necessary, do a proof of concept (POC) with Apache Kafka, Apache Pulsar, and Redpanda. I bet that in 99% of scenarios, all three of them will show a good enough performance for your use case.

Don’t trust opinionated benchmarks from others! Your use case will have different requirements and characteristics. And performance is typically just one of many evaluation dimensions.

I am not a fan of most “benchmarks” of performance and throughput. Benchmarks are almost always opinionated and configured for a specific problem (whether a vendor, independent consultant or researcher conducts them).

My colleague Jack Vanlightly explained this concept of benchmarketing with excellent diagrams:

Source: Jack Vanlightly

Here is one concrete example you will find in one of Redpanda’s benchmarks: Kafka was not built for very high throughput producers, and this is what Redpanda is exploiting when they claim that Kafka’s throughput is inferior to Redpanda. Ask yourself this question: Of 1GB/s use cases, who would create that throughput with just 4 producers? Benchmarketing at its finest.

Hence, once again, start with your business requirements. Then choose the right tool for the job. Benchmarks are always built for winning against others. Nobody will publish a benchmark where the competition wins.

Soft real-time vs. hard real-time

When we speak about real-time in the IT world, we mean end-to-end data processing pipelines that need at least a few milliseconds. This is called soft real-time. And this is where Apache Kafka, Apache Pulsar, Redpanda, Azure Event Hubs, Apache Flink, Amazon Kinesis, and similar platforms fit into. None of these can do hard real time.

Hard real-time requires a deterministic network with zero latency and no spikes. Typical scenarios include embedded systems, field buses, and PLCs in manufacturing, cars, robots, securities trading, etc. Time-Sensitive Networking (TSN) is the right keyword if you want more research.

I wrote a dedicated blog post about why data streaming is NOT hard real-time. Hence, don’t try to use Kafka or Redpanda for these use cases. That’s OT (operational technology), not IT (information technology). OT is plain C or Rust on embedded software.

No ZooKeeper with Redpanda vs. no ZooKeeper with Kafka

Besides being implemented in C++ instead of using the JVM, the second big differentiator of Redpanda is no need for ZooKeeper and two complex distributed systems… Well, with Apache Kafka 3.3, this differentiator is gone. Kafka is now production-ready without ZooKeeper! KIP-500 was a multi-year journey and an operation at Kafka’s heart.

To be fair, it will still take some time until the new ZooKeeper-less architecture goes into production. Also, today, it is only supported by new Kafka clusters. However, migration scenarios with zero downtime and without data loss will be supported in 2023, too. But that’s how a severe release cycle works for a mature software product: Step-by-step implementation and battle-testing instead of starting with marketing and selling of alpha and beta features.

ZooKeeper-less data streaming with Kafka is not just a massive benefit for the scalability and reliability of Kafka but also makes operations much more straightforward, similar to ZooKeeper-less Redpanda.

By the way, this was one of the major arguments why I did not see the value of Apache Pulsar. The latter requires not just two but three distributed systems: Pulsar broker, ZooKeeper, and BookKeeper. That’s nonsense and unnecessary complexity for virtually all projects and use cases.

Lightweight Redpanda + heavyweight ecosystem = middleweight data streaming?

Redpanda is very lightweight and efficient because of its C++ implementation. This can help in limited compute environments like edge hardware. As an additional consequence, Redpanda has fewer latency spikes than Apache Kafka. That are significant arguments for Redpanda for some use cases!

However, you need to look at the complete end-to-end data pipeline. If you use Redpanda as a message queue, you get these benefits compared to the JVM-based Kafka engine. You might then pick a message queue like RabbitMQ or NATs instead. I don’t start this discussion here as I focus on the much more powerful and advanced data streaming use cases.

Even in edge use cases where you deploy a single Kafka broker, the hardware, like an industrial computer (IPC), usually provides at least 4GB or 8GB of memory. That is sufficient for deploying the whole data streaming platform around Kafka and other technologies.

Data streaming is more than messaging or data ingestion

My fundamental question is, what is the benefit of a C++ implementation of the data hub if all the surrounding systems are built with JVM technology or even worse and slow technologies like Python?

Kafka-compatible tools like Redpanda integrate well with the Kafka ecosystem, as they use the same protocol. Hence, tools like Kafka Connect, Kafka Streams, KSQL, Apache Flink, Faust, and all other components from the Kafka ecosystem work with Redpanda. You will find such an example for almost every existing Kafka tool on the Redpanda blog.

However, these combinations kill almost all the benefits of having a C++ layer in the middle. All integration and processing components would also need to be as efficient as Redpanda and use C++ (or Go or Rust) under the hood. These tools do not exist today (likely, as they are not needed by many people). And here is an additional drawback: The debugging, testing, and monitoring infrastructure must combine C++, Python, and JVM platforms if you combine tools like Java-based Kafka Connect and Python-based Faust with C++-based Redpanda. So, I don’t get the value proposition here.

Data replication across clusters

Having more than one Kafka cluster is the norm, not an exception. Use cases like disaster recovery, aggregation, data sovereignty in different countries, or migration from on-premise to the cloud require multiple data streaming clusters.

Replication across clusters is part of open-source Apache Kafka. MirrorMaker 2 (based on Kafka Connect) supports these use cases. More advanced (proprietary) tools from vendors like Confluent Replicator or Cluster Linking make these use cases more effortless and reliable.

Data streaming with the Kafka ecosystem is perfect as the foundation of a decentralized data mesh:

How do you build these use cases with Redpanda?

It is the same story as for data integration and stream processing: How much does it help to have a very lightweight and performant core if all other components rely on “3rd party” code bases and infrastructure? In the case of data replication, Redpanda uses Kafka’s Mirrormaker.

And make sure to compare MirrorMaker to Confluent Cluster Linking – the latter uses the Kafka protocol for replications and does not need additional infrastructure, operations, offset sync, etc.

Non-functional differences between Apache Kafka and Redpanda

Technical evaluations are dominant when talking about Redpanda vs. Apache Kafka. However, the non-functional differences are as crucial before making the strategic decision to choose the data streaming platform for your next project.

Licensing, adoption curve and the total cost of ownership (TCO) are critical for the success of establishing a data streaming platform.

Open source (Kafka) vs. source available (Redpanda)

As the name says, Apache Kafka is under the very permissive Apache license 2.0. Everyone, including cloud providers, can use the framework for building internal applications, commercial products, and cloud services. Committers and contributions are spread across various companies and individuals.

Redpanda is released under the more restrictive Source Available License (BSL). The intention is to deter cloud providers from offering Redpanda’s work as a service. For most companies, this is fine, but it limits broader adoption across different communities and vendors. The likelihood of external contributors, committers, or even other vendors picking the technology is much smaller than in Apache projects like Kafka.

This has a significant impact on the (future) adoption curve…

Maturity, community and ecosystem

The introduction of this article showed the impressive adoption of Kafka. Just keep in mind: Redpanda is NOT Apache Kafka! It just supports the Kafka protocol.

Redpanda is a brand-new product and implementation. Operations are different. The behavior of the engine is different. Experts are not available. Job offerings do not exist. And so on.

Kafka is significantly better documented, has a tremendously larger community of experts, and has a vast array of supporting tooling that makes operations more straightforward.

There are many local and online Kafka training options, including online courses, books, meetups, and conferences. You won’t find much for Redpanda beyond the content of the vendor behind it.

And don’t trust marketing! That’s true for every vendor, of course. If you read a great feature list on the Redpanda website, double-check if the feature truly exists and in what shape it is. Example: RBAC (role-based access control) is available for Redpanda. The devil lies in the details. Quote from the Redpanda RBAC documentation: “This page describes RBAC in Redpanda Console and therefore manages access only for Console users but not clients that interact via the Kafka API. To restrict Kafka API access, you need to use Kafka ACLs.” There are plenty of similar examples today. Just try to use the Redpanda cloud service. You will find many things that are more alpha than beta today. Make sure not to fall into the same myths around the marketing of product features as some users did with Apache Pulsar a few years ago.

The total cost of ownership and business value

When you define your project’s business requirements and SLAs, ask yourself how much downtime or data loss is acceptable. The RTO (recovery time objective) and RPO (recovery point objective) impact a data streaming platform’s architecture and overall process to ensure business continuity, even in the case of a disaster.

The TCO is not just about the cost of a product or cloud service. Full-time engineers need to operate and integrate the data streaming platform. Expensive project leads, architects, and developers build applications.

Project risk includes the maturity of the product and the expertise you can bring in for consulting and 24/7 support.

Similar to benchmarketing regarding latency, vendors use the same strategy for TCO calculations! Here is one concrete example you always hear from Redpanda: “C++ does enable more efficient use of CPU resources.”

This statement is correct. However, the problem with that statement is that Kafka is rarely CPU-bound and much more IO-bound. Redpanda has the same network and disk requirements as Kafka, which means Redpanda has limited differences from Kafka in terms of TCO regarding infrastructure.

When to choose Redpanda instead of Apache Kafka?

You need to evaluate whether Redpanda is the right choice for your next project or if you should stick with the “real Apache Kafka” and related products or cloud offerings. Read articles and blogs, watch videos, search for case studies in your industry, talk to different competitive vendors, and build your proof of concept or pilot project. Qualifying out products is much easier than evaluating plenty of offerings.

When to seriously consider Redpanda?

You need C++ infrastructure because your ops team cannot handle and analyze JVM logs – but be aware that this is only the messaging core, not the data integration, data processing, or other capabilities of the Kafka ecosystem
The slight performance differences matter to you – and you still don’t need hard real-time
Simple, lightweight development on your laptop and in automated test environments – but you should then also run Redpanda in production (using different implementations of an API for TEST and PROD is a risky anti-pattern)

You should evaluate Redpanda against Apache Kafka distributions and cloud services in these cases.

This post explored the trade-offs Redpanda has from a technical and non-functional perspective. If you need an enterprise-grade solution or fully-managed cloud service, a broad ecosystem (connectors, data processing capabilities, etc.), and if 10ms latency is good enough and a few p99 spikes are okay, then I don’t see many reasons why you would take the risk of adopting Redpanda instead of an actual Apache Kafka product or cloud service.

The future will tell us if Redpanda is a severe competitor…

I didn’t even cover the fact that a startup always has challenges finding great case studies, especially with big enterprises like fortune 500 companies. The first great logos are always the hardest to find. Sometimes, startups never get there. In other cases, a truly competitive technology and product are created. Such a journey takes years. Let’s revisit this blog post in one, two, and five years to see the evolution of Redpanda (and Apache Kafka).

What are your thoughts? When do you consider using Redpanda instead of Apache Kafka? Are you using Redpanda already? Why and for what use cases? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to choose Redpanda instead of Apache Kafka? appeared first on Kai Waehner.

When to use Apache Camel vs. Apache Kafka?

Kai Waehner — Fri, 28 Jan 2022 06:31:02 +0000

Should I use Apache Camel or Apache Kafka for my next integration project? The question is very valid and comes up regularly. This blog post explores both open-source frameworks and explains the difference between application integration and event streaming. The comparison discusses when to use Kafka or Camel, when to combine them, when not to use them at all. A decision tree shows how you can quickly qualify out one for the other.

The history of application integration and event streaming

My personal history and experience in application integration and event streaming are the following. It shows my background and how I see the integration and data streaming markets.

A discussion that started over a decade ago…

With my background of work in the last decade at Talend, TIBCO, and Confluent, the comparison between Camel and Kafka is very exciting as I have spent a lot of time with both open-source frameworks:

Apache Camel powered Talend ESB. Talend had a visual coding tool to design Camel routes with code generation. Unfortunately, the tool’s primary focus was Talend Data Integration (ETL and batch). The Camel-powered ESB code was integrated, but it was neither perfect nor complete.

TIBCO BusinessWorks competed with Talend ESB while TIBCO StreamBase competed with other stream processing solutions. The Kafka ecosystem came up more and more in conversations with customers.

I posted about “When to use Apache Camel” in 2011 already. In 2012, I did my first talk at an international software conference in the US. The name of the conference? CamelOne! A forum only about Apache Camel. What an exciting time. Claus Ibsen, THE Camel guy, wrote an excellent summary of CamelOne 2012 in Boston.

In my conference summary, I talked about my two talks. One of them covered a comparison between Apache Camel, Spring Integration, and Mulesoft ESB. The presentation has over 35000 views, and the number still goes up today.

… from application integration to event streaming

Over time, the buzzword “big data” came up more and more. I spent some time at Talend and TIBCO to learn new programming concepts such as Map-Reduce and Shuffling, mainly powered by Apache Hadoop and Apache Spark. The big data ecosystem snowballed with tens of frameworks such as Hive, HBase, Pig, and many more.

However, the first people realized that real-time data beats slow data in almost all use cases. The Lambda architecture was invented to separate real-time workloads from batch workloads. Event Streaming was born. Apache Kafka became the de facto standard for data streaming. Like CamelOne a decade ago, Kafka Summit is the one-stop-show for Kafka use cases, architectures, and success stories. Contrary to the small CamelOne, Kafka Summit is a global event with events across the globe, plus online events.

Data in motion with the Kappa architecture replacing Lambda

In 2014, a guy called Jay Kreps (few people knew him) was already questioning the Lambda architecture. Instead, he proposed to provide a single real-time layer to provide data for real-time and batch consumers. The Kappa architecture was born. Today, the Kappa architecture is mainstream, replacing Lambda. Various vendors adopt Kafka in the meantime.

Confluent became the clear leader in the event streaming software category. Confluent Platform is powered by Apache Kafka. The focus is on event streaming. That’s different from most other vendors like Cloudera; they focus on 10-20 frameworks or products and try to combine and integrate them somehow. Today, Confluent Cloud is a complete game-changer providing Apache Kafka and its ecosystem for application integration and stream processing as a serverless cloud offering.

This is where we are today in 2022. Application integration (= Camel) and event streaming (= Kafka) play a critical role in every modern enterprise architecture. Open-source is widely adopted and usually preferred compared to proprietary solutions for various reasons, including avoiding vendor lock-in. That’s true for self-managed and serverless cloud offerings.

Hence, the question arises: Should I use Apache Camel for application integration or Apache Kafka for event streaming? Or both? Or does one solve the other, too? These questions will be answered in the following sections, concluding with a decision tree to help you make the right choice for your project.

Let’s look at the similarities between Camel and Kafka, when to use which framework, when and how to combine them, and when not to use them at all.

Features in Apache Camel AND Apache Kafka

Camel and Kafka have many positive and negative characteristics in common. Hence, it is no surprise that people compare the two frameworks:

Open source under Apache 2.0 license
Vibrant community and adoption in the industry
Mature framework with deployments in enterprises across the globe
Fixing point-to-point spaghetti architectures with a central integration backbone
Open architecture and extensibility with custom functions and connectors
Small and big deployments possible, plus single-node deployments for non-mission-critical use cases
Re-engineered and optimized for cloud-native deployments (container, Kubernetes, cloud)
Connectivity to any technology, API, communication paradigm, and SaaS
Transformation of any data types and formats
Processes transactional and analytical workloads
Domain-specific language (DSL) for message at a time processing, with similar logic such as aggregation, filtering, conditional processing
Relative complex frameworks because of their robust feature set, hence not suitable for solving a minor problem
Not a replacement of a database, data warehouse, or data lake

Beyond the similarities, Kafka and Camel have very different sweet spots built to solve distinct problems. Hence, comparing these two tools is a bit comparison of apples and oranges. Some minor projects might use one or the other to solve the problem, but critical enterprise projects show the differences more quickly.

When to use Apache Camel?

The mission of Camel

Apache Camel is an integration framework. It solves a particular problem: Data integration between different applications, APIs, protocols, and communication paradigms. This concept is often called application integration or enterprise integration. Camel implements the famous Enterprise Integration Patterns (EIP). EIPs are based on messaging principles.

Camel’s strengths

Event-based backbone based on well-known and adopted EIP concepts
Connectivity to almost any API
Integration, processing, and routing of information with an intuitive domain-specific language (DSL) with a focus on integration; providing the ability of composability in a programming context for finer grain control in code for doing conditional logic or transformations/reformatting
Powerful routing capabilities with many built-in EIPs
Many deployment options (standalone, web container, application server, Spring, OSGi, Kubernetes via the Camel K sub-project) – okay, I guess some options are not relevant in this decade anymore
Lightweight alternative to proprietary ETL and ESB tools

Camel’s weaknesses

Only a “routing machine”, i.e., not built for long-term storage (additional cache or storage needed), for that reason, Camel is not the right choice for a central nervous system like Kafka
No stream processing (like you know it from Kafka Streams or Apache Flink)
Limited scalability, not built for massive volumes of data
No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
No serverless cloud offering, with that also not competing with other iPaaS offerings
Red Hat is the only vendor supporting it
Built to be deployed in a single data center or cloud region, not across hybrid or multi-cloud scenarios

The evolution of Apache Camel

Camel is widely adopted and has a strong community. Unfortunately, from a vendor and support perspective, the offerings declined in the last few years. One of the most significant pain points: I still don’t see a serverless cloud offering anywhere today:

Camel TL;DR

Camel is an application integration framework to connect different applications and interfaces. Camel is NOT built for processing data in motion continuously, i.e., stream processing. Hence, it should be compared to ETL and ESB tools, not data streaming technologies like Kafka, Kinesis, or Flink. If you look for a serverless cloud offering, you are out of luck. If you look for vendor support, Red Hat is the only option.

When to use Apache Kafka?

The mission of Kafka

Real-time data beats slow data at any scale. The event streaming platform enables processing data in motion. Kafka is the de facto standard for event streaming, including messaging, data integration, stream processing, and storage. Kafka provides all capabilities in one infrastructure at scale. It is reliable and allows to process analytics and transactional workloads.

Kafka’s strengths

Event-based streaming platform
A unique combination of pub/sub messaging, data processing, data integration, and storage in a single framework
Built for massive volumes of data and extreme scale from the beginning, with that a single framework can be used for transactional (low volume) and analytics (high volume) use cases
True decoupling between producers and consumers because of its storage component makes it the de facto standard for microservice architectures
Guaranteed ordering of events in the distributed commit log
Distributed data processing with fault-tolerance and recoverability built-in
Replayability of events
The de facto standard for event streaming
Built with hybrid and multi-cloud data replication in mind (with included tools like MirrorMaker and separate, more advanced, and more straightforward tools like Confluent Cluster Linking)
Support from many vendors, including Confluent, Cloudera, IBM, Red Hat, Amazon, Microsoft, and many more
Paradigm shift: Built to process data in motion end-to-end from source to one or more sinks

Kafka’s weaknesses

Paradigm shift: Enterprises need to learn and understand the added value of event streaming, a new software category that enables new use cases but also requires different design patterns and operations approaches
No powerful visual coding like you know it from proprietary ETL/ESB/iPaaS tools
Limited out-of-the-box routing capabilities (Kafka Connect SMT or Kafka Streams / ksqlDB app do the job very well, but not as simple as Camel)
Complex operations (if you run it by yourself instead of using 3rd party tools or even better a serverless cloud offering)

The evolution of Apache Kafka

Kafka was built at LinkedIn to process high volumes of data, as no other open-source framework could do this. Kafka found quick adoption after LinkedIn open-sourced it. Several vendors adopted Kafka and added it to their product portfolio. Some vendors just added Kafka for the sake of having it. Others innovated and used additional tools to make Kafka cloud-native for the next generation of event streaming. Kafka as a serverless cloud offering is a critical piece of many modern enterprise architectures today:

Kafka TL;DR

Kafka is an event streaming platform to process data in motion continuously. If you “just” need an integration framework to route data from a source to one or more sinks (= ETL / ESB), then Camel can be used, too. However, Kafka kills two birds with one stone (= integrating data AND processing it in motion where needed).

Plenty of Kafka offerings are available on the market. Check out the Apache Kafka landscape and comparison to understand the differences between offerings from Confluent, Cloudera, IBM, Red Hat, Amazon, Microsoft, and others.

Decision tree – Camel or Kafka?

The above sections explored when to use Camel and Kafka. So far, so good. Nevertheless, both frameworks overlap with their capabilities. Let’s get some help to choose the right one in that case.

Qualify out – the easiest way to start an evaluation!

The easiest way to decide on a specific option is to qualify out other frameworks that cannot fulfill the requirements.

Therefore, do you need

Big data processing?
A storage component for true decoupling and replayability of events?
Stateless or stateful stream processing?
A serverless cloud offering?

The above section discussed these differentiators of Kafka. In all these cases, you can qualify out Camel. It does not fulfill these requirements. These requirements are not necessarily a complete list. And you might also find a few aspects to qualify out Kafka from the beginning. Hence, you could also start from the Camel perspective and ask yourself: When should I not use Kafka. But I think it is easier the other way round.

Qualifying out solutions because of their limitations makes the decision tree and evaluation process much easier from the beginning.

Decision Tree for Camel and Kafka

Here is my decision tree to find out if Camel or Kafka is the right choice and what vendors you could evaluate:

When to use Camel and Kafka together?

It is possible to use Camel and Kafka together in a single integration architecture. Should you do that? Two options exist. One makes more sense than the other:

Kafka for event streaming and Camel for ETL

Camel and Kafka integrate well with each other. The native Kafka component of Camel is the best native integration point as a bridge between both environments:

The above architecture shows how Camel and Kafka live next to each other. Camel is used in a business domain for application integration. Kafka is the central nervous system between the Camel integration application and many other applications. I also added Kong as API Gateway to clarify that Camel or Kafka is not a silver bullet to solve every problem.

Once again, the vast advantage of Kafka as central integration layer is its unique combination of characteristics within a single infrastructure, including:

Real-time messaging at any scale
Storage for true decoupling between different applications and communication paradigms
Built-in backpressure handling and replayability of events
Data integration
Stream processing

Real-time data replication across hybrid and multi-cloud is not shown in the above picture but is also part of the enterprise architecture out-of-the-box leveraging take Kafka protocol.

With true decoupling within modern microservice architecture, each business team can decide whether they need application integration (using Camel) or event streaming (using Kafka). Often, both could be used. Additional questions around single vs. multi frameworks and APIs, vendor support, scalability needs, and other characteristics need to be evaluated to make the right choice for your business problem.

Camel connectors embedded into Kafka Connect

There is another way to combine Kafka and Camel: The “Camel Kafka Connector” sub-project of Apache Camel. Don’t get confused. This feature is not the Kafka component (= connector) of Camel! Instead, it is a relatively new initiative to deploy camel components into the Kafka Connect infrastructure.

The obvious benefit: This way, you get hundreds of new connectors “for free” within the Kafka ecosystem. This capability sounds excellent. And it is!

However, consider the total cost of ownership and the overall efforts using this approach. Application integration is one of the most challenging problems in computer science – especially if you talk about transactional data sets that require zero data loss, exactly-once semantics, and no downtime. The more components you combine in the end-to-end data flow, the harder it gets to keep your performance and reliability SLAs.

Hence, using Camel components within Kafka Connect has a considerable disadvantage: Combining two frameworks with complexities and different design concepts. Just a few examples:

Kafka world: Partitions, Offsets, Leader and Follower, Key/Value/Header, connectors (based on Kafka Connect), Bootstrap Server, ConsumerRecord, Retention Time, etc.
Camel world: Routes, RouteBuilder CamelContext, Exchange, Processor, components (Camel connectors), Endpoints, Type Converters, Registry, etc.

Please think twice before mixing two integration tools that are powerful but complex on their own. Getting this running is just one piece of the puzzle (the simple part). Don’t forget end-to-end testing, resiliency, SLAs, support across technologies and APIs. Even buying support for Camel and Kafka from Red Hat (i.e., a single vendor) does not improve this approach.

It is likely better to take the business logic and API calls out of the Camel component and copy it into a Kafka Connect connector template to run the integration natively with only Kafka code. This workaround allows a clean architecture, end-to-end integration with a single framework, a single vendor behind it, and much easier testing / debugging / monitoring.

TL;DR: I recommend only using the “Camel Kafka Connector” sub-project if the following options do not work:

Use only Apache Camel for application integration
Leverage Apache Kafka for event streaming and application integration
Choose separate deployments of Camel and Kafka and use the Camel-Kafka-Bridge

When NOT to use Camel or Kafka at all?

Once again, the easiest way for your evaluation to start is qualifying out tools that do not work to solve the problem.

Both Camel and Kafka are NOT built for the following scenarios:

A proxy for millions of clients (like mobile apps) – but native proxies (like a REST or MQTT Proxy for Kafka) exist for some use cases.
An API Management platform – but these tools are usually complementary and used to create life cycle management or monetize APIs deployed with Camel or Kafka.
A database for complex queries and batch analytics workloads
an IoT platform with features such as device management – but direct native integration with (some) IoT protocols such as MQTT or OPC-UA is possible and the approach for (some) use cases.
A technology for hard real-time applications such as safety-critical or deterministic systems – but that’s true for any other IT framework, too. Embedded systems are a different software than Camel or Kafka!

I wrote a very detailed post about this topic from a Kafka perspective. It maps almost 1:1 to the Camel world, too (and any related technology such as Flink, Spark, Pulsar, etc.): “When NOT to use Apache Kafka?”

Apache Camel vs. Apache Kafka – Who is the winner?

Simple answer: Both!

When you compare apples and oranges, you might become happy when you are hungry as both are good to eat. The same is true for Camel and Kafka. Both can do application integration. But they serve very different needs.

Many integration scenarios can use Camel or Kafka.

Camel is the right tool if you need to integrate data within an application context or business unit (with no need for stream processing, true decoupling, replayability, large scale, replication across data centers or cloud regions).

Kafka is the central event-based nervous system across business units, regions, and hybrid clouds. Kafka is all about event streaming. Application integration is just a piece of this puzzle. On the other side, I have seen plenty of integration projects powered by Apache Kafka. It is often replacing other middleware. That’s true for ETL/ESB legacy modernization and in discussions about using a cloud-native iPaaS.

Do you use Camel or Kafka today? What use cases? How do you decide which one to choose? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to use Apache Camel vs. Apache Kafka? appeared first on Kai Waehner.

Panel Discussion about Kafka, Edge, Networking and 5G in Oil and Gas and Mining Industry

Kai Waehner — Fri, 20 Aug 2021 07:40:43 +0000

The oil & gas and mining industries require edge computing for low latency and zero trust use cases. Most IT architectures are hybrid with big data analytics in the cloud and safety-critical data processing in disconnected and often air-gapped environments. This blog post shares a panel discussion that explores the challenges, use cases, and hardware/software/network technologies to reduce cost and innovate. A key focus is on the open-source framework Apache Kafka, the de facto standard for processing data in motion at the edge and in the cloud.

Apache Kafka at the Edge and in Hybrid Cloud

I explored the usage of event streaming at the edge and in hybrid cloud scenarios in lots of detail in the past. Hence, instead of yet another description, check out the following posts to learn about use cases and architectures:

Panel Discussion: Kafka, Network Infrastructure, Edge, and Hybrid Cloud in Oil and Gas

Here is the panel discussion. The conversation includes the software and the hardware/infrastructure/networking perspective. It is a great mix of exploring use cases from the oil&gas and mining industries for processing data in motion and technical facts about communication/radio/telco infrastructures. I think it was really a great mix of topics that are heavily related and depend on each other to deploy a project successfully.

Speakers:

Andrew Duong (Confluent): Moderator
Kai Waehner (Confluent): Expert on hybrid software architectures and data in motion
Dion Stevenson (Tait Communications): Expert on hardware and network infrastructure
Sohan Domingo (Tait Communications): Expert on hardware and network infrastructure

Now enjoy the discussion and feel free to share any thoughts or feedback:

Kafka in the Energy Sector including Oil and Gas, Mining, Smart Grids

An example architecture for hybrid event streaming in the oil and gas industry can look like the following:

If you want to learn more about event streaming with Apache Kafka in the energy industry (including oil and gas, mining, smart grids), check out the following blog post:

Notes about the Kafka, Edge, Oil, and Gas, Mining Conversation

If you prefer reading or just listening to a few of the sections, here are some notes about the flow of the panel discussion:

0-4:20– Introduction to Tait Communications

4:45- 7:20– Introduction to Confluent and high-level definition of Edge & IoT

7:30- 10:10– Voice communication discussion about connectivity, the importance of context at the point of time through data so the right response can be determined sooner. No matter where they are, what they’re doing, we get communication’s at the edge to suit the needs of a modern workforce.

10:15-12:10 ML/AI at the Edge. Continuous monitoring of all infrastructure and sensors for safety purposes. Event streaming to help send alerts in the real and also for post-event analysis too. There’s a process to get into AI- Infra, then pipeline, then AI, not the other way around.

12:15- 14:42– 5G can’t solve all problems- security, privacy, compliance considerations as to where to process the data, and beyond this, the cost is also a factor. Considerations for Cloud and on-premise.

14:50 – 16:03– 5G discussion. There are real-world limitations like cell towers. You also need contextual awareness at the Edge and making decisions there (local awareness)- e.g., Gas on a vehicle that’s disconnected on the backend.

16:15 – 20:10– Manufacturing & Supply chain, radios & communications today and what’s possible in the future. Having IoT at the Edge manufacturing optimizations with low latency requirements where cloud-first doesn’t make sense. On the flip side, if it’s not safety-critical or things like an ERP system, this can be pushed into the cloud.

20:10 -23:35– Mining side of things, lacking connectivity and preference for edge-based usage. Autonomous trucks, decisions on the edge rather than delays or even milliseconds by going to the cloud. Doing it locally at the edge is more efficient in some cases. Collecting all sensors on the trucks, temperatures, etc., even whilst disconnected, but once the connection is re-established at the base, that data can be uploaded. ‘Last mile’ analytics. Confluent is IT, not OT- we integrate with IT systems, but the OT world is separated.

23:38- 26:25: Digital mobile radios and voice communications, but with autonomous trucks, you don’t have that. This is where our Unified Vehicle comes in where it’s a combination of Digital Mobile Radio(DMR) and LTE and intelligent algorithms help with failover from DMR to LTE if there are connectivity issues. Voice is still important despite the amount of technology being in use and data exploration.

27:03 – 31:15– Where to start with data exploration- Start with your requirements. Does it really need computing at the edge to solve the problems, or can Cloud work? Event streaming at the edge and use case where it makes sense. How customers get started, simple use cases to be solved first before the more advanced ones (building the foundations, data pipelines, simple rules, and test). AWS Wavelength team collaboration and edge-making sense with low latency and security requirements.

31:15- 32:54– Need to consider your bandwidth & latency as to whether edge computing makes sense. Driverless cars.

33:15 – 37:49- Where to go from here with existing customers. How do they upgrade, what customers are coming to Tait for, and the use of video as part of all this for public safety.

Health & safety, monitoring driver alertness in NZ. Truck performance, driver performance, and when to take a break. That decision needs to be made as a combination between edge and cloud.

37:50 – 40:55- Connected vehicles and cars- it’s not as hard as it looks. Gas stations with edge computing, loyalty systems, etc., and the importance of after-sales for connected vehicles. GDPR and compliance by aggregation of data instead of as some countries have high privacy issues.

41:00-44:10- Joint project with Tait in the law enforcement space. Voice to text, use of metadata, and combining voice + video with event streaming.

Kafka for Next-Generation Edge Computing

The energy industry, including oil&gas and mining, is super interesting from a technical perspective. It requires edge and cloud computing. Upstream, midstream, and downstream is a complex and safety-critical supply chain. Processing data in motion with Apache Kafka leveraging various network infrastructures is a great opportunity to innovate and reduce costs across various use cases.

Do you already leverage Apache Kafka for processing data in motion in the oil and gas, mining, or any other industry? How does your (future) edge or hybrid architecture look like? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Panel Discussion about Kafka, Edge, Networking and 5G in Oil and Gas and Mining Industry appeared first on Kai Waehner.

Kafka API is the De Facto Standard API for Event Streaming like Amazon S3 for Object Storage

Kai Waehner — Sun, 09 May 2021 14:32:49 +0000

Real-time beats slow data in most use cases across industries. The rise of event-driven architectures and data in motion powered by Apache Kafka enables enterprises to build real-time infrastructure and applications. This blog post explores why the Kafka API became the de facto standard API for event streaming like Amazon S3 for object storage, and the tradeoffs of these standards and corresponding frameworks, products, and cloud services.

Event-Driven Architecture: This Time It’s Not A Fad

The Forbes’ article “Event-Driven Architecture: This Time It’s Not A Fad” from April 2021 explained why enterprises are not just talking about event-driven real-time applications, but finally building them. Here are some arguments:

REST limitations can limit your business strategy
Data needs to be fluid and real-time
Microservices and serverless need event-driven architectures

Real-time Data in Motion beats Slow Data

Use cases for event-driven architectures exist across industries. Some examples:

Transportation: Real-time sensor diagnostics, driver-rider match, ETA updates
Banking: Fraud detection, trading, risk systems, mobile applications/customer experience
Retail: Real-time inventory, real-time POS reporting, personalization
Entertainment: Real-time recommendations, a personalized news feed, in-app purchases
The list goes on across verticals…

Real-time data in motion beats data at rest in databases or data lakes in most scenarios. There are a few exceptions that require batch processing:

Reporting (traditional business intelligence)
Batch analytics (processing high volumes of data in a bundle, for instance, Hadoop and Spark’s map-reduce, shuffling, and other data processing only make sense in batch mode)
Model training as part of a machine learning infrastructure (while model scoring and monitoring often requires real-time predictions, the model training is batch in almost all currently available ML algorithms)

Beyond these exceptions, almost everything is better in real-time than batch.

Be aware that real-time data processing is more than just sending data from A to B in real-time (aka messaging or pub/sub). Real-time data processing requires integration and processing capabilities. If you send data into a database or data lake in real-time but have to wait until it is processed there in batch, it does not solve the problem.

With the ideas around real-time in mind, let’s explore what a de facto standard API is.

What is a (De Facto) Standard API?

The answer is longer than you might expect and needs to be separated into three sections:

API
Standard API
De facto standard API

What is an API?

An application programming interface (API) is an interface that defines interactions between multiple software applications or mixed hardware-software intermediaries. It defines the kinds of calls or requests that can be made, how to make them, the data formats that should be used, the conventions to follow, etc. It can also provide extension mechanisms so that users can extend existing functionality in various ways and to varying degrees.

An API can be entirely custom, specific to a component, or designed based on an industry-standard to ensure interoperability. Through information hiding, APIs enable modular programming, allowing users to use the interface independently of the implementation.

What is a Standard API?

Industry consortiums or other industry-neutral (often global) groups or organizations specify standard APIs. A few characteristics show the trade-offs:

Vendor-agnostic interfaces
Slow evolution and long specification process
Most vendors add proprietary features because a) too slow process of the standard specification or more often b) to differentiate their commercial offering
Acceptance and success depend on the complexity and added value (this sounds obvious but is often the key blocker for success)

Examples for Standard APIs

Here are some examples of standard APIs. I also add my thoughts if I think they are successful or not (but I fully understand that there are good arguments against my opinion).

Generic Standards

SQL: Domain-specific language used in programming and designed for managing data held in a relational database management system. Successful as almost every database somehow supports SQL or tries to build a similar syntax. A good example is ksqlDB, the Kafka-native streaming SQL engine. ksqlDB (like most other streaming SQL engines) is not ANSI SQL, but still understood easily by people that know SQL.
J2EE / Java EE / Jakarta EE: Successful as most vendors adopted at least parts of it for Java frameworks. While early versions were very heavyweight and complex, the current APIs and implementations are much more lightweight and user-friendly. JMS is a great example where vendors added proprietary add-ons to add features and differentiate. No vendor-lockin is only true in theory!
HTTP: Successful as application layer protocol for distributed, collaborative, hypermedia information systems. While not 100% correct, people typically interpret HTTP as REST Web Services. HTTP is often misused for things it is not built for.
SOAP / WSDL: Partly successful in providing XML-based web service standard specifications. Some vendors built good tooling around it. However, this is typically only true for the basic standards such as SOAP and WSDL, not so much for all the other complex add-ons (often called WS-* hell).

Standards for a Specific Problem or Industry

OPC-UA for Industrial IoT (IIoT): Partly successful machine-to-machine communication protocol for industrial automation developed. Adopted by almost every vendor in the industrial space. The drawback (similarly to HTTP) is that it is often misused. For instance, MQTT is a much better and more lightweight choice in some scenarios. OPC-UA is a great example where the core is successful, but the industry-specific add-ons are not prevalent and not supported by tools. Also, OPC-UA is too heavyweight for many of the use cases it is used in.
PMML for Machine Learning: Not successful as an XML-based predictive model interchange format. The idea is great: Train an analytic model once and then deploy it across platforms and programming languages. In practice, it did not work. Too many limitations and unnecessary complexity for a project. Most real-world machine learning deployments I have seen in the wild avoid it and deploy models to production with a standard wrapper. ONNX and other successors are not more prevalent yet either.

In summary, some standard APIs are successful and adopted well; many others are not. Contrary to these standards specified by consortiums, there is another category emerging: De Facto Standard APIs.

What is a De Facto Standard API?

De Facto standard APIs originate from an existing successful solution (that can be an open-source framework, a commercial product, or a cloud service). Two ways exist how these de facto standard APIs emerge:

Driven by a single vendor (often proprietary), for example: Amazon S3 for object storage.
Driven by a huge community around a successful open-source project, for example: Apache Kafka for event streaming.

No matter how a de facto standard API originated, they typically have a few characteristics in common:

Creation of a new category of software, something that did not exist before
Adoption by other frameworks, products, or cloud services as the API because became the de facto standard
No complex, formal, long-running standard processes; hence innovation is possible in a relatively flexible and agile way
Practical processes and rules are in place to ensure good quality and consensus (either controlled by the owner company for a proprietary standard API or across the open source community)

Let’s now explore two de facto standard APIs: Amazon S3 and Apache Kafka. Both are very successful but very different regarding being a standard. Hence, the trade-offs are very different.

Amazon S3: De Facto Standard API for Object Storage

Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface in the public AWS cloud. It uses the same scalable storage infrastructure that Amazon.com uses to run its global e-commerce network. Amazon S3 can be employed to store any type of object, which allows for uses like storage for internet applications, backup and recovery, disaster recovery, data archives, data lakes for analytics, and hybrid cloud storage. Additionally, S3 on Outposts provides on-premises object storage for on-premises applications that require high-throughput local processing.

“Amazon CTO on Past, Present, Future of S3” is a great read about the evolution of this fully-managed cloud service. While the public API was kept stable, the internal backend architecture under the hood changed several times significantly. Plus, new features were developed on top of the API, for instance, AWS Athena for analytics and interactive queries using standard SQL. I really like how Werner Vogels describes his understanding of a good cloud service:

Vogels doesn’t want S3 users to even think for a moment about spindles or magnetic hardware. He doesn’t want them to care about understanding what’s happening in those data centers at all. It’s all about the services, the interfaces, and the flexibility of access, preferably with the strongest consistency and lowest latency when it really matters.

So, we are talking about a very successful proprietary cloud service by AWS. Hence, what’s the point?

Most Object Storage Vendors Support the Amazon S3 API

Many enterprises use the Amazon S3 API. Hence, it became the de facto standard. If other storage vendors want to sell object storage, supporting the S3 interface is often crucial to get through the evaluations and RFPs. If you don’t support the S3 API, it is much harder for companies to adopt the storage and implement the integration (as most companies already use Amazon S3 and have built tools, scripts, testing around this API).

For this reason, many applications have been built to support the Amazon S3 API natively. This includes applications that write data to Amazon S3 and Amazon S3-compatible object stores.

S3 compatible solutions include client backup, file browser, server backup, cloud storage, cloud storage gateway, sync&share, hybrid storage, on-premises storage, and more.

Many vendors sell S3-compatible products: Oracle, EMC, Microsoft, NetApp, Western Digital, MinIO, Pure Storage, and many more. Check out the Amazon S3 site from Wikipedia for a more detailed and complete list.

So why has the S3 API become so ubiquitous?

The creation of a new software category is a dream for every vendor! Let’s understand how and why Amazon was successful in establishing S3 for object storage. The following is a quote from Chris Evan’s great article from 2016: “Has S3 become the de facto API standard?”

So why has the S3 API become so ubiquitous? I suspect there are a number of reasons. These include:

First to market – When S3 was launched in 2006, most enterprises were familiar with object storage as “content addressable storage” through EMC’s Centera platform. Other than that, applications were niche and not widely adopted except for specific industries like High Performance Computing where those users were used to coding to and for the hardware. S3 quickly became a platform everyone could use with very little investment. That made it easy to consume and experiment with. By comparison, even today the leaders in object storage (as ranked by the major analysts) still don’t make it easy (or possible) to download and evaluate their products, even though most are software only implementations.
Documentation – following on from the previous point, S3 has always been well documented, with examples on how to run API commands. There’s a document history listing changes over the past 6-7 years that shows exactly how the API has evolved.
A Single Agenda – the S3 API was designed to fit a single agenda – that of storing and retrieving objects from S3. As such, Amazon didn’t have to design by committee and could implement the features they required and evolve from there. Contrast that with the CDMI (Cloud Data Management Interface) from SNIA. The SNIA website is difficult to navigate, the standard itself is only on the 4th published iteration in six years, while the documentation runs to 264 pages! (Note that the S3 API runs into more pages, but is infinitely more consumable, with simple examples from page 11 onwards).

Cons of a Proprietary De Facto Standard like Amazon S3

Many people might say: “Better a proprietary standard than no standard.” I partly agree with this. The possibility to learn one API and use it across multi-cloud and on-premise systems and vendors is great. However, Amazon S3 has several disadvantages as it is NOT an open standard:

Other vendors (have to) build their implementation on a best guess about the behavior of the API. There is no official standard specification they can rely on.
Customers cannot be sure what they buy. At least, they should not expect the same behavior of 3rd party S3 implementations that they get from their experiences using Amazon S3 on AWS.
Amazon can change APIs and features as it likes. Other vendors need to “reverse engineer the API” and adjust their products.
Amazon could sue competitors for using S3 API branding – even though this is not likely to happen as the benefits are probably bigger (I am not a lawyer; hence this statement might be wrong and is just my personal opinion)

Let’s now look at an open-source de facto standard: Kafka.

Kafka API: De Facto Standard API for Event Streaming

Apache Kafka is mainstream today! The Kafka API became the de facto standard for event-driven architectures and event streaming. Two proof points:

Use cases across all industries and infrastructure. Including various kinds of transactional and analytics workloads. Edge, hybrid, multi-cloud. I collected a few examples across verticals that use Apache Kafka to show the prevalence across markets.
Adoption by various open-source frameworks and many software/cloud vendors. Check out my blog post if you are interested in a comparison of Kafka vendors such as Confluent, Cloudera, Red Hat or Amazon MSK and related technologies like Azure Event Hubs, AWS Kinesis, RedPanda, or Apache Pulsar.

The Kafka API (aka Kafka Protocol)

Kafka became the de facto event streaming API. Similar like the S3 API became the de facto standard for object storage. Actually, the situation is even better for the Kafka API as the S3 API is a proprietary protocol from AWS. In contrast, the Kafka API and protocol are open source under Apache 2.0 license.

The Kafka protocol covers the wire protocol implemented in Kafka. It defines the available requests, their binary format, and the proper way to make use of them to implement a client.

One of my favorite characteristics of the Kafka protocol is backward compatibility. Kafka has a “bidirectional” client compatibility policy. In other words, new clients can talk to old servers, and old clients can talk to new servers. This allows users to upgrade either clients or servers without experiencing any downtime or data loss. This makes Kafka ideal for microservice architectures and domain-driven design (DDD). Kafka really decouples the applications from each other in contrary to web service/REST-based architectures).

Pros of an Open Source De Facto Standard like the Kafka API

The huge benefit of an open-source de facto standard API is that it is open and usually follows a collaborative standardized process to make changes to the API. This brings various benefits to the community and software vendors.

The following facts about the Kafka API make many developers and enterprises happy:

Changes occur in a visible process enforced by a committee. For Apache Kafka, the Apache Software Foundation (ASF) is the relevant organization. Apache projects are managed using a collaborative, consensus-based process with members from various countries and enterprises. Check out how it works if you don’t know it yet.
Frameworks and vendors can implement against the open protocol and validate the implementation. That is significantly different from proprietary de facto standards like Amazon S3. Having said this, not every product that says it uses the Kafka API is 100% compatible and consequently is limited in the feature set and provides different behavior.
Developers can test the underlying behavior against the same API. Hence, unit and performance tests for different implementations can use the same code.
The Apache 2.0 license makes sure that the user does not have to worry about infringing any patents by using the software.

Frameworks, Products, and Cloud Services using the Kafka API

Many frameworks and vendors adopted the Kafka API. Let’s take a look at a few very different alternatives available today that use the Kafka API:

Open-source Apache Kafka from the Apache website
Self-managed Kafka-based vendor solutions for on-premises or cloud deployments from Confluent, Cloudera, Red Hat
Partially managed Kafka-based cloud offerings from Amazon MSK, Red Hat, Azure HD Insight’s Kafka, Aiven, cloudkarafka, Instaclustr.
Fully managed Kafka cloud offerings such as Confluent Cloud – actually, there is no other serverless, fully compatible Kafka SaaS offering on the market today (even though many marketing departments try to sell it like this)
Partly protocol-compatible, self-managed solutions such Apache Pulsar (with a simple, very limited Kafka wrapper class) or RedPanda for embedded / WebAssembly (WASM) use cases
Partly protocol-compatible, fully managed offerings like Azure EventHubs

Just be aware that the devil is in the details. Many offerings only implement a fraction of the Kafka API. Additionally, many offerings only support the core messaging concept, but exclude key features such as Kafka Connect for data integration, Kafka Streams for stream processing, or exactly-once semantics (EOS) for building transactional systems.

The Kafka API Dominates the Event Streaming Landscape

If you look at the current event streaming landscape, you see that more and more frameworks and products adopt the Kafka API. Even though the following is not a complete list (and other non-Kafka offerings exist), it is imposing:

If you want to learn more about the different Kafka offerings on the market, check out my Kafka vendor comparison. It is crucial to understand what Kafka offering is right for you. Do you want to focus on business logic and consume the Kafka infrastructure as a service? Or do you want to implement security, integration, monitoring, etc., by yourself?

The Kafka API is here to stay…

The Kafka API became the de facto standard API for event streaming. The usage of an open protocol creates huge benefits for corresponding frameworks, products, and cloud services leveraging the Kafka API.

Vendors can implement against the open standard and validate their implementation. End users can choose the best solution for their business problem. Migration between different Kafka services is also possible relatively easily – as long as each vendor is compliant with the Kafka protocol and implements it completely and correctly.

Are you using the Kafka API today? Open source Kafka (“car engine”), a commercial self-managed offering (“complete car”), or the serverless Confluent Cloud (“self-driving car) to focus on business problems? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka API is the De Facto Standard API for Event Streaming like Amazon S3 for Object Storage appeared first on Kai Waehner.

Top 5 Event Streaming Use Cases for 2021 with Apache Kafka

Kai Waehner — Wed, 16 Dec 2020 08:16:58 +0000

Apache Kafka and Event Streaming are two of the most relevant buzzwords in tech these days. Ever wonder what the predicted TOP 5 Event Streaming Architectures and Use Cases for 2021 are? Check out the following presentation. Learn about edge deployments, hybrid and multi-cloud architectures, service mesh-based microservices, streaming machine learning, and cybersecurity.

Gartner Top Strategic Technology Trends for 2021

After preparing the presentation, I also discovered Gartner’s top strategic technology trends for 2021:

It is really funny (but not surprising) how much these Gartner predictions overlap and complement the five use cases I focus on for Event Streaming and Apache Kafka. The tech industry’s key trends are all about data correlation, real-time processing, analytics, and integration between various systems and technologies—all of that globally and securely. Hence, here you go with the top 5 trends around Apache Kafka for 2021…

Top 5 Apache Kafka Use Cases for 2021

Here are the five use cases I see coming up more and more in conversations with customers across the globe:

Edge deployments outside the data center: It’s time to challenge the normality of limited hardware and disconnected infrastructure.
Hybrid and multi-cloud architectures: Discover how these span multiple sites across regions, continents, data centers, and clouds with real-time information at scale to connect legacy and modern infrastructures.
Service mesh-based microservices architectures: Learn what becomes possible when organizations can provide a cloud-native event-based infrastructure.
Streaming machine learning: In 2021, many companies will move to streaming machine learning in production without the need for a data lake that enables scalable real-time analytics.
Cybersecurity: While security never goes out of style, in 2021, we will see cybersecurity in real-time at scale with openness and flexibility at its core.

Slides and Video for Event Streaming Use Cases in 2021

Here are the slides and the on-demand video recording for this talk.

Slides

On-Demand Video Recording

What are your most relevant and interesting use cases for Event Streaming and Apache Kafka in 2021? What are your strategy and timeline? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Top 5 Event Streaming Use Cases for 2021 with Apache Kafka appeared first on Kai Waehner.

Streaming Machine Learning with Kafka-native Model Deployment

Kai Waehner — Tue, 27 Oct 2020 15:07:50 +0000

Apache Kafka became the de facto standard for event streaming across the globe and industries. Machine Learning (ML) includes model training on historical data and model deployment for scoring and predictions. While training is mostly batch, scoring usually requires real-time capabilities at scale and reliability. Apache Kafka plays a key role in modern machine learning infrastructures. The next-generation architecture leverages a Kafka-native streaming model server instead of RPC (HTTP/gRPC) calls:

This blog post explores the architectures and trade-offs between three options for model deployment with Kafka: Embedded model into the Kafka application, model server and RPC, model server, and Kafka-native communication.

Kafka and Machine Learning Architecture

Model deployment is usually completely separated from model training (from the process and the technology perspective). The model training is often executed in elastic cloud infrastructure or data lakes such as Hadoop or Spark. Model scoring (i.e., doing the predictions) is usually a mission-critical workload with different uptime SLAs and latency requirements:

Learn more about this architecture and the relation to modern ML approaches such as Hybrid ML architectures or AutoML in the blog post “Using Apache Kafka to Drive Cutting-Edge Machine Learning“.

Two alternatives for model deployment in Kafka infrastructures: The model can either be embedded into the Kafka application, or it can be deployed into a separate model server. The blog post “Model Server and Model Deployment Options in a Kafka Infrastructure” covers the use cases and architectures in detail and explores some code examples.

The following explores both options’ trade-offs and introduces a third option: A Kafka-native streaming model server.

RPC between Kafka App and Model Server

The analytic model is deployed into a model server. The streaming Kafka application does a request-response call to send the input data to the model and to receive the prediction in return:

Almost every ML product or framework provides a model server. This includes open-source frameworks such as TensorFlow and the related model server TF Serving, but also proprietory tools such as SAS, AutoML vendors such as DataRobot, and cloud ML services from all major cloud providers such as AWS, Azure, GCP.

Pros and Cons of a Model Server with RPC

Trade-offs using Kafka in conjunction with an RPC-based model server and HTTP/gRPC:

Simple integration with existing technologies and organizational processes
Easiest to understand if you come from a non-streaming world
Tight coupling of the availability, scalability, and latency/throughput between application and model server
Separation of concerns (e.g. Python model + Java streaming app)
Limited scalability and robustness
Later migration to real streaming is also possible
Model management built-in for different models, versioning, and A/B testing
Model monitoring built-in

Example: TensorFlow Serving with GRPC and Kafka Streams

An example of this approach is available on Github: “TensorFlow Serving + gRPC + Java + Kafka Streams“. In this case, the TensorFlow model is exported and deployed into a TensorFlow Serving infrastructure.

Embedded Model into Kafka Application

An analytic model is just a binary, i.e., a file stored in memory or persistent storage.

The data type differs and depends on the ML framework. But it does not matter if it is Java (e.g., with H2O), Protobuf (e.g., with TensorFlow), or proprietary (with many commercial ML tools).

Any model can be embedded into a Kafka application (if the ML solution provides programmatic APIs – but almost every tool does):

The Kafka application for embedding the model can either be a Kafka-native stream processing engine such as Kafka Streams or ksqlDB, or a “regular” Kafka application using any Kafka client such as Java, Scala, Python, Go, C, C++, etc.

Pros and Cons of Embedding an Analytic Model into a Kafka Application

Trade-offs of embedding analytic models into a Kafka application:

Best latency as local inference instead of remote call
No coupling of the availability, scalability, and latency/throughput of your Kafka Streams application
Offline inference (devices, edge processing, etc.)
No side-effects (e.g., in case of failure), all covered by Kafka processing (e.g., exactly once)
No built-in model management and monitoring

Example: Kafka Python Application with Embedded TensorFlow Model

A robust and scalable example of the embedded model approach is presented in Github project “Streaming Machine Learning with Kafka, MQTT, and TensorFlow for 100000 Connected Cars“. This demo uses Python for both model training and model deployment in separate, scalable containers in a Kubernetes infrastructure.

Several other (more simple) demos to try out this approach are available here: “Machine Learning + Kafka Streams Examples“. The examples use TensorFlow, H2O.ai, and DeepLearning4J (DL4J) in conjunction with Kafka Streams and ksqlDB.

Kafka-native Streaming Model Server

A Kafka-native streaming model server combines some characteristics from both approaches discussed above. It enables the separation of concerns by providing a model server with all the expected features. But the model server does not use RPC communication via HTTP/gRPC and all the drawbacks this creates for a streaming architecture. Instead, the model server communicates via the native Kafka protocol and Kafka topics with the client application:

Pros and Cons of a Kafka-native Streaming Model Server

Trade-offs of a Kafka-native streaming model server:

Good latency via Kafka-native streaming communication
Some coupling of the availability, scalability, and latency/throughput of your Kafka Streams application
Some side-effects (e.g., in case of failure), but most potential issues covered by Kafka processing (e.g., decoupling and persistence via Kafka topics)
Model management built-in for different models, versioning, and A/B testing
Model monitoring built-in
Separation of concerns (e.g. Python model + Java streaming app)
Scalability and robustness of the model server not necessarily Kafka-like (because the underlying implementation is often not Kafka-native yet)

A Kafka-native streaming model server provides many advantages of a streaming architecture and the features of a model server. Just be aware that a Kafka-native interface does NOT mean that the model server itself is implemented with Kafka under the hood. Hence, test your scalability, robustness, and latency requirements to decide if an embedded model might be a better approach.

Example: Streaming Model Deployment with Kafka and Seldon

Seldon is a robust and scalable open-source model server. It allows us to manage, serve, and scale models in any language or framework on Kubernetes.

In mid of 2020, Seldon added a key differentiator compared to many other model servers on the market: Seldon added support for Kafka. Hence, Seldon combines the advantages of a separate model server with the streaming paradigm of Kafka:

The Jupyter notebook demonstrates this example using scikit-learn, the NLP framework spaCy, Seldon, and Kafka-native integration with producer and consumer applications. Check out the blog post “Real-Time Machine Learning at Scale using SpaCy, Kafka & Seldon Core” for more details.

All Model Deployment Options have Trade-Offs in Streaming Machine Learning Architectures

This blog post covered three alternatives for model deployment in a Kafka infrastructure: A model server with RPC, embedding models into the Kafka application, and a Kafka-native model server. All three have their trade-offs. Know them, and evaluate the right choice for your project. The good news is that it is also pretty straightforward to change from one approach to another one.

UPDATE May 2021: Dataiku also provides a native Kafka interface in the meantime (including support for Schema Registry). Great to see different model servers adopting this architecture.

If you want to learn more details about Kafka-native model deployment, check out the following video recording and slide deck from Kafka Summit:

The talk does not cover the “streaming model server” approach (because no model server provided a Kafka-native interface in 2019). But you can still learn a lot about the different architectures and best practices.

If you want to learn more about “Streaming Machine Learning with Kafka – without another Data Lake“, check out the following video recording and slide deck. It explores a simplified architecture and the advantages of Tiered Storage for Kafka:

What are your experiences with Kafka and model deployment? What are your use cases? Which approach works best for you? What is your strategy? Let’s connect on LinkedIn and discuss it!

The post Streaming Machine Learning with Kafka-native Model Deployment appeared first on Kai Waehner.

Use Cases and Architectures for Apache Kafka across Industries

Kai Waehner — Tue, 20 Oct 2020 09:12:08 +0000

Event Streaming is happening all over the world. This blog post explores real-life examples across industries for use cases and architectures leveraging Apache Kafka. Learn about architectures for real-world deployments from Audi, BMW, Disney, Generali, Paypal, Tesla, Unity, Walmart, William Hill, and more. Use cases include fraud detection, mainframe offloading, predictive maintenance, cybersecurity, edge computing, track&trace, live betting, and much more.

The following sections show a few of the use cases and architectures. Check out the slide deck and video recording at the end for all examples and the architectures from the companies mentioned above.

Use Cases for Event Streaming with Apache Kafka

Apache Kafka is an event streaming platform. It provides messaging, persistence, data integration, and data processing capabilities. High scalability for millions of messages per second, high availability including backward-compatibility and rolling upgrades for mission-critical workloads, and cloud-native features are some of the capabilities.

Hence, the number of different use cases is almost endless. If you learn one thing from the examples in this blog post, then remember that Kafka is not just a messaging system! While data ingestion into a Hadoop data lake was the first prominent use case, this implies <5% of actual Kafka deployments. Business applications, streaming ETL middleware, real-time analytics, and edge/hybrid scenarios are some of the other examples:

Examples: SIEM, Streaming Machine Learning, Stateful Stream Processing

The following covers a few architectures and use cases. The presentation afterward goes into much more detail and examples from various companies about these and other use cases from various industries:

Financial Services
Insurance
Manufacturing
Automotive
Telecom
Retailing / Transportation / Logistics
Gaming
Healthcare / Pharma / Life Sciences

Modernized security information and event management (SIEM)

SIEM and cybersecurity are getting more important across industries. Kafka is used as open and scalable data integration and processing platform. Often combined with other SIEM solutions:

Streaming Model Training without the Need for Another Data Lake

Kafka and Machine Learning are a great combination for data integration, data processing, model training, model deployment, online monitoring, and other ML tasks. The most recent innovation is discussed in detail at the latest Kafka Summit: Simplified Kafka ML architecture for streaming model training without the need for another data lake like HDFS, S3, or Spark:

Stateful Stream Processing and Streaming Analytics with Kafka Streams / ksqlDB

Data aggregation and correlation at scale in real-time are key concepts for building innovative business applications and adding business value. The following is one example of doing continuous calculations for betting. It includes a synthetic delay to “adjust the live betting odds”:

While this is a controversial example, it shows the power of stateful streaming processing very well. I am sure you already have ideas on how to apply this to your industry.

Slides – Use Cases and Architectures for Kafka

Here are the slides from my presentation about Kafka examples across industries:

Video Recording – Examples of Kafka Deployments Across Industries

Here is the video recording with all the use cases and examples from various companies across the globe and industries:

Use Cases and Examples for Event Streaming with Apache Kafka Exist in Every Industry

Kafka is used everywhere across industries for event streaming, data processing, data integration, and building business applications / microservices. It is deployed successfully in mission-critical deployments at scale at silicon valley tech giants, startups, and traditional enterprises. Scenarios include cloud, multi-cloud, hybrid, and edge infrastructures.

What are your experiences with Apache Kafka and its ecosystem for event streaming? Which use cases and architectures did you deploy? What are your status quo and future strategy? Let’s connect on LinkedIn and discuss it!

The post Use Cases and Architectures for Apache Kafka across Industries appeared first on Kai Waehner.