Cloud-Native Archives - Kai Waehner https://www.kai-waehner.de/blog/category/cloud-native/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Fri, 13 Sep 2024 22:47:35 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png Cloud-Native Archives - Kai Waehner https://www.kai-waehner.de/blog/category/cloud-native/ 32 32 Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud) https://www.kai-waehner.de/blog/2024/09/12/deployment-options-for-apache-kafka-self-managed-fully-managed-serverless-and-byoc-bring-your-own-cloud/ Thu, 12 Sep 2024 13:43:31 +0000 https://www.kai-waehner.de/?p=6808 BYOC (Bring Your Own Cloud) is an emerging deployment model for organizations looking to maintain greater control over their cloud environments. Unlike traditional SaaS models, BYOC allows businesses to host applications within their own VPCs to provide enhanced data privacy, security, and compliance. This approach leverages existing cloud infrastructure. It offers more flexibility for custom configurations, particularly for companies with stringent security needs. In the data streaming sector around Apache Kafka, BYOC is changing how platforms are deployed. Organizations get more control and adaptability for various use cases. But it is clearly NOT the right choice for everyone!

The post Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud) appeared first on Kai Waehner.

]]>
BYOC (Bring Your Own Cloud) is an emerging deployment model for organizations looking to maintain greater control over their cloud environments. Unlike traditional SaaS models, BYOC allows businesses to host applications within their own VPCs to provide enhanced data privacy, security, and compliance. This approach leverages existing cloud infrastructure. It offers more flexibility for custom configurations, particularly for companies with stringent security needs. In the data streaming sector around Apache Kafka, BYOC is changing how platforms are deployed. Organizations get more control and adaptability for various use cases. But it is clearly NOT the right choice for everyone!

Apache Kafka Deployment Options - Serverless vs Self-Managed vs BYOC Bring Your Own Cloud

BYOC (Bring Your Own Cloud) – A New Deployment Model for Cloud Infrastructure

BYOC (Bring Your Own Cloud) is a deployment model where organizations choose their preferred cloud infrastructure to host applications or services, rather than using a serverless / fully managed cloud solution selected by a software vendor; typically known as Software as a Service (SaaS). This model gives businesses flexibility to leverage their existing cloud services (like AWS, Google Cloud, Microsoft Azure, or Alibaba) while integrating third-party applications that are compatible with multiple cloud environments.

BYOC helps companies maintain control over their cloud infrastructure, optimize costs, ensure compliance with security standards. BYOC is typically implemented within an organization’s own cloud VPC. Unlike SaaS models, BYOC offers enhanced privacy and compliance by maintaining control over network architecture and data management.

However, BYOC also has some serious drawbacks! The main challenge is scaling a fleet of co-managed clusters running in customer environments with all the reliability expectations of a cloud service. Confluent has shied away from offering a BYOC deployment model for Apache Kafka based on Confluent Platform because doing BYOC at scale requires a different architecture. WarpStream has built this architecture, with a BYOC-native platform that was designed from the ground up to avoid the pitfalls of traditional BYOC. 

The Data Streaming Landscape

Data Streaming is a separate software category of data platforms. Many software vendors built their entire businesses around this category. The data streaming landscape shows that most vendors use Kafka or implement its protocol because Apache Kafka has become the de facto standard for data streaming.

New software companies have emerged in this category in the last few years. And several mature players in the data market added support for data streaming in their platforms or cloud service ecosystem. Most software vendors use Kafka for their data streaming platforms. However, there is more than Kafka. Some vendors only use the Kafka protocol (Azure Event Hubs) or utterly different APIs (like Amazon Kinesis).

The following Data Streaming Landscape 2024 summarizes the current status of relevant products and cloud services.

Data Streaming Landscape 2024 around Kafka Flink and Cloud

The Data Streaming Landscape evolves. Last year, I added WarpStream as a new entrant into the market. WarpStream uses the Kafka protocol and provides a BYOC offering for Kafka in the cloud. In my next update of the data streaming landscape, I need to do yet another update: WarpStream is now part of Confluent. There are also many other new entrants. Stay tuned for a new “Data Streaming Landscape 2025” in a few weeks (subscribe to my newsletter to stay up-to-date with all the things data streaming).

Confluent Acquisition of WarpStream

Confluent had two product offerings:

  • Confluent Platform: A self-managed data streaming platform powered by Kafka, Flink, and much more that you can deploy everywhere (on-premise data center, public cloud VPC, edge like factory or retail store, and even stretched across multiple regions or clouds).
  • Confluent Cloud: A fully managed data streaming platform powered by Kafka, Flink, and much more that you can leverage as a serverless offering in all major public cloud providers (Amazon AWS, Microsoft Azure, Google Cloud Platform).

Why did Confluent acquire WarpStream? Because many customers requested a third deployment option: BYOC for Apache Kafka.

As Jay Kreps described in the acquisition announcement: “Why add another flavor of streaming? After all, we’ve long offered two major form factors–Confluent Cloud, a fully managed serverless offering, and Confluent Platform, a self-managed software offering–why complicate things? Well, our goal is to make data streaming the central nervous system of every company, and to do that we need to make it something that is a great fit for a vast array of use cases and companies.”

Read more details about the acquisition of WarpStream by Confluent in Jay’s blog post: Confluent + WarpStream = Large-Scale Streaming in your Cloud. In summary, WarpStream is not dead. The WarpStream team clarified the status quo and roadmap of this BYOC product for Kafka in its blog post: “WarpStream is Dead, Long Live WarpStream“.

Let’s dig deeper into the three deployment options and their trade-offs.

Deployment Options for Apache Kafka

Apache Kafka can be deployed in three primary ways: self-managed, fully managed/serverless, and BYOC (Bring Your Own Cloud).

  • In self-managed deployments, organizations handle the entire infrastructure, including setup, maintenance, and scaling. This provides full control but requires significant operational effort.
  • Fully managed or serverless Kafka is offered by providers like Confluent Cloud or Azure Event Hubs. The service is hosted and managed by a third-party, reducing operational overhead but with limited control over the underlying infrastructure.
  • BYOC deployments allow organizations to host Kafka within their own cloud VPC. BYOC combines some of the benefits of cloud flexibility with enhanced security and control, while it outsources most of Kafka’s management to specialized vendors.

Confluent’s Kafka Products: Self-Managed Platform vs. BYOC vs. Serverless Cloud

Using the example of Confluent’s product offerings, we can see why there are three product categories for data streaming around Apache Kafka.

There is no silver bullet. Each deployment option for Apache Kafka has its pros and cons. The key differences are related to the trade-offs between “ease of management” and “level of control”.

Cloud-Native BYOC for Apache Kafka with WarpStream in the Public Cloud
Source: Confluent

If we go into more detail, we see that different use cases require different configurations, security setups, and levels of control while also focusing on being cost effective and providing the right SLA and latency for each use case.

Trade-Offs of Confluent’s Deployment Options for Apache Kafka

On a high level, you need to figure out if you want or have to managed the data plane(s) and control plane of your data streaming infrastructure:

Confluent Deployment Types for Apache Kafka On Premise Edge and Public Cloud
Source: Confluent

If you follow my blog, you know that a key focus is exploring various use cases, architectures and success stories across all industries. And use cases such as log aggregation or IoT sensor analytics required very different deployment characteristics than an instant payment platform or fraud detection and prevention.

Choose the right Kafka deployment model for your use case. Even within one organization, you will probably need different deployments because of security, data privacy and compliance requirements, but also staying cost efficient for high-volume workloads.

BYOC for Apache Kafka with WarpStream

Self-managed Kafka and fully managed Kafka are pretty well understood in the meantime. However, why is BYOC needed as a third option and how to do it right?

I had plenty of customer conversations across industries. Common feedback is that most organizations have a cloud-first strategy, but many also (have to) stay hybrid for security, latency or cost reasons.

And let’s be clear: If a data streaming project goes to the cloud, fully managed Kafka (and Flink) should always be the first option as it is much easier to manage and operate to focus on fast time to market and business innovation. Having said that, sometimes, security, cost or other reasons require BYOC.

How Is BYOC Implemented in WarpStream?

Let’s explore why WarpStream is an excellent option for Kafka as BYOC deployment and when to use it instead of serverless Kafka in the cloud:

  • WarpStream provides BYOC, meaning single-tenant service so that a customer has its own “instance” of Kafka (to use the protocol, it is not Apache Kafka under the hood).
  • However, under the hood, the system still uses cloud-native serverless systems like Amazon S3 for scalability, cost-efficiency and high availability (but the customer does not see this complexity and does not have to care about it).
  • As a result, the data plane is still customer managed (that’s what they need for security or other reasons), but in contrary to self-managed Kafka, the customer does not need to worry about the complexity under the hood (like rebalancing, rolling upgrades, backups) – that is what S3 and other magic code of the WarpStream service takes over.
  • The magic is the stateless agents in the customer VPC. It makes this solution scalable and still easy to operate (compared to the self-managed deployment option) while the customer has its own instance.
  • Many use cases are around lift and shift of existing Kafka deployments (like self-managed Apache Kafka or another vendor like Kafka as part of Cloudera or Red Hat). Some companies want to “lift and shift” and keep the feeling of control they are used to, while still offloading most of the management to the vendor.

I wrote this summary after reading the excellent article of my colleague Jack Vanlightly: BYOC, Not “The Future Of Cloud Services” But A Pillar Of An Everywhere Platform. This article goes into much more technical detail and is a highly recommended read for any architect and developer.

Benefits of WarpStream’s BYOC Implementation for Kafka

Most vendors have dubios BYOC implementations.

For instance, if the vendor needs to access the VPC of the customercheaper than AK self managed because cloud native (zero disks, zero interzone networking fees) and headaches for responsibilities in the case of failures.

WarpStream’s BYOC-native implementation differs from other vendors and provides various benefits because of its novel architecture:

  • WarpStream does not need access to the customer VPC. The data plane (i.e., the brokers in the customer VPC) are stateless. The metadata/consensus is in the control plane (i.e., the cloud service in the WarpStream VPC).
  • The architecture solves sovereignty challenges and is a great fit for security and compliance requirements.
  • The cost of WarpStream’s BYOC offering is cheaper than self-managed Apache Kafka because it is built with cloud-native concepts and technologies in mind (e.g., zero disks and zero interzone networking fees, leveraging cloud object storage such as Amazon S3).
  • The stateless architecture in the customer VPC makes autoscaling and elasticity very easy to implement/configure.

The Main Drawbacks of BYOC for Apache Kafka

BYOC is an excellent choice if you have specific security, compliance or cost requirements that need this deployment option. However, there are some drawbacks:

  • The latency is worse than self-managed Kafka or serverless Kafka as WarpStream directly touches the Amazon S3 object storage (in contrast to “normal Kafka”).
  • Kafka using BYOC is NOT fully managed, like e.g. Confluent Cloud, so you have more efforts to operate it. Also, keep in mind that most Kafka cloud services are NOT serverless but just provision Kafka for you and you still need to operate it.
  • Additional components of the data streaming platform (such as Kafka Connect connectors and stream processors such as Kafka Streams or Apache Flink) are not part of the BYOC offering (yet). This adds some complexity to operations and development.

Therefore, once again, I recommend to only look at BYOC options for Apache Kafka in the public cloud if a fully managed and serverless data streaming platform does NOT work for you because of cost, security or compliance reasons!

BYOC Complements Self-Managed and Serverless Apache Kafka – But BYOC Should NOT be the First Choice!

BYOC (Bring Your Own Cloud) offers a flexible and powerful deployment model, particularly beneficial for businesses with specific security or compliance needs. By allowing organizations to manage applications within their own cloud VPCs, BYOC combines the advantages of cloud infrastructure control with the flexibility of third-party service integration.

But once again: If a data streaming project goes to the cloud, fully managed Kafka (and Flink) should always be the first option as it is much easier to manage and operate to focus on fast time to market and business innovation. Choose BYOC only if fully managed does not work for you, e.g. because of security requirements.

In the data streaming domain around Apache Kafka, the BYOC model complements existing self-managed and fully managed options. It offers a middle ground that balances ease of operation with enhanced privacy and security. Ultimately, BYOC helps companies tailor their cloud environments to meet diverse and developing business requirements.

What is your deployment option for Apache Kafka? A self-managed deployment in the data center or at the edge? Serverless Cloud with a service such as Confluent Cloud? Or did you (have to) choose BYOC? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Deployment Options for Apache Kafka: Self-Managed, Fully-Managed / Serverless and BYOC (Bring Your Own Cloud) appeared first on Kai Waehner.

]]>
The State of Data Streaming for the Public Sector https://www.kai-waehner.de/blog/2023/08/02/the-state-of-data-streaming-for-the-public-sector-in-2023/ Wed, 02 Aug 2023 05:09:44 +0000 https://www.kai-waehner.de/?p=5574 This blog post explores the state of data streaming for the public sector and government. Data streaming provides consistency across all layers and allows integrating and correlating data in real-time at any scale. I look at public sector trends to explore how Apache Kafka helps as a business enabler, including case studies from the US Department of Defense (DoD), NASA, Deutsche Bahn (German Railway), and others. A complete slide deck and on-demand video recording are included.

The post The State of Data Streaming for the Public Sector appeared first on Kai Waehner.

]]>
This blog post explores the state of data streaming for the public sector. The evolution of government digitalization, citizen expectations, and cybersecurity risks requires optimized end-to-end visibility into information, comfortable mobile apps, and integration with legacy platforms like mainframe in conjunction with pioneering technologies like social media. Data streaming provides consistency across all layers and allows integrating and correlating data in real-time at any scale. I look at public sector trends to explore how data streaming leverages Apache Kafka and to help as a business enabler, including customer stories from the US Department of Defense (DoD), NASA, Deutsche Bahn (German Railway), and others. A complete slide deck and on-demand video recording are included.

The State of Data Streaming for the Public Sector in 2023

The public sector covers so many different areas. Examples include defense, law enforcement, national security, healthcare, public administration, police, judiciary, finance and tax, research, aerospace, agriculture, etc. Many of these terms and sectors overlap. Many of these use cases are applicable across many sectors.

Several disruptive trends impact innovation in the public sector to automate processes, provide a better experience for citizens, and strengthen cybersecurity defense tactics.

The two critical pillars across departments in the public sector are IT modernization and data-driven applications.

IT modernization in the government

The research company Gartner identified the following technology trends for the government to accelerate the digital transformation as they prepare for post-digital government:

Gartner Top Technology Trends in Government for 2023

These trends differ not much from traditional companies in the private sector like banking or insurance. Data consistency across monolithic legacy infrastructure and cloud-native applications matters.

Accelerating data maturity in the public sector

The public sector is often still very slow in innovation. Time-to-market is crucial. IT modernization requires up-to-date technologies and development principles. Data sharing across applications, departments, or states requires a data-driven enterprise architecture.

McKinsey & Company says “Government entities have created real-time pandemic dashboards, conducted geospatial mapping for drawing new public transportation routes, and analyzed public sentiment to inform economic recovery investment.

While many of these examples were born out of necessity, public-sector agencies are now embracing the impact that data-driven decision making can have on residents, employees, and other agencies. Embedding data and analytics at the core of operations can help optimize government resources by targeting them more effectively and enable civil servants to focus their efforts on activities that deliver the greatest results.”

McKinsey and Company - Accelerating Data Maturity in the Government

AI and Machine Learning help with automation. Chatbots and other conversational AI improve the total experience of citizens and public sector employees.

Data streaming in the government and public sector

Real-time data beats slow data in almost all use cases. No matter which agency or department you look at in the government and public sector:

Real-Time Data Streaming in the Government and Public Sector

Data streaming combines the power of real-time messaging at any scale with storage for true decoupling, data integration, and data correlation capabilities. Apache Kafka is the de facto standard for data streaming.

Check out the below links for a broad spectrum of examples and best practices. Additionally, here are a few new customer stories from the last months.

New customer stories for data streaming in the public sector and government

So much innovation is happening worldwide, even in the “slow” public sector. Automation and digitalization change how we search and buy products and services, communicate with partners and customers, provide hybrid shopping models, and more.

Most and more governments and non-profit organizations use a cloud-first approach to improve time-to-market, increase flexibility, and focus on business logic instead of operating IT infrastructure.

Here are a few customer stories from worldwide organizations in the public sector and government:

  • University of California, San Diego: Integration Platform as a Service (iPaaS) as “Swiss army knife” of integration.
  • U.S. Citizenship and Immigration Services (USCIS): Real-time inter-agency data sharing.
  • Deutsche Bahn (German Railway): Customer data platform for real-time notification about delays and cancellations, plus B2B integration with Google Maps.
  • NASA: General Coordinates Network (GCN) for multi-messenger astronomy alerts between space- and ground-based observatories, physics experiments, and thousands of astronomers worldwide.
  • US Department of Defense (DOD): Joint All Domain Command and Control (JADC2), a strategic warfighting concept that connects the data sensors, shooters, and related communications devices of all U.S. military services. DOD uses the ride-sharing service Uber as an analogy to describe its desired end state for JADC2 leveraging data streaming.

Resources to learn more

This blog post is just the starting point.

I wrote a blog series exploring why many governments and public infrastructure sectors leverage data streaming for various use cases. Learn about real-world deployments and different architectures for data streaming with Apache Kafka in the public sector:

  1. Life is a Stream of Events
  2. Smart City
  3. Citizen Services
  4. Energy and Utilities
  5. National Security

Learn more about data streaming for the government and public sector in the following on-demand webinar recording, the related slide deck, and further resources, including pretty cool lightboard videos about use cases. I presented with my colleague and SME for the public sector and governments Will La Forest.

On-demand video recording

The video recording explores public sector trends and architectures for data streaming leveraging Apache Kafka and other modern and cloud-native technologies. The primary focus is the data streaming case studies. Check out our on-demand recording:

The State of Data Streaming for Public Sector and Government in 2023

Slides

If you prefer learning from slides, check out the deck used for the above recording:

Fullscreen Mode

Case studies and lightboard videos for data streaming in the public sector and government

The state of data streaming for the public sector in 2023 is fascinating. New use cases and case studies come up every month. Mission-critical deployments at governments in the United States and Germany prove the maturity of data streaming concerning security and data privacy. The success stories prove better data governance across the entire organization, secure data collection and processing in real-time, data sharing and cross-agency partnerships with Open APIs for new business models, and many more scenarios.

We recorded lightboard videos showing the value of data streaming simply and effectively. These five-minute videos explore the business value of data streaming, related architectures, and customer stories. Stay tuned; I will update the links in the next few weeks and publish a separate blog post for each story and lightboard video.

And this is just the beginning. Every month, we will talk about the status of data streaming in a different industry. Manufacturing was the first. Financial services second, then retail, telcos, gaming, and so on…

Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post The State of Data Streaming for the Public Sector appeared first on Kai Waehner.

]]>
The State of Data Streaming for Digital Natives in 2023 https://www.kai-waehner.de/blog/2023/07/18/the-state-of-data-streaming-for-digital-natives-in-2023/ Tue, 18 Jul 2023 06:19:44 +0000 https://www.kai-waehner.de/?p=5520 This blog post explores the state of data streaming in 2023 for digital natives born in the cloud. Data streaming allows integrating and correlating data in real-time at any scale to improve the most innovative applications leveraging Apache Kafka. I explore how data streaming helps as a business enabler, including customer stories from New Relic, Wix, Expedia, Apna, Grab, and more. A complete slide deck and on-demand video recording are included.

The post The State of Data Streaming for Digital Natives in 2023 appeared first on Kai Waehner.

]]>
This blog post explores the state of data streaming in 2023 for digital natives born in the cloud. The evolution of digital services and new business models requires real-time end-to-end visibility, fancy mobile apps, and integration with pioneering technologies like fully managed cloud services for fast time-to-market, 5G for low latency, or augmented reality for innovation. Data streaming allows integrating and correlating data in real-time at any scale to improve the most innovative applications leveraging Apache Kafka.

I look at trends for digital natives to explore how data streaming helps as a business enabler, including customer stories from New Relic, Wix, Expedia, Apna, Grab, and more. A complete slide deck and on-demand video recording are included.

The State of Data Streaming for Digital Natives in 2023

Digital natives are data-driven tech companies born in the cloud. The SaaS solutions are built on cloud-native infrastructure that provides elastic and flexible operations and scale. AI and Machine Learning improve business processes while the data flows through the backend systems.

The data-driven enterprise in 2023

McKinsey & Company published an excellent article on seven characteristics that will define the data-driven enterprise:

1) Data embedded in every decision, interaction, and process
2) Data is processed and delivered in real-time
3) Flexible data stores enable integrated, ready-to-use data
4) Data operating model treats data like a product
5) The chief data officer’s role is expanded to generate value
6) Data-ecosystem memberships are the norm
7) Data management is prioritized and automated for privacy, security, and resiliency

This quote from McKinsey & Company precisely maps to the value of data streaming for using data at the right time in the right context. The below success stories are all data-driven, leveraging these characteristics.

Digital natives born in the cloud

A digital native enterprise can have many meanings. IDC has a great definition:

IDC defines Digital native businesses (DNBs) as companies built based on modern, cloud-native technologies, leveraging data and AI across all aspects of their operations, from logistics to business models to customer engagement. All core value or revenue-generating processes are dependent on digital technology.”

Companies are born in the cloud, leverage fully managed services, and are consequently innovative with fast time to market.

AI and machine learning (beyond the buzz)

“ChatGPT, while cool, is just the beginning; enterprise uses for generative AI are far more sophisticated.” says Gartner. I can’t agree more. But even more interesting: Machine Learning (the part of AI that is enterprise ready) is already used in many companies.

While everybody talks about Generative AI (GenAI) these days, I prefer talking about real-world success stories that leverage analytic models for many years already to detect fraud, upsell to customers, or predict machine failures. GenAI is “just” another more advanced model that you can embed into your IT infrastructure and business processes the same way.

Data streaming at digital native tech companies

Adopting trends across industries is only possible if enterprises can provide and correlate information correctly in the proper context. Real-time, which means using the information in milliseconds, seconds, or minutes, is almost always better than processing data later (whatever later means):

Real-time Data beats Slow Data in Retail

Digital natives combine all the power of data streaming: Real-time messaging at any scale with storage for true decoupling, data integration, and data correlation capabilities. Apache Kafka is the de facto standard for data streaming.

Data streaming with the Apache Kafka ecosystem and cloud services are used throughout the supply chain of any industry. Here are just a few examples:

Digital Natives Born in the Cloud with Data Streaming

Search my blog for various articles related to this topic in your industry: Search Kai’s blog.

Elastic scale with cloud-native infrastructure

One of the most significant benefits of cloud-native SaaS offerings is elastic scalability out of the box. Tech companies can start new projects with a small footprint and pay-as-you-go. If the project is successful or if industry peaks come (like Black Friday or Christmas season in retail), the cloud-native infrastructure scales up; and back down after the peak:

Elastic Scale with Cloud-native Data Streaming

There is no need to change the architecture from a proof of concept to an extreme scale. Confluent’s fully managed SaaS for Apache Kafka is an excellent example. Learn how to scale Apache Kafka to 10+ GB per second in Confluent Cloud without the need to re-architect your applications.

Data streaming + AI / machine learning = real-time intelligence

The combination of data streaming with Kafka and machine learning with TensorFlow or other ML frameworks is nothing new. I explored how to “Build and Deploy Scalable Machine Learning in Production with Apache Kafka” in 2017, i.e., six years ago.

Since then, I have written many further articles and supported various enterprises deploying data streaming and machine learning. Here is an example of such an architecture:

Apache Kafka and Machine Learning - Kappa Architecture

Data mesh for decoupling, flexibility and focus on data products

Digital natives don’t (have to) rely on monolithic, proprietary, and inflexible legacy infrastructure. Instead, tech companies start from scratch with modern architecture. Domain-driven design and microservices are combined in a data mesh, where business units focus on solving business problems with data products:

Cluster Linking for data replication with the Kafka protocol

Digital natives leverage trends for enterprise architectures to improve cost, flexibility, security, and latency. Four essential topics I see these days at tech companies are:

  • Decentralization with a data mesh
  • Kappa architecture replacing Lambda
  • Global data streaming
  • AI/Machine Learning with data streaming

Let’s look deeper into some enterprise architectures that leverage data streaming.

Decentralization with a data mesh

There is no single technology or product for a data mesh!  But the heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable.

Data streaming with Apache Kafka is the perfect foundation for a data mesh: Dumb pipes and smart endpoints truly decouple independent applications. This domain-driven design allows teams to focus on data products:

Domain Driven Design DDD with Kafka for Industrial IoT MQTT and OPC UA

Contrary to a data lake or date warehouse, the data streaming platform is real-time, scalable and reliable. A unique advantage for building a decentralized data mesh.

Kappa architecture replacing Lambda

The Kappa architecture is an event-based software architecture that can handle all data at all scale in real-time for transactional AND analytical workloads.

The central premise behind the Kappa architecture is that you can perform real-time and batch processing with a single technology stack. The heart of the infrastructure is streaming architecture.

Kappa Architecture with one Pipeline for Real Time and Batch

Unlike the Lambda Architecture, in this approach, you only re-process when your processing code changes and need to recompute your results.

I wrote a detailed article with several real-world case studies exploring why “Kappa Architecture is Mainstream Replacing Lambda“.

Global data streaming

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception.

Hybrid and Global Apache Kafka and Event Streaming Use Case

Several scenarios require multi-cluster Kafka deployments with specific requirements and trade-offs, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments, and global Kafka.

Learn all the details in my article “Architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments“.

Natural language processing (NLP) with data streaming for real-time Generative AI (GenAI)

Natural Language Processing (NLP) helps many projects in the real world for service desk automation, customer conversation with a chatbot, content moderation in social networks, and many other use cases. Generative AI (GenAI) is “just” the latest generation of these analytic models. Many enterprises have combined NLP with data streaming for many years for real-time business processes.

Apache Kafka became the predominant orchestration layer in these machine learning platforms for integrating various data sources, processing at scale, and real-time model inference.

Here is an architecture that shows how teams easily add Generative AI and other machine learning models (like large language models, LLM) to their existing data streaming architecture:

Chatbot with Apache Kafka and Machine Learning for Generative AI

Time to market is critical. AI does not require a completely new enterprise architecture. The true decoupling allows the addition of new applications/technologies and embedding them into the existing business processes.

An excellent example is Expedia: The online travel company added a chatbot to the existing call center scenario to reduce costs, increase response times, and make customers happier. Learn more about Expedia and other case studies in my blog post “Apache Kafka for Conversational AI, NLP, and Generative AI“.

New customer stories of digital natives using data streaming

So much innovation is happening with data streaming. Digital natives lead the race. Automation and digitalization change how tech companies create entirely new business models.

Most digital natives use a cloud-first approach to improve time-to-market, increase flexibility, and focus on business logic instead of operating IT infrastructure. And elastic scalability gets even more critical when you start small but think big and global from the beginning.

Here are a few customer stories from worldwide telecom companies:

  • New Relic: Observability platform ingesting up to 7 billion data points per minute for real-time and historical analysis.
  • Wix: Web development services with online drag & drop tools built with a global data mesh.
  • Apna: India’s largest hiring platform powered by AI to match client needs with applications.
  • Expedia: Online travel platform leveraging data streaming for a conversational chatbot service incorporating complex technologies such as fulfillment, natural-language understanding, and real-time analytics.
  • Alex Bank: A 100% digital and cloud-native bank using real-time data to enable a new digital banking experience.
  • Grab: Asian mobility service that built a cybersecurity platform for monitoring 130M+ devices and generating 20M+ Al-driven risk verdicts daily.

Resources to learn more

This blog post is just the starting point. Learn more about data streaming and digital natives in the following on-demand webinar recording, the related slide deck, and further resources, including pretty cool lightboard videos about use cases.

On-demand video recording

The video recording explores the telecom industry’s trends and architectures for data streaming. The primary focus is the data streaming case studies. Check out our on-demand recording:

The State of Data Streaming for Digital Natives (Video)

Slides

If you prefer learning from slides, check out the deck used for the above recording:

Fullscreen Mode

Data streaming case studies and lightboard videos of digital natives

The state of data streaming for digital natives in 2023 is fascinating. New use cases and case studies come up every month. This includes better data governance across the entire organization, real-time data collection and processing data from network infrastructure and mobile apps, data sharing and B2B partnerships with new business models, and many more scenarios.

We recorded lightboard videos showing the value of data streaming simply and effectively. These five-minute videos explore the business value of data streaming, related architectures, and customer stories. Stay tuned; I will update the links in the next few weeks and publish a separate blog post for each story and lightboard video.

And this is just the beginning. Every month, we will talk about the status of data streaming in a different industry. Manufacturing was the first. Financial services second, then retail, telcos, digital natives, gaming, and so on…

Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post The State of Data Streaming for Digital Natives in 2023 appeared first on Kai Waehner.

]]>
The State of Data Streaming for Telco https://www.kai-waehner.de/blog/2023/06/02/the-state-of-data-streaming-for-telco-in-2023/ Fri, 02 Jun 2023 05:38:56 +0000 https://www.kai-waehner.de/?p=5437 This blog post explores the state of data streaming for the telco industry. The evolution of telco infrastructure, customer services, and new business models requires real-time end-to-end visibility, fancy mobile apps, and integration with pioneering technologies like 5G for low latency or augmented reality for innovation. Learn about customer stories from Dish Network, British Telecom, Globe Telecom, Swisscom, and more. A complete slide deck and on-demand video recording are included.

The post The State of Data Streaming for Telco appeared first on Kai Waehner.

]]>
This blog post explores the state of data streaming for the telco industry. The evolution of telco infrastructure, customer services, and new business models requires real-time end-to-end visibility, fancy mobile apps, and integration with pioneering technologies like 5G for low latency or augmented reality for innovation. Data streaming allows integrating and correlating data in real-time at any scale to improve most telco workloads.

I look at trends in the telecommunications sector to explore how data streaming helps as a business enabler, including customer stories from Dish Network, British Telecom, Globe Telecom, Swisscom, and more. A complete slide deck and on-demand video recording are included.

The State of Data Streaming for Telco in 2023

The Telco industry is fundamental for growth and innovation across all industries.

The global spending on telecom services is expected to reach 1.595 trillion U.S. dollars by 2024 (Source: Statista, Jul 2022).

Cloud-native infrastructure and digitalization of business processes are critical enablers. 5G network capabilities and telco marketplaces enable entirely new business models.

5G enables new business models

Presentation of Amdocs / Mavenir:

5G Use Cases with Amdocs and Mavenir

A report from McKinsey & Company says, “74 percent of customers have a positive or neutral feeling about their operators offering different speeds to mobile users with different needs”. The potential for increasing the revenue per user (ARPU) with 5G use cases is enormous for telcos:

Potential from 5G monetization

Telco marketplace

Many companies across industries are trying to build a marketplace these days. But especially the telecom sector might shine here because of its interface between infrastructure, B2B, partners, and end users for sales and marketing.

tmforum has a few good arguments for why communication service providers (CSP) should build a marketplace for B2C and B2B2X:

  • Operating the marketplace keeps CSP in control of the relationship with customers
  • A marketplace is a great sales channel for additional revenue
  • Operating the marketplace helps CSPs monetize third-party (over-the-top) content
  • The only other option is to be relegated to connectivity provider
  • Enterprise customers have decided this is their preferred method of engagement
  • CPSs can take a cut of all sales
  • Participating in a marketplace prevents any one company from owning the customer

Data streaming in the telco industry

Adopting trends like network monitoring, personalized sales and cybersecurity is only possible if enterprises in the telco industry can provide and correlate information at the right time in the proper context. Real-time, which means using the information in milliseconds, seconds, or minutes, is almost always better than processing data later (whatever later means):

Real-Time Data Streaming in the Telco Industry

Data streaming combines the power of real-time messaging at any scale with storage for true decoupling, data integration, and data correlation capabilities. Apache Kafka is the de facto standard for data streaming.

Use Cases for Apache Kafka in Telcois a good article for starting with an industry-specific point of view on data streaming. “Apache Kafka for Telco-OTT and Media Applications” explores over-the-top B2B scenarios.

Data streaming with the Apache Kafka ecosystem and cloud services are used throughout the supply chain of the telco industry. Search my blog for various articles related to this topic: Search Kai’s blog.

From Telco to TechCo: Next-generation architecture

Deloitte describes the target architecture for telcos very well:

Requirements for the next generation telco architecture

Data streaming provides these characteristics: Open, scalable, reliable, and real-time. This unique combination of capabilities made Apache Kafka so successful and widely adopted.

Kafka decouples applications and is the perfect technology for microservices across a telco’s enterprise architecture. Deloitte’s diagram shows this transition across the entire telecom sector:

Cloud-native Microservices and Data Mesh in the Telecom Sector

This is a massive shift for telcos:

  • From purpose-built hardware to generic hardware and elastic scale
  • From monoliths to decoupled, independent services

Digitalization with modern concepts helps a lot in designing the future of telcos.

Open Data Architecture (ODA)

tmforum describes Open Digital Architecture (ODA) as follows:

“Open Digital Architecture is a standardized cloud-native enterprise architecture blueprint for all elements of the industry from Communication Service Providers (CSPs), through vendors to system integrators. It accelerates the delivery of next-gen connectivity and beyond – unlocking agility, removing barriers to partnering, and accelerating concept-to-cash.

ODA replaces traditional operations and business support systems (OSS/BSS) with a new approach to building software for the telecoms industry, opening a market for standardized, cloud-native software components, and enabling communication service providers and suppliers to invest in IT for new and differentiated services instead of maintenance and integration.”

Open Data Architecture ODA tmforum

If you look at the architecture trends and customer stories for data streaming in the next section, you realize that real-time data integration and processing at scale is required to provide most modern use cases in the telecommunications industry.

The telco industry applies various trends for enterprise architectures for cost, flexibility, security, and latency reasons. The three major topics I see these days at customers are:

  • Hybrid architectures with synchronization between edge and cloud in real-time
  • End-to-end network and infrastructure monitoring across multiple layers
  • Proactive service management and context-specific customer interactions

Let’s look deeper into some enterprise architectures that leverage data streaming for telco use cases.

Hybrid 5G architecture with data streaming

Most telcos have a cloud-first strategy to set up modern infrastructure for network monitoring, sales and marketing, loyalty, innovative new OTT services, etc. However, edge computing gets more relevant for use cases like pre-processing for cost reduction, innovative location-based 5G services, and other real-time analytics scenarios:

Hybrid 5G Telco Infrastructure with Data Streaming

Learn about architecture patterns for Apache Kafka that may require multi-cluster solutions and see real-world examples with their specific requirements and trade-offs. That blog explores scenarios such as disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments, and global Kafka.

Edge deployments for data streaming are their own challenges. In separate blog posts, I covered use cases for Kafka at the edge and provided an infrastructure checklist for edge data streaming.

End-to-end network and infrastructure monitoring

Data streaming enables unifying telemetry data from various sources such as Syslog, TCP, files, REST, and other proprietary application interfaces:

Telemetry Network Monitoring with Data Streaming

End-to-end visibility into the telco networks allows massive cost reductions. And as a bonus, a better customer experience. For instance, proactive service management tells customers about a network outage:

Proactive Service Management across OSS and BSS

Context-specific sales and digital lifestyle services

Customers expect a great customer experience across devices (like a web browser or mobile app) and human interactions (e.g., in a telco store). Data streaming enables a context-specific omnichannel sales experience by correlating real-time and historical data at the right time in the proper context:

Omnichannel Retail in the Telco Industry with Data Streaming

Omnichannel Retail and Customer 360 in Real Time with Apache Kafka” goes into more detail. But one thing is clear: Most innovative use cases require both historical and real-time data. In summary, correlating historical and real-time information is possible with data streaming out-of-the-box because of the underlying append-only commit log and replayability of events. A cloud-native Tiered Storage Kafka infrastructure to separate compute from storage makes such an enterprise architecture more scalable and cost-efficient.

The article “Fraud Detection with Apache Kafka, KSQL and Apache Flink” explores stream processing for real-time analytics in more detail, shows an example with embedded machine learning, and covers several real-world case studies.

New customer stories for data streaming in the telco industry

So much innovation is happening in the telecom sector. Automation and digitalization change how telcos monitor networks, build customer relationships, and create completely new business models.

Most telecommunication service providers use a cloud-first approach to improve time-to-market, increase flexibility, and focus on business logic instead of operating IT infrastructure. And elastic scalability gets even more critical with all the growing networks and 5G workloads.

Here are a few customer stories from worldwide telecom companies:

  • Dish Network: Cloud-native 5G Network with Kafka as the central communications hub between all the infrastructure interfaces and IT applications. The standalone 5G infrastructure in conjunction with data streaming enables new business models for customers across all industries, like retail, automotive, or energy sector.
  • Verizon: MEC use cases for low-latency 5G stream processing, such as autonomous drone-in-a-box-based monitoring and inspection solutions or vehicle-to-Everything (V2X).
  • Swisscom: Network monitoring and incident management with real-time data at scale to inform customers about outages, root cause analysis, and much more. The solution relies on Apache Kafka and Apache Druid for real-time analytics use cases.
  • British Telecom (BT): Hybrid multi-cloud data streaming architecture for proactive service management. BT extracts more value from its data and prioritizes real-time information and better customer experiences.
  • Globe Telecom: Industrialization of event streaming for various use cases. Two examples: Digital personalized rewards points based on customer purchases. Airtime loans are made easier to operationalize (vs. batch, where top-up cash is already spent again).

Resources to learn more

This blog post is just the starting point. Learn more about data streaming in the telco industry in the following on-demand webinar recording, the related slide deck, and further resources, including pretty cool lightboard videos about use cases.

On-demand video recording

The video recording explores the telecom industry’s trends and architectures for data streaming. The primary focus is the data streaming case studies. Check out our on-demand recording:

The State of Data Streaming for Telco in 2023

Slides

If you prefer learning from slides, check out the deck used for the above recording:

Fullscreen Mode

Case studies and lightboard videos for data streaming in telco

The state of data streaming for telco is fascinating. New use cases and case studies come up every month. This includes better data governance across the entire organization, real-time data collection and processing data from network infrastructure and mobile apps, data sharing and B2B partnerships with OTT players for new business models, and many more scenarios.

We recorded lightboard videos showing the value of data streaming simply and effectively. These five-minute videos explore the business value of data streaming, related architectures, and customer stories. Stay tuned; I will update the links in the next few weeks and publish a separate blog post for each story and lightboard video.

And this is just the beginning. Every month, we will talk about the status of data streaming in a different industry. Manufacturing was the first. Financial services second, then retail, telcos, gaming, and so on…

Let’s connect on LinkedIn and discuss it! Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter.

The post The State of Data Streaming for Telco appeared first on Kai Waehner.

]]>
When to choose Redpanda instead of Apache Kafka? https://www.kai-waehner.de/blog/2022/11/16/when-to-choose-redpanda-instead-of-apache-kafka/ Wed, 16 Nov 2022 03:19:39 +0000 https://www.kai-waehner.de/?p=4928 Data streaming emerged as a new software category. It complements traditional middleware, data warehouse, and data lakes. Apache Kafka became the de facto standard. New players enter the market because of Kafka's success. One of those is Redpanda, a lightweight Kafka-compatible C++ implementation. This blog post explores the differences between Apache Kafka and Redpanda, when to choose which framework, and how the Kafka ecosystem, licensing, and community adoption impact a proper evaluation.

The post When to choose Redpanda instead of Apache Kafka? appeared first on Kai Waehner.

]]>
Data streaming emerged as a new software category. It complements traditional middleware, data warehouse, and data lakes. Apache Kafka became the de facto standard. New players enter the market because of Kafka’s success. One of those is Redpanda, a lightweight Kafka-compatible C++ implementation. This blog post explores the differences between Apache Kafka and Redpanda, when to choose which framework, and how the Kafka ecosystem, licensing, and community adoption impact a proper evaluation.

Apache Kafka vs Redpanda Comparison

Disclaimer: I work for Confluent. However, the post is not about comparing features but explaining the concepts behind the alternatives of using Apache Kafka (and related products, including Confluent) or Redpanda. I talk to enterprises across the globe every week. Below, I summarize common misunderstandings or missing knowledge about both technologies. I hope it helps you to make the right decision. Either choose to run open-source Apache Kafka, one of the various commercial Kafka offerings or cloud services, or Redpanda. All are great options with pros and cons…

Data streaming: A new software category

Data-driven applications are the new black. As part of this, data streaming is a new software category. If you don’t understand yet how it differs from other data management platforms like a data warehouse or data lake, check out the following blog series:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

And if you wonder how Apache Kafka differs from other middleware, check out how Kafka fits into comparison with ETL, ESB, and iPaas.

Apache Kafka: The de facto standard for data streaming

Apache Kafka became the de facto standard for data streaming similar to Amazon S3 is the de facto standard for object storage. Kafka is used across industries for many use cases.

The adoption curve of Apache Kafka

The growth of the Apache Kafka community in the last years is impressive:

  • >100,000 organizations using Apache Kafka
  • >41,000 Kafka Meetup attendees
  • >32,000 Stack Overflow Questions
  • >12,000 Jiras for Apache Kafka
  • >31,000 Open Job Listings Request Kafka Skills

And look at the increased number of active monthly unique users downloading the Kafka Java client library with Maven:

Sonatype Maven Kafka Client Downloads
Source: Sonatype

The numbers grow exponentially. That’s no surprise to me as the adoption pattern and maturity curve for Kafka are similar in most companies:

  1. Start with one or few use cases (that prove the business value quickly)
  2. Deploy the first applications to production and operate them 24/7
  3. Tap into the data streams from many domains, business units, and technologies
  4. Move to a strategic central nervous system with a decentralized data hub

Kafka use cases by business value across industries

The main reason for the incredible growth of Kafka’s adoption curve is the variety of potential use cases for data streaming. The potential is almost endless. Kafka’s characteristics of combing low latency, scalability, reliability, and true decoupling establish benefits across all industries and use cases:

Use Cases for Data Streaming by Business Value

Search my blog for your favorite industry to find plenty of case studies and architectures. Or to get started, read about use cases for Apache Kafka across industries.

The emergence of many Kafka vendors

The market for data streaming is enormous. With so many potential use cases, it is no surprise that more and more software vendors add Kafka support to their products. Most vendors use Kafka or implement its protocol because Kafka has become the de facto standard for data streaming.

Learn more about the various data streaming vendors in the following blog posts:

To be clear: An increasing number of Kafka vendors is a great thing! It proves the creation of a new software category. Competition pushes innovation. The market share is big enough for many vendors. And I am 100% convinced that we are still in a very early stage of the data streaming hype cycle…

After a lengthy introduction to set the context, let’s now review a new entrant into the Kafka market: Redpanda…

Introducing Redpanda: Kafka-compatible data streaming

Redpanda is a data streaming platform. Its website explains its positioning in the market and product strategy as follows (to differentiate it from Apache Kafka):

  • No Java: A JVM-free and ZooKeeper-free infrastructure.
  • Designed in C++: Designed for a better performance than Apache Kafka.
  • A single-binary architecture: No dependencies to other libraries or nodes.
  • Self-managing and self-healing: A simple but scalable architecture for on-premise and cloud deployments.
  • Kafka-compatible: Out-of-the-box support for the Kafka protocol with existing applications, tools, and integrations.

This sounds great. You need to evaluate whether Redpanda is the right choice for your next project or if you should stick with “real Apache Kafka”.

How to choose the proper “Kafka” implementation for your project?

A recommendation that some people find surprising: Qualify out first! That’s much easier. Similarly, like I explained when NOT to use Apache Kafka.

As part of the evaluation, the question is if Kafka is the proper protocol for you. And for Kafka, pick different offerings and begin with the comparison.

Start your evaluation with the business case requirements and define your most critical needs like uptime SLAs, disaster recovery strategy, enterprise support, operations tooling, self-managed vs. fully-managed cloud service, capabilities like messaging vs. data ingestion vs. data integration vs. applications, and so on. Based on your use cases and requirements, you can start qualifying out vendors like Confluent, Redpanda, Cloudera, Red Hat / IBM, Amazon MSK, Amazon Kinesis, Google Pub Sub, and others to create a shortlist.

The following sections compare the open-source project Apache Kafka versus the re-implementation of the Kafka protocol of Redpanda. You can use these criteria (and information from other blogs, articles, videos, and so on) to evaluate your options.

Similarities between Redpanda and Apache Kafka

The high-level value propositions are the same in Redpanda and Apache Kafka:

  • Data streaming to process data in real-time at scale continuously
  • Decouple applications and domains with a distributed storage layer
  • Integrate with various data sources and data sinks
  • Leverage stream processing to correlate data and take action in real-time
  • Self-managed operations or consuming a fully-managed cloud offering

However, the devil is in the details and facts. Don’t trust marketing, but look deeper into the various products and cloud services.

Deployment options: Self-managed vs. cloud service

Data streaming is required everywhere. While most companies across industries have a cloud-first strategy, some workloads must stay at the edge for different reasons: Cost, latency, or security requirements. My blog about use cases for Apache Kafka at the edge is still one of the most read articles I have written in recent years.

Besides operating Redpanda by yourself, you can buy Redpanda as a product and deploy it in your environment.  Instead of just self-hosting Redpanda, you can deploy it as a data plane in your environment using Kubernetes (supported by the vendor’s external control plane) or leverage a cloud service (fully managed by the vendor).

The different deployment options for Redpanda are great. Pick what you need. This is very similar to Confluent’s deployment options for Apache Kafka. Some other Kafka vendors only provide either self-managed (e.g., Cloudera) or fully managed (e.g., Amazon MSK Serverless) deployment options.

What I miss from Redpanda: No official documentation about SLAs of the cloud service and enterprise support. I hope they do better than Amazon MSK (excluding Kafka support from their cloud offerings). I am sure you will get that information if you reach out to the Redpanda team, who will probably soon incorporate some information into their website.

Bring your own Cluster (BYOC)

There is a third option besides self-managing a data streaming cluster and leveraging a fully managed cloud service: Bring your own Cluster (BYOC). This alternative allows end users to deploy a solution partially managed by the vendor in your own infrastructure (like your data center or your cloud VPC).

Here is Redpanda’s marketing slogan: “Redpanda clusters hosted on your cloud, fully managed by Redpanda, so that your data never leaves your environment!”

This sounds very appealing in theory. Unfortunately, it creates more questions and problems than it solves:

  • How does the vendor access your data center or VPC?
  • Who decides how and when to scale a cluster?
  • When to act on issues? How and when do you roll a cluster to incorporate bug fixes or version upgrades?
  • What about cost management? What is the total cost of ownership? How much value does the vendor solution bring?
  • How do you guarantee SLAs? Who has to guarantee them, you or the vendor?
  • For regulated industries, how are security controls and compliance supported?  How are you sure about what the vendor does in an environment you ostensibly control?  How much harder will a bespoke third-party risk assessment be if you aren’t using pure SaaS?

For these reasons, cloud vendors only host managed services in the cloud vendor’s environment. Look at Amazon MSK, Azure Event Hubs, Google Pub Sub, Confluent Cloud, etc. All fully managed cloud services are only in the VPC of the vendor for the above reasons.

There are only two options: Either you hand over the responsibility to a SaaS offering or control it yourself. Everything in the middle is still your responsibility in the end.

Community vs. commercial offerings

The sales approach of Redpanda looks almost identical to how Confluent sells data streaming. A free community edition is available, even for production usage. The enterprise edition adds enterprise features like tiered storage, automatic data balancing, or 24/7 enterprise support.

No surprise here. And a good strategy, as data streaming is required everywhere for different users and buyers.

Technical differences between Apache Kafka and Redpanda

There are plenty of technical and non-functional differences between Apache Kafka products and Redpanda. Keep in mind that Redpanda is NOT Kafka. Redpanda uses the Kafka protocol. This is a small but critical difference. Let’s explore these details in the following sections.

Apache Kafka vs. Kafka protocol compatibility

Redpanda is NOT an Apache Kafka distribution like Confluent Platform, Cloudera, or Red Hat. Instead, Redpanda re-implements the Kafka protocol to provide API compatibility. Being Kafka-compatible is not the same as using Apache Kafka under the hood, even if it sounds great in theory.

Two other examples of Kafka-compatible offerings:

  • Azure Event Hubs: A Kafka-compatible SaaS cloud service offering from Microsoft Azure. The service itself works and performs well. However, its Kafka compatibility has many limitations. Microsoft lists a lot of them on its website. Some limitations of the cloud service are the consequence of a different implementation under the hood, like limited retention time and message sizes.
  • Apache Pulsar: An open-source framework competing with Kafka. The feature set overlaps a lot. Unfortunately, Pulsar often only has good marketing for advanced features to compete with Kafka or to differentiate. And one example is its Kafka mapper to be compatible with the Kafka protocol. Contrary to Azure Event Hubs as a serious implementation (with some limitations), Pulsar’s compatibility wrapper provides a basic implementation that is compatible with only minor parts of the Kafka protocol. So, while alleged “Kafka compatibility” sounds nice on paper, one shouldn’t seriously consider this for migrating your running Kafka infrastructure to Pulsar.

We have seen compatible products for open-source frameworks in the past. Re-implementations are usually far away from being complete and perfect. For instance, MongoDB compared the official open source protocol to its competitor Amazon DocumentDB to pinpoint the fact that DocumentDB only passes ~33% of the MongoDB integration test chain.

In summary, it is totally fine to use these non-Kafka solutions like Azure Event Hubs, Apache Pulsar, or Redpanda for a new project if they fulfill your requirements better than Apache Kafka. But keep in mind that it is not Kafka. There is no guarantee that additional components from the Kafka ecosystem (like Kafka Connect, Kafka Streams, REST Proxy, and Schema Registry) behave the same when integrated with a non-Kafka solution that only uses the Kafka protocol with its own implementation.

How good is Redpanda’s Kafka protocol compatibility?

Frankly, I don’t know. Probably and hopefully, Redpanda has better Kafka compatibility than Pulsar. The whole product is based on this value proposition. Hence, we can assume that the Redpanda team spends plenty of time on compatibility. Redpanda has NOT achieved 100% API compatibility yet.

Time will tell when we see more case studies from enterprises across industries that migrated some Apache Kafka projects to Redpanda and successfully operated the infrastructure for a few years. Why wait a few years to see? Well, I compare it to what I see from people starting with Amazon MSK. It is pretty easy to get started. However, after a few months, the first issues happen. Users find out that Amazon MSK is not a fully-managed product and does not provide serious Kafka SLAs. Hence, I see too many teams starting with Amazon MSK and then migrating to Confluent Cloud after some months.

But let’s be clear: If you run an application against Apache Kafka and migrate to a re-implementation supporting the Kafka protocol, you should NOT expect 100% the same behavior as with Kafka!

Some underlying behavior will differ even if the API is 100% compatible. This is sometimes a benefit. For instance, Redpanda focuses on performance optimization with C++. This is only possible in some workloads because of the re-implementation. C++ is superior compared to Java and the JVM for some performance and memory scenarios.

Redpanda = Apache Kafka – Kafka Connect – Kafka Streams

Apache Kafka includes Kafka Connect for data integration and Kafka Streams for stream processing.

Like most Kafka-compatible projects, Redpanda does exclude these critical pieces from its offering. Hence, even 100 percent protocol compatibility would not mean a product re-implements everything in the Apache Kafka project.

Lower latency vs. benchmarketing

Always think about your performance requirements before starting a project. If necessary, do a proof of concept (POC) with Apache Kafka, Apache Pulsar, and Redpanda. I bet that in 99% of scenarios, all three of them will show a good enough performance for your use case.

Don’t trust opinionated benchmarks from others! Your use case will have different requirements and characteristics. And performance is typically just one of many evaluation dimensions.

I am not a fan of most “benchmarks” of performance and throughput. Benchmarks are almost always opinionated and configured for a specific problem (whether a vendor, independent consultant or researcher conducts them).

My colleague Jack Vanlightly explained this concept of benchmarketing with excellent diagrams:

Benchmarks for Benchmarketing
Source: Jack Vanlightly

Here is one concrete example you will find in one of Redpanda’s benchmarks: Kafka was not built for very high throughput producers, and this is what Redpanda is exploiting when they claim that Kafka’s throughput is inferior to Redpanda. Ask yourself this question: Of 1GB/s use cases, who would create that throughput with just 4 producers? Benchmarketing at its finest.

Hence, once again, start with your business requirements. Then choose the right tool for the job. Benchmarks are always built for winning against others. Nobody will publish a benchmark where the competition wins.

Soft real-time vs. hard real-time

When we speak about real-time in the IT world, we mean end-to-end data processing pipelines that need at least a few milliseconds. This is called soft real-time. And this is where Apache Kafka, Apache Pulsar, Redpanda, Azure Event Hubs, Apache Flink, Amazon Kinesis, and similar platforms fit into. None of these can do hard real time.

Hard real-time requires a deterministic network with zero latency and no spikes. Typical scenarios include embedded systems, field buses, and PLCs in manufacturing, cars, robots, securities trading, etc. Time-Sensitive Networking (TSN) is the right keyword if you want more research.

I wrote a dedicated blog post about why data streaming is NOT hard real-time. Hence, don’t try to use Kafka or Redpanda for these use cases. That’s OT (operational technology), not IT (information technology). OT is plain C or Rust on embedded software.

No ZooKeeper with Redpanda vs. no ZooKeeper with Kafka

Besides being implemented in C++ instead of using the JVM, the second big differentiator of Redpanda is no need for ZooKeeper and two complex distributed systems… Well, with Apache Kafka 3.3, this differentiator is gone. Kafka is now production-ready without ZooKeeper! KIP-500 was a multi-year journey and an operation at Kafka’s heart.

ZooKeeper Removal KIP 500 in Apache Kafka

To be fair, it will still take some time until the new ZooKeeper-less architecture goes into production. Also, today, it is only supported by new Kafka clusters. However, migration scenarios with zero downtime and without data loss will be supported in 2023, too. But that’s how a severe release cycle works for a mature software product: Step-by-step implementation and battle-testing instead of starting with marketing and selling of alpha and beta features.

ZooKeeper-less data streaming with Kafka is not just a massive benefit for the scalability and reliability of Kafka but also makes operations much more straightforward, similar to ZooKeeper-less Redpanda.

By the way, this was one of the major arguments why I did not see the value of Apache Pulsar. The latter requires not just two but three distributed systems: Pulsar broker, ZooKeeper, and BookKeeper. That’s nonsense and unnecessary complexity for virtually all projects and use cases.

Lightweight Redpanda + heavyweight ecosystem = middleweight data streaming?

Redpanda is very lightweight and efficient because of its C++ implementation. This can help in limited compute environments like edge hardware. As an additional consequence, Redpanda has fewer latency spikes than Apache Kafka. That are significant arguments for Redpanda for some use cases!

However, you need to look at the complete end-to-end data pipeline. If you use Redpanda as a message queue, you get these benefits compared to the JVM-based Kafka engine. You might then pick a message queue like RabbitMQ or NATs instead. I don’t start this discussion here as I focus on the much more powerful and advanced data streaming use cases.

Even in edge use cases where you deploy a single Kafka broker, the hardware, like an industrial computer (IPC), usually provides at least 4GB or 8GB of memory. That is sufficient for deploying the whole data streaming platform around Kafka and other technologies.

Data streaming is more than messaging or data ingestion

My fundamental question is, what is the benefit of a C++ implementation of the data hub if all the surrounding systems are built with JVM technology or even worse and slow technologies like Python?

Kafka-compatible tools like Redpanda integrate well with the Kafka ecosystem, as they use the same protocol. Hence, tools like Kafka Connect, Kafka Streams, KSQL, Apache Flink, Faust, and all other components from the Kafka ecosystem work with Redpanda. You will find such an example for almost every existing Kafka tool on the Redpanda blog.

However, these combinations kill almost all the benefits of having a C++ layer in the middle. All integration and processing components would also need to be as efficient as Redpanda and use C++ (or Go or Rust) under the hood.  These tools do not exist today (likely, as they are not needed by many people). And here is an additional drawback: The debugging, testing, and monitoring infrastructure must combine C++, Python, and JVM platforms if you combine tools like Java-based Kafka Connect and Python-based Faust with C++-based Redpanda. So, I don’t get the value proposition here.

Data replication across clusters

Having more than one Kafka cluster is the norm, not an exception. Use cases like disaster recovery, aggregation, data sovereignty in different countries, or migration from on-premise to the cloud require multiple data streaming clusters.

Replication across clusters is part of open-source Apache Kafka. MirrorMaker 2 (based on Kafka Connect) supports these use cases. More advanced (proprietary) tools from vendors like Confluent Replicator or Cluster Linking make these use cases more effortless and reliable.

Data streaming with the Kafka ecosystem is perfect as the foundation of a decentralized data mesh:

Cluster Linking for data replication with the Kafka protocol

How do you build these use cases with Redpanda?

It is the same story as for data integration and stream processing: How much does it help to have a very lightweight and performant core if all other components rely on “3rd party” code bases and infrastructure? In the case of data replication, Redpanda uses Kafka’s Mirrormaker.

And make sure to compare MirrorMaker to Confluent Cluster Linking – the latter uses the Kafka protocol for replications and does not need additional infrastructure, operations, offset sync, etc.

Non-functional differences between Apache Kafka and Redpanda

Technical evaluations are dominant when talking about Redpanda vs. Apache Kafka. However, the non-functional differences are as crucial before making the strategic decision to choose the data streaming platform for your next project.

Licensing, adoption curve and the total cost of ownership (TCO) are critical for the success of establishing a data streaming platform.

Open source (Kafka) vs. source available (Redpanda)

As the name says, Apache Kafka is under the very permissive Apache license 2.0. Everyone, including cloud providers, can use the framework for building internal applications, commercial products, and cloud services. Committers and contributions are spread across various companies and individuals.

Redpanda is released under the more restrictive Source Available License (BSL). The intention is to deter cloud providers from offering Redpanda’s work as a service. For most companies, this is fine, but it limits broader adoption across different communities and vendors. The likelihood of external contributors, committers, or even other vendors picking the technology is much smaller than in Apache projects like Kafka.

This has a significant impact on the (future) adoption curve

Maturity, community and ecosystem

The introduction of this article showed the impressive adoption of Kafka. Just keep in mind: Redpanda is NOT Apache Kafka! It just supports the Kafka protocol.

Redpanda is a brand-new product and implementation. Operations are different. The behavior of the engine is different. Experts are not available. Job offerings do not exist. And so on.

Kafka is significantly better documented, has a tremendously larger community of experts, and has a vast array of supporting tooling that makes operations more straightforward.

There are many local and online Kafka training options, including online courses, books, meetups, and conferences. You won’t find much for Redpanda beyond the content of the vendor behind it.

And don’t trust marketing! That’s true for every vendor, of course. If you read a great feature list on the Redpanda website, double-check if the feature truly exists and in what shape it is. Example: RBAC (role-based access control) is available for Redpanda. The devil lies in the details. Quote from the Redpanda RBAC documentation: “This page describes RBAC in Redpanda Console and therefore manages access only for Console users but not clients that interact via the Kafka API. To restrict Kafka API access, you need to use Kafka ACLs.” There are plenty of similar examples today. Just try to use the Redpanda cloud service. You will find many things that are more alpha than beta today. Make sure not to fall into the same myths around the marketing of product features as some users did with Apache Pulsar a few years ago.

The total cost of ownership and business value

When you define your project’s business requirements and SLAs, ask yourself how much downtime or data loss is acceptable. The RTO (recovery time objective) and RPO (recovery point objective) impact a data streaming platform’s architecture and overall process to ensure business continuity, even in the case of a disaster.

The TCO is not just about the cost of a product or cloud service. Full-time engineers need to operate and integrate the data streaming platform. Expensive project leads, architects, and developers build applications.

Project risk includes the maturity of the product and the expertise you can bring in for consulting and 24/7 support.

Similar to benchmarketing regarding latency,  vendors use the same strategy for TCO calculations! Here is one concrete example you always hear from Redpanda: “C++ does enable more efficient use of CPU resources.”

This statement is correct. However, the problem with that statement is that Kafka is rarely CPU-bound and much more IO-bound. Redpanda has the same network and disk requirements as Kafka, which means Redpanda has limited differences from Kafka in terms of TCO regarding infrastructure.

When to choose Redpanda instead of Apache Kafka?

You need to evaluate whether Redpanda is the right choice for your next project or if you should stick with the “real Apache Kafka” and related products or cloud offerings. Read articles and blogs, watch videos, search for case studies in your industry, talk to different competitive vendors, and build your proof of concept or pilot project. Qualifying out products is much easier than evaluating plenty of offerings.

When to seriously consider Redpanda?

  • You need C++ infrastructure because your ops team cannot handle and analyze JVM logs – but be aware that this is only the messaging core, not the data integration, data processing, or other capabilities of the Kafka ecosystem
  • The slight performance differences matter to you – and you still don’t need hard real-time
  • Simple, lightweight development on your laptop and in automated test environments – but you should then also run Redpanda in production (using different implementations of an API for TEST and PROD is a risky anti-pattern)

You should evaluate Redpanda against Apache Kafka distributions and cloud services in these cases.

This post explored the trade-offs Redpanda has from a technical and non-functional perspective. If you need an enterprise-grade solution or fully-managed cloud service, a broad ecosystem (connectors, data processing capabilities, etc.), and if 10ms latency is good enough and a few p99 spikes are okay, then I don’t see many reasons why you would take the risk of adopting Redpanda instead of an actual Apache Kafka product or cloud service.

The future will tell us if Redpanda is a severe competitor…

I didn’t even cover the fact that a startup always has challenges finding great case studies, especially with big enterprises like fortune 500 companies. The first great logos are always the hardest to find. Sometimes, startups never get there. In other cases, a truly competitive technology and product are created. Such a journey takes years. Let’s revisit this blog post in one, two, and five years to see the evolution of Redpanda (and Apache Kafka).

What are your thoughts? When do you consider using Redpanda instead of Apache Kafka? Are you using Redpanda already? Why and for what use cases? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post When to choose Redpanda instead of Apache Kafka? appeared first on Kai Waehner.

]]>
Best Practices for Building a Cloud-Native Data Warehouse or Data Lake https://www.kai-waehner.de/blog/2022/07/21/best-practices-for-building-a-cloud-native-data-warehouse-data-lake-lakehouse/ Thu, 21 Jul 2022 09:40:51 +0000 https://www.kai-waehner.de/?p=4666 The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let's explore this dilemma in a blog series. This is part 5: Best Practices for Building a Cloud-Native Data Warehouse or Data Lake.

The post Best Practices for Building a Cloud-Native Data Warehouse or Data Lake appeared first on Kai Waehner.

]]>
The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 5: Best Practices for Building a Cloud-Native Data Warehouse or Data Lake.

Best Practices for Data Analytics with AWS Azure Googel BigQuery Spark Kafka Confluent Databricks

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. THIS POST: Best Practices for Building a Cloud-Native Data Warehouse or Data Lake

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Best practices for building a cloud-native data warehouse or data lake

Let’s explore the following lessons learned from building cloud-native data analytics infrastructure with data warehouses, data lakes, data streaming, and lakehouses:

  • Lesson 1: Process and store data in the right place.
  • Lesson 2: Don’t design for data at rest to reverse it.
  • Lesson 3: There is no need for a lambda architecture to separate batch and real-time workloads
  • Lesson 4: Understand the trade-offs between data sharing at rest and a streaming data exchange.
  • Lesson 5: Data mesh is not a single product or technology.

Let’s get started…

Lesson 1: Process and store data in the right place.

Ask yourself: What is the use case for your data?

Here are a few use case examples for data and exemplary tools to implement the business case:

  • Recurring reporting for management =>  Data warehouse and its out-of-the-box reporting tools
  • Interactive analysis of structured and unstructured data => Business intelligence tools like Tableau, Power BI, Qlik, or TIBCO Spotfire on top of the data warehouse or another data store
  • Transactional business workloads => Custom Java application running in a Kubernetes environment or serverless cloud infrastructure
  • Advanced analytics to find insights in historical data => Raw data sets stored in a data lake for applying powerful algorithms with AI / Machine Learning technologies such as TensorFlow
  • Real-time actions on new events => Streaming applications to process and correlate data continuously while it is relevant

Real-time or batch compute on the right platform as needed

Batch workloads run best in an infrastructure that was built for this. For instance, Hadoop or Apache Spark. Real-time workloads run best in an infrastructure that was built for this. For example, Apache Kafka.

However, sometimes, both platforms could be used. Understand the underlying infrastructure to leverage it the best way. Apache Kafka can replace a database! Nevertheless, it should only be done in the few scenarios where it makes sense (i.e., simplifies the architecture or adds business value).

For example, replayability as a sequence of events (with guaranteed ordering with time stamps) is built into the immutable Kafka log. Replaying and re-processing historical data from Kafka are straightforward and a perfect use case for many scenarios, including:

  • New consumer application
  • Error-handling
  • Compliance / regulatory processing
  • Query and analyze existing events
  • Schema changes in the analytics platform
  • Model training

On the other side, if you need to do complex analytics like map-reduce or shuffling, SQL queries with tens of JOINs, a robust time-series analysis of sensor events, a search index based on ingested log information, and so on. Then you better choose Spark, Rockset, Apache Druid, or Elasticsearch for that use case 

Tiered storage with cloud-native object storage for cost-efficiency

A single storage infrastructure cannot solve all these problems, even if the “lakehouse vendors” tell you so. Hence, ingesting all data into a single system will fail to succeed with the above use cases. Choose a best-of-breed approach instead with the right tools for the job.

Modern, cloud-native systems decouple storage and compute. This is true for data streaming platforms like Apache Kafka and analytics platforms like Apache Spark, Snowflake, or Google BigQuery. SaaS solutions implement innovative tiered storage solutions (under the hood so you don’t see them) for cost-efficient separation between storage and compute.

Even modern data streaming services leverage tiered storage:

Confluent Tiered Storage for Kafka for Digital Forensics of Historical Data

Lesson 2: Don’t design for data at rest to reverse it.

Ask yourself: Is there any added business value if you process data now instead of later (whatever later means)?

If yes, then don’t store the data in a database or data lake, or data warehouse as the first step. The data is stored at rest then and not available for real-time processing anymore. A data streaming platform like Apache Kafka is the right choice if real-time data beats slow data in your use case!

I find it amazing how many people put all their raw data into data storage just to find out that they could leverage the data in real-time later. Reverse ETL tools are spun up then to access the data in the lakehouse again via change data capture (CDC) or similar approaches. Or if you use Spark Structured Streaming (= “real-time”), but the first thing to get the data for “real-time stream processing” is reading it from an S3 object storage (= “at rest and too late”) is unfitting.

Reverse ETL is NOT the right approach for real-time use cases…

If you store data in a data warehouse or data lake, you cannot process it in real-time anymore as it is already stored at rest. These data stores are built for indexing, search, batch processing, reporting, model training, and other use cases that make sense in the storage system. But you cannot consume the data in real-time in motion from storage at rest:

Data at Rest and Reverse ETL

… data streaming is built for continuously processing data in real-time

That’s where event streaming comes into play. Platforms like Apache Kafka enable processing data in motion in real-time for transactional and analytical workloads.

Reverse ETL is not needed in modern event-driven architecture! It is “built-in” into the architecture out-of-the-box. Each consumer directly consumes the data in real-time, if appropriate and technically feasible. And data warehouses or data lakes still consume it at their own pace in near-real-time or batch:

Data in Motion and Data at Rest

Again, this does not mean you should not put data at rest in your data warehouse or data lake. But only do that if you need to analyze the data later. The data storage at rest is NOT appropriate for real-time workloads.

Learn more about this topic in my blog post “When to Use Reverse ETL and when it is an Anti-Pattern“.

Lesson 3: There is no need for a lambda architecture to separate batch and real-time workloads

Ask yourself: What is the easiest way to consume and process incoming data with my favorite data analytics technology?

Real-time data beats slow data, but NOT always!

Think about your industry, business units, problems you solve, and innovative new applications you build. Real-time data beats slow data. This statement is almost always true. Either to increase revenue, reduce cost, reduce risk, or improve the customer experience.

Data at rest means to store data in a database, data warehouse, or data lake. This way, data is processed too late in many use cases – even if a real-time streaming component (like Kafka) ingests the data. The data processing is still a web service call, SQL query, or map-reduce batch process away from providing a result to your problem.

Don’t get me wrong. Data at rest is not a bad thing. Several use cases, such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) work very well with this approach. But real-time beats batch in almost all other use cases.

The Kappa architecture simplifies the infrastructure for batch AND real-time workloads

The Kappa architecture is an event-based software architecture that can handle all data at any scale in real-time for transactional AND analytical workloads.

The central premise behind the Kappa architecture is that you can perform both real-time and batch processing with a single technology stack. That’s a very different approach than the well-known Lambda architecture. The latter separates batch and real-time workloads in separate infrastructures and technology stacks.

The heart of a Kappa infrastructure is streaming architecture. First, the event streaming platform log stores incoming data. From there, a stream processing engine processes the data continuously in real-time or ingests the data into any other analytics database or business application via any communication paradigm and speed, including real-time, near real-time, batch, and request-response.

Kappa Architecture with one Pipeline for Real Time and Batch

Learn more about the benefits and trade-offs of the Kappa architecture in my article “Kappa Architecture is Mainstream Replacing Lambda“.

Lesson 4: Understand the trade-offs between data sharing at rest and a streaming data exchange.

Ask yourself: How do I need to share data with other internal business units or external organizations?

Use cases for hybrid and multi-cloud replication with data streaming, data lakes, data warehouses, and lakehouses

Many good reasons exist to replicate data across data centers, regions, or cloud providers:

  • Disaster recovery and high availability: Create a disaster recovery cluster and failover during an outage.
  • Global and multi-cloud replication: Move and aggregate data across regions and clouds.
  • Data sharing: Share data with other teams, lines of business, or organizations.
  • Data migration: Migrate data and workloads from one cluster to another (like from a legacy on-premise data warehouse to a cloud-native data lakehouse).

Real-time data replication beats slow data sharing

The story around internal or external data sharing is not different from other applications. Real-time replication beats slow data exchanges. Hence, storing data at rest and then replicating it to another data center, region, or cloud provider is an anti-pattern if real-time information adds business value.

The following example shows how independent stakeholders (= domains in different enterprises) use a cross-company streaming data exchange:

Stream Data Exchange and Sharing with Data Mesh in Motion

Innovation does not stop at the own border. Streaming replication is relevant for all use cases where real-time is better than slow data (valid for most scenarios). A few examples:

  • End-to-end supply chain optimization from suppliers to the manufacturer to the intermediary to the aftersales
  • Track and trace across countries
  • Integration of 3rd party add-on services to the own digital product
  • Open APIs for embedding and combining external services to build a new product

Read the details about a “Streaming Data Exchange with Kafka and a Data Mesh in Motion vs. Data Sharing at Rest in the Data Warehouse or Data Lake” for more details.

Also, understand why APIs (= REST / HTTP) and data streaming (= Apache Kafka) are complementary, not competitive!

Lesson 5: Data mesh is not a single product or technology.

Ask yourself: How do I create a flexible and agile enterprise architecture to innovate more efficiently and solve my business problems faster?

Data Mesh is a Logical View, not Physical!

Data mesh shifts to a paradigm that draws from modern distributed architecture: considering domains as the first-class concern, applying platform thinking to create a self-serve data infrastructure, treating data as a product, and implementing open standardization to enable an ecosystem of interoperable distributed data products.

Here is an example of a Data Mesh:

Data Mesh with Apache Kafka

TL;DR: Data Mesh combines existing paradigms, including Domain-driven Design, Data Marts, Microservices, and Event Streaming.

A data warehouse or data lake is NOT and CAN NOT BECOME the entire data mesh!

The heart of a Data Mesh infrastructure should be real-time, decoupled, reliable, and scalableKafka is a modern cloud-native enterprise integration platform (also often called iPaaS today). Therefore, Kafka provides all the capabilities for the foundation of a Data Mesh.

However, not all components can or should be Kafka-based. The beauty of microservice architectures is that every application can choose the right technologies. An application might or might not include databases, analytics tools, or other complementary components. The input and output data ports of the data product should be independent of the chosen solutions:

Data Stores within the Data Product with Snowflake MongoDB Oracle et al

Kafka can be a strategic component of a cloud-native data mesh, not more and not less. But even if you do not use data streaming and build a data mesh only with data at rest, there is still no silver bullet. Don’t try to build a data mesh with a single product, technology, or vendor. No matter if that tool focuses on real-time data streaming, batch processing and analytics, or API-based interfaces. Tools like Starburst – a SQL-based MPP query engine powered by open source Trino (formerly Presto) – enable analytics on top of different data stores.

Best practices for a cloud-native data warehouse go beyond a SaaS product

Building a cloud-native data warehouse or data lake is an enormous project. It requires data ingestion, data integration, connectivity to analytics platforms, data privacy and security patterns, and much more. All of that is needed before the actual tasks like reporting or analytics can even begin.

The complete enterprise architecture beyond the scope of the data warehouse or data lake is even more complex. Best practices must be applied to build a resilient, scalable, elastic, and cost-efficient data analytics infrastructure. SLAs, latencies, and uptime have very different requirements across business domains. Best of breed approaches choose the right tool for the job. True decoupling between business units and applications allows focusing on solving specific business problems.

Separation of storage and compute, unified real-time pipelines instead of separating batch and real-time, avoiding anti-patterns like Reverse ETL, and appropriate data sharing concepts enable a successful journey to cloud-native data analytics.

For more details, browse other posts of this blog series:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. THIS POST: Lessons Learned from Building a Cloud-Native Data Warehouse

How did you modernize your data warehouse or data lake? Do you agree with the lessons I learned? What other experiences did you have? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Best Practices for Building a Cloud-Native Data Warehouse or Data Lake appeared first on Kai Waehner.

]]>
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization https://www.kai-waehner.de/blog/2022/07/18/case-studies-cloud-native-data-streaming-for-data-warehouse-modernization/ Mon, 18 Jul 2022 08:48:48 +0000 https://www.kai-waehner.de/?p=4663 The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let's explore this dilemma in a blog series. This is part 4: Case Studies for cloud-native data streaming and data warehouses.

The post Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization appeared first on Kai Waehner.

]]>
The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 4: Case Studies for cloud-native data streaming and data warehouse modernization.

Case Studies for Cloud Native Analytics with Data Warehouse Data Lake Data Streaming Lakehouse

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. THIS POST: Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Case studies: Cloud-native data streaming for data warehouse modernization

Every project is different. This is true for data streaming, analytics, and other software development. The following shows three case studies with significantly different architectures and technologies for data warehouse modernization. The examples come from various verticals: Software and cloud business, financial services, logistics and transportation, and the travel and accommodation industry.

Confluent: Data warehouse modernization from batch ETL with Stitch to streaming ETL with Kafka

The article “Streaming ETL SFDC Data for Real-Time Customer Analytics” explores how Confluent eats its dog food to modernize the internal data warehouse pipeline.

The use case is straightforward and standard across most organizations: Extract, transform, and load (ETL) Salesforce data into a Google BigQuery data warehouse, so that the business can use the data. But it is more complex than it sounds:

Organizations often rely on a third-party ETL tool to periodically load data from a CRM and other applications to their data warehouse. These batch tools introduce a lag between when the business events are captured in Salesforce and when they are made available for consumption and processing. The batch workloads commonly result in discrepancies between Salesforce reports and internal dashboards, leading to concerns about the integrity and reliability of the data.

Confluent used Talend’s Stitch batch ETL tool in the beginning. The old architecture looked like this:

Batch ETL with Salesforce and Talend Stitch

The consequence of batch ETL and a 3rd party tool in the middle lead to insufficient and inconsistent information updates.

Over the past few years, Confluent has invested in building stream processing capabilities into the internal data warehouse pipeline. Confluent leverages its own fully managed Confluent Cloud connectors (in this case, the Salesforce CDC source and BigQuery sink connectors), Schema Registry for data governance, and ksqlDB + Kafka Streams for reliable streaming ETL to send SFDC data to BigQuery. Here is the modernized architecture:

Real time streaming ETL architecture with Salesforce CDC and Apache Kafka Connect to Google BigQuery

Paypal: Reducing the time for readouts from 12 hours to a few seconds for 30 billion events per day

Paypal has plenty of Kafka projects for many critical and analytical workloads. In this use case, it scales the Kafka Consumer for 30-35 Billion events per day to migrate its analytical workloads to the Google Cloud Platform (GCP).

Paypal Cloud-native Data Warehouse with Apache Kafka and Google BigQuery

A streaming application ingests the events from Kafka directly to BigQuery. This is a critical project for PayPal as most of the analytical readouts are based on this. The outcome of the data warehouse modernization and building a cloud-native architecture: Reduce the time for readouts from 12 hours to a few seconds.

Read more about this success story in the PayPal Technology Blog.

Shippeo: From on-premise databases to multiple cloud-native data lakes

Shippeo provides real-time and multimodal transportation visibility for logistics providers, shippers, and carriers. Its software uses automation and artificial intelligence to share real-time insights, enable better collaboration, and unlock your supply chain’s full potential. The platform can give instant access to predictive, real-time information for every delivery.

Shippeo described how they integrated traditional databases (MySQL and PostgreSQL) and cloud-native data warehouses (Snowflake and BigQuery) with Apache Kafka and Debezium:

From Mysql and Postgresql to Snowflake and BigQuery with Kafka and Debezium at Shippeo

This is an excellent example of cloud-native enterprise architecture leveraging a “best of breed” approach for data warehousing and analytics. Kafka decouples the analytical workloads from the transactional systems and handles the backpressure for slow consumers.

Sykes Cottages: Fully-managed end-to-end pipeline with Confluent Cloud, Kafka Connect, Snowflake

Sykes Holiday Cottages are one of the UK’s leading and fastest-growing independent holiday cottage rental agencies representing over 19,000 cottages across the UK, Ireland, and New Zealand.

The experience of its customers on the web is a top priority and is one way to stay competitive. The goal is to match customers to their perfect holiday cottage experience and delight at each stage along the way. Getting the data pipeline to fuel this innovation is critical. Data warehouse modernization and data streaming enabled new ways to further innovate the web experience through a data-driven approach.

From inconsistent and slow batch workloads…

While serving its purpose for several years, the existing pipeline had problems impairing this cycle. Very early in this pipeline, the ETL process turned the data into rows and columns (structured data). Various copies were made, and the results were presented via a static report. Data engineers were needed for changes, such as new events or contextual information. The scale was also challenging as this has to be done manually in the main.

Sykes Cottages Modern Cloud Native Data Pipeline

Critically keeping the data in a semi-structured format until it is ingested into the warehouse and then using ELT to do a single transformation of the data, Sykes Holiday Cottages can simplify the pipeline and make it much more agile.

… to event-based real-time updates and continuous stream processing

New web events (and any context that goes with it) can be wrapped up within a message and can flow all the way to the warehouse without a single code change. The new events are then available to the web teams either through a query or the visualization tool.

The current throughput is around 50k (peaking at over 300k) messages per minute. As new events are captured, this will grow considerably. Additionally, each of the above components must scale accordingly.

The new architecture enables the web teams to capture new events. And analyze the data using self-service tools with no dependency on data engineering.

Sykes Cottages Legacy Data Pipeline

In conclusion, the business case for doing this is compelling. Based on our testing and projections, we expect at least 10x ROI over three years for this investment.

In Sykes Holiday Cottages’ blog post, learn more details: Why Sykes Cottages partnered with Snowflake and Confluent to drive enhanced customer experience.

Doordash: From multiple pipelines to data streaming for Snowflake integration

Even digital natives – that started their business in the cloud without legacy applications in their own data centers – need to modernize the enterprise architecture to improve business processes, reduce costs, and provide real-time information to its downstream applications.

It is cost inefficient to build multiple pipelines that are trying to achieve similar purposes. Doordash used cloud-native AWS messaging and streaming systems like Amazon SQS and Amazon Kinesis for data ingestion into the Snowflake data warehouse:

Legacy data pipeline at DoorDashMixing different kinds of data transport and going through multiple messaging/queueing systems without carefully designed observability around it leads to difficulties in operations.

These issues resulted in high data latency, significant cost, and operational overhead at Doordash. Therefore, Doordash moved to a cloud-native streaming platform powered by Apache Kafka and Apache Flink for continuous stream processing before ingesting data into Snowflake:

Cloud-native data streaming powered by Apache Kafka and Apache Flink for Snowflake integration at Doordash

The move to a data streaming platform provides many benefits to Doordash:

  • Heterogeneous data sources and destinations, including REST APIs using the Confluent rest proxy
  • Easily accessible
  • End-to-end data governance with schema enforcement and schema evolution with Confluent Schema Registry
  • Scalable, fault-tolerant, and easy to operate for a small team

All the details about this cloud-native infrastructure optimization are in Doordash’s engineering blog post: “Building Scalable Real Time Event Processing with Kafka and Flink“.

Real-world case studies for cloud-native projects prove the business value

Data warehouse and data lake modernization only make sense if there is a business value. Elastic scale, reduced operations complexity, and faster time to market are significant advantages of cloud services like Snowflake, Databricks, or Google BigQuery.

Data streaming plays a vital role in these initiatives to integrate with legacy and cloud-native data sources, continuous streaming ETL, true decoupling between the data sources, and multiple data sinks (lakes, warehouses, business applications).

The case studies of Confluent, Paypal, Shippeo, and Sykes Cottages showed their different success stories of moving into cloud-native infrastructure to rain real-time visibility and analytics capabilities. Elastic scale and fully-managed end-to-end pipelines are crucial success factors in gaining business value with consistently up-to-date information.

For more details, browse other posts of this blog series:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. THIS POST: Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

Do you have another success story to share? Or are your projects for data lake and data warehouse modernization still ongoing? Do you use separate infrastructure for specific use cases or build a monolithic lakehouse instead? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization appeared first on Kai Waehner.

]]>
Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure https://www.kai-waehner.de/blog/2022/07/15/data-warehouse-data-lake-modernization-from-legacy-on-premise-to-cloud-native-saas-with-data-streaming/ Fri, 15 Jul 2022 06:03:28 +0000 https://www.kai-waehner.de/?p=4649 The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let's explore this dilemma in a blog series. This is part 3: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure.

The post Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure appeared first on Kai Waehner.

]]>
The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 3: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure.

Data Warehouse and Data Lake Modernization with Data Streaming

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Data warehouse modernization: From legacy on-premise to cloud-native infrastructure

Many people talk about data warehouse modernization when they move to a cloud-native data warehouse. Though, what does data warehouse modernization mean? Why do people move away from their on-premise data warehouse? What are the benefits?

Many projects I have seen in the wild went through the following steps:

  1. Select a cloud-native data warehouse
  2. Get data into the new data warehouse
  3. [Optional] Migrate from the old to the new data warehouse

Let’s explore these steps in more detail and understand the technology and architecture options.

1. Selection of a cloud-native data warehouse

Many years ago, cloud computing was a game-changer for operating infrastructure. AWS innovated by providing not just EC2 virtual machines but also storage, like AWS S3 as a service.

Cloud-native data warehouse offerings are built on the same fundamental change. Cloud providers brought their analytics cloud services, such as AWS Redshift, Azure Synapse, or GCP BigQuery. Independent vendors rolled out a cloud-native data warehouse or data lake SaaS such as Snowflake, Databricks, and many more. While each solution has its trade-offs, a few general characteristics are true for most of them:

  • Cloud-native: A modern data warehouse is elastic, scales for small up to extreme workloads, and automates most business processes around development, operations, and monitoring.
  • Fully managed: The vendor takes over the operations burden. This includes scaling, failover handling, upgrades, and performance tuning. Some offerings are truly serverless, while many services require capacity planning and manual or automated scaling up and down.
  • Consumption-based pricing: Pay-as-you-go enables getting started in minutes and scaling costs with broader software usage. Most enterprise deployments allow commitment to getting price discounts.
  • Data sharing: Replicating data sets across regions and environments is a common feature to offer data locality, privacy, lower latency, and regulatory compliance.
  • Multi-cloud and hybrid deployments: While cloud providers usually only offer the 1st party service on their cloud infrastructure, 3rd party vendors provide a multi-cloud strategy. Some vendors even offer hybrid environments, including on-premise and edge deployments.

Plenty of comparisons exist in the community, plus analyst research from Gartner, Forrester, et al. Looking at vendor information and trying out the various cloud products using free credits is crucial, too. Finding the right cloud-native data warehouse is its own challenge and not in this blog post.

2. Data streaming as (near) real-time and hybrid integration layer

Data ingestion into data warehouses and data lakes was already covered in part two of this blog series. The more real-time, the better for most business applications. Near real-time ingestion is possible with specific tools (like AWS Kinesis or Kafka) or as part of the data fabric (the streaming data hub where a tool like Kafka plays a bigger role than just data ingestion).

The often more challenging part is data integration. Most data warehouse and data lake pipelines require ETL to ingest data. As the next-generation analytics platform is crucial for making the right business decisions, the data ingestion and integration platform must also be cloud-native! Tools like Kafka provide the reliable and scalable integration layer to get all required data into the data warehouse.

Integration of legacy on-premise data into the cloud-native data warehouse

In a greenfield project, the project team is lucky. Data sources run in the same cloud, using open and modern APIs, and scale as well as the cloud-native data warehouse.

Unfortunately, the reality is brownfield almost always, even if all applications run in public cloud infrastructure. Therefore, the integration and replication of data from legacy and on-premise applications is a general requirement.

Data is typically consumed from legacy databases, data warehouses, applications, and other proprietary platforms. The replication into the cloud data warehouse usually needs to be near real-time and reliable.

A data streaming platform like Kafka is perfect for replicating data across data centers, regions, and environments because of its elastic scalability and true decoupling capabilities. Kafka enables connectivity to modern AND legacy systems via connectors, proprietary APIs, programming languages, or open REST interfaces:

Accelerate modernization from on-prem to AWS with Kafka and Confluent Cluster Linking

A common scenario in such a brownfield project is the clear separation of concerns and true decoupling between legacy on-premise and modern cloud workloads. Here, Kafka is deployed on-premise to connect to legacy applications.

Tools like MirrorMaker, Replicator, or Confluent Cluster Linking replicate events in real-time into the Kafka cluster in the cloud. The Kafka brokers provide access to the incoming events. Downstream consumers read the data into the data sinks at their own pace; real-time, near real-time, batch, or request-response via API. Streaming ETL is possible at any site – where it makes the most sense from a business or security perspective and is the most cost-efficient.

Example: Confluent Cloud + Snowflake = Cloud-native Data Warehouse Modernization

Here is a concrete example of data warehouse modernization using cloud-native data streaming and data warehousing with Confluent Cloud and Snowflake:

Cloud-native Data Warehouse Modernization with Apache Kafka Confluent Snowflake

For modernizing the data warehouse, data is ingested from various legacy and modern data sources using different communication paradigms, APIs, and integration strategies. The data is transmitted in motion from data sources via Kafka (and optional preprocessing) into the Snowflake data warehouse. The whole end-to-end pipeline is scalable, reliable, and fully managed, including the connectivity and ingestion between the Kafka and Snowflake clusters.

However, there is more to the integration and ingestion layer: The data streaming platform stores the data for true decoupling and slow downstream applications; not every consumer is or can be real-time. Most enterprise architectures do not ingest data into a single data warehouse or data lake or lakehouse. The reality is that different downstream applications need access to the same information; even though vendors of data warehouses and data lakes tell you differently, of course 🙂

By consuming events from the streaming data hub, each application domain decides by itself if it

  • processes events within Kafka with stream processing tools like Kafka Streams or ksqlDB
  • builds own downstream applications with its code and technologies (like Java, .NET, Golang, Python)
  • integrates with 3rd party business applications like Salesforce or SAP
  • ingests the raw or preprocessed and curated data from Kafka into the sink system (like a data warehouse or data lake)

3. Data warehouse modernization and migration from legacy to cloud-native

An often misunderstood concept is the buzz around data warehouse modernization: Companies rarely take the data of the existing on-premise data warehouse or data lake, write a few ETL jobs, and put the data into a cloud-native data warehouse for the sake of doing it.

If you think about a one-time lift-and-shift from an on-premise data warehouse to the cloud, then a traditional ETL tool or a replication cloud service might be the easiest. However, usually, data warehouse modernization is more than that! 

What is data warehouse modernization?

A data warehouse modernization can mean many things, including replacing and migrating the existing data warehouse, building a new cloud-native data warehouse from scratch, or optimizing a legacy ETL pipeline of a cloud-native data warehouse.

In all these cases, data warehouse modernization requires business justification, for instance:

  • Application issues in the legacy data warehouse, such as too slow data processing with legacy batch workloads, result in wrong or conflicting information for the business people.
  • Scalability issues in the on-premise data warehouse as the data volume grows too much.
  • Cost issues because the legacy data warehouse does not offer reasonable pricing with pay-as-you-go models.
  • Connectivity issues as legacy data warehouses were not built with an open API and data sharing in mind. Cloud-native data warehouses run on cost-efficient and scalable object storage, separate storage from computing, and allow data consumption and sharing. (but keep in mind that Reverse ETL is often an anti-pattern!)
  • A strategic move to the cloud with all infrastructure. The analytics platform is no exception if all old and new applications go to the cloud.

Cloud-native applications usually come with innovation, i.e., new business processes, data formats, and data-driven decision-making. From a data warehouse perspective, the best modernization is to start from scratch. Consume data directly from the existing data sources, ETL it, and do business intelligence on top of the new data structures.

I have seen many more projects where customers use change data capture (CDC) from Oracle databases (i.e., the leading core system) instead of trying to replicate data from the legacy data warehouse (i.e., the analytics platform) as scalability, cost, and later shutdown of legacy infrastructure benefits from this approach.

Data warehouse migration: Continuous vs. cut-over

The project is usually a cut-over when you need to do a real modernization (i.e., migration) from a legacy data warehouse to a cloud-native one. This way, the first project phase integrates the legacy data sources with the new data warehouse. The old and new data warehouse platforms operate in parallel, so that old and new business processes go on. After some time (months or years later), when the business is ready to move, the old data warehouse will be shut down after legacy applications are either migrated to the new data warehouse or replaced with new applications:

Data Warehouse Offloading Integration and Replacement with Data Streaming 

My article “Mainframe Integration, Offloading and Replacement with Apache Kafka” illustrates this offloading and long-term migration process. Just scroll to the section “Mainframe Offloading and Replacement in the Next 5 Years” in that post and replace the term ‘mainframe’ with ‘legacy data warehouse’ in your mind.

A migration and cut-over is its project and can include the legacy data warehouse; or not. Data lake modernization (e.g., from a self- or partially managed Cloudera cluster running on-premise in the data center to a fully managed Databricks or Snowflake cloud infrastructure) follows the same principles. And mixing the data warehouse (reporting) and data lake (big data analytics) into a single infrastructure does not change this either.

Data warehouse modernization is NOT a big bang and NOT a single tool approach!

Most data warehouse modernization projects are ongoing efforts over a long period. You must select a cloud-native data warehouse, get data into the new data warehouse from various sources, and optionally migrate away from legacy on-premise infrastructure.

Data streaming for data ingestion, business applications, or data sharing in real-time should always be a separate component in the enterprise architecture. It has very different requirements regarding SLAs, uptime, through, latency, etc. Putting all real-time and analytical workloads into the same cluster makes little sense from a cost, risk, or business value perspective. The idea of a modern data flow and building a data mesh is the separation of concerns with domain-driven design and focusing on data products (using different, independent APIs, technologies, and cloud services).

For more details, browse other posts of this blog series:

  1. Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
  2. Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
  3. THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
  4. Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
  5. Lessons Learned from Building a Cloud-Native Data Warehouse

What cloud-native data warehouse(s) do you use? How does data streaming fit into your journey? Did you integrate or replace your legacy on-premise data warehouse(s); or start from greenfield in the cloud? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure appeared first on Kai Waehner.

]]>
Cloud-native Core Banking Modernization with Apache Kafka https://www.kai-waehner.de/blog/2022/03/16/core-banking-mainframe-modernization-with-cloud-native-apache-kafka/ Wed, 16 Mar 2022 07:52:11 +0000 https://www.kai-waehner.de/?p=4336 Most financial service institutions operate their core banking platform on legacy mainframe technologies. The monolithic, proprietary, inflexible architecture creates many challenges for innovation and cost-efficiency. This blog post explores three open, elastic, and scalable banking solutions powered by Apache Kafka to solve these problems.

The post Cloud-native Core Banking Modernization with Apache Kafka appeared first on Kai Waehner.

]]>
Most financial service institutions operate their core banking platform on legacy mainframe technologies. The monolithic, proprietary, inflexible architecture creates many challenges for innovation and cost-efficiency. This blog post explores an open, elastic, and scalable solution powered by Apache Kafka can look like to solve these problems. Three cloud-native real-world banking solutions show how transactional and analytical workloads can be built at any scale in real-time for traditional business processes like payments or regulatory reporting and innovative new applications like crypto trading.

Cloud Native Core Banking Platform powered by Apache Kafka Data Streaming

What is core banking?

Core banking is a banking service provided by networked bank branches. Customers may access their bank accounts and perform basic transactions from member branch offices or connected software applications like mobile apps. Core banking is often associated with retail banking, and many banks treat retail customers as their core banking customers. Wholesale banking is a business conducted between banks. Securities trading involves the buying and selling of stocks, shares, and so on.

Businesses are usually managed via the corporate banking division of the institution. Core banking covers the basic depositing and lending of money. Standard core banking functions include transaction accounts, loans, mortgages, and payments.

Typical business processes of the banking operating system include KYC (“Know Your Customer”), product opening, credit scoring, fraud, refunds, collections, etc.

Banks make these services available across multiple channels like automated teller machines, Internet banking, mobile banking, and branches. Banking software and network technology allow a bank to centralize its record-keeping and allow access from any location.

Automatic analytics for regulatory reporting, flexible configuration to adjust workflows and innovate, and an open API to integrate with 3rd party ecosystems are crucial for modern banking platforms.

Transactional vs. analytical workloads in core banking

Workloads for analytics and transactions have very unlike characteristics and requirements. The use cases differ significantly. Most core banking workflows are transactional.

SLAs are very different and are crucial to understanding to guarantee the proper behavior in case of infrastructure issues and for disaster recovery:

  • RPO (Recovery Point Objective): The actual data loss you can live within the case of a disaster
  • RTO (Recovery Time Objective): The actual recovery period (i.e., downtime) you can live within the case of a disaster

Disaster Recovery - RPO Recovery Point Objective and RTO Recovery Time Objective

While downtime or data loss are not good in analytics use cases, they are often acceptable when the cost and risk are compared for guaranteeing an RTO and RPO close to zero. Hence, if the reporting function for end-users is down for an hour or even worse a few reports are lost, then life goes on.

In transactional workloads within a core banking platform, RTO and RPO need to be as close to zero as possible, even in the case of a disaster (e.g., if a complete data center or cloud region is down). If the core banking platform loses payment transactions or other events required for compliance processing, then the bank is in huge trouble.

Legacy core banking platforms

Advancements in the Internet and information technology reduced manual work in banks and increased efficiency. Computer software was developed decades ago to perform core banking operations like recording transactions, passbook maintenance, interest calculations on loans and deposits, customer records, the balance of payments, and withdrawal.

Banking software running on a mainframe

Most core banking platforms of traditional financial services companies still run on mainframes. The machines, operating systems, and applications still do a great job. SLAs like RPO and RTO are not new. If you look at IBM’s mainframe products and docs, the core concepts are similar to cutting-edge cloud-native technologies. Downtime, data loss, and similar requirements need to be defined.

The solving architecture provided the needed guarantees. IBM DB2, IMS, CICS, and Cobol code still operate transactional workloads very stable. A modern IBM z15 mainframe, announced in 2019, provides up to 40TB RAM and 190 Core. That’s very impressive.

Monolithic, proprietary, inflexible mainframe applications

So, what’s the problem with legacy core banking platforms running on a mainframe or similar other infrastructure?

  • Monolithic: Legacy mainframe applications are extreme monolithic applications. This is not comparable to a monolithic web application from the 2000s running on IBM WebSphere or another J2EE. / Java EE application server. It is much worse.
  • Proprietary: IMS, CICS, MQ, DB2, et al. are very mature technologies. However, next-generation decision makers, cloud architects, and developers expect open APIs, open-source core infrastructure, best-of-breed solutions and SaaS with independent freedom of choice for each problem.
  • Inflexible: Most legacy core banking applications do their job for decades. The Cobol code runs. However, it is not understood or documented. Cobol developers are scarce, too. The only option is to extend existing applications. Also, the infrastructure is not elastic to scale up and down in a software-defined manner. Instead, companies have to buy hardware for millions of dollars (and still pay an additional fortune for the transactions).

Yes, the mainframe supports up-to-date technologies such as DB2, MQ, WebSphere, Java, Linux, Web Services, Kubernetes, Ansible, Blockchain! Nevertheless, this does not solve the existing problems. This only helps when you build new applications. However, new applications are usually made with modern cloud-native infrastructure and frameworks instead of relying on legacy concepts.

Optimization and cost reduction of existing mainframe applications

The above sections looked at the enterprise architecture with RPO/RTO in mind to guarantee uptime and no data loss. This is crucial for decision-makers responsible for the business unit’s risk and revenue.

However, the third aspect besides revenue and risk is cost. Beyond providing an elastic and flexible infrastructure for the next-generation core banking platform, companies also move away from legacy solutions for cost reasons.

Enterprises save millions of dollars by just offloading data from a mainframe to modern event streaming:

Mainframe Offloading from Cobol to Apache Kafka and Java

For instance, streaming data empowers the Royal Bank of Canada (RBC) to save millions of dollars by offloading data from the mainframe to Kafka. Here is a quote from RBC:

… rescue data off of the mainframe, in a cloud-native, microservice-based fashion … [to] … significantly reduce the reads on the mainframe, saving RBC fixed infrastructure costs (OPEX). RBC stayed compliant with bank regulations and business logic, and is now able to create new applications using the same event-based architecture.

Read my dedicated blog post if you want to learn more about mainframe offloading, integration, and migration to Apache Kafka.

Modern cloud-native core banking platforms

This post is not just about offloading and integration. Instead, we look at real-world examples where cloud-native core banking replaced existing legacy mainframes or enabled new FinTech companies to start in a cutting-edge real-time cloud environment from scratch to compete with the traditional FinServ contenders.

Requirements for a modern banking platform?

Here are the requirements I here regularly on the wish list of executives and lead architects from financial services companies for new banking infrastructure and applications:

  • Real-time data processing
  • Scalable infrastructure
  • High availability
  • True decoupling and backpressure handling
  • Cost-efficient cost model
  • Flexible architecture for agile development
  • Elastic scalability
  • Standards-based interfaces and open APIs
  • Extensible functions and domain-driven separation of concerns
  • Secure authentication, authorization, encryption, and audit logging
  • Infrastructure-independent deployments across an edge, hybrid, and multi-region / multi-cloud environments

What are cloud-native infrastructure and applications?

And here are the capabilities of a genuinely cloud-native infrastructure to build a next-generation core banking system:

  • Real-time data processing
  • Scalable infrastructure
  • High availability
  • True decoupling and backpressure handling
  • Cost-efficient cost model
  • Flexible architecture for agile development
  • Elastic scalability
  • Standards-based interfaces and open APIs
  • Extensible functions and domain-driven separation of concerns
  • Secure authentication, authorization, encryption, and audit logging
  • Infrastructure-independent deployments across an edge, hybrid, and multi-region / multi-cloud environments

I think you get my point here: Adopting cloud-native infrastructure is critical for success in building next-generation banking software. No matter if you

  • are on-premise or in the cloud
  • are a traditional player or a startup
  • focus on a specific country or language, or operate across regions or even globally

Apache Kafka = cloud-native infrastructure for real-time transactional workloads

Many people think that Apache Kafka is not built for transactions and should only be used for big data analytics. This blog post explores when and how to use Kafka in resilient, mission-critical architectures and when to use the built-in Transaction API.

Kafka is a distributed, fault-tolerant system that is resilient by nature (if you deploy and operate it correctly). No downtime and no data loss can be guaranteed, like in your favorite database, mainframe, or other core platforms.

Elastic scalability and rolling upgrades allow a flexible and reliable data streaming infrastructure for transactional workloads to guarantee business continuity. The architect can even stretch a cluster across regions with tools such as Confluent Multi-Region Clusters. This setup ensures zero data loss and zero downtime even in case of a disaster where a data center is entirely down.

The post “Global Kafka Deployments” explores the different deployment options and their trade-offs in more detail. Check out my blog post about transactional vs. analytical workloads with Apache Kafka for more information.

Apache Kafka in banking and financial services

The rise of event streaming in financial services is growing like crazy. Continuous real-time data integration and processing are mandatory for many use cases. Many business departments in the financial services sector deploy Apache Kafka for mission-critical transactional workloads and big data analytics, including core banking. High scalability, high reliability, and an elastic open infrastructure are the critical reasons for Kafka’s success.

To learn more about different use cases, architectures, and real-world examples in the FinServ sector check out the post “Apache Kafka in the Financial Services Industry“. Use cases include

  • Wealth management and capital markets
  • Market and credit risk
  • Cybersecurity
  • IT Modernization
  • Retail and corp banking
  • Customer experience

Modern cloud-native core banking solutions powered by Kafka

Now, let’s explore the specific example of cloud-native core banking solutions built with Apache Kafka. The following subsections show three real-world examples.

Thought Machine – Correctness and scale in a single platform

Thought Machine is an innovative and flexible core banking operating system. The core capabilities of Thought Machine’s solution include

  • Cloud-native core banking software
  • Transactional workloads (24/7, zero data loss)
  • Flexible product engine powered by smart contracts (not blockchain)

The cloud-native core banking operating system enables a bank to achieve a wide scale of customization without having to change anything on the underlying platform. This is highly advantageous and a crucial part of how its architecture is a counterweight to the “spaghetti” that arises in other systems when customization and platform functionality are not separated.

Thought Machine Cloud Native Core Banking powered by Apache Kafka

Thought Machine’s Kafka Summit talk from 2021 explores how Thought Machine’s core banking system ‘Vault’ was built in a cloud-first manner with Apache Kafka at its heart. It leverages event streaming to enable asynchronous and parallel processing at scale, specifically focusing on the architectural patterns to ensure ‘correctness’ in such an environment.

10x Banking – Channel agnostic transactions in real-time

10X Banking provides a cloud-native core banking platform. In their Kafka Summit talk, they talked about the history of core banking and how they leverage Apache Kafka in conjunction with other open-source technologies within their commercial platform.

10X Core Banking powered by Apache Kafka

10x cloud-native approach provides flexible product lifecycles. Time-to-market is a key benefit. Organizations do not need to start from scratch. A unified data model and tooling allowed focusing on the business problems.

10x platform is a secure, reliable, scalable, and regulatory compliant SaaS platform that minimizes the regulatory burden and cost. It is built on a domain-driven design with true decoupling between transactional workloads and analytics/reporting.

Kafka is a data hub within the comprehensive platform for real-time analytics, transactions, and cybersecurity. Apache Kafka is not the silver bullet for every problem. Hence, 10x chose a best-of-breed approach to combine different open-source frameworks, commercial products, and SaaS offerings to build the cloud-native banking framework.

Here is how 10X Banking built a cloud-native core banking platform to enable real-time and batch analytics with a single data streaming pipeline leveraging the Kappa architecture:

Kappa Architecture at 10X Banking powered by Apache Kafka

The key components include Apache Kafka for data streaming, plus Apache Pinot and Trino for analytics.

Custodigit – Secure crypto investments with stateful data streaming and orchestration

Custodigit is a modern banking platform for digital assets and cryptocurrencies. It provides crucial features and guarantees for seriously regulated crypto investments:

  • Secure storage of wallets
  • Sending and receiving on the blockchain
  • Trading via brokers and exchanges
  • Regulated environment (a key aspect and no surprise as this product is coming from the Swiss – a very regulated market)

Kafka is the central core banking nervous system of Custodigit’s microservice architecture. Stateful Kafka Streams applications provide workflow orchestration with the “distributed saga” design pattern for the choreography between microservices. Kafka Streams was selected because of:

  • lean, decoupled microservices
  • metadata management in Kafka
  • unified data structure across microservices
  • transaction API (aka exactly-once semantics)
  • scalability and reliability
  • real-time processing at scale
  • a higher-level domain-specific language for stream processing
  • long-running stateful processes

I covered Custodigit and other blockchains/crypto platforms in a separate blog post: Apache Kafka as Data Hub for Crypto, DeFi, NFT, Metaverse – Beyond the Buzz.

Cloud-native core banking provides elastic scale for transactional workloads

Modern core banking software needs to be elastic, scalable, and real-time. This is true for transactional workloads like KYC or credit scoring and analytical workloads, like regulatory reporting. Apache Kafka enables processing transactional and analytical workloads in many modern banking solutions.

Thought Machine, 10X Banking, and Custodigit are three cloud-native examples powered by the Apache Kafka ecosystem to enable the next generation of banking software in real-time. Open Banking is achieved with open APIs to integrate with other 3rd party services.

The integration, offloading, and later replacement of legacy technologies such as mainframe with modern data streaming technologies prove the business value in many organizations. Kafka is not a silver bullet, but the central and mission-critical data hub for real-time data integration and processing.

How do you leverage data streaming for analytical or transactional workloads? What architecture does your platform use? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Cloud-native Core Banking Modernization with Apache Kafka appeared first on Kai Waehner.

]]>
Kafka for Real-Time Replication between Edge and Hybrid Cloud https://www.kai-waehner.de/blog/2022/01/26/kafka-cluster-linking-for-hybrid-replication-between-edge-cloud/ Wed, 26 Jan 2022 12:45:05 +0000 https://www.kai-waehner.de/?p=4131 Not all workloads should go to the cloud! Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration. This blog post explores hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell edge hardware and serverless Confluent Cloud.

The post Kafka for Real-Time Replication between Edge and Hybrid Cloud appeared first on Kai Waehner.

]]>
Not all workloads should go to the cloud! Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration. This blog post explores architectures and design patterns for software and hardware considerations to deploy hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell edge hardware and serverless Confluent Cloud.

Real-Time Edge Computing and Hybrid Cloud with Apache Kafka Confluent and Hivecell

Not every workload should go into the cloud

Almost every company has a cloud-first strategy in the meantime. Nevertheless, not all workloads should be deployed in the public cloud. A few reasons why IT applications still run at the edge or in a local data center:

  • Cost-efficiency: The more data produced at the edge, the more costly it is to transfer everything to the cloud. This significant data transfer is often non-sense for high volumes of raw sensor and telemetry data.
  • Low latency: Some use cases require data processing and correlation in real-time in milliseconds. Communication with remote locations increases the response time significantly.
  • Bad, unstable internet connection: Some environments do not provide good connectivity to the cloud or are entirely disconnected all the time or for some time of the day.
  • Cybersecurity with air-gapped environments: The disconnected Edge is common in safety-critical environments. Controlled data replication is only possible via unidirectional hardware gateways or manual human copy tasks within the site.

Here is a great recent example of why not all workloads should go to the cloud: AWS outage that created enormous issues for visitors to Disney World as the mobile app features are running online in the cloud. Business continuity is not possible if the connection to the cloud is offline:

AWS Outage at Disney World

The Edge is…

To be clear: The term ‘edge’ needs to be defined at the beginning of every conversation. I define the edge as having the following characteristics and needs:

  • Edge is NOT a data center
  • Offline business continuity
  • Often 100+ locations
  • Low-footprint and low-touch
  • Hybrid integration

Hybrid cloud for Kafka is the norm; not an exception

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. Several scenarios require multi-cluster solutions. Real-world examples have different requirements and trade-offs, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments, and global Kafka.

Global Event Streaming with Apache Kafka Confluent Multi Region Clusters Replication and Confluent Cloud

I posted about this in the past. Check out “architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments“.

Apache Kafka at the Edge

From a Kafka perspective, the edge can mean two things:

  • Kafka clients at the edge connecting directly to the Kafka cluster in a remote data center or public cloud, connecting via a native client (Java, C++, Python, etc.) or a proxy (MQTT Proxy, HTTP / REST Proxy)
  • Kafka clients AND the Kafka broker(s) deployed at the edge, not just the client applications

Both alternatives are acceptable and have their trade-offs. This post is about the whole Kafka infrastructure at the edge (potentially replicating to another remote Kafka cluster via MirrorMaker, Confluent Replicator, or Cluster Linking).

Check out my Infrastructure Checklist for Apache Kafka at the edge for more details.

I also covered various Apache Kafka use cases for the edge and hybrid cloud across industries like manufacturing, transportation, energy, mining, retail, entertainment, etc.

Hardware for edge computing and analytics

Edge hardware has some specific requirements to be successful in a project:

  • No special equipment for power, air conditioning, or networking
  • No technicians are required on-site to install, configure, or maintain the hardware and software
  • Start with the smallest footprint possible to show ROI
  • Easily add more compute power as workload expands
  • Deploy and operate simple or distributed software for containers, middleware, event streaming, business applications, and machine learning
  • Monitor, manage and upgrade centrally via fleet management and automation, even when behind a firewall

Devon Energy: Edge and Hybrid Cloud with Kafka, Confluent, and Hivecell

Devon Energy (formerly named WPX Energy) is a company in the oil & gas industry. The digital transformation creates many opportunities to improve processes and reduce costs in this vertical. WPX leverages Confluent Platform on Hivecell edge hardware to realize edge processing and replication to the cloud in real-time at scale.

The solution is designed for real-time decision-making and future closed-loop control optimization. Devon Energy conducts edge stream processing to enable real-time decision-making at the well sites. They also replicate business-relevant data streams produced by machine learning models and analytical preprocessed data at the well site to the cloud, enabling Devon Energy to harness the full power of its real-time events:

Devon Energy Apache Kafka and Confluent at the Edge with Hivecell and Cluster Linking to the Cloud

A few interesting notes about this hybrid Edge to cloud deployment:

  • Improved drilling and well completion operations
  • Edge stream processing / analytics + closed-loop control ready
  • Vendor agnostic (pumping, wireline, coil, offset wells, drilling operations, producing wells)
  • Replication to the cloud in real-time at scale
  • Cloud agnostic (AWS, GCP, Azure)

Live Demo – How to deploy a Kafka Cluster in production on your desk or anywhere

Confluent and Hivecell delivered the promise of bringing a piece of Confluent Cloud right there to your desk and delivering managed Kafka on a cloud-native Kubernetes cluster at the edge. For the first, Kafka deployments run time at scale at the edge, enabling local Kafka clusters at oil drilling sites, on ships, in factories, or in quick-service restaurants.

In this webinar, we showed how it works during a live demo – where we deploy an edge Confluent cluster, stream edge data, and synchronize it with Confluent Cloud across regions and even continents:

Hybrid Kafka Edge to Cloud Replication with Confluent and Hivecell

Dominik and I had our Hivecell cluster in our home in Texas, USA, respectively, Bavaria, Germany. We synchronized events across continents to a central Confluent Cloud cluster and simulated errors and cloud-native self-healing by “killing” one of my Hivecell nodes in Germany.

The webinar covered the following topics:

Edge computing as the next significant paradigm shift in IT
Apache Kafka and Confluent use cases at the Edge in IoT environments
– An easy way of setting up Kafka clusters where ever you need them, including fleet management and error handling
– Hands-on examples of Kafka cluster deployment and data synchronization

Slides and on-demand video recording

Here are the slides:

And the on-demand video recording:

Video Recording - Kafka, Confluent and Hivecell at the Edge and Hybrid Cloud

 

 

Real-time data streaming everywhere required hybrid edge to cloud data streaming!

A cloud-first strategy makes sense. Elastic scaling, agile development, and cost-efficient infrastructure allow innovation. However, not all workloads should go to the cloud for latency, cost, or security reasons.

Apache Kafka can be deployed everywhere. Essential for most projects is the successful deployment and management at the edge and the uni- or bidirectional synchronization in real-time between the edge and the cloud. This post showed how Confluent Cloud, Kafka at the Edge on Hivecell edge hardware, and Cluster Linking enable hybrid streaming data exchanges.

How do you use Apache Kafka? Do you deploy in the public cloud, in your data center, or at the edge outside a data center? How do you process and replicate the data streams? What other technologies do you combine with Kafka? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kafka for Real-Time Replication between Edge and Hybrid Cloud appeared first on Kai Waehner.

]]>