Lake House Archives - Kai Waehner

Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure

Kai Waehner — Fri, 15 Jul 2022 06:03:28 +0000

The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 3: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure.

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Data warehouse modernization: From legacy on-premise to cloud-native infrastructure

Many people talk about data warehouse modernization when they move to a cloud-native data warehouse. Though, what does data warehouse modernization mean? Why do people move away from their on-premise data warehouse? What are the benefits?

Many projects I have seen in the wild went through the following steps:

Select a cloud-native data warehouse
Get data into the new data warehouse
[Optional] Migrate from the old to the new data warehouse

Let’s explore these steps in more detail and understand the technology and architecture options.

1. Selection of a cloud-native data warehouse

Many years ago, cloud computing was a game-changer for operating infrastructure. AWS innovated by providing not just EC2 virtual machines but also storage, like AWS S3 as a service.

Cloud-native data warehouse offerings are built on the same fundamental change. Cloud providers brought their analytics cloud services, such as AWS Redshift, Azure Synapse, or GCP BigQuery. Independent vendors rolled out a cloud-native data warehouse or data lake SaaS such as Snowflake, Databricks, and many more. While each solution has its trade-offs, a few general characteristics are true for most of them:

Cloud-native: A modern data warehouse is elastic, scales for small up to extreme workloads, and automates most business processes around development, operations, and monitoring.
Fully managed: The vendor takes over the operations burden. This includes scaling, failover handling, upgrades, and performance tuning. Some offerings are truly serverless, while many services require capacity planning and manual or automated scaling up and down.
Consumption-based pricing: Pay-as-you-go enables getting started in minutes and scaling costs with broader software usage. Most enterprise deployments allow commitment to getting price discounts.
Data sharing: Replicating data sets across regions and environments is a common feature to offer data locality, privacy, lower latency, and regulatory compliance.
Multi-cloud and hybrid deployments: While cloud providers usually only offer the 1st party service on their cloud infrastructure, 3rd party vendors provide a multi-cloud strategy. Some vendors even offer hybrid environments, including on-premise and edge deployments.

Plenty of comparisons exist in the community, plus analyst research from Gartner, Forrester, et al. Looking at vendor information and trying out the various cloud products using free credits is crucial, too. Finding the right cloud-native data warehouse is its own challenge and not in this blog post.

2. Data streaming as (near) real-time and hybrid integration layer

Data ingestion into data warehouses and data lakes was already covered in part two of this blog series. The more real-time, the better for most business applications. Near real-time ingestion is possible with specific tools (like AWS Kinesis or Kafka) or as part of the data fabric (the streaming data hub where a tool like Kafka plays a bigger role than just data ingestion).

The often more challenging part is data integration. Most data warehouse and data lake pipelines require ETL to ingest data. As the next-generation analytics platform is crucial for making the right business decisions, the data ingestion and integration platform must also be cloud-native! Tools like Kafka provide the reliable and scalable integration layer to get all required data into the data warehouse.

Integration of legacy on-premise data into the cloud-native data warehouse

In a greenfield project, the project team is lucky. Data sources run in the same cloud, using open and modern APIs, and scale as well as the cloud-native data warehouse.

Unfortunately, the reality is brownfield almost always, even if all applications run in public cloud infrastructure. Therefore, the integration and replication of data from legacy and on-premise applications is a general requirement.

Data is typically consumed from legacy databases, data warehouses, applications, and other proprietary platforms. The replication into the cloud data warehouse usually needs to be near real-time and reliable.

A data streaming platform like Kafka is perfect for replicating data across data centers, regions, and environments because of its elastic scalability and true decoupling capabilities. Kafka enables connectivity to modern AND legacy systems via connectors, proprietary APIs, programming languages, or open REST interfaces:

A common scenario in such a brownfield project is the clear separation of concerns and true decoupling between legacy on-premise and modern cloud workloads. Here, Kafka is deployed on-premise to connect to legacy applications.

Tools like MirrorMaker, Replicator, or Confluent Cluster Linking replicate events in real-time into the Kafka cluster in the cloud. The Kafka brokers provide access to the incoming events. Downstream consumers read the data into the data sinks at their own pace; real-time, near real-time, batch, or request-response via API. Streaming ETL is possible at any site – where it makes the most sense from a business or security perspective and is the most cost-efficient.

Example: Confluent Cloud + Snowflake = Cloud-native Data Warehouse Modernization

Here is a concrete example of data warehouse modernization using cloud-native data streaming and data warehousing with Confluent Cloud and Snowflake:

For modernizing the data warehouse, data is ingested from various legacy and modern data sources using different communication paradigms, APIs, and integration strategies. The data is transmitted in motion from data sources via Kafka (and optional preprocessing) into the Snowflake data warehouse. The whole end-to-end pipeline is scalable, reliable, and fully managed, including the connectivity and ingestion between the Kafka and Snowflake clusters.

However, there is more to the integration and ingestion layer: The data streaming platform stores the data for true decoupling and slow downstream applications; not every consumer is or can be real-time. Most enterprise architectures do not ingest data into a single data warehouse or data lake or lakehouse. The reality is that different downstream applications need access to the same information; even though vendors of data warehouses and data lakes tell you differently, of course

By consuming events from the streaming data hub, each application domain decides by itself if it

processes events within Kafka with stream processing tools like Kafka Streams or ksqlDB
builds own downstream applications with its code and technologies (like Java, .NET, Golang, Python)
integrates with 3rd party business applications like Salesforce or SAP
ingests the raw or preprocessed and curated data from Kafka into the sink system (like a data warehouse or data lake)

3. Data warehouse modernization and migration from legacy to cloud-native

An often misunderstood concept is the buzz around data warehouse modernization: Companies rarely take the data of the existing on-premise data warehouse or data lake, write a few ETL jobs, and put the data into a cloud-native data warehouse for the sake of doing it.

If you think about a one-time lift-and-shift from an on-premise data warehouse to the cloud, then a traditional ETL tool or a replication cloud service might be the easiest. However, usually, data warehouse modernization is more than that!

What is data warehouse modernization?

A data warehouse modernization can mean many things, including replacing and migrating the existing data warehouse, building a new cloud-native data warehouse from scratch, or optimizing a legacy ETL pipeline of a cloud-native data warehouse.

In all these cases, data warehouse modernization requires business justification, for instance:

Application issues in the legacy data warehouse, such as too slow data processing with legacy batch workloads, result in wrong or conflicting information for the business people.
Scalability issues in the on-premise data warehouse as the data volume grows too much.
Cost issues because the legacy data warehouse does not offer reasonable pricing with pay-as-you-go models.
Connectivity issues as legacy data warehouses were not built with an open API and data sharing in mind. Cloud-native data warehouses run on cost-efficient and scalable object storage, separate storage from computing, and allow data consumption and sharing. (but keep in mind that Reverse ETL is often an anti-pattern!)
A strategic move to the cloud with all infrastructure. The analytics platform is no exception if all old and new applications go to the cloud.

Cloud-native applications usually come with innovation, i.e., new business processes, data formats, and data-driven decision-making. From a data warehouse perspective, the best modernization is to start from scratch. Consume data directly from the existing data sources, ETL it, and do business intelligence on top of the new data structures.

I have seen many more projects where customers use change data capture (CDC) from Oracle databases (i.e., the leading core system) instead of trying to replicate data from the legacy data warehouse (i.e., the analytics platform) as scalability, cost, and later shutdown of legacy infrastructure benefits from this approach.

Data warehouse migration: Continuous vs. cut-over

The project is usually a cut-over when you need to do a real modernization (i.e., migration) from a legacy data warehouse to a cloud-native one. This way, the first project phase integrates the legacy data sources with the new data warehouse. The old and new data warehouse platforms operate in parallel, so that old and new business processes go on. After some time (months or years later), when the business is ready to move, the old data warehouse will be shut down after legacy applications are either migrated to the new data warehouse or replaced with new applications:

My article “Mainframe Integration, Offloading and Replacement with Apache Kafka” illustrates this offloading and long-term migration process. Just scroll to the section “Mainframe Offloading and Replacement in the Next 5 Years” in that post and replace the term ‘mainframe’ with ‘legacy data warehouse’ in your mind.

A migration and cut-over is its project and can include the legacy data warehouse; or not. Data lake modernization (e.g., from a self- or partially managed Cloudera cluster running on-premise in the data center to a fully managed Databricks or Snowflake cloud infrastructure) follows the same principles. And mixing the data warehouse (reporting) and data lake (big data analytics) into a single infrastructure does not change this either.

Data warehouse modernization is NOT a big bang and NOT a single tool approach!

Most data warehouse modernization projects are ongoing efforts over a long period. You must select a cloud-native data warehouse, get data into the new data warehouse from various sources, and optionally migrate away from legacy on-premise infrastructure.

Data streaming for data ingestion, business applications, or data sharing in real-time should always be a separate component in the enterprise architecture. It has very different requirements regarding SLAs, uptime, through, latency, etc. Putting all real-time and analytical workloads into the same cluster makes little sense from a cost, risk, or business value perspective. The idea of a modern data flow and building a data mesh is the separation of concerns with domain-driven design and focusing on data products (using different, independent APIs, technologies, and cloud services).

For more details, browse other posts of this blog series:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

What cloud-native data warehouse(s) do you use? How does data streaming fit into your journey? Did you integrate or replace your legacy on-premise data warehouse(s); or start from greenfield in the cloud? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure appeared first on Kai Waehner.

Analytics vs. Transactions in Data Streaming with Apache Kafka

Kai Waehner — Wed, 09 Mar 2022 14:01:10 +0000

Workloads for analytics and transactions have very unlike characteristics and requirements. The use cases differ significantly. SLAs are very different, too. Many people think that Apache Kafka is not built for transactions and should only be used for big data analytics. This blog post explores when and how to use Kafka in resilient, mission-critical architectures and when to use the built-in Transaction API.

Analytical and transactional workloads

Let’s begin by defining the terms. The YouTube channel ‘Databases Demystified’ has a great episode: Analytical vs. Transactional. I use and enhance its explanation in the following subsections.

Some people refer to this as an “OLTP vs. OLAP” discussion:

In OLTP (online transaction processing), information systems typically facilitate and manage transaction-oriented applications.
In OLAP (online analytical processing), information systems generally execute much more complex queries, in a smaller volume, for the purpose of business intelligence or reporting rather than to process transactions.

There are some overlaps in some use cases and products. Hence, I use the more generic terms “transactions” and “analytics” in this blog post.

Analytical workloads

Analytical workloads have the following characteristics:

Processing large amounts of information for creating aggregates
Read-only queries and (usually) batch-write data loads
Supporting complex queries with multiple steps of data processing, join conditions, and filtering
Highly variable ad hoc queries, many of which may only be run once, ever
Not mission-critical, meaning downtime or data loss is not good, but in most cases not a disaster for the core business

Analytics solutions

Analytics solutions exist on-premises and in all major clouds. The tools differ regarding their capabilities and sweet spots. Examples include:

Redshift (Amazon Web Services)
BigQuery (Google Cloud)
Snowflake
Hive / HDFS / Spark
And many more!

Transactional workloads

Transactional workloads have unique characteristics and SLAs compared to analytical workloads:

Manipulating one object at a time (often across different systems)
Create Read Update and Delete (CRUD) operations inserting data one object at a time or updating existing data (often across different systems)
Precisely managing state with guarantees about what has or hasn’t been written to disk
Supporting many operations per second in real-time with high throughput
Mission-critical SLAs for uptime, availability, and latency of the end-to-end data communication

Transactional solutions

Transactional solutions include applications, databases, messaging systems, and integration middleware:

IBM Mainframe (including CICS, IMS, DB2)
TIBCO EMS
PostgreSQL
Oracle Database
MongoDB
And many more!

Often, a transactional workload has to guarantee ACID principles (i.e., all or nothing writes to different applications and technologies).

A mix of transactional and analytical workloads

Many solutions support a mix of transactional and analytical workloads.

For instance, many enterprises store transactional data in MongoDB but also process complex queries for analytics use cases in the same database. MongoDB started as document-based NoSQL database. In the meantime, it is a general-purpose database platform that also supports other forms of database queries like MongoDB provides graph and tree traversal capabilities:

Hence, focus on the business problem first. Then, you can decide if your existing infrastructure can solve the problem or if you need yet another one. But there is no silver bullet. A vendor-independent best of breed approach works best in most enterprise architectures I see in the success stories from the field.

Data at Rest vs. Data in Motion

Batch vs. real-time data processing is an important discussion you should have in every project. Statements like “batch processing is for analytics, real-time processing is for transactions” are not always correct. Real-time beats slow data in almost all use cases from a business value perspective. Nevertheless, batch processing is the better approach for some specific use cases.

Analytics platforms for batch processing

Data at Rest means to store data in a database, data warehouse, or data lake. This means that the data is processed too late in many use cases – even if a real-time streaming component (like Kafka) ingests the data. The data processing is still a web service call, SQL query, or map-reduce batch process away from providing a result to your problem.

Don’t get me wrong. Data at Rest is not a bad thing. Several use cases such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) require this approach… If you do it right! Data at Rest can be used for transactional workloads, too!

Apache Kafka for real-time data streaming

The Kafka API is the De Facto Standard API for Data in Motion like Amazon S3 for object storage. Why is Kafka so successful? Real-time beats slow data in most use cases across industries.

The same cloud-native approach is required for event streaming as for the modern data lake. Building a scalable and cost-efficient infrastructure is crucial for the success of a project. Event streaming and data lake technologies are complementary, not competitive.

I will not explore the reasons and use cases for the success of Kafka in this post. Instead, check out my overview about Kafka use cases across industries for more details. Or read some of my vertical-specific blog posts.

In short, most added value comes from processing Data in Motion while it is relevant instead of storing Data at Rest and processing it later (or too late). Many analytical and transactional workloads use Kafka for this reason.

Apache Kafka for analytics

Even in 2022, many people think about Kafka as a data ingestion layer into data stores. This is still a critical use case. Enterprises use Kafka as the ingestion layer for different analytics platforms:

Batch reporting and dashboards
Interactive queries (using Tableau, Qlik, and similar tools)
Data preparation for batch calculations, model training, and other analytics
Connectivity into different data warehouses, data lakes, and other data sinks using a best of breed approach

But Kafka is much more than a messaging and ingestion layer. Here are a few analytics examples using Kafka for analytics (often with other analytics tools to solve a specific problem together):

Data integration for various source systems using Kafka Connect and pre-built connectors (including real-time, near real-time, batch, web service, file, and proprietary interfaces)
Decoupling and backpressure handling as the sink systems are often not ready for vast volumes of real-time data. Domain-driven design (DDD) for true decoupling is a crucial differentiator of Kafka compared to other middleware and message queues.
Data processing at scale in real-time filters, transforms, generalizes, or aggregates incoming data sets before ingesting them into sink systems.
Real-time analytics applied within the Kafka application. Many analytics platforms were designed for near real-time or batch workloads but not for resilient model scoring with low latency – especially at scale). An example could be an analytic model trained with batch machine learning algorithms in a data lake with Spark MLlib or TensorFlow and then deployed into a Kafka Streams or ksqlDB application.
Replay historical events in cases such as onboarding a new consumer application, error-handling, compliance or regulatory processing, schema changes in an analytics platform. This becomes especially relevant if Tiered Storage is used under the hood of Kafka for cost-efficient and scalable long-term storage.

Analytics example with Confluent Cloud and AWS services

Here is an illustration from an AWS architecture combining Confluent and its ecosystem including connectors, stream processing capabilities, and schema management together with several 1st party AWS cloud services:

As you can see, Kafka is an excellent tool for analytical workloads. It is not a silver bullet but used for appropriate parts of the overall data management architecture. I have another blog post that explores the relationship between Kafka and other serverless analytics platforms.

However, Kafka is NOT just used for analytical workloads!

Apache Kafka for transactions

Around 60 to 70% of use cases and deployments I see at customers across the globe leverage the Kafka ecosystem for transactional workloads. Enterprises use Kafka for:

core banking platforms
fraud detection
global replication of order and inventory information
integration with business-critical platforms like CRM, ERP, MES, and many other transactional systems
supply chain management
customer communication like point-of-sale integration or context-specific upselling
and many other use cases where every single event counts.

Kafka is a distributed, fault-tolerant system that is resilient by nature (if you deploy and operate it correctly). No downtime and no data loss can be guaranteed, like in your favorite database, mainframe, or other core platforms.

Elastic scalability and rolling upgrades allow building a flexible and reliable data streaming infrastructure for transactional workloads to guarantee business continuity. The architect can even stretch a cluster across regions to ensure zero data loss and zero downtime even in case of a disaster where a data center is completely down. The post “Global Kafka Deployments” explores the different deployment options and their trade-offs in more detail.

Kafka Transactions API example

And even better: Kafka’s Transaction API, i.e., Exactly-Once Semantics (EOS), has been available since Kafka 0.11 (that GA’ed a long time ago). EOS makes building transactional workloads even easier as you don’t need to handle duplicates anymore.

Kafka now supports atomic writes across multiple partitions through the transactions API. This allows a producer to send a batch of messages to multiple partitions. Either all messages in the batch are eventually visible to any consumer, or none are ever visible to consumers. Here is an example:

Kafka provides a built-in transactions API. And the performance impact (that many people are worried about) is minimal. Here is a simple rule of thumb: If you care about exactly-once semantics, simply activate it! If performance issues force you to disable it, you can still fine-tune your application or disable it. But most projects are fine with the minimal performance trade-offs versus the enormous benefit of handling transactional behavior out-of-the-box.

Nevertheless, to be clear: You don’t need to use Kafka’s Transaction API to build mission-critical, transactional workloads.

SAGA design pattern for transactional data in Kafka without transactions

The Kafka Transactions API is optional. As discussed above, Kafka is resilient without transactions. Though eliminating duplicates is your task then. Exactly-once semantics solve this problem out-of-the-box across all Kafka components. Kafka Connect, Kafka Streams, ksqlDB, and different clients like Java, C++, .NET, Go support EOS.

However, I am also not saying that you should always use the Kafka Transaction API or that it solves every transactional problem. Keep in mind that scalable distributed systems require other design patterns than a traditional “Oracle to IBM MQ transaction”.

Some business transactions span multiple services. Hence, you need a mechanism to implement transactions that span services. A familiar design pattern and implementation for such a transactional workload is the SAGA pattern with a stateful orchestration application.

Swisscom’s Custodigit is an excellent example of such an implementation leveraging Kafka Streams. It is a modern banking platform for digital assets and cryptocurrencies that provides crucial features and guarantees for seriously regulated crypto investments – more details in my blog post about Blockchain, Crypto, NFTs, and Kafka.

And yes, there are always trade-offs between the Kafka Transaction API and exactly-once semantics, stateful orchestration in a separate application, and two-phase-commit transactions like Oracle DB and IBM MQ use it. Choose the right tool to define your appropriate enterprise architecture!

Kafka with other data stores and streaming engines

Most enterprises use Kafka as the central scalable real-time data hub. Hence, use cases include analytical and transactional workloads.

Most Kafka projects I see today also leverage Kafka Connect for data integration, Kafka Streams/ksqlDB for continuous data processing, and Schema Registry for data governance.

Thus, with Kafka, one (distributed and scalable) infrastructure enables messaging, storage, integration, and data processing. But of course, most Kafka clusters connect to other applications (like SAP or Salesforce) and data management systems (like MongoDB, Snowflake, Databricks, et al.) for analytics:

I explored in detail why Kafka is a database for some specific use cases but will NOT replace other databases and data lakes in its own blog post.

In addition to Kafka-native stream processing engines like Kafka Streams or ksqlDB, other streaming analytics frameworks like Apache Flink or Spark Streaming can easily be connected for transactional or analytical workloads. Just keep in mind that especially transactional workloads get harder end-to-end with every additional system and infrastructure you add to the enterprise architecture.

Kappa architecture for analytics AND transactions with Kafka as the data hub

Real-time data beats slow data. That’s true for almost every use case. Yet, enterprise architects build new infrastructures with the Lambda architecture that includes a separate batch layer for analytics and a real-time layer for transactional workloads.

A single real-time pipeline, called Kappa architecture, is the better fit. Real-world examples from companies such as Disney, Shopify, Uber, and Twitter explore the benefits of Kappa but also show how batch processing fits into this discussion positively with no Lambda. In its dedicated post, learn how a Kappa architecture can revolutionize how you built analytical and transactional workloads with the same scalable real-time data hub powered by Kafka.

How do you leverage data streaming for analytical or transactional workloads? Do you use exactly-once semantics to ease the implementation of transactions? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Analytics vs. Transactions in Data Streaming with Apache Kafka appeared first on Kai Waehner.

Serverless Kafka in a Cloud-native Data Lake Architecture

Kai Waehner — Fri, 25 Jun 2021 09:10:57 +0000

Apache Kafka became the de facto standard for processing data in motion. Kafka is open, flexible, and scalable. Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use a serverless Kafka SaaS offering to focus on business logic. However, hybrid scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden. This blog post explores how to leverage cloud-native and serverless Kafka offerings in a hybrid cloud architecture. We start from the perspective of data at rest with a data lake and explore its relation to data in motion with Kafka.

Data at Rest – Still the Right Approach?

Don’t get me wrong. Data at rest is not a bad thing. Several use cases such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) require this approach… If you do it right!

The Wrong Approach of Cloudera’s Data Lake

A data lake was introduced into most enterprises by Cloudera and Hortonworks (plus several partners like IBM) years ago. Most companies had a big data vision (but they had no idea how to gain business value out of it). The data lake consists of 20+ different open source frameworks.

New frameworks are added when they emerge so that the data lake is up-to-date. The only problem: Still no business value. Plus vendors that have no good business model. Just selling support does not work, especially if two very similar vendors compete with each other. The consequence was the Cloudera/Hortonworks merger. And a few years later, the move to private equity.

In 2021, Cloudera still supports so many different frameworks, including many data lake technologies but also event streaming platforms such as Storm, Kafka, Spark Streaming, Flink. I am surprised how one relatively small company can do this. Well, honestly, I am not amazed. TL;DR: They can’t! They know each framework a little bit (and only the dying Hadoop ecosystem very well). This business model does not work. Also, in 2021, Cloudera still does not have a real SaaS product. This is no surprise either. It is not easy to build a true SaaS offering around 20+ frameworks.

Hence, my opinion is confirmed: Better do one thing right if you are a relatively small company instead of trying to do all the things.

The Lake House Strategy from AWS

That’s why data lakes are built today with other vendors: The major cloud providers (AWS, GCP, Azure, Alibaba), MongoDB, Databricks, Snowflake. All of them have their specific use cases and trade-offs. However, all of them have in common that they have a cloud-first strategy and a serverless SaaS offering for their data lake.

Let’s take a deeper look at AWS to understanding how a modern, cloud-native strategy with a good business model looks like in 2021.

AWS is the market leader for public cloud infrastructure. Additionally, AWS invents new infrastructure categories regularly. For example, EC2 instances started the cloud era and enabled agile and elastic compute power. S3 became the de facto standard for object storage. Today, AWS has hundreds of innovative SaaS services.

AWS’ data lake strategy is based on the new buzzword Lake House:

As you can see, the key message is that one solution cannot solve all problems. However, even more important, all of these problems are solved with cloud-native, serverless AWS solutions:

This is how a cloud-native data lake offering in the public cloud has to look like—obviously, other hyperscalers like GCP and Azure head in the same direction with their serverless offerings.

Unfortunately, the public cloud is not the right choice for every problem due to latency, security, and cost reasons.

Hybrid and Multi-Cloud become the Norm

In the last years, many new innovative solutions tackle another market: Edge and on-premise infrastructure. Some examples: AWS Local Zones, AWS Outposts, AWS Wavelength. I really like the innovative approach from AWS of setting new infrastructure and software categories. Most cloud providers have very similar offerings, but AWS rolls it out in many cases, and others often more or less copy it. Hence, I focus on AWS in this post.

Having said this, each cloud provider has specific strengths. GCP is well known for its leadership in open source-powered services around Kubernetes, TensorFlow, and others. IBM and Oracle are best in providing services and infrastructure for their own products.

I see demand for multiple cloud providers everywhere. Most customers I talk to have a multi-cloud strategy using AWS and other vendors like Azure, GCP, IBM, Oracle, Alibaba, etc. Good reasons exist to use different cloud providers, including cost, data locality, disaster recovery across vendors, vendor independence, historical reasons, dedicated cloud-specific services, etc.

Fortunately, the serverless Kafka SaaS Confluent Cloud is available on all major clouds. Hence, similar examples are available for the usage of the fully-managed Kafka ecosystem with Azure and GCP. Hence, let’s finally go to the Kafka part of this post…

From “Data at Rest” to “Data in Motion”

This was a long introduction before we actually come back to serverless Kafka. But only with background knowledge is it possible to understand the rise of data in motion and the need for cloud-native and serverless services.

Let’s start with the key messages to point out:

Real-time beats slow data in most use cases across industries.
For event streaming, the same cloud-native approach is required as for the modern data lake.
Event streaming and data lake / lake house technologies are complementary, not competitive.

The rise of event-driven architectures and data in motion powered by Apache Kafka enables enterprises to build real-time infrastructure and applications.

Apache Kafka – The De Facto Standard for Data in Motion

In short, most added value comes from processing data in motion while it is relevant instead of storing data at rest and process it later (or too late). The following diagram from Forrester’s Mike Gualtieri shows this very well:

The Kafka API is the De Facto Standard API for Data in Motion like Amazon S3 for Object Storage:

While I understand that vendors such as Snowflake and MongoDB would like to get into the data in motion business, I doubt this makes much sense. As discussed earlier in the Cloudera section, it is best to focus on one thing and do that very well. That’s obviously why Confluent has strong partnerships not just with the cloud providers but also with Snowflake and MongoDB.

Apache Kafka is the battle-tested and scalable open-source framework for processing data in motion. However, it is more like a car engine.

A Complete, Serverless Kafka Platform

As I talk about cloud, serverless, AWS, etc., you might ask yourself: “Why even think about Kafka on AWS at all if you can simply use Amazon MSK?” – that’s the obligatory, reasonable question!

The short answer: Amazon MSK is PaaS and NOT a fully managed and NOT a serverless Kafka SaaS offering.

Here is a simple counterquestion: Do you prefer to buy

a battle-tested car engine (without wheels, brakes, lights, etc.)
a complete car (including mature and automated security, safety, and maintenance)
a self-driving car (including safe automated driving without the need to steer the car, refueling, changing brakes, product recalls, etc.)

In the Kafka world, you only get a self-driving car from Confluent. That’s not a sales or marketing pitch, but the truth. All other cloud offerings provide you a self-managed product where you need to choose the brokers, fix bugs, do performance tuning, etc., by yourself. This is also true for Amazon MSK. Hence, I recommend evaluating the different offerings to understand if “fully managed” or “serverless” is just a marketing term or reality.

Check out “Comparison of Open Source Apache Kafka vs. Vendors including Confluent, Cloudera, Red Hat, Amazon MSK” for more details.

No matter if you want to build a data lake / lake house architecture, integrate with other 3rd party applications, or build new custom business applications: Serverless is the way to go in the cloud!

Serverless, fully-managed Kafka

A fully managed serverless offering is the best choice if you are in the public cloud. No need to worry about operations efforts. Instead, focus on business problems using a pay-as-you-go model with consumption-based pricing and mission-critical SLAs and support.

A truly fully managed and serverless offering does not give you access to the server infrastructure. Or did you get access to your AWS S3 object storage or Snowflake server configs? No, because then you would have to worry about operations and potentially impact or even destroy the cluster.

Self-managed Cloud-native Kafka

Not every Kafka cluster runs in the public cloud. Therefore, some Kafka clusters require partial management by their own operations team. I have seen plenty of enterprises struggling with Kafka operations themselves, especially if the use case is not just data ingestion into a data lake but mission-critical transactional or analytical workloads.

A cloud-native Kafka supports the operations team with automation. This reduces risks and efforts. For example, self-balancing clusters take over the rebalancing of partitions. Automated rolling upgrades allow that you upgrade to every new release instead of running an expensive and risky migration project (often with the consequence of not migrating to new versions). Separation of computing and storage (with Tiered Storage) enables large but also cost-efficient Kafka clusters with Terrabytes or even Petabytes of data. And so on.

Oh, by the way: A cloud-native Kafka cluster does not have to run on Kubernetes. Ansible or plain container/bare-metal deployments are other common options to deploy Kafka in your own data center or at the edge. But Kubernetes definitely provides the best cloud-native experience regarding automation with elastic scale. Hence, various Kafka operators (based on CRDs) were developed in the past years, for instance, Confluent for Kubernetes or Red Hat’s Strimzi.

Kafka is more than just Messaging and Data Ingestion

Last but not least, let’s be clear: Kafka is more than just messaging and data ingestion. I see most Kafka projects today also leverage Kafka Connect for data integration and/or Kafka Streams/ksqlDB for continuous data processing. Thus, with Kafka, one single (distributed and scalable) infrastructure enables messaging, storage, integration, and processing of data:

A fully-managed Kafka platform does not just operate Kafka but the whole ecosystem. For instance, fully-managed connectors enable serverless data integration with native AWS services such as S3, Redshift or Lambda, and non-AWS systems such as MongoDB Atlas, Salesforce or Snowflake. In addition, fully managed streaming analytics with ksqlDB enables continuous data processing at scale.

A complete Kafka platform provides the whole ecosystem, including security (role-based access control, encryption, audit logs), data governance (schema registry, data quality, data catalog, data lineage), and many other features like global resilience, flexible DevOps automation, metrics, monitoring, etc.

Example 1: Event Streaming + Data Lake / Lake House

The following example shows how to use a complete platform to do real-time analytics with various Confluent components plus integration with AWS lake house services:

Ingest & Process

Capture event streams with a consistent data structure using Schema Registry, develop real-time ETL pipelines with a lightweight SQL syntax using ksqlDB & unify real-time streams with batch processing using Kafka Connect Connectors.

Store & Analyze

Stream data with pre-built Confluent connectors into your AWS data lake or data warehouse to execute queries on vast amounts of streaming data for real-time and batch analytics.

This example shows very well how data lake or lake house services and event streaming complement each other. All services are SaaS. Even the integration (powered by Kafka Connect) is serverless.

Example 2: Serverless Application and Microservice Integration

The following example shows how to use a complete platform to integrate existing and build new applications and serverless microservices with various Confluent and AWS services:

Serverless integration

Connect existing and applications and data stores in a repeatable way without having to manage and operate anything. Apache Kafka and Schema Registry ensure to maintain app compatibility. ksqlDB allows to development of real-time apps with SQL syntax. Kafka Connect provides effortless integrations with Lambda & data stores.

AWS serverless platform

Stop provisioning, maintaining, or administering servers for backend components such as compute, databases and storage so that you can focus on increasing agility and innovation for your developer teams.

Kafka Everywhere – Cloud, On-Premise, Edge

The public cloud is the future of the data center. However, two main reasons exist NOT to run everything in a public cloud infrastructure:

Brownfield architectures: Many enterprises have plenty of applications and infrastructure in data centers. Hybrid cloud architectures are the only option. Think about mainframes as one example.
Edge use cases: Some scenarios do not make sense in the public cloud due to cost, latency, security or legal reasons. Think about smart factories as one example.

Multi-cluster and cross-data center deployments of Apache Kafka have become the norm rather than an exception. Several scenarios require multi-cluster solutions, including disaster recovery, aggregation for analytics, cloud migration, mission-critical stretched deployments and global Kafka. Learn more about this in the blog post “Architecture patterns for distributed, hybrid, edge, and global Apache Kafka deployments“.

Various AWS infrastructures enable the deployment of Kafka outside the public cloud. Confluent Platform is certified on AWS Outposts and therefore runs on various AWS hardware offerings.

Example 3: Hybrid Integration with Kafka-native Cluster Linking

Here is an example for brownfield modernization:

Connect

Pre-built connectors continuously bring valuable data from existing services on-prem, including enterprise data warehouse, databases, and mainframes. In addition, bi-directional communication is also possible where needed.

Bridge

Hybrid cloud streaming enables consistent, reliable replication in real-time to build a modern event-driven architecture for new applications and the integration with 1st and 3rd party SaaS interfaces.

Modernize

Public cloud infrastructure increases agility in getting applications to market and reduces TCO when freeing up resources to focus on value-generating activities and not managing servers.

Example 4: Low-Latency Kafka with Cloud-native 5G Infrastructure on AWS Wavelength

Low Latency data streaming requires infrastructure that runs close to the edge machines, devices, sensors, smartphones, and other interfaces. AWS Wavelength was built for these scenarios. The enterprise does not have to install its own IT infrastructure at the edge.

The following architecture shows an example built by Confluent, AWS, and Verizon:

Learn more about low latency use cases for data in motion in the blog post “Low Latency Data Streaming with Apache Kafka and Cloud-Native 5G Infrastructure“.

Live Demo – Hybrid Cloud Replication

Here is a live demo I recorded to show the streaming replication between an on-premise Kafka cluster and Confluent Cloud, including stream processing with ksqlDB and data integration with Kafka Connect (using the fully-managed AWS S3 connector):

Slides from Joint AWS + Confluent Webinar on Serverless Kafka

I recently did a webinar together with AWS to present the added value of combining Confluent and AWS services. This became a standard pattern in the last years for many transactional and analytical workloads.

Here are the slides:

Reverse ETL and its Relation to Data Lakes and Kafka

To conclude this post, I want to explore one more term you might have heard about. The buzzword is still in the early stage, but more and more vendors pitch a new trend: Reverse ETL. In short, this means storing data in your favorite long-term storage (database / data warehouse / data lake / lake house) and then get the data out of there again to connect to other business systems.

In the Kafka world, this is the same as Change Data Capture (CDC). Hence, Reverse ETL is not a new thing. Confluent provides CDC connectors for the many relevant systems, including Oracle, MongoDB, and Salesforce.

As I mentioned earlier, data storage vendors try to get into the data in motion business. I am not sure where this will go. Personally, I am convinced that the event streaming platform is the right place in the enterprise architecture for processing data in motion. This way every application can consume the data in real-time – decoupled from another database or data lake. But the future will show. I thought it was a good ending to this blog post to clarify this new buzzword and its relation to Kafka. I wrote more about this topic in its own article: “Apache Kafka and Reverse ETL“.

Serverless and Cloud-native Kafka with AWS and Confluent

Cloud-first strategies are the norm today. Whether the use case is a new greenfield project, a brownfield legacy integration architecture, or a modern edge scenario with hybrid replication. Kafka became the de facto standard for processing data in motion. However, Kafka is just a piece of the puzzle. Most enterprises prefer a complete, cloud-native service.

AWS and Confluent are a proven combination for various use cases across industries to deploy and operate Kafka environments everywhere, including truly serverless Kafka in the public cloud and cloud-native Kafka outside the public cloud. Of course, while this post focuses on the relation between Confluent and AWS, Confluent has similar strong partnerships with GCP and Azure to provide data in motion on these hyperscalers, too.

How do you run Kafka in the cloud today? Do you operate it by yourself? Do you use a partially managed service? Or do you use a truly serverless offering? Are your projects greenfield, or do you use hybrid architectures? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Serverless Kafka in a Cloud-native Data Lake Architecture appeared first on Kai Waehner.