Business Intelligence Archives - Kai Waehner

The Heart of the Data Mesh Beats Real-Time with Apache Kafka

Kai Waehner — Thu, 28 Jul 2022 06:08:38 +0000

If there were a buzzword of the hour, it would undoubtedly be “data mesh“! This new architectural paradigm unlocks analytic and transactional data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios. The data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. Learn how the de facto standard for data streaming, Apache Kafka, plays a crucial role in building a data mesh.

There is no single technology or product for a data mesh!

This post explores how Apache Kafka, as an open and scalable decentralized real-time platform, can be the basis of a data mesh infrastructure and – complemented by many other data platforms like a data warehouse, data lake, and lakehouse – solve real business problems.

There is no silver bullet or single technology/product/cloud service for implementing a data mesh. The key outcome of a data mesh architecture is the ability to build data products; with the right tool for the job. A good data mesh combines data streaming technology like Apache Kafka or Confluent Cloud with cloud-native data warehouse and data lake architectures from Snowflake, Databricks, Google BigQuery, et al.

What is a data mesh?

I won’t write yet another article describing the concepts of a data mesh. Zhamak Dehghani coined the term in 2019. The following data mesh architecture from 30,000-foot view explains the basic idea well:

I summarize data mesh as the following three bullet points:

An architecture paradigm with several historical influences (domain-driven design, microservices, data marts, data streaming)
Not specific to a single technology or product; no single vendor can implement a data mesh alone
Handling data as a product is a fundamental change, enabling a more flexible architecture and independent solving of separate business problems
Decentralized services, not just analytics but also transactional workloads

Why handle data as a product?

Talking about innovative technology is insufficient to introduce a new architectural paradigm. Consequently, measuring the business value of the enterprise architecture is critical, too.

McKinsey finds that “when companies instead manage data like a consumer product—be it digital or physical—they can realize near-term value from their data investments and pave the way for quickly getting more value tomorrow. Creating reusable data products and patterns for piecing together data technologies enables companies to derive value from data today and tomorrow”:

For McKinsey, the benefits of this approach can be significant:

New business use cases can be delivered as much as 90 percent faster.
The total cost of ownership, including technology, development, and maintenance, can decline by 30 percent.
The risk and data-governance burden can be reduced.

What is data streaming with Apache Kafka and its relation to data mesh?

A data mesh enables flexibility through decentralization and best-of-breed data products. The heart of data sharing requires reliable real-time data at any scale between data producers and data consumers. Additionally, true decoupling between the decentralized data products is key to the success of the data mesh paradigm. Each domain must have access to shared data but also the ability to choose the right tool (i.e., technology, API, product, or SaaS) to solve its business problems.

That’s where data streaming fits into the data mesh story:

The de facto standard for data streaming is Apache Kafka. A cloud-native data streaming infrastructure that can link clusters with each other out-of-the-box enables building a modern data mesh. No Data Mesh will use just one technology or vendor. Learn from inspiring posts from your favorite data products vendors like AWS, Snowflake, Databricks, Confluent, and many more to successfully define and build your custom Data Mesh. Data Mesh is a journey, not a big bang. A data warehouse or data lake (or in modern days, a lakehouse) cannot be the only infrastructure for data mesh and data products.

I covered how to leverage the capabilities of Apache Kafka and its ecosystem like Kafka Connect, ksqlDB, Cluster Linking, etc. to build the heart of a data mesh in a separate blog post: Streaming Data Exchange with Kafka and a Data Mesh in Motion.

Example: Real-time data fabric in hybrid cloud

Here is one example spanning a streaming Data Mesh across multiple cloud providers like AWS, Azure, GCP, or Alibaba, and on-premise / edge sites:

This example shows all the characteristics discussed in the above sections for a Data Mesh:

Decentralized real-time infrastructure across domains and infrastructures
True decoupling between domains within and between the clouds
Several communication paradigms, including data streaming, RPC, and batch
Data integration with legacy and cloud-native technologies
Continuous stream processing where it adds value, and batch processing in some analytics sinks

Presentation: Building a decentralized data mesh with data streaming at its heart

The following slide deck walks you through the motivation, principles, and architectures of building a real-time data mesh powered by Apache Kafka using the Kappa architecture, hybrid cloud, and stream data sharing:

The data mesh provides flexibility and freedom of technology choice for each data product

The heart of a decentralized data mesh infrastructure must be real-time, reliable, and scalable. As the de facto standard for data streaming, Apache Kafka plays a crucial role in a cloud-native data mesh architecture. Nevertheless, data mesh is not bound to a specific technology. The beauty of the decentralized architecture is the freedom of technology choice for each business unit when building its data products.

Data sharing between domains within and across organizations is another aspect where data streaming helps in a data mesh. Real-time data beats slow data. That is not just true for most business problems across industries but also for replicating data between data centers, clouds, regions, or organizations. A streaming data exchange enables data sharing in real-time to build a data mash in motion.

Did you already start building your Data Mesh? What does the enterprise architecture look like? What frameworks, products, and cloud services do you use? Is the heart of your data mesh real-time in motion or some lakehouse at rest? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Heart of the Data Mesh Beats Real-Time with Apache Kafka appeared first on Kai Waehner.

Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization

Kai Waehner — Mon, 18 Jul 2022 08:48:48 +0000

The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 4: Case Studies for cloud-native data streaming and data warehouse modernization.

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
THIS POST: Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

Stay tuned for a dedicated blog post for each topic as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Case studies: Cloud-native data streaming for data warehouse modernization

Every project is different. This is true for data streaming, analytics, and other software development. The following shows three case studies with significantly different architectures and technologies for data warehouse modernization. The examples come from various verticals: Software and cloud business, financial services, logistics and transportation, and the travel and accommodation industry.

Confluent: Data warehouse modernization from batch ETL with Stitch to streaming ETL with Kafka

The article “Streaming ETL SFDC Data for Real-Time Customer Analytics” explores how Confluent eats its dog food to modernize the internal data warehouse pipeline.

The use case is straightforward and standard across most organizations: Extract, transform, and load (ETL) Salesforce data into a Google BigQuery data warehouse, so that the business can use the data. But it is more complex than it sounds:

Organizations often rely on a third-party ETL tool to periodically load data from a CRM and other applications to their data warehouse. These batch tools introduce a lag between when the business events are captured in Salesforce and when they are made available for consumption and processing. The batch workloads commonly result in discrepancies between Salesforce reports and internal dashboards, leading to concerns about the integrity and reliability of the data.

Confluent used Talend’s Stitch batch ETL tool in the beginning. The old architecture looked like this:

The consequence of batch ETL and a 3rd party tool in the middle lead to insufficient and inconsistent information updates.

Over the past few years, Confluent has invested in building stream processing capabilities into the internal data warehouse pipeline. Confluent leverages its own fully managed Confluent Cloud connectors (in this case, the Salesforce CDC source and BigQuery sink connectors), Schema Registry for data governance, and ksqlDB + Kafka Streams for reliable streaming ETL to send SFDC data to BigQuery. Here is the modernized architecture:

Paypal: Reducing the time for readouts from 12 hours to a few seconds for 30 billion events per day

Paypal has plenty of Kafka projects for many critical and analytical workloads. In this use case, it scales the Kafka Consumer for 30-35 Billion events per day to migrate its analytical workloads to the Google Cloud Platform (GCP).

A streaming application ingests the events from Kafka directly to BigQuery. This is a critical project for PayPal as most of the analytical readouts are based on this. The outcome of the data warehouse modernization and building a cloud-native architecture: Reduce the time for readouts from 12 hours to a few seconds.

Read more about this success story in the PayPal Technology Blog.

Shippeo: From on-premise databases to multiple cloud-native data lakes

Shippeo provides real-time and multimodal transportation visibility for logistics providers, shippers, and carriers. Its software uses automation and artificial intelligence to share real-time insights, enable better collaboration, and unlock your supply chain’s full potential. The platform can give instant access to predictive, real-time information for every delivery.

Shippeo described how they integrated traditional databases (MySQL and PostgreSQL) and cloud-native data warehouses (Snowflake and BigQuery) with Apache Kafka and Debezium:

This is an excellent example of cloud-native enterprise architecture leveraging a “best of breed” approach for data warehousing and analytics. Kafka decouples the analytical workloads from the transactional systems and handles the backpressure for slow consumers.

Sykes Cottages: Fully-managed end-to-end pipeline with Confluent Cloud, Kafka Connect, Snowflake

Sykes Holiday Cottages are one of the UK’s leading and fastest-growing independent holiday cottage rental agencies representing over 19,000 cottages across the UK, Ireland, and New Zealand.

The experience of its customers on the web is a top priority and is one way to stay competitive. The goal is to match customers to their perfect holiday cottage experience and delight at each stage along the way. Getting the data pipeline to fuel this innovation is critical. Data warehouse modernization and data streaming enabled new ways to further innovate the web experience through a data-driven approach.

From inconsistent and slow batch workloads…

While serving its purpose for several years, the existing pipeline had problems impairing this cycle. Very early in this pipeline, the ETL process turned the data into rows and columns (structured data). Various copies were made, and the results were presented via a static report. Data engineers were needed for changes, such as new events or contextual information. The scale was also challenging as this has to be done manually in the main.

Critically keeping the data in a semi-structured format until it is ingested into the warehouse and then using ELT to do a single transformation of the data, Sykes Holiday Cottages can simplify the pipeline and make it much more agile.

… to event-based real-time updates and continuous stream processing

New web events (and any context that goes with it) can be wrapped up within a message and can flow all the way to the warehouse without a single code change. The new events are then available to the web teams either through a query or the visualization tool.

The current throughput is around 50k (peaking at over 300k) messages per minute. As new events are captured, this will grow considerably. Additionally, each of the above components must scale accordingly.

The new architecture enables the web teams to capture new events. And analyze the data using self-service tools with no dependency on data engineering.

In conclusion, the business case for doing this is compelling. Based on our testing and projections, we expect at least 10x ROI over three years for this investment.

In Sykes Holiday Cottages’ blog post, learn more details: Why Sykes Cottages partnered with Snowflake and Confluent to drive enhanced customer experience.

Doordash: From multiple pipelines to data streaming for Snowflake integration

Even digital natives – that started their business in the cloud without legacy applications in their own data centers – need to modernize the enterprise architecture to improve business processes, reduce costs, and provide real-time information to its downstream applications.

It is cost inefficient to build multiple pipelines that are trying to achieve similar purposes. Doordash used cloud-native AWS messaging and streaming systems like Amazon SQS and Amazon Kinesis for data ingestion into the Snowflake data warehouse:

Mixing different kinds of data transport and going through multiple messaging/queueing systems without carefully designed observability around it leads to difficulties in operations.

These issues resulted in high data latency, significant cost, and operational overhead at Doordash. Therefore, Doordash moved to a cloud-native streaming platform powered by Apache Kafka and Apache Flink for continuous stream processing before ingesting data into Snowflake:

The move to a data streaming platform provides many benefits to Doordash:

Heterogeneous data sources and destinations, including REST APIs using the Confluent rest proxy
Easily accessible
End-to-end data governance with schema enforcement and schema evolution with Confluent Schema Registry
Scalable, fault-tolerant, and easy to operate for a small team

All the details about this cloud-native infrastructure optimization are in Doordash’s engineering blog post: “Building Scalable Real Time Event Processing with Kafka and Flink“.

Real-world case studies for cloud-native projects prove the business value

Data warehouse and data lake modernization only make sense if there is a business value. Elastic scale, reduced operations complexity, and faster time to market are significant advantages of cloud services like Snowflake, Databricks, or Google BigQuery.

Data streaming plays a vital role in these initiatives to integrate with legacy and cloud-native data sources, continuous streaming ETL, true decoupling between the data sources, and multiple data sinks (lakes, warehouses, business applications).

The case studies of Confluent, Paypal, Shippeo, and Sykes Cottages showed their different success stories of moving into cloud-native infrastructure to rain real-time visibility and analytics capabilities. Elastic scale and fully-managed end-to-end pipelines are crucial success factors in gaining business value with consistently up-to-date information.

For more details, browse other posts of this blog series:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
THIS POST: Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

Do you have another success story to share? Or are your projects for data lake and data warehouse modernization still ongoing? Do you use separate infrastructure for specific use cases or build a monolithic lakehouse instead? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization appeared first on Kai Waehner.

Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure

Kai Waehner — Fri, 15 Jul 2022 06:03:28 +0000

The concepts and architectures of a data warehouse, a data lake, and data streaming are complementary to solving business problems. Storing data at rest for reporting and analytics requires different capabilities and SLAs than continuously processing data in motion for real-time workloads. Many open-source frameworks, commercial products, and SaaS cloud services exist. Unfortunately, the underlying technologies are often misunderstood, overused for monolithic and inflexible architectures, and pitched for wrong use cases by vendors. Let’s explore this dilemma in a blog series. Learn how to build a modern data stack with cloud-native technologies. This is part 3: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure.

Blog Series: Data Warehouse vs. Data Lake vs. Data Streaming

This blog series explores concepts, features, and trade-offs of a modern data stack using data warehouse, data lake, and data streaming together:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

Data warehouse modernization: From legacy on-premise to cloud-native infrastructure

Many people talk about data warehouse modernization when they move to a cloud-native data warehouse. Though, what does data warehouse modernization mean? Why do people move away from their on-premise data warehouse? What are the benefits?

Many projects I have seen in the wild went through the following steps:

Select a cloud-native data warehouse
Get data into the new data warehouse
[Optional] Migrate from the old to the new data warehouse

Let’s explore these steps in more detail and understand the technology and architecture options.

1. Selection of a cloud-native data warehouse

Many years ago, cloud computing was a game-changer for operating infrastructure. AWS innovated by providing not just EC2 virtual machines but also storage, like AWS S3 as a service.

Cloud-native data warehouse offerings are built on the same fundamental change. Cloud providers brought their analytics cloud services, such as AWS Redshift, Azure Synapse, or GCP BigQuery. Independent vendors rolled out a cloud-native data warehouse or data lake SaaS such as Snowflake, Databricks, and many more. While each solution has its trade-offs, a few general characteristics are true for most of them:

Cloud-native: A modern data warehouse is elastic, scales for small up to extreme workloads, and automates most business processes around development, operations, and monitoring.
Fully managed: The vendor takes over the operations burden. This includes scaling, failover handling, upgrades, and performance tuning. Some offerings are truly serverless, while many services require capacity planning and manual or automated scaling up and down.
Consumption-based pricing: Pay-as-you-go enables getting started in minutes and scaling costs with broader software usage. Most enterprise deployments allow commitment to getting price discounts.
Data sharing: Replicating data sets across regions and environments is a common feature to offer data locality, privacy, lower latency, and regulatory compliance.
Multi-cloud and hybrid deployments: While cloud providers usually only offer the 1st party service on their cloud infrastructure, 3rd party vendors provide a multi-cloud strategy. Some vendors even offer hybrid environments, including on-premise and edge deployments.

Plenty of comparisons exist in the community, plus analyst research from Gartner, Forrester, et al. Looking at vendor information and trying out the various cloud products using free credits is crucial, too. Finding the right cloud-native data warehouse is its own challenge and not in this blog post.

2. Data streaming as (near) real-time and hybrid integration layer

Data ingestion into data warehouses and data lakes was already covered in part two of this blog series. The more real-time, the better for most business applications. Near real-time ingestion is possible with specific tools (like AWS Kinesis or Kafka) or as part of the data fabric (the streaming data hub where a tool like Kafka plays a bigger role than just data ingestion).

The often more challenging part is data integration. Most data warehouse and data lake pipelines require ETL to ingest data. As the next-generation analytics platform is crucial for making the right business decisions, the data ingestion and integration platform must also be cloud-native! Tools like Kafka provide the reliable and scalable integration layer to get all required data into the data warehouse.

Integration of legacy on-premise data into the cloud-native data warehouse

In a greenfield project, the project team is lucky. Data sources run in the same cloud, using open and modern APIs, and scale as well as the cloud-native data warehouse.

Unfortunately, the reality is brownfield almost always, even if all applications run in public cloud infrastructure. Therefore, the integration and replication of data from legacy and on-premise applications is a general requirement.

Data is typically consumed from legacy databases, data warehouses, applications, and other proprietary platforms. The replication into the cloud data warehouse usually needs to be near real-time and reliable.

A data streaming platform like Kafka is perfect for replicating data across data centers, regions, and environments because of its elastic scalability and true decoupling capabilities. Kafka enables connectivity to modern AND legacy systems via connectors, proprietary APIs, programming languages, or open REST interfaces:

A common scenario in such a brownfield project is the clear separation of concerns and true decoupling between legacy on-premise and modern cloud workloads. Here, Kafka is deployed on-premise to connect to legacy applications.

Tools like MirrorMaker, Replicator, or Confluent Cluster Linking replicate events in real-time into the Kafka cluster in the cloud. The Kafka brokers provide access to the incoming events. Downstream consumers read the data into the data sinks at their own pace; real-time, near real-time, batch, or request-response via API. Streaming ETL is possible at any site – where it makes the most sense from a business or security perspective and is the most cost-efficient.

Example: Confluent Cloud + Snowflake = Cloud-native Data Warehouse Modernization

Here is a concrete example of data warehouse modernization using cloud-native data streaming and data warehousing with Confluent Cloud and Snowflake:

For modernizing the data warehouse, data is ingested from various legacy and modern data sources using different communication paradigms, APIs, and integration strategies. The data is transmitted in motion from data sources via Kafka (and optional preprocessing) into the Snowflake data warehouse. The whole end-to-end pipeline is scalable, reliable, and fully managed, including the connectivity and ingestion between the Kafka and Snowflake clusters.

However, there is more to the integration and ingestion layer: The data streaming platform stores the data for true decoupling and slow downstream applications; not every consumer is or can be real-time. Most enterprise architectures do not ingest data into a single data warehouse or data lake or lakehouse. The reality is that different downstream applications need access to the same information; even though vendors of data warehouses and data lakes tell you differently, of course

By consuming events from the streaming data hub, each application domain decides by itself if it

processes events within Kafka with stream processing tools like Kafka Streams or ksqlDB
builds own downstream applications with its code and technologies (like Java, .NET, Golang, Python)
integrates with 3rd party business applications like Salesforce or SAP
ingests the raw or preprocessed and curated data from Kafka into the sink system (like a data warehouse or data lake)

3. Data warehouse modernization and migration from legacy to cloud-native

An often misunderstood concept is the buzz around data warehouse modernization: Companies rarely take the data of the existing on-premise data warehouse or data lake, write a few ETL jobs, and put the data into a cloud-native data warehouse for the sake of doing it.

If you think about a one-time lift-and-shift from an on-premise data warehouse to the cloud, then a traditional ETL tool or a replication cloud service might be the easiest. However, usually, data warehouse modernization is more than that!

What is data warehouse modernization?

A data warehouse modernization can mean many things, including replacing and migrating the existing data warehouse, building a new cloud-native data warehouse from scratch, or optimizing a legacy ETL pipeline of a cloud-native data warehouse.

In all these cases, data warehouse modernization requires business justification, for instance:

Application issues in the legacy data warehouse, such as too slow data processing with legacy batch workloads, result in wrong or conflicting information for the business people.
Scalability issues in the on-premise data warehouse as the data volume grows too much.
Cost issues because the legacy data warehouse does not offer reasonable pricing with pay-as-you-go models.
Connectivity issues as legacy data warehouses were not built with an open API and data sharing in mind. Cloud-native data warehouses run on cost-efficient and scalable object storage, separate storage from computing, and allow data consumption and sharing. (but keep in mind that Reverse ETL is often an anti-pattern!)
A strategic move to the cloud with all infrastructure. The analytics platform is no exception if all old and new applications go to the cloud.

Cloud-native applications usually come with innovation, i.e., new business processes, data formats, and data-driven decision-making. From a data warehouse perspective, the best modernization is to start from scratch. Consume data directly from the existing data sources, ETL it, and do business intelligence on top of the new data structures.

I have seen many more projects where customers use change data capture (CDC) from Oracle databases (i.e., the leading core system) instead of trying to replicate data from the legacy data warehouse (i.e., the analytics platform) as scalability, cost, and later shutdown of legacy infrastructure benefits from this approach.

Data warehouse migration: Continuous vs. cut-over

The project is usually a cut-over when you need to do a real modernization (i.e., migration) from a legacy data warehouse to a cloud-native one. This way, the first project phase integrates the legacy data sources with the new data warehouse. The old and new data warehouse platforms operate in parallel, so that old and new business processes go on. After some time (months or years later), when the business is ready to move, the old data warehouse will be shut down after legacy applications are either migrated to the new data warehouse or replaced with new applications:

My article “Mainframe Integration, Offloading and Replacement with Apache Kafka” illustrates this offloading and long-term migration process. Just scroll to the section “Mainframe Offloading and Replacement in the Next 5 Years” in that post and replace the term ‘mainframe’ with ‘legacy data warehouse’ in your mind.

A migration and cut-over is its project and can include the legacy data warehouse; or not. Data lake modernization (e.g., from a self- or partially managed Cloudera cluster running on-premise in the data center to a fully managed Databricks or Snowflake cloud infrastructure) follows the same principles. And mixing the data warehouse (reporting) and data lake (big data analytics) into a single infrastructure does not change this either.

Data warehouse modernization is NOT a big bang and NOT a single tool approach!

Most data warehouse modernization projects are ongoing efforts over a long period. You must select a cloud-native data warehouse, get data into the new data warehouse from various sources, and optionally migrate away from legacy on-premise infrastructure.

Data streaming for data ingestion, business applications, or data sharing in real-time should always be a separate component in the enterprise architecture. It has very different requirements regarding SLAs, uptime, through, latency, etc. Putting all real-time and analytical workloads into the same cluster makes little sense from a cost, risk, or business value perspective. The idea of a modern data flow and building a data mesh is the separation of concerns with domain-driven design and focusing on data products (using different, independent APIs, technologies, and cloud services).

For more details, browse other posts of this blog series:

Data Warehouse vs. Data Lake vs. Data Streaming – Friends, Enemies, Frenemies?
Data Streaming for Data Ingestion into the Data Warehouse and Data Lake
THIS POST: Data Warehouse Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Case Studies: Cloud-native Data Streaming for Data Warehouse Modernization
Lessons Learned from Building a Cloud-Native Data Warehouse

What cloud-native data warehouse(s) do you use? How does data streaming fit into your journey? Did you integrate or replace your legacy on-premise data warehouse(s); or start from greenfield in the cloud? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure appeared first on Kai Waehner.

Apache Kafka and Machine Learning for Real Time Supply Chain Optimization in IIoT

Kai Waehner — Fri, 23 Aug 2019 06:59:43 +0000

I did a webinar with Confluent‘s partner Expero about “Apache Kafka and Machine Learning for Real Time Supply Chain Optimization“. This is a great example for anybody in automation industry / Industrial IoT (IIoT) like automotive, manufacturing, logistics, etc.

We explain how a real time event streaming platform can integrate in real time with the legacy world and proprietary IIoT protocols (like Siemens S7, Modbus, Beckhoff ADS, OPC-UA, et al). You can process the data at scale and then ingest it into a modern database (like AWS S3, Snowflake or MongoDB) or analytic / machine learning framework (like TensorFlow, PyTorch or Azure Machine Learning Service).

Here is the architecture we use to discuss and implement the supply chain optimization use case leveraging real time stream processing and machine learning:

We leverage various components from the Apache Kafka ecosystem. This includes:

Kafka Connect as scalable and reliable integration framework
Kafka Connect connectors like PLC4X – a great Apache framework to integrate with IIoT protocols
KSQL for continuous processing (filter, transform, aggregate) of the sensor data
Kafka Streams to deploy the trained analytic models to a real time streaming scoring engine

Use Case: Supply Chain Optimization using Real Time and Batch Processes

Automating multifaceted, complex workflows requires hybrid solutions including streaming analytics of IOT data and batch analytics. This includes machine learning solutions and real time visualization. Leaders in organizations who are responsible for global supply chain planning are responsible for working with and integrating with data from disparate sources around the world. Many of these data sources output information in real time. This assists planners in operationalizing plans and interacting with manufacturing output. IOT sensors on manufacturing equipment and inventory control systems feed real time processing pipelines to match actuals productions figures against planned schedules to calculate yield efficiency.

Using information from both real time systems and batch optimization, supply chain managers are able to economize operations and automate tedious inventory and manufacturing accounting processes. Sitting on top of all of these systems is a supply chain visualization tool. This enables users’ visibility over the global supply chain. If you are responsible for key data integration initiatives, join for a detailed walk through of a customer’s use of this system built using Confluent and Expero tools.

What will you learn?

See different use cases in automation industry and Industrial IoT (IIoT) where an event streaming platform adds business value
Understand different architecture options to leverage Apache Kafka and Confluent Platform in IoT scenarios in the cloud, on premise data centers and at the edge
Learn how to leverage different analytics tools and machine learning frameworks in a flexible and scalable way
How real time visualization ties together streaming and batch analytics for business users, interpreters, and analysts
Understand how streaming and batch analytics optimize the supply chain planning workflow.
Conceptualize the intersection between resource utilization and manufacturing assets with long term planning and supply chain optimization.

Industrial Internet of Things (IIoT) in Real Time at Scale with Apache Kafka

Here is the slide deck and video recording. Have fun watching it. Please let me know if you have any feedback or questions:

Slide Deck:

Video Recording:

The post Apache Kafka and Machine Learning for Real Time Supply Chain Optimization in IIoT appeared first on Kai Waehner.

Machine Learning Trends of 2018 combined with the Apache Kafka Ecosystem

Kai Waehner — Tue, 13 Feb 2018 15:55:59 +0000

At OOP 2018 conference in Munich, I presented an updated version of my talk about building scalable, mission-critical microservices with the Apache Kafka ecosystem and Deep Learning frameworks like TensorFlow, DeepLearning4J or H2O. I want to share the updated slide deck and discuss a few updates about newest trends, which I incorporated into the talk.

The main story is the same as in my Confluent blog post about Apache Kafka Ecosystem and Machine Learning: How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka. But I focused more on Deep Learning / Neural Networks. I also discussed a few innovations in the ecosystem of Apache Kafka and trends in ML in the last months: KSQL, ONNX, AutoML, ML platforms from Uber and Netflix. Let’s take a look into these interesting topics and how this is related to each other.

KSQL – A Streaming SQL Language on top of Apache Kafka.

“KSQL is a streaming SQL engine for Apache Kafka. KSQL lowers the entry bar to the world of stream processing, providing a simple and completely interactive SQL interface for processing data in Kafka. You no longer need to write code in a programming language such as Java or Python! KSQL is open-source (Apache 2.0 licensed), distributed, scalable, reliable, and real-time. It supports a wide range of powerful stream processing operations including aggregations, joins, windowing, sessionization, and much more.” More details here: “Introducing KSQL: Open Source Streaming SQL for Apache Kafka“.

You can write SQL-like queries to deploy scalable, mission-critical stream processing apps (which leverage Kafka Streams under the hood). Definitely a highlight in the Kafka open source ecosystem.

KSQL and Machine Learning

KSQL is built on top of Kafka Streams and therefore allows to build scalable, mission-critical services. Machine Learning models including Neural Networks are embeddable easily by building a User Defined Function (UDF). I am preparing an example these days where I apply a Neural Network – more precisely an Autoencoder – for sensor analytics to detect anomalies – i.e. critical values in health checks – of hospital guests in real time to send an alert to the doctor.

Let’s now talk about some interesting new developments in the machine learning ecosystem.

ONNX – A Open Format to Represent Deep Learning Models

“ONNX is a open format to represent deep learning models. With ONNX, AI developers can more easily move models between state-of-the-art tools and choose the combination that is best for them.”

This sounds similar to PMML (Predictive Model Markup Language, see “What is PMML” on KDnuggets) and PFA (Portable Format for Analytics), two other standards to define and share machine learning models. However, ONNX differs in a few aspects:

focuses on Deep Learning
has several huge tech companies (AWS, Microsoft, Facebook) and hardware vendors (AMD, NVidia, Intel, Qualcomm, etc.) behind it
supports many leading open source frameworks, already (including TensorFlow, Pytorch, MXNet)

ONNX is already GA in version 1.0 and production ready (as announced by Amazon, Microsoft and Facebook in December 2017). There is also a nice getting started guide for different frameworks.

ONNX and the Apache Kafka ecosystem

Unfortunately, ONNX has no Java support yet. Therefore, no support yet for embedding it into Kafka Streams Java API natively. Only via workaround like doing a REST call or embedding a JNI binding. But I am very sure this is only a matter of time, because the Java platform is so important in many enterprises to deploy mission-critical applications.

Right now, you could use Kafka’s Java API or other Kafka Clients. Confluent provides official clients for several programming languages, e.g. for Python or Go, which both are perfect for Machine Learning applications, too.

Automated Machine Learning (aka AutoML)

“Automated machine learning (AutoML) is a hot new field with the goal of making it easy to select different machine learning algorithms, their parameter settings, and the pre-processing methods that improve their ability to detect complex patterns in big data” as stated here.

With AutoML, you can build analytic models without any knowledge about Machine Learning. The AutoML implementations uses different implementations of Decision Trees, Clustering, Neural Networks, etc. to build and compare different models out-of-the-box. You just upload or connect your historical data set and click a few buttons to start the process. Maybe not perfect for every use case, but you can easily improve many existing processes without the need for a rare and expensive data scientist.

DataRobot or Google’s AutoML are two of many well-known cloud offerings in this space. H2O’s AutoML is integrated into its open source ML framework, but they also offer a nice UI-focused commercial product called “Driverless AI“. I highly recommend to spend 30min on any AutoML tool. It is really fascinating to see how AI tools develop these days.

AutoML and the Apache Kafka ecosystem

Most AutoML tools offer deployment of their models. You can access the analytic models e.g. via a REST interface. Not a perfect solution for a scalable, event-drive architecture like Kafka. The good news: Many AutoML solutions also allow to export their generated models so that you can deploy them into your application. For example, AutoML in H2O’s open source frameworks is just one of many options. You only use another operation in the programming language of your choice (R, Python, Scala, Web UI):

aml <- h2o.automl(x = x, y = y,
                  training_frame = train,
                  leaderboard_frame = test,
                  max_runtime_secs = 30)

Similar to what you would do to build a Linear Regression, Decision Tree or Neural Network. The result is generated Java code which you can easily embed into your Kafka Streams microservice or any other Kafka application. AutoML enables you to build and deploy highly scalable machine learning without deep knowledge in ML.

ML Platforms: Uber’s Michelangelo; Netflix’ Meson

Tech giants are typically some years ahead of “traditional enterprises”. They already built years ago what you build today or tomorrow. ML Platforms are no difference. Writing the ML source code to train an analytic model is just a very small part of a real world ML infrastructure. You need to think about the whole development process. The following picture shows the “Hidden Technical Debt in Machine Learning Systems“:

You will probably build several analytic models with different technologies. Not everything will be built in your Spark or Flink cluster or in a single cloud infrastructure. You might run TensorFlow on some big, expensive GPU in the public cloud to build powerful neural networks. Or use H2O to build some small, but very efficient and performant decision trees which do inference in a few microseconds… ML has many use cases.

That’s why many tech giants have built their own ML platforms, like Uber’s Michelangelo or Netflix’ Meson . These ML Platforms allow them to build and monitor powerful, scalable analytic models, but also to stay flexible to choose the right ML technology for each use case.

Apache Kafka ecosystem for ML Platforms

One of the reasons why Apache Kafka is so successful is the huge adoption by many tech giants. Almost all great Silicon Valley companies like LinkedIn, Netflix, Uber, Ebay, “you-name-it” blog and speak about their usage of Kafka as event-driven central nervous system for their mission-critical applications. Many focus on the distributed streaming platform for messaging, but we also see more and more adoption of add-ons like Kafka Connect, Kafka Streams, REST Proxy, Schema Registry or KSQL.

If you look at the above picture again, then think about Kafka: Isn’t it a perfect fit for a ML Platform? Training, monitoring, deployment, inference, configuration, A/B testing, etc. etc. etc. That’s probably why Uber, Netflix and many others use Kafka already as central component in their ML infrastructure.

And again, you are not forced to use just one specific technology. One of the great design concepts of Kafka is that you can re-process data again and again from its distributed commit log. This means you can either build different models with one technology as Kafka sink (let’s say Apache Flink or Spark), or connect different technologies like scikit-learn for local testing, TensorFlow running on Google Cloud GPUs for powerful deep learning, an on premise installation of H2O nodes for AutoML, and some other Kafka Streams ML apps deployed in Docker containers or Kubernetes. All of these ML applications consume the data in parallel in their pace and how often they need to.

Here is a great example of how to automate training and deployment of a scalable ML microservice with Kafka and Kafka Streams. No need to add another big data cluster. That’s one of the key differences of using Kafka Streams or KSQL for your ML applications instead of other Stream Processing frameworks.

Apache Kafka and Deep Learning – Slide Deck from OOP

Finally, after all these discussions about the Apache Kafka ecosystem and new trends in Machine Learning / Deep Learning, here are my updated slides from my talk at OOP 2018 conference:

I have also built a few examples using Apache Kafka, Kafka Streams and different open source ML frameworks like H2O, TensorFlow and DeepLearning4j (DL4J). The Github project shows how easy it is to deploy analytic models to a highly scalable, fault-tolerant, mission-critical Kafka microservice. A KSQL demo will also come soon.

Please share your feedback. Do you already use Kafka in the Machine Learning space? What components in addition to Kafka core do you use? Feel free to contact me to discuss this in more detail.

The post Machine Learning Trends of 2018 combined with the Apache Kafka Ecosystem appeared first on Kai Waehner.

Deep Learning in Real Time with TensorFlow, H2O.ai and Kafka Streams (Slides from JavaOne 2017)

Kai Waehner — Wed, 04 Oct 2017 16:59:04 +0000

Early October… Like every year in October, it is time for JavaOne and Oracle Open World in San Francisco… I am glad to be back at this huge event again. My talk at JavaOne 2017 was all about deployment of analytic models to scalable production systems leveraging Apache Kafka and Kafka Streams. Let’s first look at the abstract. After that I attach the slides and refer to further material around this topic.

Abstract “Deep Learning in Real Time with TensorFlow, H2O.ai and Kafka Streams”

Intelligent real time applications are a game changer in any industry. Deep Learning is one of the hottest buzzwords in this area. New technologies like GPUs combined with elastic cloud infrastructure enable the sophisticated usage of artificial neural networks to add business value in real world scenarios. Tech giants use it e.g. for image recognition and speech translation. This session discusses some real-world scenarios from different industries to explain when and how traditional companies can leverage deep learning in real time applications.

This session shows how to deploy Deep Learning models into real time applications to do predictions on new events. Apache Kafka will be used to inter analytic models in a highly scalable and performant way.

The first part introduces the use cases and concepts behind Deep Learning. It discusses how to build Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) and Autoencoders leveraging open source frameworks like TensorFlow, DeepLearning4J or H2O.

The second part shows how to deploy the built analytic models to real time applications leveraging Apache Kafka as streaming platform and Apache Kafka’s Streams API to embed the intelligent business logic into any external application or microservice.

Key Takeaways for the Audience: Kafka Streams + Deep Learning

Here are the takeaways of this talk:

Focus of this talk is to discuss and show how to productionize analytic models built by data scientists – the key challenge in most companies.
Deep Learning allows to build different neural networks to solve complex classification and regression scenarios and can add business value in any industry
Deep Learning is used to build analytics models using open source frameworks like TensorFlow, DeepLearning4J or H2O.ai.
Apache Kafka’s Streams API allows to embed the intelligent business logic into any application or microservice
Apache Kafka’s Streams API leverages these Deep Learning Models (without Redeveloping) to act on new events in real time

Slides and Further Material around Apache Kafka and Machine Learning

Here are the slides of my talk:

Some further material around Apache Kafka, Kafka Streams and Machine Learning:

I will post more examples and use cases around Apache Kafka and Machine Learning in the upcoming months… Stay tuned!

The post Deep Learning in Real Time with TensorFlow, H2O.ai and Kafka Streams (Slides from JavaOne 2017) appeared first on Kai Waehner.

Comparison: Data Preparation vs. Inline Data Wrangling in Machine Learning and Deep Learning Projects

Kai Waehner — Mon, 13 Feb 2017 11:23:01 +0000

I want to highlight a new presentation about Data Preparation in Data Science projects:

“Comparison of Programming Languages, Frameworks and Tools for Data Preprocessing and (Inline) Data Wrangling in Machine Learning / Deep Learning Projects”

Data Preparation as Key for Success in Data Science Projects

A key task to create appropriate analytic models in machine learning or deep learning is the integration and preparation of data sets from various sources like files, databases, big data storages, sensors or social networks. This step can take up to 80% of the whole project.

This session compares different alternative techniques to prepare data, including extract-transform-load (ETL) batch processing (like Talend, Pentaho), streaming analytics ingestion (like Apache Storm, Flink, Apex, TIBCO StreamBase, IBM Streams, Software AG Apama), and data wrangling (DataWrangler, Trifacta) within visual analytics. Various options and their trade-offs are shown in live demos using different advanced analytics technologies and open source frameworks such as R, Python, Apache Hadoop, Spark, KNIME or RapidMiner. The session discusses how this is related to visual analytics tools (like TIBCO Spotfire). Therefore, it also shows best practices for how the data scientist and business analyst should work together to build good analytic models.

Key Takeaway: Inline Data Wrangling Within Visual Analytics Tooling

Key takeaways of this session:

–    Learn various options for preparing data sets to build analytic models
–    Understand the pros and cons and the targeted persona for each option
–    See different technologies and open source frameworks for data preparation
–    Understand the relation to visual analytics and streaming analytics, and how these concepts are actually leveraged to build the analytic model after data preparation

Slide Deck

The following shows the slide deck:

Video Recording: Data Preparation vs. (Inline) Data Wrangling

Here is the video recording:

The post Comparison: Data Preparation vs. Inline Data Wrangling in Machine Learning and Deep Learning Projects appeared first on Kai Waehner.

Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Services

Kai Waehner — Tue, 15 Nov 2016 13:20:31 +0000

In November 2016, I am at Big Data Spain in Madrid for the first time. A great conference with many awesome speakers and sessions about very hot topics such as Apache Hadoop, Spark Spark, Streaming Processing / Streaming Analytics and Machine Learning. If you are interested in big data, then this conference is for you! My two talks:

“How to Apply Machine Learning to Real Time Processing” (see slides and video recording from a similar conference talk).
“Comparison of Streaming Analytics Options” (the reason for this blog post; an updated version of my talk from JavaOne 2015)

Here I wanna share the slides and a video recording of the latter one…

Abstract: Comparison of Stream Processing Options

This session discusses the technical concepts of stream processing / streaming analytics and how it is related to big data, mobile, cloud and internet of things. Different use cases such as predictive fault management or fraud detection are used to show and compare alternative frameworks and products for stream processing and streaming analytics.

The focus of the session lies on comparing

different open source frameworks such as Apache Apex, Apache Flink or Apache Spark Streaming
engines from software vendors such as IBM InfoSphere Streams, TIBCO StreamBase
cloud offerings such as AWS Kinesis.
real time streaming UIs such as Striim, Zoomdata or TIBCO Live Datamart. Live demos will give the audience a good feeling about how to use these frameworks and tools.

The session will also discuss how stream processing is related to Apache Hadoop frameworks (such as MapReduce, Hive, Pig or Impala) and machine learning (such as R, Spark ML or H2O.ai).

Slides – Alternatives for Streaming Analytics

The following slide deck is a more extensive version of the talk at Big Data Spain (as the conference talks were only 30 minutes):

Video Recording: Apache Storm, Flink, Apex, Spark, StreamBase, Striim, et al

The video recording walks you through the above slide deck:

As always, I appreciate any comments, questions or other feedback.

The post Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Services appeared first on Kai Waehner.

Machine Learning Applied to Microservices

Kai Waehner — Thu, 20 Oct 2016 19:32:22 +0000

I had two sessions at O’Reilly Software Architecture Conference in London in October 2016. It is the first #OReillySACon in London. A very good organized conference with plenty of great speakers and sessions. I can really recommend this conference and its siblings in other cities such as San Francisco or New York if you want to learn about good software architectures and new concepts, best practices and technologies. Some of the hot topics this year besides microservices are DevOps, serverless architectures and big data analytics respectively machine learning.

Intelligent Microservices by Leveraging Big Data Analytics

One of the two sessions was about how to apply machine learning and big data analytics to real time event processing. I also included the relation to microservices, i.e. how to leverage microservice concepts such as 12 Factor Apps, Containers (e.g. Docker), Cloud Platforms (e.g. Kubernetes, Cloud Foundry), or DevOps to build agile, intelligent microservices.

Abstract: How to Apply Machine Learning to Microservices

The digital transformation is going forward due to Mobile, Cloud and Internet of Things. Disrupting business models leverage Big Data Analytics and Machine Learning.

“Big Data” is currently a big hype. Large amounts of historical data are stored in Hadoop or other platforms. Business Intelligence tools and statistical computing are used to draw new knowledge and to find patterns from this data, for example for promotions, cross-selling or fraud detection. The key challenge is how these findings can be integrated from historical data into new transactions in real time to make customers happy, increase revenue or prevent fraud. “Fast Data” via stream processing is the solution to embed patterns – which were obtained from analyzing historical data – into future transactions in real-time.

This session uses several real world success stories to explain the concepts behind stream processing and its relation to Hadoop and other big data platforms. It discusses how patterns and statistical models of R, Spark MLlib, H2O, and other technologies can be integrated into real-time processing by using several different real world case studies. The session also points out why a Microservices architecture helps solving the agile requirements for these kind of projects.

A brief overview of available open source frameworks and commercial products shows possible options for the implementation of stream processing, such as Apache Storm, Apache Flink, Spark Streaming, IBM InfoSphere Streams, or TIBCO StreamBase.

A live demo shows how to implement stream processing, how to integrate machine learning, and how human operations can be enabled in addition to the automatic processing via a Web UI and push events.

How to Build Intelligent Microservices – Slide Deck from O’Reilly Software Architecture Conference

The post Machine Learning Applied to Microservices appeared first on Kai Waehner.

Comparison Of Log Analytics for Distributed Microservices – Open Source Frameworks, SaaS and Enterprise Products

Kai Waehner — Thu, 20 Oct 2016 18:57:51 +0000

I want to share the slide of my session about comparing open source frameworks, SaaS and Enterprise products regarding log analytics for distributed microservices:

Monitoring Distributed Microservices with Log Analytics

IT systems and applications generate more and more distributed machine data due to millions of mobile devices, Internet of Things, social network users, and other new emerging technologies. However, organizations experience challenges when monitoring and managing their IT systems and technology infrastructure. They struggle with distributed Microservices and Cloud architectures, custom application monitoring and debugging, network and server monitoring / troubleshooting, security analysis, compliance standards, and others.

This session discusses how to solve the challenges of monitoring and analyzing Terabytes and more of different distributed machine data to leverage the “digital business”. The main part of the session compares different open source frameworks and SaaS cloud solutions for Log Management and operational intelligence, such as Graylog , the “ELK stack”, Papertrail, Splunk or TIBCO LogLogic). A live demo will demonstrate how to monitor and analyze distributed Microservices and sensor data from the “Internet of Things”.

The session also explains the distinction of the discussed solutions to other big data components such as Apache Hadoop, Data Warehouse or Machine Learning and its application to real time processing, and how they can complement each other in a big data architecture.

The session concludes with an outlook to the new, advanced concept of IT Operations Analytics (ITOA).

Slide Deck from O’Reilly Software Architecture Conference

The post Comparison Of Log Analytics for Distributed Microservices – Open Source Frameworks, SaaS and Enterprise Products appeared first on Kai Waehner.