Hadoop Archives - Kai Waehner

Apache Iceberg – The Open Table Format for Lakehouse AND Data Streaming

Kai Waehner — Sat, 13 Jul 2024 12:01:01 +0000

Every data-driven organization has operational and analytical workloads. A best of breed approach emerges with various data platforms, including data streaming, data lake, data warehouse and lakehouse solutions and cloud services. An open table format framework like Apache Iceberg is essential in the enterprise architecture to ensure reliable data management and sharing, seamless schema evolution, efficient handling of large-scale datasets and cost-efficient storage while providing strong support for ACID transactions and time travel queries. This blog post explores market trends, adoption of table format frameworks like Iceberg, Hudi, Paimon, Delta Lake, XTable, and the product strategy of leading vendors of data platforms such as Snowflake, Databricks (Apache Spark), Confluent (Apache Kafka / Flink), Amazon Athena and Google BigQuery.

What is an Open Table Format for a Data Platform?

An open table format helps in maintaining data integrity, optimizing query performance, and ensuring a clear understanding of the data stored within the platform.

The open table format for data platforms typically includes a well-defined structure with specific components that ensure data is organized, accessible, and easily queryable. A typical table format contains a table name, column names, data types, primary and foreign keys, indexes, and constraints.

This is not a new concept. Your favourite decades-old database, like Oracle, IBM DB2 (even on the mainframe) or PostgreSQL, uses the same principles. However, the requirements and challenges changed a bit for cloud data warehouses, data lakes, lake houses regarding scalability, performance and query capabilities.

Benefits of a “Lakehouse Table Format” like Apache Iceberg

Every part of an organization becomes data-driven. The consequence is extensive data sets, data sharing with data products across business units, and new requirements for processing data in near real-time.

Apache Iceberg provides many benefits for the enterprise architecture:

Single Storage: Data is stored once (coming from various data sources) reduces cost and complexity
Interoperability: Access without integration efforts from any analytical engine
All Data: Unify operational and analytical workloads (transactional systems, big data logs/IoT/clickstream, mobile APIs, 3rd party B2B interfaces, etc.)
Vendor Independence: Work with any favorite analytics engine (no matter if it is near real-time, batch or API-based)

Apache Hudi and Delta Lake provide the same characteristics. Though, Delta Lake is mainly driven by Databricks as a single vendor.

Table Format AND Catalog Interface

It is important to understand that discussions about Apache Iceberg or similar table format frameworks include two concepts: Table Format AND Catalog Interface! As an end user of the technology, you need both!

The Apache Iceberg project implements the format but only provides a specification (but not implementation) for the catalog:

The table format defines how data is organized, stored, and managed within a table.
The catalog interface manages the metadata for tables and provides an abstraction layer for accessing tables in a data lake.

The Apache Iceberg documentation explores the concepts in much more detail, based on this diagram:

Source: Apache Iceberg

Organizations use various implementations for Iceberg’s catalog interface. Each integrates with different metadata stores and services. Key implementations include:

Hadoop Catalog: Uses the Hadoop Distributed File System (HDFS) or other compatible file systems to store metadata. Suitable for environments already using Hadoop.
Hive Catalog: Integrates with Apache Hive Metastore to manage table metadata. Ideal for users leveraging Hive for their metadata management.
AWS Glue Catalog: Uses AWS Glue Data Catalog for metadata storage. Designed for users operating within the AWS ecosystem.
REST Catalog: Provides a RESTful interface for catalog operations via HTTP. Enables integration with custom or third-party metadata services.
Nessie Catalog: Uses Project Nessie, which provides a Git-like experience for managing data.

The momentum and growing adoption of Apache Iceberg motivates many data platform vendors to implement its own Iceberg catalog. I discuss a few strategies in the below section about data platform and cloud vendor strategies, including Snowflake’s Polaris, Databricks’ Unity, and Confluent’s Tableflow.

First-Class Iceberg Support vs. Iceberg Connector

Please note that supporting Apache Iceberg (or Hudi/Delta Lake) means much more than just providing a connector and integration with the table format via API. Vendors and cloud services differentiate by advanced features like automatic mapping between data formats, critical SLAs, travel back in time, intuitive user interfaces, and so on.

Let’s look at an example: Integration between Apache Kafka and Iceberg. Various Kafka Connect connectors were already implemented. However, here are the benefits of using a first-class integration with Iceberg (e.g., Confluent’s Tableflow) compared to just using a Kafka Connect connector:

No connector config
No consumption through connector
Built-in maintenance (compaction, garbage collection, snapshot management)
Automatic schema evolution
External catalog service synchronization
Simpler operations (in a fully-managed SaaS solution, it is serverless, with no need for any scale or operations by the end-user)

Similar benefits apply to other data platforms and potential first-class integration compared to providing simple connectors.

Open Table Format for a Data Lake / Lakehouse using Apache Iceberg, Apache Hudi, Delta Lake

The general goal of table format frameworks such as Apache Iceberg, Apache Hudi, and Delta Lake is to enhance the functionality and reliability of data lakes by addressing common challenges associated with managing large-scale data. These frameworks help to:

Improve Data Management
- Facilitate easier handling of data ingestion, storage, and retrieval in data lakes.
- Enable efficient data organization and storage, supporting better performance and scalability.
Ensure Data Consistency
- Provide mechanisms for ACID transactions, ensuring that data remains consistent and reliable even during concurrent read and write operations.
- Support snapshot isolation, allowing users to view a consistent state of data at any point in time.
Support Schema Evolution
- Allow for changes in data schema (such as adding, renaming, or removing columns) without disrupting existing data or requiring complex migrations.
Optimize Query Performance
- Implement advanced indexing and partitioning strategies to improve the speed and efficiency of data queries.
- Enable efficient metadata management to handle large datasets and complex queries effectively.
Enhance Data Governance
- Provide tools for better tracking and managing data lineage, versioning, and auditing, which are crucial for maintaining data quality and compliance.

By addressing these goals, table format frameworks like Apache Iceberg, Apache Hudi, and Delta Lake help organizations build more robust, scalable, and reliable data lakes and lakehouses. Data engineers, data scientists and business analysts leverage analytics, AI/ML or reporting/visualization tools on top of the table format to manage and analyze large volumes of data.

Comparison of Apache Iceberg, Hudi, Paimon, Delta Lake?

I won’t do a comparison of the different table format frameworks Apache Iceberg, Apache Hudi, Apache Paimon and Delta Lake here. Many experts wrote about this already. Each table format framework has unique strengths and benefits. But updates are required every month because of the fast evolution and innovation, adding new improvements and capabilities within these frameworks.

Here is a summary of what I see in various blog posts about the three alternatives:

Apache Iceberg: Excels in schema and partition evolution, efficient metadata management, and broad compatibility with various data processing engines.
Apache Hudi: Best suited for real-time data ingestion and upserts, with strong change data capture capabilities and data versioning.
Apache Paimon: A lake format that enables building a real-time lakehouse architecture with Flink and Spark for both streaming and batch operations.
Delta Lake: Provides robust ACID transactions, schema enforcement, and time travel features, making it ideal for maintaining data quality and integrity.

A key decision point might be that Delta Lake is not driven by a broad community like Iceberg and Hudi, but mainly by Databricks as a single vendor behind it.

Apache XTable as Interoperable Cross-Table Framework supporting Iceberg, Hudi and Delta Lake

Users have lots of choices. XTable is yet another incubating table framework under Apache open source license to seamlessly interoperate cross-table between Apache Hudi, Delta Lake, and Apache Iceberg.

Apache XTable:

provides cross-table omnidirectional interoperability between lakehouse table formats.
is NOT a new or separate format, Apache XTable provides abstractions and tools for the translation of lakehouse table format metadata.
is formerly known as OneTable.

Maybe Apache XTable is the answer to provide options for specific data platforms and cloud vendors but still provide simple integration and interoperability.

But be careful: A wrapper on top of different technologies is not a silver bullet. We saw this years ago when Apache Beam emerged. Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data ingestion and data processing workflows. It supports a variety of stream processing engines, such as Flink, Spark, Samza. The primary driver behind Apache Beam is Google to allow the migration workflows into Google Cloud Dataflow. However, the limitations are huge, as such a wrapper needs to find the least common denominator supporting features. And most frameworks’ key benefit is the twenty percent that do not fit into such a wrapper. For these reasons, for instance, Kafka Streams does intentionally not support Apache Beam because it would have required too many design limitations.

Market Adoption of Table Format Frameworks

FIrst of all, we are still in the early stages. We are still at the innovation trigger in terms of the Gartner Hype Cycle, coming to the peak of inflated expectations. Most organizations are still evaluating, but not adopting these table formats in production across the organization yet.

Flashback: Container Wars – Kubernetes vs. Mesosphere vs. Cloud Foundry

The debate round Apache Iceberg reminds me of the container wars a few years ago. The term “Container Wars” refers to the competition and rivalry among different containerization technologies and platforms in the realm of software development and IT infrastructure.

The three competing technologies were Kubernetes, Mesosphere and Cloud Foundry. Here is where it went:

Cloud Foundry and Mesosphere were early. Kubernetes still won the battle. Why? I never understood all the technical details and differences. In the end, if the three frameworks are pretty similar, it is all about community adoption, the right timing of feature releases, good marketing, luck, and a few other factors. But it is good for the software industry to have one leading open source framework to build solutions and business models on, instead of three competing ones.

Present: Table Format Wars – Apache Iceberg vs. Hudi vs. Delta Lake

Obviously, Google Trends is no statistical evidence or sophisticated research. But I used it a lot in the past as an intuitive, simple, free tool to analyze market trends. Therefore, I also use this tool to see if Google searches overlap with my personal experience of the market adoption of Apache Iceberg, Hudi and Delta Lake (Apache XTable is too small yet to be added):

We obviously see a similar pattern as the container wars showed a few years ago. I have no idea where this is going. And if one technology wins, or if the frameworks differentiate enough to prove that there is no silver bullet. The future will show us.

My personal opinion? I think Apache Iceberg will win the race. Why? I cannot argue with any technical reasons. I just see many customers across all industries talk about it more and more. And more and more vendors start supporting it. But we will see. I actually do NOT care who wins. However, similarly to the container wars, I think it is good to have a single standard and vendors differentiating with features around it, like it is with Kubernetes.

But with this in mind, let’s explore the current strategy of the leading data platforms and cloud providers regarding table format support in their platforms and cloud services.

Data Platform and Cloud Vendor Strategies for Apache Iceberg

I won’t do any speculation in this section. The evolution of the table format frameworks moves quickly. And vendor strategies change quickly. Please refer to the vendor’s websites for the latest information. But here is a status quo about the data platform and cloud vendor strategies regarding the support and integration of Apache Iceberg.

Snowflake:
- Supports Apache Iceberg for quite some time already
- Adding better integrations and new features regularly
- Internal and external storage options (with trade-offs) like Snowflake’s storage or Amazon S3
- Announced Polaris, an open source catalog implementation for Iceberg, with commitment to support community-driven, vendor-agnostic bi-directional integration
Databricks:
- Focuses on Delta Lake as the table format and (now open sourced) Unity as catalog
- Acquired Tabular, the leading company behind Apache Iceberg
- Unclear future strategy of supporting open Iceberg interface (in both directions) or only to feed data into its lakehouse platform and technologies like Delta Lake and Unity Catalog
Confluent:
- Embeds Apache Iceberg as a first-class citizen into its data streaming platform (the product is called Tableflow)
- Converts a Kafka Topic and related schema metadata (i.e., data contract) into an Iceberg table
- Bi-directional integration between operational and analytical workloads
- Analytics with embedded serverless Flink and its unified batch and streaming API or data sharing with 3rd party analytics engines like Snowflake, Databricks or Amazon Athena
More Data Platforms and Open Source Analytics Engines:
- The list of technologies and cloud services supporting Iceberg grows every month
- A few examples: Apache Spark, Apache Flink, ClickHouse, Dremio, Starburst using Trino (formerly PrestoSQL), Cloudera using Impala, Imply using Apache Druid, Fivetran
Cloud Service Providers (AWS, Azure, GCP, Alibaba):
- Different strategies and integrations, but all cloud providers increase Iceberg support across their services these days, for instance:
- Object Storage: Amazon S3, Azure Data Lake Storage (ALDS), Google Cloud Storage (GCS)
- Catalogs: Cloud-specific like AWS Glue Catalog or vendor agnostic like Project Nessie or Hive Catalog
- Analytics: Amazon Athena, Azure Synapse Analytics, Microsoft Fabric, Google BigQuery

The Shift Left Architecture with Kafka, Flink and Iceberg to Unify Operational and Analytical Workloads

The Shift Left Architecture moves data processing closer to the data source, leveraging real-time data streaming technologies like Apache Kafka and Flink to process data in motion directly after it is ingested. This approach reduces latency and improves data consistency and data quality.

Unlike ETL and ELT, which involve batch processing with the data stored at rest, the Shift Left Architecture enables real-time data capture and transformation. It aligns with the Zero ETL concept by making data immediately usable. But in contrast to Zero ETL, shifting data processing to the left side of the enterprise architecture avoids a complex, hard-to-maintain spaghetti architecture with many point-to-point connections.

The Shift Left Architecture also reduces the need for Reverse ETL by ensuring data is actionable in real-time for both operational and analytical systems. Overall, this architecture enhances data freshness, reduces costs, and speeds up the time-to-market for data-driven applications. Learn more about this concept in my blog post about “The Shift Left Architecture“.

An open table format and catalog introduces enormous benefits into the enterprise architecture: interoperability, freedom of choice of the analytics engines, faster time-to-market and reduced cost.

Apache Iceberg seems to become the de facto standard across vendors and cloud providers. However, it is still at an early stage and competing and wrapper technologies like Apache Hudi, Apache Paimon, Delta Lake and Apache XTable are trying to get momentum, too.

Iceberg and other open table formats are not just a huge win for the single storage and integration with multiple analytics / data / AI/ML platforms such as Snowflake, Databricks, Google BigQuery, et al., but also for the unification of operational and analytical workloads using data streaming with technologies such as Apache Kafka and Flink. The Shift Left Architecture is a significant benefit to reduce efforts, improve data quality and consistency, and enable real-time instead of batch applications and insights.

Finally, if you still wonder what the differences are between data streaming and lakehouses (and how they complement each other), check out this ten minute video:

What is your table format strategy? Which technologies and cloud services do you connect? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Apache Iceberg – The Open Table Format for Lakehouse AND Data Streaming appeared first on Kai Waehner.

Kappa Architecture is Mainstream Replacing Lambda

Kai Waehner — Thu, 23 Sep 2021 06:39:43 +0000

Real-time data beats slow data. That’s true for almost every use case. Nevertheless, enterprise architects build new infrastructures with the Lambda architecture that includes separate batch and real-time layers. This blog post explores why a single real-time pipeline, called Kappa architecture, is the better fit. Real-world examples from companies such as Disney, Shopify, Uber, and Twitter explore the benefits of Kappa but also show how batch processing fits into this discussion positively without the need for Lambda.

This post is heavily inspired by Jay Kreps’ article “Questioning the Lambda Architecture” from 2014 (!) and maps his thoughts to the real-world situation in 2021. Today, almost every business solution, data storage and analytics provider, and business application leverages event streaming and asynchronous, truly decoupled event-based communication paradigms for data processing. For that reason, many move from Lambda to Kappa architectures.

A Modern Enterprise Architecture

A modern enterprise architecture offers cloud-native characteristics: Flexibility, elasticity, automation, true decoupling between different applications, and real-time capabilities (where needed).

Microservices, Data Mesh, and Domain-driven Design for True Decoupling

Let’s quickly explore the buzzwords to understand how most people build modern enterprise architectures today:

Domain-driven Design (DDD) enforces strict boundaries between service communication and a decentralized application landscape.
Microservices enable building flexible, decoupled applications with different programming languages and communication paradigms.
Data Mesh allows to architect services around data. Data is the product in a data mesh. Self-service capabilities and federation enable business units to focus on their business problem.

My blog post “Microservices, Apache Kafka, and Domain-Driven Design” explored this discussion in more detail (even though the buzzword “data mesh” did not exist at the time of writing). TL;DR: An event-driven streaming infrastructure such as Apache Kafka uniquely enables proper decoupling and real-time data processing (contrary to traditional web service / REST / HTTP-based microservice architectures and contrary to traditional messaging systems (MQ, ESB). The blog post about Kafka vs. MQ/ETL/ESB might also be helpful to learn more.

Real-time Data Beats Slow Data, but NOT Always!

Think about your industry, business units, problems you solve, and innovative new applications you build. Real-time data beats slow data. This statement is almost always true. Either to increase revenue, reduce cost, reduce risk, or improve the customer experience.

Data at Rest means to store data in a database, data warehouse, or data lake. This way, data is processed too late in many use cases – even if a real-time streaming component (like Kafka) ingests the data. The data processing is still a web service call, SQL query, or map-reduce batch process away from providing a result to your problem.

Don’t get me wrong. Data at rest is not a bad thing. Several use cases such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) work very well with this approach. But real-time beats batch in almost all other use cases.

I analyzed the relation between data at rest and data in motion and how this point of view regarding the enterprise architecture changed with the cloud-first strategy of most companies in the blog post “Serverless Kafka in a Cloud-native Data Lake Architecture“.

The de facto standard for real-time data processing is Apache Kafka. Hence, the covered real-world examples in this post use Kafka.

With this context in mind, let’s revisit Lambda architecture.

The Lambda Architecture

Nathan Marz coined the Lambda architecture: A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods.

Lambda architecture includes batch, speed, and serving layers. This approach enables processing data in real-time but also easy re-processing of batched static datasets. The problem with out-of-order data is also solved.

This approach attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data while simultaneously using real-time stream processing to provide views of online data. The rise of lambda architecture is correlated with the steady growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce.

Two Options for a Lamba Architecture

The web discusses two different approaches to Lambda architecture.

The initial approach provided a unified serving layer. A unified serving layer joins the real-time and batch layer:

Another alternative is two separate serving layers. One layer is for real-time consumption, the other one for batch consumption:

I see the second option much more in the field. In the end, both have the same concept of building two separate layers for data ingestion and processing.

Issues with the Lambda Architecture

The Hadoop vendors heavily pitched the Lambda architecture to deploy and operate a super complex infrastructure with many big data frameworks. Today, I only hear the pain of enterprises complaining about this complexity and the missing business value. No surprise that most of these vendors did not survive or have a very confusing and unclear future product strategy.

Disney has summarized the concerns with the Lambda architecture on one slide:

The batch and streaming sides each require a different codebase that must be maintained and kept in sync so that processed data produces the same result from both paths. Additionally, with batch, speed, and serving layers, everything needs to be processed (at least) twice. That increases the cost and operations efforts of storage, network, and compute.

Jay Kreps had similar arguments when he proposed the Kappa architecture in 2014 (!), already: “The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems like it would be”.

So, what’s different in Kappa architecture?

The Kappa Architecture

The Kappa architecture is a software architecture that is event-based and able to handle all data at all scale in real-time for transactional AND analytical workloads.

The central premise behind the Kappa architecture is that you can perform both real-time and batch processing with a single technology stack. The heart of the infrastructure is streaming architecture. First, the event streaming platform log stores incoming data. From there, a stream processing engine processes the data continuously in real-time or ingests the data into any other analytics database or business application via any communication paradigm and speed, including real-time, near real-time, batch, request-response.

Unlike the Lambda Architecture, in this approach, you only do re-processing when your processing code changes, and you need to recompute your results. And, of course, the job doing the re-computation is just an improved version of the same code, running on the same framework, taking the same input data.

Benefits of the Kappa architecture

The Kappa architecture has several benefits:

Handle all the use cases (streaming, batch, RPC) with a single architecture
One codebase that is always in synch
One set of infrastructure and technology
The heart of the infrastructure is real-time, scalable, and reliable
Improved data quality with guaranteed ordering and no mismatches
No need to re-architect for new use cases

TL;DR: The Kappa architecture leverages a single source of truth focusing on simplicity in the enterprise architecture. People can develop, test, debug, and operate their systems on a single processing framework for BOTH real-time and batch systems. To be clear: The leading system for some applications can still be another system. For instance, the leading system for ERP is still SAP, while the source of truth for consumers is the Kafka log.

Kappa for Transactional and Analytical Workloads

Contrary to a data lake, event-streaming-powered Kappa architectures enable transactional workloads in addition to analytical workloads too.

For instance, Kafka and its ecosystem support exactly-once semantics so that you can build your next payment platform for aftersales or customer interactions with mission-critical SLAs, low latency, and fault-tolerance built-in. Independently, the data science team consumes historical events for finding insights in a batch process using machine learning.

Kappa is NOT a free lunch!

The Kappa architecture sounds too good to be true? Well, a basic rule of thumb is still valid: Use the right tool for the job!

Event streaming is a paradigm shift. A big bang migration will not work. Here are a few lessons learned from Disney about introducing the Kappa architecture:

As a big bang does not work, a good way is to rethink data and databases. Martin Kleppmann called it “turning the database inside out“. Let’s look at this approach and how it helps to leverage the Kappa architecture in combination with other databases and analytics platforms.

The Inside and Outside Perspective to Solve the Kappa Challenges

Turning the database inside out is a new thinking of the enterprise architecture. The heart of the infrastructure is event-based and real-time. Where needed, you consume the events in batch or store them in additional storage and analytics tools with their concepts and paradigms after they consumed the events.

The inner perspective of Kappa: The central nervous system

Think of an event streaming platform like Kafka:

Data availability/retention: Compacted Topics, Tiered Storage
Data consistency and fault-tolerance: Exactly-once semantics, Multi-Region Clusters, Cluster Linking
Handling late-arriving data: Event time and processing time are different. State management in the streaming application, proper data sinks, replay with guaranteed ordering, and timestamps.
Data reprocessing and backfill: Dynamic clusters (ideally a serverless cloud offering or at least a cloud-native self-managed cluster), stateful applications (Kafka Streams, ksqlDB, external stream processing framework like Apache Flink).
Data integration: Kafka Connect for sources and sinks, clients for any language, REST Proxy (real-time but also batch and RPC

An event streaming platform provides many characteristics to built a Kappa architecture. However, it is not a silver bullet. Additional databases and analytics tools are mandatory for some use cases. For instance, Kafka does not scale well for dynamic bursty workloads. Complex SQL queries and joins also need another database.

The outer perspective of Kappa: The applications and data stores

Think of any business application, data storage, or analytics platform:

Data Consumption: Consume the data from the central nervous system. Consume the data at your speed (real-time, near real-time, batch, RPC).
Data Storage: Store the data in your storage as long as you need it (in-memory, short-term storage, long-term storage).
Data Processing: Process the data for your use case (real-time notification, indexing into your query engine, a batch process for reporting or model training, etc.). Complex processing is not doable in the event streaming platform (e.g., complex joins, intensive compute with batch algorithms).

The discussion “Can Apache Kafka be used as a database?” is also helpful to understand both perspective and the trade-offs of using Kafka as data storage.

Cost-Efficient and Scalable Kappa Architectures

A huge problem of realizing the Kappa architecture in the real world was storing vast volumes of data in an event streaming platform. This approach was costly and had scalability issues at the Terabyte or Petabyte scale. On the contrary, data lakes were designed for vast volumes from the beginning. Hadoop and HDFS were used on-premise in the early phases. The public cloud enabled the migration to fully-managed object storage such as AWS S3 or Google Cloud Storage to make data lakes even more scalable and cost-efficient for big data.

One approach is to reduce the data stored in the event streaming platform. Infinite retention leveraging log compaction is a viable approach to reduce the storage size. However, compacted topics shrink data sets and only store the latest value for each message key. Hence, this workaround is not applicable for every use case.

Another workaround I have seen a lot in practice is building a “streaming data lake” with Kafka as a streaming layer and object storage for long-term storage. The bi-directional integration was built with Kafka Connect and sink and source connectors. This was actually the main reason why Confluent built an S3 Source connector for Kafka Connect in addition to its heavily used S3 Sink connector.

Tiered Storage for Event Streaming

The good news is that streaming platforms evolved. Tiered Storage allows decoupling storage from computing in event streaming platforms such as Kafka or Pulsar.

Tiered Storage is a game-changer for Kappa architectures. It manages the storage without a performance impact on real-time consumers. Additionally, this enables a very cost-efficient and elastic Kappa architecture without the need for a traditional data lake. Uber talks about the motivation and benefits of Tiered Storage for Kafka (KIP-405) in a recent Kafka Summit talk.

Kappa architectures are very flexible regarding the underlying storage technology. While Uber uses Hadoop’s HDFS as storage, Confluent went another way: Confluent Tiered Storage for Kafka is based on the S3 interface to leverage object storage and works for both public cloud provider object stores such as AWS S3 or GCS, and on-premise object stores such as PureStorage or MinIO for Kubernetes.

In other words: Tiered Storage for Kafka can leverage the same modern data storage as modern cloud data lakes (or as AWS calls it today: Lake House). Hence, the Kappa architecture provides the best of both worlds: Real-time data processing plus cost-efficient and scalable long-term storage for replaying historical data.

Real-World Examples for a Kappa Architecture

The above was a lot of theory. Let’s recap: Real-time data beats slow data in most use cases. But batch processing is still needed and will not go away.

Let’s now look at a few real-world examples of Kappa architectures at Uber, Shopify, and Disney.

Kappa at Uber for Trillions of Messages and Petabytes per Day

Uber is a very prominent tech giant. They talk a lot about their software architectures and deployments regularly in public. Uber is one of the most significant Kafka users in the world. In the meantime, they process over 4 trillion msgs and 3PBs per day.

As a perfect fit for this blog post, Uber presented at a recent Kafka Summit about their Kappa architecture:

As you can see, Uber’s architecture evolved precisely to what I described in the above sections. The central nervous system is a Kafka-based real-time infrastructure. Uber still has batch pipelines. Uber also provides APIs (e.g., to mobile apps). And – no surprise – they also have traditional SQL and NoSQL databases, business intelligence reporting tools, dashboards, and much more.

Uber’s architecture shows the massive benefits of Kappa: The heart of the infrastructure is real-time, scalable, fault-tolerant, and reliable. A single pipeline for everything. No need for a Lambda architecture! Kappa enables transactional and analytical workloads. Each microservice in the data mesh can use its technology and communication paradigm for each application.

Kappa at Shopify for Stateless and Stateful Data Streaming

Shopify presented their Kappa architecture in a recent Kafka Summit talk: “It’s Time To Stop Using Lambda Architecture“ The session covered the concerns of Kappa architecture and how Shopify solved them with different building blocks. The three key components are the log (Kafka), streaming framework (Kafka Streams and Apache Flink), and data sinks (any real-time consumer or data store).

Here is one example of a stateful Kappa scenario at Shopify:

Shopify discussed the core building blocks of their Kappa architecture:

The Log (Kafka)

Durability with Topic Compaction and Tiered Storage
Consistency via Exactly-Once Semantics (EOS)
Data Integration via Kafka Connect
Elasticity via dynamic Kafka clusters

Streaming Framework (Kafka Streams / Flink)

Reliability and scalability
Fault tolerance
State management

Data Sinks

Real-time consumers
Update/upsert for simplified design, for instance, RDBMS, NoSQL, Compacted Kafka Topics
Append-only storage (i.e., no update), for instance, regular Kafka Topics, Time Series databases

Kappa at Disney as Single Source of Truth

Disney’s Kafka Summit talk “Big Data Kappa” is very inspiring. It probably includes the most lessons learned and trade-offs of a real-world Kappa deployment. I encourage you to watch the on-demand video—many insights and guidance for building your own Kappa Architecture.

All data writes at Disney go through Kafka as the source of truth. The following screenshot shows the concept. The green box is the Kafka cluster, including Tiered Storage as the single source of truth. Any application consumes the data from Kafka for further processing and optional external storage.

Disney solves the following problems with its Kafka-based Kappa architecture:

Keep it simple (Kiss)
Reduce Code Duplication
Decreasing End To End Latency
Full System Immutability
Avoiding Data Discrepancies
Ability to move laterally between storage systems
Everyone wants their answers faster

Kappa at Twitter for Migration from Lambda Architecture

Twitter processes approximately 400 billion events in real-time and generates petabyte (PB) scale data every day. The on-premise architecture with Hadoop and Kafka using the Lambda architecture was not efficient enough:

Therefore, Twitter migrated to the cloud on GCP with Kafka using the Kappa architecture:

With the new hybrid architecture on both Twitter Data Center and Google Cloud Platform, they “are able to process billions of events in real-time and achieve low latency, high accuracy, stability, architecture simplicity, and reduced operation cost for engineers” as Twitter quotes in their detailed blog post about their Lambda to Kappa migration.

Example Project: Kappa for Machine Learning including Model Training, Scoring, and Monitoring

After real-world examples from Uber, Shopify, and Disney, I want to share one more practical code example: A technical demo connecting to 100,000 IoT devices to do streaming machine learning.

The use case is about integrating tens or hundreds of thousands of IoT devices and processing the data in real-time. The demo use case is predictive maintenance (i.e., anomaly detection) in a connected car infrastructure to predict motor engine failures:

The implemented Kappa architecture provides a single real-time infrastructure for various very different use cases and processing paradigms:

Real-time data ingestion at high throughput from IoT devices via an MQTT proxy: Integration with millions of interfaces, in this case, simulated vehicles.
Batch processing for model training: The TensorFlow Python application from the data scientist consumes historical data from the Kafka log to train analytic models.
Real-time stream processing for model scoring: The Java-based streaming application is powered by Kafka Streams / ksqlDB and operated by the production engineer with mission-critical SLAs and low latency.
Near-real time ingestion into the digital twin for analytics: Kafka Connect ingests the data into different databases and applications, in this case, a MongoDB Atlas cloud service.
Synchronous request-response / RPC communication for mobile app integration and transactional workloads: The Confluent REST Proxy (or any other web / mobile proxy) sends real-time alerts to humans.

The whole infrastructure is cloud-native. It runs on Kubernetes and can be deployed in a data center or on any hyperscaler. The following blog post explains the demo in detail: IoT Live Demo – 100.000 Connected Cars with Kubernetes, Kafka, MQTT, TensorFlow. The code is available in the Github project.

Kappa under the Hood of Next Generation Software Products and SaaS Offerings

Software companies have the same challenges as end-users like Uber, Shopify, or Disney. Hence, no surprise that software vendors move to Kappa architectures and real-time capabilities as the heart of their infrastructures.

This section shows a few examples of software vendors that moved to event-based architectures, event streaming, and asynchronous external interfaces within their next-generation software offerings.

Once again: This does NOT mean that everything within these products is real-time or event-based, but only if the related components provide real-time capabilities, then you can provide a real-time interface for internal or external consumers.

Business Solutions (Salesforce, SAP, Slack, et al.)

Business solutions provide customer interactions, logistics, manufacturing, internal communication, and many other use cases. No surprise that real-time data beats slow data. For this reason, most modern business solutions moved from less flexible and less scalable communication paradigms to event-based interfaces. Instead of using files, web service APIs, or manual changes, communication happens via event-driven APIs internally and externally.

A few examples across different business solutions:

Salesforce: The internal “platform events” architecture heavily relies on Apache Kafka for decoupled real-time data processing at scale. External APIs like the integration with Salesforce’s proprietary sObject datatype moved from SOAP and REST web service to Streaming API PushTopics, Enterprise Messaging Platform Events, and Change Data Capture Events.
SAP: Instead of relying on its legacy proprietary interfaces such as BAPI and iDoc, SAP moved to event-based APIs in their next-generation SAP S/4 Hana ERP platform. The blog “SAP integration options for Apache Kafka” shows the mess of numerous legacy interfaces and alternative modern event-based integration options.
Slack: Being a messaging platform by nature, it is no surprise that the heart of their core backend infrastructure leverages event streaming. Slack’s data streaming team focuses on providing Kafka as a Service for the company at the scale of trillions of messages per day across dozens of clusters in Amazon data centers. For the front-end, Slack’s current architecture leverages a service mesh built with Envoy and WebSockets.

Databases, Data Warehouses, Log Analytics

Data storage and analytics vendors are traditionally batch technologies for long-term storage, dashboards, reporting, and interactive queries. The heart of most solutions is still a batch system for analytics workloads. That’s the core business of these products and services.

Nevertheless, almost all of these vendors went into (near) real-time business due to customer demand. Hence, event-based integration capabilities and near real-time ingestion, processing, and analytics are becoming more prevalent. Some examples:

MongoDB: “Change Streams” allow applications to access real-time data changes from the document-based NoSQL datastore.
Snowflake: “Snowpipe” can help organizations seamlessly load continuously generated data into the cloud data warehouse.
Elasticsearch: “Data Streams” lets you store append-only time series data across multiple indices while giving you a single named resource for requests. Data streams are well-suited for logs, events, metrics, and other continuously generated data to ingest data into the Elastic search engine.

These solutions have in common that they move from batch to near real-time ingestion into their data store or data lake. Nevertheless, they still store and analyze data at rest. Hence, this is complementary but not an alternative to event streaming.

New entrants into the market try to differentiate from the above data storage vendors by providing a real-time infrastructure at its core. A great example is Rockset, a scalable real-time analytics platform in the cloud. As it is a native real-time solution, Rockset natively integrates with event streaming platforms such as Apache Kafka.

Event Streaming

Event Streaming platforms are event-based by nature. They process data in motion continuously. Therefore, the central nervous system of a Kappa architecture has to be an event streaming platform. Period.

For a comparison of frameworks like Kafka and Pulsar, plus reviewing the differentiators from platform vendors and SaaS providers such as Confluent, Cloudera, Red Hat, Amazon MSK, Azure Event Hubs, etc., please check out this comparison of event streaming platforms.

Event streaming will be one serverless component in a cloud-native data lake architecture in many future enterprise architectures.

It is worth noting that event streaming and the above-discussed business solutions and data storage and analytics vendors are complementary, not competitive! For instance, Confluent partners with business solutions such as Salesforce, database vendors such as MongoDB and Elastic, data-warehouses such as Snowflake, and cloud providers such as AWS or Azure to provide Source, Sink, and Change Data Capture (CDC) connectors powered by Kafka Connect. The fully managed Confluent Cloud service even provides the end-to-end integration as part of the serverless offering in the public cloud.

Video Recording: Kappa vs. Lambda Architecture

I covered the discussion around “Kappa vs. Lambda” in a 40-minute video recording, too. Enjoy:

Kappa is the New Black for the Enterprise Architecture

Real-time data beats slow data. After reading this article, think about your industry, business unit, and projects again. If real-time data processing improves your customer experiences, increases your revenue, or reduces your cost and risk, then why wait? The Kappa architecture provides enormous benefits and a much simpler infrastructure than the Lambda architecture.

Having said this, batch processing and other data storage and analytics services are not going away. Kappa and event streaming are complementary, and no silver bullet for every problem. For more details, check out the article “Can Apache Kafka replace a database?” – that article emphases this statement and explores the trade-offs.

Event streaming is the foundation of Kappa architecture. There is no way around this. Apache Kafka is the de facto standard for event streaming and the choice in real-world Kappa architectures. If you still need or want to evaluate your own event streaming platform, continue with the Kafka vs. Pulsar comparison or the general comparison of competitive event streaming vendors and cloud solutions.

Did you already Kappa architecture? Or do you still rely on or even prefer Lambda architectures? What are your experiences and opinions? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Kappa Architecture is Mainstream Replacing Lambda appeared first on Kai Waehner.

Can Apache Kafka Replace a Database?

Kai Waehner — Thu, 12 Mar 2020 14:47:36 +0000

Can and should Apache Kafka replace a database? How long can and should I store data in Kafka? How can I query and process data in Kafka? These are common questions that come up more and more. Short answers like “Yes” or “It depends” are not good enough for you? Then this read for you! This blog post explains the idea behind databases and different features like storage, queries and transactions to evaluate when Kafka is a good fit and when it is not.

Jay Kreps, the co-founder of Apache Kafka and Confluent, explained already in 2017 why “It’s okay to store data in Apache Kafka”. However, many things have improved and new components and features were added in the last three years. This update covers the core concepts of Kafka from a database perspective. It includes Kafka-native add-ons like Tiered Storage for long-term cost-efficient storage and ksqlDB as event streaming database. The relation and trade-offs between Kafka and other databases are explored to complement each other instead of thinking about a replacement. This discussion includes different options for pull and push based bi-directional integration.

I also created a slide deck and video recording about this topic.

What is a Database? Oracle? NoSQL? Hadoop?

Let’s think about the term “database” on a very high level. According to Wikipedia,

“A database is an organized collection of data, generally stored and accessed electronically from a computer system.

The database management system (DBMS) is the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS software additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a “database system”. Often the term “database” is also used to loosely refer to any of the DBMS, the database system or an application associated with the database.

Computer scientists may classify database-management systems according to the database models that they support. Relational databases became dominant in the 1980s. These model data as rows and columns in a series of tables, and the vast majority use SQL for writing and querying data. In the 2000s, non-relational databases became popular, referred to as NoSQL because they use different query languages.”

Based on this definition, we know that there are many databases on the market. Oracle. MySQL. Postgres. Hadoop. MongoDB. Elasticsearch. AWS S3. InfluxDB. Kafka.

Hold on. Kafka? Yes, indeed. Let’s explore this in detail…

Storage, Transactions, Processing and Querying Data

A database infrastructure is used for storage, queries and processing of data, often with specific delivery and durability guarantees (aka transactions).

There is not just one database as we all should know from all the NoSQL and Big Data products on the market. For each use case, you (should) choose the right database. It depends on your requirements. How long to store data? What structure should the data have? Do you need complex queries or just retrieval of data via key and value? Require ACID transactions, exactly-once semantics, or “just” at least once delivery guarantees?

These and many more questions have to be answered before you decide if you need a relational database like MySQL or Postgres, a big data batch platform like Hadoop, a document store like MongoDB, a key-value store like RocksDB, a time series database like InfluxDB, an in-memory cache like Memcached, or something else.

Every database has different characteristics. Thus, when you ask yourself if you can replace a database with Kafka, which database and what requirements are you talking about?

What is Apache Kafka?

Obviously, it is also crucial to understand what Kafka is to decide if Kafka can replace your database. Otherwise, it is really hard to proceed with this evaluation…

Kafka is an Event Streaming Platform => Messaging! Stream Processing! Database! Integration!

First of all, Kafka is NOT just a pub/sub messaging system to send data from A to B. This is what some unaware people typically respond to such a question when they think Kafka is the next IBM MQ or RabbitMQ. Nope. Kafka is NOT a messaging system. A comparison with other messaging solutions is an apple to orange comparison (but still valid to decide when to choose Kafka or a messaging system).

Kafka is an event streaming platform. Companies from various industries presented hundred of use cases where they use Kafka successfully for much more than just messaging. Just check out all the talks from the Kafka Summits (including free slide decks and video recordings).

One of the main reasons why Apache Kafka became the de facto standard for so many different use cases is its combination of four powerful concepts:

Publish and subscribe to streams of events, similar to a message queue or enterprise messaging system
Store streams of events in a fault-tolerant storage as long as you want (hours, days, months, forever)
Process streams of events in real time, as they occur
Integration of different sources and sinks (no matter if real time, batch or request-response)

Decoupled, Scalable, Highly Available Streaming Microservices

With these four pillars built into one distributed event streaming platform, you can decouple various applications (i.e., producers and consumers) in a reliable, scalable, and fault-tolerant way.

As you can see, storage is one of the key principles of Kafka. Therefore, depending on your requirements and definition, Kafka can be used as a database.

Is “Kafka Core” a Database with ACID Guarantees?

I won’t cover the whole discussion about how “Kafka Core” (meaning Kafka brokers and its concepts like distributed commit log, replication, partitions, guaranteed ordering, etc.) fits into the ACID (Atomicity, Consistency, Isolation, Durability) transaction properties of databases. This was discussed already by Martin Kleppmann at Kafka Summit San Francisco 2018 (“Is Kafka a Database?”) and a little bit less technically by Tim Berglund (“Dissolving the Problem – Making an ACID-Compliant Database Out of Apache Kafka”).

TL;DR: Kafka is a database and provides ACID guarantees. However, it works differently than other databases. Kafka is also not replacing other databases; but a complementary tool in your toolset.

The Client Side of Kafka

In messaging systems, the client API provides producers and consumers to send and read messages. All other logic is implemented using low level programming or additional frameworks.

In databases, the client API provides a query language to create data structures and enables the client to store and retrieve data. All other logic is implemented using low level programming or additional frameworks.

In an event streaming platform, the client API is used for sending and consuming data like in a messaging system. However, in contrary to messaging and databases, the client API provides much more functionality.

Independent, scalable, reliable components applications can be built with the Kafka APIs. Therefore, a Kafka client application is a distributed system that queries, processes and stores continuous streams of data. Many applications can be built without the need for another additional framework.

The Kafka Ecosystem – Kafka Streams, ksqlDB, Spring Kafka, and Much More…

The Kafka ecosystem provides various different components to implement applications.

Kafka itself includes a Java and Scala client API, Kafka Streams for stream processing with Java, and Kafka Connect to integrate with different sources and sinks without coding.

Many additional Kafka-native client APIs and frameworks exist. Here are some examples:

librdkafka: A C library implementation of the Apache Kafka protocol, providing Producer, Consumer and Admin clients. It was designed with message delivery reliability and high performance in mind, current figures exceed 1 million msgs/second for the producer and 3 million msgs/second for the consumer. In addition to the C library, it is often used as wrapper to provide Kafka clients from other programming languages such as C++, Golang, Python and JavaScript.
REST Proxy: Provides a RESTful interface to a Kafka cluster. It makes it easy to produce and consume messages, view the state of the cluster, and perform administrative actions without using the native Kafka protocol or clients.
ksqlDB: An event streaming database for Apache Kafka that enables you to build event streaming applications leveraging your familiarity with relational databases.
Spring for Kafka: Applies core Spring concepts to the development of Kafka-based messaging and streaming solutions. It provides a “template” as a high-level abstraction for sending messages. Includes first-class support for Kafka Streams. Additional Spring frameworks like Spring Cloud Stream and Spring Cloud Data Flow also provide native support for event streaming with Kafka.
Faust: A library for building streaming applications in Python, similar to the original Kafka Streams library (but more limited functionality and less mature).
TensorFlow I/O + Kafka Plugin: A native integration into TensorFlow for streaming machine learning (i.e. directly consuming models from Kafka for model training and model scoring instead of using another data lake).
Many more…

Domain-Driven Design (DDD), Dumb Pipes, Smart Endpoints

The importance of Kafka’s client side is crucial for the discussion of potentially replacing a database because Kafka applications can be stateless or stateful; the latter keeping state in the application instead of using an external database. The storage section below contains more details about how the client application can store data long term and is highly available.

With this, you understand that Kafka has a powerful server and a powerful client side. Many people are not aware of this when they evaluate Kafka versus other messaging solutions or storage systems.

This in conjunction with the capability to do real decoupling between producer and consumers leveraging the underlying storage of Kafka, it becomes clear why Apache Kafka became the de facto standard and backbone for microservice architectures – not just replacing other traditional middleware but also building the client applications using Domain Driven Design (DDD) for decoupled applications, dumb pipes and smart endpoints:

Check out the blog post “Microservices, Apache Kafka, and Domain-Driven Design (DDD)” for more details on this discussion.

Again, why is this important for the discussion around Kafka being a database: For every new microservice you create, you should ask yourself: Do I really need a “real database” backend in my microservice? With all its complexity and cost regarding development, testing, operations, monitoring?

Often, the answer is yes, of course. But I see more and more applications where keeping the state directly in the Kafka application is better, easier or both.

Storage – How long can you store Data in Kafka? And what is the Database under the Hood?

The short answer: Data can be stored in Kafka as long as you want. Kafka even provides the option to use a retention time of -1. This means “forever”.

The longer answer is much more complex. You need to think about the cost and scalability of the Kafka brokers. Should you use HDDs or SDDs? Or maybe even Flash based technology? Pure Storage wrote a nice example leveraging flash storage to write 5 million messages per second with three producers and 3x replication. It depends on how much data you need to store and how fast you need to access data and be able to recover from failures..

Publishing with Apache Kafka at The New York Times is a famous example for storing data in Kafka forever. Kafka is used for storing all the articles ever published by The New York Times and replacing their API-based approach. The Streams API is used to feed published content in real-time to the various applications and systems that make it available to our readers.

Another great example is the Account Activity Replay API from Twitter: It uses Kafka as Storage to provide “a data recovery tool that lets developers retrieve events from as far back as five days. This API recovers events that weren’t delivered for various reasons, including inadvertent server outages during real-time delivery attempts.”

Until now we are just talking about the most commonly used Kafka features: Log-based storage with retention time and disks attached to the broker. However, you also need to consider the additional capabilities of Kafka to have a complete discussion about long-term storage in a Kafka infrastructure: Compacted topics, tiered storage and client-side storage. All of these features can quickly change your mind of how you think about Kafka, its use cases and its architecture.

Compacted Topics – Log Compaction and “Event Updates”

Log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition. It addresses use cases and scenarios such as restoring state after application crashes or system failure or reloading caches after application restarts during operational maintenance. Therefore, log compaction does not have a retention time.

Obviously, the big trade-off is that log compaction does not keep all events and the full order of changes. For this, you need to use the normal Kafka Topics with a specific retention time.

Or you can use -1 to store all data forever. The big trade-offs here are high cost for the disks and more complex operations and scalability.

Tiered Storage – Long-Term Storage in Apache Kafka

KIP 405 (Kafka Improvement Proposal) was created to standardize the interface for Tiered Storage in Kafka to provide different implementations by different contributors and vendors. The KIP is not implemented yet and the interface is still under discussion by contributors from different companies.

Confluent already provides a (commercial) implementation for Kafka to use Tiered Storage for storing data long term in Kafka at low cost. The blog post “Infinite Storage in Confluent Platform” talks about the motivation and implementation of this game-changing Kafka add-on.

Tiered Storage for Kafka reduces cost (due to cheaper object storage), increases scalability (due to the separation between storage and processing), and eases operations (due to simplified and much faster rebalancing).

Here are some examples for storing the complete log in Kafka long-term (instead of leveraging compacted topics):

New Consumer, e.g. a complete new microservices or a replacement of an existing application
Error-Handling, e.g. re-processing of data in case of error to fix errors and process events again
Compliance / Regulatory Processing: Reprocessing of already processed data for legal reasons; could be very old data (e.g. pharma: 10 years old)
Query and Analysis of Existing Events; No need for another data store / data lake; ksqlDB (position first, but know the various limitations); Kafka-native analytics tool (e.g. Rockset with Kafka connector and full SQL support for Tableau et al)
Machine Learning and Model Training: Consume events for model training with a) different one machine learning framework and different hyperparameters or b) different Machine Learning frameworks

The last example is discussed in more detail here: Streaming Machine Learning with Tiered Storage (no need for a Data Lake).

Client-Side Database – Stateful Kafka Client Applications and Microservices

As discussed above, Kafka is not just the server side. You build highly available and scalable real time applications on the Kafka client side.

RocksDB for Stateful Kafka Applications

Often, these Kafka client applications (have to) keep important state. Kafka Streams and ksqlDB leverage RocksDB for this (you could also just use in-memory storage or replace RocksDB with another storage; I have never seen the latter option in the real world, though). RocksDB is a key-value store for running mission-critical workloads. It is optimized for fast, low latency storage.

In Kafka Streams applications, that solves the problem of abstracting access to local stable storage instead of using an external database. Using an external database would require external communication / RPC every time an event is processed. A clear anti-pattern in event streaming architectures.

RocksDB allows software engineers to focus their energies on the design and implementation of other areas of their systems with the peace of mind of relying on RocksDB for access to stable storage. It is battle-tested at several silicon valley companies and used under the hood of many famous databases like Apache Cassandra, CockroachDB or MySQL (MyRocks). RocksDB Is Eating the Database World covers the history and use cases in more detail.

ksqlDB as Event Streaming Database

Kafka Streams and ksqlDB – the event streaming database for Kafka – allow building stateful streaming applications; including powerful concepts like joins, sliding windows and interactive queries of the state. The following example shows how you build a stateful payment application:

The client application keeps the data in its own application for real time joins and other data correlations. It combines the concepts of a STREAM (unchangeable event) and a TABLE (updated information like in a relational database). Keep in mind that this application is highly scalable. It is typically not just a single instance. Instead, it is a distributed cluster of client instances to provide high availability and parallelizing data processing. Even if something goes down (VM, container, disk, network), the overall system will not lose and data and continue running 24/7.

As you can see, many questions have to be answered and various features have to be considered to make the right decision about how long and where you want to store data in Kafka.

One good reason to store data long-term in Kafka is to be able to use the data at a later point in time for processing, correlations or analytics.

Query and Processing – Can you Consume and Analyze the Kafka Storage?

Kafka provides different options to consume and query data.

Queries in Kafka can be either PUSH (i.e. continuously process and forward events) or PULL (i.e. the client requests events like you know it from your favorite SQL database).

I will show you different options in the following sections.

Consumer Applications Pull Events

Kafka clients pull the data from the brokers. This decouples producers, brokers and consumers and makes the infrastructure scalable and reliable.

Kafka itself includes a Java and Scala client to consume data. However, Kafka clients are available for almost any other programming language, including widespread languages like C, C++, Python, JavaScript or Golang and exotic languages like RUST. Check out Yeva Byzek’s examples to see your favorite programming language in action. Additionally, Confluent provides a REST Proxy. This allows consumption of events via HTTP(S) from any language or tool supporting this standard.

Applications have different options to consume events from the Kafka broker:

Continuous consumption of the latest events (in real time or batch)
Just specific time frames or partitions
All data from the beginning

Stream Processing Applications / Microservices Pull and Push Events

Kafka Streams and ksqlDB pull events from the brokers, process the data and then push the result back into another Kafka topic. These queries are running continuously. Powerful queries are possible; including JOINs and stateful aggregations.

These features are used for streaming ETL and real time analytics at scale, but also to build mission-critical business applications and microservices.

This discussion needs much more detail and cannot be covered in this blog post focusing on the database perspective. Get started e.g. with my intro to event streaming with ksqlDB from Big Data Spain in Madrid covering use cases and more technical details. “Confluent Developer” is another great point for getting started with building event streaming applications. The site provides plenty of tutorials, videos, demos, and more around Apache Kafka.

The feature “interactive queries” allows querying values from the client applications’ state store (typically implemented with RocksDB under the hood). The events are pulled via technologies like REST / HTTP or pushed via intermediaries like a WebSockets proxy. Kafka Streams provides the core functionality. The interactive query interface has to be implemented by yourself on top. Pro: Flexibility. Con: Not provided out-of-the-box.

Kafka as Query Engine and its Limitations

None of the above described Kafka query capabilities are as powerful as your beloved Oracle database or Elasticsearch!

Therefore, Kafka will not replace other databases. It is complementary. The main idea behind Kafka is to continuously process streaming data; with additional options to query stored data.

Kafka is good enough as database for some use cases. However, the query capabilities of Kafka are not good enough for some other use cases.

Kafka is then often used as central streaming platform where one or more databases (and other applications) build their own materialized real time view leveraging their own technology.

The principle is often called „turning the database inside out“. This design pattern allows using the right database for the right problem. Kafka is used in these scenarios

as scalable event streaming platform for data integration
for decoupling between different producers and consumers
to handle backpressure
to continuously process and correlate incoming events
for enabling the creation and updating of materialized views within other databases
to allow interactive queries directly to Kafka (depending on the use case and used technology)

Kafka as Single Source of Truth and Leading System?

For many scenarios, it is great if the central event streaming platform is the central single source of truth. Kafka provides an event-based real time infrastructure that is scalable and decouples all the producers and consumers. However, in the real world, something like an ERP system will often stay the leading system even it pushes the data via Kafka to the rest of the enterprise.

That’s totally fine! Kafka being the central streaming platform does not force you to make it the leading system for every event. For some applications and databases, the existing source of truth is still the source of truth after the integration with Kafka.

The key point here is that your single source of truth should not be a database that stores the data at rest, e.g. in a data lake like Hadoop or AWS S3. This way your central storage is a slow batch system. You cannot simply connect a real time consumer to it. On the other side, if an event streaming platform is your central layer, then you can ingest it into your data lake for data processing at rest, but you can also easily add another real time consumer.

Native ANSI SQL Query Layer to Pull Events? Tableau, Qlik, Power BI et al to analyze Kafka?

Access to massive volumes of event streaming data through Kafka has sparked a strong interest in interactive, real-time dashboards and analytics, with the idea being similar to what was built on top of traditional databases like Oracle or MySQL using Tableau, Qlik, or Power BI and batch frameworks like Hadoop using Impala, Presto, or BigQuery: The user wants to ask questions and get answers quickly.

Leveraging Rockset, a scalable SQL search and analytics engine based on RocksDB, and in conjunction with BI and analytics tools like Tableau, you can directly query the Kafka log. With ANSI SQL. No limitations. At scale. In real time. This is a good time to question your data lake strategy, isn’t it?

Check out details about the technical implementation and use cases here: “Real-Time Analytics and Monitoring Dashboards with Apache Kafka and Rockset“. Bosch Power Tools is a great real world example for using Kafka as long-term storage and Rockset for real time analytics and powerful interactive queries.

Transactions – Delivery and Processing Guarantees in Kafka

TL;DR: Kafka provides end-to-end processing guarantees, durability and high availability to build the most critical business applications. I have seen many mission-critical infrastructures built on Kafka in various industries, including banking, telco, insurance, retailing, automotive and many others.

I want to focus on delivery guarantees and correctness as critical characteristics of messaging systems and databases. Transactions are required in many applications to guarantee no data loss and deterministic behavior.

Transaction processing in databases is information processing that is divided into individual, indivisible operations called transactions. Each transaction must succeed or fail as a complete unit; it can never be only partially complete. Many databases with transactional capabilities (including Two-Phase-Commit / XA Transactions) do not scale well and are hard to operate.

Therefore, many distributed systems provide just “at least once semantics”.

Exactly-Once Semantics (EOS) in Kafka

In the contrary, Kafka is a distributed system that provides various guarantee deliveries. Different configuration options allow at-least-once, at-most-once and exactly-once semantics (EOS).

Exactly-once semantics is what people compare to database transactions. The idea is similar: You need to guarantee that each produced information is consumed and processed exactly once. Many people argued that this is not possible to implement with Kafka because individual, indivisible operations can fail in a distributed system! In the Kafka world, many people referred to the famous Hacker News discussion “You Cannot Have Exactly-Once Delivery” and similar Twitter conversations.

In mid of 2017, the “unbelievable thing” happened: Apache Kafka 0.11 added support for Exactly-Once Semantics (EOS). Note that it does not include the term “transaction” intentionally. Because it is not a transaction. Because a transaction is not possible in a distributed system. However, the result is the same: Each consumer consumes each produced message exactly once. “Exactly-once Semantics are Possible: Here’s How Kafka Does it” covers the details of the implementation. In short, EOS includes three features:

Idempotence: Exactly-once in order semantics per partition
Transactions: Atomic writes across multiple partitions
Exactly-once stream processing in Apache Kafka

Matthias J. Sax did a great job explaining EOS at Kafka Summit 2018 in London. Slides and video recording are available for free.

EOS works differently than transactions in databases but provides the same result in the end. It can be turned on by configuration. By the way: The performance penalty compared to at-least-once semantics is not big. Typically something between 10 and 25% slower end-to-end processing.

Exactly-Once Semantics in the Kafka Ecosystem (Kafka Connect, Kafka Streams, ksqlDB, non-Java Clients)

EOS is not just part of Kafka core and the related Java / Scala client. Most Kafka components support exactly-once delivery guarantees, including:

Some (but not all) Kafka Connect connectors. For example AWS S3 and Elasticsearch.
Kafka Streams and ksqlDB to process data exactly once for streaming ETL or in business applications.
Non-Java clients. librdkafka – the core foundation of many Kafka clients in various programming languages – added support for EOS recently.

Will Kafka Replace your existing Database?

In general, no! But you should always ask yourself: Do you need another data store in addition to Kafka? Sometimes yes, sometimes no. We discussed the characteristics of a database and when Kafka is sufficient.

Each database has specific features, guarantees and query options. Use MongoDB as document store, Elasticsearch for text search, Oracle or MySQL for traditional relational use cases, or Hadoop for a big data lake to run map/reduce jobs for reports.

This blog post hopefully helps you make the right decision for your next project.

However, hold on: The question is not always “Kafka vs database XYZ”. Often, Kafka and databases are complementary! Let’s discuss this in the following section…

Kafka Connect – Integration between Kafka and other Databases

Apache Kafka includes Kafka Connect: A framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems. Using Kafka Connect you can use existing connector implementations for common data sources and sinks to move data into and out of Kafka:

This includes many connectors to various databases. To query data from a source system, event can either be pulled (e.g. with the JDBC Connector) or pushed via Chance-Data-Capture (CDC, e.g. with the Debezium Connector). Kafka Connect can also write into any sink data storage, including various relational, NoSQL and big data infrastructures like Oracle, MongoDB, Hadoop HDFS or AWS S3.

Confluent Hub is a great resource to find the available source and sink connectors for Kafka Connect. The hub includes open source connectors and commercial offerings from different vendors. To learn more about Kafka Connect, you might want to check out Robin Moffat’s blog posts. He has implemented and explained tens of fantastic examples leveraging Kafka Connect to integrate with many different source and sink databases.

Apache Kafka is a Database with ACID Guarantees, but Complementary to other Databases!

Apache Kafka is a database. It provides ACID guarantees and is used in hundreds of companies for mission-critical deployments. However, in many cases Kafka is not competitive to other databases. Kafka is an event streaming platform for messaging, storage, processing and integration at scale in real time with zero downtime and zero data loss.

With these characteristics, Kafka is often used as central streaming integration layer. Materialized views can be built by other databases for their specific use cases like real time time series analytics, near real time ingestion into a text search infrastructure, or long term storage in a data lake.

In summary, if you get asked if Kafka can replace a database, then here are different answers:

Kafka can store data forever in a durable and high available manner providing ACID guarantees
Different options to query historical data are available in Kafka
Kafka-native add-ons like ksqlDB or Tiered Storage make Kafka more powerful than ever before for data processing and event-based long-term storage
Stateful applications can be built leveraging Kafka clients (microservices, business applications) without the need for another external database
Not a replacement for existing databases like MySQL, MongoDB, Elasticsearch or Hadoop
Other databases and Kafka complement each other; the right solution has to be selected for a problem; often purpose-built materialized views are created and updated in real time from the central event-based infrastructure
Different options are available for bi-directional pull and push based integration between Kafka and databases to complement each other

Please let me know what you think and connect on LinkedIn… Stay informed about new blog posts by subscribing to my newsletter.

The post Can Apache Kafka Replace a Database? appeared first on Kai Waehner.

Machine Learning Trends of 2018 combined with the Apache Kafka Ecosystem

Kai Waehner — Tue, 13 Feb 2018 15:55:59 +0000

At OOP 2018 conference in Munich, I presented an updated version of my talk about building scalable, mission-critical microservices with the Apache Kafka ecosystem and Deep Learning frameworks like TensorFlow, DeepLearning4J or H2O. I want to share the updated slide deck and discuss a few updates about newest trends, which I incorporated into the talk.

The main story is the same as in my Confluent blog post about Apache Kafka Ecosystem and Machine Learning: How to Build and Deploy Scalable Machine Learning in Production with Apache Kafka. But I focused more on Deep Learning / Neural Networks. I also discussed a few innovations in the ecosystem of Apache Kafka and trends in ML in the last months: KSQL, ONNX, AutoML, ML platforms from Uber and Netflix. Let’s take a look into these interesting topics and how this is related to each other.

KSQL – A Streaming SQL Language on top of Apache Kafka.

“KSQL is a streaming SQL engine for Apache Kafka. KSQL lowers the entry bar to the world of stream processing, providing a simple and completely interactive SQL interface for processing data in Kafka. You no longer need to write code in a programming language such as Java or Python! KSQL is open-source (Apache 2.0 licensed), distributed, scalable, reliable, and real-time. It supports a wide range of powerful stream processing operations including aggregations, joins, windowing, sessionization, and much more.” More details here: “Introducing KSQL: Open Source Streaming SQL for Apache Kafka“.

You can write SQL-like queries to deploy scalable, mission-critical stream processing apps (which leverage Kafka Streams under the hood). Definitely a highlight in the Kafka open source ecosystem.

KSQL and Machine Learning

KSQL is built on top of Kafka Streams and therefore allows to build scalable, mission-critical services. Machine Learning models including Neural Networks are embeddable easily by building a User Defined Function (UDF). I am preparing an example these days where I apply a Neural Network – more precisely an Autoencoder – for sensor analytics to detect anomalies – i.e. critical values in health checks – of hospital guests in real time to send an alert to the doctor.

Let’s now talk about some interesting new developments in the machine learning ecosystem.

ONNX – A Open Format to Represent Deep Learning Models

“ONNX is a open format to represent deep learning models. With ONNX, AI developers can more easily move models between state-of-the-art tools and choose the combination that is best for them.”

This sounds similar to PMML (Predictive Model Markup Language, see “What is PMML” on KDnuggets) and PFA (Portable Format for Analytics), two other standards to define and share machine learning models. However, ONNX differs in a few aspects:

focuses on Deep Learning
has several huge tech companies (AWS, Microsoft, Facebook) and hardware vendors (AMD, NVidia, Intel, Qualcomm, etc.) behind it
supports many leading open source frameworks, already (including TensorFlow, Pytorch, MXNet)

ONNX is already GA in version 1.0 and production ready (as announced by Amazon, Microsoft and Facebook in December 2017). There is also a nice getting started guide for different frameworks.

ONNX and the Apache Kafka ecosystem

Unfortunately, ONNX has no Java support yet. Therefore, no support yet for embedding it into Kafka Streams Java API natively. Only via workaround like doing a REST call or embedding a JNI binding. But I am very sure this is only a matter of time, because the Java platform is so important in many enterprises to deploy mission-critical applications.

Right now, you could use Kafka’s Java API or other Kafka Clients. Confluent provides official clients for several programming languages, e.g. for Python or Go, which both are perfect for Machine Learning applications, too.

Automated Machine Learning (aka AutoML)

“Automated machine learning (AutoML) is a hot new field with the goal of making it easy to select different machine learning algorithms, their parameter settings, and the pre-processing methods that improve their ability to detect complex patterns in big data” as stated here.

With AutoML, you can build analytic models without any knowledge about Machine Learning. The AutoML implementations uses different implementations of Decision Trees, Clustering, Neural Networks, etc. to build and compare different models out-of-the-box. You just upload or connect your historical data set and click a few buttons to start the process. Maybe not perfect for every use case, but you can easily improve many existing processes without the need for a rare and expensive data scientist.

DataRobot or Google’s AutoML are two of many well-known cloud offerings in this space. H2O’s AutoML is integrated into its open source ML framework, but they also offer a nice UI-focused commercial product called “Driverless AI“. I highly recommend to spend 30min on any AutoML tool. It is really fascinating to see how AI tools develop these days.

AutoML and the Apache Kafka ecosystem

Most AutoML tools offer deployment of their models. You can access the analytic models e.g. via a REST interface. Not a perfect solution for a scalable, event-drive architecture like Kafka. The good news: Many AutoML solutions also allow to export their generated models so that you can deploy them into your application. For example, AutoML in H2O’s open source frameworks is just one of many options. You only use another operation in the programming language of your choice (R, Python, Scala, Web UI):

aml <- h2o.automl(x = x, y = y,
                  training_frame = train,
                  leaderboard_frame = test,
                  max_runtime_secs = 30)

Similar to what you would do to build a Linear Regression, Decision Tree or Neural Network. The result is generated Java code which you can easily embed into your Kafka Streams microservice or any other Kafka application. AutoML enables you to build and deploy highly scalable machine learning without deep knowledge in ML.

ML Platforms: Uber’s Michelangelo; Netflix’ Meson

Tech giants are typically some years ahead of “traditional enterprises”. They already built years ago what you build today or tomorrow. ML Platforms are no difference. Writing the ML source code to train an analytic model is just a very small part of a real world ML infrastructure. You need to think about the whole development process. The following picture shows the “Hidden Technical Debt in Machine Learning Systems“:

You will probably build several analytic models with different technologies. Not everything will be built in your Spark or Flink cluster or in a single cloud infrastructure. You might run TensorFlow on some big, expensive GPU in the public cloud to build powerful neural networks. Or use H2O to build some small, but very efficient and performant decision trees which do inference in a few microseconds… ML has many use cases.

That’s why many tech giants have built their own ML platforms, like Uber’s Michelangelo or Netflix’ Meson . These ML Platforms allow them to build and monitor powerful, scalable analytic models, but also to stay flexible to choose the right ML technology for each use case.

Apache Kafka ecosystem for ML Platforms

One of the reasons why Apache Kafka is so successful is the huge adoption by many tech giants. Almost all great Silicon Valley companies like LinkedIn, Netflix, Uber, Ebay, “you-name-it” blog and speak about their usage of Kafka as event-driven central nervous system for their mission-critical applications. Many focus on the distributed streaming platform for messaging, but we also see more and more adoption of add-ons like Kafka Connect, Kafka Streams, REST Proxy, Schema Registry or KSQL.

If you look at the above picture again, then think about Kafka: Isn’t it a perfect fit for a ML Platform? Training, monitoring, deployment, inference, configuration, A/B testing, etc. etc. etc. That’s probably why Uber, Netflix and many others use Kafka already as central component in their ML infrastructure.

And again, you are not forced to use just one specific technology. One of the great design concepts of Kafka is that you can re-process data again and again from its distributed commit log. This means you can either build different models with one technology as Kafka sink (let’s say Apache Flink or Spark), or connect different technologies like scikit-learn for local testing, TensorFlow running on Google Cloud GPUs for powerful deep learning, an on premise installation of H2O nodes for AutoML, and some other Kafka Streams ML apps deployed in Docker containers or Kubernetes. All of these ML applications consume the data in parallel in their pace and how often they need to.

Here is a great example of how to automate training and deployment of a scalable ML microservice with Kafka and Kafka Streams. No need to add another big data cluster. That’s one of the key differences of using Kafka Streams or KSQL for your ML applications instead of other Stream Processing frameworks.

Apache Kafka and Deep Learning – Slide Deck from OOP

Finally, after all these discussions about the Apache Kafka ecosystem and new trends in Machine Learning / Deep Learning, here are my updated slides from my talk at OOP 2018 conference:

I have also built a few examples using Apache Kafka, Kafka Streams and different open source ML frameworks like H2O, TensorFlow and DeepLearning4j (DL4J). The Github project shows how easy it is to deploy analytic models to a highly scalable, fault-tolerant, mission-critical Kafka microservice. A KSQL demo will also come soon.

Please share your feedback. Do you already use Kafka in the Machine Learning space? What components in addition to Kafka core do you use? Feel free to contact me to discuss this in more detail.

The post Machine Learning Trends of 2018 combined with the Apache Kafka Ecosystem appeared first on Kai Waehner.

Kafka Streams + H2O.ai + TensorFlow (Video Recording / Live Demo)

Kai Waehner — Thu, 07 Sep 2017 06:22:07 +0000

I do a lot of presentations these days at meetups and conferences with one focus: How to leverage Apache Kafka and Kafka Streams to apply analytic models (built with H2O, TensorFlow, DeepLearning4J and other frameworks) to scalable, mission-critical environments. As many attendees have asked me, I created a video recording about this talk (focusing on live demos).

I also see many Confluent customers talking about their challenges to deploy analytic models to a mission-critical, scalable production environment. This is a completely different story than “just” developing a great, accurate model in R or Python. Educating them how Apache Kafka and Kafka Streams can help here is a key task for me these days… This leads to many very interesting and disrupting use cases! I will blog more about this in the next months. For example, I will show an example where I train a neural networks with the concept of autoencoders to build analytic models. Some use cases for this: Anomaly detection for predictive maintenance, fraud, customer churn, etc. These neural networks will then be deployed and monitored with Apache Kafka and its Streams API.

Abstract of the Session: Apache Kafka + Machine Learning

Intelligent real time applications are a game changer in any industry. This session explains how companies from different industries build intelligent real time applications. The first part of this session explains how to build analytic models with R, Python or Scala. No matter which language you favor, you can leverage open source machine learning / deep learning frameworks like TensorFlow, DeepLearning4J or H2O.ai. The second part discusses the deployment of these built analytic models to your own applications or microservices. Here you leverage the Apache Kafka cluster and Kafka’s Streams API instead of setting up a new, complex stream processing cluster. The session focuses on live demos. It also teaches lessons learned for executing analytic models in a highly scalable, mission-critical and performant way.

Key Takeaways for the Audience

Insights are hidden in Historical Data, e.g. on Big Data Platforms such as Hadoop
Machine Learning and Deep Learning find these Insights by building Analytics Models
Stream Processing uses these Models (without Redeveloping) to act in Real Time
See different open source frameworks for Machine Learning and Stream Processing like TensorFlow, DeepLearning4J or H2O.ai to build analytic models
Apache Kafka, its Streams API and Machine Learning can be combined to build, apply and monitor analytic models
Understand how to leverage Kafka Streams to use analytic models in your own streaming microservices. Learn best practices for building and deploying analytic models in real time leveraging the open source Apache Kafka Streams platform

Code Examples on Github (Java, Kafka Streams, TensorFlow, H2O.ai)

You can find the Java code examples and analytic models for H2O and TensorFlow in my Github project.

Just clone the repository and run “maven clean package”. Then take a look at the Unit Tests to understand how to apply analytic models with Apache Kafka’s Streams API.

Video Recoding: Apache Kafka + Kafka Streams + H2O.ai + TensorFlow

Finally, here we go with the video recording:

As always, I appreciate any comments (feedback, questions, criticism)… Have fun watching the video.

You can also see a corresponding slide deck:

The post Kafka Streams + H2O.ai + TensorFlow (Video Recording / Live Demo) appeared first on Kai Waehner.

Why I Move (Back) to Open Source for Messaging, Integration and Stream Processing

Kai Waehner — Mon, 01 May 2017 13:24:08 +0000

After three great years at TIBCO Software, I move back to open source and join Confluent, a company focusing on the open source project Apache Kafka to build mission-critical, scalable infrastructures for messaging, integration and streaming analytics. Confluent is a Silicon Valley startup, still in the beginning of its journey, with a 700% growing business in 2016, and is exjustpected to grow significantly in 2017 again.

In this blog post, I want to share why I see the future for middleware and big data analytics in open source technologies, why I really like Confluent, what I will focus on in the next months, and why I am so excited about this next step in my career.

Key Business Trends in the Industry: Big Data, Real Time Streaming Analytics, Machine Learning

Let’s talk shortly about three cutting-edge topics which get important in any industry and small, medium and big enterprises these days:

Big Data Analytics: Find insights and patterns in big historical datasets.
Real Time Streaming Platforms: Apply insights and patterns to new events in real time (e.g. for fraud detection, cross selling or predictive maintenance).
Machine Learning (and its hot subtopic Deep Learning): Leverage algorithms and let machines learn by themselves without programming everything explicitly.

These three topics disrupt every industry these days. Note that Machine Learning is related to the other two topics. Though today, we often see it as independent topic; many data science projects actually use only very small datasets (often less than a Gigabyte of input data). Fortunately, all three topics will be combined more and more to add additional business value.

Some industries are just in the beginning of their journey of disruption and digital transformation (like banks, telcos, insurance companies), others already realized some changes and innovation (e.g. retailers, airlines). In addition to the above topics, some other cutting edge success stories emerge in a few industries, like Internet of Things (IoT) in manufacturing or Blockchain in banking.

With all these business trends on the market, we also see a key technology trend for all these topics: The adoption of open source projects.

Key Technology Trend: Adoption of “Real” Open Source Projects

When I say “open source”, I mean some specific projects. I do not talk about very new, immature projects, but frameworks which are deployed for many years in production successfully, and used by various different developers and organizations. For example, Confluent’s Docker images like the Kafka REST Proxy or Kafka Schema Registry are each downloaded over 100.000 times, already.

A “real”, successful middleware or analytics open source project has the following characteristics:

Openness: Available for free and really open under a permissive license, i.e. you can use it in production and scale it out without any need to purchase a license or subscription (of course, there can be commercial, proprietary add-ons – but they need to be on top of the project, and not change the license for the used open source project under the hood)
Maturity: Used in business-relevant or even mission critical environments for at least one year, typically much longer
Adoption: Various vendors and enterprises support a project, either by contributing (source code, documentation, add-ons, tools, commercial support) or realizing projects
Flexibility: Deployment on any infrastructure, including on premise, public cloud, hybrid. Support for various application environments (Windows, Linux, Virtual Machine, Docker, Serverless, etc.), APIs for several programming languages (Java, .Net, Go, etc.)
Integration: Independent and loosely coupled, but also highly integrated (via connectors, plugins, etc.) to other relevant open source and commercial components

After defining key characteristics for successful open source projects, let’s take a look some frameworks with huge momentum.

Cutting Edge Open Source Technologies: Apache Hadoop, Apache Spark, Apache Kafka

I defined three key trends above which are relevant for any industry and many (open source and proprietary) software vendors. There is a clear trend towards some open source projects as de facto standards for new projects:

Big Data Analytics: Apache Hadoop (and its zoo of sub projects like Hive, Pig, Impala, HBase, etc.) and Apache Spark (which is often separated from Hadoop in the meantime) to store, process and analyze huge historical datasets
Real Time Streaming Platforms: Apache Kafka – not just for highly scalable messaging, but also for integration and streaming analytics. Platforms either use Kafka Streams to build stream processing applications / microservices or an “external” framework like Apache Flink, Apex, Storm or Heron.
Machine Learning: No clear “winner” here (and that is a good thing in my opinion as it is so multifaceted). Many great frameworks are available – for example R, Python and Scala offer various great implementations of Machine Learning algorithms, and specific frameworks like Caffee, Torch, TensorFlow or MXNet focus on Deep Learning and Artificial Neural Networks.

On top of these frameworks, various vendors build open source or proprietary tooling and offer commercial support. Think about the key Hadoop / Spark vendors: Hortonworks, Cloudera, MapR and others, or KNIME, RapidMiner or H2O.ai as specialized open source tools for machine learning in a visual coding environment.

Of course, there are many other great open source frameworks not mentioned here but also relevant on the market, for example RabbitMQ and ActiveMQ for messaging or Apache Camel for integration. In addition, new “best practice stacks” are emerging, like the SMACK Stack which combines Spark, Mesos, Akka, and Kafka.

I am so excited about Apache Kafka and Confluent, because it is used in any industry and many small and big enterprises, already. Apache Kafka production deployments accelerated in 2016 and it is now used by one-third of the Fortune 500. And this is just the beginning. Apache Kafka is not an all-rounder to solve all problems, but it is awesome in the things it is built for – as the huge and growing number of users, contributors and production deployments prove. It highly integrated with many other frameworks and tools. Therefore, I will not just focus on Apache Kafka and Confluent in my new job, but also many other technologies as discussed later.

Let’s next think about the relation of Apache Kafka and Confluent to proprietary software.

Open Source vs. Proprietary Software – A Never-ending War?

The trend is moving towards open source technologies these days. This is no question, but a fact. I have not seen a single customer in the last years who does not have projects and huge investments around Hadoop, Spark and Kafka. In the meantime, it changed from labs and first small projects to enterprise de facto standards and company-wide deployments and strategies. Replacement of closed legacy software is coming more and more – to reduce costs, but even more important to be more flexible, up-to-date and innovative.

What does this mean for proprietary software?

For some topics, I do not see much traction or demand for proprietary solutions. Two very relevant examples where closed software ruled the last ~20 years: Messaging solutions and analytics platforms. Open frameworks seem to replace them almost everywhere in any industry and enterprise in new projects (for good reasons).

New messaging projects are based on standards like MQTT or frameworks like Apache Kafka. Analytics is done with R and Python in conjunction with frameworks like scikit-learn or TensorFlow. These options leverage flexible, but also very mature implementations. Often, there is no need for a lot of proprietary, inflexible, complex or expensive tooling on top of it. Even IBM, the mega vendor, focuses on offerings around open source in the meantime.

IBM invests millions into Apache Spark for big data analytics and puts over 3500 researchers and developers to work on Spark-related projects instead of just pushing towards its various own proprietary analytics offerings like IBM SPSS. If you search for “IBM Messaging”, you find offerings based on AMQP standard and cloud services based on Apache Kafka instead of pushing towards new proprietary solutions!

I think IBM is a great example of how the software market is changing these days. Confluent (just in the beginning of its journey) or Cloudera (just went public with a successful IPO) are great examples for Silicon Valley startups going the same way.

In my opinion, a good proprietary software leverages open source technologies like Apache Kafka, Apache Hadoop or Apache Spark. Vendors should integrate natively with these technologies. Some opportunities for vendors:

Visual coding (for developers) to generate code (e.g. graphical components, which generate framework-compliant source code for a Hadoop or Spark job)
Visual tooling (e.g. for business analysts or data scientists), like a Visual Analytics tools which connect to big data stores to find insights and patterns
Infrastructure tooling for management and monitoring of open source infrastructure (e.g. tools to monitor and scale Apache Kafka messaging infrastructures)
Add-ons which are natively integrated with open source frameworks (e.g. instead of requiring own proprietary runtime and messaging infrastructures, an integration vendor should deploy its integration microservices on open cloud-native platforms like Kubernetes or Cloud Foundry and leverage open messaging infrastructures like Apache Kafka instead of pushing towards proprietary solutions)

Open Source and Proprietary Software Complement Each Other

Therefore, I do not see this as a discussion of “open source software” versus “proprietary software”. Both complement each other very well. You should always ask the following questions before making a decision for open source software, proprietary software or a combination of both:

What is the added value of the proprietary solution? Does this increase the complexity and increases the footprint of runtime and tooling?
What is the expected total cost of ownership of a project (TCO), i.e. license / subscription + project lifecycle costs?
How to realize the project? Who will support you, how do you find experts for delivery (typically consulting companies)? Integration and analytics projects are often huge with big investments, so how to make sure you can deliver (implementation, test, deployment, operations, change management, etc.)? Can we get commercial support for our mission-critical deployments (24/7)?
How to use this project with the rest of the enterprise architecture? Do you deploy everything on the same cluster? Do we want to set some (open) de facto standards in different departments and business units?
Do we want to use the same technology in new environments without limiting 30-day-trials or annoying sales cycles, maybe even deploy it to production without any license / subscription costs?
Do we want to add changes, enhancements or fixes to the platform by ourselves (e.g. if we need a specific feature immediately, not in six months)?

Let’s think about a specific example with these questions in mind:

Example: Do you still need an Enterprise Service Bus (ESB) in a World of Cloud-Native Microservices?

I faced this question a lot in the last 24 months, especially with the trend moving to flexible, agile microservices (not just for building business applications, but also for integration and analytics middleware). See my article “Do Microservices Spell the End of the ESB?”. The short summary: You still need middleware (call it ESB, integration platform, iPaaS, or somehow else), though the requirements are different today. This is true for open source and proprietary ESB products. However, something else has changed in the last 24 months…

In the past, open source and proprietary middleware vendors offered an ESB as integration platform. The offering included a runtime (to guarantee scalable, mission-critical operation of integration services) and a development environment (for visual coding and faster time-to-market). The last two years changed how we think about building new applications. We now (want to) build microservices, which run in a Docker container. The scalable, mission-critical runtime is managed by a cloud-native platform like Kubernetes or Cloud Foundry. Ideally, DevOps automates builds, tests and deployment. These days, ESB vendors adopt these technologies. So far, so good.

Now, you can deploy your middleware microservice to these cloud-native platforms like any other Java, .NET or Go microservice. However, this completely changes the added value of the ESB platform. Now, its benefit is just about visual coding and the key argument is time-to-market (you should always question and doublecheck if it is a valid argument). The runtime is not really part of the ESB anymore. In most scenarios, this completely changes the view on deciding if you still need an ESB. Ask yourself about time-to-market, license / subscription costs and TCO again! Also think about the (typically) increased resource requirements (Memory, CPU, Disk) of tooling-built services (typically some kind of big .ear file), compared to plain source code (e.g. Java’s .jar files).

Is the ESB still enough added value or should you just use a cloud-native platform and a messaging infrastructure? Is it easier to write some lines of source instead of setting up the ESB tooling where you often struggle importing your REST / Swagger or WSDL files and many other configuration environments before you actually can begin to leverage the visual coding environment? In very big, long-running projects, you might finally end up with a win. Though, in an agile, every-changing world with fail-fast ideology, various different technologies and frameworks, and automated CI/CD stacks, you might only add new complexity, but not get the expected value anymore like in the old world where the ESB was also the mission-critical runtime. The same is true for other middleware components like stream processing, analytic platforms, etc.

ESB Alternative: Apache Kafka and Confluent Open Source Platform

As alternative, you could use for example Kafka Connect, which is a very lightweight integration library based on Apache Kafka to build large-scale low-latency data pipelines. The beauty of Kafka Connect is that all the challenges around scalability, fail-over and performance are leveraged from the Kafka infrastructure. You just have to use the Kafka Connect connectors to realize very powerful integrations with a few lines of configuration for sources and sinks. If you use Apache Kafka as messaging infrastructure anyway, you need to find some very compelling reasons to use an ESB on top instead of the much less complex and much less heavyweight Kafka Connect library.

I think this section explained why I think that open source and proprietary software are complementary in many use cases. But it does not make sense to add heavyweight, resource intensive and complex tooling in every scenario. Open source is not free (you still need to spend time and efforts on the project and probably pay money for some kind of commercial support), but often open source without too much additional proprietary tooling is the better choice regarding complexity, time-to-market and TCO. You can find endless success stories about open source projects; not just from tech giants like Google, Facebook or LinkedIn, but also from many small and big traditional enterprises. Of course, any project can fail. Though, in projects with frameworks like Hadoop, Spark or Kafka, it is probably not due to issues with technology…

Confluent + “Proprietary Mega Vendors”

On the other side, I really look forward working together with “mostly proprietary” mega vendors like TIBCO, SAP, SAS and others where it makes sense to solve customer problems and build innovative, powerful, mission-critical solutions. For example, TIBCO StreamBase is a great tool if you want to develop stream processing applications via visual editor instead of writing source code. Actually, it does not even compete with Kafka Streams because the latter one is a library which you embed into any other microservice or application (deployed anywhere, e.g. in a Java application, Docker container, Apache Mesos, “you-choose-it”) while StreamBase (like its competitors Software AG Apama, IBM Streams and all the other open source frameworks like Apache Flink, Storm, Apache Apex, Heron, etc.) focus on building streaming application on its own cluster (typically either deployed on Hadoop’s YARN or on a proprietary cluster). Therefore, you could use StreamBase and its Kafka connector to build streaming applications leveraging Kafka as messaging infrastructure.

Even Confluent itself does offer some proprietary tooling like Confluent Control Center for management and monitoring on top of open source Apache Kafka and open source Confluent Platform, of course. This is the typical business model you see behind successful open source vendors like Red Hat: Embrace open source projects, offer 24/7 support and proprietary add-ons for enterprise-grade deployments. Thus, not everything is or needs to be open source. That’s absolutely fine.

So, after all these discussions about business and technology trends, open source and proprietary software, what will I do in my new job at Confluent?

Confluent Platform in Conjunction with Analytics, Machine Learning, Blockchain, Enterprise Middleware

Of course, I will focus a lot on Apache Kafka and Confluent Platform in my new job where I will work mainly with prospects and customers in EMEA, but also continue as Technology Evangelist with publications and conference talks. Let’s get into a little bit more detail here…

My focus never was being a deep level technology expert or fixing issues in production environments (but I do hands-on coding, of course). Many other technology experts are way better in very technical discussions. As in the past, I will focus on designing mission-critical enterprise architectures, discussing innovative real world use cases with prospects and customers, and evaluating cutting edge technologies in conjunction with the Confluent Platform. Here are some of my ideas for the next months:

Apache Kafka + Cloud Native Platforms = Highly Scalable Streaming Microservices (leveraging platforms like Kubernetes, Cloud Foundry, Apache Mesos)
Highly Scalable Machine Learning and Deep Learning Microservices with Apache Kafka Streams (using TensorFlow, MXNet, H2O.ai, and other open source frameworks)
Online Machine Learning (i.e. updating analytics models in real time for every new event) leveraging Apache Kafka as infrastructure backbone
Open Source IoT Platform for Core and Edge Streaming Analytics (leveraging Apache Kafka, Apache Edgent, and other IoT frameworks)
Comparison of Open Source Stream Processing Frameworks (differences between Kafka Streams and other modern frameworks like Heron, Apache Flink, Apex, Spark Streaming, Edgent, Nifi, StreamSets, etc.)
Apache Kafka / Confluent and other Enterprise Middleware (discuss when to combine proprietary middleware with Apache Kafka, and when to simply “just” use Confluent’s open source platform)
Middleware and Streaming Analytics as Key for Success in Blockchain Projects

You can expect publications, conference and meetup talks, and webinars about these and other topics in 2017 like I did it in the last years. Please let me know what you are most interested in and what other topics you would like to hear about!

I am also really looking forward to work together with partners on scalable, mission-critical enterprise architectures and powerful solutions around Apache Kafka and Confluent Platform. This will include combined solutions and use cases with open source but also proprietary software vendors.

Last but not least the most important part, I am excited to work with prospects and customers from traditional enterprises, tech giants and startups to realize innovative use cases and success stories to add business value.

As you can see, I am really excited to start at Confluent in May 2017. I will visit Confluent’s London and Palo Alto offices in the first weeks and also be at Kafka Summit in New York. Thus, an exciting month to get started in this awesome Silicon Valley startup.

Please let me know your feedback. Do you see the same trends? Do you share my opinions or disagree? Looking forward to discuss all these topics with customers, partners and anybody else in upcoming meetings, workshops, publications, webinars, meetups and conference talks.

The post Why I Move (Back) to Open Source for Messaging, Integration and Stream Processing appeared first on Kai Waehner.

Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Services

Kai Waehner — Tue, 15 Nov 2016 13:20:31 +0000

In November 2016, I am at Big Data Spain in Madrid for the first time. A great conference with many awesome speakers and sessions about very hot topics such as Apache Hadoop, Spark Spark, Streaming Processing / Streaming Analytics and Machine Learning. If you are interested in big data, then this conference is for you! My two talks:

“How to Apply Machine Learning to Real Time Processing” (see slides and video recording from a similar conference talk).
“Comparison of Streaming Analytics Options” (the reason for this blog post; an updated version of my talk from JavaOne 2015)

Here I wanna share the slides and a video recording of the latter one…

Abstract: Comparison of Stream Processing Options

This session discusses the technical concepts of stream processing / streaming analytics and how it is related to big data, mobile, cloud and internet of things. Different use cases such as predictive fault management or fraud detection are used to show and compare alternative frameworks and products for stream processing and streaming analytics.

The focus of the session lies on comparing

different open source frameworks such as Apache Apex, Apache Flink or Apache Spark Streaming
engines from software vendors such as IBM InfoSphere Streams, TIBCO StreamBase
cloud offerings such as AWS Kinesis.
real time streaming UIs such as Striim, Zoomdata or TIBCO Live Datamart. Live demos will give the audience a good feeling about how to use these frameworks and tools.

The session will also discuss how stream processing is related to Apache Hadoop frameworks (such as MapReduce, Hive, Pig or Impala) and machine learning (such as R, Spark ML or H2O.ai).

Slides – Alternatives for Streaming Analytics

The following slide deck is a more extensive version of the talk at Big Data Spain (as the conference talks were only 30 minutes):

Video Recording: Apache Storm, Flink, Apex, Spark, StreamBase, Striim, et al

The video recording walks you through the above slide deck:

As always, I appreciate any comments, questions or other feedback.

The post Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Services appeared first on Kai Waehner.

Machine Learning Applied to Microservices

Kai Waehner — Thu, 20 Oct 2016 19:32:22 +0000

I had two sessions at O’Reilly Software Architecture Conference in London in October 2016. It is the first #OReillySACon in London. A very good organized conference with plenty of great speakers and sessions. I can really recommend this conference and its siblings in other cities such as San Francisco or New York if you want to learn about good software architectures and new concepts, best practices and technologies. Some of the hot topics this year besides microservices are DevOps, serverless architectures and big data analytics respectively machine learning.

Intelligent Microservices by Leveraging Big Data Analytics

One of the two sessions was about how to apply machine learning and big data analytics to real time event processing. I also included the relation to microservices, i.e. how to leverage microservice concepts such as 12 Factor Apps, Containers (e.g. Docker), Cloud Platforms (e.g. Kubernetes, Cloud Foundry), or DevOps to build agile, intelligent microservices.

Abstract: How to Apply Machine Learning to Microservices

The digital transformation is going forward due to Mobile, Cloud and Internet of Things. Disrupting business models leverage Big Data Analytics and Machine Learning.

“Big Data” is currently a big hype. Large amounts of historical data are stored in Hadoop or other platforms. Business Intelligence tools and statistical computing are used to draw new knowledge and to find patterns from this data, for example for promotions, cross-selling or fraud detection. The key challenge is how these findings can be integrated from historical data into new transactions in real time to make customers happy, increase revenue or prevent fraud. “Fast Data” via stream processing is the solution to embed patterns – which were obtained from analyzing historical data – into future transactions in real-time.

This session uses several real world success stories to explain the concepts behind stream processing and its relation to Hadoop and other big data platforms. It discusses how patterns and statistical models of R, Spark MLlib, H2O, and other technologies can be integrated into real-time processing by using several different real world case studies. The session also points out why a Microservices architecture helps solving the agile requirements for these kind of projects.

A brief overview of available open source frameworks and commercial products shows possible options for the implementation of stream processing, such as Apache Storm, Apache Flink, Spark Streaming, IBM InfoSphere Streams, or TIBCO StreamBase.

A live demo shows how to implement stream processing, how to integrate machine learning, and how human operations can be enabled in addition to the automatic processing via a Web UI and push events.

How to Build Intelligent Microservices – Slide Deck from O’Reilly Software Architecture Conference

The post Machine Learning Applied to Microservices appeared first on Kai Waehner.

Comparison Of Log Analytics for Distributed Microservices – Open Source Frameworks, SaaS and Enterprise Products

Kai Waehner — Thu, 20 Oct 2016 18:57:51 +0000

I want to share the slide of my session about comparing open source frameworks, SaaS and Enterprise products regarding log analytics for distributed microservices:

Monitoring Distributed Microservices with Log Analytics

IT systems and applications generate more and more distributed machine data due to millions of mobile devices, Internet of Things, social network users, and other new emerging technologies. However, organizations experience challenges when monitoring and managing their IT systems and technology infrastructure. They struggle with distributed Microservices and Cloud architectures, custom application monitoring and debugging, network and server monitoring / troubleshooting, security analysis, compliance standards, and others.

This session discusses how to solve the challenges of monitoring and analyzing Terabytes and more of different distributed machine data to leverage the “digital business”. The main part of the session compares different open source frameworks and SaaS cloud solutions for Log Management and operational intelligence, such as Graylog , the “ELK stack”, Papertrail, Splunk or TIBCO LogLogic). A live demo will demonstrate how to monitor and analyze distributed Microservices and sensor data from the “Internet of Things”.

The session also explains the distinction of the discussed solutions to other big data components such as Apache Hadoop, Data Warehouse or Machine Learning and its application to real time processing, and how they can complement each other in a big data architecture.

The session concludes with an outlook to the new, advanced concept of IT Operations Analytics (ITOA).

Slide Deck from O’Reilly Software Architecture Conference

The post Comparison Of Log Analytics for Distributed Microservices – Open Source Frameworks, SaaS and Enterprise Products appeared first on Kai Waehner.

Streaming Analytics with Analytic Models (R, Spark MLlib, H20, PMML)

Kai Waehner — Thu, 03 Mar 2016 15:51:01 +0000

In March 2016, I had a talk at Voxxed Zurich about “How to Apply Machine Learning and Big Data Analytics to Real Time Processing”.

Finding Insights with R, H20, Apache Spark MLlib, PMML and TIBCO Spotfire

“Big Data” is currently a big hype. Large amounts of historical data are stored in Hadoop or other platforms. Business Intelligence tools and statistical computing are used to draw new knowledge and to find patterns from this data, for example for promotions, cross-selling or fraud detection. The key challenge is how these findings can be integrated from historical data into new transactions in real time to make customers happy, increase revenue or prevent fraud.

Putting Analytic Models into Action via Event Processing and Streaming Analytics

“Fast Data” via stream processing is the solution to embed patterns – which were obtained from analyzing historical data – into future transactions in real-time. The following slide deck uses several real world success stories to explain the concepts behind stream processing and its relation to Apache Hadoop and other big data platforms. I discuss how patterns and statistical models of R, Apache Spark MLlib, H20, and other technologies can be integrated into real-time processing using open source stream processing frameworks (such as Apache Storm, Spark Streaming or Flink) or products (such as IBM InfoSphere Streams or TIBCO StreamBase). A live demo showed the complete development lifecycle combining analytics with TIBCO Spotfire, machine learning via R and stream processing via TIBCO StreamBase and TIBCO Live Datamart.

Slide Deck from Voxxed Zurich 2016

Here is the slide deck:

The post Streaming Analytics with Analytic Models (R, Spark MLlib, H20, PMML) appeared first on Kai Waehner.