StreamBase Archives - Kai Waehner

The Past, Present and Future of Stream Processing

Kai Waehner — Wed, 20 Mar 2024 06:47:53 +0000

Stream processing has existed for decades. However, it really kicks off in the 2020s thanks to the adoption of open source frameworks like Apache Kafka and Flink. Fully managed cloud services make it easy to configure and deploy stream processing in a cloud-native way; even without the need to write any code. This blog post explores the past, present and future of stream processing. The discussion includes various technologies and cloud services, low code/ no code trade-offs, outlooks into the support of machine learning and GenAI, streaming databases, and the integration between data streaming and data lakes with Apache Iceberg.

In December 2023, the research company proved that data streaming is a new software category and not just yet another integration or data platform. Forrester published “The Forrester Wave: Streaming Data Platforms, Q4 2023“. Get free access to the report here. The leaders are Microsoft, Google and Confluent, followed by Oracle, Amazon, Cloudera, and a few others. A great time to review the past, present and future of stream processing as a key component in a data streaming architecture.

The Past of Stream Processing: The Move from Batch to Real-Time

The evolution of stream processing began as industries sought more timely insights from their data. Initially, batch processing was the norm. Data was collected over a period, stored, and processed at intervals. This method, while effective for historical analysis, proved inefficient for real-time decision-making.

In parallel to batch processing, message queues were created to provide real-time communication for transactional data. Message Brokers like IBM MQ or TIBCO EMS were a common way to decouple applications. Applications send data and receive data in an event-driven architecture without worrying about if the recipient was ready, how to handle backpressure, etc. The stream processing journey began.

Stream Processing is a Journey Over Decades…

… and we are still in a very early stage at most enterprises. Here is an excellent timeline of TimePlus about the journey of stream processing open source frameworks, proprietary platforms and SaaS cloud services:

Source: TimePlus

The stream processing journey started decades ago with research and first purpose-built proprietary products for specific use cases like stock trading.

Open source stream processing frameworks emerged during the big data and Hadoop era to make at least the ingestion layer a bit more real-time. Today, most enterprises at least get started understanding the value of stream processing for analytical and transactional use cases across industries. The cloud is a fundamental change as you can start streaming and processing data with a button click leveraging fully managed SaaS and simple UIs (if you don’t want to operate infrastructure or write low-level source code).

TIBCO StreamBase, Software AG Apama, IBM Streams

The advent of message queue technologies like IBM MQ and TIBCO EMS moved many critical applications to real-time message brokers. Real-time messaging enabled the consumption of data in real-time to store it in a database, mainframe, or application for further processing.

However, only true stream processing capabilities included in tools like TIBCO StreamBase, Software AG Apama or IBM (InfoSphere) Streams marked a significant shift towards real-time data processing. These products enabling businesses to react to information as it arrived by processing and correlating the data in motion.

Visual coding in tools like StreamBase or Apama represents an innovative approach to developing stream processing solutions. These tools provide a graphical interface that allows developers and analysts to design, build, and test applications by connecting various components and logic blocks visually, rather than writing code manually. Under the hood, the code generation worked with a Streaming SQL language.

Here is a screenshot of the TIBCO StreamBase IDE for visual drag & drop of streaming pipelines:

TIBCO StreamBase IDE

Some drawbacks of these early stream processing solutions include high cost, vendor lock-in, no flexibility regarding tools or APIs, and missing communities. These platforms are monolithic and were built far before cloud-native elasticity and scalability became a requirement for most RFIs and RFPs when evaluating vendors.

Open Source Event Streaming with Apache Kafka

The actual significant change for stream processing came with introducing Apache Kafka, a distributed streaming platform that allowed for high-throughput, fault-tolerant handling of real-time data feeds. Kafka, alongside other technologies like Apache Flink, revolutionized the landscape by providing the tools necessary to move from batch to real-time stream processing seamlessly.

The adoption of open source technologies changed all industries. Openness, flexibility, and community-driven development enabled easier influence on the features and faster innovation.

Over 100.000 organizations use Apache Kafka. The massive adoption came from a unique combination of capabilities: Messaging, storage, data integration, stream processing, all in one scalable and distributed infrastructure.

Various open source stream processing engines emerged. Kafka Streams was added to the Apache Kafka project. Other examples include Apache Storm, Spark Streaming, and Apache Flink.

The Present of Stream Processing: Architectural Evolution and Mass Adoption

The fundamental change to processing data in motion has enabled the development of data products and data mesh. Decentralizing data ownership and management with domain-driven design and technology-independent microservices promotes a more collaborative and flexible approach to data architecture. Each business unit can choose its own technology, API, cloud service, and communication paradigm like real-time, batch, or request-response.

From Lambda Architecture to Kappa Architecture

Today, stream processing is at the heart of modern data architecture, thanks in part to the emergence of the Kappa architecture. This model simplifies the traditional Lambda Architecture by using a single stream processing system to handle both real-time and historical data analysis, reducing complexity and increasing efficiency.

Lambda architecture with separate real-time and batch layers:

Kappa architecture with a single pipeline for real-time and batch processing:

For more details about the pros and cons of Kappa vs. Lambda, check out my “Kappa Architecture is Mainstream Replacing Lambda“. It explores case studies from Uber, Twitter, Disney and Shopify.

Kafka Streams and Apache Flink Become Mainstream

Apache Kafka has become synonymous with building scalable and fault-tolerant streaming data pipelines. Kafka facilitating true decoupling of domains and applications makes it integral to microservices and data mesh architectures.

Plenty of stream processing frameworks, products, and cloud services emerged in the past years. This includes open source frameworks like Kafka Streams, Apache Storm, Samza, Flume, Apex, Flink, Spark Streaming, and cloud services like Amazon Kinesis, Google Cloud Dataflow, Azure Stream Analytics. The “Data Streaming Landscape 2024” gives an overview of relevant technologies and vendors.

Apache Flink seems to become the de facto standard for many enterprises (and vendors). The adoption is like Kafka four years ago:

Source: Confluent

This does not mean other frameworks and solutions are bad. For instance, Kafka Streams is complementary to Apache Flink, as it suites different use cases.

No matter what technology enterprises choose, the mass adoption of stream processing is in progress right now. This includes modernizing existing batch processes AND building innovative new business models that only work in real time. As a concrete example, think about ride-hailing apps like Uber, Lyft, FREENOW, Grab. They are only possible because events are processed and correlated in real-time. Otherwise, everyone would still prefer a traditional taxi.

Stateless and Stateful Stream Processing

In data streaming, stateless and stateful stream processing are two approaches that define how data is handled and processed over time:

The choice between stateless and stateful processing depends on the specific requirements of the application, including the nature of the data, the complexity of the processing needed, and the performance and scalability requirements.

Stateless Stream Processing

Stateless Stream Processing refers to the handling of each data point or event independently from others. In this model, the processing of an event does not depend on the outcomes of previous events or require keeping track of the state between events. Each event is processed based on the information it contains, without the need for historical context or future data points. This approach is simpler and can be highly efficient for tasks that don’t require knowledge beyond the current event being processed.

The implementation could be a stream processor (like Kafka Streams or Flink), functionality in a connector (like Kafka Connect Single Message Transforms), or a Web Assembly (WASM) embedded into a streaming platform.

Stateful Stream Processing

Stateful Stream Processing involves keeping track of information (state) across multiple events to perform computations that depend on data beyond the current event. This model allows for more complex operations like windowing (aggregating events over a specific time frame), joining streams of data based on keys, and tracking sequences of events or patterns over time. Stateful processing is essential for scenarios where the outcome depends on accumulated knowledge or trends derived from a series of data points, not just on a single input.

The implementation is much more complex and challenging than stateless stream processing. A dedicated stream processing implementation is required. Dedicated distributed engines (like Apache Flink) handle stateful computionations, memory usage and scalability better than Kafka-native tools like Kafka Streams or KSQL (because the latter are bound to Kafka Topics).

Low Code, No Code, AND A Lot of Code!

No-code and low-code tools are software platforms that enable users to develop applications quickly and with minimal coding knowledge. These tools provide graphical user interfaces with drag-and-drop capabilities, allowing users to assemble and configure applications visually rather than writing extensive lines of code.

Common features and benefits of visual coding:

Rapid Development: Both types of platforms significantly reduce development time, enabling faster delivery of applications.
User-Friendly Interface: The graphical interface and drag-and-drop functionality make it easy for users to design, build, and iterate on applications.
Cost Reduction: By enabling quicker development with fewer resources, these platforms can lower the cost of software creation and maintenance.
Accessibility: They make application development accessible to a broader range of people, reducing the dependency on skilled developers for every project.

So far, the theory.

Disadvantages of Visual Coding Tools

Actually, StreamBase, Apama, et al., had great visual coding offerings. However, no-code / low-code tools have many drawbacks and disadvantages, too:

Limited Customization and Flexibility: While these platforms can speed up development for standard applications, they may lack the flexibility needed for highly customized solutions. Developers might find it challenging to implement specific functionalities that aren’t supported out of the box.
Dependency on Vendors: Using no-code/low-code platforms often means relying on third-party vendors for the platform’s stability, updates, and security. This dependency can lead to potential issues if the vendor cannot maintain the platform or goes out of business. And often the platform team is the bottleneck for implementing new business or integration logic.
Performance Concerns: Applications built with no-code/low-code platforms may not be as optimized as those developed with traditional coding, potentially leading to lower performance or inefficiencies, especially for complex applications.
Scalability Issues: As businesses grow, applications might need to scale up to support increased loads. No-code/low-code platforms might not always support this level of scalability or might require significant workarounds, affecting performance and user experience.
Over-reliance on Non-Technical Users: While empowering citizen developers is a key advantage of these platforms, it can also lead to governance challenges. Without proper oversight, non-technical users might create inefficient workflows or data structures, leading to technical debt and maintenance issues.
Cost Over Time: Initially, no-code/low-code platforms can reduce development costs. However, as applications grow and evolve, the ongoing subscription costs or fees for additional features and scalability can become significant.

Flexibility is King: Stream Processing for Everyone!

Microservices, domain-driven design, data mesh… All these modern design approaches taught us to provide flexible enterprise architectures. Each business unit and persona should be able to choose its own technology, API, or SaaS. And no matter if you do real-time, near real-time, batch or request response communication.

Apache Kafka provides the true decoupling out-of-the-box. Therefore, low-code or now-code tools is an option. However, a data scientist, data engineer, software developer or citizen integrator can choose its own technology for stream processing.

The past, present and future of stream processing shows different frameworks, visual coding tools and even applied generative AI. One solution does NOT replace but complement the other alternatives:

The Future of Stream Processing: Serverless SaaS, GenAI and Streaming Databases

Stream processing is set to grow exponentially in the future, thanks to advancements in cloud computing, SaaS, and AI. Let’s explore the future of stream processing and look at the expected short, mid and long-term developments.

SHORT TERM: Fully Managed Serverless SaaS for Stream Processing

The cloud’s scalability and flexibility offer an ideal environment for stream processing applications, reducing the overhead and resources required for on-premise solutions. As SaaS models continue to evolve, stream processing capabilities will become more accessible to a broader range of businesses, democratizing real-time data analytics.

For instance, look at the serverless Flink Actions in Confluent Cloud. You can configure and deploy stream processing for use cases like deduplication or masking without any code:

Source: Confluent

MID TERM: Automated Tooling and the Help of GenAI

Integrating AI and machine learning with stream processing will enable more sophisticated predictive analytics. This opens new frontiers for automated decision-making and intelligent applications while continuously processing incoming event streams. The full potential of embedding AI into stream processing has to be learned and implemented in the upcoming years.

For instance, automated data profiling is one instance of stream processing that GenAI can support significantly. Software tools analyze and understand the quality, structure, and content of a dataset without manual intervention as the events flow through the data pipeline in real-time. This process typically involves examining the data to identify patterns, anomalies, missing values, and inconsistencies. A perfect fit for stream processing!

Automated data profiling in the stream processor can provide insights into data types, frequency distributions, relationships between columns, and other metadata information crucial for data quality assessment, governance, and preparation for further analysis or processing.

MID TERM: Streaming Storage and Analytics with Apache Iceberg

Apache Iceberg is an open-source table format for huge analytic datasets that provides powerful capabilities in managing large-scale data in data lakes. Its integration with streaming data sources like Apache Kafka and analytics platforms, such as Snowflake, Starburst, Dremio, AWS Athena or Databricks, can significantly enhance data management and analytics workflows.

Integration between Streaming Data from Kafka and Analytics on Databricks or Snowflake using Apache Iceberg

Supporting the Apache Iceberg table format might be a crucial strategic move by streaming and analytics frameworks, vendors and cloud services. Here are some key benefits from the enterprise architecture perspective:

Unified Batch and Stream Processing: Iceberg tables can serve as a bridge between streaming data ingestion from Kafka and doxwnstream analytic processing. By treating streaming data as an extension of a batch-based table, Iceberg enables a seamless transition from real time to batch analytics, allowing organizations to analyze data with minimal latency.
Schema Evolution: Iceberg supports schema evolution without breaking downstream systems. This is useful when dealing with streaming data from Kafka, where the schema might evolve. Consumers can continue reading data using the schema they understand, ensuring compatibility and reducing the need for data pipeline modifications.
Time Travel and Snapshot Isolation: Iceberg’s time travel feature allows analytics on data as it looked at any point in time, providing snapshot isolation for consistent reads. This is crucial for reproducible reporting and debugging, especially when dealing with continuously updating streaming data from Kafka.
Cross-Platform Compatibility: Iceberg provides a unified data layer accessible by different compute engines, including those used by Databricks and Snowflake. This enables organizations to maintain a single copy of their data that is queryable across different platforms, facilitating a multi-tool analytics ecosystem without data silos.

LONG TERM: Transactional + Analytics = Streaming Database?

Streaming databases, like RisingWave or Materialize, are designed to handle real-time data processing and analytics. This offers a way to manage and query data that is continuously generated from sources like IoT devices, online transactions, and application logs. Traditional databases that are optimized for static data stored on disk. Instead, streaming databases are built to process and analyze data in motion. They provide insights almost instantaneously as the data flows through the system.

Streaming databases offer the ability to perform complex queries and analytics on streaming data, further empowering organizations to harness real-time insights.

The ongoing innovation in streaming databases will probably lead to more advanced, efficient, and user-friendly solutions, facilitating broader adoption and more creative applications of stream processing technologies.

Having said this, we are still in the very early stage. It is not clear yet when you really need a streaming database instead of a mature and scalable stream processor like Apache Flink. The future will show us and competition is great for innovation.

The Future of Stream Processing is Open Source and Cloud

The journey from batch to real-time processing has transformed how businesses interact with their data. The continued evolution couples technologies like Apache Kafka, Kafka Streams, and Apache Flink with the growth of cloud computing and SaaS. Stream processing will redefine the future of data analytics and decision-making.

As we look ahead, the future possibilities for stream processing are boundless, promising more agile, intelligent, and real-time insights into the ever-increasing streams of data.

If you want to learn more, listen to the following on-demand webinar about the past, present and future of stream processing with me joined by the two streaming industry veterans Richard Tibbets (founder of StreamBase) and Michael Benjamin (TimePlus). I had the please work with them for a few years at TIBCO where we deployed StreamBase at many Financial Services companies for stock trading and similar use cases:

How does your stream processing journey look like? In which decade did you join? Or are you just learning with the latest open-source frameworks or cloud services? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post The Past, Present and Future of Stream Processing appeared first on Kai Waehner.

Apache Kafka vs. Middleware (MQ, ETL, ESB) – Slides + Video

Kai Waehner — Thu, 07 Mar 2019 15:45:15 +0000

Learn the differences between an event-driven streaming platform like Apache Kafka and middleware like Message Queues (MQ), Extract-Transform-Load (ETL) and Enterprise Service Bus (ESB). Including best practices and anti-patterns, but also how these concepts and tools complement each other in an enterprise architecture.

This blog post shares my slide deck and video recording. I discuss the differences between Apache Kafka as Event Streaming Platform and integration middleware. Learn if they are friends, enemies or frenemies.

Problems of Legacy Middleware

Extract-Transform-Load (ETL) is still a widely-used pattern to move data between different systems via batch processing. Due to its challenges in today’s world where real time is the new standard, an Enterprise Service Bus (ESB) is used in many enterprises as integration backbone between any kind of microservice, legacy application or cloud service to move data via SOAP / REST Web Services or other technologies. Stream Processing is often added as its own component in the enterprise architecture for correlation of different events to implement contextual rules and stateful analytics. Using all these components introduces challenges and complexities in development and operations.

Apache Kafka – An Open Source Event Streaming Platform

This session discusses how teams in different industries solve these challenges by building a native event streaming platform from the ground up instead of using ETL and ESB tools in their architecture. This allows to build and deploy independent, mission-critical streaming real time application and microservices. The architecture leverages distributed processing and fault-tolerance with fast failover, no-downtime, rolling deployments and the ability to reprocess events, so you can recalculate output when your code changes. Integration and Stream Processing are still key functionality but can be realized in real time natively instead of using additional ETL, ESB or Stream Processing tools.

A concrete example architecture shows how to build a complete streaming platform leveraging the widely-adopted open source framework Apache Kafka to build a mission-critical, scalable, highly performant streaming platform. Messaging, integration and stream processing are all build on top of the same strong foundation of Kafka; deployed on premise, in the cloud or in hybrid environments. In addition, the open source Confluent projects, based on top of Apache Kafka, adds additional features like a Schema Registry, additional clients for programming languages like Go or C, or many pre-built connectors for various technologies.

Slides: Apache Kafka vs. Integration Middleware

Here is the slide deck:

Video Recording: Kafka vs. MQ / ETL / ESB – Friends, Enemies or Frenemies?

Here is the video recording where I walk you through the above slide deck:

Article: Apache Kafka vs. Enterprise Service Bus (ESB)

I also published a detailed blog post on Confluent blog about this topic in 2018:

Apache Kafka vs. Enterprise Service Bus (ESB)

Talk and Slides from Kafka Summit London 2019

The slides and video recording from Kafka Summit London 2019 (which are similar to above) are also available for free.

Why Apache Kafka instead of Traditional Middleware?

If you don’t want to spend a lot of time on the slides and recording, here is a short summary of the differences of Apache Kafka compared to traditional middleware:

Questions and Discussion…

Please share your thoughts, too!

Does your infrastructure see similar architectures? Do you face similar challenges? Do you like the concepts behind an Event Streaming Platform (aka Apache Kafka)? How do you combine legacy middleware with Kafka? What’s your strategy to integrate the modern and the old (technology) world? Is Kafka part of that architecture?

Please let me know either via a comment or via LinkedIn, Twitter, email, etc. I am curious about other opinions and experiences (and people who disagree with my presentation).

The post Apache Kafka vs. Middleware (MQ, ETL, ESB) – Slides + Video appeared first on Kai Waehner.

Apache Kafka Streams + Machine Learning (Spark, TensorFlow, H2O.ai)

Kai Waehner — Tue, 23 May 2017 17:11:40 +0000

I started at Confluent in May 2017 to work as Technology Evangelist focusing on topics around the open source framework Apache Kafka. I think Machine Learning is one of the hottest buzzwords these days as it can add huge business value in any industry. Therefore, you will see various other posts from me around Apache Kafka (messaging), Kafka Connect (integration), Kafka Streams (stream processing), Confluent’s additional open source add-ons on top of Kafka (Schema Registry, Replicator, Auto Balancer, etc.). I will explain how to leverage all this for machine learning and other big data technologies in real world production scenarios.

Read this, if you wonder why am so excited about moving (back) to open source for messaging, integration and stream processing in the big data world.

In the following blog post, I want to share my first slide deck from a conference talk representing Confluent: A software architecture user group in Leipzig, Germany organized a 2-day event to discuss big data in practice.

Apache Kafka Streams + Machine Learning / Deep Learning

This is the abstract of the slide deck:

Big Data and Machine Learning are key for innovation in many industries today. Large amounts of historical data are stored and analyzed in Hadoop, Spark or other clusters to find patterns and insights, e.g. for predictive maintenance, fraud detection or cross-selling.

This first part of the session explains how to build analytic models with R, Python and Scala leveraging open source machine learning / deep learning frameworks like Apache Spark, TensorFlow or H2O.ai.

The second part discusses how to leverage these built analytic models in your own real time streaming applications or microservices. It explains how to leverage the Apache Kafka cluster and Kafka Streams instead of building an own stream processing cluster. The session focuses on live demos and teaches lessons learned for executing analytic models in a highly scalable and performant way.

The last part explains how Apache Kafka can help to move from a manual build and deployment of analytic models to continuous online model improvement in real time.

Slide Deck: How to Build Analytic Models and Deployment to Real Time Processing

Here is the slide deck:

More blog posts with more details and specific code examples will follow in the next weeks. I will also do a web recording for this slide deck and post it on Youtube.

The post Apache Kafka Streams + Machine Learning (Spark, TensorFlow, H2O.ai) appeared first on Kai Waehner.

Visual Analytics + Open Source Deep Learning Frameworks

Kai Waehner — Mon, 24 Apr 2017 09:25:54 +0000

Deep Learning gets more and more traction. It basically focuses on one section of Machine Learning: Artificial Neural Networks. This article explains why Deep Learning is a game changer in analytics, when to use it, and how Visual Analytics allows business analysts to leverage the analytic models built by a (citizen) data scientist.

What is Deep Learning and Artificial Neural Networks?

Deep Learning is the modern buzzword for artificial neural networks, one of many concepts in machine learning to build analytics models. A neural network works similar to what we know from a human brain: You get non-linear interactions as input and transfer them to output. Neural networks leverage continuous learning and increasing knowledge in computational nodes between input and output. A neural network is a supervised algorithm in most cases, which uses historical data sets to learn parameters to predict outputs of future events, e.g. for cross selling or fraud detection. Unsupervised neural networks can be used to find new patterns and anomalies. In some cases, it makes sense to combine supervised and unsupervised algorithms.

Neural Networks are used in research for many decades and includes various sophisticated concepts like Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), and Autoencoder. However, today’s powerful and elastic computing infrastructure in combination with other technologies like graphical processing units (GPU) with thousands of cores allows to do much more powerful computations with a much deeper number of layers. Hence the term “Deep Learning”.

The following picture from TensorFlow Playground shows an easy-to-use environment which includes various test data sets, configuration options and visualizations to learn and understand deep learning and neural networks:

If you want to learn more about the details of Deep Learning and Neural Networks, I recommend the following sources:

“The Anatomy of Deep Learning Frameworks” – an article about the basic concepts and components of neural networks
TensorFlow Playground to play around with neural networks by yourself hands-on without any coding, also available on Github to build your own customized offline playground
“Deep Learning Simplified” video series on Youtube with several short, simple explanations of basic concepts, alternative algorithms and some frameworks like H2O.ai or Tensorflow

While Deep Learning is getting more and more traction, it is not the silver bullet for every scenario.

When (not) to use Deep Learning?

Deep Learning enables many new possibilities which were not possible in “mass production” a few years ago, e.g. image classification, object recognition, speech translation or natural language processing (NLP) in much more sophisticated ways than without Deep Learning. A key benefit is the automated feature engineering, which costs a lot of time and efforts with most other machine learning alternatives.

You can also leverage Deep Learning to make better decisions, increase revenue or reduce risk for existing (“already solved”) problems instead of using other machine learning algorithms. Examples include risk calculation, fraud detection, cross selling and predictive maintenance.

However, note that Deep Learning has a few important drawbacks:

Very expensive, i.e. slow and compute-intensive; training a deep learning model often takes days or weeks, execution also takes more time than most other algorithms.
Hard to interpret: lack of understandability of the result of the analytic model; often a key requirement for legal or compliance regularities
Tends to overfit, and therefore needs regularization

Deep Learning is ideal for complex problems. It can also outperform other algorithms in moderate problems. Deep Learning should not be used for simple problems. Other algorithms like logistic regression or decision trees can solve these problems easier and faster.

Open Source Deep Learning Frameworks

Neural networks are mostly adopted using one of various open source implementations. Various mature deep learning frameworks are available for different programming languages.

The following picture shows an overview of open source deep learning frameworks and evaluates several characteristics:

These frameworks have in common that they are built for data scientists, i.e. personas with experience in programming, statistics, mathematics and machine learning. Note that writing the source code is not a big task. Typically, only a few lines of codes are needed to build an analytic model. This is completely different from other development tasks like building a web application, where you write hundreds or thousands of lines of code. In Deep Learning – and Data Science in general – it is most important to understand the concepts behind the code to build a good analytic model.

Some nice open source tools like KNIME or RapidMiner allow visual coding to speed up development and also encourage citizen data scientists (i.e. people with less experience) to learn the concepts and build deep networks. These tools use own deep learning implementations or other open source libraries like H2O.ai or DeepLearning4j as embedded framework under the hood.

If you do not want to build your own model or leverage existing pre-trained models for common deep learning tasks, you might also take a look at the offerings from the big cloud providers, e.g. AWS Polly for Text-to-Speech translation, Google Vision API for Image Content Analysis, or Microsoft’s Bot Framework to build chat bots. The tech giants have years of experience with analysing text, speech, pictures and videos and offer their experience in sophisticated analytic models as a cloud service; pay-as-you-go. You can also improve these existing models with your own data, e.g. train and improve a generic picture recognition model with pictures of your specific industry or scenario.

Deep Learning in Conjunction with Visual Analytics

No matter if you want to use “just” a framework in your favourite programming language or a visual coding tool: You need to be able to make decisions based on the built neural network. This is where visual analytics comes into play. In short, visual analytics allows any persona to make data-driven decisions instead of listening to gut feeling when analysing complex data sets. See “Using Visual Analytics for Better Decisions – An Online Guide” to understand the key benefits in more detail.

A business analyst does not understand anything about deep learning, but just leverages the integrated analytic model to answer its business questions. The analytic model is applied under the hood when the business analyst changes some parameters, features or data sets. Though, visual analytics should also be used by the (citizen) data scientist to build the neural network. See “How to Avoid the Anti-Pattern in Analytics: Three Keys for Machine Learning” to understand in more details how technical and non-technical people should work together using visual analytics to build neural networks, which help solving business problems. Even some parts of data preparation are best done within visual analytics tooling as describe in “Data Preprocessing vs. Data Wrangling in Machine Learning Projects”.

From a technical perspective, Deep Learning frameworks (and in a similar way any other Machine Learning frameworks, of course) can be integrated into visual analytics tooling in different ways. The following list includes a TIBCO Spotfire example for each alternative:

Embedded Analytics: Implemented directly within the analytics tool (self-implementation or “OEM”); can be used by the business analyst without any knowledge about machine learning (Spotfire: Clustering via some basic, simple configuration of an input and output data plus cluster size)
Native Integration: Connectors to directly access external deep learning clusters. (Spotfire: TERR to use R’s machine learning libraries, KNIME connector to directly integrate with external tooling)
Framework API: Access via a Wrapper API in different programming languages. For example, you could integrate MXNet via R or TensorFlow via Python into your visual analytics tooling. This option can always be used and is appropriate if no native integration or connector is available. (Spotfire: MXNet’s R interface via Spotfire’s TERR Integration for using any R library)
Integrated as Service via an Analytics Server: Connect external deep learning clusters indirectly via a server-side component of the analytics tool; different frameworks can be accessed by the analytics tool in a similar fashion (Spotfire: Statistics Server for external analytics tools like SAS or Matlab)
Cloud Service: Access pre-trained models for common deep learning specific tasks like image recognition, voice recognition or text processing. Not appropriate for very specific, individual business problems of an enterprise. (Spotfire: Call public deep learning services like image recognition, speech translation, or Chat Bot from AWS, Azure, IBM, Google via REST service through Spotfire’s TERR / R interface)

All options have in common that you need to add configuration of some hyper-parameters, i.e. “high level” parameters like problem type, feature selection or regularization level. Depending on the integration option, this can be very technical and low level, or simplified and less flexible using terms which the business analyst understands.

Deep Learning Example: Autoencoder Template for TIBCO Spotfire

Let’s take one specific category of neural networks as example: Autoencoders to find anomalies. Autoencoder is an unsupervised neural network used to replicate the input dataset by restricting the number of hidden layers in a neural network. A reconstruction error is generated upon prediction. The higher the reconstruction error, the higher is the possibility of that data point being an anomaly.

Use Cases for Autoencoders include fighting financial crime, monitoring equipment sensors, healthcare claims fraud, or detecting manufacturing defects. A generic TIBCO Spotfire template is available in the TIBCO Community for free. You can simply add your data set and leverage the template to find anomalies using Autoencoders – without any complex configuration or even coding. Under the hood, the template uses H2O.ai’s deep learning implementation and its R API. It runs in a local instance on the machine where to run Spotfire. You can also take a look at the R code, but this is not needed to use the template at all and therefore optional.

Real World Example: Anomaly Detection for Predictive Maintenance

Let’s use the Autoencoder for a real-world example. In telco, you have to analyse the infrastructure continuously to find problems and issues within the network. Best before the failure happens so that you can fix it before the customer even notices the problem. Take a look at the following picture, which shows historical data of a telco network:

The orange dots are spikes which occur as first indication of a technical problem in the infrastructure. The red dots show a constant failure where mechanics have to replace parts of the network because it does not work anymore.

Autoencoders can be used to detect network issues before they actually happen. TIBCO Spotfire is uses H2O’s Autoencoder in the background to find the anomalies. As discussed before, the source code is relative scarce. Here is the snipped of building the analytic model with H2O’s Deep Learning R API and detecting the anomalies (by finding out the reconstruction error of the Autoencoder):

This analytic model – built by the data scientist – is integrated into TIBCO Spotfire. The business analyst is able to visually analyse the historical data and the insights of the Autoencoder. This combination allows data scientists and business analysts to work together fluently. It was never easier to implement predictive maintenance and create huge business value by reducing risk and costs.

Apply Analytic Models to Real Time Processing with Streaming Analytics

This article focuses on building deep learning models with Data Science Frameworks and Visual Analytics. Key for success in projects is to apply the build analytic model to new events in real time to add business value like increasing revenue, reducing cost or reducing risk.

“How to Apply Machine Learning to Event Processing” describes in more detail how to apply analytic models to real time processing. Or watch the corresponding video recording leveraging TIBCO StreamBase to apply some H2O models in real time. Finally, I can recommend to learn about various streaming analytics frameworks to apply analytic models.

Let’s come back to the Autoencoder use case to realize predictive maintenance in telcos. In TIBCO StreamBase, you can easily apply the built H2O Autoencoder model without any redevelopment via StreamBase’ H2O connector. You just attach the Java code generated by H2O framework, which contains the analytic model and compiles to very performant JVM bytecode:

The most important lesson learned: Think about the execution requirements before building the analytic model. What performance do you need regarding latency? How many events do you need to process per minute, second or millisecond? Do you need to distribute the analytic model to a clusters with many nodes? How often do you have to improve and redeploy the analytic model? You need to answer these questions at the beginning of your project to avoid double efforts and redevelopment of analytic models!

Another important fact is that analytic models do not always need score very fast or frequently (like if you want to score every single event in a sensor analytics use case). In the above telco infrastructure example, these spikes and failures might happen in subsequent days or even weeks. Thus, in many use cases, it is fine to score an analytic model once an hour or even once a day.

Deep Learning + Visual Analytics + Streaming Analytics = Next Generation Big Data Success Stories

Deep Learning allows to solve many well understood problems like cross selling, fraud detection or predictive maintenance in a more efficient way. In addition, you can solve additional scenarios, which were not possible to solve before, like accurate and efficient object detection or speech-to-text translation.

Visual Analytics is a key component in Deep Learning projects to be successful. It eases the development of deep neural networks by (citizen) data scientists and allows business analysts to leverage these analytic models to find new insights and patterns.

Today, (citizen) data scientists use programming languages like R or Python, deep learning frameworks like Theano, TensorFlow, MXNet or H2O’s Deep Water and a visual analytics tool like TIBCO Spotfire to build deep neural networks. The analytic model is embedded into a view for the business analyst to leverage it without knowing the technology details.

In the future, visual analytics tools might embed neural network features like they already embed other machine learning features like clustering or logistic regression today. This will allow business analysts to leverage Deep Learning without the help of a data scientist and be appropriate for simpler use cases.

However, do not forget that building an analytic model to find insights is just the first part of a project. Deploying it to real time afterwards is as important as second step. Good integration between tooling for finding insights and applying insights to new events can improve time-to-market and model quality in data science projects significantly. The development lifecycle is a continuous closed loop. The analytic model needs to be validated and rebuild in certain sequences.

The post Visual Analytics + Open Source Deep Learning Frameworks appeared first on Kai Waehner.

Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Services

Kai Waehner — Tue, 15 Nov 2016 13:20:31 +0000

In November 2016, I am at Big Data Spain in Madrid for the first time. A great conference with many awesome speakers and sessions about very hot topics such as Apache Hadoop, Spark Spark, Streaming Processing / Streaming Analytics and Machine Learning. If you are interested in big data, then this conference is for you! My two talks:

“How to Apply Machine Learning to Real Time Processing” (see slides and video recording from a similar conference talk).
“Comparison of Streaming Analytics Options” (the reason for this blog post; an updated version of my talk from JavaOne 2015)

Here I wanna share the slides and a video recording of the latter one…

Abstract: Comparison of Stream Processing Options

This session discusses the technical concepts of stream processing / streaming analytics and how it is related to big data, mobile, cloud and internet of things. Different use cases such as predictive fault management or fraud detection are used to show and compare alternative frameworks and products for stream processing and streaming analytics.

The focus of the session lies on comparing

different open source frameworks such as Apache Apex, Apache Flink or Apache Spark Streaming
engines from software vendors such as IBM InfoSphere Streams, TIBCO StreamBase
cloud offerings such as AWS Kinesis.
real time streaming UIs such as Striim, Zoomdata or TIBCO Live Datamart. Live demos will give the audience a good feeling about how to use these frameworks and tools.

The session will also discuss how stream processing is related to Apache Hadoop frameworks (such as MapReduce, Hive, Pig or Impala) and machine learning (such as R, Spark ML or H2O.ai).

Slides – Alternatives for Streaming Analytics

The following slide deck is a more extensive version of the talk at Big Data Spain (as the conference talks were only 30 minutes):

Video Recording: Apache Storm, Flink, Apex, Spark, StreamBase, Striim, et al

The video recording walks you through the above slide deck:

As always, I appreciate any comments, questions or other feedback.

The post Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Services appeared first on Kai Waehner.

Comparison of Open Source IoT Integration Frameworks

Kai Waehner — Thu, 03 Nov 2016 11:20:09 +0000

In November 2016, I attended Devoxx conference in Casablanca. Around 1500 developers participated. A great event with many awesome speakers and sessions. Hot topics this year besides Java: Open Source Frameworks, Microservices (of course!), Internet of Things (including IoT Integration), Blockchain, Serverless Architectures.

I had three talks:

How to Apply Machine Learning to Real Time Processing
Comparison of Open Source IoT Integration Frameworks
Tools in Action – Live Demo of Open Source Project Flogo

In addition, I was interviewed by the Voxxed team about Big Data, Machine Learning and Internet of Things. The video will be posted on Voxxed website in the next weeks.

You can find several slides and video recordings about my big data / machine learning topic on my blog / website. Therefore, I will focus on the open source IoT frameworks in this post.

Open Source Integration Frameworks for the Internet of Things

Internet of Things (IoT) and edge integration are getting more important than ever before due to the massively growing number of connected devices year by year. In this talk, I showed open source frameworks built to develop very lightweight microservices. These can be deployed on small devices or in cloud native containers / serverless architectures with very low resource consumption to wire together all different kinds of hardware devices, APIs and online services.

The focus of this session was to discuss open source process engines such as Eclipse Kura (in conjunction with Apache Camel), Node-RED or Flogo. These offer a framework plus zero-code environment with Web IDE for building and deploying IoT integration and data processing directly onto IoT gateways and connected devices. For that, they leverage IoT standards such as MQTT, WebSockets or CoaP, but also other interfaces such as Twitter feeds or REST services.

The session also discussed the relation to other components in a IoT architecture including:

Open source dataflow pipelines (like Apache NiFi / MiNiFy, StreamSets or Cask Hydrator),
Open source stream processing frameworks (such as Apache Storm, Flink, Spark Streaming or Samza)
Cloud IoT platforms (such as AWS IoT, Cloud IoT Cloud Platform or IBM’s open source serverless framework OpenWhisk).

Slide Deck: Apache Nifi vs. StreamSets vs. Eclipse Kura vs. Node-RED vs. Flogo vs. Others

Here is the slide deck about different alternatives for IoT integration:

Video Recording: Integration Frameworks for the Internet of Things

And here is the video recording where I walk through the above slide deck and also show live demos of Node-Red and Flogo:

As always, I appreciate any comments or feedback…

The post Comparison of Open Source IoT Integration Frameworks appeared first on Kai Waehner.

Open Source Project Flogo – Overview, Architecture and Live Demo

Kai Waehner — Thu, 03 Nov 2016 11:09:22 +0000

In October 2016, the open source IoT integration framework Flogo was published as first developer preview. This blog post is intended to give a first overview about Flogo. You can either browse through the slide deck or watch the videos.

What is Project Flogo?

In short, Flogo is an ultra-lightweight integration framework powered by Go programming language. It is open source under the permissive BSD license and easily extendable for your own use cases. Flogo is used to develop IoT edge apps or cloud-native / serverless microservices. Therefore, it is complementary to other integration solutions and IoT cloud platforms.

Some key characteristics:

Ultra-light footprint (powered by Golang) for edge devices with zero dependency model, very low disk and memory footprint, and very fast startup time
Can be run on a variety of platforms (edge device, edge gateway, on premise, cloud, container)
Connectivity to IoT technologies (MQTT, CoaP, REST, …)
Highly optimized for unreliable IoT environments
Intended to be used by developers / integration specialists / citizen integrators either by writing source code or leveraging the Web UI for visual coding, testing and debugging
Includes some innovating features like a web-native step-back debugger to interactively design / debug your process, simulate sensor events, and change data / configuration without restarting the complete process

Overview, Architecture and Use Cases

The following slide deck shows an overview, architecture and use cases for Flogo:

You can also watch the following 45min video where I walk you through these slides and also show some live demos and source code:

Flogo Live Demo and Source Code

If you just want to see the live demo, watch the following 15min video:

Any feedback or questions are highly appreciated. Please use the Community Q&A for to ask whatever you want to know.

The post Open Source Project Flogo – Overview, Architecture and Live Demo appeared first on Kai Waehner.

Machine Learning Applied to Microservices

Kai Waehner — Thu, 20 Oct 2016 19:32:22 +0000

I had two sessions at O’Reilly Software Architecture Conference in London in October 2016. It is the first #OReillySACon in London. A very good organized conference with plenty of great speakers and sessions. I can really recommend this conference and its siblings in other cities such as San Francisco or New York if you want to learn about good software architectures and new concepts, best practices and technologies. Some of the hot topics this year besides microservices are DevOps, serverless architectures and big data analytics respectively machine learning.

Intelligent Microservices by Leveraging Big Data Analytics

One of the two sessions was about how to apply machine learning and big data analytics to real time event processing. I also included the relation to microservices, i.e. how to leverage microservice concepts such as 12 Factor Apps, Containers (e.g. Docker), Cloud Platforms (e.g. Kubernetes, Cloud Foundry), or DevOps to build agile, intelligent microservices.

Abstract: How to Apply Machine Learning to Microservices

The digital transformation is going forward due to Mobile, Cloud and Internet of Things. Disrupting business models leverage Big Data Analytics and Machine Learning.

“Big Data” is currently a big hype. Large amounts of historical data are stored in Hadoop or other platforms. Business Intelligence tools and statistical computing are used to draw new knowledge and to find patterns from this data, for example for promotions, cross-selling or fraud detection. The key challenge is how these findings can be integrated from historical data into new transactions in real time to make customers happy, increase revenue or prevent fraud. “Fast Data” via stream processing is the solution to embed patterns – which were obtained from analyzing historical data – into future transactions in real-time.

This session uses several real world success stories to explain the concepts behind stream processing and its relation to Hadoop and other big data platforms. It discusses how patterns and statistical models of R, Spark MLlib, H2O, and other technologies can be integrated into real-time processing by using several different real world case studies. The session also points out why a Microservices architecture helps solving the agile requirements for these kind of projects.

A brief overview of available open source frameworks and commercial products shows possible options for the implementation of stream processing, such as Apache Storm, Apache Flink, Spark Streaming, IBM InfoSphere Streams, or TIBCO StreamBase.

A live demo shows how to implement stream processing, how to integrate machine learning, and how human operations can be enabled in addition to the automatic processing via a Web UI and push events.

How to Build Intelligent Microservices – Slide Deck from O’Reilly Software Architecture Conference

The post Machine Learning Applied to Microservices appeared first on Kai Waehner.

Comparison Of Log Analytics for Distributed Microservices – Open Source Frameworks, SaaS and Enterprise Products

Kai Waehner — Thu, 20 Oct 2016 18:57:51 +0000

I want to share the slide of my session about comparing open source frameworks, SaaS and Enterprise products regarding log analytics for distributed microservices:

Monitoring Distributed Microservices with Log Analytics

IT systems and applications generate more and more distributed machine data due to millions of mobile devices, Internet of Things, social network users, and other new emerging technologies. However, organizations experience challenges when monitoring and managing their IT systems and technology infrastructure. They struggle with distributed Microservices and Cloud architectures, custom application monitoring and debugging, network and server monitoring / troubleshooting, security analysis, compliance standards, and others.

This session discusses how to solve the challenges of monitoring and analyzing Terabytes and more of different distributed machine data to leverage the “digital business”. The main part of the session compares different open source frameworks and SaaS cloud solutions for Log Management and operational intelligence, such as Graylog , the “ELK stack”, Papertrail, Splunk or TIBCO LogLogic). A live demo will demonstrate how to monitor and analyze distributed Microservices and sensor data from the “Internet of Things”.

The session also explains the distinction of the discussed solutions to other big data components such as Apache Hadoop, Data Warehouse or Machine Learning and its application to real time processing, and how they can complement each other in a big data architecture.

The session concludes with an outlook to the new, advanced concept of IT Operations Analytics (ITOA).

Slide Deck from O’Reilly Software Architecture Conference

The post Comparison Of Log Analytics for Distributed Microservices – Open Source Frameworks, SaaS and Enterprise Products appeared first on Kai Waehner.

Streaming Analytics with Analytic Models (R, Spark MLlib, H20, PMML)

Kai Waehner — Thu, 03 Mar 2016 15:51:01 +0000

In March 2016, I had a talk at Voxxed Zurich about “How to Apply Machine Learning and Big Data Analytics to Real Time Processing”.

Finding Insights with R, H20, Apache Spark MLlib, PMML and TIBCO Spotfire

“Big Data” is currently a big hype. Large amounts of historical data are stored in Hadoop or other platforms. Business Intelligence tools and statistical computing are used to draw new knowledge and to find patterns from this data, for example for promotions, cross-selling or fraud detection. The key challenge is how these findings can be integrated from historical data into new transactions in real time to make customers happy, increase revenue or prevent fraud.

Putting Analytic Models into Action via Event Processing and Streaming Analytics

“Fast Data” via stream processing is the solution to embed patterns – which were obtained from analyzing historical data – into future transactions in real-time. The following slide deck uses several real world success stories to explain the concepts behind stream processing and its relation to Apache Hadoop and other big data platforms. I discuss how patterns and statistical models of R, Apache Spark MLlib, H20, and other technologies can be integrated into real-time processing using open source stream processing frameworks (such as Apache Storm, Spark Streaming or Flink) or products (such as IBM InfoSphere Streams or TIBCO StreamBase). A live demo showed the complete development lifecycle combining analytics with TIBCO Spotfire, machine learning via R and stream processing via TIBCO StreamBase and TIBCO Live Datamart.

Slide Deck from Voxxed Zurich 2016

Here is the slide deck:

The post Streaming Analytics with Analytic Models (R, Spark MLlib, H20, PMML) appeared first on Kai Waehner.