Hadoop Archives - Kai Waehner

Apache Kafka + Kafka Streams + Mesos / DCOS = Scalable Microservices

Kai Waehner — Fri, 27 Oct 2017 08:05:16 +0000

I had a talk at MesosCon 2017 Europe in Prague about building highly scalable, mission-critical microservices with Apache Kafka, Kafka Streams and Apache Mesos / DCOS. I would like to share the slides and a video recording of the live demo.

Abstract

Microservices establish many benefits like agile, flexible development and deployment of business logic. However, a Microservice architecture also creates many new challenges. This includes increased communication between distributed instances, the need for orchestration, new fail-over requirements, and resiliency design patterns.

This session discusses how to build a highly scalable, performant, mission-critical microservice infrastructure with Apache Kafka, Kafka Streams and Apache Mesos respectively DC/OS. Apache Kafka brokers are used as powerful, scalable, distributed message backbone. Kafka’s Streams API allows to embed stream processing directly into any external microservice or business application. Without the need for a dedicated streaming cluster. Apache Mesos can be used as scalable infrastructure for both, the Apache Kafka brokers and external applications using the Kafka Streams API, to leverage the benefits of a cloud native platforms like service discovery, health checks, or fail-over management.

A live demo shows how to develop real time applications for your core business with Kafka messaging brokers and Kafka Streams API. You see how to deploy / manage / scale them on a DC/OS cluster using different deployment options.

Key takeaways

Successful microservice architectures require a highly scalable messaging infrastructure combined with a cloud-native platform which manages distributed microservices
Apache Kafka offers a highly scalable, mission critical infrastructure for distributed messaging and integration
Kafka’s Streams API allows to embed stream processing into any external application or microservice
Mesos respectively DC/OS allow management of both, Kafka brokers and external applications using Kafka Streams API, to leverage many built-in benefits like health checks, service discovery or fail-over control of microservices
See a live demo which combines the Apache Kafka streaming platform and DC/OS

Architecture: Kafka Brokers + Kafka Streams on Kubernetes and DC/OS

The following picture shows the architecture. You can either run Kafka Brokers and Kafka Streams microservices natively on DC/OS via Marathon or leverage Kubernetes as Docker container orchestration tool (which is also supported my Mesosphere in the meantime).

Slides

Here are the slides from my talk:

Live Demo

The following video shows the live demo. It is built on AWS using Mesosphere’s Cloud Formation script to setup a DC/OS cluster in ten minutes.

Here, I deployed both – Kafka brokers and Kafka Streams microservices – directly to DC/OS without leveraging Kubernetes. I expect to see many people continue to deploy Kafka brokers directly on DC/OS. For microservices many teams might move to the following stack: Microservice –> Docker –> Kubernetes –> DC/OS.

Do you also use Apache Mesos respectively DC/OS to run Kafka? Only the brokers or also Kafka clients (producers, consumers, Streams, Connect, KSQL, etc)? Or do you prefer another tool like Kubernetes (maybe on DC/OS)?

The post Apache Kafka + Kafka Streams + Mesos / DCOS = Scalable Microservices appeared first on Kai Waehner.

Apache Kafka Streams + Machine Learning (Spark, TensorFlow, H2O.ai)

Kai Waehner — Tue, 23 May 2017 17:11:40 +0000

I started at Confluent in May 2017 to work as Technology Evangelist focusing on topics around the open source framework Apache Kafka. I think Machine Learning is one of the hottest buzzwords these days as it can add huge business value in any industry. Therefore, you will see various other posts from me around Apache Kafka (messaging), Kafka Connect (integration), Kafka Streams (stream processing), Confluent’s additional open source add-ons on top of Kafka (Schema Registry, Replicator, Auto Balancer, etc.). I will explain how to leverage all this for machine learning and other big data technologies in real world production scenarios.

Read this, if you wonder why am so excited about moving (back) to open source for messaging, integration and stream processing in the big data world.

In the following blog post, I want to share my first slide deck from a conference talk representing Confluent: A software architecture user group in Leipzig, Germany organized a 2-day event to discuss big data in practice.

Apache Kafka Streams + Machine Learning / Deep Learning

This is the abstract of the slide deck:

Big Data and Machine Learning are key for innovation in many industries today. Large amounts of historical data are stored and analyzed in Hadoop, Spark or other clusters to find patterns and insights, e.g. for predictive maintenance, fraud detection or cross-selling.

This first part of the session explains how to build analytic models with R, Python and Scala leveraging open source machine learning / deep learning frameworks like Apache Spark, TensorFlow or H2O.ai.

The second part discusses how to leverage these built analytic models in your own real time streaming applications or microservices. It explains how to leverage the Apache Kafka cluster and Kafka Streams instead of building an own stream processing cluster. The session focuses on live demos and teaches lessons learned for executing analytic models in a highly scalable and performant way.

The last part explains how Apache Kafka can help to move from a manual build and deployment of analytic models to continuous online model improvement in real time.

Slide Deck: How to Build Analytic Models and Deployment to Real Time Processing

Here is the slide deck:

More blog posts with more details and specific code examples will follow in the next weeks. I will also do a web recording for this slide deck and post it on Youtube.

The post Apache Kafka Streams + Machine Learning (Spark, TensorFlow, H2O.ai) appeared first on Kai Waehner.

Why I Move (Back) to Open Source for Messaging, Integration and Stream Processing

Kai Waehner — Mon, 01 May 2017 13:24:08 +0000

After three great years at TIBCO Software, I move back to open source and join Confluent, a company focusing on the open source project Apache Kafka to build mission-critical, scalable infrastructures for messaging, integration and streaming analytics. Confluent is a Silicon Valley startup, still in the beginning of its journey, with a 700% growing business in 2016, and is exjustpected to grow significantly in 2017 again.

In this blog post, I want to share why I see the future for middleware and big data analytics in open source technologies, why I really like Confluent, what I will focus on in the next months, and why I am so excited about this next step in my career.

Key Business Trends in the Industry: Big Data, Real Time Streaming Analytics, Machine Learning

Let’s talk shortly about three cutting-edge topics which get important in any industry and small, medium and big enterprises these days:

Big Data Analytics: Find insights and patterns in big historical datasets.
Real Time Streaming Platforms: Apply insights and patterns to new events in real time (e.g. for fraud detection, cross selling or predictive maintenance).
Machine Learning (and its hot subtopic Deep Learning): Leverage algorithms and let machines learn by themselves without programming everything explicitly.

These three topics disrupt every industry these days. Note that Machine Learning is related to the other two topics. Though today, we often see it as independent topic; many data science projects actually use only very small datasets (often less than a Gigabyte of input data). Fortunately, all three topics will be combined more and more to add additional business value.

Some industries are just in the beginning of their journey of disruption and digital transformation (like banks, telcos, insurance companies), others already realized some changes and innovation (e.g. retailers, airlines). In addition to the above topics, some other cutting edge success stories emerge in a few industries, like Internet of Things (IoT) in manufacturing or Blockchain in banking.

With all these business trends on the market, we also see a key technology trend for all these topics: The adoption of open source projects.

Key Technology Trend: Adoption of “Real” Open Source Projects

When I say “open source”, I mean some specific projects. I do not talk about very new, immature projects, but frameworks which are deployed for many years in production successfully, and used by various different developers and organizations. For example, Confluent’s Docker images like the Kafka REST Proxy or Kafka Schema Registry are each downloaded over 100.000 times, already.

A “real”, successful middleware or analytics open source project has the following characteristics:

Openness: Available for free and really open under a permissive license, i.e. you can use it in production and scale it out without any need to purchase a license or subscription (of course, there can be commercial, proprietary add-ons – but they need to be on top of the project, and not change the license for the used open source project under the hood)
Maturity: Used in business-relevant or even mission critical environments for at least one year, typically much longer
Adoption: Various vendors and enterprises support a project, either by contributing (source code, documentation, add-ons, tools, commercial support) or realizing projects
Flexibility: Deployment on any infrastructure, including on premise, public cloud, hybrid. Support for various application environments (Windows, Linux, Virtual Machine, Docker, Serverless, etc.), APIs for several programming languages (Java, .Net, Go, etc.)
Integration: Independent and loosely coupled, but also highly integrated (via connectors, plugins, etc.) to other relevant open source and commercial components

After defining key characteristics for successful open source projects, let’s take a look some frameworks with huge momentum.

Cutting Edge Open Source Technologies: Apache Hadoop, Apache Spark, Apache Kafka

I defined three key trends above which are relevant for any industry and many (open source and proprietary) software vendors. There is a clear trend towards some open source projects as de facto standards for new projects:

Big Data Analytics: Apache Hadoop (and its zoo of sub projects like Hive, Pig, Impala, HBase, etc.) and Apache Spark (which is often separated from Hadoop in the meantime) to store, process and analyze huge historical datasets
Real Time Streaming Platforms: Apache Kafka – not just for highly scalable messaging, but also for integration and streaming analytics. Platforms either use Kafka Streams to build stream processing applications / microservices or an “external” framework like Apache Flink, Apex, Storm or Heron.
Machine Learning: No clear “winner” here (and that is a good thing in my opinion as it is so multifaceted). Many great frameworks are available – for example R, Python and Scala offer various great implementations of Machine Learning algorithms, and specific frameworks like Caffee, Torch, TensorFlow or MXNet focus on Deep Learning and Artificial Neural Networks.

On top of these frameworks, various vendors build open source or proprietary tooling and offer commercial support. Think about the key Hadoop / Spark vendors: Hortonworks, Cloudera, MapR and others, or KNIME, RapidMiner or H2O.ai as specialized open source tools for machine learning in a visual coding environment.

Of course, there are many other great open source frameworks not mentioned here but also relevant on the market, for example RabbitMQ and ActiveMQ for messaging or Apache Camel for integration. In addition, new “best practice stacks” are emerging, like the SMACK Stack which combines Spark, Mesos, Akka, and Kafka.

I am so excited about Apache Kafka and Confluent, because it is used in any industry and many small and big enterprises, already. Apache Kafka production deployments accelerated in 2016 and it is now used by one-third of the Fortune 500. And this is just the beginning. Apache Kafka is not an all-rounder to solve all problems, but it is awesome in the things it is built for – as the huge and growing number of users, contributors and production deployments prove. It highly integrated with many other frameworks and tools. Therefore, I will not just focus on Apache Kafka and Confluent in my new job, but also many other technologies as discussed later.

Let’s next think about the relation of Apache Kafka and Confluent to proprietary software.

Open Source vs. Proprietary Software – A Never-ending War?

The trend is moving towards open source technologies these days. This is no question, but a fact. I have not seen a single customer in the last years who does not have projects and huge investments around Hadoop, Spark and Kafka. In the meantime, it changed from labs and first small projects to enterprise de facto standards and company-wide deployments and strategies. Replacement of closed legacy software is coming more and more – to reduce costs, but even more important to be more flexible, up-to-date and innovative.

What does this mean for proprietary software?

For some topics, I do not see much traction or demand for proprietary solutions. Two very relevant examples where closed software ruled the last ~20 years: Messaging solutions and analytics platforms. Open frameworks seem to replace them almost everywhere in any industry and enterprise in new projects (for good reasons).

New messaging projects are based on standards like MQTT or frameworks like Apache Kafka. Analytics is done with R and Python in conjunction with frameworks like scikit-learn or TensorFlow. These options leverage flexible, but also very mature implementations. Often, there is no need for a lot of proprietary, inflexible, complex or expensive tooling on top of it. Even IBM, the mega vendor, focuses on offerings around open source in the meantime.

IBM invests millions into Apache Spark for big data analytics and puts over 3500 researchers and developers to work on Spark-related projects instead of just pushing towards its various own proprietary analytics offerings like IBM SPSS. If you search for “IBM Messaging”, you find offerings based on AMQP standard and cloud services based on Apache Kafka instead of pushing towards new proprietary solutions!

I think IBM is a great example of how the software market is changing these days. Confluent (just in the beginning of its journey) or Cloudera (just went public with a successful IPO) are great examples for Silicon Valley startups going the same way.

In my opinion, a good proprietary software leverages open source technologies like Apache Kafka, Apache Hadoop or Apache Spark. Vendors should integrate natively with these technologies. Some opportunities for vendors:

Visual coding (for developers) to generate code (e.g. graphical components, which generate framework-compliant source code for a Hadoop or Spark job)
Visual tooling (e.g. for business analysts or data scientists), like a Visual Analytics tools which connect to big data stores to find insights and patterns
Infrastructure tooling for management and monitoring of open source infrastructure (e.g. tools to monitor and scale Apache Kafka messaging infrastructures)
Add-ons which are natively integrated with open source frameworks (e.g. instead of requiring own proprietary runtime and messaging infrastructures, an integration vendor should deploy its integration microservices on open cloud-native platforms like Kubernetes or Cloud Foundry and leverage open messaging infrastructures like Apache Kafka instead of pushing towards proprietary solutions)

Open Source and Proprietary Software Complement Each Other

Therefore, I do not see this as a discussion of “open source software” versus “proprietary software”. Both complement each other very well. You should always ask the following questions before making a decision for open source software, proprietary software or a combination of both:

What is the added value of the proprietary solution? Does this increase the complexity and increases the footprint of runtime and tooling?
What is the expected total cost of ownership of a project (TCO), i.e. license / subscription + project lifecycle costs?
How to realize the project? Who will support you, how do you find experts for delivery (typically consulting companies)? Integration and analytics projects are often huge with big investments, so how to make sure you can deliver (implementation, test, deployment, operations, change management, etc.)? Can we get commercial support for our mission-critical deployments (24/7)?
How to use this project with the rest of the enterprise architecture? Do you deploy everything on the same cluster? Do we want to set some (open) de facto standards in different departments and business units?
Do we want to use the same technology in new environments without limiting 30-day-trials or annoying sales cycles, maybe even deploy it to production without any license / subscription costs?
Do we want to add changes, enhancements or fixes to the platform by ourselves (e.g. if we need a specific feature immediately, not in six months)?

Let’s think about a specific example with these questions in mind:

Example: Do you still need an Enterprise Service Bus (ESB) in a World of Cloud-Native Microservices?

I faced this question a lot in the last 24 months, especially with the trend moving to flexible, agile microservices (not just for building business applications, but also for integration and analytics middleware). See my article “Do Microservices Spell the End of the ESB?”. The short summary: You still need middleware (call it ESB, integration platform, iPaaS, or somehow else), though the requirements are different today. This is true for open source and proprietary ESB products. However, something else has changed in the last 24 months…

In the past, open source and proprietary middleware vendors offered an ESB as integration platform. The offering included a runtime (to guarantee scalable, mission-critical operation of integration services) and a development environment (for visual coding and faster time-to-market). The last two years changed how we think about building new applications. We now (want to) build microservices, which run in a Docker container. The scalable, mission-critical runtime is managed by a cloud-native platform like Kubernetes or Cloud Foundry. Ideally, DevOps automates builds, tests and deployment. These days, ESB vendors adopt these technologies. So far, so good.

Now, you can deploy your middleware microservice to these cloud-native platforms like any other Java, .NET or Go microservice. However, this completely changes the added value of the ESB platform. Now, its benefit is just about visual coding and the key argument is time-to-market (you should always question and doublecheck if it is a valid argument). The runtime is not really part of the ESB anymore. In most scenarios, this completely changes the view on deciding if you still need an ESB. Ask yourself about time-to-market, license / subscription costs and TCO again! Also think about the (typically) increased resource requirements (Memory, CPU, Disk) of tooling-built services (typically some kind of big .ear file), compared to plain source code (e.g. Java’s .jar files).

Is the ESB still enough added value or should you just use a cloud-native platform and a messaging infrastructure? Is it easier to write some lines of source instead of setting up the ESB tooling where you often struggle importing your REST / Swagger or WSDL files and many other configuration environments before you actually can begin to leverage the visual coding environment? In very big, long-running projects, you might finally end up with a win. Though, in an agile, every-changing world with fail-fast ideology, various different technologies and frameworks, and automated CI/CD stacks, you might only add new complexity, but not get the expected value anymore like in the old world where the ESB was also the mission-critical runtime. The same is true for other middleware components like stream processing, analytic platforms, etc.

ESB Alternative: Apache Kafka and Confluent Open Source Platform

As alternative, you could use for example Kafka Connect, which is a very lightweight integration library based on Apache Kafka to build large-scale low-latency data pipelines. The beauty of Kafka Connect is that all the challenges around scalability, fail-over and performance are leveraged from the Kafka infrastructure. You just have to use the Kafka Connect connectors to realize very powerful integrations with a few lines of configuration for sources and sinks. If you use Apache Kafka as messaging infrastructure anyway, you need to find some very compelling reasons to use an ESB on top instead of the much less complex and much less heavyweight Kafka Connect library.

I think this section explained why I think that open source and proprietary software are complementary in many use cases. But it does not make sense to add heavyweight, resource intensive and complex tooling in every scenario. Open source is not free (you still need to spend time and efforts on the project and probably pay money for some kind of commercial support), but often open source without too much additional proprietary tooling is the better choice regarding complexity, time-to-market and TCO. You can find endless success stories about open source projects; not just from tech giants like Google, Facebook or LinkedIn, but also from many small and big traditional enterprises. Of course, any project can fail. Though, in projects with frameworks like Hadoop, Spark or Kafka, it is probably not due to issues with technology…

Confluent + “Proprietary Mega Vendors”

On the other side, I really look forward working together with “mostly proprietary” mega vendors like TIBCO, SAP, SAS and others where it makes sense to solve customer problems and build innovative, powerful, mission-critical solutions. For example, TIBCO StreamBase is a great tool if you want to develop stream processing applications via visual editor instead of writing source code. Actually, it does not even compete with Kafka Streams because the latter one is a library which you embed into any other microservice or application (deployed anywhere, e.g. in a Java application, Docker container, Apache Mesos, “you-choose-it”) while StreamBase (like its competitors Software AG Apama, IBM Streams and all the other open source frameworks like Apache Flink, Storm, Apache Apex, Heron, etc.) focus on building streaming application on its own cluster (typically either deployed on Hadoop’s YARN or on a proprietary cluster). Therefore, you could use StreamBase and its Kafka connector to build streaming applications leveraging Kafka as messaging infrastructure.

Even Confluent itself does offer some proprietary tooling like Confluent Control Center for management and monitoring on top of open source Apache Kafka and open source Confluent Platform, of course. This is the typical business model you see behind successful open source vendors like Red Hat: Embrace open source projects, offer 24/7 support and proprietary add-ons for enterprise-grade deployments. Thus, not everything is or needs to be open source. That’s absolutely fine.

So, after all these discussions about business and technology trends, open source and proprietary software, what will I do in my new job at Confluent?

Confluent Platform in Conjunction with Analytics, Machine Learning, Blockchain, Enterprise Middleware

Of course, I will focus a lot on Apache Kafka and Confluent Platform in my new job where I will work mainly with prospects and customers in EMEA, but also continue as Technology Evangelist with publications and conference talks. Let’s get into a little bit more detail here…

My focus never was being a deep level technology expert or fixing issues in production environments (but I do hands-on coding, of course). Many other technology experts are way better in very technical discussions. As in the past, I will focus on designing mission-critical enterprise architectures, discussing innovative real world use cases with prospects and customers, and evaluating cutting edge technologies in conjunction with the Confluent Platform. Here are some of my ideas for the next months:

Apache Kafka + Cloud Native Platforms = Highly Scalable Streaming Microservices (leveraging platforms like Kubernetes, Cloud Foundry, Apache Mesos)
Highly Scalable Machine Learning and Deep Learning Microservices with Apache Kafka Streams (using TensorFlow, MXNet, H2O.ai, and other open source frameworks)
Online Machine Learning (i.e. updating analytics models in real time for every new event) leveraging Apache Kafka as infrastructure backbone
Open Source IoT Platform for Core and Edge Streaming Analytics (leveraging Apache Kafka, Apache Edgent, and other IoT frameworks)
Comparison of Open Source Stream Processing Frameworks (differences between Kafka Streams and other modern frameworks like Heron, Apache Flink, Apex, Spark Streaming, Edgent, Nifi, StreamSets, etc.)
Apache Kafka / Confluent and other Enterprise Middleware (discuss when to combine proprietary middleware with Apache Kafka, and when to simply “just” use Confluent’s open source platform)
Middleware and Streaming Analytics as Key for Success in Blockchain Projects

You can expect publications, conference and meetup talks, and webinars about these and other topics in 2017 like I did it in the last years. Please let me know what you are most interested in and what other topics you would like to hear about!

I am also really looking forward to work together with partners on scalable, mission-critical enterprise architectures and powerful solutions around Apache Kafka and Confluent Platform. This will include combined solutions and use cases with open source but also proprietary software vendors.

Last but not least the most important part, I am excited to work with prospects and customers from traditional enterprises, tech giants and startups to realize innovative use cases and success stories to add business value.

As you can see, I am really excited to start at Confluent in May 2017. I will visit Confluent’s London and Palo Alto offices in the first weeks and also be at Kafka Summit in New York. Thus, an exciting month to get started in this awesome Silicon Valley startup.

Please let me know your feedback. Do you see the same trends? Do you share my opinions or disagree? Looking forward to discuss all these topics with customers, partners and anybody else in upcoming meetings, workshops, publications, webinars, meetups and conference talks.

The post Why I Move (Back) to Open Source for Messaging, Integration and Stream Processing appeared first on Kai Waehner.

Visual Analytics + Open Source Deep Learning Frameworks

Kai Waehner — Mon, 24 Apr 2017 09:25:54 +0000

Deep Learning gets more and more traction. It basically focuses on one section of Machine Learning: Artificial Neural Networks. This article explains why Deep Learning is a game changer in analytics, when to use it, and how Visual Analytics allows business analysts to leverage the analytic models built by a (citizen) data scientist.

What is Deep Learning and Artificial Neural Networks?

Deep Learning is the modern buzzword for artificial neural networks, one of many concepts in machine learning to build analytics models. A neural network works similar to what we know from a human brain: You get non-linear interactions as input and transfer them to output. Neural networks leverage continuous learning and increasing knowledge in computational nodes between input and output. A neural network is a supervised algorithm in most cases, which uses historical data sets to learn parameters to predict outputs of future events, e.g. for cross selling or fraud detection. Unsupervised neural networks can be used to find new patterns and anomalies. In some cases, it makes sense to combine supervised and unsupervised algorithms.

Neural Networks are used in research for many decades and includes various sophisticated concepts like Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), and Autoencoder. However, today’s powerful and elastic computing infrastructure in combination with other technologies like graphical processing units (GPU) with thousands of cores allows to do much more powerful computations with a much deeper number of layers. Hence the term “Deep Learning”.

The following picture from TensorFlow Playground shows an easy-to-use environment which includes various test data sets, configuration options and visualizations to learn and understand deep learning and neural networks:

If you want to learn more about the details of Deep Learning and Neural Networks, I recommend the following sources:

“The Anatomy of Deep Learning Frameworks” – an article about the basic concepts and components of neural networks
TensorFlow Playground to play around with neural networks by yourself hands-on without any coding, also available on Github to build your own customized offline playground
“Deep Learning Simplified” video series on Youtube with several short, simple explanations of basic concepts, alternative algorithms and some frameworks like H2O.ai or Tensorflow

While Deep Learning is getting more and more traction, it is not the silver bullet for every scenario.

When (not) to use Deep Learning?

Deep Learning enables many new possibilities which were not possible in “mass production” a few years ago, e.g. image classification, object recognition, speech translation or natural language processing (NLP) in much more sophisticated ways than without Deep Learning. A key benefit is the automated feature engineering, which costs a lot of time and efforts with most other machine learning alternatives.

You can also leverage Deep Learning to make better decisions, increase revenue or reduce risk for existing (“already solved”) problems instead of using other machine learning algorithms. Examples include risk calculation, fraud detection, cross selling and predictive maintenance.

However, note that Deep Learning has a few important drawbacks:

Very expensive, i.e. slow and compute-intensive; training a deep learning model often takes days or weeks, execution also takes more time than most other algorithms.
Hard to interpret: lack of understandability of the result of the analytic model; often a key requirement for legal or compliance regularities
Tends to overfit, and therefore needs regularization

Deep Learning is ideal for complex problems. It can also outperform other algorithms in moderate problems. Deep Learning should not be used for simple problems. Other algorithms like logistic regression or decision trees can solve these problems easier and faster.

Open Source Deep Learning Frameworks

Neural networks are mostly adopted using one of various open source implementations. Various mature deep learning frameworks are available for different programming languages.

The following picture shows an overview of open source deep learning frameworks and evaluates several characteristics:

These frameworks have in common that they are built for data scientists, i.e. personas with experience in programming, statistics, mathematics and machine learning. Note that writing the source code is not a big task. Typically, only a few lines of codes are needed to build an analytic model. This is completely different from other development tasks like building a web application, where you write hundreds or thousands of lines of code. In Deep Learning – and Data Science in general – it is most important to understand the concepts behind the code to build a good analytic model.

Some nice open source tools like KNIME or RapidMiner allow visual coding to speed up development and also encourage citizen data scientists (i.e. people with less experience) to learn the concepts and build deep networks. These tools use own deep learning implementations or other open source libraries like H2O.ai or DeepLearning4j as embedded framework under the hood.

If you do not want to build your own model or leverage existing pre-trained models for common deep learning tasks, you might also take a look at the offerings from the big cloud providers, e.g. AWS Polly for Text-to-Speech translation, Google Vision API for Image Content Analysis, or Microsoft’s Bot Framework to build chat bots. The tech giants have years of experience with analysing text, speech, pictures and videos and offer their experience in sophisticated analytic models as a cloud service; pay-as-you-go. You can also improve these existing models with your own data, e.g. train and improve a generic picture recognition model with pictures of your specific industry or scenario.

Deep Learning in Conjunction with Visual Analytics

No matter if you want to use “just” a framework in your favourite programming language or a visual coding tool: You need to be able to make decisions based on the built neural network. This is where visual analytics comes into play. In short, visual analytics allows any persona to make data-driven decisions instead of listening to gut feeling when analysing complex data sets. See “Using Visual Analytics for Better Decisions – An Online Guide” to understand the key benefits in more detail.

A business analyst does not understand anything about deep learning, but just leverages the integrated analytic model to answer its business questions. The analytic model is applied under the hood when the business analyst changes some parameters, features or data sets. Though, visual analytics should also be used by the (citizen) data scientist to build the neural network. See “How to Avoid the Anti-Pattern in Analytics: Three Keys for Machine Learning” to understand in more details how technical and non-technical people should work together using visual analytics to build neural networks, which help solving business problems. Even some parts of data preparation are best done within visual analytics tooling as describe in “Data Preprocessing vs. Data Wrangling in Machine Learning Projects”.

From a technical perspective, Deep Learning frameworks (and in a similar way any other Machine Learning frameworks, of course) can be integrated into visual analytics tooling in different ways. The following list includes a TIBCO Spotfire example for each alternative:

Embedded Analytics: Implemented directly within the analytics tool (self-implementation or “OEM”); can be used by the business analyst without any knowledge about machine learning (Spotfire: Clustering via some basic, simple configuration of an input and output data plus cluster size)
Native Integration: Connectors to directly access external deep learning clusters. (Spotfire: TERR to use R’s machine learning libraries, KNIME connector to directly integrate with external tooling)
Framework API: Access via a Wrapper API in different programming languages. For example, you could integrate MXNet via R or TensorFlow via Python into your visual analytics tooling. This option can always be used and is appropriate if no native integration or connector is available. (Spotfire: MXNet’s R interface via Spotfire’s TERR Integration for using any R library)
Integrated as Service via an Analytics Server: Connect external deep learning clusters indirectly via a server-side component of the analytics tool; different frameworks can be accessed by the analytics tool in a similar fashion (Spotfire: Statistics Server for external analytics tools like SAS or Matlab)
Cloud Service: Access pre-trained models for common deep learning specific tasks like image recognition, voice recognition or text processing. Not appropriate for very specific, individual business problems of an enterprise. (Spotfire: Call public deep learning services like image recognition, speech translation, or Chat Bot from AWS, Azure, IBM, Google via REST service through Spotfire’s TERR / R interface)

All options have in common that you need to add configuration of some hyper-parameters, i.e. “high level” parameters like problem type, feature selection or regularization level. Depending on the integration option, this can be very technical and low level, or simplified and less flexible using terms which the business analyst understands.

Deep Learning Example: Autoencoder Template for TIBCO Spotfire

Let’s take one specific category of neural networks as example: Autoencoders to find anomalies. Autoencoder is an unsupervised neural network used to replicate the input dataset by restricting the number of hidden layers in a neural network. A reconstruction error is generated upon prediction. The higher the reconstruction error, the higher is the possibility of that data point being an anomaly.

Use Cases for Autoencoders include fighting financial crime, monitoring equipment sensors, healthcare claims fraud, or detecting manufacturing defects. A generic TIBCO Spotfire template is available in the TIBCO Community for free. You can simply add your data set and leverage the template to find anomalies using Autoencoders – without any complex configuration or even coding. Under the hood, the template uses H2O.ai’s deep learning implementation and its R API. It runs in a local instance on the machine where to run Spotfire. You can also take a look at the R code, but this is not needed to use the template at all and therefore optional.

Real World Example: Anomaly Detection for Predictive Maintenance

Let’s use the Autoencoder for a real-world example. In telco, you have to analyse the infrastructure continuously to find problems and issues within the network. Best before the failure happens so that you can fix it before the customer even notices the problem. Take a look at the following picture, which shows historical data of a telco network:

The orange dots are spikes which occur as first indication of a technical problem in the infrastructure. The red dots show a constant failure where mechanics have to replace parts of the network because it does not work anymore.

Autoencoders can be used to detect network issues before they actually happen. TIBCO Spotfire is uses H2O’s Autoencoder in the background to find the anomalies. As discussed before, the source code is relative scarce. Here is the snipped of building the analytic model with H2O’s Deep Learning R API and detecting the anomalies (by finding out the reconstruction error of the Autoencoder):

This analytic model – built by the data scientist – is integrated into TIBCO Spotfire. The business analyst is able to visually analyse the historical data and the insights of the Autoencoder. This combination allows data scientists and business analysts to work together fluently. It was never easier to implement predictive maintenance and create huge business value by reducing risk and costs.

Apply Analytic Models to Real Time Processing with Streaming Analytics

This article focuses on building deep learning models with Data Science Frameworks and Visual Analytics. Key for success in projects is to apply the build analytic model to new events in real time to add business value like increasing revenue, reducing cost or reducing risk.

“How to Apply Machine Learning to Event Processing” describes in more detail how to apply analytic models to real time processing. Or watch the corresponding video recording leveraging TIBCO StreamBase to apply some H2O models in real time. Finally, I can recommend to learn about various streaming analytics frameworks to apply analytic models.

Let’s come back to the Autoencoder use case to realize predictive maintenance in telcos. In TIBCO StreamBase, you can easily apply the built H2O Autoencoder model without any redevelopment via StreamBase’ H2O connector. You just attach the Java code generated by H2O framework, which contains the analytic model and compiles to very performant JVM bytecode:

The most important lesson learned: Think about the execution requirements before building the analytic model. What performance do you need regarding latency? How many events do you need to process per minute, second or millisecond? Do you need to distribute the analytic model to a clusters with many nodes? How often do you have to improve and redeploy the analytic model? You need to answer these questions at the beginning of your project to avoid double efforts and redevelopment of analytic models!

Another important fact is that analytic models do not always need score very fast or frequently (like if you want to score every single event in a sensor analytics use case). In the above telco infrastructure example, these spikes and failures might happen in subsequent days or even weeks. Thus, in many use cases, it is fine to score an analytic model once an hour or even once a day.

Deep Learning + Visual Analytics + Streaming Analytics = Next Generation Big Data Success Stories

Deep Learning allows to solve many well understood problems like cross selling, fraud detection or predictive maintenance in a more efficient way. In addition, you can solve additional scenarios, which were not possible to solve before, like accurate and efficient object detection or speech-to-text translation.

Visual Analytics is a key component in Deep Learning projects to be successful. It eases the development of deep neural networks by (citizen) data scientists and allows business analysts to leverage these analytic models to find new insights and patterns.

Today, (citizen) data scientists use programming languages like R or Python, deep learning frameworks like Theano, TensorFlow, MXNet or H2O’s Deep Water and a visual analytics tool like TIBCO Spotfire to build deep neural networks. The analytic model is embedded into a view for the business analyst to leverage it without knowing the technology details.

In the future, visual analytics tools might embed neural network features like they already embed other machine learning features like clustering or logistic regression today. This will allow business analysts to leverage Deep Learning without the help of a data scientist and be appropriate for simpler use cases.

However, do not forget that building an analytic model to find insights is just the first part of a project. Deploying it to real time afterwards is as important as second step. Good integration between tooling for finding insights and applying insights to new events can improve time-to-market and model quality in data science projects significantly. The development lifecycle is a continuous closed loop. The analytic model needs to be validated and rebuild in certain sequences.

The post Visual Analytics + Open Source Deep Learning Frameworks appeared first on Kai Waehner.

Comparison: Data Preparation vs. Inline Data Wrangling in Machine Learning and Deep Learning Projects

Kai Waehner — Mon, 13 Feb 2017 11:23:01 +0000

I want to highlight a new presentation about Data Preparation in Data Science projects:

“Comparison of Programming Languages, Frameworks and Tools for Data Preprocessing and (Inline) Data Wrangling in Machine Learning / Deep Learning Projects”

Data Preparation as Key for Success in Data Science Projects

A key task to create appropriate analytic models in machine learning or deep learning is the integration and preparation of data sets from various sources like files, databases, big data storages, sensors or social networks. This step can take up to 80% of the whole project.

This session compares different alternative techniques to prepare data, including extract-transform-load (ETL) batch processing (like Talend, Pentaho), streaming analytics ingestion (like Apache Storm, Flink, Apex, TIBCO StreamBase, IBM Streams, Software AG Apama), and data wrangling (DataWrangler, Trifacta) within visual analytics. Various options and their trade-offs are shown in live demos using different advanced analytics technologies and open source frameworks such as R, Python, Apache Hadoop, Spark, KNIME or RapidMiner. The session discusses how this is related to visual analytics tools (like TIBCO Spotfire). Therefore, it also shows best practices for how the data scientist and business analyst should work together to build good analytic models.

Key Takeaway: Inline Data Wrangling Within Visual Analytics Tooling

Key takeaways of this session:

–    Learn various options for preparing data sets to build analytic models
–    Understand the pros and cons and the targeted persona for each option
–    See different technologies and open source frameworks for data preparation
–    Understand the relation to visual analytics and streaming analytics, and how these concepts are actually leveraged to build the analytic model after data preparation

Slide Deck

The following shows the slide deck:

Video Recording: Data Preparation vs. (Inline) Data Wrangling

Here is the video recording:

The post Comparison: Data Preparation vs. Inline Data Wrangling in Machine Learning and Deep Learning Projects appeared first on Kai Waehner.

Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Services

Kai Waehner — Tue, 15 Nov 2016 13:20:31 +0000

In November 2016, I am at Big Data Spain in Madrid for the first time. A great conference with many awesome speakers and sessions about very hot topics such as Apache Hadoop, Spark Spark, Streaming Processing / Streaming Analytics and Machine Learning. If you are interested in big data, then this conference is for you! My two talks:

“How to Apply Machine Learning to Real Time Processing” (see slides and video recording from a similar conference talk).
“Comparison of Streaming Analytics Options” (the reason for this blog post; an updated version of my talk from JavaOne 2015)

Here I wanna share the slides and a video recording of the latter one…

Abstract: Comparison of Stream Processing Options

This session discusses the technical concepts of stream processing / streaming analytics and how it is related to big data, mobile, cloud and internet of things. Different use cases such as predictive fault management or fraud detection are used to show and compare alternative frameworks and products for stream processing and streaming analytics.

The focus of the session lies on comparing

different open source frameworks such as Apache Apex, Apache Flink or Apache Spark Streaming
engines from software vendors such as IBM InfoSphere Streams, TIBCO StreamBase
cloud offerings such as AWS Kinesis.
real time streaming UIs such as Striim, Zoomdata or TIBCO Live Datamart. Live demos will give the audience a good feeling about how to use these frameworks and tools.

The session will also discuss how stream processing is related to Apache Hadoop frameworks (such as MapReduce, Hive, Pig or Impala) and machine learning (such as R, Spark ML or H2O.ai).

Slides – Alternatives for Streaming Analytics

The following slide deck is a more extensive version of the talk at Big Data Spain (as the conference talks were only 30 minutes):

Video Recording: Apache Storm, Flink, Apex, Spark, StreamBase, Striim, et al

The video recording walks you through the above slide deck:

As always, I appreciate any comments, questions or other feedback.

The post Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Services appeared first on Kai Waehner.

Machine Learning Applied to Microservices

Kai Waehner — Thu, 20 Oct 2016 19:32:22 +0000

I had two sessions at O’Reilly Software Architecture Conference in London in October 2016. It is the first #OReillySACon in London. A very good organized conference with plenty of great speakers and sessions. I can really recommend this conference and its siblings in other cities such as San Francisco or New York if you want to learn about good software architectures and new concepts, best practices and technologies. Some of the hot topics this year besides microservices are DevOps, serverless architectures and big data analytics respectively machine learning.

Intelligent Microservices by Leveraging Big Data Analytics

One of the two sessions was about how to apply machine learning and big data analytics to real time event processing. I also included the relation to microservices, i.e. how to leverage microservice concepts such as 12 Factor Apps, Containers (e.g. Docker), Cloud Platforms (e.g. Kubernetes, Cloud Foundry), or DevOps to build agile, intelligent microservices.

Abstract: How to Apply Machine Learning to Microservices

The digital transformation is going forward due to Mobile, Cloud and Internet of Things. Disrupting business models leverage Big Data Analytics and Machine Learning.

“Big Data” is currently a big hype. Large amounts of historical data are stored in Hadoop or other platforms. Business Intelligence tools and statistical computing are used to draw new knowledge and to find patterns from this data, for example for promotions, cross-selling or fraud detection. The key challenge is how these findings can be integrated from historical data into new transactions in real time to make customers happy, increase revenue or prevent fraud. “Fast Data” via stream processing is the solution to embed patterns – which were obtained from analyzing historical data – into future transactions in real-time.

This session uses several real world success stories to explain the concepts behind stream processing and its relation to Hadoop and other big data platforms. It discusses how patterns and statistical models of R, Spark MLlib, H2O, and other technologies can be integrated into real-time processing by using several different real world case studies. The session also points out why a Microservices architecture helps solving the agile requirements for these kind of projects.

A brief overview of available open source frameworks and commercial products shows possible options for the implementation of stream processing, such as Apache Storm, Apache Flink, Spark Streaming, IBM InfoSphere Streams, or TIBCO StreamBase.

A live demo shows how to implement stream processing, how to integrate machine learning, and how human operations can be enabled in addition to the automatic processing via a Web UI and push events.

How to Build Intelligent Microservices – Slide Deck from O’Reilly Software Architecture Conference

The post Machine Learning Applied to Microservices appeared first on Kai Waehner.

Comparison Of Log Analytics for Distributed Microservices – Open Source Frameworks, SaaS and Enterprise Products

Kai Waehner — Thu, 20 Oct 2016 18:57:51 +0000

I want to share the slide of my session about comparing open source frameworks, SaaS and Enterprise products regarding log analytics for distributed microservices:

Monitoring Distributed Microservices with Log Analytics

IT systems and applications generate more and more distributed machine data due to millions of mobile devices, Internet of Things, social network users, and other new emerging technologies. However, organizations experience challenges when monitoring and managing their IT systems and technology infrastructure. They struggle with distributed Microservices and Cloud architectures, custom application monitoring and debugging, network and server monitoring / troubleshooting, security analysis, compliance standards, and others.

This session discusses how to solve the challenges of monitoring and analyzing Terabytes and more of different distributed machine data to leverage the “digital business”. The main part of the session compares different open source frameworks and SaaS cloud solutions for Log Management and operational intelligence, such as Graylog , the “ELK stack”, Papertrail, Splunk or TIBCO LogLogic). A live demo will demonstrate how to monitor and analyze distributed Microservices and sensor data from the “Internet of Things”.

The session also explains the distinction of the discussed solutions to other big data components such as Apache Hadoop, Data Warehouse or Machine Learning and its application to real time processing, and how they can complement each other in a big data architecture.

The session concludes with an outlook to the new, advanced concept of IT Operations Analytics (ITOA).

Slide Deck from O’Reilly Software Architecture Conference

The post Comparison Of Log Analytics for Distributed Microservices – Open Source Frameworks, SaaS and Enterprise Products appeared first on Kai Waehner.

Characteristics of a Good Visual Analytics and Data Discovery Tool

Kai Waehner — Thu, 28 Jul 2016 07:06:02 +0000

Visual Analytics and Data Discovery allow analysis of big data sets to find insights and valuable information. This is much more than just classical Business Intelligence (BI). See this article for more details and motivation: “Using Visual Analytics to Make Better Decisions: the Death Pill Example“. Let’s take a look at important characteristics to choose the right tool for your use cases.

Visual Analytics Tool Comparison and Evaluation

Several tools are available on the market for Visual Analytics and Data Discovery. Three of the most well known options are Tableau, Qlik and TIBCO Spotfire. Use the following list to compare and evaluate different tools to make the right decision for your project:

Ease-of use and an intuitive user interface for business users to create interactive visualizations
Various visualization components such as bar charts, pie charts, histogram, scatter plots, treemaps, trellis charts, and many more
Connectivity to various data sources (e.g. Oracle, NoSQL, Hadoop, SAP Hana, Cloud Services)
True ad-hoc data discovery: real interactive analysis via drag-and-drop interactions (e.g. restructure tables or link different data sets) instead of “just” visualizing data sets by drill-down / roll-up in tables.
Support for data loading and analysis with alternative approaches: in-memory (e.g. RDBMS, spreadsheets), in-database (e.g. Hadoop) or on-demand (e.g. event data streams)
In-line and ad-hoc data wrangling functionality to put data into the shape and quality that is needed for further analysis
Geoanalytics using geo-location features to enable location-based analysis beyond simple layer map visualizations (e.g. spatial search, location-based clustering, distance and route calculation)
Out-of-the-box functionality for “simple” analytics without coding (e.g. forecasting, clustering, classification)
Out-of-the-box capabilities to realize advanced analytics use cases without additional tools (e.g. an embedded R engine and corresponding tooling)
Support for integrating any additional advanced analytics and machine learning frameworks (such as R, Python, Apache Spark, H20.ai, KNIME , SAS or MATLAB)
Extendibility and enhancement with custom components and features
Collaboration between business users, analysts and data scientists within the same tool without additional third-party tools (e.g. ability to work together in a team, share analysis with others, add comments and discussions)

Take a look at available visual analytics tools on the market with the above list in mind and select the right one for your use cases. Also keep in mind that you usually want to put the insights into action afterwards, e.g. for fraud detection, cross selling or predictive maintenance. Therefore, think about “How to Apply Insights and Analytic Models to Real Time Processing” when you start your data discovery journey.

The post Characteristics of a Good Visual Analytics and Data Discovery Tool appeared first on Kai Waehner.

Streaming Analytics with Analytic Models (R, Spark MLlib, H20, PMML)

Kai Waehner — Thu, 03 Mar 2016 15:51:01 +0000

In March 2016, I had a talk at Voxxed Zurich about “How to Apply Machine Learning and Big Data Analytics to Real Time Processing”.

Finding Insights with R, H20, Apache Spark MLlib, PMML and TIBCO Spotfire

“Big Data” is currently a big hype. Large amounts of historical data are stored in Hadoop or other platforms. Business Intelligence tools and statistical computing are used to draw new knowledge and to find patterns from this data, for example for promotions, cross-selling or fraud detection. The key challenge is how these findings can be integrated from historical data into new transactions in real time to make customers happy, increase revenue or prevent fraud.

Putting Analytic Models into Action via Event Processing and Streaming Analytics

“Fast Data” via stream processing is the solution to embed patterns – which were obtained from analyzing historical data – into future transactions in real-time. The following slide deck uses several real world success stories to explain the concepts behind stream processing and its relation to Apache Hadoop and other big data platforms. I discuss how patterns and statistical models of R, Apache Spark MLlib, H20, and other technologies can be integrated into real-time processing using open source stream processing frameworks (such as Apache Storm, Spark Streaming or Flink) or products (such as IBM InfoSphere Streams or TIBCO StreamBase). A live demo showed the complete development lifecycle combining analytics with TIBCO Spotfire, machine learning via R and stream processing via TIBCO StreamBase and TIBCO Live Datamart.

Slide Deck from Voxxed Zurich 2016

Here is the slide deck:

The post Streaming Analytics with Analytic Models (R, Spark MLlib, H20, PMML) appeared first on Kai Waehner.