NoSQL Archives - Kai Waehner

Streaming Analytics with Analytic Models (R, Spark MLlib, H20, PMML)

Kai Waehner — Thu, 03 Mar 2016 15:51:01 +0000

In March 2016, I had a talk at Voxxed Zurich about “How to Apply Machine Learning and Big Data Analytics to Real Time Processing”.

Finding Insights with R, H20, Apache Spark MLlib, PMML and TIBCO Spotfire

“Big Data” is currently a big hype. Large amounts of historical data are stored in Hadoop or other platforms. Business Intelligence tools and statistical computing are used to draw new knowledge and to find patterns from this data, for example for promotions, cross-selling or fraud detection. The key challenge is how these findings can be integrated from historical data into new transactions in real time to make customers happy, increase revenue or prevent fraud.

Putting Analytic Models into Action via Event Processing and Streaming Analytics

“Fast Data” via stream processing is the solution to embed patterns – which were obtained from analyzing historical data – into future transactions in real-time. The following slide deck uses several real world success stories to explain the concepts behind stream processing and its relation to Apache Hadoop and other big data platforms. I discuss how patterns and statistical models of R, Apache Spark MLlib, H20, and other technologies can be integrated into real-time processing using open source stream processing frameworks (such as Apache Storm, Spark Streaming or Flink) or products (such as IBM InfoSphere Streams or TIBCO StreamBase). A live demo showed the complete development lifecycle combining analytics with TIBCO Spotfire, machine learning via R and stream processing via TIBCO StreamBase and TIBCO Live Datamart.

Slide Deck from Voxxed Zurich 2016

Here is the slide deck:

The post Streaming Analytics with Analytic Models (R, Spark MLlib, H20, PMML) appeared first on Kai Waehner.

TIBCO BusinessWorks and StreamBase for Big Data Integration and Streaming Analytics with Apache Hadoop and Impala

Kai Waehner — Tue, 14 Apr 2015 14:41:46 +0000

Apache Hadoop is getting more and more relevant. Not just for Big Data processing (e.g. MapReduce), but also for Fast Data processing (e.g. Stream Processing). Recently, I published two blog posts on the TIBCO blog to show how you can leverage TIBCO BusinessWorks 6 and TIBCO StreamBase to realize Big Data and Fast Data Hadoop use cases.

TIBCO ActiveMatrix BusinessWorks 6 + Apache Hadoop = Big Data Integration

Apache Hadoop was built for processing complex computations on Big Data stores (that is, terabytes to petabytes) with a MapReduce distributed computation model that runs easily on cheap commodity hardware.

A Hadoop distribution from vendors such as Hortonworks, Cloudera or MapR packages different projects of the Hadoop ecosystem. This assures that all used versions work together smoothly. On top of the packaging, Hadoop vendors offer tooling for deployment, administration and monitoring of Hadoop clusters. Commercial support completes their offerings.

The key challenge is to integrate the input and results of Hadoop processing into the rest of the enterprise. Using just a Hadoop distribution requires a lot of complex coding for integration services.

Continue here for the full article: TIBCO ActiveMatrix BusinessWorks 6 + Apache Hadoop = Big Data Integration

TIBCO StreamBase + Hadoop + Impala = Fast Data Streaming Analytics

As of today, Hadoop is evolving quickly. It is not only used for batch processing anymore. YARN, Storm, Spark, and several other solutions introduce modern paradigms to Hadoop. However, some problems still remain with Hadoop:

No good, easy development tooling for the Hadoop ecosystem components such as Hive, Storm, Spark, etc.
Missing maturity (a lot of alpha/beta/0.x versions) especially in management and monitoring tools, as well as security, connectivity, and APIs
No “real time” (== seconds, milliseconds, microseconds), but “near real time” (still several seconds and more, much more when recovering from infrastructure faults)
No operational analytics (human monitoring and proactive actions)

So why not combine the great benefits of Hadoop with the Fast Data streaming analytics tool TIBCO StreamBase with its mature, mission-critical deployments in several different industries, great graphical tooling, and operational real-time analytics (via TIBCO Live Datamart on top of StreamBase)?

This post shows how to realize a Fast Data use case with TIBCO StreamBase and the Hadoop framework’s Impala analytical database quickly and easily.

Continue here for the full article: TIBCO StreamBase + Hadoop + Impala = Fast Data Streaming Analytics

For a general introduction to Stream Processing and Streaming Analytics, I recommend the InfoQ article: Real-Time Stream Processing as Game Changer in a Big Data World with Hadoop and Data Warehouse.

As always, I appreciate any feedback…

The post TIBCO BusinessWorks and StreamBase for Big Data Integration and Streaming Analytics with Apache Hadoop and Impala appeared first on Kai Waehner.

Real World Use Cases and Success Stories for In-Memory Data Grids

Kai Waehner — Mon, 24 Nov 2014 12:43:21 +0000

NoSQL Matters Conference 2014

NoSQL Matters is a great conference about different NoSQL topics. A lot of great NoSQL products and use cases are presented. In November 2014, I had a talk about “Real World Use Cases and Success Stories for In-Memory Data Grids” in Barcelona, Spain. I discussed several different use cases, which our TIBCO customers implemented using our In-Memory Data Grid “TIBCO ActiveSpaces“. I will present the same content at data2day, a German conference in Karlsruhe about big data topics.

In-Memory Data Grids: TIBCO ActiveSpaces, Oracle Coherence, Infinispan, IBM eXtreme Scale, Hazelcast, Gigaspaces, etc.

A lot of in-memory data grid products are available. TIBCO ActiveSpaces, Oracle Coherence, Infinispan, IBM WebSphere eXtreme Scale, Hazelcast, Gigaspaces, GridGain, Pivotal Gemfire to name most of the important ones. See a great graphic by 451 Research Group, which shows different databases and how data grids fit into that landscape. You can always get the newest version: 451 DataBase Landscape.

It is important to understand that an in-memory data grid offers much more than just caching and storing data in memory. Further in-memory features are event processing, publish / subscribe, ACID transactions, continuous queries and fault-tolerance – to name a few… Therefore, let’s discuss one example in the next section to get a better understanding of what an in-memory data grid actually is.

TIBCO ActivesSpaces In-Memory Data Grid

TIBCO ActiveSpaces combines the best out of NoSQL and In-Memory features. The following description is taken from TIBCO’s website:

To lift the burden of big data, TIBCO ActiveSpaces provides a distributed in-memory data grid that can increase processing speed so you can reduce reliance on costly transactional systems.

ActiveSpaces EE provides an infrastructure for building highly scalable, fault-tolerant applications. It creates large virtual data caches from the aggregate memory of participating nodes, scaling automatically as nodes join and leave. Combining the features and performance of databases, caching systems, and messaging software, it supports very large, highly volatile data sets and event-driven applications. And it frees developers to focus on business logic rather than on the complexities of distributing, scaling, and making applications autonomously fault-tolerant.

ActiveSpaces EE supplies configurable replication of virtual shared memory. This means that the space autonomously re-replicates and re-distributes lost data, resulting in an active-active fault-tolerant architecture without resource overhead.

Benefits

Reduce Management Cost: Off-load slow, expensive, and hard-to-maintain transactional systems.
Deliver Ultra-Low, Predictable Latency: Use peer-to-peer communication, avoiding intervention by a central server.
Drastically Improve Performance: Create next-generation elastic applications including high performance computing, extreme transaction processing, and complex event processing.
Simplify Administration: Eliminate the complexity of implementing and configuring a distributed caching platform using a command-line administration tool with shell-like control keys that provide command history, syntax completion, and context-sensitive help.
Become Platform Independent: Store database rows and objects and use the system as middleware to exchange information between heterogeneous platforms.
Speed Development: Enable data virtualization and let developers focus on business logic rather than on the details of data implementation.

If you want to learn more about TIBCO ActiveSpaces take a look at a great recording from QCon 2013: TIBCO Fellow Jean-Noel Moyne discusses in-memory data grids in more detail.

SAP HANA is not an In-Memory Data Grid

I should write an additional blog post about this topic. Nevertheless, to make it clear: SAP HANA is not an in-memory data grid. This is important to mention as everybody thinks about SAP HANA when talking about in-memory, right? Take a look at the 451 database landscape, which I mentioned above. SAP HANA is put into the “relational zone” under appliances (SAP HANA is only available as appliance), whereas all the other products I named are put in the “Grid / Cache Zone”.

SAP Hana is primarily being used to reduce dependency on other relational databases (e.g. Oracle). It is designed to make SAP run faster not to speed up other applications (non-SAP). SAP HANA is more like a traditional DB that is meant to ‘run reports faster’ by leveraging the large amount of RAM on the servers. It is great for some analytical use cases, e.g. faster reporting and “after the fact analysis”.

Compared to other in-memory products (i.e. “real data grids”) such as TIBCO ActiveSpaces and the other products mentioned above, SAP HANA misses several features such as implicit eventing (publish / subscribe) or deployment with flexible elasticity on commodity hardware. You can implement custom logic on SAP HANA with JavaScript or a proprietary SQL-like language (SQLScript), of course. Though, building several of the use cases in my presentation below is much more difficult with SAP HANA than with other “real data grid” products.

Be aware: I am not saying that SAP HANA is a bad product. Though, it serves different use cases than in-memory data grids such as TIBCO ActiveSpaces! For example, SAP HANA is great to replace Oracle RACs as database backend for SAP ERPs to speed up the systems and improve user experience.

Real World Use Cases and Success Stories for In-Memory Data Grids

The goal of my talk was not very technical. Instead, I discussed several different real world use cases and success stories for using in-memory data grids. Here is the abstract for my talk:

NoSQL is not just about different storage alternatives such as document store, key value store, graphs or column-based databases. The hardware is also getting much more important. Besides common disks and SSDs, enterprises begin to use in-memory storages more and more because a distributed in-memory data grid provides very fast data access and update. While its performance will vary depending on multiple factors, it is not uncommon to be 100 times faster than corresponding database implementations. For this reason and others described in this session, in-memory computing is a great solution for lifting the burden of big data, reducing reliance on costly transactional systems, and building highly scalable, fault-tolerant applications.The session begins with a short introduction to in-memory computing. Afterwards, different frameworks and product alternatives are discussed for implementing in-memory solutions. Finally, the main part of this session shows several different real world uses cases where in-memory computing delivers business value by supercharging the infrastructure.

Here is the slide deck:

As always, I appreciate every feedback. Please post a comment or contact me via Email, Twitter, LinkedIn or Xing…

The post Real World Use Cases and Success Stories for In-Memory Data Grids appeared first on Kai Waehner.