Life Science Archives - Kai Waehner https://www.kai-waehner.de/blog/category/life-science/ Technology Evangelist - Big Data Analytics - Middleware - Apache Kafka Fri, 07 Feb 2025 03:39:08 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.2 https://www.kai-waehner.de/wp-content/uploads/2020/01/cropped-favicon-32x32.png Life Science Archives - Kai Waehner https://www.kai-waehner.de/blog/category/life-science/ 32 32 Data Streaming in Healthcare and Pharma: Use Cases and Insights from Cardinal Health https://www.kai-waehner.de/blog/2024/11/28/data-streaming-in-healthcare-and-pharma-use-cases-cardinal-health/ Thu, 28 Nov 2024 04:12:15 +0000 https://www.kai-waehner.de/?p=7047 This blog explores Cardinal Health’s journey, exploring how its event-driven architecture and data streaming power use cases like supply chain optimization, and medical device and equipment management. By integrating Apache Kafka with platforms like Apigee, Dell Boomi and SAP, Cardinal Health sets a benchmark for IT modernization and innovation in the healthcare and pharma sectors.

The post Data Streaming in Healthcare and Pharma: Use Cases and Insights from Cardinal Health appeared first on Kai Waehner.

]]>

The post Data Streaming in Healthcare and Pharma: Use Cases and Insights from Cardinal Health appeared first on Kai Waehner.

]]>
Machine Learning and Data Science with Kafka in Healthcare https://www.kai-waehner.de/blog/2022/04/18/machine-learning-data-science-with-kafka-in-healthcare-pharma/ Mon, 18 Apr 2022 11:44:09 +0000 https://www.kai-waehner.de/?p=4446 IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part five: Machine Learning and Data Science. Examples include Recursion and Humana.

The post Machine Learning and Data Science with Kafka in Healthcare appeared first on Kai Waehner.

]]>
IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part five: Machine Learning and Data Science. Examples include Recursion and Humana.

Machine Learning and Data Science with Apache Kafka in Healthcare

Blog Series – Kafka in Healthcare

Many healthcare companies leverage Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap. I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Stay tuned for a dedicated blog post for each of these topics as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Machine Learning and Data Science with Data Streaming using Apache Kafka

The relationship between Apache Kafka and machine learning (ML) is getting more and more traction for data engineering at scale and robust model deployment with low latency.

The Kafka ecosystem helps in different ML use cases for model training, model serving, and model monitoring. The core of most ML projects requires reliable and scalable data engineering pipelines across

  • different technologies
  • communication paradigms (REST, gRPC, data streaming)
  • programming languages (like Python for the data scientist or Java/Go/C++ for the production engineer)
  • APIs
  • commercial products
  • SaaS offerings

Here is an architecture diagram that shows how Kafka helps in data science projects:

The beauty of Kafka is that it combines real-time data processing with extreme scalability and true decoupling between systems.

Tiered Storage adds cost-efficient storage of big data sets and replayability with guaranteed ordering.

I’ve written about this relationship between Kafka and Machine Learning in various articles:

Let’s look at a few real-world deployments for Apache Kafka and Machine Learning in the healthcare sector.

Humana – Real-Time Interoperability at the Point of Care

Humana Inc. is a for-profit American health insurance company. They leverage data streaming with Apache Kafka to improve real-time interoperability at the point of care.

The interoperability platform to transition from an insurance company with elements of health to truly a health company with elements of insurance.

Their core principles include:

  • Consumer-centric
  • Health plan agnostic
  • Provider agnostic
  • Cloud resilient
  • Elastic scale
  • Event-driven and real-time

A critical characteristic is inter-organization data sharing (known as “data exchange/data sharing”).

Humana’s use cases include

  • real-time updates of health information, for instance
  • connecting health care providers to pharmacies
  • reducing pre-authorizations from 20-30 minutes to 1 minute
  • real-time home healthcare assistant communication

The Humana interoperability platform combines data streaming (= the Kafka ecosystem) with artificial intelligence and machine learning (= IBM Watson) to correlate data, train analytic models, and act on new events in real-time.

Humana’s data journey is described in this diagram from their Kafka Summit talk:

Real-Time Healthcare Insurance at Humana with Apache Kafka Data Streaming

Learn more details about Humana’s use cases and architecture in the keynote of another Kafka Summit session.

Recursion – Industrial Revolution of Drug Discovery with Kafka and Deep Learning

Recursion is a clinical-stage biotechnology company that built the “industrial revolution of drug discovery“. They decode biology by integrating technological innovations across biology, chemistry, automation, machine learning, and engineering to industrialize drug discovery.

Industrial pharma revolution - accelerate drug discovery at recursion

Kafka-powered data streaming speeds up the pharma processes significantly. Recursion has already made significant strides in accelerating drug discovery, with over 30 disease models in discovery, another nine in preclinical development, and two in clinical trials.

With serverless Confluent Cloud and the new data streaming approach, the company has built a platform that makes it possible to screen much larger experiments with thousands of compounds against hundreds of disease models in minutes and less expensive than alternative discovery approaches.

From a technical perspective, Recursion finds drug treatments by processing biological images. A massively parallel system combines experimental biology, artificial intelligence, automation, and real-time data streaming:

Apache Kafka and Machine Learning at Recursion for Drug Discovery in Pharma

Recursion went from ‘drug discovery in manual and slow, not scalable, bursty BATCH MODE’ to ‘drug discovery in automated, scalable, reliable REAL-TIME MODE’.

Recursion leverages Dagger, an event-driven workflow and orchestration library for Kafka Streams that enables engineers to orchestrate services by defining workloads as high-level data structures. Dagger combines Kafka topics and schemas with external tasks for actions completed outside of the Kafka Streams applications.

Drug Discovery in automated, scalable, reliable real time Mode

In the meantime, Recursion did not just migrate from manual batch workloads to Kafka but also migrated to serverless Kafka, leveraging Confluent Cloud to focus its resources on business problems instead of infrastructure operations.

Machine Learning and Data Science with Kafka for Intelligent Healthcare Applications

Think about IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains. Real-time data beats slow data in the healthcare supply chain almost everywhere.

This blog post explored the capabilities of the Apache Kafka ecosystem for machine learning infrastructures. Real-world deployments from Humana and Recursion showed how enterprises successfully deploy Kafka together with Machine Learning frameworks like TensorFlow for use cases.

How do you leverage data streaming with Apache Kafka in the healthcare industry? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Machine Learning and Data Science with Kafka in Healthcare appeared first on Kai Waehner.

]]>
Real Time Analytics with Apache Kafka in the Healthcare Industry https://www.kai-waehner.de/blog/2022/04/04/real-time-analytics-machine-learning-with-apache-kafka-in-the-healthcare-industry/ Mon, 04 Apr 2022 10:31:47 +0000 https://www.kai-waehner.de/?p=4414 IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. This is part four: Real-Time Analytics. Examples include Cerner, Celmatix, CDC/Centers for Disease Control and Prevention.

The post Real Time Analytics with Apache Kafka in the Healthcare Industry appeared first on Kai Waehner.

]]>
IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part four: Real-Time Analytics. Examples include Cerner, Celmatix, CDC/Centers for Disease Control and Prevention.

Real Time Analytics and Machine Learning with Apache Kafka in Healthcare

Blog Series – Kafka in Healthcare

Many healthcare companies leverage Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap. I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Stay tuned for a dedicated blog post for each of these topics as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Real-Time Analytics with Apache Kafka

Real-time analytics (aka stream processing, streaming analytics, or complex event processing) is a data processing technology used to collect, store, and manage continuous data streams when produced or received.

Stream processing has many use cases. Examples include the backend process for claim processing, billing, logistics, manufacturing, fulfillment, or fraud detection. Data processing may need to be decoupled from the frontend, where users click buttons and expect things to happen.

The de facto standard for real-time analytics is Apache Kafka. Kafka is like a central data hub that holds shared events and keeps services in sync. Its distributed cluster technology provides availability, resiliency, and performance properties that strengthen the architecture. It leaves the programmer to write and deploy client applications that will run load balanced and be highly available.

Real Time Analytics with Data Streaming Stream Processing and Apache Kafka

Technologies for real-time analytics with the Kafka ecosystem include Kafka-native stream processing with Kafka Streams or ksqlDB, or 3rd party add-ons like Apache Flink, Spark Streaming, or commercial streaming analytics cloud services.

The critical difference with the Kafka ecosystem is that you leverage a single platform for data integration and processing at scale in real-time. There is no need to combine several platforms to achieve this. The result is a Kappa architecture that enables real-time but also batch workloads with a single integration architecture.

Let’s look at a few real-world deployments in the healthcare sector.

Cerner – Sepsis Alerting in Real-Time

Cerner is a supplier of health information technology services, devices, and hardware. ~30% of all US healthcare data in a Cerner solution.

Sepsis kills. In fact, it kills up to 52,000 people every year in the UK alone. With sepsis alerting, the key to saving lives is early identification, especially the need to administer antibiotics within that first critical ‘golden hour’. Quick alerts make a significant impact. Cerner’s sepsis alert, coupled with the care plans developed with the big room approach, means that patients are now 71% more likely to receive timely antibiotics.

Cerner leverages a Kafka-powered central event streaming platform for sepsis alerting in real-time to save lives. Legacy systems hit a wall preventing going faster (and missed SLAs). The data processing with Kafka progressed from minutes to seconds.

Real Time Sepsis Alerting at Cerner with Apache Kafka

Cerner is a long-term Kafka user and early adopter in the healthcare sector. Learn more about this use case in their Kafka Summit talk from 2016.

Celmatix – Reproductive Health Care

Celmatix is a preclinical-stage biotech company that provides digital tools and genetic insights focused on fertility. They offer personalized information to disrupt how women approach their lifelong reproductive health journey.

The streaming platform provides real-time aggregation of heterogeneous data collected from Electronic Medical Records (EMRs) and genetic data collected from partners through their Personalized Reproductive Medicine (PReM) Initiative.

Proactive reproductive health decisions are enabled by real-time genomics data and by applying technologies such as big data analytics, machine learning, A/I, and whole-genome DNA sequencing.

Celmatix Reproductive Health Care Eletronical Medical Records EMR Processing with Apache Kafka

Data governance for security and compliance is critical in such a healthcare application. “Apache Kafka and Confluent are invaluable investments to scale the way we want to and future-proof our business,” says the lead data architect at Celmatix. Learn more in the Confluent case study.

CDC – Covid-19 Electronic Lab Reporting

The Centers for Disease Control and Prevention (CDC) built Covid-19 Electronic Lab Reporting (CELR) with the Kafka ecosystem. Use cases include case notifications, lab reporting, and healthcare interoperability.

The threat of the COVID-19 virus is tracked in real-time to provide comprehensive data for local, state, and federal responses. The application allows them to understand locations with an increase in incidence better.

With the true decoupling of the data streaming platform, the CDC can rapidly aggregate, validate, transform, and distribute laboratory testing data submitted by public health departments and other partners:

Centers for Disease Control and Prevention CDC Covid Analytics with Kafka

Real-Time Analytics with Kafka for Smart Healthcare Applications at any Scale

Think about IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains. Real-time data beats slow data in the healthcare supply chain almost everywhere.

This blog post explored the capabilities of the Apache Kafka ecosystem for real-time analytics. Real-world deployments from Cerner, Celmatix and the Centers for Disease Control and Prevention showed how enterprises successfully deploy Kafka for different enterprise architecture use cases.

How do you leverage data streaming with Apache Kafka in the healthcare industry? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Real Time Analytics with Apache Kafka in the Healthcare Industry appeared first on Kai Waehner.

]]>
Legacy Modernization and Hybrid Multi-Cloud with Kafka in Healthcare https://www.kai-waehner.de/blog/2022/03/30/legacy-modernization-and-hybrid-multi-cloud-with-kafka-in-healthcare/ Wed, 30 Mar 2022 08:10:25 +0000 https://www.kai-waehner.de/?p=4393 IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. This is part two: Legacy modernization and hybrid multi-cloud. Examples include Optum / UnitedHealth Group, Centene, and Bayer.

The post Legacy Modernization and Hybrid Multi-Cloud with Kafka in Healthcare appeared first on Kai Waehner.

]]>
IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part two: Legacy modernization and hybrid multi-cloud. Examples include Optum / UnitedHealth Group, Centene, and Bayer.

Legacy Modernization and Hybrid Multi Cloud with Apache Kafka in Healthcare

Blog Series – Kafka in Healthcare

Many healthcare companies leverage Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap. I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Stay tuned for a dedicated blog post for each of these topics as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Legacy Modernization and Hybrid Multi-Cloud with Kafka

Application modernization benefits from the Apache Kafka ecosystem for hybrid integration scenarios.

Most enterprises require a reliable and scalable integration between legacy systems such as IBM Mainframe, Oracle, SAP ERP, and modern cloud-native applications like Snowflake, MongoDB Atlas, or AWS Lambda.

I already explored “architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments” some time ago:

Hybrid Cloud Architecture with Apache Kafka

TL;DR: Various alternatives exist to deploy Apache Kafka across data centers, regions, and continents. There is no single best architecture. It always depends on characteristics such as RPO / RTO, SLAs, latency, throughput, etc.

Some deployments focus on on-prem to cloud integration. Others link Kafka clusters on multiple cloud providers. Technologies such as Apache Kafka’s  MirrorMaker 2, Confluent Replicator, Confluent Multi-Region-Clusters, and Confluent Cluster Linking help build such an infrastructure.

Let’s look at a few real-world deployments in the healthcare sector.

Optum (United Health Group) – Cloud-native Kafka-as-a-Service

Optum is an American pharmacy benefit manager and health care provider. It is a subsidiary of UnitedHealth Group. The Apache Kafka infrastructure is provided as an internal service, centrally managed, and used by over 200 internal application teams.

Optum built a repeatable, scalable, cost-efficient way to standardize data. They leverage the whole Kafka ecosystem:

  • Data ingestion from multiple resources (Kafka Connect)
  • Data enrichment (Table Joins & Streaming API)
  • Aggregation and metrics calculation (Kafka Streams API)
  • Sinking data to database (Kafka Connect)
  • Near real-time APIs to serve the data

Optum’s Kafka Summit talk explored the journey and maturity curve for their data streaming evolution:

Optum United Healthcare Apache Kafka Journey

As you can see, the journey started with a self-managed Kafka cluster on-premises. Over time, they migrated to a cloud-native Kubernetes environment and built an internal Kafka-as-a-Service offering. Right now, Optum works on multi-cloud enterprise architecture to deploy across multiple cloud service providers.

Centene – Data Integration for M&A across Infrastructures

Centene is the largest Medicaid and Medicare Managed Care Provider in the US. The healthcare insurer acts as an intermediary for government-sponsored and privately insured healthcare programs. Centene’s mission is to “help people live healthier lives and to help make the health system work better for everyone”.

The critical challenge of Centene is interesting: Growth! Many mergers and acquisitions happened in the last decade: Envolve, HealthNet, Fidelis, and Wellcare.

Data integration and processing at scale in real-time between various systems, infrastructures, and cloud environments is a considerable challenge. Kafka provides Centene with valuable capabilities, as they explained in an online talk:

  • Highly scalable
  • High autonomy/decoupling
  • High availability & data resiliency
  • Real-time data transfer
  • Complex stream processing

The event-driven integration architecture leverages Apache Kafka with MongoDB:

Centene Cloud Architecture with Kafka MongoDB ETL Pipeline

Bayer – Hybrid Multi-Cloud Data Streaming

Bayer AG is a German multinational pharmaceutical and life sciences company and one of the largest pharmaceutical companies in the world. They leverage Kafka in various use cases and business domains. The following scenario is from Monsanto.

Bayer adopted a cloud-first strategy and started a multi-year transition to the cloud to provide real-time data flows across hybrid and multi-cloud infrastructures.

The Kafka-based cross-data center DataHub facilitated migration and the shift to real-time stream processing. It offers strong enterprise adoption and supports a myriad of use cases. The Apache Kafka ecosystem is the “middleware” to build a bi-directional streaming replication and integration architecture between on-premises data centers and multiple cloud providers:

From legacy on premise to hybrid multi cloud at Bayer with Apache Kafka

The Kafka journey of Bayer started on AWS. Afterward, some project teams worked on GCP. In parallel, DevOps and cloud-native technologies modernized the underlying infrastructure. Today, Bayer operates a multi-cloud infrastructure with mature, reliable, and scalable stream processing use cases:

Bayer AG using Apache Kafka for Hybrid Cloud Architecture and Integration

Learn about Bayer’s journey and how they built their hybrid and multi-cloud Enterprise DataHub with Apache Kafka and its ecosystem: Bayer’s Kafka Summit talk.

Data Streaming with Kafka across Hybrid and Multi-cloud Infrastructures

Think about IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains. Real-time data beats slow data in the healthcare supply chain almost everywhere.

This blog post explored the value of data streaming with Apache Kafka to modernize IT infrastructure and build hybrid multi-cloud architectures. Real-world deployments from Optum, Centene, and Bayer showed how enterprises deploy Kafka successfully for different use cases in the enterprise architecture.

How do you leverage data streaming with Apache Kafka in the healthcare industry? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Legacy Modernization and Hybrid Multi-Cloud with Kafka in Healthcare appeared first on Kai Waehner.

]]>
Apache Kafka and Event Streaming in Pharma and Life Sciences https://www.kai-waehner.de/blog/2020/05/19/apache-kafka-event-streaming-pharmaceuticals-pharma-life-sciences-use-cases-architecture/ Tue, 19 May 2020 15:15:10 +0000 https://www.kai-waehner.de/?p=2261 This blog post covers use cases and architectures for Apache Kafka and Event Streaming in Pharma and Life…

The post Apache Kafka and Event Streaming in Pharma and Life Sciences appeared first on Kai Waehner.

]]>
This blog post covers use cases and architectures for Apache Kafka and Event Streaming in Pharma and Life Sciences. The technical example explores drug development and discovery with real time data processing, machine learning, workflow orchestration and image / video processing.

Use Cases in Pharmaceuticals and Life Sciences for Event Streaming and Apache Kafka

The following shows some of the use cases I have seen in the field in pharma and life sciences:

Use Cases in Pharma and Life Sciences for Event Streaming and Apache Kafka

Many of them have in common that they are not new. But event streaming at scale in real time can help improve the processes and allow innovative new applications. Therefore Apache Kafka is a perfect fit for the Pharma and Life Science industry. Having said this, starting with a use case and goal is important to add business value:

Event Streaming in Pharma and Life Sciences - Use Cases Supporting Business Value

From a technical perspective, the next step is drilling down into technologies as you can see in the above picture. Typically, you combine different concepts like ‘data pipelines’ and ‘stream processing’ to implement the business solution.

Generate Added Value from your Data

The pharmaceutical and life science industry today has an unprecedented wealth of opportunities to generate added value from data.

These possibilities cover all relevant areas such as:

  • R&D / Engineering
  • Sales and Marketing
  • Manufacturing and Quality Assurance
  • Supply Chain
  • Product Monitoring / After Sales Support

Novel data use:

  • Better therapies
  • Faster and more accurate diagnoses
  • Faster drug development
  • Improvement of clinical studies
  • Real-World Data Generation
  • real-world evidence
  • Precision Medicine
  • Support Remote Health etc

Challenges:

  • Data silos
  • Integration between different technologies and communication paradigms
  • Data growth / explosion
  • Cloud / on-premises / hybrid
  • Use of new technologies like Artificial Intelligence (AI) / Machine Learning (ML)
  • Time to market
  • Regulatory affairs
  • Security
  • Performance (throughput, scale and speed)
  • Open API interfaces

Let’s now take a look at how to solve these challenges to add value from existing data…

Event Streaming for Data Processing at Scale in Real Time

Here are a few examples of Pharma and Life Sciences companies relying on event streaming with Kafka and its ecosystem:

These companies spoke on a past Kafka Summit about their use cases. Find more details in the linked slides and video recordings.

All of them have in common that the event streaming platform based on Apache Kafka is the heart of their integration and data processing infrastructure:

Event Streaming Platform for Pharma Healthcare Life Sciences

Let’s now take a look at a concrete example to go into more details.

Pharma Use Case: Drug Research and Development with Kafka

I want to cover one specific use case: Drug Discovery. Honestly, I am not an expert in this area. Therefore, I use examples from the company ‘Recursion Pharmaceutical’. They presented at a past Kafka Summit about “Drug Discovery at Scale in Real Time with Kafka Streams“.

Cost Reduction and Faster Time-to-Market

The motivation for improving the drug development process is pretty clear: Cost reduction and faster time-to-market.

Here are a few quotes from McKinsey & Company:

The Drug Research and Development Process

The process for drug discovery is long and complex:

Drug Research and Discovery Process

As you can see, the drug development process takes many years. Part of that reason is that drug discovery requires a lot of clinical studies doing data processing and analytics of big data sets.

Drug Discovery At Scale in Real Time with Kafka Streams

Recursion Pharmaceutical went from ‘drug discovery in manual and slow, not scalable, bursty BATCH MODE’ to ‘drug discovery in automated, scalable, reliable REAL TIME MODE’

They created a massively parallel system that combines experimental biology, artificial intelligence, automation and real-time event streaming to accelerate drug discovery:

Drug Discovery in automated, scalable, reliable real time Mode

This hybrid event streaming architecture is explained in more detail in Recursion Pharmaceutical’s Kafka Summit talk.

Streaming Machine Learning in Pharma and Life Sciences with Kafka

While Recursion Pharmaceutical showed a concrete example, I want to share a more general view of such an architecture in the following…

Streaming Analytics for Drug Discovery in Real Time at Scale

The following is a possible solution to do data processing based on business rules (e.g. feature engineering or filtering) in conjunction with machine learning (e.g. image recognition using a convolutional neural network / CNN):

Streaming Analytics for Drug Discovery in Real Time at Scale

Such an infrastructure typically combines modern technologies with old, legacy interfaces (like file processing on a old Windows server). Different programming languages and tools are used in different parts of the process. It is not uncommon to see Python, Java, .NET and some proprietary tools in one single workflow.

Kafka + ksqlDB + Kafka Streams + .NET + TensorFlow

The following maps the above use case to concrete cutting-edge technologies:

Kafka, ksqlDB and TensorFlow for Drug Discovery in Real Time at Scale

What makes this architecture exciting?

  1. Data processing and correlation at scale in real time
  2. Integration with any data source and sink (no matter if real time, batch or request-response)
  3. Single machine learning pipeline for model training, scoring and monitoring
  4. No need for a data lake (but you can use one if you want or have to, of course)
  5. Combination of different technologies to solve the impedance mismatch between different teams (like Python for the data scientist, Java for the production engineer, and Tableau for the business expert)
  6. Compatibility with any external system (no matter if modern or legacy, no matter if open or proprietary, no matter if edge, on premise data center or cloud)

I did not have the time to implement this use case. But the good news is that there is a demo available showing exactly the same architecture and combination of technologies (showcasing a connected car infrastructure for real time data processing and analytics at scale in real time). Check out the Blog and video or the Github project for more details.

Image / Video Processing, Workflow Orchestration, Middleware…

I want to cover a few more topics which come up regularly when I discuss Kafka use cases with customers from pharma, life sciences and other industries:

  • Image / video processing
  • Workflow orchestration
  • Middleware and integration with legacy systems

Each one is worth its own blog post, but the following will guide you into the right direction.

Image / Video Processing with Apache Kafka

Image and video processing is a very important topic in many industries. Many pharma and life sciences processes require it, too.

The key question: Can and should you do image / video processing with Kafka? Or how does this fit into the story at all?

Alternatives for Processing Large Payloads with Kafka

Several alternatives exists (and I have seen all three in the field several times):

  1. Kafka-native image / video processing: This is absolutely doable! You can easily increase the maximum message size (from the 1Mb default to any value which makes sense for you) and stream images or videos through Kafka. If have seen a few deployments where Kafka was combined with a video buffering framework on the consumer side. Processing images from video cameras is a good example. It does not matter if you combine Kafka with image processing frameworks like OpenCV or with Deep Learning frameworks like TensorFlow.
  2. Split + re-assemble large messages: Big messages can be chunked into smaller messages and aggregated on the consumer side. This makes a lot of sense for binary data (like images). If the input is a delimited CSV file or JSON / XML, another option is to split the data up so that only the chunks are processed by consumers.
  3. Metadata-only and object store: Kafka messages only contain the metadata and the link to the image / video. The actual data is store in an external storage (like AWS S3 object storage).
  4. Externalizing large payloads: Receive the large payload but filter and externalize it before sending the data to Kafka. Kafka Connect’s SMT (Single Message Transformations) are a great way to implement this. This enterprise integration pattern (EIP) is called ‘Claim Check Pattern‘. A source connector could receive the payload with the image, filter and store the image into another data store, and send the payload (including an added link to the image) to Kafka. A sink connector can use a SMT similarly to load an image from the data store before sending it to another sink system.

All approaches are valid and have their pros and cons.

Large Payload Handling at LinkedIn

LinkedIn did a great presentation in 2016 about this topic. Here are their trade-offs for sending large messages via Kafka vs. sending just the reference link:

Large Message Support vs Split Messages in Apache Kafka for Image and Video Processing

Please keep in mind that this presentation was done in 2016. Kafka and its ecosystem improved a lot since that time. Infrastructures also changed a lot regarding scalability and cost. Therefore, find the right architecture and cost structure for your use case!

UPDATE 2020: I wrote a blog post about the current status of processing large messages with Kafka. Check it out for the latest capabilities and use cases.

Workflow Orchestration of Pharma Processes with Apache Kafka

Business processes are often complex. Some can be fully automated. Others need human interaction. In short, there are two approaches for Workflow Orchestration in a Kafka infrastructure:

  • Kafka-native Workflow Orchestration: The Orchestration is implemented within a Kafka application. This is pretty straightforward for streaming data, but more challenging for long-running processes with human interaction. For the latter, you obviously should build a nice UI on top of the streaming app. Dagger is a dynamic realtime stream processing framework based on Kafka Streams for task assignment and workflow management. Swisscom also presented their own Kafka Streams based orchestration engine at a Kafka Meetup in Zurich in 2019.
  • External Business Process Management (BPM) Tool: Plenty of workflow orchestration tools and BPM engines exist on the market. Both open source and proprietary. Just to give one example, Zeebe is a modern, scalable open source workflow engine. It actually provides a Kafka Connect connector to easily combine event streaming with the orchestration engine.

The advantage of Kafka-native workflow orchestration is that there is only one infrastructure to operate 24/7. But if it is not sufficient or you want to use a nice, pre-built UI, then nothing speaks against combining Kafka with an external workflow orchestration tool.

Integration with Middleware and Legacy Systems like Mainframes or Old ERP Systems

I pointed this out above already, but want to highlight it again in its own section: Apache Kafka is a great technology to deploy a modern, scalable, reliable middleware. In pharma and life sciences, many different technologies, protocols, interfaces and communication paradigms have to be integrated with each other. From Mainframe and batch systems to modern big data analytics platforms and real time event streaming applications.

Kafka and its ecosystem are a perfect fit:

  • Building a scalable 24/7 middleware infrastructure with real time processing, zero downtime, zero data loss and integration to legacy AND modern technologies, databases and applications
  • Integration with existing legacy middleware (ESB, ETL, MQ)
  • Replacement of proprietary integration platforms
  • Offloading from expensive systems like Mainframes

The following shows how you can leverage the Strangler Design Pattern to integrate and (partly) replace legacy systems like mainframes:

Middleware and Legacy Integration with Apache Kafka and Event Streaming

If you think about using the Kafka ecosystem in your Pharma or Life Science projects, please check out my blogs, slides and videos about Apache Kafka vs. Middleware (MQ, ETL, ESB) and “Mainframe Offloading and Replacement with Apache Kafka“.

Slides and Video Recording for Kafka in Pharma and Life Sciences

I created some slides and a video recording discussing Apache Kafka and Machine Learning in Pharma and Life Sciences. Check it out:

Slides:

Video recording:

 

Generate Added Value with Kafka in Pharma and Life Sciences Industry

The pharmaceutical and life science industry today has an unprecedented wealth of opportunities to generate added value from data. Apache Kafka and Event Streaming are a perfect fit. This includes scalable big data pipelines, machine learning for real time analytics, image / video processing, and workflow orchestration.

What are your experiences in  pharma and life science projects? Did you or do you plan to use Apache Kafka and its ecosystem? What is your strategy? Let’s connect on LinkedIn and discuss!

The post Apache Kafka and Event Streaming in Pharma and Life Sciences appeared first on Kai Waehner.

]]>