Life Science Archives - Kai Waehner

Data Streaming in Healthcare and Pharma: Use Cases and Insights from Cardinal Health

Kai Waehner — Thu, 28 Nov 2024 04:12:15 +0000

Cardinal Health is at the forefront of leveraging real-time data streaming to transform healthcare and manufacturing operations. With Apache Kafka and Confluent Cloud, Cardinal Health has modernized its legacy systems, enhanced real-time analytics, and improved efficiency across its Pharma and Med divisions. This blog explores Cardinal Health’s journey, exploring how its event-driven architecture powers use cases like supply chain optimization, and medical device and equipment management. By integrating data streaming with platforms like Apigee, Dell Boomi and SAP, Cardinal Health sets a benchmark for IT modernization and innovation in the healthcare and pharma sectors.

Join the data streaming community and stay informed about new blog posts by subscribing to my newsletter and follow me on LinkedIn or X (former Twitter) to stay in touch.

Data Streaming in Healthcare and Pharma

Many healthcare companies leverage Apache Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap.

I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Data Streaming Use Cases and Architectures for Healthcare (including Slide Deck)
Legacy Modernization and Hybrid Cloud (Optum / UnitedHealth Group, Centene, Bayer)
Streaming ETL (Bayer, Babylon Health)
Real-time Analytics (Cerner, Celmatix, CDC/Centers for Disease Control and Prevention)
Machine Learning and Data Science (Recursion, Humana)
Open API and Omnichannel (Care.com, Invitae)

Apache Kafka and Flink for Stream Processing and Analytics

Apache Kafka and Apache Flink enable continuous stream processing and analytics in healthcare and pharma, providing the ability to ingest, process, and analyze massive volumes of data with low latency.

These technologies empower use cases such as real-time patient monitoring, personalized medicine, supply chain optimization, and regulatory compliance, ensuring timely insights and operational efficiency.

The State of Data Streaming in Healthcare and Pharma

Recently, I did a webinar that explores the healthcare industry’s trends and architectures for data streaming. The primary focus is the data streaming architectures and case studies.

Check out the on-demand recording:

Cardinal Health’s Data Streaming Adoption

Cardinal Health is an American multinational health care services company, and the 14th highest revenue generating company in the United States with 48,000 employees. It plays a critical role in healthcare by distributing pharmaceuticals, manufacturing medical products, and providing data solutions to healthcare facilities.

Source: Cardinal Health

With operations in over 30 countries and a presence in 90% of U.S. hospitals, Cardinal Health’s ability to innovate is directly tied to its extensive use of data. Their journey into event-driven architecture (EDA) with data streaming using Apache Kafka and Confluent Cloud has been a transformative step in achieving real-time operational excellence and enabling scalable digital transformation.

At the data streaming conference Current 2024 in Austin, Texas, Cardinal Health presented how they “devised their enablement strategy and proceeded to build an ecosystem of information and learning support to entice leaders, architects, and developers to get on the EDA-with-Confluent-Cloud-Kafka bandwagon“.

Data streaming is a journey. Cardinal Health is an excellent success story about adopting a fully managed SaaS data streaming platform in the cloud, learning from technical and organizations challenges and best practices.

The Challenge: Modernizing Data Management

Cardinal Health faced significant challenges with outdated middleware technologies and a data-at-rest approach reliant on batch processing. These limitations led to inefficiencies, such as delayed updates on order statuses and manual interventions in customer service.

Recognizing the need for change, the company sought a modern solution to transform its data management and address the growing demand for real-time insights.

The Strategy: Adopting Event-Driven Architecture

In 2022, Cardinal Health implemented a strategy to shift from traditional data processing methods to an event-driven architecture powered by Kafka via the fully-managed Confluent Cloud SaaS solution.

Source: Cardinal Health

The goal was not just to deploy a new tool, but to change how data was managed, shared, and leveraged across the organization. This involved:

Establishing an event-driven architecture (EDA) team to act as consultants for business units, guiding them on how best to use Kafka and data streaming.
Partnering with Confluent Professional Services to create a scalable and secure platform tailored to Cardinal Health’s needs.
Integrating change management strategies to ensure teams adopted EDA successfully and moved beyond familiar legacy systems.

Building Organizational Capability

Key to Cardinal Health’s success was an intentional focus on learning and performance enablement:

Comprehensive Training Programs: Cardinal Health created an ecosystem of training resources, including live events, asynchronous content, and structured learning paths. These were designed to cater to diverse roles and skill levels across the organization, from business users to developers.
Agile Learning Development: By adopting agile methodologies, the training team rapidly delivered concise, modular content that could be easily updated and scaled.
Cross-Team Engagement: Events like tech talks and interactive sessions with internal and external experts fostered a culture of continuous learning and collaboration.

Results and Business Impact

In just 17 months, Cardinal Health expanded its use of Kafka from a single topic to over 58 applications across its pharmaceutical and medical product divisions. This rapid growth reflects the organization’s commitment to scaling its digital capabilities.

Key outcomes include:

Improved Efficiency: Real-time data streaming has streamlined operations, reducing manual interventions and ensuring timely updates for both internal teams and customers.
Scalability: The cloud-native and elastic platform allows Cardinal Health to handle peak loads seamlessly, such as during high-demand periods.
Cost Savings: The adoption of EDA has optimized resource use, resulting in significant cost reductions.
Enhanced Customer Experience: Real-time transparency into order statuses and operational data has improved customer satisfaction and reduced support calls.

Lessons Learned and Best Practices

Cardinal Health’s journey provides valuable insights for organizations embarking on similar transformations:

Define a Clear Strategy: Aligning technical initiatives with business goals ensures meaningful impact and sustained investment.
Partner with Experts: Collaborating with Confluent and other professionals accelerated Cardinal Health’s deployment and knowledge-building efforts.
Invest in Training: A robust training and enablement ecosystem empowers teams to adopt new technologies effectively and confidently.
Engage Stakeholders: Frequent communication and recognition programs foster a culture of adoption and innovation.

Cardinal Health’s Pharma and Med Use Cases for Data Streaming

A wide range of use cases across the Pharma and Med divisions was implemented by Cardinal Health leveraging data sharing across the event-driven architecture (EDA). Real-time data to improve operational efficiency, enhances customer experiences, and drive business agility.

The data streaming platform CHEDA (Cardinal Health Event-Driven Architecture) integrates and shares data across business units for unified insights and operations:

Source: Cardinal Health

Cardinal Health’s event-driven architecture is central to enabling data sharing and real-time responsiveness, boosting operational efficiency, and providing a seamless customer experience while keeping the systems truly decoupled from each other. By integrating EDA across its Pharma, Med and other divisions, the organization has modernized its data infrastructure, paving the way for future innovation and growth.

The next two subsections explore the use cases and integrated platforms for Cardinal Health’s Pharma and Med divisions. Cardinal Health has seamlessly integrated Confluent with other data integration and application integration platforms, respectively API gateways, such as Apigee, Dell Boomi, Axway, and Qlik Attunity. Keep in mind that data streaming with Kafka and Flink is complementary to other integration platforms. Data streaming is its own software category and not the same as an ETL tool, Enterprise Service Bus (ESB) or Integration Platform as a Service (iPaaS).

Pharma: Distribution of Pharmaceuticals Supporting Retail Pharmacies, Hospitals and Specialty Care Providers

Cardinal Health’s Pharma division focuses on the distribution of pharmaceuticals, supporting retail pharmacies, hospitals, and specialty care providers with a robust supply chain network. This division also provides solutions like specialty drug management, contract handling, and real-time analytics to optimize operations and improve patient outcomes.

Use cases for data streaming with Cardinal Health Event-Driven Architecture in the pharma division include:

Generics and Flu Vaccination Management
Therapy and Specialty Platforms
Contract and Chargeback Management
E-commerce and Ordering
Data Analytics and Reporting
Warehouse and Logistics

Integrated platforms include JARVIS/Vantus, IBM, Rimsys RIMS, Veeva Vault QMS, Nuctrac, Palantir, PRISM, ISOTRAK, and many more.

Med: Manufacturing and Distribution of Medical Devices, Supplies and Laboratory Products for Healthcare Facilities

The Med division of Cardinal Health specializes in the manufacturing and distribution of medical devices, surgical supplies, and laboratory products for healthcare facilities. It also offers integrated solutions for inventory management, compliance, and logistics to enhance efficiency and ensure timely delivery of critical medical products.

Use cases for data streaming with Cardinal Health Event-Driven Architecture in the med division include:

Medical Device and Equipment Management
Regulatory and Compliance Solutions
Ordering and Customer Systems
Supply Chain Monitoring
Back-office Operations

Integrated platforms include Edgepark Data Warehouse, SAP ECC, SAP MDG (Master Data Governance), SAP S/4Hana, Kodiak Canada ERP, Edgepark, Vastera Tradephere, and many more.

Cardinal Health: A Data-Driven Transformation in Healthcare and Pharma

Cardinal Health shows the transformative power of data streaming and event-driven architecture (EDA) in healthcare and pharma. By modernizing legacy systems and deploying real-time data streaming across its Pharma and Med divisions, Cardinal Health has enhanced efficiency, improved supply chain management, enabled seamless data sharing across business units and applications, and delivered better customer experiences.

With use cases ranging from Generics and Flu Vaccination Management to Therapy and Specialty Platforms, and optimizing supply chain operations, Cardinal Health’s EDA platform has become a foundation for data sharing and innovation. By connecting to other API gateways and data integration platforms such as Axway or Apigee and industry-specific applications such as Edgepark, PRISM or Nuctrac, the company has built a robust ecosystem that supports real-time responsiveness and operational efficiency across business units. Cardinal Health’s journey showcases how embracing real-time data streaming can drive significant value across critical healthcare and pharma operations.

How do you leverage data streaming with Apache Kafka and Flink in the healthcare and pharma sector? How does your enterprise architecture look like? What data platforms and applications do you interconnect? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Data Streaming in Healthcare and Pharma: Use Cases and Insights from Cardinal Health appeared first on Kai Waehner.

Machine Learning and Data Science with Kafka in Healthcare

Kai Waehner — Mon, 18 Apr 2022 11:44:09 +0000

IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part five: Machine Learning and Data Science. Examples include Recursion and Humana.

Blog Series – Kafka in Healthcare

Many healthcare companies leverage Kafka today. Use cases exist in every domain across the healthcare value chain. Most companies deploy data streaming in different business domains. Use cases often overlap. I tried to categorize a few real-world deployments into different technical scenarios and added a few real-world examples:

Overview – Data Streaming Use Cases and Architectures for Healthcare (including Slide Deck)
Legacy Modernization and Hybrid Cloud (Optum / UnitedHealth Group, Centene, Bayer)
Streaming ETL (Bayer, Babylon Health)
Real-time Analytics (Cerner, Celmatix, CDC/Centers for Disease Control and Prevention)
THIS POST: Machine Learning and Data Science (Recursion, Humana)
Open API and Omnichannel (Care.com, Invitae)

Stay tuned for a dedicated blog post for each of these topics as part of this blog series. I will link the blogs here as soon as they are available (in the next few weeks). Subscribe to my newsletter to get an email after each publication (no spam or ads).

Machine Learning and Data Science with Data Streaming using Apache Kafka

The relationship between Apache Kafka and machine learning (ML) is getting more and more traction for data engineering at scale and robust model deployment with low latency.

The Kafka ecosystem helps in different ML use cases for model training, model serving, and model monitoring. The core of most ML projects requires reliable and scalable data engineering pipelines across

different technologies
communication paradigms (REST, gRPC, data streaming)
programming languages (like Python for the data scientist or Java/Go/C++ for the production engineer)
APIs
commercial products
SaaS offerings

Here is an architecture diagram that shows how Kafka helps in data science projects:

The beauty of Kafka is that it combines real-time data processing with extreme scalability and true decoupling between systems.

Tiered Storage adds cost-efficient storage of big data sets and replayability with guaranteed ordering.

I’ve written about this relationship between Kafka and Machine Learning in various articles:

Let’s look at a few real-world deployments for Apache Kafka and Machine Learning in the healthcare sector.

Humana – Real-Time Interoperability at the Point of Care

Humana Inc. is a for-profit American health insurance company. They leverage data streaming with Apache Kafka to improve real-time interoperability at the point of care.

The interoperability platform to transition from an insurance company with elements of health to truly a health company with elements of insurance.

Their core principles include:

Consumer-centric
Health plan agnostic
Provider agnostic
Cloud resilient
Elastic scale
Event-driven and real-time

A critical characteristic is inter-organization data sharing (known as “data exchange/data sharing”).

Humana’s use cases include

real-time updates of health information, for instance
connecting health care providers to pharmacies
reducing pre-authorizations from 20-30 minutes to 1 minute
real-time home healthcare assistant communication

The Humana interoperability platform combines data streaming (= the Kafka ecosystem) with artificial intelligence and machine learning (= IBM Watson) to correlate data, train analytic models, and act on new events in real-time.

Humana’s data journey is described in this diagram from their Kafka Summit talk:

Learn more details about Humana’s use cases and architecture in the keynote of another Kafka Summit session.

Recursion – Industrial Revolution of Drug Discovery with Kafka and Deep Learning

Recursion is a clinical-stage biotechnology company that built the “industrial revolution of drug discovery“. They decode biology by integrating technological innovations across biology, chemistry, automation, machine learning, and engineering to industrialize drug discovery.

Kafka-powered data streaming speeds up the pharma processes significantly. Recursion has already made significant strides in accelerating drug discovery, with over 30 disease models in discovery, another nine in preclinical development, and two in clinical trials.

With serverless Confluent Cloud and the new data streaming approach, the company has built a platform that makes it possible to screen much larger experiments with thousands of compounds against hundreds of disease models in minutes and less expensive than alternative discovery approaches.

From a technical perspective, Recursion finds drug treatments by processing biological images. A massively parallel system combines experimental biology, artificial intelligence, automation, and real-time data streaming:

Recursion went from ‘drug discovery in manual and slow, not scalable, bursty BATCH MODE’ to ‘drug discovery in automated, scalable, reliable REAL-TIME MODE’.

Recursion leverages Dagger, an event-driven workflow and orchestration library for Kafka Streams that enables engineers to orchestrate services by defining workloads as high-level data structures. Dagger combines Kafka topics and schemas with external tasks for actions completed outside of the Kafka Streams applications.

In the meantime, Recursion did not just migrate from manual batch workloads to Kafka but also migrated to serverless Kafka, leveraging Confluent Cloud to focus its resources on business problems instead of infrastructure operations.

Machine Learning and Data Science with Kafka for Intelligent Healthcare Applications

Think about IoT sensor analytics, cybersecurity, patient communication, insurance, research, and many other domains. Real-time data beats slow data in the healthcare supply chain almost everywhere.

This blog post explored the capabilities of the Apache Kafka ecosystem for machine learning infrastructures. Real-world deployments from Humana and Recursion showed how enterprises successfully deploy Kafka together with Machine Learning frameworks like TensorFlow for use cases.

How do you leverage data streaming with Apache Kafka in the healthcare industry? What architecture does your platform use? Which products do you combine with data streaming? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.

The post Machine Learning and Data Science with Kafka in Healthcare appeared first on Kai Waehner.

Real Time Analytics with Apache Kafka in the Healthcare Industry

Kai Waehner — Mon, 04 Apr 2022 10:31:47 +0000

IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part four: Real-Time Analytics. Examples include Cerner, Celmatix, CDC/Centers for Disease Control and Prevention.

Blog Series – Kafka in Healthcare

Overview – Data Streaming Use Cases and Architectures for Healthcare (including Slide Deck)
Legacy Modernization and Hybrid Cloud (Optum / UnitedHealth Group, Centene, Bayer)
Streaming ETL (Bayer, Babylon Health)
THIS POST: Real-time Analytics (Cerner, Celmatix, CDC/Centers for Disease Control and Prevention)
Machine Learning and Data Science (Recursion, Humana)
Open API and Omnichannel (Care.com, Invitae)

Real-Time Analytics with Apache Kafka

Real-time analytics (aka stream processing, streaming analytics, or complex event processing) is a data processing technology used to collect, store, and manage continuous data streams when produced or received.

Stream processing has many use cases. Examples include the backend process for claim processing, billing, logistics, manufacturing, fulfillment, or fraud detection. Data processing may need to be decoupled from the frontend, where users click buttons and expect things to happen.

The de facto standard for real-time analytics is Apache Kafka. Kafka is like a central data hub that holds shared events and keeps services in sync. Its distributed cluster technology provides availability, resiliency, and performance properties that strengthen the architecture. It leaves the programmer to write and deploy client applications that will run load balanced and be highly available.

Technologies for real-time analytics with the Kafka ecosystem include Kafka-native stream processing with Kafka Streams or ksqlDB, or 3rd party add-ons like Apache Flink, Spark Streaming, or commercial streaming analytics cloud services.

The critical difference with the Kafka ecosystem is that you leverage a single platform for data integration and processing at scale in real-time. There is no need to combine several platforms to achieve this. The result is a Kappa architecture that enables real-time but also batch workloads with a single integration architecture.

Let’s look at a few real-world deployments in the healthcare sector.

Cerner – Sepsis Alerting in Real-Time

Cerner is a supplier of health information technology services, devices, and hardware. ~30% of all US healthcare data in a Cerner solution.

Sepsis kills. In fact, it kills up to 52,000 people every year in the UK alone. With sepsis alerting, the key to saving lives is early identification, especially the need to administer antibiotics within that first critical ‘golden hour’. Quick alerts make a significant impact. Cerner’s sepsis alert, coupled with the care plans developed with the big room approach, means that patients are now 71% more likely to receive timely antibiotics.

Cerner leverages a Kafka-powered central event streaming platform for sepsis alerting in real-time to save lives. Legacy systems hit a wall preventing going faster (and missed SLAs). The data processing with Kafka progressed from minutes to seconds.

Cerner is a long-term Kafka user and early adopter in the healthcare sector. Learn more about this use case in their Kafka Summit talk from 2016.

Celmatix – Reproductive Health Care

Celmatix is a preclinical-stage biotech company that provides digital tools and genetic insights focused on fertility. They offer personalized information to disrupt how women approach their lifelong reproductive health journey.

The streaming platform provides real-time aggregation of heterogeneous data collected from Electronic Medical Records (EMRs) and genetic data collected from partners through their Personalized Reproductive Medicine (PReM) Initiative.

Proactive reproductive health decisions are enabled by real-time genomics data and by applying technologies such as big data analytics, machine learning, A/I, and whole-genome DNA sequencing.

Data governance for security and compliance is critical in such a healthcare application. “Apache Kafka and Confluent are invaluable investments to scale the way we want to and future-proof our business,” says the lead data architect at Celmatix. Learn more in the Confluent case study.

CDC – Covid-19 Electronic Lab Reporting

The Centers for Disease Control and Prevention (CDC) built Covid-19 Electronic Lab Reporting (CELR) with the Kafka ecosystem. Use cases include case notifications, lab reporting, and healthcare interoperability.

The threat of the COVID-19 virus is tracked in real-time to provide comprehensive data for local, state, and federal responses. The application allows them to understand locations with an increase in incidence better.

With the true decoupling of the data streaming platform, the CDC can rapidly aggregate, validate, transform, and distribute laboratory testing data submitted by public health departments and other partners:

Real-Time Analytics with Kafka for Smart Healthcare Applications at any Scale

This blog post explored the capabilities of the Apache Kafka ecosystem for real-time analytics. Real-world deployments from Cerner, Celmatix and the Centers for Disease Control and Prevention showed how enterprises successfully deploy Kafka for different enterprise architecture use cases.

The post Real Time Analytics with Apache Kafka in the Healthcare Industry appeared first on Kai Waehner.

Legacy Modernization and Hybrid Multi-Cloud with Kafka in Healthcare

Kai Waehner — Wed, 30 Mar 2022 08:10:25 +0000

IT modernization and innovative new technologies change the healthcare industry significantly. This blog series explores how data streaming with Apache Kafka enables real-time data processing and business process automation. Real-world examples show how traditional enterprises and startups increase efficiency, reduce cost, and improve the human experience across the healthcare value chain, including pharma, insurance, providers, retail, and manufacturing. This is part two: Legacy modernization and hybrid multi-cloud. Examples include Optum / UnitedHealth Group, Centene, and Bayer.

Blog Series – Kafka in Healthcare

Overview – Data Streaming Use Cases and Architectures for Healthcare (including Slide Deck)
THIS POST: Legacy Modernization and Hybrid Cloud (Optum / UnitedHealth Group, Centene, Bayer)
Streaming ETL (Bayer, Babylon Health)
Real-time Analytics (Cerner, Celmatix, CDC/Centers for Disease Control and Prevention)
Machine Learning and Data Science (Recursion, Humana)
Open API and Omnichannel (Care.com, Invitae)

Legacy Modernization and Hybrid Multi-Cloud with Kafka

Application modernization benefits from the Apache Kafka ecosystem for hybrid integration scenarios.

Most enterprises require a reliable and scalable integration between legacy systems such as IBM Mainframe, Oracle, SAP ERP, and modern cloud-native applications like Snowflake, MongoDB Atlas, or AWS Lambda.

I already explored “architecture patterns for distributed, hybrid, edge and global Apache Kafka deployments” some time ago:

TL;DR: Various alternatives exist to deploy Apache Kafka across data centers, regions, and continents. There is no single best architecture. It always depends on characteristics such as RPO / RTO, SLAs, latency, throughput, etc.

Some deployments focus on on-prem to cloud integration. Others link Kafka clusters on multiple cloud providers. Technologies such as Apache Kafka’s MirrorMaker 2, Confluent Replicator, Confluent Multi-Region-Clusters, and Confluent Cluster Linking help build such an infrastructure.

Let’s look at a few real-world deployments in the healthcare sector.

Optum (United Health Group) – Cloud-native Kafka-as-a-Service

Optum is an American pharmacy benefit manager and health care provider. It is a subsidiary of UnitedHealth Group. The Apache Kafka infrastructure is provided as an internal service, centrally managed, and used by over 200 internal application teams.

Optum built a repeatable, scalable, cost-efficient way to standardize data. They leverage the whole Kafka ecosystem:

Data ingestion from multiple resources (Kafka Connect)
Data enrichment (Table Joins & Streaming API)
Aggregation and metrics calculation (Kafka Streams API)
Sinking data to database (Kafka Connect)
Near real-time APIs to serve the data

Optum’s Kafka Summit talk explored the journey and maturity curve for their data streaming evolution:

As you can see, the journey started with a self-managed Kafka cluster on-premises. Over time, they migrated to a cloud-native Kubernetes environment and built an internal Kafka-as-a-Service offering. Right now, Optum works on multi-cloud enterprise architecture to deploy across multiple cloud service providers.

Centene – Data Integration for M&A across Infrastructures

Centene is the largest Medicaid and Medicare Managed Care Provider in the US. The healthcare insurer acts as an intermediary for government-sponsored and privately insured healthcare programs. Centene’s mission is to “help people live healthier lives and to help make the health system work better for everyone”.

The critical challenge of Centene is interesting: Growth! Many mergers and acquisitions happened in the last decade: Envolve, HealthNet, Fidelis, and Wellcare.

Data integration and processing at scale in real-time between various systems, infrastructures, and cloud environments is a considerable challenge. Kafka provides Centene with valuable capabilities, as they explained in an online talk:

Highly scalable
High autonomy/decoupling
High availability & data resiliency
Real-time data transfer
Complex stream processing

The event-driven integration architecture leverages Apache Kafka with MongoDB:

Bayer – Hybrid Multi-Cloud Data Streaming

Bayer AG is a German multinational pharmaceutical and life sciences company and one of the largest pharmaceutical companies in the world. They leverage Kafka in various use cases and business domains. The following scenario is from Monsanto.

Bayer adopted a cloud-first strategy and started a multi-year transition to the cloud to provide real-time data flows across hybrid and multi-cloud infrastructures.

The Kafka-based cross-data center DataHub facilitated migration and the shift to real-time stream processing. It offers strong enterprise adoption and supports a myriad of use cases. The Apache Kafka ecosystem is the “middleware” to build a bi-directional streaming replication and integration architecture between on-premises data centers and multiple cloud providers:

The Kafka journey of Bayer started on AWS. Afterward, some project teams worked on GCP. In parallel, DevOps and cloud-native technologies modernized the underlying infrastructure. Today, Bayer operates a multi-cloud infrastructure with mature, reliable, and scalable stream processing use cases:

Learn about Bayer’s journey and how they built their hybrid and multi-cloud Enterprise DataHub with Apache Kafka and its ecosystem: Bayer’s Kafka Summit talk.

Data Streaming with Kafka across Hybrid and Multi-cloud Infrastructures

This blog post explored the value of data streaming with Apache Kafka to modernize IT infrastructure and build hybrid multi-cloud architectures. Real-world deployments from Optum, Centene, and Bayer showed how enterprises deploy Kafka successfully for different use cases in the enterprise architecture.

The post Legacy Modernization and Hybrid Multi-Cloud with Kafka in Healthcare appeared first on Kai Waehner.

Apache Kafka and Event Streaming in Pharma and Life Sciences

Kai Waehner — Tue, 19 May 2020 15:15:10 +0000

This blog post covers use cases and architectures for Apache Kafka and Event Streaming in Pharma and Life Sciences. The technical example explores drug development and discovery with real time data processing, machine learning, workflow orchestration and image / video processing.

Use Cases in Pharmaceuticals and Life Sciences for Event Streaming and Apache Kafka

The following shows some of the use cases I have seen in the field in pharma and life sciences:

Many of them have in common that they are not new. But event streaming at scale in real time can help improve the processes and allow innovative new applications. Therefore Apache Kafka is a perfect fit for the Pharma and Life Science industry. Having said this, starting with a use case and goal is important to add business value:

From a technical perspective, the next step is drilling down into technologies as you can see in the above picture. Typically, you combine different concepts like ‘data pipelines’ and ‘stream processing’ to implement the business solution.

Generate Added Value from your Data

The pharmaceutical and life science industry today has an unprecedented wealth of opportunities to generate added value from data.

These possibilities cover all relevant areas such as:

R&D / Engineering
Sales and Marketing
Manufacturing and Quality Assurance
Supply Chain
Product Monitoring / After Sales Support

Novel data use:

Better therapies
Faster and more accurate diagnoses
Faster drug development
Improvement of clinical studies
Real-World Data Generation
real-world evidence
Precision Medicine
Support Remote Health etc

Challenges:

Data silos
Integration between different technologies and communication paradigms
Data growth / explosion
Cloud / on-premises / hybrid
Use of new technologies like Artificial Intelligence (AI) / Machine Learning (ML)
Time to market
Regulatory affairs
Security
Performance (throughput, scale and speed)
Open API interfaces

Let’s now take a look at how to solve these challenges to add value from existing data…

Event Streaming for Data Processing at Scale in Real Time

Here are a few examples of Pharma and Life Sciences companies relying on event streaming with Kafka and its ecosystem:

Invitae: Combination of Data Science and 24/7 Production Deployment
Babylon Health: Connectivity and Agile Microservice Architecture
Bayer AG: On Premise / Cloud / Hybrid Real Time Replication at Scale
celmatrix: Real Time Aggregation of Heterogeneous Data + Governance / Security

These companies spoke on a past Kafka Summit about their use cases. Find more details in the linked slides and video recordings.

All of them have in common that the event streaming platform based on Apache Kafka is the heart of their integration and data processing infrastructure:

Let’s now take a look at a concrete example to go into more details.

Pharma Use Case: Drug Research and Development with Kafka

I want to cover one specific use case: Drug Discovery. Honestly, I am not an expert in this area. Therefore, I use examples from the company ‘Recursion Pharmaceutical’. They presented at a past Kafka Summit about “Drug Discovery at Scale in Real Time with Kafka Streams“.

Cost Reduction and Faster Time-to-Market

The motivation for improving the drug development process is pretty clear: Cost reduction and faster time-to-market.

Here are a few quotes from McKinsey & Company:

“Collectively, the top 20 pharmaceutical companies spend approximately $60 billion on drug development each year, and the estimated average cost of bringing a drug to market (including drug failures) is now $2.6 billion—a 140 percent increase in the past ten years.”
“It should be possible to bring medicines to the market 500 days faster, which would create a competitive advantage within increasingly crowded asset classes and bring much-needed therapies to patients sooner. To transform drug development, this acceleration can be combined with improved quality and compliance, enhanced patient and healthcare-professional experience, better insights and decision making, and a reduction in development costs of up to 25 percent.”
“The McKinsey Global Institute1 estimates that applying big-data strategies to better inform decision making could generate up to $100 billion in value annually across the US health-care system, by optimizing innovation, improving the efficiency of research and clinical trials, and building new tools for physicians, consumers, insurers, and regulators to meet the promise of more individualized approaches.”

The Drug Research and Development Process

The process for drug discovery is long and complex:

As you can see, the drug development process takes many years. Part of that reason is that drug discovery requires a lot of clinical studies doing data processing and analytics of big data sets.

Drug Discovery At Scale in Real Time with Kafka Streams

Recursion Pharmaceutical went from ‘drug discovery in manual and slow, not scalable, bursty BATCH MODE’ to ‘drug discovery in automated, scalable, reliable REAL TIME MODE’

They created a massively parallel system that combines experimental biology, artificial intelligence, automation and real-time event streaming to accelerate drug discovery:

This hybrid event streaming architecture is explained in more detail in Recursion Pharmaceutical’s Kafka Summit talk.

Streaming Machine Learning in Pharma and Life Sciences with Kafka

While Recursion Pharmaceutical showed a concrete example, I want to share a more general view of such an architecture in the following…

Streaming Analytics for Drug Discovery in Real Time at Scale

The following is a possible solution to do data processing based on business rules (e.g. feature engineering or filtering) in conjunction with machine learning (e.g. image recognition using a convolutional neural network / CNN):

Such an infrastructure typically combines modern technologies with old, legacy interfaces (like file processing on a old Windows server). Different programming languages and tools are used in different parts of the process. It is not uncommon to see Python, Java, .NET and some proprietary tools in one single workflow.

Kafka + ksqlDB + Kafka Streams + .NET + TensorFlow

The following maps the above use case to concrete cutting-edge technologies:

What makes this architecture exciting?

Data processing and correlation at scale in real time
Integration with any data source and sink (no matter if real time, batch or request-response)
Single machine learning pipeline for model training, scoring and monitoring
No need for a data lake (but you can use one if you want or have to, of course)
Combination of different technologies to solve the impedance mismatch between different teams (like Python for the data scientist, Java for the production engineer, and Tableau for the business expert)
Compatibility with any external system (no matter if modern or legacy, no matter if open or proprietary, no matter if edge, on premise data center or cloud)

I did not have the time to implement this use case. But the good news is that there is a demo available showing exactly the same architecture and combination of technologies (showcasing a connected car infrastructure for real time data processing and analytics at scale in real time). Check out the Blog and video or the Github project for more details.

Image / Video Processing, Workflow Orchestration, Middleware…

I want to cover a few more topics which come up regularly when I discuss Kafka use cases with customers from pharma, life sciences and other industries:

Image / video processing
Workflow orchestration
Middleware and integration with legacy systems

Each one is worth its own blog post, but the following will guide you into the right direction.

Image / Video Processing with Apache Kafka

Image and video processing is a very important topic in many industries. Many pharma and life sciences processes require it, too.

The key question: Can and should you do image / video processing with Kafka? Or how does this fit into the story at all?

Alternatives for Processing Large Payloads with Kafka

Several alternatives exists (and I have seen all three in the field several times):

Kafka-native image / video processing: This is absolutely doable! You can easily increase the maximum message size (from the 1Mb default to any value which makes sense for you) and stream images or videos through Kafka. If have seen a few deployments where Kafka was combined with a video buffering framework on the consumer side. Processing images from video cameras is a good example. It does not matter if you combine Kafka with image processing frameworks like OpenCV or with Deep Learning frameworks like TensorFlow.
Split + re-assemble large messages: Big messages can be chunked into smaller messages and aggregated on the consumer side. This makes a lot of sense for binary data (like images). If the input is a delimited CSV file or JSON / XML, another option is to split the data up so that only the chunks are processed by consumers.
Metadata-only and object store: Kafka messages only contain the metadata and the link to the image / video. The actual data is store in an external storage (like AWS S3 object storage).
Externalizing large payloads: Receive the large payload but filter and externalize it before sending the data to Kafka. Kafka Connect’s SMT (Single Message Transformations) are a great way to implement this. This enterprise integration pattern (EIP) is called ‘Claim Check Pattern‘. A source connector could receive the payload with the image, filter and store the image into another data store, and send the payload (including an added link to the image) to Kafka. A sink connector can use a SMT similarly to load an image from the data store before sending it to another sink system.

All approaches are valid and have their pros and cons.

Large Payload Handling at LinkedIn

LinkedIn did a great presentation in 2016 about this topic. Here are their trade-offs for sending large messages via Kafka vs. sending just the reference link:

Please keep in mind that this presentation was done in 2016. Kafka and its ecosystem improved a lot since that time. Infrastructures also changed a lot regarding scalability and cost. Therefore, find the right architecture and cost structure for your use case!

UPDATE 2020: I wrote a blog post about the current status of processing large messages with Kafka. Check it out for the latest capabilities and use cases.

Workflow Orchestration of Pharma Processes with Apache Kafka

Business processes are often complex. Some can be fully automated. Others need human interaction. In short, there are two approaches for Workflow Orchestration in a Kafka infrastructure:

Kafka-native Workflow Orchestration: The Orchestration is implemented within a Kafka application. This is pretty straightforward for streaming data, but more challenging for long-running processes with human interaction. For the latter, you obviously should build a nice UI on top of the streaming app. Dagger is a dynamic realtime stream processing framework based on Kafka Streams for task assignment and workflow management. Swisscom also presented their own Kafka Streams based orchestration engine at a Kafka Meetup in Zurich in 2019.
External Business Process Management (BPM) Tool: Plenty of workflow orchestration tools and BPM engines exist on the market. Both open source and proprietary. Just to give one example, Zeebe is a modern, scalable open source workflow engine. It actually provides a Kafka Connect connector to easily combine event streaming with the orchestration engine.

The advantage of Kafka-native workflow orchestration is that there is only one infrastructure to operate 24/7. But if it is not sufficient or you want to use a nice, pre-built UI, then nothing speaks against combining Kafka with an external workflow orchestration tool.

Integration with Middleware and Legacy Systems like Mainframes or Old ERP Systems

I pointed this out above already, but want to highlight it again in its own section: Apache Kafka is a great technology to deploy a modern, scalable, reliable middleware. In pharma and life sciences, many different technologies, protocols, interfaces and communication paradigms have to be integrated with each other. From Mainframe and batch systems to modern big data analytics platforms and real time event streaming applications.

Kafka and its ecosystem are a perfect fit:

Building a scalable 24/7 middleware infrastructure with real time processing, zero downtime, zero data loss and integration to legacy AND modern technologies, databases and applications
Integration with existing legacy middleware (ESB, ETL, MQ)
Replacement of proprietary integration platforms
Offloading from expensive systems like Mainframes

The following shows how you can leverage the Strangler Design Pattern to integrate and (partly) replace legacy systems like mainframes:

If you think about using the Kafka ecosystem in your Pharma or Life Science projects, please check out my blogs, slides and videos about Apache Kafka vs. Middleware (MQ, ETL, ESB) and “Mainframe Offloading and Replacement with Apache Kafka“.

Slides and Video Recording for Kafka in Pharma and Life Sciences

I created some slides and a video recording discussing Apache Kafka and Machine Learning in Pharma and Life Sciences. Check it out:

Slides:

Video recording:

Generate Added Value with Kafka in Pharma and Life Sciences Industry

The pharmaceutical and life science industry today has an unprecedented wealth of opportunities to generate added value from data. Apache Kafka and Event Streaming are a perfect fit. This includes scalable big data pipelines, machine learning for real time analytics, image / video processing, and workflow orchestration.

What are your experiences in pharma and life science projects? Did you or do you plan to use Apache Kafka and its ecosystem? What is your strategy? Let’s connect on LinkedIn and discuss!

The post Apache Kafka and Event Streaming in Pharma and Life Sciences appeared first on Kai Waehner.