Pharma’s Data Problem

Contributed by


Drs David Higgins and Marco F Schmidt highlight the gap that exists between the abundant discussions around technological innovation in healthcare and implementation and why pharma needs to develop its own proprietary approaches to crunching ‘wide data’.


Data is the new oil, but biomedical data – despite its abundance – is closer to fracking than to the Arabian sands. Pharmaceutical companies have abundant access to data and nuanced in-house understanding of this data but struggle to cost-effectively leverage their knowledge.


The world’s largest companies are converging on the data in healthcare business. From pharma, Novartis CEO Vas Narasimhan wants to turn his company into a medicine and data science company. Meanwhile, numerous companies, such as Roche, have been highly acquisitive in the data science space. From the consumer tech side, Apple want to become a healthcare company in the next 10 years. Alphabet subsidiary Verily Life Sciences is massively scaling-up. And market analysts predict that Amazon will shortly try to occupy the role currently served by Walgreens, CVS and local health centers, capturing the US consumer healthcare space.


Rather than discussing whether data is the new oil, perhaps the more pertinent question is, why is there still a gap between discussions of tech innovation and actual tech innovation in healthcare?


Are data and artificial intelligence (AI) in pharma just a hype? No. Although pharma’s core business model, selling drugs, will likely not be “disrupted” due to the highly regulated environment in contrast to the consumer tech market, there is a paradigm shift in healthcare: from “break & fix” to “predict & prevent”. Therefore, pharmaceutical companies need to invest more in disease prediction such as diagnostics or clinical decision support to get reimbursement for their therapeutics. And this can be only done with new AI methods. However, keep in mind: prediction and decision support are just add-ons to pharma’s core product.


The real benefit of data and current AI techniques, more accurately called machine learning (ML) or statistical AI, for pharma represents drug development. Drug development is expensive. The success rate of a drug candidate from clinical Phase I to approval is less than 10 percent. In particular, the efficacy of a drug candidate in clinical Phase II or III cannot be predicted. Nelson et al. from GSK described in 2015 that molecules that bind to a drug target with genetic support is twice as likely to be approved. Thus, biomedical data and ML may predict the clinical success of a drug molecule.


Nevertheless, there is still a problem: Biomedical data are often unstructured and high-dimensional. Data consolidation and harmonization from different clinical trials with different patients’ consents remains poorly integrated. Most importantly, biomedical data is rarely big data.


Current machine learning or statistical AI work extremely well for applications with large numbers of subjects – the more users of Facebook/Netflix the better their content recommendations become – with low numbers of features which are measured and trained against. It would be more accurate to categorize this as deep data, in keeping with deep neural networks, rather than big data. Biomedical data in general and genomic data in particular, by comparison, is more appropriately characterized as so-called high-dimensional or wide data. You typically see 100s-1000s of subjects (or lower), each of whom has millions of features associated with them. Statistical AI cannot reliably infer anything from such a data set. It is almost guaranteed that any discovery will in fact be a false positive. Actually, pharmaceutical companies do not have a data problem, they have a wide data problem.


For this reason, pharma cannot used established statistical AI approaches from consumer tech 1-to-1. Pharma needs to develop its own approaches for wide data. Biomedical AI will involve hybrid modeling approaches. Specific knowledge of the biomedical systems generating the data will be incorporated into the backbone of models (e.g. Physics Informed Neural Networks). This will boost the signal in the data, allowing statistical methods to be used for much lower sample sizes and much broader feature sets (wide data). Learning from financial modeling a low-risk approach is to take an ensemble of such approaches. Later, such models will be coupled, in decision support systems, in order to provide systems level analytics of pharmacological drug function. Importantly, such solutions will contribute to platform approaches; the days of isolated data scientists, whose direct impact is not measurable, are coming to an end.


Today, the greatest use of AI-type approaches in pharma is at either end of the drug pipeline. Efforts are being made to consolidate different datasets to increase the sample size. Nonetheless, the wide data problem remains so that a deep data approach is highly unlikely to contain enough labeled data to deliver reliable performance. However, approaches which are conscious of the requirements of a wide data context will shine.


Made with Visme Infographic Maker

Add Your Comment

You must be logged in to post a comment.

Related Content

Latest Report