Drug development has been plummeting in productivity over the last 70 years. Learn how improving preclinical models is the key to solving the challenge at the heart of drug development.

“The more positive anyone is about the past several decades of progress [in pharmaceutical development], the more negative they should be about the strength of countervailing forces.”  These foreboding words were penned in a seminal 2012 article by Jack Scannell, author of Eroom’s Law, in an effort to illuminate the drug development industry’s productivity crisis.

Productivity measures how efficient drug development is, often presented as the number of drugs that can be brought to market given a set amount of effort or investment. Consider pharmaceutical development in the 1950s, for example: Data presented by Scannell and his co-authors showed that, with the contemporary equivalent of $1 billion US dollars, the industry was able to produce around 30 new drugs. In contrast, that same investment of $1 billion in 2023 would not even produce one new therapeutic (see Figure 1).

Figure 1: Graph showing the change R&D efficiency since 1950.

That is a substantial decrease in productivity and one that many in the industry are concerned about. Whether it’s to treat neurodegenerative disease, stem the spread of infectious agents, or fight cancer, there is a persistent need among patients for new and innovative therapeutics. When productivity is low, developers face steeper costs, and progress slows. As a result, drug costs increase for patients, who are left waiting in desperate need of therapeutic relief.

Diagnosing the various factors that have led to our current productivity crisis—the “countervailing forces”—has been a challenge that Scannell and many others have worked to overcome. Through their efforts, many potential causes have been identified, one of which stands out as profoundly impactful: improving the accuracy of preclinical models.

Stay Up to Date with the Emulate newsletter

Preclinical Drug Development 

Preclinical drug development is highly speculative: Researchers are tasked with foretelling how compounds will behave in the human body and ultimately identifying the select few that are both safe and therapeutic. To do this, they rely on model systems that serve as proxies for the human body, with none serving a more prominent role than non-human animal models.

Rodents, primates, and other non-human animal models have long been the gold standard in preclinical toxicology screening. With complex and interconnected tissues, these animals allow researchers to test the effect of their drug in a dynamic system that resembles the human body. As such, animals have been positioned as the last filter in the drug development process, charged with the difficult job of weeding out toxic drugs before they enter clinical trials.

Despite their ubiquity in drug development, ample evidence indicates that animal models are far from perfect. Approximately 90% of drugs entering clinical trials fail, with roughly 30% of those failures attributed to unforeseen toxicity. Such abundant failure indicates that, at minimum, animal models alone are insufficient decision-making tools—too often, they get it wrong, and both patients and drug developers pay the price.

The cost of this failure plays a central role in the current productivity crisis. Though researchers now have access to next-generation sequencing, combinatorial chemistry, and automation, drug development costs have increased nearly 80-fold since 1950 to a staggering $2.3B per approved drug. And approximately 75% of the pharmaceutical industry’s drug development costs can be attributed to development failures.

It stands to reason that reducing clinical trial failure rates will improve the efficiency of drug development. Not only are failed trials expensive, but they also take up clinical resources that could otherwise be used to advance successful drugs. Since clinical trial failure rates reflect the quality of drugs that enter trials, improving the quality of these drugs should improve the industry’s overall productivity.

So how should researchers go about doing this? Revisiting his seminal work a decade later, Scannell provided powerful guidance: The quality of the compounds that enter clinical trials is a consequence of the preclinical models used to select them, and even small improvement in the quality of the preclinical models—more specifically, their predictive validity—can have a substantial impact on productivity. Enter more human-relevant preclinical models like the Liver-Chip.

Improving Productivity with Organ-Chips

In a recently published study, Emulate scientists showed that the Liver-Chip—a specialized Organ-Chip that mimics the human liver—can identify compounds’ potential to cause drug-induced liver injury (DILI) far more accurately than traditional in vitro and animal models.

Briefly, Organ-Chips are three-dimensional culture systems that combine heterogeneous cell culture, fluid flow, and several features of the tissue microenvironment to mimic human organ function in an in vitro setting. Evidence indicates that human cells cultured in Organ-Chips behave remarkably similar to their in vivo counterparts. Among many promising applications, these chips are particularly well suited for preclinical toxicology screening.

In their study, the Emulate researchers found the Liver-Chip to be a highly sensitive and specific tool for detecting hepatotoxic compounds. In particular, the Liver-Chip showed a sensitivity of 87% and specificity of 100% against a series of drugs that had progressed into the clinic after being tested in animal models, only to later be revealed as toxic when given to patients. Therefore, these drugs well represent the current gap in preclinical toxicology testing, through which some hepatotoxic drug candidates evade detection and advance into clinical trials.

If the Liver-Chip can fill the gap left by animal models, Scannell’s framework suggests that the Liver-Chip could profoundly affect the industry’s productivity by reducing the number of safety-related clinical trial failures.

To calculate how this reduction may impact industry productivity, Emulate researchers teamed up with Jack Scannell to build an economic value model. This analysis showed that applying the Liver-Chip in all small-molecule drug development programs could generate $3 billion dollars annually for the industry as a result of improved productivity. This is approximately $150M per top pharmaceutical company. And, that’s just for the Liver-Chip. In addition to hepatotoxicity, cardiovascular, neurological, immunological, and gastrointestinal toxicities are among the most common reasons clinical trials fail. If Organ-Chips can be developed to reduce these clinical trial failures with a similar 87% sensitivity, the resulting uplift in productivity could generate $24 billion for the industry annually—roughly $750M to $1B per top pharmaceutical company.

It is immediately evident that, even when the cost of integrating and running Liver-Chip experiments is accounted for, the cost savings of reducing clinical trial failures is substantial. Moreover, the freed-up clinical bandwidth would permit advancing other, more promising compounds. The Emulate researchers’  work demonstrates that improving productivity in drug development is possible, and it starts with developing better models. As the industry embraces the potential of Organ-on-a-Chip technology and continues to explore its application in various areas of drug development, there is hope for a future with improved productivity and faster delivery of life-saving therapeutics.

It isn’t enough that models identify toxic drugs—they must avoid mistaking safe drugs as dangerous. Read on to learn about the importance of specificity in the preclinical stages of drug development.

Drug-induced liver injury (DILI) has been a persistent threat to drug development for decades. Animals like rats, dogs, and monkeys are meant to be a last line of defense against DILI, catching the toxic effects that drugs could have before they reach humans. Yet, differences between species severely limit these models, and the consequences of this gap are borne out in halted clinical trials and even patient deaths.  

Put simply; there is a translational gap between our current preclinical models and the patients who rely on them.  

The Emulate Liver-Chip—an advanced, three-dimensional culture system that mimics human liver tissue—was designed to help fill this gap. The Liver-Chip allows scientists to observe potential drug effects on human liver tissue and, in turn, better predict which drug candidates are likely to cause DILI (see Figure 1).  

Figure 1: Liver-Chip cross-section.

Such a preclinical model could be extremely useful in preventing candidate drugs with a hepatic liability from reaching patients, but exactly how useful depends on how sensitive and specific it is.  

Stay Up to Date with the Emulate newsletter

Measuring Preclinical Model Accuracy 

In preclinical drug development, a wide range of model systems—including animals, spheroids, and Organ-Chips—are used for decision-making. Researchers rely on these models to help them determine which drug candidates should advance into clinical trials. Whether or not scientists make the right decision depends largely on the quality of the models they use. And that quality is measured as both sensitivity and specificity. 

In this context, “sensitivity” describes how often a model successfully identifies a toxic drug candidate as such. So, a model with 100% sensitivity would correctly flag all harmful drug candidates as such. In contrast, “specificity” refers to how accurate a model is in identifying non-toxic drug candidates. A 100%-specific model would never claim that a non-toxic candidate is toxic. It’s important to note that a model can be 100% sensitive without being very specific. For example, an overeager model that calls most candidates toxic may capture all toxic candidates (100% sensitivity) but also mislabel many non-toxic candidates as toxic (mediocre specificity). 

In an ideal world, preclinical drug development would use models that are 100% sensitive and 100% specific. Unfortunately, no model is perfect. Approximately 90% of drugs entering clinical trials fail, with many failing due to toxicity issues. This alone suggests that there is a strong need for more accurate decision-making tools.  

The give and take of Sensitivity and Specificity 

Researchers want preclinical toxicology models with the best sensitivity possible, as higher sensitivity means more successful clinical trials, safer patients, and better economics. However, this cannot come at the cost of failing good drugs. An overly sensitive model with a low threshold for what it considers “toxic” would catch all toxic drugs, but it may also catch drugs that are, in reality, safe and effective in humans. Good drugs are rare, and a lot of effort and investment goes into their development. Even one drug that fails to make it to the clinic can end up costing pharmaceutical companies billions and leave a patient population without treatment. Models should do their utmost to classify non-toxic compounds as such—that is, to have 100% specificity. 

But how can drug development insist on perfect specificity when no model is perfect? Fortunately, there is a give-and-take between sensitivity and specificity that model developers can take advantage of: one can be traded for the other. 

In decision analysis, sensitivity and specificity can be “dialed in” for the model in question. In most cases, this involves setting a threshold in the analysis of the model’s output. In a recent study published in Communications Medicine, part of Nature Portfolio, the Emulate team set a threshold of 375 on the Liver-Chip’s quantitative output; in the case of hepatic spheroids, an older model system, researchers have set a threshold of 50. In both cases, the higher the thresholds, the more sensitive and less specific the model tends to be. These thresholds were selected precisely to dial the systems into 100% specificity. 

This is why Ewart et al.’s findings are so striking. Even while maintaining such a strict specificity, the Liver-Chip achieved a staggering 87% sensitivity. This means that, on top of correctly identifying most of the toxic drugs, the Liver-Chip never misidentified a non-toxic drug in the study as toxic. For drug developers, this means that no good drugs—nor the considerable resources poured into their development—would be wasted. Using models like the Liver-Chip that achieve high sensitivity alongside perfect specificity would allow drug developers to deprioritize potentially dangerous drugs without sacrificing good drugs. In all, this could lead to more productive drug development pipelines, safer drugs progressing to clinical trials, and more patient lives saved.


Learn about the importance of sensitivity in drug development and why researchers can improve their pipelines by using more sensitive models.

In a recently published study in Communications Medicine, part of Nature Portfolio, Emulate scientists reported that the human Liver-Chip—an advanced, three-dimensional culture system that mimics human liver tissue—could have saved over 240 lives and prevented 10 liver transplants that were caused by the test set of drugs. Specifically, the study demonstrated that the Emulate human Liver-Chip could be used to identify a candidate drug’s likelihood of causing drug-induced liver injury (DILI), a leading cause of safety-related clinical trial failure and market withdrawals around the world1,2

Any preclinical model that helps prevent toxic drugs from reaching patients would be extremely useful, but exactly how much so depends in part on its sensitivity. In the context of predictive toxicology, sensitivity is a measure of how well a model can identify a drug candidate’s toxicity. For example, a model with 0% sensitivity would fail to identify every toxic drug it encounters, whereas one with perfect 100% sensitivity would never miss.  

Ewart et al. concluded that the Emulate human Liver-Chip could profoundly affect preclinical drug development because it showed a high sensitivity of 87%—specifically for identifying drugs that were cleared for clinical trials after being tested in both animals and in vitro systems, but ultimately proved hepatotoxic in humans. In other words, the Liver-Chip identified DILI risk for nearly 7 out of every 8 hepatotoxic drugs it encountered.  

That’s striking—particularly given the drugs that were being tested: Each had previously made it into the clinic, meaning they were deemed safe enough to administer to humans based on rigorous preclinical testing. While the specifics of preclinical testing may differ for each drug, each includes testing in at least two animal species per regulatory guidelines to progress a drug candidate into clinical testing. Despite this, 22 of the drug candidates went on to be proven toxic in patients. Because animal models failed to adequately forecast the harm that these drugs could—and did—bring to human patients, more than 240 people lost their lives, and 10 were forced to undergo emergency liver transplantation.  

Expressed in the language of preclinical toxicology, because the tested drugs made their way through animal testing, animals served as the “reference” in the study, meaning a 0% sensitivity for this drug set—a strong contrast to the human Liver-Chip’s 87%. While it’s important to bear in mind that sensitivity will depend upon the reference set of drugs—as it did here—the juxtaposition of these numbers tells an important story about modern drug development and its ability to protect patients. But to appreciate this story, it’s necessary to scrutinize the concept of model sensitivity and the assumptions underlying it.

The Assumption of Sensitivity 

As mentioned above, sensitivity measures how often a model system correctly identifies a drug as toxic or, conversely, how often it incorrectly marks a toxic drug as not toxic. If the model system allows a toxic drug to pass through without a strong indication of harm, it will support a false conclusion that the drug is non-toxic—a result colloquially known as a “false-negative.”  

False-negatives can have dire consequences, enabling harmful drug candidates to reach the patient’s bedside through clinical trials. To avoid this, researchers have long sought to test candidate drugs in model systems that closely approximate the human body and prevent bad drugs from reaching the market. And for more than 80 years, animals have filled that need. However, animals are far from perfect. Genetic and physiological differences can produce nuanced yet significant discrepancies in how animals respond to drugs—a drug that appears safe in rats may turn out to be lethal in humans. Such a result would be described as a false-negative.  

As roughly 90% of drugs that enter clinical trials fail—many due to safety concerns—it is clear that animal models are far from 100% sensitive3

So how sensitive are they? The short answer is that, although animals have been tested for the better part of a century, researchers have yet to produce robust data on the sensitivity of animal models, particularly with respect to preclinical toxicology screening4. Perhaps the largest hurdle preventing them from making such an assessment is their assumption that animals are as good as it gets—a belief that eclipses many researchers’ desire for proof.  

Because there is a lack of firm data, it’s difficult to make general estimates on animal models sensitivity. However,  about 90% of clinical trials fail, and 30–40% of these failures are due to toxicity responses. From this, it can be presumed that animals did not provide sufficient evidence to forecast drug toxicity in humans and that including more sensitive models could have helped prevent toxic drugs from reaching humans. 

Stay Up to Date with the Emulate newsletter

The Story in The Numbers 

Animal models are meant to be a protective barrier, the last line of defense that prevents toxic drug candidates from reaching humans. But, as described above, this barrier is imperfect—it has gaps in sensitivity that allow toxic drugs to advance into clinical trials far too often.   

In their study, Ewart et al. evaluated whether the Emulate human Liver-Chip could fortify this barrier. The team selected 27 drugs, 22 of which were known to be hepatotoxic. Importantly, many of these drugs were included in a set of guidelines that the IQ Consortium, an affiliate of the International Consortium for Innovation and Quality in Pharmaceutical Development, has designated as a baseline researchers should use to measure a liver model’s ability to predict DILI risk. 

Each of these drugs had previously advanced to clinical trials, and some were market approved. In other words, animal models in preclinical testing served as the reference for the study and therefore had a 0% sensitivity for this group of drugs.  

To measure a model’s sensitivity, researchers use it to screen a set of test drug candidates. To effectively gauge the model’s sensitivity, it’s essential that the set of drugs be carefully selected. For example, one could bias the test drugs towards “easy” drugs—ones that are very toxic in ways that even simple models would identify; however, showing that a new model is excellent at capturing “easy” drugs likely does not demonstrate its utility in capturing the difficult drugs that slip through animal testing. Establishing sensitivity based on such drugs is akin to claiming that a telescope’s ability to spot the sun makes it sensitive to observing stars—it may be technically true but is irrelevant to real-world challenges. That said, it’s not uncommon for preclinical models to be simply tested using highly toxic drugs that never made it to clinical trials5,6

Stated plainly, a model’s sensitivity changes depending on the drugs that are being tested. If you use clearly toxic drug candidates—ones that would never make it past animal models to begin with—your model will likely appear very sensitive because these drugs are easy to detect. But, if one achieves a high sensitivity when using more challenging drugs—ones that slip through and only show their toxic potential in humans—then the sensitivity of that model will be far more valuable.  

This is precisely why Ewart et al.’s study stands out. Instead of testing Liver-Chips with drugs that are too toxic for animal use, the researchers relied on drugs that had already undergone animal testing and appeared safe. This particular set of drugs not only bypassed animal testing, but they also went on to kill 242 patients. The Liver-Chip successfully identified all of them (see Figure 2). As an additional challenge, they also tested a set of drugs that hepatic spheroid models missed. Even though these drugs were especially difficult for current models to detect, the Liver-Chip achieved an impressive sensitivity of 87%.  

Figure 2: Drugs Ewart et al. tested in their study.

By producing these results, Ewart et al. have shown that the human Liver-Chip can, at a minimum, help fill the sensitivity gaps that plague animal models. Stated another way, this data strongly suggests that Liver-Chips can help to identify toxic drugs that animal models miss, which could greatly reduce the number of harmful drugs that advance into clinical trials.  

These numbers tell a story about modern and future drug development. Today, drug developers rely on models that, while useful, are imperfect. This imperfection can be catastrophic for patients, as it was for the patients who lost their lives to the drugs Ewart et al. tested. But, in recognizing these imperfections, we can work to prevent future harm by building a better drug development process—one fortified by modern technologies like the Liver-Chip.  


  1. Craveiro, Nuno Sales, et al. “Drug Withdrawal due to Safety: A Review of the Data Supporting Withdrawal Decision.” Current Drug Safety, vol. 15, no. 1, 3 Feb. 2020, pp. 4–12, https://doi.org/10.2174/1574886314666191004092520. 
  1. Research, Center for Drug Evaluation and. “Drug-Induced Liver Injury: Premarketing Clinical Evaluation.” U.S. Food and Drug Administration, 17 Oct. 2019, www.fda.gov/regulatory-information/search-fda-guidance-documents/drug-induced-liver-injury-premarketing-clinical-evaluation. 
  1. Fogel, David B. “Factors Associated with Clinical Trials That Fail and Opportunities for Improving the Likelihood of Success: A Review.” Contemporary Clinical Trials Communications, vol. 11, Sept. 2018, pp. 156–164, www.ncbi.nlm.nih.gov/pmc/articles/PMC6092479/, https://doi.org/10.1016/j.conctc.2018.08.001. 
  1. Bailey, Jarrod, et al. “An Analysis of the Use of Animal Models in Predicting Human Toxicology and Drug Safety.” Alternatives to Laboratory Animals, vol. 42, no. 3, June 2014, pp. 181–199, https://doi.org/10.1177/026119291404200306. 
  1. Zhou, Yitian, et al. “Comprehensive Evaluation of Organotypic and Microphysiological Liver Models for Prediction of Drug-Induced Liver Injury.” Frontiers in Pharmacology, vol. 10, 24 Sept. 2019, https://doi.org/10.3389/fphar.2019.01093. Accessed 22 Nov. 2020. 
  1. Bircsak, Kristin M., et al. “A 3D Microfluidic Liver Model for High Throughput Compound Toxicity Screening in the OrganoPlate®.” Toxicology, vol. 450, Feb. 2021, p. 152667, https://doi.org/10.1016/j.tox.2020.152667.