Predictive models are designed to remove some of the subjectivity inherent in medical decision-making and to automate certain health-related services with the idea of improving the accuracy of diagnosis, providing personalized treatment options, and streamlining the health care industry overall. More and more of these models using approaches including machine learning are showing up for use in doctor’s offices and hospitals, as well as in telemedicine applications, which have become prevalent with the growing demand for online alternatives to office visits.
Predictive models, however, do not always live up to the hype, according to physician and researcher Ziad Obermeyer of the University of California-Berkeley, and medical statistician Maarten van Smeden of the University Medical Center Utrecht in the Netherlands. Both have recently published studies exposing some of the problems with predictive models, while also providing recommendations to help predictive modeling reach its potential and improve health care.
Cutting on the bias
One of the biggest hurdles for predictive models is that health is a tough nut to crack. “When developing an algorithm to identify photos, it is a cat or it isn’t, and we can all agree on that. Or for something like a self-driving car, it is a pedestrian or it isn’t, and it is a stop sign or it isn’t. But in medicine, we rarely have anything remotely like ground truth,” said Obermeyer, M.D. (Figure 1), who is an associate professor of health policy and management at the University of California-Berkeley, and one of ten National Academy of Medicine 2020 Emerging Leaders in Health and Medicine Scholars. “We don’t have straight-forward definitions for health: How do I measure that someone is healthy, is going to be healthy, or is going to have a deterioration in their health in the next year? So the measurement process in health is a huge part of what makes this hard, and that’s what we saw in our work.”
In one study of health care models [1], Obermeyer and colleagues in Boston and Chicago dissected an algorithm that a large number of U.S. medical centers employ to determine which groups of patients are more likely to need medical care in the future, so they can invest preventative resources most effectively. “We know that humans are not good at this, especially when they have hundreds of patients that they need to keep track of, so we really desperately want algorithms to be doing this job,” he said. The idea was to build an algorithm that would eliminate subjectivity, including ethnic, geographic, or other prejudices. “Unfortunately, the algorithm ended up reinforcing those structural inequalities and biases, and all of the things that we don’t like about our health system,” he said.
The model’s fault rested in the choice of the measurement to predict health outcomes. In this case, the developers chose health expenditures, because the thinking was that people who spend the most on care are those who need the most care, so preventative resources should go to those who spend more. “The designers told the algorithm to predict expenditures, and the algorithm was very good at it,” Obermeyer said. “The problem is that the label was biased, because black patients are poorer than white patients on average, they have greater barriers to access in getting health care when they need it, and because doctors treat them differently, they end up getting less care even when they see a doctor. So for all those reasons, two patients with the same level of health and health needs could have very different expenditures, and that was the bias that the algorithms were automating.”
Using health expenditures as a predictor of health needs is actually nothing new, Obermeyer said. “A report by the Society of Actuaries that compared the 10 most widely used algorithms for doing this task, which is called population health management forecasting, all did the same thing, so if you look at hospital systems across the U.S., they are all using algorithms like this,” he said. Although health providers are attuned to and work diligently against disparities, nobody translated that abstract knowledge into a realization that expenditures carried a bias that would be scaled up by the algorithm, he explained. “I think it illustrates why models are hard, because it wasn’t a problem that you would have caught by running some basic diagnostics about algorithmic performance.”
The research group found the same thing in algorithms the U.S. government has been using to determine how to distribute relief funding for coronavirus-2019 (COVID-19). The algorithm uses COVID-19 infection rates, but since diagnostic testing is unevenly distributed—much less common in poorer urban areas than in wealthier suburbs, for instance—the algorithm directs relief funds inequitably [2], he explained.
Both studies showcase the need to tread carefully. “We get a lot of training on how to build an algorithm, optimize it, and make sure it has certain properties, and yet we don’t get a lot of training on what questions you ask the algorithm to answer,” he said. That is a particular problem when it comes to health. “The difficulty is that we’re interested in this variable called health that we don’t measure directly in our datasets. Instead, we measure a bunch of proxy measures for health (and) if we don’t pay very close attention to the variable we ask the algorithm to predict, the sample it is measured in, and what biases have I introduced into the measurement by selecting a particular sample—and leaving some other people out—we are going to get a biased algorithm that has a lot of unintended consequences.” He added, “All of those are fundamental, statistical questions, and they turn out to be really crucial for distinguishing an algorithm that is doing fundamentally what it is supposed to do from one that is introducing bias and error.”
Validation, test, verify
Details about the decision-making processes used by machine-learning/deep-learning models are obscured, which leads to valid concerns about whether they are keying on appropriate data, as well as user uncertainty about whether they can trust the models’ conclusions. One of the best ways to instill confidence in models is to run them through extensive evaluation, but that is rarely done said van Smeden, Ph.D., assistant professor in the Julius Center for Health Sciences and Primary Care at the University Medical Center Utrecht (Figure 2). “A lot of the attention goes to developing new algorithms because it’s fun, and I understand that. But in the end, most of these algorithms are used to help in medical decisions, so after you have developed your model, you have to validate it and also somehow show that it actually improves medical care,” he asserted.
He and a broad research group reviewed the flush of new models—mainly employing machine-learning—designed to identify COVID-related pneumonia on computed tomography (CT) scans [3]. After reviewing 105 articles on such models, van Smeden said, the researchers found none had what they felt was sufficient external validation or testing on new patients and data, and for that reason, listed them all as having a “high risk of bias” and their “reported performance is probably optimistic.”
External validation and testing are imperative because any model can go off-track in a variety of ways [4], van Smeden said. The size and quality of the dataset is a good example. “It’s easy to get a small dataset of relatively high quality, and it’s relatively easy to get very large datasets of low quality, but machine learning models are data-hungry and need large, high-quality datasets that are very difficult to get, especially if you want representative sample.”
Besides adding external validation and patient/data tests to the model-development process before it is put into clinical use, van Smeden also recommended following up with another layer of evaluation to determine whether the model actually leads to improved diagnosis, care, and patient outcomes. Such evaluation provides necessary confirmation, he said, especially with machine-learning models, which are designed to seek out patterns in datasets, and use them to draw conclusions. Such models are often described as “black boxes,” because they don’t reveal the patterns they used in decision-making. “Understanding black boxes is very difficult and sometimes you don’t have to as long as you test after you’ve finished developing your models,” he remarked. “We know that bad models that make bad predictions can make things worse. That is a real possibility, and one that we don’t take seriously enough.”
Consistent validation and testing will only become part of model-development protocols if regulations demand them, van Smeden said, but so far such regulations are lacking. “We have been using models at least since Virginia Apgar with her Apgar score (developed in 1952 to predict the health of newborns), so models are not new at all, and medical field has let this go, almost never demanding a need for regulation,” he said. With advances in computing power and capabilities, however, models are not only quickly escalating in number, but also in scope as they begin to make increasingly high-stake decisions. He noted, “We are in a situation now where algorithms can potentially take over for medical doctors, so we need some minimal requirements to be met before they are implemented.”
Positive side
Although the work of van Smeden and Obermeyer found flaws with predictive models, both researchers hope their work will help lead to models that can improve health care. Obermeyer remarked, “Even though it’s tempting to read a lot of results as very negative about the future of algorithms in health and elsewhere, for me I read it much more optimistically,” he said, noting that studies like his show a path forward by drilling home the importance of “paying close attention to the data-generating and measurement processes.”
Van Smeden also felt that models, if done well, can be of great benefit to the medical field and to patient outcomes. “I see opportunities, [but] we have to step up our game, so to speak, to make them better, so that people have sufficient confidence in them and medical doctors trust them,” he remarked. “That means we have to avoid making a fool of ourselves too often, so we have to test these models.”
References
- Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan, “Dissecting racial bias in an algorithm used to manage the health of populations,” Science, vol. 366, no. 6464, pp. 447–453, Oct. 25, 2019.
- P. Kakani, A. Chandra, S. Mullainathan, and Z. Obermeyer, “Allocation of COVID-19 relief funding to disproportionately black counties,” JAMA, vol. 324, no. 10, pp. 1000–1003, Sep. 8, 2020.
- L. Wynants et al., “Prediction models for diagnosis and prognosis of COVID-19 infection: Systematic review and critical appraisal,” BMJ, vol. 369, Apr. 7, 2020, Art. no. m1328.
- L. Wynants et al., “Three myths about risk thresholds for prediction models,” BMC Med., vol. 17, Oct. 25, 2019, Art. no. 192. Accessed: Oct. 7, 2020. [Online]. Available: https://bmcmedicine.biomedcentral.com/articles/10.1186/s12916-019-1425-3#citeas