Peeking into a black box, the fairness and generalizability of a MIMIC-III benchmarking model


The MIMIC dataset provides a wealth of opportunities to develop, train, and educate; as a prime example, Harutyunyan et al.25 developed a benchmark study to predict IHM risk that has been replicated in multiple settings26,27,28,29,30,31. Given the magnitude of impact of this work, we performed an empirical evaluation of this model – a task generally done far too inconsistently and sparingly33. In this endeavor, we appreciated the authors’ efforts to facilitate model comparison through the development of easily reproducible benchmark models as well as their detailed explanations and general user-friendliness of their open-source code. While there are many strengths to this publication, we also identified several limitations related to class imbalance and fairness that require mitigation and transparent reporting. Specifically, we found three main problems to be addressed. First, the cohort and performance screening unmasked a typical class imbalance problem, where the model struggles to correctly classify minority class instances as demonstrated through low recall. Only every fourth to fifth high-risk patient is identified as such by the AI tool. Second, while the assessment showed the model’s capacity to generalize, the classification parity assessment revealed that model fairness is not guaranteed for certain ethnic and socioeconomic minority groups, but gender is unaffected. Finally, the calibration fairness study pointed to differences in patient comorbidity burden for identical model risk predictions across socioeconomic groups. These results highlight the extent of masked bias in high-quality scientific work and the need for thorough fairness evaluations. In light of the possible repercussions and the pervasiveness of bias in AI models, we provide a detailed analysis of the case study results and highlight recommendations on how to cautiously monitor and validate benchmark pipelines.

The empirical analysis of the Harutyunyan model suggests that class imbalance has a significant effect on this model’s performance in all three analytical framework stages. In the context of AI-guided clinical decision support, such limitations may threaten the model’s usability. As seen in particular from this model, recall rates as low as 25% or less should raise important concerns regardless of the model’s potential future application fields and use cases. In fact, this specific model’s endpoint may not be clinically relevant in many settings, but the pandemic has proven that there are moments where doctors increasingly look for tools to help them triage patients in a resource-scarce or momentarily overstrained setting. In a slightly different context, a case in point of such a tool is the Care Assessment Need (CAN) score deployed for almost a decade by the Veterans Health Administration in the US. It calculates weekly CAN scores for all Veterans who receive primary care services within the VA, including 90-days and 1-year mortality endpoints34. This tool has also been shown to be amenable to COVID-19 mortality risk stratification repurposing to support clinical decision making in a system under duress35.

One of the biggest dangers of model bias and class imbalance, such as exhibited by this benchmark model, is the fact that such intrinsic modeling problems frequently get hidden by inadequate and too simplistic summary performance metrics such as accuracy, or to a lesser degree even in current reporting standards such as AUROC36,37,38. The accuracy paradox explains how this pitfall may cause misleading conclusions in the case of unbalanced datasets by failing to provide comprehensible information about the model’s discriminatory capabilities39. Various targeted data and algorithm-level methods have been developed to effectively mitigate the inherently adverse effects of class imbalance40. But even if such mitigation methods may, in some cases, decisively help in alleviating the repercussions of data imbalance on performance, there is still a critical need to address the remaining negative impact. Given the still widespread neglect surrounding the proper handling of class imbalance and the associated far-reaching negative consequences37, it is urgent to finally work towards broad adoption of adequate performance evaluation procedures and reporting standards, such as provided in MINIMAR32, taking into account the characteristic challenges arising from class imbalance. More specifically, we advocate for the cessation of simplistic performance reporting solely based on accuracy or AUROC since performance metric choices have far-reaching implications on the study’s conclusions, particularly in the presence of data imbalance. Therefore, it is imperative that scientific journals now go beyond current reporting guidelines and start broadly requiring publications to explicitly state the class (im)balance and, in the case of pronounced skew, to require the reporting of further metrics beyond accuracy and AUROC.

Despite the inherent difficulties associated with mathematical definitions of fairness19,21 and the analytic limitations imposed by class imbalance37, significant differences for both examined fairness criteria were found. The classification parity assessment revealed important performance differences for socioeconomic and ethnic groups, but not with respect to gender. Medicaid patients receive significantly worse predictions compared to patients with private insurance in the internal validation assessments of both the MIMIC and STARR-trained model (stage one and three). Studying ethnic disparities, Black patients suffer from significantly lower model prediction performance as compared to non-Hispanic White patients in all assessment stages. Part of this bias might stem from the low representation of Black patients in both datasets, yet the two other ethnic minority groups, Asians and Hispanics, did not show such pronounced bias. Evidently, more research is warranted to better understand the underlying causes of these observed differences.

Complementary to classification parity, the fairness calibration analysis revealed further fairness disparities for publicly insured patients. Particularly Medicare patients consistently suffer from the strongest decalibration as their risks are severely underestimated. Since the Medicare population tends to be distinctively older by the nature of its inclusion criteria resulting in an elevated IHM rate, these results are however not surprising in the context of the class imbalance problem. Studying the relationship of comorbidity burden and algorithmic risk predictions paints a similar picture as Medicare patients consistently have more comorbidities for any risk score percentile. In the hypothetical case of deploying such a model in an ICU setting, older patients would potentially be at risk of not receiving the appropriate care resources they would need to improve their chances of survival.

Another interesting phenomenon in the study of comorbidity-risk associations are the randomly seeming spikes for higher risk percentiles in many demographic groups. They may be partially explained by random fluctuations due to the quickly decreasing sample size as the algorithmic risk score increases. Another explanation, however, might be that the IHM risk of those high-risk strata patients has strongly varying dependencies according to the characteristics of the demographic group. The opposing spiking behavior of Medicaid and privately insured high-risk patients is particularly interesting. The steep positive association of risk score and comorbidity burden in Medicaid patients suggests that their predicted likelihood of death in or after an ICU stay is disproportionally driven by comorbidities. On the other hand, patients with private insurance seem much more likely to receive such a high model risk score because of a sudden unexpected event unrelated to their comorbidities. This disparity in comorbidity burden points to a more systemic problem in our healthcare system, where socioeconomically disadvantaged patients receive unequal quality of care, contributing to many preventable deaths13,14,41,42.

Bias in a deployed AI-guided clinical decision support may have long-lasting and severe negative consequences for the affected patient groups. In this specific case, particularly Black patients and patients under public insurance would be at risk, though other minority groups may also suffer from underrepresentation3,4,5,6. These results build on Obermeyer’s previous work and strengthen the case that Black patients and other systemically disadvantaged groups are the ones left behind by AI revolutionizing standards of care9. Not identifying and addressing such hidden biases would lead to their propagation under the mantle of AI, thereby further exacerbating the health disparities faced by minority populations already today7,8,9,10. Moreover, such wide-ranging repercussions would also destroy public trust and hinder the further adoption of AI tools capable of decisively improving patient outcomes21.

While this case study is focused on the evaluation of the risk-prediction behavior of a MIMIC-trained model, it is important to consider the underlying data and its fit for purpose. Much of our historical healthcare data include inherent biases from decades of a discriminatory healthcare system13,14,43. For example, several studies have documented that non-White patients were less likely to receive an analgesic for acute pain, particularly when the underlying reason is difficult to quantify, such as a migraine or back pain44,45. This disparity becomes embedded in the data and therefore a model learning from these data can only regurgitate the biases in the data itself. This highlights the need for diverse stakeholder involvement throughout the design, development, evaluation, validation and deployment of an AI model to understand how and where these biases may occur in the data and potential mitigation strategies, such as discussion with a nurse/clinician about different levels of missingness for a particular variable46. Ultimately, solutions ready for safe deployment must be developed through a joint team effort involving people on the ground, knowledge experts, decision makers, and end-users.

Our work has far-reaching implications. First, it sheds light on the negative repercussions of disregarding class performance disparities in the context of skewed data distributions – a challenge still largely neglected but impacting many areas of AI research and requiring systemic changes in model evaluation practices and reporting standards36,37,38. Secondly, it showcases the importance of thorough external model validation prior to its use as a benchmarking tool. And finally, models should also systematically undergo fairness assessments to break the vicious cycle of inadvertently perpetuating the systemic biases found in society and healthcare under the mantle of AI. Particularly, the modeling of certain research questions on single-centered datasets like MIMIC is at an elevated risk of embedded systemic bias and the widespread use of only a handful of large public datasets further escalates the negative consequences thereof. Our work shows that particularly Black and socioeconomically vulnerable patients are at higher risk of inaccurate predictions by the model under study. These findings are of particular importance since this model has already been used several dozen times to benchmark new modeling techniques without any prior performance or fairness checks reported26,27,28,29,30,31.

This study has limitations. First, the fairness assessment is significantly limited by the model’s underlying problem of class imbalance. However, the results obtained under these unmodifiable constraints are statistically sound and meaningful. A second limitation is the lack of multi-center data, which would further strengthen the generalizability of the study findings, although two academic settings were evaluated in this study. Finally, the study is also limited by the similarity of the two datasets since STARR originates, like MIMIC, from an affluent academic teaching hospital setting. Yet the large geographic distance resulting in a different demographic patient population mix provides a good basis for a first model generalizability assessment.

In the present era of big data, the use of AI in healthcare continues to expand. An important aspect of the safe and equitable dissemination of AI-based clinical decision support is the thorough evaluation of the model and its downstream effects, a step that goes beyond predictive model performance to further encompass bias and fairness evaluations. Correspondingly, regulatory agencies seek to further initiate and develop real-world performance requirements and test beds, using MIMIC as a gold standard10. A crucial step to achieving these goals is open science. Yet as MIMIC is often viewed as an exemplar of open science, understanding its limitations, through for example case studies, is an imperative. In our evaluation of an IHM benchmarking model, we found challenges around class imbalance and worse predictions for Black and publicly insured patients. Going further, we also saw that model performance measured by AUROC was the center of evaluation and other important assessments, such as minority class performance, risk of bias and concerns around model fairness, were missing or not adequately reported. The repercussions from such non-comprehensive evaluation frameworks are a safety concern of entire populations, where the most vulnerable will ultimately suffer the most. Hence, this study cautions against the imprudent use of benchmark models lacking fairness assessments and external validation in order to make true progress and build trust in the community.

Next Post

Extended in the tooth: Goldie the pufferfish has crisis dental do the job | Animals

A pufferfish had to undertake emergency dental perform after her teeth grew so major she was not able to take in. The owner of Goldie the porcupine pufferfish, Mark Byatt, 64, rushed her to the vets in Kent just after noticing she was dropping excess weight since her long teeth […]