Limitations in Evaluating Machine Learning Models for Imbalanced Binary Classification in Spine Surgery: A Systematic Review.

Friday, February 21, 2025

7:00 AM - 2:00 AM EST

Presenting Author(s)

Abdul Ghaith, MD, PhD

Post-Doctoral Neurosurgery Research Fellow
Johns Hopkins University
Baltimore, MD, US

Introduction: Machine learning (ML) and Deep Learning (DL) has increasingly been used to develop clinical prediction models for spine surgery applications. However, the proper evaluation of these models when dealing with imbalanced data is crucial for ensuring their effectiveness in clinical practice. This current systematic review sought to identify and evaluate all current research-based spine surgery ML and DL models predicting binary outcomes, focusing on their evaluation metrics.

Methods: A comprehensive search of the literature was conducted through the EMBASE, Medline, and PubMed databases using relevant keywords. No limits were placed on the level of evidence or timing of the study. Overall, 60 papers were included and the findings were reported according to the PRISMA guidelines.

Results: Among the 60 papers, 13 were focused on lengths of stay (LOS), 12 on readmissions, 12 on non-home discharge, six on mortality, and five on reoperations. The target outcomes exhibited data imbalances ranging from 0.44% to 42.4%. Fifty-nine papers reported the model's AUROC, 28 mentioned accuracies, 33 provided sensitivity, 29 discussed specificity, 28 addressed PPV, 24 considered NPV, 25 indicated BS with ten providing null model brier, and eight detailed the F1 score. Additionally, visualization of data was conducted in different ways among the included papers: 52 included an AUROC curve, 27 a calibration curve, 13 a confusion matrix, 12 decision curves, three PR-curves, and only one featured a precision-recall curve. Several common errors and potential bias sources were identified, including papers reporting good metrics but omitting others, optimizing metrics at the expense of others, and reporting high accuracy and AUROC while struggling with sensitivity. Additionally, some papers provided poor calibration plots and had missing metrics.

Conclusion : Proper evaluation schemes are essential in applied machine learning for spine surgery applications. Researchers, reviewers, and editors ought to be aware of the pitfalls of using inadequate evaluation metrics and the importance of comprehensive model evaluation. Appraisal of AI models by trained statisticians with knowledge in the field may also be warranted. This may in turn contribute to furthering the use of AI in research within the field of spine surgery.