AI/ML Models Face Limitations in Predicting Postoperative Dysphagia: Insights from a Mult-institutional Study Using a Validated Measure

Friday, February 21, 2025

Presenting Author(s)

Ken Porche, MD

Fellow
Mayo Clinic Rochester
Rochester, Minnesota, United States

Disclosure(s):

Ken Porche, MD: No financial relationships to disclose

Introduction: This study assesses the predictive ability of machine learning (ML) models for identifying dysphagia after anterior cervical surgery, using the validated Eating Assessment Tool-10 (EAT-10). Dysphagia is a frequent complication, and early prediction can improve outcomes. This analysis focused on new-onset dysphagia, excluding patients with preoperative dysphagia, and used only preoperative and intraoperative variables.

Methods: A total of 1,715 patients from 3 large institutional who underwent anterior cervical surgery were included. Dysphagia was defined as an EAT-10 score ≥3 within 3 months postoperatively. Exclusion criteria: posterior surgery or preoperative dysphagia. The dataset was split into training, validation, and test sets (53:27:20) using a nested, stratified KFold cross-validation (an outer fold of 5 and inner fold of 3) to preserve the ratio of positive and negative cases and to ensure robust generalization. Four feature selection methods—recursive feature elimination (RFE), permutation importance, and XGBoost—were used. Models tested included deep neural network (DNN), regression, gradient boosting, random forest, SVM, and Naïve Bayes. Hyperparameter tuning, regularization, and dropout methods, when appropriate, were applied to optimize each model's performance towards accuracy. Metrics calculated included accuracy, F1 score, ROC-AUC, precision-recall AUC, sensitivity, specificity, PPV, and NPV.

Results: Overall model performance was limited, with the DNN model achieving the highest accuracy (63.3%) and an ROC-AUC of 0.613. Regression had a slightly lower accuracy (62.8%) but higher specificity (77.7%) and the second-highest NPV (0.647). Gradient boosting, random forest, and SVM models had similar accuracy (~61%). The gradient boosting model also had a strong specificity (76.1%) and a reasonable precision-recall AUC (0.623). Naïve Bayes underperformed, with the lowest accuracy (58.3%) and sensitivity (19.5%). Precision-recall AUC was highest in the deep learning model (0.645), though sensitivity across all models was generally low, ranging from 19.5% to 47.8%.

Conclusion : Despite using advanced methodology, ML models showed poor accuracy and AUC in predicting postoperative dysphagia based purely on clinical and operative information. While deep learning performed the best, its low predictive power is not clinically sufficient. Further research is necessary to explore additional variables or advanced techniques to improve prediction, such as intraoperative esophageal pressure monitoring.