Patient Perspectives on AI: Comparing Large Language Model and Physician-Generated Responses to Cervical Spine Surgery Questions

Friday, February 21, 2025

Presenting Author(s)

Ezra Yoseph

Medical Student
Stanford University

Introduction: Anterior cervical discectomy and fusion (ACDF) surgery is a common intervention for patients with cervical spine pathologies. The complexity of ACDF surgery and varied quality of online health information pose a challenge for patients attempting to understand the surgery and its outcomes through virtual resources. In this study, we sought to elucidate differences in patient perspectives on Large Language Model (LLM) versus physician-generated responses to frequently asked questions about ACDF surgery.

Methods: This cross-sectional study involved three phases. Phase 1 entailed composing 10 commonly asked questions regarding ACDF surgery with the assistance of ChatGPT-3.5, ChatGPT-4.0, and Google search. Phase 2 involved collecting responses to the questions from two spine surgeons and then prompting ChatGPT-3.5 and Gemini to answer the same 10 questions. Phase 3 involved recruiting cervical spine surgery patients (n=5) and age-matched controls (n=5) to evaluate the responses provided by both surgeons and two LLM platforms on clarity and completeness.

Results: LLM-generated responses were significantly shorter, on average, than physician-generated responses (30.0 +/- 23.5 vs. 153.7 +/- 86.7 words, P < 0.001). Study participants were more likely to rate LLM-generated responses with more positive clarity ratings (H = 6.25, P = 0.012), with no significant difference in completeness ratings (H = 0.695, P = 0.404). On an individual question basis, there were no significant differences in ratings given to LLM vs. physician-generated responses. Compared with age-matched controls, cervical spine surgery patients were more likely to rate physician-generated responses as higher in clarity (H = 6.42, P = 0.011) and completeness (H = 7.65, P = 0.006).

Conclusion : Although a small sample size, our findings indicate that LLMs offer comparable, and occasionally preferred, information in terms of clarity and comprehensiveness of responses to common ACDF surgery questions. It is particularly striking that ratings were similar considering LLM-generated responses were, on average, 80% shorter than physician responses. Further studies are needed to determine how LLMs can be integrated into spine surgery education moving forward.