Interpretation Accuracy of Neurosurgical Spine Imaging by ChatGPT: An Exploratory Analysis

Friday, February 21, 2025

Presenting Author(s)

Geoffrey R. O'Malley, Jr., B.A.

Medical Student
Hackensack-Meridian School of Medicine
Chatham, New Jersey, United States

Disclosure(s):

McKenzie A. Mayer, BS: No financial relationships to disclose

Introduction: Artificial intelligence (AI) is rapidly evolving, with significant implications for patients and healthcare professionals. AI has the potential to analyze medical imaging, aid in diagnosis, and offer patients medical advice before neurosurgeon follow-up. This study aimed to assess the ability of ChatGPT to interpret spinal radiological images and compare its performance on online versus real-patient images. We hypothesized that ChatGPT would demonstrate greater accuracy in analyzing online images, given their availability and likely use in training datasets.

Methods: Images of common spinal conditions were sourced online from Radiopaedia, while real-patient images were obtained from our institution's electronic health records. ChatGPT was provided with each anonymized image, and its ability to interpret these studies was compared against the radiologists’ final reports. Performance was evaluated using two key metrics: diagnostic correctness (e.g., Chiari I malformation), which assessed ChatGPT’s ability to reliably identify the final diagnosis, and image analysis accuracy (e.g., cerebellar tonsils protruding below the foramen magnum), which ensured logical reasoning behind the diagnosis selection. Additionally, the authors assessed whether ChatGPT correctly identified the image modality and relevant anatomical structures. Statistical analysis was performed using GraphPad Prism 10 (Boston, MA), with significance set at p< 0.05.

Results: A total of 40 spinal images were collected (20 online, 20 real-patient). ChatGPT correctly identified the modality in 100% of online images compared to 95% of real-patient images, with no significant difference (p = 0.3112). Real-patient images had a higher correct analysis rate than online images, but this was not significant (40% vs. 25%, p = 0.3112). Online images demonstrated a higher percentage of correct anatomical identifications (65%) compared to real-patient images (50%), though this difference was without significance (p = 0.3373). Lastly, diagnostic accuracy was 10% for online images compared to 20% for real-patient images, with no significant difference (p = 0.3758).

Conclusion : ChatGPT demonstrated no significant difference in performance between online images and real-patient images. While it demonstrates an accurate recognition of image modality, its ability to diagnose spinal abnormalities is notably limited. ChatGPT also exhibits some capacity to recognize anatomical structures but struggles with conducting detailed image analysis.