Medical Student Research Fellow Northwestern University Feinberg School of Medicine
Introduction: This study evaluates three AI-driven literature review tools—Elicit, Semantic Scholar, and ChatGPT 4.0—against a traditional manual review in cervical spine surgery research. The aim is to assess how AI can enhance literature reviews and guide researcher tool selection.
Methods: A medical student manually queried for articles on "three-level anterior cervical discectomy and fusion and four-level anterior cervical discectomy and fusion" using Google and PubMed. Elicit, Semantic Scholar, and ChatGPT 4.0 were then queried with the same phrase, and relevant articles from each source were compiled. Each tool’s results were evaluated for relevance, peer-reviewed status, and publication metrics.
Results: The manual review found 13 relevant articles, seven overlapping with AI results and six unique. Elicit retrieved 21 relevant articles, 15 unique, with a 48% overlap with the manual review. Semantic Scholar retrieved ten articles, overlapping with both Elicit and the manual review for two articles. Notably, ChatGPT 4.0 produced only fabricated, non-existent article titles, though they appeared relevant.
Conclusion : Elicit and Semantic Scholar offered unique and relevant findings, proving valuable in modern literature reviews. While AI tools were faster, the manual review remains reliable, particularly for citation validation. ChatGPT 4.0’s output highlights a critical flaw, underscoring the need for human oversight in AI-aided research tools. AI tools can supplement traditional methods but require verification.