Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses
“Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses”
Mitchell J. Feldman, Edward P. Hoffer, Jared J. Conley, Jaime Chang, Jeanhee A. Chung, Michael C. Jernigan, William T. Lester, Zachary H. Strasser, & Henry C. Chueh
JAMA Network Open
May 29, 2025
What's the point?
Before generative AI was used for clinical diagnosis, diagnostic decision support systems (DDSSs) have helped clinicians diagnose patients and understand the diagnosis. These AI systems use large databases of clinical information to create differential diagnoses and are specifically designed as a diagnosis aid. This study is the first to put a traditional diagnosis tool (DXplain) head-to-head with newer LLMs (ChatGPT and Gemini) on 36 real, unpublished, challenging clinical cases. Comparing the top 25 diagnoses generated by the DDSS and each LLM, the DDSS listed the correct diagnosis more often (50-72%) than either LLM. However, the DDSS missed nine cases, whereas the LLMs listed six of those nine cases correctly.
The Bottom Line: While this DDSS performed better than either LLM on diagnostic accuracy, the LLMs performed remarkably well considering they are not designed for diagnosis.
Why does this matter?
With all the hype about generative AI, it’s easy to overlook DDSSs, AI tools that have supported clinicians for years. Unlike LLMs, DDSSs don’t hallucinate, explain their reasoning, are not subject to human biases, and consistently outperform LLMs when clinical data like labs are involved. But they’re not perfect; they require licenses and aren’t always widely available.
LLMs like ChatGPT, on the other hand, are free and surprisingly good at tackling even challenging, unpublished cases. The future of diagnostic excellence may lie in combining the strengths of both: pairing the reliability of DDSSs with the accessibility and adaptability of generative AI.
Who does this impact?
Clinicians can use these findings as evidence to pursue a hybrid approach using LLMs alongside DDSSs when available, while also recognizing that no AI system is infallible. Although all three systems identified most of the cases in their differential diagnosis, there were a handful of cases that no system identified, or one system identified while the others did not. This study shows while LLMs have made remarkable progress in the relatively short time they have been used for diagnostic assistance, they are still tools in development.
Healthcare system leaders and administrators can use this study as a jumping-off point to think about how to combine new AI tools with existing systems, such as harnessing the natural language processing capabilities of LLMs to input narrative text into a DDSS, leveraging the advantages of both AI systems.
For patients, this study can serve as a reminder that AI systems have been used by clinicians for decades as an aid and not as a replacement for doctors.
About Editor's Picks
Curated by the UCSF CODEX team, each Editor’s Pick features a standout study or article that moves the conversation on diagnostic excellence forward. These pieces offer meaningful, patient-centered insights, use innovative approaches, and speak to the needs of patients, clinicians, researchers, and decision-makers alike. All are selected from respected journals or outlets for their rigor and real-world relevance.
See more Editor's Picks.