Editor's Pick: Human-AI collaboratives improve DxEx
“Human–AI collectives most accurately diagnose clinical vignettes”
Nikolas Zöller, Julian Berger, Irving Lin, Nathan Fu, Jayanth Komarneni, Gioele Barabucci, Kyle Laskowski, Victor Shia, Benjamin Harack, Eugene A. Chuh, Vito Trianni, Ralf H. J. M. Kurvers, and Stefan M. Herzog
Proceeding of the National Academy of Sciences (PNAS)
June 13, 2025
What's the point?
Can clinicians and AI work better together than alone? This new study suggests yes.
The research team developed a collective intelligence method to combine diagnostic responses from both clinicians and large language models (LLMs). They used a database of over 2,000 case vignettes and 40,000 differential diagnoses submitted by physicians through the Human Diagnosis Project. Five LLMs were then prompted with the same cases and patient information to generate their own diagnoses. The authors then simulated diagnostic accuracy across three groups: physicians alone, LLMs alone, and hybrid collectives combining both.
The results? Hybrid collectives outperformed individual clinicians, standalone LLMs, and even groups of clinicians or LLMs on their own.
The Bottom Line: Clinicians and LLMs tend to miss different things—when they team up, their strengths can cover each other’s diagnostic gaps.
Why does this matter?
LLMs can process huge volumes of medical information, but they also come with baked-in limitations like hallucinations, bias, and a lack of common sense.
According to the authors of this study, these aren’t flaws we can fix with more data, human feedback, or advanced technical tweaks. Instead of aiming to replace clinicians, they suggest a different path: combine the expertise and judgment of physicians with the raw processing power of LLMs.
By letting each do what they do best, we may be able to improve diagnostic accuracy and reduce errors—together.
Who does this impact?
Clinicians or medical students without easy access to experts could use LLMs to support their differential diagnoses. This study found that when a clinician and a single LLM worked together, their combined diagnoses were more accurate than either working alone. Going further, the authors showed that hybrid ensembles—pairing medical students with one or more LLMs—outperformed both individual clinicians and clinician groups.
For safety researchers, this study highlights a key takeaway: partner with informaticists to help close the gap between technical solutions and what clinicians and patients truly need.
About Editor's Picks
Curated by the UCSF CODEX team, each Editor’s Pick features a standout study or article that moves the conversation on diagnostic excellence forward. These pieces offer meaningful, patient-centered insights, use innovative approaches, and speak to the needs of patients, clinicians, researchers, and decision-makers alike. All are selected from respected journals or outlets for their rigor and real-world relevance.
View more Editor's Picks here.