Editor's Pick: Human-AI collaboratives improve DxEx

“Human–AI collectives most accurately diagnose clinical vignettes”

Nikolas Zöller, Julian Berger, Irving Lin, Nathan Fu, Jayanth Komarneni, Gioele Barabucci, Kyle Laskowski, Victor Shia, Benjamin Harack, Eugene A. Chuh, Vito Trianni, Ralf H. J. M. Kurvers, and Stefan M. Herzog
Proceeding of the National Academy of Sciences (PNAS)
June 13, 2025  

Read the paper

What's the point?

Can clinicians and AI work better together than alone? This new study suggests yes.

The research team developed a collective intelligence method to combine diagnostic responses from both clinicians and large language models (LLMs). They used a database of over 2,000 case vignettes and 40,000 differential diagnoses submitted by physicians through the Human Diagnosis Project. Five LLMs were then prompted with the same cases and patient information to generate their own diagnoses. The authors then simulated diagnostic accuracy across three groups: physicians alone, LLMs alone, and hybrid collectives combining both.

The results? Hybrid collectives outperformed individual clinicians, standalone LLMs, and even groups of clinicians or LLMs on their own.

The Bottom Line: Clinicians and LLMs tend to miss different things—when they team up, their strengths can cover each other’s diagnostic gaps.

Why does this matter?

LLMs can process huge volumes of medical information, but they also come with baked-in limitations like hallucinations, bias, and a lack of common sense.

According to the authors of this study, these aren’t flaws we can fix with more data, human feedback, or advanced technical tweaks. Instead of aiming to replace clinicians, they suggest a different path: combine the expertise and judgment of physicians with the raw processing power of LLMs.

By letting each do what they do best, we may be able to improve diagnostic accuracy and reduce errors—together.

Who does this impact?

Clinicians or medical students without easy access to experts could use LLMs to support their differential diagnoses. This study found that when a clinician and a single LLM worked together, their combined diagnoses were more accurate than either working alone. Going further, the authors showed that hybrid ensembles—pairing medical students with one or more LLMs—outperformed both individual clinicians and clinician groups.

For safety researchers, this study highlights a key takeaway: partner with informaticists to help close the gap between technical solutions and what clinicians and patients truly need.

__

Share your thoughts and join the conversation on LinkedIn.

About Editor's Picks

Curated by the UCSF CODEX team, each Editor’s Pick features a standout study or article that moves the conversation on diagnostic excellence forward. These pieces offer meaningful, patient-centered insights, use innovative approaches, and speak to the needs of patients, clinicians, researchers, and decision-makers alike. All are selected from respected journals or outlets for their rigor and real-world relevance.

View more Editor's Picks here.

Editor's Pick: Human-AI collaboratives improve DxEx

“Human–AI collectives most accurately diagnose clinical vignettes”

What's the point?

Why does this matter?

Who does this impact?

About Editor's Picks

The Latest News

Advancing Diagnosis of Rare Diseases Webinar

Healthcare Leaders Unite at UCSF to Shape the Future of AI in Diagnosis and Patient Safety

Editor's Pick: AI-based clinical support in the real world

Editor's Pick: AI exposure and clinicians deskilling in colonoscopy

Editor's Pick: Diagnostic Errors and Why DDx Reasoning Style Matters

Editor's Pick: Human-AI collaboratives improve DxEx

Dedicated AI Expert System vs Generative AI With Large Language Model for Clinical Diagnoses

Welcome, Blen Gebremeskel, MD!

Editor's Pick: New Study Shows Feedback Lowers Overuse of CT Imaging

Editor's Pick: Closing the Dx Loop