Insights from CODEX Director Sumant Ranji, MD, SFHM: Analyzing New Research on LLMs and Patient Triage

Analyzing This Week's Federal Healthcare AI Announcements

Large language models (LLMs) can accurately diagnose a broad range of conditions in simulated studies when they are provided with a clinical case presentation containing a structured and curated summary of a patient’s clinical information. As clinicians are trained to synthesize patient data into a standard format, LLMs may be able to augment clinicians’ diagnostic accuracy in real-world patient encounters. This year, there has been a rapid growth in patient-facing diagnostic AI applications, which are not intended to replace clinicians but are designed to make healthcare more accessible by providing patients with advice based on the symptoms and medical history they reported. Two studies published this month in Nature Medicine tested how well LLMs performed when patients input their symptoms into the LLM – a scenario that is already common, as 16% of adults in the US reported consulting an AI chatbot for health advice at least monthly.

In the first study by Bean et al, the investigators provided adult participants with common medical scenarios and instructed them to decide whether or how to seek additional care (such as going to the emergency room or making an appointment with their primary care physician). The participants were randomized to either use an LLM for advice or to use any non-LLM resources they would typically use. The investigators found that while the LLM provided an accurate differential diagnosis and triage advice on its own, the accuracy dropped dramatically when real human users provided the information. This appeared to be due to human factors (users inputting incomplete information or misinterpreting the initial LLM output) and model factors (narrowly focusing on a single term or providing inconsistent responses to similar symptoms).

The second study by Ramaswamy et al assessed the ability of the newly released ChatGPT Health to provide accurate triage recommendations in response to vignettes representing a range of medical conditions, including both lower-acuity symptoms (e.g., sore throat) and higher-acuity symptoms (e.g., acute abdominal pain). Concerningly, the authors found that ChatGPT Health systematically under-triaged cases representing medical emergencies, recommending that the patient seek care within 24-48 hours in over 50% of cases when the correct advice would have been to immediately seek emergency care. This poses a clear patient safety risk and raises the question of whether the model was adequately trained on relatively uncommon but high-morbidity conditions. As well, the model failed to consistently provide crisis intervention resources in a case of a patient contemplating suicide. Across all scenarios, ChatGPT Health’s triage advice was more accurate when objective data (vital signs or lab results) were included in the vignettes, and the model did not exhibit the biases that have been noted in other studies of LLMs’ health advice.

There is a great deal of hype around using LLMs for triage and diagnosis, but as these studies show, the accuracy of an LLM's advice is highly dependent on the information it receives. At least thus far, LLMs do well when given data in a classic case presentation format but may not be helpful (and are potentially unsafe) when given unstructured or incomplete data. ChatGPT Health in particular failed to accurately triage medical emergencies, potentially putting patients at risk.

LLMs should be able to help clinicians improve their diagnostic accuracy, but the utility of their advice for the general public remains in question. At this time, patients should not use publicly available LLMs for diagnostic or triage advice.

These studies also highlight the enduring role of core clinical skills in diagnostic accuracy in the AI era. Learning how to tease out a diagnosis when faced with unstructured and incomplete data is a central focus of health professions education and is what practicing clinicians do every day. LLMs will continue to improve, but the litmus test for patient-facing diagnostic AI applications will be how they perform in real-world triage situations – where currently human clinicians have the upper hand.

Let us know your thoughts and join the conversation on LinkedIn.

Read the studies:

1) Bean, A.M., Payne, R.E., Parsons, G. et al. Reliability of LLMs as medical assistants for the general public: a randomized preregistered study. Nat Med 32, 609–615 (2026). https://doi.org/10.1038/s41591-025-04074-y

2) Ramaswamy, A., Tyagi, A., Hugo, H. et al. ChatGPT Health performance in a structured test of triage recommendations. Nat Med (2026). https://doi.org/10.1038/s41591-026-04297-7

///

Sumant headshot

Sumant Ranji, MD
Director, UCSF CODEX
Director of Quality Improvement and Patient Safety, Division of Hospital Medicine at the Zuckerberg San Francisco General Hospital (ZSFG)

Dr. Ranji is a renowned leader in patient safety, quality improvement, and medical education. He first joined UCSF as a fellow in hospital medicine and clinical research, then went on to become a faculty member in the Division of Hospital Medicine at UCSF Health, where he held multiple leadership positions. As director of CoDEx, Sumant leads the center’s strategic planning and the execution of its work to champion diagnostic excellence research and build awareness of and engagement within the field.

Learn more about Sumant here.

About UCSF CODEX (Coordinating Center for Diagnostic Excellence)

Every person deserves access to an accurate and timely diagnosis. At CODEX, we never stop working toward making that a reality. We serve as a national coordinating entity, engaging the diagnostic excellence community to promote novel findings, catalyze action, and advance the field. Our mission is to lead change in the field of diagnostic excellence by facilitating activities that result in measurable improvement in diagnostic quality, safety, and equity.

Follow Us
LinkedIn: https://www.linkedin.com/company/ucsfcodex
Mailing List: https://ucsf.co1.qualtrics.com/jfe/form/SV_9zU2iVMliYFqBvg

About UCSF CODEX (Coordinating Center for Diagnostic Excellence)

The Latest News

Editor's Pick: Algorithmic bias in lung cancer screening eligibility

CODEX Director Sumant Ranji featured in the San Francisco Chronicle

CODEX Director Sumant Ranji featured in The Wall Street Journal

Editor's Pick: Reliability of LLMs as medical assistants

Insights from CODEX Director Sumant Ranji, MD, SFHM: Analyzing New Research on LLMs and Patient…

2026 Call for Applications (due March 2) – NAM/CMSS Scholars in Diagnostic Excellence

Insights from CODEX Director Sumant Ranji, MD, SFHM: Analyzing This Week's Federal Healthcare…

Advancing Diagnosis of Rare Diseases Webinar

Healthcare Leaders Unite at UCSF to Shape the Future of AI in Diagnosis and Patient Safety

Editor's Pick: AI-based clinical support in the real world