Editor's Pick: Reliability of LLMs as medical assistants

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study.

Bean, A.M., Payne, R.E., Parsons, G. et al. Reliability of LLMs as medical assistants for the general public: a randomized preregistered study. Nat Med 32, 609–615 (2026). https://doi.org/10.1038/s41591-025-04074-y
Nature Medicine
February 9, 2026

Read the paper

__

Q&A Video Featuring Gezzer Ortega, MD, MPH

Watch the full Q&A here.
The teaser video can be found here

Gezzer Ortega, MD, MPH
Assistant Professor of Surgery
Department of Surgery and the Center for Surgery and Public Health
Mass General Brigham Harvard Medical School.

__

(Note: The responses below are highlights from the Q&A video above from Dr. Ortega.)

What's the point?

This study asks a simple but critical question: Can AI actually help make help people make better medical decisions? I think the answer is surprising. AI performed really well on its own, but when people used it, outcomes didn't improve, and sometimes they got worse. This tells us something really important. The problem isn't just AI accuracy; it's this interaction between humans and AI. People gave incomplete information, AI gave inconsistent or misunderstood answers, and even when the AI was right, users didn't always act on it in the study. 

From a diagnostic perspective, this introduces a new kind of risk, because it's not a knowledge gap, but more of a communication gap. That matters because AI is quickly becoming the front door to healthcare. The takeaway isn't that AI doesn't work; it's that these high test scores don't really equal world safety.  

The Bottom Line:  If we want AI to improve diagnosis, then we need to design systems that help people ask better questions, interpret answers correctly, and know when to seek care. Because in the end, diagnostic excellence depends on not just what AI knows, but on how humans use it.

Why does this matter? 

When you think about the diagnostic excellence lens, this study highlights a critical shift. The question is no longer, can AI be medically accurate, but can humans and AI work together safely? The findings expose a new category of diagnostic risk, which is not the classic cognitive error or a system failure; it's the interaction failure between the humans and the AI. This is important because diagnosis depends on information exchange. There's history taking, the framing, the interpretation, but then LLMs are now entering the space, which was often the clinician asking the patients and kind of gathering this information. When that interaction [becomes] unreliable, then it can really change the diagnostic trajectory for that patient. 

AI may change, not just support the diagnostic process. But without a careful design, it risks introducing some noise, mis-prioritization, and false assurances. This study really reframes the priorities around making sure that we evaluate communication quality, not just the accuracy of the AI, that we design systems that guide users toward clinically meaningful input, and then that we treat patient-facing AI as diagnostic interface, and not just an information tool. 

Who does this impact?

From a clinician standpoint, clinicians expect patients to arrive with AI-informed interpretations, and more and more patients are bringing this to their clinicians. This study recognizes that there's a misalignment between AI advice and patient understanding that we need to work through. 

For researchers, it shifts the focus toward human interaction science, as opposed to previously just on the accuracy of AI. It calls for us to incorporate behavioral, cognitive, and communication frameworks into the evaluation of the technologies. From a research design standpoint, the study shows the need to reflect real-world use, and not these kinds of idealized prompts. 

At the health system level, a lot of LLMs have become the front door to care. You have a lot of entities that are working on medical assistant AIs and different technology to kind of triage patients, and so we have to be careful and make sure that as we integrate this technology, that it's safe, and that we are not deploying tools that have not had user-centered validation. 

For health system and quality health leaders, recognize the socio-technical factor of communication as part of the diagnostic safety process, and then developing metrics for AI-related communication breakdowns when it is used for our patients, and then what that patient AI clinician handoff looks like. 

For patients, AI can inform, but it shouldn't replace clinical judgment. As clinicians, we should encourage symptom communication and appropriate escalation, and work on educating our patients, as well as our clinicians, on how to contextualize this technology and how to use it. Diagnostic excellence depends on how information moves between people and systems. If we don't design the interaction carefully, even the most powerful AI can fail at the point where it matters most, which is helping patients make the right decision. 

__

Share your thoughts and join the conversation on LinkedIn. Explore more insights from CODEX Director Sumant Ranji, MD, here

About Editor's Picks

Curated by the UCSF CODEX team, each Editor’s Pick features a standout study or article that moves the conversation on diagnostic excellence forward. These pieces offer meaningful, patient-centered insights, use innovative approaches, and speak to the needs of patients, clinicians, researchers, and decision-makers alike. All are selected from respected journals or outlets for their rigor and real-world relevance.  

View more Editor's Picks here.