Editor's Pick: Study Finds AI Medical Tools Show Bias, Potential for Misdiagnosis and Patient Harm

“Sociodemographic biases in medical decision making by large language models”

Mahmud Omar, Shelly Soffer, Reem Agbareia, Nicola Luigi Bragazzi, Donald U. Apakama, Carol R. Horowitz, Alexander W. Charney, Robert Freeman, Benjamin Kummer, Benjamin S. Glicksberg, Girish N. Nadkarni & Eyal Klang
Nature Medicine
April 7, 2025

Read the paper

GraphicWhat’s the point?

Artificial intelligence (AI) is derived from human-generated data that can contain biases, raising questions about how prejudices based on someone’s identity can show up in AI-generated medical evaluations. This study looked at how AI tools—specifically, large language models (LLMs)—make medical recommendations. Researchers tested nine AI programs using 1,000 emergency room cases, keeping the medical symptoms the same but changing details like the patient’s race, gender, sexuality, income, or housing status. They ran over 1.7 million AI responses and found that the recommendations often changed based on these personal characteristics, not the actual health condition.

For example, patients labeled as Black, unhoused, or LGBTQIA+ were much more likely to be sent for urgent care, invasive procedures, or mental health evaluations, even when that wasn’t clinically necessary. Meanwhile, patients labeled as high-income were far more likely to be offered advanced tests like MRIs or CT scans, compared to those labeled as lower income.

These differences didn’t line up with what real doctors would recommend and suggest that AI could be reinforcing harmful biases.

The Bottom Line: In a study analyzing over 1.7 million AI-generated vignette responses, researchers found that race, gender, income, and housing status influenced evaluation and treatment recommendations—even when patients had the same health conditions. These findings raise the concern that LLM-generated advice could reinforce stereotypes and potentially lead to misdiagnosis and patient harm.

Why does this matter?

This is the largest study so far to examine whether AI tools used in healthcare treat patients differently based on sociodemographic characteristics like race, gender, income, housing status, or sexual orientation. If AI systems make biased recommendations, it could lead to unfair or even harmful care. While some prompts can help reduce this bias, they don’t fix it entirely.

For example, some groups were more often recommended urgent care or mental health evaluations, even when those steps weren’t clinically necessary. These early decisions—such as how urgent a case is or whether to order more tests—are critical to getting the right diagnosis. This is particularly important to diagnostic excellence, as triage, additional testing, and follow-up are key components of the pathway to an excellent diagnostic outcome.

The study’s findings highlight the urgent need to evaluate and monitor AI tools to ensure they support equitable, timely, and high-quality healthcare.

Who does this impact?

  • Healthcare professionals can use these findings to better understand the risks and build safeguards when using AI tools. Notably, prompting reduced bias in 67% of cases in GPT-4o's responses, but not all, highlighting that clinicians must be proactive to recognize how explicit and implicit bias can appear in AI-generated medical recommendations.
  • Health system leaders and administrators can cite this study to support implementation efforts for ongoing assessment, evaluation, and monitoring of AI tools to identify and address unfair patterns in care recommendations, helping ensure equitable and clinically appropriate outcomes for all patients.
  • Patients should be aware that including demographic details in AI-powered systems may influence the care suggested.
  • Policymakers can use this evidence to push for greater transparency and fairness in how these tools are built, audited, and used.

Join the Conversation

We’d love to hear your take—join the conversation with us on LinkedIn or Bluesky and share your thoughts!

About Editor's Picks

Curated by the UCSF CODEX team, each Editor’s Pick features a standout study or article that moves the conversation on diagnostic excellence forward. These pieces offer meaningful, patient-centered insights, use innovative approaches, and speak to the needs of patients, clinicians, researchers, and decision-makers alike. All are selected from respected journals or outlets for their rigor and real-world relevance.