CODEX Digest - 9.4.25
Want this delivered straight to your inbox every Thursday? Subscribe now.
This week’s picks include an analysis of medical malpractice data, a study revealing AI falls short in interpreting ECGs, and an evaluation of racial disparities in insurance coverage for detecting breast cancer. Also featured this week are qualitative studies capturing the perspectives of endometriosis patients, medical students, and healthcare professionals using AI.
Here are this week's must-reads:
Titles link to the PubMed record or free-to-access sites with full text availability.
Fidelity of medical reasoning in large language models.
Bedi S, Jiang Y, Chung P, et al. JAMA Netw Open. 2025;8(8):e2526021.
AI's ability to actively reason rather than only recognize patterns to answer questions is an important distinction determining its safety for frontline care. In this cross-sectional study, six LLMs were tested on a standardized medical benchmark assessment, except the correct answer was replaced with “None of the other answers.” Results showed all models became less accurate, raising doubts about whether AI is ready to work autonomously in clinical settings.
Demystifying cognitive bias in the diagnostic process for frontline clinicians and educators; new words for old ideas. (subscription required)
Cunningham N, Cook H, Leach D, et al. Diagnosis (Berl). 2025;12(3):322-332.
There are numerous, complicated typologies of cognitive biases, making it challenging for learners to incorporate awareness of bias into their clinical reasoning. This commentary shares a tool called “The Idiom’s Guide to Cognitive Bias” to simplify language around cognitive biases. The guide uses language and images together to strengthen medical and thinking skills, while making cognitive bias in diagnosis easier to recognize and discuss.
The effect of medical explanations from large language models on diagnostic accuracy in radiology. (This is a preprint that has not gone through peer review.)
Feuerriegel S, Spitzer P, Hendriks D, et al. Res Sq. Epub 2025 Aug 11.
The need to understand how an LLM arrives at results in accessible language is key to clinician and patient acceptance of AI-generated decisions. This preprint reports on a randomized test of radiologists using LLMs with different levels of elucidation. Participants who received advice from LLM-generated chain-of-thought explanations achieved the highest level of diagnostic performance. Further analysis also showed when the LLM-generated diagnosis was incorrect, participants in the chain-of thought explanation group were more likely to override the LLM-advice than participants in a differential diagnosis group.
Karavadra B, Semlyen J, Morris E, et al. BMC Womens Health. 2025;25(1):319.
Endometriosis is a common condition known to challenge diagnostic excellence. This interview study drew from the lived experiences of 15 endometriosis patients in the UK to explore reasons for delay in determining their diagnosis and developed a framework centered on psychological contexts involving stigma, self-care behavior, relationships with clinicians, and the healthcare system. This knowledge can help address beliefs and behaviors that contribute to delayed diagnosis.
Exploring emergency department providers’ uncertainty in neurological clinical reasoning. (subscription required)
Lee AM, Brown KR, Durning SJ, et al. Diagnosis (Berl). 2025;12(3):424-431.
Patients presenting to emergency rooms (ER) with neurological complaints can be diagnostic reasoning challenges. This study presented physicians with a series of clinical vignettes depicting neurological vs. non-neurological cases presenting to the ER. The neurological cases were shown to create more self-reported diagnostic uncertainty and anxiety levels in diagnosis and management of the condition. This implies that improving capacity for neurological conditions may enhance triage of neurologic issues in the ER.
Racial differences in screening eligibility by breast density after state-level insurance expansion.
Mahmoud MA, Ehsan S, Ginzberg SP, et al. JAMA Netw Open. 2025;8(8):e2525216.
Insurance coverage can affect access to screening programs that support early diagnosis of cancer. This cross-sectional study looked at 68,478 Black and White women between 2015 and 2021 in a state that mandates insurance coverage for MRI for those at “higher risk” for breast cancer. It found that using breast density and lifetime risk (using a race-based risk calculator) to decide who gets extra screening missed many patients at risk. Black women, with generally lower density and risk scores, rarely qualified for insurance-covered MRI, and the criteria poorly identified false negatives. Current risk scores and breast density rules for MRI may not accurately show breast cancer risk for Black women and contribute to racial disparities in diagnosis.
Identification of neurological text markers associated with risk of stroke.
Mayampurath A, Rosado A, Romo E, et al. J Stroke Cerebrovasc Dis. 2025;34(8):108376.
Timeliness of stroke diagnosis is crucial in preventing patient harm. This retrospective analysis used natural language processing for analyzing emergency department (ED) visit notes to determine if indicators in the text were present in the documentation that could have flagged stroke occurrence 30-days post ED encounter. Fifty-eight concepts were identified that may be useful as indicators to motivate additional evaluation activities to avert stroke.
An electronic trigger to detect telemedicine-related diagnostic errors. (subscription required)
Murphy DR, Kadiyala H, Wei L, et al. J Telemed Telecare. 2025;31(7):1050-1055.
Despite increased access to telediagnosis, mechanisms to measure the safety of the practice are limited. This project used the Safer DX Trigger tool to develop a process to review EHR data and find factors that delay diagnosis during primary care telehealth visits at a large Department of Veterans Affairs site. The e-trigger algorithm was able to identify instances when patients returned within 10 days of a telemedicine visit. Further analysis of 100 randomly selected visits revealed 11% had a missed opportunity in diagnosis during a telehealth visit.
Mutlak Z, Saqer N, Chan SCC, et al. J Clin Med. 2025;14(12):4139.
Behaviors to improve diagnostic excellence can be established during medical training through discussion of error. This mixed methods study involves a cohort of 65 final-year medical students who examined the value of an educational strategy using diagnostic case studies to surface cognitive biases and systemic issues as contributors to missteps. The results indicate that encouraging learning from diagnostic mistakes can improve medical students’ confidence and self-awareness.
Nouis SC, Uren V, Jariwala S. BMC Med Ethics. 2025;26(1):89.
Ethical use of AI deployment in clinical medicine is universally concerning. This qualitative study examines professional healthcare perspectives from a variety of facilities in one NHS trust. While generalizability is not expected, the research illustrates participants' perspectives on the potential of AI to improve efficiency and decision-making, while documenting accountability, transparency, and ethical weaknesses that could reduce diagnostic excellence and provider acceptance of AI.
Hidden in Plain Sight: Exposing the Drivers of Diagnostic Error. Part II: Office Based Practice.
Siegal D, Small M, Montminy SL, et al. Coverys; 2025.
This analysis of proprietary medical malpractice data over a five-year period found office-based practices were associated with the largest percentage of diagnosis-related claims. Among the results, the data show that cancer was the most missed diagnosis. The initial patient evaluation phase identified process weaknesses that lead to missed diagnoses and diagnostic errors. Orthopedics was the most common specialty involved with identifying issues in surgical events. The included self-assessment enables outpatient practices to identify and surface weaknesses in diagnostic processes to inform improvement priorities.
Beyond text: the impact of clinical context on GPT-4’s 12-lead electrocardiogram interpretation accuracy. (subscription required)
Zeljkovic I, Novak A, Lisicic A, et al. Can J Cardiol. 2025;41(7):1406-1414.
Electrocardiograms (ECGs) are a mainstay of health assessment, and their accurate interpretation is key to effective diagnosis of heart conditions. This cross-sectional observational study used 150 12-lead ECGs from various cardiac conditions to assess the accuracy of GPT-4 interpretations with and without additional clinical information. The researchers found that clinical context enhanced the accuracy of AI to interpret ECGs, however, GPT-4 alone is not sufficiently accurate to interpret ECGs.
About the CODEX Digest
Stay current with the CODEX Digest, which cuts through the noise to bring you a list of recent must-read publications handpicked by the Learning Hub team. Each edition features timely, relevant, and impactful journal articles, books, reports, studies, reviews, and more selected from the broader CODEX Collection—so you can spend less time searching and more time learning.
Get the latest in diagnostic excellence, curated and delivered straight to your inbox every week:
See past digests here.
