Can we trust chatbots for tacrolimus? A STROBE-aligned multimodel benchmark of large language models for drug information in kidney transplantation: LLM reliability in transplant pharmacology

İlyas Kudaş

doi:10.28982/josam.8536

Authors

İlyas Kudaş University of Health Sciences, Cam Sakura Training and Research Hospital, Department of General Surgery, Istanbul, Turkey https://orcid.org/0000-0002-5499-5168

DOI:

https://doi.org/10.28982/josam.8536

Keywords:

large language models, kidney transplantation, immunosuppression, g–drug interactions, tacrolimus, therapeutic drug monitoring, pharmacology, medication safety

Abstract

Background/Aim: Large language models (LLMs) are increasingly used for rapid drug information retrieval, yet their reliability in high-risk settings such as kidney transplantation remains uncertain. Immunosuppressants have narrow therapeutic indices and clinically consequential drug–drug interactions (DDIs), making even small factual errors potentially harmful.

Methods: We performed a cross-sectional, head-to-head benchmark of four LLMs (GPT-5.1, GPT-4.1, Gemini, Claude) using 150 standardized prompts derived from KDIGO transplant guidance and pharmacology reference standards. Prompts covered four domains: drug mechanism/explanation, major DDIs, dosing principles/therapeutic drug monitoring, and toxicity profiles. Each model produced 150 responses (600 total). Responses were blinded, randomized, and independently scored by two transplant pharmacists and one senior transplant physician using a three-tier rubric: accurate/actionable (Score 2), safe but non-actionable generalization (Score 1), and factual error/hallucination (Score 0). Disagreements were resolved by consensus. Primary outcomes were overall accuracy (Score 2 proportion) and unsafe error rate (Score 0 proportion).

Results: Inter-rater agreement was excellent (Cohen’s κ=0.88). Overall accuracy ranged from 85.3% to 91.3% across models, with low unsafe error rates (1.3%–4.7%). Across domains, highest performance was observed for foundational mechanism questions, while dosing principles and major DDIs generated more Score-1 responses (safe but insufficient detail).

Conclusion: LLMs demonstrated high—but not fail-safe—performance for kidney transplant pharmacology. Given residual unsafe errors and variability in actionable specificity, LLM outputs should be used only as adjunctive support with pharmacist/physician verification prior to clinical decisions.

Downloads

Download data is not yet available.

References

Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel). 2023;11(6):887. doi:10.3390/healthcare11060887. DOI: https://doi.org/10.3390/healthcare11060887

Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton E, et al. Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review. J Med Internet Res. 2024;26:e22769. doi:10.2196/22769. DOI: https://doi.org/10.2196/22769

Roustan D, Bastardot F. The Clinicians’ Guide to Large Language Models: A General Perspective With a Focus on Hallucinations. Interact J Med Res. 2025;14:e59823. PMID:39874574. DOI: https://doi.org/10.2196/59823

Chelli M, Descamps J, Lavoué V, Trojani C, Azar M, Deckert M, et al. Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis. J Med Internet Res. 2024 May 22;26:e53164. doi: 10.2196/53164. PMID: 38776130; PMCID: PMC11153973. DOI: https://doi.org/10.2196/53164

Omar M, Soffer S, Agbareia R, Bragazzi NL, Apakama DU, Horowitz CR, et al. Sociodemographic biases in medical decision making by large language models. Nat Med. 2025 Jun;31(6):1873-81. doi: 10.1038/s41591-025-03626-6. Epub 2025 Apr 7. PMID: 40195448. DOI: https://doi.org/10.1038/s41591-025-03626-6

Halloran PF. Immunosuppressive drugs for kidney transplantation. N Engl J Med. 2004;351(26):2715-29. doi:10.1056/NEJMra033540. PMID:15616206. DOI: https://doi.org/10.1056/NEJMra033540

Kidney Disease: Improving Global Outcomes (KDIGO) Transplant Work Group. KDIGO clinical practice guideline for the care of kidney transplant recipients. Am J Transplant. 2009;9 Suppl 3:S1-S155. doi:10.1111/j.1600-6143.2009.02834.x. PMID:19845597. DOI: https://doi.org/10.1111/j.1600-6143.2009.02834.x

Naesens M, Kuypers DRJ, Sarwal M. Calcineurin inhibitor nephrotoxicity. Clin J Am Soc Nephrol. 2009;4(2):481-508. doi:10.2215/CJN.04800908. PMID:19218475. DOI: https://doi.org/10.2215/CJN.04800908

Kahan BD. Therapeutic drug monitoring of immunosuppressant drugs in clinical practice. Clin Ther. 2002;24(3):330-50. PMID:11952020. DOI: https://doi.org/10.1016/S0149-2918(02)85038-X

Lange NW, Salerno DM, Berger K, Tsapepas DS. Using known drug interactions to manage supratherapeutic calcineurin inhibitor concentrations. Clin Transplant. 2017;31(11):e13098. doi:10.1111/ctr.13098. PMID:28856745. DOI: https://doi.org/10.1111/ctr.13098

Moradi O, Karimzadeh I, Davani-Davari D, Shafiekhani M, Sagheb MM, Raees-Jalali GA. Drug-Drug Interactions among Kidney Transplant Recipients in The Outpatient Setting. Int J Organ Transplant Med. 2020;11(4):185-95. PMID: 33335699; PMCID: PMC7726842.

He J, Yu Y, Yin C, Liu H, Zou H, Ma J, et al. Clinically significant drug-drug interaction between tacrolimus and fluconazole in stable renal transplant recipient and literature review. J Clin Pharm Ther. 2020 Apr;45(2):264-9. doi: 10.1111/jcpt.13075. Epub 2019 Nov 22. PMID: 31756280. DOI: https://doi.org/10.1111/jcpt.13075

Susomboon T, Kunlamas Y, Vadcharavivad S, Vongwiwatana A. The effect of the very low dosage diltiazem on tacrolimus exposure very early after kidney transplantation: a randomized controlled trial. Sci Rep. 2022 Aug 21;12(1):14247. doi: 10.1038/s41598-022-18552-7. PMID: 35989346; PMCID: PMC9393165. DOI: https://doi.org/10.1038/s41598-022-18552-7

Tang L, Sun Z, Idnay B, et al. Evaluating large language models on medical evidence summarization. NPJ Digit Med. 2023;6(1):158. PMID:37162998. DOI: https://doi.org/10.1038/s41746-023-00896-7

Asgari E, Montaña-Brown N, Dubois M, Khalil S, Balloch J, Yeung JA, et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. NPJ Digit Med. 2025 May 13;8(1):274. doi: 10.1038/s41746-025-01670-7. PMID: 40360677; PMCID: PMC12075489. DOI: https://doi.org/10.1038/s41746-025-01670-7

Templin T, Fort S, Padmanabham P, Seshadri P, Rimal R, Oliva J, et al. Framework for bias evaluation in large language models in healthcare settings. NPJ Digit Med. 2025 Jul 7;8(1):414. doi: 10.1038/s41746-025-01786-w. PMID: 40624264; PMCID: PMC12234702. DOI: https://doi.org/10.1038/s41746-025-01786-w

von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet. 2007;370(9596):1453-7. PMID:18064739. DOI: https://doi.org/10.1016/S0140-6736(07)61602-X

Brunton LL, Hilal-Dandan R, Knollmann BC, eds. Goodman & Gilman’s The Pharmacological Basis of Therapeutics. 14th ed. New York: McGraw-Hill Education; 2022.

R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing; 2023.