The role of artificial intelligence in medical education: an evaluation of Large Language Models (LLMs) on the Turkish Medical Specialty Training Entrance Exam

Table 1 Comparison of correct answer percentages between AI platforms and human test-takers

Model	Basic Medical Sciences (%)	Clinical Medical Sciences (%)	Overall Accuracy (%)	p-value (vs. Humans)
Human Average	43.03 (SD = 12.45)	53.29 (SD = 10.82)	48.16	-
ChatGPT 4	85.83 (SD = 8.21)	91.67 (SD = 6.54)	88.75	< 0.001
Llama 3 70B	79.17 (SD = 9.12)	79.17 (SD = 7.89)	79.17	< 0.01
Gemini 1.5 Pro	78.33 (SD = 9.45)	77.50 (SD = 8.32)	78.13	< 0.01
Command R +	50.00 (SD = 10.23)	50.00 (SD = 9.87)	50.00	0.12 (Basic), < 0.05 (Clinical)

ISSN: 1472-6920