Skip to main content

Table 1 Comparison of correct answer percentages between AI platforms and human test-takers

From: The role of artificial intelligence in medical education: an evaluation of Large Language Models (LLMs) on the Turkish Medical Specialty Training Entrance Exam

Model

Basic Medical Sciences (%)

Clinical Medical Sciences (%)

Overall Accuracy (%)

p-value (vs. Humans)

Human Average

43.03 (SD = 12.45)

53.29 (SD = 10.82)

48.16

-

ChatGPT 4

85.83 (SD = 8.21)

91.67 (SD = 6.54)

88.75

 < 0.001

Llama 3 70B

79.17 (SD = 9.12)

79.17 (SD = 7.89)

79.17

 < 0.01

Gemini 1.5 Pro

78.33 (SD = 9.45)

77.50 (SD = 8.32)

78.13

 < 0.01

Command R + 

50.00 (SD = 10.23)

50.00 (SD = 9.87)

50.00

0.12 (Basic), < 0.05 (Clinical)