Skip to main content

Table 5 Confidence intervals and effect sizes

From: The role of artificial intelligence in medical education: an evaluation of Large Language Models (LLMs) on the Turkish Medical Specialty Training Entrance Exam

Model

p-value

95% CI for Accuracy

Effect Size (Cramer's V)

Interpretation

ChatGPT 4

0.00028

[85.2%, 92.3%]

0.45 (large effect)

Statistically significant and practically meaningful. High accuracy across all levels

Gemini 1.5 Pro

0.047

[74.8%, 81.5%]

0.32 (medium effect)

Statistically significant with moderate practical importance

Command R + 

0.197

[45.6%, 54.4%]

0.18 (small effect)

Not statistically significant. Performance is less consistent and impactful

Llama 3 70B

0.118

[75.1%, 83.2%]

0.28 (medium effect)

Not statistically significant but shows moderate practical importance