The role of artificial intelligence in medical education: an evaluation of Large Language Models (LLMs) on the Turkish Medical Specialty Training Entrance Exam

Table 5 Confidence intervals and effect sizes

Model	p-value	95% CI for Accuracy	Effect Size (Cramer's V)	Interpretation
ChatGPT 4	0.00028	[85.2%, 92.3%]	0.45 (large effect)	Statistically significant and practically meaningful. High accuracy across all levels
Gemini 1.5 Pro	0.047	[74.8%, 81.5%]	0.32 (medium effect)	Statistically significant with moderate practical importance
Command R +	0.197	[45.6%, 54.4%]	0.18 (small effect)	Not statistically significant. Performance is less consistent and impactful
Llama 3 70B	0.118	[75.1%, 83.2%]	0.28 (medium effect)	Not statistically significant but shows moderate practical importance

ISSN: 1472-6920