Is AI the future of evaluation in medical education?? AI vs. human evaluation in objective structured clinical examination

Tekin, Murat; Yurdal, Mustafa Onur; Toraman, Çetin; Korkmaz, Güneş; Uysal, İbrahim

doi:10.1186/s12909-025-07241-4

Table 5 Mean, standard deviation values, and Inter-Rater reliability level of evaluators

From: Is AI the future of evaluation in medical education?? AI vs. human evaluation in objective structured clinical examination

Basic Life Support Criteria	Inter-rater Reliability		Evaluator
	Inter-rater Reliability		H1	H2	RT	AI1	AI2
	α	κ	M(Sd)	M(Sd)	M(Sd)	M(Sd)	M(Sd)
Checked the safety of the environment, themselves, and the patient (verbally stating this is sufficient).	0.267	0.100	0.94 (0.9)	1.51 (0.5)	0.96 (0.9)	2 (0.0)	1.89 (0.4)
Gently touched the patient/injured person’s shoulders and asked, “How are you? Are you okay?” (verbally stating this is sufficient).	0.042	0.038	1.87 (0.5)	1.96 (0.2)	1.94 (0.3)	1.79 (0.4)	1.98 (0.2)
If the patient is unconscious, called for help from the environment and gave the command “Call 112” to someone (verbally stating this is sufficient).	0.017	0.161	1.87 (0.5)	1.96 (0.2)	1.77 (0.5)	1.81 (0.4)	1.89 (0.4)
Checked the mouth, opened the airway using the “head-tilt, chin-lift” maneuver.	0.015	0.023	1.70 (0.6)	1.94 (0.3)	1.74 (0.5)	1.72 (0.5)	1.81 (0.5)
Assessed breathing for no more than 10 s using the “look, listen, feel” method and checked for pulse at the carotid artery.	0.051	0.024	1.60 (0.7)	1.96 (0.2)	1.57 (0.7)	1.87 (0.3)	1.68 (0.6)
Performed effective and correct chest compressions (correct hand position, correct compression point, correct depth, correct speed, and allowing chest recoil).	0.293	-0.014	1.55 (0.6)	2 (0.0)	1.34 (0.7)	1.66 (0.5)	1 (0.6)
After 30 chest compressions, effectively gave 2 rescue breaths with proper head-tilt and chin-lift position (closed the patient’s nostrils while giving breaths).	0.151	0.071	1.87 (0.3)	1.98 (0.2)	1.49 (0.5)	1.55 (0.5)	1.68 (0.5)
Minimized interruptions in chest compressions.	0.046	0.042	1.53 (0.9)	2 (0.0)	1.64 (0.6)	1.83 (0.4)	1.70 (0.6)
Continued performing chest compressions and rescue breaths in a 30/2 ratio for two minutes (or stated that they should do so).	0.143	-0.033	1.68 (0.7)	2 (0.0)	1.77 (0.5)	1.96 (0.2)	1.36 (0.8)
Checked the patient’s breathing and pulse every two minutes (verbally stating this is sufficient).	0.163	-0.003	0.57 (0.8)	1.45 (0.5)	1.28 (0.8)	1.62 (0.7)	1.30 (0.9)

H1: Human Evaluating from Video 1, H2: Human Evaluating from Video 2, RT: Real Time Human Evaluator, AI1: ChatGPT, AI2: Gemini Flash, M: Mean, Sd: Standard Deviation, α: Krippendorff’s, κ: Fleiss

Back to article page

ISSN: 1472-6920

Contact us

General enquiries: journalsubmissions@springernature.com

BMC Medical Education

Contact us