Performance of single-agent and multi-agent language models in Spanish language medical competency exams

BMC Medical Education

Table 1 Performance metrics for all evaluated strategies on the EUNACOM Exam. Mean scores, standard deviations (SD), API calls, and mean completion time (in sec- onds) are shown

Category	Strategy	Accuracy (Mean % ± SD)	API Calls	Time (s)
Single-agent	COT + Few-Shot Few-Shot	87.67% ± 0.12% 86.88% ± 0.40%	1.00 1.00	1.74 1.61
	CoT MEDPROMPT	86.86% ± 0.37%	1.00	2.26
	CoT MEDPROMPT	86.96% ± 0.44%	1.00	2.95
	SELF-REFLECTION	85.38% ± 0.22%	2.65	4.15
	ZERO-SHOT	85.90% ± 0.32%	1.00	1.53
	MDAGENTS	89.97% ± 0.56%	21.14	192.44
	MEDAGENTS	87.99% ± 0.49%	17.00	63.95
Multi-agent	VOTING	87.22% ± 0.31%	6.00	12.51
	BORDA COUNT	86.70% ± 0.18%	6.00	13.03
	Weighted Voting	86.68% ± 0.18%	6.00	12.43

ISSN: 1472-6920