The future of AI clinicians: assessing the modern standard of chatbots and their approach to diagnostic uncertainty

Table 1 Comparison of GPT-4o and Claude-3 performance to Family Medicine residents on questions with diagnostic uncertainty

		GPT-4o (%)	Claude-3 (%)	PGY 1 Resident % Correct (N = 160)		PGY 2 Resident % Correct (N = 160)		PGY 1 + 2 Resident % Correct (N = 220)
		GPT-4o (%)	Claude-3 (%)	%	95% CI	%	95% CI	%	95% CI
Overall		48 (53.3)	52 (57.7)	61.1	58.4–63.7	63.3	60.7–66.1	62.2	59.6–64.9
Exam Category
	Cardiovascular	8 (80.0)	5 (50.0)	63.8	60.6–67.1	65.2	62.7–68.3	64.5	61.7–67.7
	Gastrointestinal	7 (70.0)	5 (50.0)	64.1	61.7–67.2	67.3	65.2–70.1	65.7	63.5–68.7
	Geriatric Care	4 (40.0)	7 (70.0)	58.7	55.4–61.2	60.4	58.5–63.5	59.6	57.0-62.4
	Endocrine	6 (60.0)	5 (50.0)	65.2	63.8–66.7	66.3	64.9–68.5	65.8	64.4–67.6
	Mental Health	3 (30.0)	8 (80.0)	51.3	48.5–54.8	53.4	50.7–56.8	52.4	49.6–55.8
	MSK	5 (50.0)	5 (50.0)	60.7	58.2–62.4	62.9	59.9–65.7	61.8	59.1–64.1
	Pediatric	5 (50.0)	5 (50.0)	65.9	63.2–68.4	68.6	65.4–71.2	67.3	64.3–69.8
	Respiratory	6 (60.0)	5 (50.0)	64.5	61.8–67.3	67.9	64.1–70.2	66.2	63.0-68.8
	Women’s Health	4 (40.0)	7 (70.0)	55.3	52.5–57.9	57.4	55.260.4	56.4	53.9–59.2

ISSN: 1472-6920