Model | Basic Medical Sciences (%) | Clinical Medical Sciences (%) | Overall Accuracy (%) | p-value (vs. Humans) |
---|---|---|---|---|
Human Average | 43.03 (SD = 12.45) | 53.29 (SD = 10.82) | 48.16 | - |
ChatGPT 4 | 85.83 (SD = 8.21) | 91.67 (SD = 6.54) | 88.75 | < 0.001 |
Llama 3 70B | 79.17 (SD = 9.12) | 79.17 (SD = 7.89) | 79.17 | < 0.01 |
Gemini 1.5 Pro | 78.33 (SD = 9.45) | 77.50 (SD = 8.32) | 78.13 | < 0.01 |
Command R + | 50.00 (SD = 10.23) | 50.00 (SD = 9.87) | 50.00 | 0.12 (Basic), < 0.05 (Clinical) |