Model | p-value | 95% CI for Accuracy | Effect Size (Cramer's V) | Interpretation |
---|---|---|---|---|
ChatGPT 4 | 0.00028 | [85.2%, 92.3%] | 0.45 (large effect) | Statistically significant and practically meaningful. High accuracy across all levels |
Gemini 1.5 Pro | 0.047 | [74.8%, 81.5%] | 0.32 (medium effect) | Statistically significant with moderate practical importance |
Command R + | 0.197 | [45.6%, 54.4%] | 0.18 (small effect) | Not statistically significant. Performance is less consistent and impactful |
Llama 3 70B | 0.118 | [75.1%, 83.2%] | 0.28 (medium effect) | Not statistically significant but shows moderate practical importance |