Skip to main content

Table 1 Comparison of GPT-4o and Claude-3 performance to Family Medicine residents on questions with diagnostic uncertainty

From: The future of AI clinicians: assessing the modern standard of chatbots and their approach to diagnostic uncertainty

  

GPT-4o (%)

Claude-3 (%)

PGY 1 Resident % Correct (N = 160)

PGY 2 Resident % Correct (N = 160)

PGY 1 + 2 Resident % Correct (N = 220)

%

95% CI

%

95% CI

%

95% CI

Overall

48 (53.3)

52 (57.7)

61.1

58.4–63.7

63.3

60.7–66.1

62.2

59.6–64.9

Exam Category

        
 

Cardiovascular

8 (80.0)

5 (50.0)

63.8

60.6–67.1

65.2

62.7–68.3

64.5

61.7–67.7

 

Gastrointestinal

7 (70.0)

5 (50.0)

64.1

61.7–67.2

67.3

65.2–70.1

65.7

63.5–68.7

 

Geriatric Care

4 (40.0)

7 (70.0)

58.7

55.4–61.2

60.4

58.5–63.5

59.6

57.0-62.4

 

Endocrine

6 (60.0)

5 (50.0)

65.2

63.8–66.7

66.3

64.9–68.5

65.8

64.4–67.6

 

Mental Health

3 (30.0)

8 (80.0)

51.3

48.5–54.8

53.4

50.7–56.8

52.4

49.6–55.8

 

MSK

5 (50.0)

5 (50.0)

60.7

58.2–62.4

62.9

59.9–65.7

61.8

59.1–64.1

 

Pediatric

5 (50.0)

5 (50.0)

65.9

63.2–68.4

68.6

65.4–71.2

67.3

64.3–69.8

 

Respiratory

6 (60.0)

5 (50.0)

64.5

61.8–67.3

67.9

64.1–70.2

66.2

63.0-68.8

 

Women’s Health

4 (40.0)

7 (70.0)

55.3

52.5–57.9

57.4

55.260.4

56.4

53.9–59.2