Skip to main content

Artificial intelligence based assessment of clinical reasoning documentation: an observational study of the impact of the clinical learning environment on resident documentation quality

Abstract

Background

Objective measures and large datasets are needed to determine aspects of the Clinical Learning Environment (CLE) impacting the essential skill of clinical reasoning documentation. Artificial Intelligence (AI) offers a solution. Here, the authors sought to determine what aspects of the CLE might be impacting resident clinical reasoning documentation quality assessed by AI.

Methods

In this observational, retrospective cross-sectional analysis of hospital admission notes from the Electronic Health Record (EHR), all categorical internal medicine (IM) residents who wrote at least one admission note during the study period July 1, 2018– June 30, 2023 at two sites of NYU Grossman School of Medicine’s IM residency program were included. Clinical reasoning documentation quality of admission notes was determined to be low or high-quality using a supervised machine learning model. From note-level data, the shift (day or night) and note index within shift (if a note was first, second, etc. within shift) were calculated. These aspects of the CLE were included as potential markers of workload, which have been shown to have a strong relationship with resident performance. Patient data was also captured, including age, sex, Charlson Comorbidity Index, and primary diagnosis. The relationship between these variables and clinical reasoning documentation quality was analyzed using generalized estimating equations accounting for resident-level clustering.

Results

Across 37,750 notes authored by 474 residents, patients who were older, had more pre-existing comorbidities, and presented with certain primary diagnoses (e.g., infectious and pulmonary conditions) were associated with higher clinical reasoning documentation quality. When controlling for these and other patient factors, variables associated with clinical reasoning documentation quality included academic year (adjusted odds ratio, aOR, for high-quality: 1.10; 95% CI 1.06–1.15; P <.001), night shift (aOR 1.21; 95% CI 1.13–1.30; P <.001), and note index (aOR 0.93; 95% CI 0.90–0.95; P <.001).

Conclusions

AI can be used to assess complex skills such as clinical reasoning in authentic clinical notes that can help elucidate the potential impact of the CLE on resident clinical reasoning documentation quality. Future work should explore residency program and systems interventions to optimize the CLE.

Peer Review reports

Background

Clinical reasoning is a core skill that is essential to patient care and clinical documentation serves as an important measure of resident clinical reasoning [1,2,3]. While the absence of high-quality documentation of clinical reasoning does not preclude that reasoning was performed, higher-quality clinical reasoning documentation has been hypothesized to be associated with improved patient outcomes including reduced diagnostic errors [4,5,6]. The development of residents’ clinical reasoning skills is inextricably linked to the clinical learning environment (CLE) [7, 8]. The CLE is the interplay between education (including the expected learning outcomes and assessment practices) and the clinical context where trainees participate in patient care [9, 10]. Additionally, the CLE, as a mediator of attainment of clinical competence, has the potential for long-lasting impact on future practice patterns [11,12,13,14,15]. Learning in the authentic clinical environment can be challenging due to the unpredictability, pace, and complexity inherent to caring for sick patients [9, 10]. Furthermore, feedback, which is essential to development of clinical reasoning, can often be limited in the fast-paced CLE, given high patient volumes and increasing demands on physicians’ time [2].

Studies across medical specialties have investigated how the CLE impacts resident performance. In the procedural specialties of obstetrics and gynecology and surgery, researchers have used procedural outcomes to examine the interplay between CLE and markers of resident performance by reviewing EHR data for maternal birth complications and postoperative complications [16, 17]. In internal medicine (IM), research around the CLE has focused on clinical workload, shift timing, and work hours with noted negative impacts on learning as assessed through trainee surveys; however, objective, attributable, and scalable measurement of such relationships proves difficult in practice [18,19,20]. Those studies that do focus on objective, EHR-derived outcomes have suggested that increased resident workload, as measured by higher patient censuses and numbers of admissions, is associated with increased resource use, readmission rates, and potentially increased patient mortality [21, 22]. Similarly, reducing resident workload by lowering patient census has been shown to result in an improvement in the quality of discharge summaries as assessed by manual human review [23]. However, more automated and scalable data measures, including better utilization of EHR data, are needed to determine what specific, modifiable aspects of the CLE might be impacting resident performance and patient outcomes [24].

Clinical reasoning documentation is one such EHR-derived outcome that offers an avenue to explore further the impact of the CLE on resident performance [25]. While multiple validated rating tools exist to assess the quality of clinical reasoning documentation, they can be time-intensive for educators and therefore difficult to generate insights at scale to effectively measure the impact of the CLE [3, 26,27,28,29]. Artificial intelligence (AI) and big data analytics offer approaches for generating the necessary scalable, large datasets of resident-specific, objective measures of resident performance that– if melded with patient, team, and contextual factors– could further unpack the potential impact of the CLE on resident performance [29]. Despite this opportunity, AI-based assessments of resident performance are not yet widely implemented [30,31,32,33]. In earlier phases of this work, we developed an AI-based assessment of clinical reasoning documentation to increase the frequency and quality of feedback on this important skill [34]. AI-based tools have the potential to assess clinical reasoning at the individual and aggregate level at a scale that is not feasible with human rating. With this technology, we can more comprehensively explore the relationship between the CLE and resident clinical reasoning documentation practices.

Here, we report a cross-sectional analysis of the relationship between the CLE (e.g., markers of workload such as volume of admissions and time of shift) and AI-based assessment of clinical reasoning documentation quality among a large retrospective cohort of IM residents at two distinct residency program hospital sites within one quaternary academic health system.

Methods

Study sites

This study was conducted at NYU Grossman School of Medicine’s IM residency program, a large academic urban training program in New York City. The study was conducted at two of the residency program’s hospital sites—NYU Langone Hospital (NYULH)-Manhattan and NYULH-Brooklyn—which utilize a single EHR. The Manhattan site (181 residents overall) is a university-based hospital, quaternary referral and transplant center. These residents also rotate at two other affiliated sites with distinct EHRs not included in this study but do not rotate at the Brooklyn site. The Brooklyn site (41 residents overall) is a community-based tertiary hospital whose residency program became affiliated with NYU in 2016. At each site, residents staff 6 medicine inpatient teams. Residents are divided into day teams (7 am to 7 pm) and night teams (7 pm to 7 am), both responsible for caring for existing patients while also admitting patients daily, including writing admission notes. The day team structure typically comprises a supervising attending, a senior resident, two interns, and medical students. The night team is a resident-intern dyad.

Study population

We included all categorical IM residents who wrote at least one admission note on a general IM service from July 1, 2018– June 30, 2023 (five academic years), at either NYULH-Manhattan (n = 356 residents) or NYULH-Brooklyn (n = 118 residents). Preliminary (e.g., Neurology) and rotating residents (e.g., Psychiatry) were excluded. Notes authored in the intensive care unit were excluded, as such notes were not included in the initial validation work of the AI algorithm [34].

Data retrieval

Admission notes were retrieved using a custom query of the EHR data warehouse in July 2023. Note-level data included the note creation date and time. From this, the shift (day or night) and creation time within the shift were calculated. Additionally, a note index was calculated at the shift level indicating if a given note was the first, second, third, etc. note authored within a shift. These aspects of the CLE were included as potential markers of workload, which have been shown to have a strong relationship with resident performance [21,22,23]. Patient-level data at the time of each inpatient hospitalization was also captured, such as patient age, sex, insurance type, and Charlson Comorbidity Index (CCI). These variables were included as they are all patient characteristics that can influence the clinical reasoning process in the CLE [35].

CCI encompasses 17 disease categories (e.g., congestive heart failure, renal disease) that were weighted using an age-independent adaptation revised for International Classification of Disease, 10th edition (ICD-10-CM) codes present before the date of hospital admission (score range 0 to 29, with higher indicating more comorbidity) [36,37,38].

Hospitalizations were categorized by diagnostic area using the ICD-10-CM chapter of hospitalization-level diagnoses, as done previously (Supplement Digital Appendix 1) [39]. The principal diagnosis (selected by clinicians) was used if available, otherwise the primary diagnosis (selected by a non-physician coder based on chart review) was used.

Author-level data on each resident’s primary training site and post-graduate year (PGY) at the time of note creation were retrieved from an educational data warehouse [40].

AI algorithm

The assessment and plan section of each note was analyzed using a previously developed supervised machine learning model that classifies clinical reasoning documentation quality in IM resident admissions notes as low- or high-quality [34]. The human rating gold standard for training the model was the Revised-IDEA assessment tool which defines a high-quality hospital admission note in four domains as a note that has an interpretive summary (I) that is concise and uses semantic vocabulary to highlight the most important elements from history, exam, and testing to represent the patient’s main problems, a differential diagnosis (D) that is explicitly prioritized with specific diagnoses (rather than diagnostic categories, e.g. infection) and provides clear explanation of reasoning (E) for the lead and alternative (A) diagnoses in the differential [2]. Ultimately the model was trained on only the DEA component of the Revised-IDEA tool, excluding the I, as the I component was too complex to tackle with the technology available at the time of model development. The Revised-DEA human rating score ranges 0–6 with a score of ≥3 deemed high quality through a standard setting process. The model uses the Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES, v4.0.0), a pre-large language model (LLM) open-source natural language processing (NLP) tool that was trained on clinical notes and performs named entity recognition (i.e., extracting diagnoses and other concepts) and predicts whether a note was low- or high-quality trained on the human rated note corpus. The algorithm has excellent note-level performance with an area under the receiver operating characteristic curve of 0.88 [34].

Statistical analysis

Descriptive statistics, chi-squared testing, regression analyses, and visualizations were performed with the R statistical software (v4.2.0) using a variety of libraries (tidyverse, data.table, gtsummary, ggplot2). Hypothesis testing was two-sided at α = 0.05. Regression models for high-quality clinical reasoning were built both without adjustment for clustering (logistic regression using glm package of stats library) and with adjustment for clustering by resident using generalized estimating equations (GEE) with an exchangeable correlation structure (geeglm package of geepack library) and robust standard error estimates. Interaction terms were evaluated for each of the CLE variables; significant interactions were included in the final model. The within-cluster intraclass correlation was approximately 0.08, indicating only modest within-resident clustering of performance. The study was approved by the NYU Grossman School of Medicine institutional review board.

Results

Descriptive statistics were computed for resident characteristics, CLE characteristics, and patient characteristics stratified by residency program site (Table 1 and Supplement Digital Appendix 2) and post-graduate year (PGY) (Supplement Digital Appendix 3). There were 37,750 admission notes (16,180 Manhattan and 21,570 Brooklyn) authored by 474 distinct residents (356 Manhattan residents and 118 Brooklyn residents) with a mean 79.6 notes per resident (mean 45 per Manhattan resident and 183 per Brooklyn resident) across 28,782 distinct patients between both sites from July 2018 to June 2023. Overall, 55% of the notes were high-quality (64% high-quality at Manhattan and 48% high-quality at Brooklyn). At both sites PGY-2 or PGY-3 residents accounted for the majority of high-quality notes written as compared to PGY-1 residents (overall n = 15,823 (76%) vs. 5,034 (24%), P <.001). Senior residents authored more high-quality notes at Brooklyn (PGY2 or PGY3, n = 8,992 (86%) vs. PGY1, n = 1,448 (14%), P <.001) compared to Manhattan (PGY2 or PGY3, n = 6,831 (66%) vs. PGY1 n = 3,586 (34%), P <.001). At both sites more of the high-quality notes were written during the night shift as compared to day shift (overall n = 13,518 (65%) vs. n = 7,339 (35%), P <.001) and notes with an earlier note index in the shift accounted for the majority of high-quality notes (overall 1st note in shift n = 11,890 (57%) vs. 2nd note in shift n = 4,657 (22%) vs. 3rd note in shift n = 2,130 (10%) vs. 4th note in shift n = 1,119 (5.4%) vs. 5th or more note in shift n = 1,061 (5.1%), P <.001).

Table 1 Descriptive statistics of note quality by site, resident characteristics, clinical learning environment characteristics, and patient characteristics

Controlling for covariates and accounting for resident-level clustering, patient characteristics associated with complex admitting presentations were associated with higher quality clinical reasoning documentation, including older age (adjusted odds ratio, aOR 1.13 for age > = 80.1; 95% CI 1.05–1.22; P <.001; aOR 1.17 for age 68.1–80; 95% CI 1.09–1.25; P <.001; aOR 1.05 for age 54.1–68; 95% CI 0.99–1.11; P =.10) and higher comorbidity index (aOR 2.18 for CCI ≥ 5 referenced to CCI 0; 95% CI 2.02–2.34; P <.001) (Table 2). The primary diagnosis areas most strongly associated with higher clinical reasoning documentation quality included pulmonary and infectious diseases (aOR 1.53; 95% CI 1.40–1.66; P <.001; aOR 1.58; 95% CI 1.45–1.73; P <.001, respectively).

Table 2 Regression models for relationship between clinical reasoning Documentation quality and resident characteristics, clinical learning environment characteristics, and patient characteristics

After controlling for covariates, several aspects of the CLE were significantly associated with clinical reasoning documentation quality (Table 2). Academic year was positively associated with clinical reasoning documentation quality (aOR 1.10; 95% CI 1.06–1.15; P <.001). Notably, there was a decline in clinical reasoning documentation quality during AY 2019–2020 driven by notes authored during the peak of the initial COVID-19 surge in New York City (March to June 2020), with subsequent recovery in documentation quality (47.7% high-quality AY 2019–2020 (49.9% high-quality July-February, 42.4% high-quality March-June) vs. 56.8% high-quality all other AYs, P <.001, Fig. 1a). The NYULH-Manhattan site experienced a bigger decline in clinical reasoning documentation quality during the COVID-19 surge than the NYULH-Brooklyn site (62.2% high-quality July-February AY 2019–2020 to 50.5% high-quality March-June AY 2019–2020 at Manhattan vs. 44.5% high-quality July-February AY 2019–2020 to 37.0% high-quality March-June AY 2019–2020 at Brooklyn, P <.001). Overall, performance differed by site: those at NYULH-Brooklyn were less likely to write high-quality notes than at NYULH-Manhattan (aOR 0.46; 95% CI 0.39–0.55; P <.001). Additionally, only at the NYULH-Brooklyn site was an increase in clinical reasoning documentation quality observed by PGY. PGY 2 and PGY3 residents (i.e., senior residents) at the NYULH-Brooklyn site were more likely to write high-quality clinical reasoning notes as compared to their PGY1 peers (aOR 1.39; 95% CI 1.17–1.64; P <.001). Sensitivity analysis with PGY2 and PGY3 separated showed the same result.

Fig. 1
figure 1

A: High-quality clinical reasoning documentation by academic year (AY) demonstrating a decline AY 2019–2020 driven by notes authored during the peak of the initial COVID-19 surge in New York City (March to June 2020). B: High-quality clinical Reasoning documentation by shift and hour demonstrating notes written during the night shift were more likely to be high-quality than those in the day shift and notes written later in the night shift appeared associated with lower note quality but this relationship disappeared in the multivariable regression. C: High-quality clinical reasoning documentation by note index within shift demonstrating that within each shift every additional note (i.e., higher note index) was associated with lower quality clinical reasoning documentation

Notes written during the night shift were more likely to be high-quality than those in the day shift (aOR 1.21; 95% CI 1.13–1.30; P <.001). Although notes written later in the night shift appeared associated with lower note quality (Fig. 1b), this relationship disappeared in the multivariable regression (aOR 1.00; 95% CI 0.99–1.01; P =.80). Rather, within each shift every additional note (i.e., higher note index) was associated with an 7% lower odds of high-quality clinical reasoning documentation (aOR 0.93; 95% CI 0.90–0.95; P <.001) (Fig. 1c).

Discussion

We utilized AI-based assessment to measure resident clinical reasoning documentation across five academic years, two residency program sites, and tens of thousands of hospitalizations to better understand the relationship between the CLE and resident performance. The size of the datasets generated and this exploration was only feasible due to the power of AI. A single person working continuously for eight hours per day spending two minutes per note would have taken over 150 days to rate all notes. Furthermore, while use of AI to enhance assessment has been widely suggested, only one other validated AI-based assessment tool of clinical reasoning documentation in the authentic CLE has been described, focusing on hospital progress notes [31, 41,42,43,44,45,46,47].

Melding EHR-derived measures of the CLE with this AI-based assessment revealed clinically meaningful performance differences by several CLE factors. We extracted note-level data of creation time within shift and note index to explore markers of workload which can be a major barrier to learning in the CLE [9, 18]. After controlling for other covariates including patient characteristics such as primary diagnosis area and patient complexity as measured by CCI, we found that residents were more likely to write high-quality notes during their night shifts and less likely to write high-quality notes with an increasing number of notes within a shift. For the former, we hypothesize that note quality was higher during the night shift as residents typically can focus more on new admissions without the added workload of routine non-emergent care at night such as writing progress notes, carrying out routine daytime tasks like calling non-emergent consults, or preparing for discharges. The differences seen between night and day shift note quality could also potentially be due the differences in team structure with the senior resident supervising up to four of five junior team members (interns and medical students) during the daytime and only supervising a single intern with the resident-intern dyad team structure at nighttime. Additionally, the fact that notes written earlier in the shift (lower note index) had higher quality might be due to residents focusing on fewer concomitant admissions at the time of those earlier admissions. While there is consensus that busyness of work and service pressures can cause cognitive overload and be a barrier to learning and performance in the CLE, both findings provide objective measures that could inform structural changes [18, 19, 21, 23, 48]. Such structural changes could include further limits to the number of admissions than already exists [49] or team structures that could minimize cognitive overload like the addition of advanced practice providers (APPs) to support resident teams or the creation of swing shifts to focus on admitting without cross-coverage obligations [50,51,52]. These changes would need to be balanced with the tradeoffs including requirements for increased staffing and ensuring residents having enough clinical exposure to develop competence.

Unsurprisingly, the major disruptor to the CLE of the COVID-19 surge in New York City in the spring of 2020 impacted the educational outcome of clinical reasoning documentation quality at both sites. NYULH-Manhattan experienced a bigger impact than NYULH-Brooklyn which correlates with the higher patient volume experienced at this site during the surge [53]. At both sites, clinical reasoning documentation quality has since recovered. While clinical reasoning documentation quality at NYULH-Manhattan has mostly plateaued, NYULH-Brooklyn experienced a continued significant rise over the ensuing two academic years with a leveling off in the AY 2022–2023. We hypothesize this is a result of a major change in the CLE that occurred in 2016 with the merger of NYULH with Lutheran Medical Center, which was a resource-limited community hospital, which is now NYULH-Brooklyn. The merger took a full integration approach of this hospital into the NYULH system including: transitioning into NYULH EHR, establishing new hospital quality metrics, and importantly redesigning the residency program curriculum and mission, all of which likely played a role in fostering a CLE conducive to positive change [54]. A similar trend in improvement was seen in clinical outcomes (such as reduction in mortality rate and fewer central line and catheter-associated urinary tract infections) after the merger as described by Wang et al. [54]. Using aggregate performance of AI-based assessment of clinical reasoning documentation can be an important strategy to assess for any unintended or intended impact of systems changes like mergers or EHR changes.

While likely not an impact of the CLE, another difference seen in educational outcomes between NYULH-Brooklyn and NYULH-Manhattan was an overall higher quality clinical reasoning documentation at the latter site and an increase in clinical reasoning documentation quality by PGY year only at the former site. Possible explanations for these differences might be the differences in the resident cohorts themselves; each site has its own mission and identity and attracts residents with different clinical interests, career goals, and from different medical schools. Our findings might reflect a ceiling effect at NYULH-Manhattan where the PGY1 residents started at a higher percentage of high-quality notes (66%) than NYULH-Brooklyn (42%). Despite these differences, both sites have several similarities including: team structures, faculty supervision, EHR note templates with free text assessment and plans, and clinical reasoning curriculum (a resident facing dashboard to provide feedback on clinical reasoning documentation generated from output of the AI-based assessment was implemented during the study period in November 2020 at both sites) [34]. The implementation of the dashboard could have influenced resident clinical reasoning documentation practices in the later part of the study period, in particular accounting for the trend in improvement seen by academic year at both sites, which is an ideal outcome after implementing feedback via the dashboard. Ultimately, possibly due to these similarities, both sites experienced similar impacts of the CLE when controlling for covariates.

Lastly, we explored clinical reasoning documentation quality by primary diagnosis area to further understand if residents were meeting educational outcomes by content domain and the potential impact clinical context might have. After controlling for patient characteristics, including preexisting comorbidities, we found that residents were more likely to write high-quality clinical reasoning documentation in the primary diagnosis domains of infectious disease and pulmonary medicine. There could be several explanations for this finding such as increased exposure to these diagnoses in particular infectious diseases during COVID-19 [55], residents having the opportunity to rotate on pulmonary and infectious disease subspecialty-led teams, or certain presenting diagnoses not lending themselves to requiring as much diagnostic reasoning (e.g. a hematology-oncology patient admitted for chemotherapy). Ultimately, these data could be used both for feedback for individual resident practice habits or rotation planning, scheduling residents for specialty rotations, and for improvements at the program level to inform curricular planning, such as focusing on diagnostic categories with overall lower quality clinical reasoning in noon conference series. Use for feedback on practice habits could be applicable not only to residents but also to attending physicians and APPs, and has the potential to reduce diagnostic errors and improve patient outcomes by improving clinical reasoning documentation quality [4, 56, 57]. Validating the AI-based assessment tool for attending physician and APP clinical reasoning documentation quality and exploring the relationship between clinical reasoning documentation quality and diagnostic errors are both important areas to explore with future studies.

Limitations

The AI model in this study was developed with older generation NLP technology only allowing for binary predictions of low vs. high-quality clinical reasoning documentation, however it did have good performance [34]. LLMs will only further improve the ability to analyze EHR-based note text and allow for more nuanced assessment of clinical reasoning, which we have integrated into an updated model [58]. In terms of generalizability, although this study was conducted at a single academic institution, it was at two hospital sites with distinct resident populations. Additionally, in the current phase of our work, we have validated an updated AI-based assessment of clinical reasoning documentation using newer technologies at a second institution, University of Cincinnati, and created a pipeline for dissemination of the tool [58]. This will enhance generalizability of this study allowing other clinical sites to explore the potential impact of CLE on resident performance using AI-based assessment of clinical reasoning documentation [58]. Another limitation is that the AI model only assesses whether clinical reasoning was documented, not whether it was accurate or whether reasoning occurred but was just not documented. However, documentation of clinical reasoning is an essential skill that residents are expected to perform and has been hypothesized to be associated with patient outcomes [4,5,6, 59]. Furthermore, the model excluded the interpretive summary of the Revised-IDEA tool and only focused on the assessment and plan of the note. While the earlier stages of the clinical reasoning process of data acquisition (history and physical exam) and the interpretive summary are important components, we focused on the latter stages of the process of arriving at the prioritized differential diagnosis [60]. Lastly, while we controlled for many key variables we determined could influence clinical reasoning documentation quality in the CLE, residual confounding may exist, as the CLE is very complex, and the retrospective cross-sectional design precludes assessment of causality.

Conclusions

In this retrospective cross-sectional analysis of IM resident clinical reasoning documentation quality, we demonstrated how AI-based assessment of resident clinical reasoning documentation can provide insights to unpack the impact of the CLE on this essential skill. We found several specific factors in the CLE associated with clinically meaningful differences in resident clinical reasoning documentation quality. These findings can serve as a call to action to further expand to other institutions, specialties, and healthcare providers in future studies and inform important recommended systems changes.

Data availability

Given the sensitive nature of EHR data, additional data cannot be readily accessible. Please email the authors Jesse.Rafel@nyulangone.org with any inquiries about the data or machine learning model.

Abbreviations

AI:

Artificial Intelligence

APP:

Advanced Practice Provider

CLE:

Clinical Learning Environment

EHR:

Electronic Health Record

IM:

Internal Medicine

LLMs:

Large Language Models

NLP:

Natural Language Processing

NYULH:

New York University Langone Health

References

  1. Connor DM, Durning SJ, Rencic JJ. Clinical reasoning as a core competency. Acad Med. 2020;95(8):1166–71.

    Article  Google Scholar 

  2. Schaye V, Miller L, Kudlowitz D, Chun J, Burk-Rafel J, Cocks P, et al. Development of a clinical reasoning Documentation assessment tool for resident and fellow admission notes: a shared mental model for feedback. J Gen Intern Med. 2022;37(3):507–12.

    Article  Google Scholar 

  3. Baker EA, Ledford CH, Fogg L, Way DP, Park YS. The IDEA assessment tool: assessing the reporting, diagnostic reasoning, and decision-making skills demonstrated in medical students’ hospital admission notes. Teach Learn Med. 2015;27(2):163–73.

    Article  Google Scholar 

  4. Kulkarni D, Heath J, Kosack A, Jackson NJ, Crummey A. An educational intervention to improve inpatient Documentation of high-risk diagnoses by pediatric residents. Hosp Pediatr. 2018;8(7):430–5.

    Article  Google Scholar 

  5. Schiff GD, Bates DW. Can electronic clinical Documentation help prevent diagnostic errors? N Engl J Med. 2010;362(12):1066–9.

    Article  Google Scholar 

  6. Singh H, Giardina TD, Meyer AN, Forjuoh SN, Reis MD, Thomas EJ. Types and origins of diagnostic errors in primary care settings. JAMA Intern Med. 2013;173(6):418–25.

    Article  Google Scholar 

  7. Teunissen P, Scheele F, Scherpbier A, Van Der Vleuten C, Boor K, Van Luijk S, et al. How residents learn: qualitative evidence for the pivotal role of clinical activities. Med Educ. 2007;41(8):763–70.

    Article  Google Scholar 

  8. Konopasky A, Artino AR, Battista A, Ohmer M, Hemmer PA, Torre D, et al. Understanding context specificity: the effect of contextual factors on clinical reasoning. Diagn. 2020;7(3):257–64.

    Article  Google Scholar 

  9. Nordquist J, Hall J, Caverzagie K, Snell L, Chan MK, Thoma B, et al. The clinical learning environment. Med Teach. 2019;41(4):366–72.

    Article  Google Scholar 

  10. Gruppen LD. Context and complexity in the clinical learning environment. Med Teach. 2019;41(4):373–4.

    Article  Google Scholar 

  11. Weiss KB, Bagian JP, Nasca TJ. The clinical learning environment: the foundation of graduate medical education. JAMA. 2013;309(16):1687–8.

    Article  Google Scholar 

  12. Wagner R, Patow C, Newton R, Casey BR, Koh NJ, Weiss KB. The overview of the CLER program: CLER National report of findings 2016. J Grad Med Educ. 2016;8(2 Suppl 1):11–3.

    Article  Google Scholar 

  13. Nasca TJ, Wagner R, Weiss KB. Introduction to the CLER National report of findings 2022: the COVID-19 pandemic and its impact on the clinical learning environment. J Grad Med Educ. 2023;15(1):140–2.

    Article  Google Scholar 

  14. Chen C, Petterson S, Phillips R, Bazemore A, Mullan F. Spending patterns in region of residency training and subsequent expenditures for care provided by practicing physicians for medicare beneficiaries. JAMA. 2014;312(22):2385–93.

    Article  Google Scholar 

  15. Asch DA, Epstein A, Nicholson S. Evaluating medical training programs by the quality of care delivered by their alumni. JAMA. 2007;298(9):1049–51.

    Article  Google Scholar 

  16. Asch DA, Nicholson S, Srinivas S, Herrin J, Epstein AJ. Evaluating obstetrical residency programs using patient outcomes. JAMA. 2009;302(12):1277–83.

    Article  Google Scholar 

  17. Bansal N, Simmons KD, Epstein AJ, Morris JB, Kelz RR. Using patient outcomes to evaluate general surgery residency program performance. JAMA Surg. 2016;151(2):111–9.

    Article  Google Scholar 

  18. Kilty C, Wiese A, Bergin C, Flood P, Fu N, Horgan M, et al. A National stakeholder consensus study of challenges and priorities for clinical learning environments in postgraduate medical education. BMC Med Educ. 2017;17(1):226.

    Article  Google Scholar 

  19. Haney EM, Nicolaidis C, Hunter A, Chan BK, Cooney TG, Bowen JL. Relationship between resident workload and self-perceived learning on inpatient medicine wards: a longitudinal study. BMC Med Educ. 2006;6:35.

    Article  Google Scholar 

  20. Arora VM, Georgitis E, Siddique J, Vekhter B, Woodruff JN, Humphrey HJ, et al. Association of workload of on-call medical interns with on-call sleep duration, shift duration, and participation in educational activities. JAMA. 2008;300(10):1146–53.

    Article  Google Scholar 

  21. Ong M, Bostrom A, Vidyarthi A, McCulloch C, Auerbach A. House staff team workload and organization effects on patient outcomes in an academic general internal medicine inpatient service. Arch Intern Med. 2007;167(1):47–52.

    Article  Google Scholar 

  22. Averbukh Y, Southern W. The impact of the number of admissions to the inpatient medical teaching team on patient safety outcomes. J Grad Med Educ. 2012;4(3):307–11.

    Article  Google Scholar 

  23. Coit MH, Katz JT, McMahon GT. The effect of workload reduction on the quality of residents’ discharge summaries. J Gen Intern Med. 2011;26(1):28–32.

    Article  Google Scholar 

  24. Burk-Rafel J, Sebok-Syer SS, Santen SA, Jiang J, Caretta-Weyer HA, Iturrate E, et al. TRainee attributable & automatable care evaluations in real-time (TRACERs): A scalable approach for linking education to patient care. Perspect Med Educ. 2023;12(1):149.

    Article  Google Scholar 

  25. Thoma B, Turnquist A, Zaver F, Hall AK, Chan TM. Communication, learning and assessment: exploring the dimensions of the digital learning environment. Med Teach. 2019;41(4):385–90.

    Article  Google Scholar 

  26. Hung H, Kueh LL, Tseng CC, Huang HW, Wang SY, Hu YN, et al. Assessing the quality of electronic medical records as a platform for resident education. BMC Med Educ. 2021;21(1):577.

    Article  Google Scholar 

  27. Kogan JR, Hess BJ, Conforti LN, Holmboe ES. What drives faculty ratings of residents’ clinical skills? The impact of faculty’s own clinical skills. Acad Med. 2010;85(10 Suppl):S25–8.

    Article  Google Scholar 

  28. Edwards ST, Neri PM, Volk LA, Schiff GD, Bates DW. Association of note quality and quality of care: a cross-sectional study. BMJ Qual Saf. 2014;23(5):406–13.

    Article  Google Scholar 

  29. Arora VM. Harnessing the power of big data to improve graduate medical education: big Idea or bust? Acad Med. 2018;93(6):833–4.

    Article  Google Scholar 

  30. Turner L, Hashimoto D, Vasisht S, Schaye V, Demystifying AI. Current state and future role in precision medical education assessment. Acad Med. 2024;99(4S Suppl 1):S42–7.

    Article  Google Scholar 

  31. Boscardin CK, Gin B, Golde PB, Hauer KE. ChatGPT and generative artificial intelligence for medical education: potential impact and opportunity. Acad Med. 2023:101097.

  32. Abd-Alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, et al. Large Language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ. 2023;9(1):e48291.

    Article  Google Scholar 

  33. Preiksaitis C, Rose C. Opportunities, challenges, and future directions of generative artificial intelligence in medical education: scoping review. JMIR Med Educ. 2023;9(1):e48785.

    Article  Google Scholar 

  34. Schaye V, Guzman B, Burk-Rafel J, Marin M, Reinstein I, Kudlowitz D, et al. Development and validation of a machine learning model for automated assessment of resident clinical reasoning Documentation. J Gen Intern Med. 2022;37(9):2230–8.

    Article  Google Scholar 

  35. Stolper E, Van Royen P, Jack E, Uleman J, Olde Rikkert M. Embracing complexity with systems thinking in general practitioners’ clinical reasoning helps handling uncertainty. J Eval Clin Pract. 2021;27(5):1175–81.

    Article  Google Scholar 

  36. Quan H, Sundararajan V, Halfon P, Fong A, Burnand B, Luthi J-C et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med Care. 2005:1130–9.

  37. Deyo RA, Cherkin DC, Ciol MA. Adapting a clinical comorbidity index for use with ICD-9-CM administrative databases. J Clin Epidemiol. 1992;45(6):613–9.

    Article  Google Scholar 

  38. Charlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J Chronic Dis. 1987;40(5):373–83.

    Article  Google Scholar 

  39. König S, Pellissier V, Hohenstein S, Leiner J, Hindricks G, Meier-Hellmann A, et al. A comparative analysis of in-hospital mortality per disease groups in Germany before and during the COVID-19 pandemic from 2016 to 2020. JAMA Netw Open. 2022;5(2):e2148649–e.

    Article  Google Scholar 

  40. Triola MM, Pusic MV. The education data warehouse: a transformative tool for health education research. J Grad Med Educ. 2012;4(1):113–5.

    Article  Google Scholar 

  41. Lin SY, Shanafelt TD, Asch SM. Reimagining clinical documentation with artificial intelligence. Mayo Clin Proc. 2018;93(5):563-5.

  42. Salt J, Harik P, Barone MA. Leveraging natural Language processing: toward computer-assisted scoring of patient notes in the USMLE step 2 clinical skills exam. Acad Med. 2019;94(3):314–6.

    Article  Google Scholar 

  43. Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78–80.

    Article  Google Scholar 

  44. Strong E, DiGiammarino A, Weng Y, Kumar A, Hosamani P, Hom J, et al. Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA Intern Med. 2023;183(9):1028–30.

    Article  Google Scholar 

  45. Liu J, Wang C, Liu S. Utility of ChatGPT in clinical practice. JMIR. 2023;25:e48568.

    Google Scholar 

  46. Feldman J, Hochman KA, Guzman BV, Goodman A, Weisstuch J, Testa P. Scaling note quality assessment across an academic medical center with AI and GPT-4. NEJM Catal Innov Care Deliv. 2024;5(5):CAT. 23.0283.

    Google Scholar 

  47. Jamieson AR, Holcomb MJ, Dalton TO, Campbell KK, Vedovato S, Shakur AH, et al. Rubrics to prompts: assessing medical student post-encounter notes with AI. NEJM AI. 2024;1(12):AIcs2400631.

    Article  Google Scholar 

  48. Fletcher KE, Reed DA, Arora VM. Doing the dirty work: measuring and optimizing resident workload. J Gen Intern Med. 2011;26(1):8–9.

    Article  Google Scholar 

  49. ACGME Program Requirements for Graduate Medical Education in Internal Medicine. 2022. https://www.acgme.org/globalassets/pfassets/programrequirements/140_internalmedicine_2023.pdf. Accessed 15 March 2025.

  50. Thanarajasingam U, McDonald FS, Halvorsen AJ, Naessens JM, Cabanela RL, Johnson MG, et al. Service census caps and unit-based admissions: resident workload, conference attendance, duty hour compliance, and patient safety. Mayo Clin Proc. 2012;87(4):320–7.

    Article  Google Scholar 

  51. Chandra R, Farah F, Munoz-Lobato F, Bokka A, Benedetti KL, Brueggemann C, et al. Sleep is required to consolidate odor memory and remodel olfactory synapses. Cell. 2023;186(13):2911–e2820.

    Article  Google Scholar 

  52. Yu AT, Jepsen N, Prasad S, Klein JP, Doughty C. Adding nocturnal advanced practice providers to an academic inpatient neurology service improves residents’ educational experience. Neurohospitalist. 2023;13(2):130–6.

    Article  Google Scholar 

  53. Schaye VE, Reich JA, Bosworth BP, Stern DT, Volpicelli F, Shapiro NM et al. Collaborating across private, public, community, and federal hospital systems: lessons learned from the Covid-19 pandemic response in NYC. NEJM Catal Innov Care Deliv. 2020;1(6).

  54. Wang E, Arnold S, Jones S, Zhang Y, Volpicelli F, Weisstuch J, et al. Quality and safety outcomes of a hospital merger following a full integration at a safety net hospital. JAMA Netw Open. 2022;5(1):e2142382.

    Article  Google Scholar 

  55. Rhee DW, Pendse J, Chan H, Stern DT, Sartori DJ. Mapping the clinical experience of a new York City residency program during the COVID-19 pandemic. J Hosp Med. 2021;16(6):353–6.

    Article  Google Scholar 

  56. Graber I, John M. Eisenberg Patient Safety and Quality Awards: An Interview with Gordon D. Schiff. Jt Comm J Qual Patient Saf. 2020;46(7):371– 80.

  57. Schiff GD. Diagnosis and diagnostic errors: time for a new paradigm. BMJ Qual Saf. 2014;23(1):1–3.

    Article  Google Scholar 

  58. Schaye V, DiTullio D, Guzman B, Vennemeyer S, Shih H, Reinstein I et al. Large Language model mased assessment of clinical reasoning Documentation in the electronic health record Ccross two institutions. JMIR (forthcoming).

  59. The Accrediation Council for Graduate Medical Education Internal Medicine Milestones. https://www.acgme.org/globalassets/PDFs/Milestones/InternalMedicineMilestones2.0.pdf. 2020. Accessed 15 Mar 2025.

  60. Bowen JL. Educational strategies to promote clinical diagnostic reasoning. N Engl J Med. 2006;355(21):2217–25.

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to acknowledge Helen Finkelstein, data warehouse and business intelligence developer at the Institute for Innovations in Medical Education, for her assistance in calculating the CCI.

Funding

None.

Author information

Authors and Affiliations

Authors

Contributions

VS made substantial contributions to the conception of the work, interpretation of data, draft and revisions of the work; DT made substantial contributions to the conception of the work, interpretation of data, draft and revisions of the work; DS made substantial contributions to the conception of the work, interpretation of data, draft and revisions of the work; KH made substantial contributions to the conception of the work, interpretation of data, draft and revisions of the work; MH made substantial contributions to the conception of the work, interpretation of data, draft and revisions of the work; IR made substantial contributions to the data acquisition and analysis; BG made substantial contributions to the data acquisition and analysis; JBR made substantial contributions to the conception of the work, interpretation of data, draft and revisions of the work, and the data acquisition and analysis.

Corresponding author

Correspondence to Verity Schaye.

Ethics declarations

Ethical approval

The study was approved by the NYU Grossman School of Medicine institutional review board on 12/9/2023 i19-00280. As this was a retrospective, observational study of EHR data review informed consent from each participant was waived for model development and retrospective data analysis by the NYU Grossman School of Medicine Institutional Review Board.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schaye, V., DiTullio, D.J., Sartori, D.J. et al. Artificial intelligence based assessment of clinical reasoning documentation: an observational study of the impact of the clinical learning environment on resident documentation quality. BMC Med Educ 25, 591 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12909-025-07191-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12909-025-07191-x

Keywords