Skip to main content

Applying generalized theory to optimize the quality of high-stakes objective structured clinical examinations for undergraduate medical students: experience from the French medical school

Abstract

Background

The national OSCE examination has recently been adopted in France as a prerequisite for medical students to enter accredited graduate education programs. However, the reliability and generalizability of OSCE scores are not well explored taking into account the national examination blueprint.

Method

To obtain complementary information for monitoring and improving the quality of the OSCE we performed a pilot study applying generalizability (G-)theory on a sample of 6th-year undergraduate medical students (n = 73) who were assessed by 24 examiner pairs at three stations. Based on the national blueprint, three different scoring subunits (a dichotomous task-specific checklist evaluating clinical skills and behaviorally anchored scales evaluating generic skills and a global performance scale) were used to evaluate students and combined into a station score. A variance component analysis was performed using mixed modelling to identify the impact of different facets (station, student and student x station interactions) on the scoring subunits. The generalizability and dependability statistics were calculated.

Results

There was no significant difference between mean scores attributable to different examiner pairs across the data. The examiner variance component was greater for the clinical skills score (14.4%) than for the generic skills (5.6%) and global performance scores (5.1%). The station variance component was largest for the clinical skills score, accounting for 22.9% of the total score variance, compared to 3% for the generic skills and 13.9% for global performance scores. The variance component related to student represented 12% of the total variance for clinicals skills, 17.4% for generic skills and 14.3% for global performance ratings. The combined generalizability coefficients across all the data were 0.59 for the clinical skills score, 0.93 for the generic skills score and 0.75 for global performance.

Conclusions

The combined estimates of relative reliability across all data are greater for generic skills scores and global performance ratings than for clinical skills scores. This is likely explained by the fact that content-specific tasks evaluated using checklists produce greater variability in scores than scales evaluating broader competencies. This work can be valuable to other teaching institutions, as monitoring the sources of errors is a principal quality control strategy to ensure valid interpretations of the students’ scores.

Peer Review reports

Introduction

Objective structured clinical examination (OSCE) [1] is used to evaluate student performance in many clinical tasks using standardized assessment grids and scales [2, 3]. Content specificity, consistency in examiners’ ratings, and complexity of organization have been identified as the most important challenges in performance-based evaluations, limiting generalizability [4, 5].

Assessment of student performance using OSCE involves a large number of factors that affect scores, such as students, examiners, stations and items within stations, clinical situations and standardized patients [6]. Multiple types of examiner variance have been reported to impact pass-fail decisions [7,8,9,10,11,12]. Several studies have shown differences in students’ scores due to both the leniency and stringency of examiners [7,8,9]. In one study including 10,145 students and 1,259 examiners this variance reached 12% [7]. Another study [8] involving 190 students undertaking OSCE at the end of their family medicine clerkship reported variance of 44%, which could impact pass-fail decisions in 11% of students. The challenge is how to limit the variability of cutoff scores and achieve consistent examiner judgments of student performance in large-scale and multi-circuit assessments [13].

The validity, reliability and generalizability of OSCE scores are desirable in the context of summative assessments to justify the utility of OSCE for student evaluation [6, 14]. In classical test theory the calculation of Cronbach’s alpha can inform test score reliability but not whether examiner bias has had an influence on scores [6]. A generalizability study (G-study) enables the exploration of error variances that might contribute to students’ scores [15, 16]. Based on the findings of a G-study, decision studies are designed to minimize errors [15].

A majority of evaluations of medical OSCEs use regression models limited to only two effects (effects related to the candidate and those related to the error) or to interrater reliability [17,18,19,20], which are not sufficient for establishing reliability of test scores from an OSCE. Studies using G-theory to explore the reliability of structured assessments of medical skills are scarce and often involve a limited number of research groups [21]. A large range of G coefficients (0.48–0.93) has been reported in the context of summative OSCEs [22,23,24,25]. However, the results of different studies are difficult to compare given heterogeneous study designs, complexities of OSCE organization and different sources of variance (facets) explored. The problem of generalizing findings in the field of medical education has previously been addressed, with only 0.13% of studies being replications [26]. Some G-studies used a single facet design [22, 27]. The number of stations included in the circuit varies from 2 to 25 [15, 22, 28,29,30]. A residual error can represent the greatest source of variance depending on the study design [9, 15, 23, 28, 29]. For example, one study [23] reported a G coefficient of 0.93 for a multisite summative OSCE (6 clinical sites over two days) with 4 versions and 18 equivalent stations; however, the authors did not evaluate examiners as a potential source of variance and the greatest source of variance (51%) was residual error. In another study [22], the G-coefficients of the OSCE scores from eight cohorts of fourth-year medical students (n = 435; each OSCE comports 25 stations) ranged from 0.48 to 0.80 (median 0.62), and the residual error was 73% for a two-factor design.

The national OSCE examination has recently been adopted in France as part of the reform of the second cycle of medical studies as a prerequisite for all medical students to enter accredited graduate medical education programs [31]. According to French national recommendations, undergraduate medical OSCEs are organized as bloc sessions during the 4 th, 5 th and 6 th years of medical studies. Each session is comprised of 5 stations, and two stations are required for OSCE re-sits [32, 33]. A national qualifying OSCE session is composed of 10 stations [32, 33]. Three different scoring subunits (a dichotomous task-specific checklist evaluating clinical skills and behaviorally anchored scales evaluating generic skills and a global performance scale) are used to evaluate students [34]. However, their effects on examiners’ ratings have not been explored.

To obtain complementary information for monitoring and improving the quality of the OSCE, we performed a pilot study applying generalizability theory (G-theory) [6, 35] to identify error variances that might contribute to the students’ scores. This report describes our approach and presents key issues and lessons learned from this experience.

Methods

Study settings

We applied a G-theory in a pilot study [36] on a sample of 6 th-year undergraduate medical students at Nancy Medical School (University of Lorraine) in the context of a final year block faculty OSCE re-sit session at the end of the 2023 academic year. As variance component estimates from G-studies do not provide information about effect sizes, no power calculation was needed [36]. The OSCE re-sit was scheduled in one half-day session in 4 parallel circuits. Students (n = 73) and examiners (n = 24) were randomly allocated to stations and circuits. All students had comparable study outcomes. Each student was assessed independently by a pair of examiners. The circuit was comprised of three stations. We did not include stations with standardized patients (SPs) to separate variance related to examiners’ ratings from SPs’ performance. Each of the three stations was dedicated to one of the following situations: asthenia, discovery of abnormalities on pulmonary auscultation, or acute respiratory distress within the following domains of clinical competency: i/management strategy, ii/physical examination, or iii/diagnostic strategy. All stations assessed clinical skills, generic skills such as the ability to communicate with patients/peers, summarize clinical data, propose a care plan, etc., and global performance. Each OSCE station lasted 8 min (1 min was reserved for rotation between 2 stations).

The stations were developed by clinical teachers according to the Learning Assessment Booklet (LiSA) for the 2nd cycle of medical studies [37], based on the curriculum standards for sixth-year medical students and with the same marking standards. The marking sheets of each station followed the same format and were composed of three scoring units (Supplementary Table 1): i/dichotomous task-specific checklist with between 10 and 12 items and rated 1 point (done) or 0 points (not done) for evaluating clinical tasks, ii/two behaviorally anchored 5-point scales (0–0.25–0.5–0.75–1 point) for evaluating generic skills such coherence in data synthesis or verbal expression, iii/a five-point global rating scale for evaluating global performance: insufficient performance—borderline performance—satisfactory performance—very satisfactory performance—outstanding performance. The global performance scale included descriptors (unique to each station) specifying the criteria of each level of performance to guide the assessment of examiners. As detailed in Supplementary Table 1, the content specificity refers to a single case (for instance, pneumonia) which does not allow the evaluation of the students’ performance on a range of areas and makes the generalization of the skills assessed challenging. The domain-specificity is a broader term, such as clinical examination (Supplementary Table 1), history taking or data interpretation.

All OSCE examiners received mandatory examiner training regarding the use of checklists and scales with the help of the national online learning resources for OSCE examiners Uness Livret LiSA [37] and specific instructions regarding the case to evaluate (example provided in Supplementary Table 1). They participated in a pre-exam calibration session using video-taped scenarios to demonstrate expected student performance at two different levels (‘‘borderline’’ and “very satisfactory”). We chose borderline and very satisfactory performance levels based on previous reports indicating that it was easier for examiners to agree on ratings of satisfactory performance [30].

Statistical analysis

Descriptive statistics and pass/fail standard

The variables are reported as means ± standard deviations (SDs), medians (interquartile ranges; IQRs) or as percentages with 95% confidence intervals (95% CIs). The pass/failure standard for the OSCE stations was set using the borderline regression method [38]. The total number of items represented 100% of the overall score. The clinical skills score and generic skills score were combined into a station score and expressed on a scale from 0 to 20 points in accordance with French OSCE guidelines [34, 39]. The overall OSCE score was the average value of the cutoffs for each of the stations within the circuit [9, 40, 41]. A reliability coefficient, Cronbach’s alpha, was calculated using a two-factor ANOVA without replication. The scores between two examiners of an OSCE were compared with a two-sided Student's t test or the Mann‒Whitney test when appropriate. A P value < 0.05 was considered statistically significant. All analyses were performed using MS Excel and the IBM Statistical Package for the Social Sciences (SPSS) version 29.0.

Generalizability study

A generalizability study [16] was performed on the students’ raw scores using a partially crossed (two examiners per case) [42] random effect design with three facets: person (student or object of measurement, n = 73), station (n = 3) and examiner (n = 24), to identify which facets or their interactions were the main sources of measurement error. Each person (student) was crossed with each station. Because it was not possible to separate examiners and circuits and to avoid an overestimation of reliability [43], the data were also analyzed as a fully crossed two-facet random effect design with persons and stations as facets. Raw station scores were used as scoring subunits to identify sources of variability: the clinical skills score (expressed as a percentage), generic skills score (expressed as a percentage), and global performance rating. A variance component analysis was performed in the IBM SPSS version 29.0. Minimum norm quadratic unbiased estimation (MINQUE) and restricted maximum likelihood estimation (REML) approaches were used because of the unbalanced dataset [36]. The MINQUE produces the least variance among all unbiased estimators and does not require normally distributed data nor iteration [44]. Given the fact that the REML approach does not immediately generalize to the setting of non-normally distributed data [45] the reliability was calculated based on the results of the MINQUE approach. Additionally, a re-analysis was done using type III sum of squares (ANOVA) procedure.

The variance component estimates and the estimates as a proportion of total score variance reflecting variation in the theoretical true score were calculated (generalizability coefficients, G-coefficients) [6, 16]. A decision study was used to calculate reliability estimates [36].

Ethical considerations

The study was registered with the French National Commission for Data Protection and Liberties (n 2020/118). The present research was of an educational nature. It was not a clinical trial and did not fulfill specific criteria for registration according to the Checklist for Evaluating Whether a Clinical Trial or Study is an Applicable Clinical Trial (ACT) Under 42 CFR 11.22(b) for Clinical Trials Initiated on or After January 18, 2017. Ethics approval and informed consent were not necessary according to French national regulations defined by article L 1123–7 of the Public Health Code (CSP)—Loi Jardé (n°2012–300 of March 5, 2012, in application in November 2016—Article R1121-1). All study participants received verbal information regarding the study objectives, voluntary participation, assurance of confidentiality, and the right to withdraw from the study according to the directives of the University of Lorraine. Participants may contact a data protection officer and the CNIL (http://www.cnil.fr) and can access their personal data, rectify them, delete them and limit their use. Analysis was performed after anonymization of the data.

Results

General presentation of the OSCE results

A total of 73 sixth-year medical students (50.7% females) and 24 examiners (46.0% females) participated in the OSCE session. There were no missing data. Descriptive statistics are presented in Table 1 and Fig. 1 for the full dataset. The mean overall OSCE score for clinical skills was 61.79 ± 19.07 (95% CI; 60.00; 63.58) as a percentage of the total score, 43.38 ± 19.07 (95% CI; 41.70; 45.50) for generic skills as a percentage of the total score, and 1.48 ± 0.91 (95% CI; 1.39; 1.56) for global performance. Differences in the mean scores between the two examiners were not significant. The overall mean pass mark determined by the BRM was 12.2 ± 2.1 SD (95% CI; 11.09; 11.77), with an overall pass rate of 74% across all stations. The combined estimate of reliability (Cronbach’s alpha) across all the stations was 0.66.

Table 1 Descriptive statistics for three scoring subunits across all data
Fig. 1
figure 1

Histograms of clinical skills scores (A), generic skills scores (B), global performance ratings (C) and BRM scores (D)

Results of the G-studies

Figure 1A-C shows the distribution of the three assessment outcomes in the full dataset for three scoring subunits: the clinical skills score, generic skills score and global performance. These are modeled using the MINQUE approach (random three-factor and two-factor design) (Tables 2 and 3) with a reanalysis using the REML method (Supplementary Table 2) and the type III sum of squares ANOVA (Tables 4 and 5). As mentioned earlier, results of the REML approach are to be interpreted with caution given the fact that this method does not generalize in the context of non-normally distributed data [45]. Therefore, the reliability was calculated based on the results of the MINQUE approach. The ANOVA sum of squares type III was employed to compute unbiased estimates for each effect (Tables 4 and 5). The ANOVA method sometimes produces negative variance estimates, which can indicate an incorrect model, an inappropriate estimation method, or a need for more data. Our data did not produce negative variance estimates, which indicated the model was appropriate.

Table 2 Variance component estimates and generalizability coefficients for the3-facetmodel (persons, examiners and stations) using a minimum norm quadratic unbiased estimation (MINQUE) approach
Table 3 Variance component estimates and generalizability coefficients for the 2-facet model (persons, stations) using a minimum norm quadratic unbiased estimation (MINQUE) approach
Table 4 Variance estimate table (type III sum of squares ANOVA) a for the 3-facet model (persons, examiners and stations)
Table 5 Variance estimate table (type III sum of squares ANOVA) a for the 2-facet model (persons and stations)

Results of the random 3-facet design

As indicated in Table 2, for the clinical skills scores, the variance in the total score attributed to student proficiency represented 12%. The station component accounted for 22.9% of the total score variance and the examiner component represented 14.4%. The variance attributable to the interactions among different facets and the residual variance represented 50.7% of the total variance.

To determine the reliability of the clinical skills scores, the G coefficient [16, 35] was calculated based on the variance component estimates using the following equation (2) (zero values of variance components are not inserted): Person/[Person + (Station/3) + (Examiner/24) + (Person × Station × Examiner/5256)]. The G-coefficient for the clinical skills score was 0.59. Based upon the variance component estimates, to achieve a G coefficient of 0.80 the OSCE circuit would require 8 stations.

Regarding the generic skills score, some differences can be observed compared with the clinical skills score. The student component accounted for 17.4% of the total variance. The station component was small, representing 3% of the total score variance, and the examiner variance accounted for 5.6%. The person-station-examiner interactions and the residual error accounted for 73.9% of the total variance. The G coefficient, calculated according to the variance component estimates, was 0.93.

For the global performance rating, the student component represented 14.3% of the total score variance. The station variance component represented 13.9% of the total score variance. The examiner component variance represented a comparable portion of the total score variance for the generic skills score, accounting for 5.1% of the total score variance. Similar to the clinical skills and generic skills, the principal sources of variance were person-station-examiner interactions and the residual error, accounting for 66.7% of the total score variance. The G coefficient calculated from the variance component estimates was 0.75. To achieve a G coefficient of 0.8, the OSCE circuit would require 4 stations.

Results of the random 2-facet design

The results of the random two-facet design using the MINQUE method are shown in Table 4.

Similar to the results of the three-facet model, the station component accounted for a greater proportion of the total variance in the generic skills score and global performance rating. The person‒station interactions and the residual error represented the main sources of variance for all three scoring subunits. The G coefficients were calculated using the following equation: Person/[Person + (Station/3) + (Person × Station/219)] and were 0.50 for clinical skills, 0.92 for generic skills and 0.74 for global performance. Based on the variance component estimates for clinical skills, the OSCE circuit would require 12 stations to achieve a G coefficient of 0.80.

Discussion

The results of this study show that combined estimates of relative reliability across all data are greater for generic skills scores and global performance ratings than for clinical skills checklist scores. Our results are in concordance with other studies showing that global performance had higher reliabilities than did checklist scores [29, 30, 46, 47]. These observations suggest that G coefficients for generic skills scores and global performance ratings (based on behaviorally anchored scales) would generalize better than ratings for clinical skills (based on a dichotomous checklist).

For our three-factor model the values of the G coefficients were 0.59 for the clinical skills score, 0.93 for the generic skills score and 0.75 for the global performance score, indicating that 59%, 93% and 75%, respectively, of the score variation was explained by consistent differences between the students. The two-facet model had similar outcomes. Clinical skills checklists and a global performance scale were constructed to evaluate the same task. Each examiner evaluated students independently without consensus with the second examiner; therefore, it is less likely that generalizability coefficients are overestimated. The values of the G coefficient considered acceptable for high-stakes examination should be between 0.7 and 0.95 [6, 35]. Our results for generic skills and global performance ratings meet the standard of a G coefficient for high-stakes examination purposes. To achieve a G coefficient of 0.8 the OSCE circuit would require 8 to 12 stations based on estimates of the clinical skills score using three- and two-facet models, respectively.

In line with our findings the literature indicates smaller G coefficients for checklist scores than for global performance ratings [30]. This is likely explained by the fact that content-specific tasks evaluated using checklists produce greater variability in scores than scales evaluating broader competencies. One study in the context of a summative medical OSCE involving 70 students and 3 stations reported G coefficients of 0.82 for checklist scores and 0.96 for global performance [30]. Another group of investigators reported a G coefficient of 0.48 for checklist scores and 0.61 for global performance in the context of midwifery OSCE, which involved 4 sessions of 6 stations with a mean of 27 students per OSCE session [29].

The variance component related to person (student) represented a small portion of the total score variance but was larger for generic skill scores and global performance ratings than for clinical skill scores for both the three- and two-facet models. In the three-facet model, greater variance in examiner stringency than in student ability was observed for the clinical skills score compared with the generic skills score and global performance. These findings further support the increased ability of generic skills scores and global ratings compared with clinical skills scores to differentiate students according to their performance. The person variance component reflects how much students vary in their proficiency and should be a major contributor to the score variance. Regarding OSCE in health profession education, different studies reported large differences in the person variance component, ranging from 4 to 92% [9, 25, 29, 30, 48], or they could not differentiate the effects of the person based on the model used [23, 48].

In the two-facet model, the person-by-station interaction component accounted for more than 30% of the total score variance for generic skills and global performance, reflecting the content-specific performance of students. A high case specificity signifies that success on any case or station is specific to the case and does not generalize to other stations. Our observations agree with prior literature [15, 25, 49]. This measurement error can be reduced by increasing the number of stations included in the circuit or by eliminating problematic items/stations [22]. The person-by-station interaction component was larger for generic skills scores and global performance compared to clinical skills scores. This further suggests the influence of items or subunits nested within the station. This variance component can be advantageous for differentiating students’ abilities according to clinical domains of proficiency in the context of domain-specific qualifying examinations. In parallel, it also implies the use of piloted stations, for instance, from the OSCE bank, with known psychometric data and pass marks as references. In our context, the evaluation of domain-specific skills is desirable; however, more observational data are needed to interpret the scores and generalize the sample cases [14]. Repeated administration of the same stations and the use of the same examiners over several years would be helpful for obtaining more observational data.

The station variance component, which reflects variance in case difficulty, was largest for the clinical skills score, accounting for 20% of the total score variance in both the three- and two-facet models. In contrast, the station variance component was smaller for the global performance score and generic skills score, representing 14% and 3 to 4% of the total variance, respectively. This finding indicates that differences in station difficulty have a weaker influence on global ratings and generic skill scores. In parallel, these findings suggest that for the clinical skills score, the examiner rating for determining the pass mark differs by station and that the pass mark will change if other stations are employed.

The examiner component represented a small portion (less than 6%) of the total score variance for generic skills and global performance. In contrast, for the clinical scores, the examiner variance represented 14% of the total score. Note that the recommended threshold for examiners’ variance in the context of high-stakes examinations is 6.2% [16]. In line with the literature [9] our findings show that content- or domain-specific assessment grids may lead to greater variability in the stringency and leniency of examiners than do generic skills and global performance.

In our three-facet model, the multiple interactions among different facets (i.e., person x examiner x station) could not be separated from the residual error (representing variance attributable to facet variables not included in the model). Further, the examiner-by-person interaction component could not be explored by the present model. In our context, examiners evaluate stations that are not related to their clinical specialty; therefore, the potential risk of bias-related examiners’ expertise should be low; however, other sources of bias such as time spent with the student or examiner expectations, could contribute to this variance [50]. It remains to be addressed by future studies.

Any potential contribution of the residual error to the total score variance could not be investigated in our three-facet model. A residual unidentified error in our two-facet model accounted for approximately 40% of the total variance for all three scoring subunits. The residual error was reported as the greatest source of variance by other studies [15, 23, 28, 29]. Several authors have shown that this residual variance is greater for checklist scores than for global ratings [29, 30], further suggesting that global ratings discriminate better between students. It has been shown that grouping items in domain-specific assessment grids can increase reliability [9, 48]; therefore, a global rating might be more advantageous than the use of dichotomous item-based checklists.

Strengths and limitations

This pilot study was conducted in the context of high-stakes OSCE in undergraduate medical education to identify sources of variance in examiners’ scores according to the French national blueprint for qualifying high-stakes medical OSCEs [34]. The number of stations in this study was small; nevertheless, it is in concordance with previous G-studies in the context of summative OSCEs with two [28], three [30] or six stations [15, 29]. In addition, this format corresponds to the French national recommendations for undergraduate medical OSCEs, with 5 stations required for bloc OSCE and 2 stations for re-sits [32, 33].

We used data collected in educational settings at our institution; therefore, we could not perform a fully cross-sectional three-facet design. However, based on the literature an unbalanced design can provide acceptable variance estimates [36]. In addition, we employed the MINQUE procedure, which is recommended for unbalanced datasets in place of analysis of variance (ANOVA) type I and II sums of squares, as the latter can yield possibly erroneous results [51].

As the study was conducted on the occasion of OSCE re-sits, the population of students may not be representative of the whole sample of 6 th-year students; however, all students had comparable study outcomes. The OSCEs are highly dependent on the SPs; therefore, their training is crucial to ensure that the SPs perform similarly. We did not voluntarily include stations with SPs in our G-study to separate the sources of variance related to examiners’ ratings. This was a pilot study and we are planning to perform additional studies and explore the effects of SPs. Several other possible sources of variance such as gender, the specialty and experience of examiners or the gender of students, were not within the scope of this study and may be the subject of further research.

Practical implications

The present G-study enabled us to establish how error variance is distributed among various sources or errors (facets) and suggests a strategy for optimizing the reliability of examiners’ scores. We are planning to explore reliabilities using a greater number of stations. Checklists and global performance scales have been revised. With respect to domain specificity, we plan to collect more observations from other examination centers using a common bank to interpret meaningfully and generalize the scores. This is important because the number of stations per examination is small according to the national recommendations [32] and given the logistic constraints limiting increases in the number of stations in the circuit. To minimize the examiner effect we also plan to integrate some stations without examiners into the OSCE circuit, as suggested by Harden [1].

Conclusions

The results of this study suggest that ratings of generic skills and of global performance are superior to checklists when evaluating student abilities on performance-based exams. The higher value of the G coefficient for generic skills and global performance signifies that we can be more confident in the students’ scores compared with clinical skills scores. This work provides additional evidence that G-theory-based analysis is feasible at the local level, enabling the rigorous evaluation of the quality of OSCE. We are planning to use the estimated variance components for decision studies. Our approach provides a strategy for optimizing the reliability of students’ scores, taking into account national examination blueprint. It can be valuable to other teaching institutions, as monitoring the sources of errors is a principal quality control strategy to ensure valid interpretations of the students’ scores.

Data availability

The datasets generated and/or analyzed during the current study are not publicly available due to the privacy of the students but are available from the corresponding author on reasonable request.

References

  1. Harden RM, Stevenson M, Downie WW, Wilson GM. Assessment of Clinical Competence using Objective Structured Examination. BMJ. 1975;1:447–51.

    Article  Google Scholar 

  2. Khan KZ, Ramachandran S, Gaunt K, Pushkar P. The Objective Structured Clinical Examination (OSCE): AMEE Guide no. 81. Part II: Organisation and administration. Medical Teacher. 2013;35:e1447-63.

    Article  Google Scholar 

  3. Pell G, Homer M, Roberts T. Assessor training: Its effects on criterion-based assessment in a medical context. International Journal of Research & Method in Education. 2008;31(2):143–54.

    Article  Google Scholar 

  4. Van Der Vleuten CPM. The assessment of professional competence: Developments, research and practical implications. Adv Health Sci Educ. 1996;1(1):41–67.

    Article  Google Scholar 

  5. Swanson D, Norman G, Linn R. Performance-Based Assessment: Lessons From the Health Professions. Educ Res. 1995;1:24.

    Google Scholar 

  6. Crossley J, Davies H, Humphris G, Jolly B. Generalisability: a key to unlock professional assessment. Med Educ. 2002;36(10):972–8.

    Article  Google Scholar 

  7. McManus IC, Thompson M, Mollon J. Assessment of examiner leniency and stringency ('hawk-dove effect’) in the MRCP(UK) clinical examination (PACES) using multi- facet Rasch modelling. BMC Medical Education. 2006;6:42.

    Article  Google Scholar 

  8. Harasym PH, Woloschuk W, Cunning L. Undesired variance due to examiner stringency/leniency effect in communication skill scores assessed in OSCEs. Adv Health Sci Educ. 2008;13(5):617–32.

    Article  Google Scholar 

  9. Homer M. Pass/fail decisions and standards: the impact of differential examiner stringency on OSCE outcomes. Adv Health Sci Educ. 2022;27:457–73.

    Article  Google Scholar 

  10. Bartman I, Smee S, Roy M. A method for identifying extreme OSCE examiners. Clin Teach. 2013;10(1):27–31.

    Article  Google Scholar 

  11. Yeates P, O’Neill P, Mann K, Eva KW. Seeing the same thing differently: Mechanisms that contribute to assessor differences in directly-observed performance assessments. Adv Health Sci Educ. 2013;18(3):325–41.

    Article  Google Scholar 

  12. Myford C, Wolfe E. Detecting and Measuring Rater Effects Using Many-Facet Rasch Measurement: Part II. J Appl Meas. 2004;1(5):189–227.

    Google Scholar 

  13. Downing SM. Reliability: On the reproducibility of assessment data. Med Educ. 2004;38(9):1006–12.

    Article  Google Scholar 

  14. Downing S. Validity: On the meaningful interpretation of assessment data. Med Educ. 2003;1(37):830–7.

    Article  Google Scholar 

  15. Iramaneerat C, Yudkowsky R, Myford CM, Downing SM. Quality control of an OSCE using generalizability theory and many-faceted Rasch measurement. Adv Health Sci Educ. 2008;13(4):479–93.

    Article  Google Scholar 

  16. Tavakol M, Dennick R. Post-examination interpretation of objective test data: Monitoring and improving the quality of high-stakes examinations: AMEE Guide No. 66. Medical Teacher. 2012;34(3):e161-75.

    Article  Google Scholar 

  17. Faherty A, Counihan T, Kropmans T, Finn Y. Inter-rater reliability in clinical assessments: do examiner pairings influence candidate ratings? BMC Med Educ. 2020;20(1):147.

    Article  Google Scholar 

  18. Schleicher I, Leitner K, Juenger J, Moeltner A, Ruesseler M, Bender B, et al. Examiner effect on the objective structured clinical exam – a study at five medical schools. BMC Med Educ. 2017;17(1):71.

    Article  Google Scholar 

  19. Amadou C, Veil R, Blanié A, Nicaise C, Rouquette A, Gajdos V. Variance due to the examination conditions and factors associated with success in objective structured clinical examinations (OSCEs): first experiences at Paris-Saclay medical school. BMC Med Educ. 2024;24(1):716.

    Article  Google Scholar 

  20. Haviari S, de Tymowski C, Burnichon N, Lemogne C, Flamant M, Ruszniewski P, et al. Measuring and correcting staff variability in large-scale OSCEs. BMC Med Educ. 2024;24(1):817.

    Article  Google Scholar 

  21. Andersen SAW, Nayahangan LJ, Park YS, Konge L. Use of Generalizability Theory for Exploring Reliability of and Sources of Variance in Assessment of Technical Skills: A Systematic Review and Meta-Analysis. Academic Medicine. 2021;96(11). Available from: https://journals.lww.com/academicmedicine/fulltext/2021/11000/use_of_generalizability_theory_for_exploring.34.aspx

  22. Auewarakul C, Downing SM, Praditsuwan R, Jaturatamrong U. Item Analysis to Improve Reliability for an Internal Medicine Undergraduate OSCE. Adv Health Sci Educ. 2005;10(2):105–13.

    Article  Google Scholar 

  23. Trejo-Mejía JA, Sánchez-Mendiola M, Méndez-Ramírez I, Martínez-González A. Reliability analysis of the objective structured clinical examination using generalizability theory. Med Educ Online. 2016;21(1):31650.

    Article  Google Scholar 

  24. Vallevand A, Violato C. A Predictive and Construct Validity Study of a High-Stakes Objective Clinical Examination for Assessing the Clinical Competence of International Medical Graduates. Teach Learn Med. 2012;24(2):168–76.

    Article  Google Scholar 

  25. Peeters M, Cor M, Petite S, Schroeder M. Validation Evidence using Generalizability Theory for an Objective Structured Clinical Examination. INNOVATIONS in pharmacy. 2021;26(12):15.

    Article  Google Scholar 

  26. Makel MC, Plucker JA. Facts Are More Important Than Novelty: Replication in the Education Sciences. Educ Res. 2014;43(6):304–16.

    Article  Google Scholar 

  27. Baig LA, Violato C. Temporal stability of objective structured clinical exams: a longitudinal study employing item response theory. BMC Med Educ. 2012;12(1):121.

    Article  Google Scholar 

  28. Hatala R, Marr S, Cuncic C, Bacchus CM. Modification of an OSCE format to enhance patient continuity in a high-stakes assessment of clinical performance. BMC Med Educ. 2011;11(1):23.

    Article  Google Scholar 

  29. Govaerts MJB, Van der Vleuten CPM, Schuwirth LWT. Optimising the Reproducibility of a Performance-Based Assessment Test in Midwifery Education. Adv Health Sci Educ. 2002;7(2):133–45.

    Article  Google Scholar 

  30. Malau-Aduli B, Mulcahy S, Warnecke E, Otahal P, Teague PA, Turner R, et al. Inter-rater reliability: Comparison of checklist and global scoring for OSCEs. Creative Education. 2012;3(Special Issue):937–42.

    Article  Google Scholar 

  31. Arrêté du 21 décembre 2021 relatif à l’organisation des épreuves nationales donnant accès au troisième cycle des études de médecine. 2021 12 [cited 2024 Jun 9]; Available from: https://www.legifrance.gouv.fr/jorf/id/JORFTEXT000044572679

  32. Légifrance. Arrêté du 21 décembre 2021 portant modification de plusieurs arrêtés relatifs aux formations de santé. 2021 12; Available from: https://www.legifrance.gouv.fr/jorf/id/JORFTEXT000044616269

  33. Légifrance. Arrêté du 24 juillet 2023 portant modification de l’arrêté du 8 avril 2013 relatif au regime des études en vue du premier et du deuxième cycle des études médicales. Journal officiel de la République française. 2023 09;(204). Available from: https://www.legifrance.gouv.fr/jorf/id/JORFTEXT000048038966.

  34. Braun M, Feigerlova E, on behalf of the national OSCE working group. Construire la station d’ECOS et le circuit d’ECOS Cadre general. In : Formation aux ECOS. Conférence des doyens. 2022 [cited 2024 Jun 8]; Available from: https://formation.uness.fr/formation/course/view.php?id=21420.

  35. Tavakol M, Dennick R. Making sense of Cronbach’s alpha. Int J Med Educ. 2011;2:53–5.

    Article  Google Scholar 

  36. Crossley J, Russell J, Jolly B, Ricketts C, Roberts C, Schuwirth L, et al. ‘I’m pickin’ up good regressions’: the governance of generalisability analyses. Med Educ. 2007;41(10):926–34.

    Article  Google Scholar 

  37. Université numérique en santé et sport (Uness) Livret de Suivi des Apprentissages (LiSA). 2023 [cited 2024 Jun 8]; Available from: https://www.uness.fr/nos-services/environnement-uness/uness-livret-lisa

  38. Pell, G., Fuller, R., Homer, M., Roberts, T. How to measure the quality of the OSCE: a review of metrics - AMEE guide no. 49. Medical Teacher. 2010;32:802–11.

  39. Feigerlová E, Braun M. Réflexions sur la démarche de standardisation d’examens cliniques objectifs structurés (ECOS) dans le cadre de la réforme du deuxième cycle des études de médecine en France : expérience sur une cohorte d’étudiants en médecine. Pédagogie Médicale. 2024;25(2):79–86.

    Article  Google Scholar 

  40. McKinley, D.W., Norcini, J.J. How to set standards on performance-based examinations: AMEE Guide No. 85. Medical Teacher. 2014;36(2):97–110.

  41. Hejri SM, Jalili M, Muijtjens AM, Van Der Vleuten CP. Assessing the reliability of the borderline regression method as a standard setting procedure for objective structured clinical examination. Journal of Research in Medical Sciences. 2013;18:887–91.

    Google Scholar 

  42. Lawson DM. Applying Generalizability Theory to High-Stakes Objective Structured Clinical Examinations in a Naturalistic Environment. J Manipulative Physiol Ther. 2006;29(6):463–7.

    Article  Google Scholar 

  43. Sireci SG, Thissen D, Wainer H. On the Reliability of Testlet-Based Tests. J Educ Meas. 1991;28(3):237–47.

    Article  Google Scholar 

  44. Rao C R. Estimation of Variance and Covariance Components—MINQUE Theory. Journal of Multivariate Analysis. 1971;1(257–275).

  45. McCulloch C, Searle S, Neuhaus J. Generalized, Linear, and Mixed Models, 2nd Edition. 2008;1–424.

  46. Wilkinson TJ, Frampton CM, Thompson-Fawcett M, Egan T. Objectivity in Objective Structured Clinical Examinations: Checklists Are No Substitute for Examiner Commitment. Academic Medicine. 2003;78(2). Available from: https://journals.lww.com/academicmedicine/fulltext/2003/02000/objectivity_in_objective_structured_clinical.21.aspx

  47. Hodges B, McIlroy JH. Analytic global OSCE ratings are sensitive to level of training. Med Educ. 2003;37(11):1012–6.

    Article  Google Scholar 

  48. Santen SA, Ryan M, Helou MA, Richards A, Perera RA, Haley K, et al. Building reliable and generalizable clerkship competency assessments: Impact of ‘hawk-dove’ correction. Med Teach. 2021;43(12):1374–80.

    Article  Google Scholar 

  49. Norman G, Bordage G, Page G, Keane D. How specific is case specificity? Med Educ. 2006;40(7):618–23.

    Article  Google Scholar 

  50. Gingerich A, Kogan J, Yeates P, Govaerts M, Holmboe E. Seeing the ‘black box’ differently: assessor cognition from three research perspectives. Med Educ. 2014;48(11):1055–68.

    Article  Google Scholar 

  51. Baltagi B, Song S, Jung B. A comparative study of alternative estimators for the unbalanced 2-way error component regression model. Econometrics Journal. 2002;1(5):480–93.

    Article  Google Scholar 

Download references

Acknowledgements

I would like to thank Marc Braun (Dean of the Medical School of the University of Lorraine) for his continuous support with the setting up of the OSCE program. I also thankfully acknowledge members of the OSCE pedagogical team of the Medical School of the University of Lorraine.

Funding

This study received no funding from any institution. The investigator contributed voluntarily to this project.

Author information

Authors and Affiliations

Authors

Contributions

E.F. designed the study, wrote the main manuscript text, prepared figures and Tables and reviewed the manuscript.

Corresponding author

Correspondence to Eva Feigerlova.

Ethics declarations

Ethics approval and consent to participate

The study was registered with the French National Commission for Data Protection and Liberties (n 2020/118). The present research was of an educational nature. It was not a clinical trial and did not fulfill specific criteria for registration according to the Checklist for Evaluating Whether a Clinical Trial or Study is an Applicable Clinical Trial (ACT) Under 42 CFR 11.22(b) for Clinical Trials Initiated on or After January 18, 2017. Ethics approval and informed consent were not necessary according to French national regulations defined by article L 1123–7 of the Public Health Code (CSP)—Loi Jardé (n°2012–300 of March 5, 2012, in application in November 2016—Article R1121-1). All study participants received verbal information regarding the study objectives, voluntary participation, assurance of confidentiality, and the right to withdraw from the study according to the directives of the University of Lorraine. Participants may contact a data protection officer and the CNIL (http://www.cnil.fr) and can access their personal data, rectify them, delete them and limit their use. Analysis was performed after anonymization of the data.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Material 1: Supplementary Table 1. Example of the examiner’s marking sheet

12909_2025_7255_MOESM2_ESM.docx

Supplementary Material 2: Supplementary Table 2. Variance component estimates and generalizability coefficients for the 3-facet model (persons, examiners and stations) using a restricted maximum likelihood estimation method (REML) approach

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Feigerlova, E. Applying generalized theory to optimize the quality of high-stakes objective structured clinical examinations for undergraduate medical students: experience from the French medical school. BMC Med Educ 25, 643 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12909-025-07255-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12909-025-07255-y

Keywords