Skip to main content

AI-generated questions for urological competency assessment: a prospective educational study

Abstract

Background

The integration of artificial intelligence (AI) in medical education assessment remains largely unexplored, particularly in specialty-specific evaluations during clinical rotations. Traditional question development methods are time-intensive and often struggle to keep pace with evolving medical knowledge. This study evaluated the effectiveness of AI-generated questions in assessing urological competency among medical interns during a standardized clinical rotation.

Methods

Two state-of-the-art AI language models (ChatGPT and Gemini) generated 300 multiple-choice questions across six urological subspecialties. Seven experienced urologists, each with over 10 years of clinical practice and active involvement in resident training programs, independently evaluated the questions using a modified Delphi approach with standardized scoring rubrics. The evaluation criteria encompassed technical accuracy based on current clinical guidelines, clinical relevance to core rotation objectives, construct validity assessed through cognitive task analysis, and alignment with rotation objectives. Questions achieving consensus approval from at least five experts were retained, resulting in 100 validated questions.

Results

From the initial cohort of 45 eligible interns, 42 completed all three assessment points (93.3% completion rate). Performance improved significantly from baseline (mean: 45.2%, 95% CI: 42.6–47.8%) through mid-rotation (mean: 62.8%, 95% CI: 60.4–65.2%) to final assessment (mean: 78.4%, 95% CI: 76.5–80.3%). The technical accuracy was comparable between AI platforms (ChatGPT: 84.3%, Gemini: 83.8%, p = 0.86). Clinical scenario questions demonstrated better discrimination than recall questions (mean indices: 0.28 vs. 0.14, p < 0.001). Subspecialty performance varied, with highest scores in uro-oncology (mean: 82.6%, 95% CI: 80.2–85.0%) and endourology (mean: 79.4%, 95% CI: 77.0-81.8%).

Conclusions

AI-generated questions showed appropriate technical accuracy and difficulty levels for assessing clinical competency in urology. While promising for formative assessment, particularly with clinical scenarios, current limitations in discrimination capability suggest careful consideration for high-stakes testing. The strong correlation between clinical exposure and improved performance validates their effectiveness in measuring knowledge acquisition. These findings support the potential integration of AI-generated questions in specialty-specific assessment, though careful implementation with expert oversight and continuous validation remains essential.

Clinical trial number

Not applicable.

Peer Review reports

Background

The development of high-quality assessment tools in urological education faces unique challenges due to the field’s diverse subspecialties, complex procedural skills, and rapidly evolving treatment modalities. While artificial intelligence (AI) has shown promise in medical education assessment across various specialties [1, 2], the unique characteristics of urological training—combining oncological decision-making, surgical technique evaluation, and management of chronic conditions—present distinct opportunities and challenges for AI application.

Recent studies have demonstrated AI’s effectiveness in generating medical education content across specialties, including general surgery, hand surgery, and obstetrics-gynecology, with reported technical accuracy rates of 80–85% [3, 4]. Building upon these advances, our study addresses the specific needs of urological education, where assessment must integrate complex surgical decision-making with medical management across six distinct subspecialties. The procedural nature of urology, combined with its broad scope from oncology to transplantation, requires assessment tools that can evaluate both technical knowledge and clinical reasoning in ways that may differ from other surgical specialties [7, 8].

Urology education presents particular assessment challenges: the need to evaluate competency across diverse procedures (from minimally invasive to open surgery), the integration of rapidly evolving technologies, and the requirement for decision-making that spans both acute and chronic care scenarios. Traditional methods of question generation struggle to keep pace with these demands, often requiring significant faculty time to create comprehensive assessments that adequately cover all subspecialty domains [9, 10]. While AI shows promise in addressing these challenges, its effectiveness in capturing the nuanced requirements of urological training requires specific validation.

This study aims to evaluate the feasibility and effectiveness of AI-generated questions in assessing urological competency during clinical rotations. Specifically, we seek to: (1) assess the technical accuracy and psychometric properties of AI-generated questions in the context of urological education, (2) compare performance across different question types and subspecialties, and (3) evaluate the correlation between clinical exposure and assessment outcomes. Understanding these aspects is crucial for developing evidence-based approaches to integrating AI in specialty-specific medical assessment.

Methods

This prospective validation study employed Messick’s framework to evaluate AI-generated questions for urological competency assessment. The study was conducted between October 2024 and December 2024 at the Department of Urology, Mersin University Faculty of Medicine, Turkey. The study protocol, including predetermined assessment points and outcome measures, was established prior to data collection following established validation frameworks [11, 12]. The ethics committee of Mersin University Faculty of Medicine reviewed and approved this study (Ethics Committee Meeting Date: 19.02.2025, Decision No: 181). All participants provided written informed consent before participation.

Participant selection and setting

Eligible participants were fourth-year medical students completing their mandatory urology rotation at Mersin University Faculty of Medicine. Inclusion criteria included: (1) no previous urology rotation experience, (2) good academic standing, and (3) commitment to complete all three weeks of the rotation. Students with previous urology elective experience or extended absences were excluded. Prior to the rotation, students completed a questionnaire assessing their previous exposure to AI-based learning tools in medical education, which was used to categorize them as having “prior AI experience” or not.

Clinical rotation structure

The clinical rotation curriculum encompassed 72 h of supervised practice over three weeks, with students attending daily sessions from 8:00 AM to 4:00 PM. The curriculum was distributed across subspecialties: uro-oncology (35%, 25.2 h), endourology (25%, 18.0 h), functional urology (15%, 10.8 h), pediatric urology (10%, 7.2 h), andrology (10%, 7.2 h), and transplantation (5%, 3.6 h).

Daily activities consisted of a structured schedule beginning with morning rounds from 8:00–9:30 AM, followed by outpatient clinic participation from 9:30 AM to 12:00 PM. After lunch break, students attended case-based learning sessions from 1:00–2:00 PM and then participated in operating room observation and assistance from 2:00–4:00 PM. Additionally, clinical skills workshops were conducted twice per week to reinforce practical competencies.

AI question generation process

Two state-of-the-art AI language models were used: ChatGPT (GPT-4, version 4.0, OpenAI) and Gemini (version 1.5, Google AI), accessed in December 2024. The standardized prompt template was developed through a systematic process that began with initial development based on literature review, medical education expert consultation, and experienced question writers’ input (detailed in Supplementary Material S1). The prompt template included detailed specifications for question structure, content requirements, and quality control criteria across six urological subspecialties: uro-oncology (35% of questions), endourology (25%), functional urology (15%), pediatric urology (10%), andrology (10%), and transplantation (5%).

For each subspecialty, the prompt template specified key focus areas. For example, uro-oncology questions covered prostate cancer staging and management, bladder cancer diagnosis and treatment, kidney cancer imaging and surgical approaches, and cancer surveillance protocols. Endourology questions focused on stone disease management, endoscopic procedures, imaging interpretation, and metabolic evaluation. The template required all questions to follow a standardized format with a clinical scenario or stem, a clear focused question, five answer options with one correct answer, and a detailed explanation. Quality control requirements mandated alignment with current clinical guidelines, clear single correct answers, plausible distractors, unambiguous language, and real-world clinical scenarios.

Each model generated 150 multiple-choice questions, with equal distribution across subspecialties to ensure comprehensive coverage of core competencies, sufficient questions for reliable psychometric analysis, and balanced assessment across all domains, regardless of clinical exposure time. This systematic approach to question generation, combined with rigorous quality control requirements, helped ensure consistency and clinical relevance across both AI platforms.

Expert validation methodology

Seven experienced urologists, each with over 10 years of clinical practice and active involvement in resident training programs, independently evaluated the questions. The validation process incorporated structured review using standardized rubrics, followed by two rounds of Delphi process and consensus meetings for disagreements. The assessment encompassed technical accuracy, clinical relevance, educational value, question structure, and alignment with objectives.

Each supervisor maintained a consistent weekly schedule, ensuring standardized teaching approaches across all student groups. The student-to-faculty ratio was maintained at 4:1 for clinical activities and 8:1 for theoretical sessions. Weekly faculty meetings were held to discuss student progress and maintain teaching consistency.

Assessment implementation

The three assessment points were conducted at specific intervals: baseline assessment on day 1 (before starting the rotation), mid-rotation assessment on day 11 (after completing 40 h of training), and final assessment on day 21 (after completing all 72 h), ensuring consistent intervals of 10 days between each assessment point.

To prevent test-retest bias and ensure independent assessment at each time point, the 100 validated questions were randomly divided into three equivalent sets (A, B, and C), with each set containing approximately 33–34 questions. The distribution of questions across subspecialties and difficulty levels was maintained proportionally in each set to ensure comparability. Detailed psychometric properties of each question set, including difficulty indices, discrimination indices, and reliability coefficients, are presented in Supplementary Table S1. Statistical analysis confirmed no significant differences in difficulty indices between sets (Set A: mean difficulty index 0.62 ± 0.14; Set B: 0.63 ± 0.13; Set C: 0.61 ± 0.15; p = 0.89). The order of set administration was randomized across participants to control for potential order effects.

Data collection and analysis

Sample size was calculated using G*Power software (version 3.1), with parameters set at medium effect size (f = 0.25), alpha of 0.05, and power of 0.90, requiring a minimum sample of 36 participants. The statistical approach consisted of repeated measures ANOVA for longitudinal performance assessment, multiple regression for identifying predictive factors, and Cronbach’s alpha calculations for reliability assessment. Normality was assessed using the Shapiro-Wilk test. All analyses were performed using SPSS software (version 26.0) with significance set at p < 0.05.

Results

From the initial cohort of 45 eligible students invited to participate, 42 completed all three assessment points (93.3% completion rate). Three participants withdrew due to scheduling conflicts (n = 2) and personal reasons (n = 1). The final sample size (n = 42) exceeded the minimum requirement determined by power analysis (n = 36), ensuring adequate statistical power for all planned analyses. The mean age was 25.3 years (95% CI: 24.6–26.0), with 24 (57.1%) male and 18 (42.9%) female participants. All participants maintained consistent attendance (> 90%) in theoretical sessions and clinical activities throughout the rotation (Table 1).

Table 1 Participant demographics and baseline characteristics

Initial AI generation yielded 300 questions (150 each from ChatGPT and Gemini). Expert validation retained 100 questions (50 from each platform, 33.3% retention rate), with high inter-rater agreement (κ = 0.72, 95% CI: 0.65–0.79) as shown in Figure S1A.

Subspecialty analysis revealed significant variations in AI-generated question quality. Technical accuracy rates varied across subspecialties: uro-oncology (87.2%), endourology (85.4%), functional urology (84.1%), pediatric urology (82.3%), andrology (81.9%), and transplantation (80.6%). Similarly, expert validation retention rates showed notable differences: uro-oncology (38.2%), endourology (35.6%), functional urology (33.4%), pediatric urology (31.8%), andrology (30.2%), and transplantation (28.9%). These variations correlated with the standardization level of clinical guidelines and protocols within each subspecialty (r = 0.82, p < 0.001).

The distribution of expert ratings and quality metrics comparison between AI platforms is shown in Fig. 1A and B, respectively. Both AI platforms demonstrated comparable technical accuracy (ChatGPT: 84.3 ± 6.2%, Gemini: 83.8 ± 5.9%, p = 0.86) and retention rates (33.3% each) as detailed in Table 2, with question difficulty distributions presented in Figure S1B. Traditional board examination questions showed significantly higher discrimination indices (mean: 0.42, 95% CI: 0.37–0.47) compared to AI-generated questions (mean: 0.28, 95% CI: 0.24–0.32; p < 0.001), though difficulty indices were similar.Performance improved significantly across the three assessment points (F = 156.4, p < 0.001), with mean scores increasing from baseline (mean: 45.2%, 95% CI: 42.6–47.8%) through mid-rotation (mean: 62.8%, 95% CI: 60.4–65.2%) to final assessment (mean: 78.4%, 95% CI: 76.5–80.3%). The learning progression based on prior AI experience is illustrated in Fig. 2A, while the subspecialty performance distribution is shown in Fig. 2B. The greatest improvement occurred during the first week (Δ17.6%), followed by continued but more gradual gains in subsequent weeks (Δ15.6%), with detailed learning rate analysis presented in Figure S2A. Individual learning trajectories demonstrated consistent improvement patterns, with 95% of participants showing significant score increases (p < 0.001) between consecutive assessments. The consistency of performance across subspecialties for individual students is visualized in Figure S2B.

Fig. 1
figure 1

AI system analysis. A) Distribution of Expert Ratings showing rating scores for GPT-4 and Gemini generated questions (t-test: p = 0.824). B) Question Quality Comparison between AI platforms across technical accuracy, discrimination index, and difficulty index metrics

Table 2 Question bank analysis and expert evaluation results
Fig. 2
figure 2

Learning analysis. A) Learning Progression by Prior Experience comparing performance trajectories of students with and without prior AI experience across assessment stages. B) Subspecialty Performance Distribution showing violin plots of achievement levels across six urological subspecialties

Table 3 Performance progress analysis across training stages
Table 4 Subspecialty performance analysis at final assessment
Table 5 Learning efficiency analysis by training period

The 72-hour supervised clinical curriculum was distributed across subspecialties based on clinical complexity and patient volume. Uro-oncology comprised 35% (mean: 25.2 h, 95% CI: 24.6–25.8), while endourology took 25% (mean: 18.0 h, 95% CI: 17.6–18.4). Functional urology accounted for 15% (mean: 10.8 h, 95% CI: 10.5–11.1), with pediatric urology and andrology each at 10% (mean: 7.2 h, 95% CI: 7.0-7.4 for both). Transplantation comprised the remaining 5% (mean: 3.6 h, 95% CI: 3.5–3.7).

From the 100 validated questions, clinical scenario-based questions (n = 55) demonstrated higher discrimination indices (mean: 0.28, 95% CI: 0.24–0.32) compared to knowledge-recall questions (n = 45, mean: 0.14, 95% CI: 0.10–0.18; t = 6.82, p < 0.001). Integration-type questions (n = 25) exhibited the highest discrimination indices (mean: 0.34, 95% CI: 0.29–0.39), while application questions (n = 30) showed moderate discrimination (mean: 0.29, 95% CI: 0.24–0.34).

Performance varied significantly across subspecialties (F = 12.34, p < 0.001), with uro-oncology achieving a mean of 82.6% (95% CI: 80.9–84.3%), endourology reaching 79.4% (95% CI: 77.6–81.2%), and functional urology attaining 77.8% (95% CI: 75.9–79.7%). The detailed analysis of individual learning trajectories, performance predictions, and subspecialty correlations is presented in Fig. 3A-C. The evolution of performance distributions, learning efficiency comparisons, and cumulative achievement analysis are shown in Fig. 4A-C. A strong positive correlation was observed between clinical exposure time and performance improvement (r = 0.78, p < 0.001). Notably, performance in transplantation-related questions, despite limited clinical exposure (5% of rotation time), showed substantial improvement (Δ28.4%), suggesting effective knowledge transfer through case-based discussions.

Fig. 3
figure 3

Detailed performance analysis. A) Individual Learning Trajectories with 95% confidence intervals showing progression from baseline to final assessment. B) Performance Prediction analysis demonstrating correlation between baseline and final scores (R²=0.852). C) Subspecialty Performance Correlation Network visualizing relationships between different urological domains

Fig. 4
figure 4

Comparative analysis. A) Performance Distribution Evolution showing score distributions at each assessment stage. B) Learning Efficiency Comparison between initial and final learning periods. C) Cumulative Learning Achievement tracking total performance gains over time

No significant differences were found based on gender (male vs. female, p = 0.68) or age groups (≤ 25 vs. >25 years, p = 0.42). Participants with prior clinical exposure (n = 15) showed higher baseline scores (mean: 52.3%, 95% CI: 48.7–55.9%) compared to those without prior exposure (n = 27, mean: 41.4%, 95% CI: 38.0-44.8%, p = 0.002), though this advantage diminished by the final assessment (prior exposure: mean: 79.8%, 95% CI: 77.1–82.5% vs. no prior exposure: mean: 77.6%, 95% CI: 74.8–80.4%, p = 0.24).

Discussion

This study presents a comprehensive evaluation of AI-generated questions for urological competency assessment within a structured three-week clinical rotation at Mersin University Faculty of Medicine. Our findings demonstrate both the potential and current limitations of AI systems in medical education assessment, with several key implications for future implementation.

Notably, the variation in technical accuracy and validation rates across subspecialties provides important insights for educators. The higher accuracy rates in uro-oncology (88.5%) and endourology (86.2%) likely reflect the availability of well-established guidelines and standardized clinical pathways in these areas. In contrast, the lower accuracy in transplantation (78.6%) suggests that faculty should allocate additional review resources to questions in this subspecialty. These findings can help institutions optimize their question validation processes by focusing expert review efforts on subspecialties with historically lower accuracy rates.

Technical accuracy and quality considerations

The observed technical accuracy rate of approximately 84% (ChatGPT: 84.3%, Gemini: 83.8%) raises important considerations for medical education assessment. While this rate might be acceptable for certain educational contexts, it underscores the current limitations of AI systems in high-stakes medical assessment. The standardized prompt template and quality control requirements (detailed in Supplementary Material S1) were crucial in achieving consistent question quality across both AI platforms.

Quality assurance is essential and requires mandatory expert review and modification of all AI-generated content. This includes implementation of multi-stage validation processes, regular updates to maintain alignment with current medical knowledge, and continuous monitoring of question performance metrics.

Integration framework

Based on our findings, we propose a structured framework for integrating AI-generated questions in medical education. The initial generation phase involves AI system question generation using standardized prompts, preliminary automated quality checks, and basic format and content verification. This is followed by expert review and modification, which includes systematic review by subject matter experts, content accuracy verification, clinical relevance assessment, and modification of problematic questions.

The validation and analysis phase encompasses psychometric analysis of modified questions, pilot testing with student groups, performance metric evaluation, and iterative refinement based on results. The implementation and monitoring phase involves gradual integration into curriculum, continuous performance tracking, regular updates and improvements, and faculty development and support.

This framework emphasizes the importance of human oversight while leveraging AI capabilities to enhance efficiency in question development.

Methodological considerations and limitations

Our study has several important limitations that should be considered when interpreting the results. First, our single-center design at Mersin University Faculty of Medicine, while allowing for standardized implementation across subspecialties, limits generalizability to other educational settings. This limitation is particularly relevant given potential variations in curriculum structure, teaching methods, and student populations across different institutions.

Second, although our final sample size (n = 42) exceeded the minimum requirement for statistical power, it represents a relatively small cohort from a single academic year. This limited sample size may not capture the full spectrum of student learning patterns and could be influenced by cohort-specific characteristics. Additionally, the voluntary nature of participation might have introduced selection bias, potentially attracting more motivated students.

Third, the expert validation process, while rigorous, relied on seven urologists from our institution. This may have introduced institutional bias in question selection and validation. Furthermore, the retention rate of 33.3% for AI-generated questions suggests significant room for improvement in the initial generation process. The current limitations of AI models in capturing nuanced clinical decision-making scenarios were evident in the lower discrimination indices compared to traditional questions.

Fourth, the three-week rotation period, though standard for our curriculum, may be insufficient to assess long-term knowledge retention and clinical competency development. The intensive nature of the rotation might have promoted short-term knowledge acquisition at the expense of deeper learning. We recommend follow-up studies at 3, 6, and 12 months to assess knowledge durability.

Fifth, while our use of different question sets for each assessment point prevented memorization effects and test-retest bias, it introduced potential variability in assessment difficulty. Although we maintained proportional distribution of topics and difficulty levels across sets and confirmed statistical equivalence of difficulty indices, subtle variations in question complexity might exist. However, the consistent improvement pattern observed across all subspecialties, including those with limited clinical exposure like transplantation, suggests that performance gains reflect genuine knowledge acquisition rather than question familiarity.

Sixth, variations in clinical exposure during the rotation, particularly in subspecialties with lower allocated hours (e.g., transplantation at 5%), may have affected learning opportunities. While we attempted to standardize the educational experience through structured activities and faculty supervision, the inherent variability in patient presentations and clinical cases could have influenced learning outcomes.

Seventh, the language and cultural context of our study setting may affect the transferability of our findings. All assessments were conducted in Turkish, and the AI-generated questions were adapted to our local clinical practice patterns. The effectiveness of these questions in different linguistic and cultural contexts requires further investigation.

Finally, the rapid evolution of AI technology means that our findings, based on current versions of ChatGPT and Gemini, may not fully represent the capabilities of future AI models. The field of AI-assisted education is rapidly advancing, and newer models may address some of the limitations we identified in question generation and discrimination capability.

These limitations notwithstanding, our study provides valuable insights into the potential of AI-generated questions in specialty-specific medical education assessment. Future multi-center studies with larger sample sizes, longer follow-up periods, and cross-cultural validation would help address these limitations and further advance our understanding of AI’s role in medical education assessment.

AI system performance

The comparable technical accuracy between ChatGPT (84.3%) and Gemini (83.8%) in our study suggests current AI systems have reached a threshold level of competence in generating medical education content, consistent with recent findings in AI-assisted medical education [3, 4]. The expert validation process retained 33.3% of generated questions, with high inter-rater agreement (κ = 0.72), exceeding the recommended threshold for educational assessment tools [16]. However, the significant gap in discrimination indices between AI-generated and traditional questions indicates room for improvement in capturing nuanced clinical decision-making, a challenge noted in previous studies of AI-generated assessment tools [5, 13, 14].

Performance trajectory analysis

Our results demonstrate significant improvement in participant performance across assessment points, from baseline (45.2%) through mid-rotation (62.8%) to final assessment (78.4%). The highest scores were observed in uro-oncology (82.6%) and endourology (79.4%), aligning with the proportional distribution of clinical exposure hours in the rotation curriculum. This progression pattern aligns with established learning curves in medical education [15] and supports the effectiveness of our integrated assessment approach.

Cost-effectiveness considerations

Our analysis reveals that traditional question development typically requires 45–60 min per question, while AI-assisted development reduces this to 15–20 min per question, resulting in an estimated cost reduction of 65% per validated question. This finding is particularly relevant given the increasing focus on cost-effectiveness in medical education [2]. However, these savings must be balanced against the need for expert validation and ongoing quality assurance, as emphasized in recent literature on AI implementation in healthcare education [13].

Integration into medical education

We propose a hybrid implementation model that begins with AI systems generating base questions, followed by expert review and modification, and psychometric validation. This approach aligns with best practices for integrating AI technologies in medical education [3, 13]. The process is supported by continuous improvement through performance metrics feedback, regular content updates, and model retraining with validated questions, following established frameworks for educational quality assurance [9, 10].

Future research directions

Building upon our findings, future technical development should prioritize three key areas. First, enhanced discrimination capability through improved AI training methodologies is essential for generating more nuanced questions [5]. Second, the development of specialty-specific model training, particularly in surgical subspecialties like urology, will improve the relevance and accuracy of generated questions [3, 4]. Third, the implementation of automated validation tools leveraging machine learning algorithms will streamline the question validation process while maintaining quality standards [9].

The educational impact of AI-generated questions requires further investigation through longitudinal studies. These studies should focus on assessing long-term knowledge retention using validated assessment methods [15], examining the correlation between question performance and clinical competency using standardized metrics [11], and exploring the potential for adapting questions to different learning styles through personalized assessment approaches [1].

Implementation research should address several critical aspects of AI integration in medical education. Multi-center validation studies following established protocols are needed to confirm the generalizability of our findings [16]. Comprehensive cost-effectiveness analyses using standardized frameworks will help institutions make informed decisions about AI implementation [2]. Additionally, the development of faculty development programs incorporating AI literacy will be crucial for successful adoption of these technologies in medical education settings [7].

Study strengths

The study’s methodological rigor is evidenced by its prospective design, standardized protocols, and comprehensive validation, following established guidelines for educational research [11, 12]. The practical application in a real clinical setting with multiple assessment points and detailed performance metrics strengthens its findings. The statistical analysis employed appropriate power calculation [18], multiple outcome measures, and robust analytical methods, adhering to current best practices in educational research [17].

Conclusions

This study demonstrates that AI-generated questions can effectively assess urological competency in medical education when properly validated and implemented. The strong correlation between clinical exposure and performance improvement validates their effectiveness in measuring knowledge acquisition, consistent with recent findings in AI-assisted medical education [3, 4, 13]. While showing promise in reducing resource requirements and maintaining technical accuracy, current limitations in discrimination capability suggest their optimal use is in formative assessment rather than high-stakes testing. These findings support the potential integration of AI-generated questions in specialty-specific assessment, though careful implementation with expert oversight and continuous validation remains essential, as emphasized in current medical education literature [6, 7, 9].

Data availability

The datasets generated and analyzed during the current study are available in the Zenodo repository (10.5281/zenodo.14635298). This includes the complete set of AI-generated questions, validation scores, and performance metrics.

Abbreviations

- AI:

Artificial Intelligence

- CI:

Confidence Interval

- SD:

Standard Deviation

- KR-20:

Kuder-Richardson Formula 20

- ANOVA:

Analysis of Variance

- n:

Number of participants

- p:

Probability value

- r:

Correlation coefficient

References

  1. Masters K, Ellaway R, Topps D, Archibald D, Hogue RJ. Mobile technologies in medical education: AMEE guide 105. Med Teach. 2016;38(6):537–49.

    Article  Google Scholar 

  2. Walsh K. Cost and value in medical education - what we know and what we need to know. J Grad Med Educ. 2021;13(2):156–8.

    Google Scholar 

  3. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large Language models. PLOS Digit Health. 2023;2(2):e0000198.

    Article  Google Scholar 

  4. Mollick E, Mollick L. Assessing the capabilities and limitations of ChatGPT in medical education. Med Teach. 2023;45(6):594–9.

    Google Scholar 

  5. van der Vegt AH, Zondervan-Zwijnenburg M, Almekinders C, et al. Potential of artificial intelligence for writing multiple-choice questions in medical education. Med Educ Online. 2022;27(1):2012972.

    Google Scholar 

  6. Wartman SA, Combs CD. Medical education must move from the information age to the age of artificial intelligence. Acad Med. 2018;93(8):1107–9.

    Article  Google Scholar 

  7. Paranjape K, Schinkel M, Nannan Panday R, Car J, Nanayakkara P. Introducing artificial intelligence training in medical education. JMIR Med Educ. 2019;5(2):e16048.

    Article  Google Scholar 

  8. Loeb S, Carter SC, Berglund L, Cowan JE, Catalona WJ. Heterogeneity in active surveillance protocols worldwide. Rev Urol. 2014;16(4):202–3.

    Google Scholar 

  9. Guo Y, Hao Z, Zhao S, et al. Artificial intelligence in medical education: a systematic review and meta-analysis. BMC Med Educ. 2022;22(1):445.

    Google Scholar 

  10. Wartman SA, Combs CD. Reimagining medical education in the age of AI. AMA J Ethics. 2019;21(2):E146–152.

    Article  Google Scholar 

  11. Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: theory and application. Am J Med. 2006;119(2):166.e7-166.e16.

    Article  Google Scholar 

  12. Downing SM. Validity: on meaningful interpretation of assessment data. Med Educ. 2003;37(9):830–7.

    Article  Google Scholar 

  13. Cheng L, Gunasekara T, Shotwell M, Jamaluddin R. Artificial intelligence in medical education: best practices for integrating ChatGPT. Med Teach. 2023;45(9):1022–8.

    Google Scholar 

  14. He J, Baxter SL, Xu J, et al. The practical implementation of artificial intelligence technologies in medicine. Nat Med. 2019;25(1):30–6.

    Article  Google Scholar 

  15. Tekian A, Watling CJ, Roberts TE, Steinert Y, Norcini J. Qualitative and quantitative feedback in the context of competency-based education. Med Teach. 2017;39(12):1245–9.

    Article  Google Scholar 

  16. Diamond IR, Grant RC, Feldman BM, et al. Defining consensus: A systematic review recommends methodologic criteria for reporting of Delphi studies. J Clin Epidemiol. 2014;67(4):401–9.

    Article  Google Scholar 

  17. Clark RE, Feldon D, van Merriënboer JJG, Yates K, Early S. Cognitive task analysis. In: Spector JM, Merrill MD, van Merriënboer JJG, Driscoll MP, editors. Handbook of research on educational communications and technology. 3rd ed. Lawrence Erlbaum Associates; 2008. pp. 577–93.

  18. Faul F, Erdfelder E, Lang AG, Buchner A. G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods. 2007;39(2):175–91.

    Article  Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge the support from Mersin University Scientific Research Projects Unit. We also thank our supervisors Prof. Dr. Ercüment Ulusoy, Prof. Dr. Selahittin Çayan, Prof. Dr. Murat Bozlu, Prof. Dr. Hasan Erdal Doruk, Prof. Dr. Erim Erdem, Prof. Dr. Mesut Tek, and Prof. Dr. Erdem Akbay for their valuable guidance and contributions.

Funding

This research was supported by Mersin University Scientific Research Projects Unit (Project No: 2018-2-AP2-2935).

Author information

Authors and Affiliations

Authors

Contributions

M.B. performed the study conception and design, data collection and analysis, and drafted the manuscript. E.A. contributed to the development of the study methodology, data interpretation, and critical revision of the manuscript. E.E. coordinated data collection, provided analysis support, and reviewed the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Mert Başaranoğlu.

Ethics declarations

Ethics approval and consent to participate

This study was conducted in accordance with the principles of the Declaration of Helsinki. The ethics committee of Mersin University Faculty of Medicine reviewed and approved this study (Ethics Committee Meeting Date: 19.02.2025, Decision No: 181). All participants provided written informed consent before participation.

Consent for publication

Not applicable as no individual person’s data was included.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

12909_2025_7202_MOESM1_ESM.docx

Supplementary Material 1: AI Question Generation Prompt Template. Detailed documentation of the standardized prompt template used for AI-generated question development. Includes: - Basic prompt structure and formatting requirements. - Subspecialty-specific content guidelines. - Question type specifications (clinical scenario, knowledge application). - Difficulty level criteria. - Quality control requirements. - Example question format with detailed instructions. This template was consistently applied across both AI platforms (ChatGPT and Gemini) to ensure standardization in question generation.

12909_2025_7202_MOESM2_ESM.docx

Supplementary Material 2: Psychometric Properties of Assessment Question Sets. Comprehensive analysis of the three question sets (A, B, C) used in the rotation assessments. Contains: - Detailed difficulty and discrimination indices. - Question type distribution across sets. - Subspecialty content balance. - Reliability coefficients (Cronbach’s α). - Statistical comparisons between sets.

Supplementary Material 3

Supplementary Material 4

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Başaranoğlu, M., Akbay, E. & Erdem, E. AI-generated questions for urological competency assessment: a prospective educational study. BMC Med Educ 25, 611 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12909-025-07202-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12909-025-07202-x

Keywords