- Research
- Open access
- Published:
Item analysis of multiple choice questions from assessment of health sciences students, Tigray, Ethiopia
BMC Medical Education volume 25, Article number: 441 (2025)
Abstract
Background
Rapid expansion of higher education and graduation rate in Ethiopia, doesn’t tell much about quality of core processes like teaching and learning. Despite the fact that preparation of a good item is essential, little attention has been paid to examining characteristics of exam items such as difficulty, discrimination, and distractor efficiency. Unless these characters are examined results may lead to biased evaluation. To address the current issue, this study aimed to carry out item analysis of multiple choice questions (MCQs) in undergraduate health sciences students of Mekelle University, Tigray, Ethiopia in order to evaluate correlations between item difficulty, item discrimination and distraction efficiency.
Methods
Institution based cross sectional study was conducted from March to April 2020. A total of 189 exam papers with 539 multiple choice items and 2224 distracters were analyzed in terms of difficulty, discrimination, distractor efficiency and internal consistency reliability. SPSS version 25 was used for analysis & findings presented using tables & figures. Pearson’s correlation & one-way analysis of variance were used.
Results
In this study mean difficulty ranged from 0.57–0.81, and discrimination from 0.27–0.52. Difficulty and discrimination were negatively correlated. Six departments achieved excellent reliability, indicating r>0.9. However, the reliability score of Medicine and Pharmacy departments was poor (r<0.5), suggesting areas for improvement. The remaining two departments exhibited acceptable reliability. More than three fourth of the items had non-functional distractors.
Conclusions
The internal consistency of the exams from six departments demonstrated excellent reliability. However, the exams from the Medicine and Pharmacy departments showed areas for improvement. In order to enhance assessment effectiveness, it's crucial to incorporate items that exhibit both desirable difficulty and high discrimination along with functionally realistic distracters. This approach ensures assessments accuracy.
Introduction
Educational quality in Ethiopia seems to be compromised due to the rapid expansion of higher education in the country. It is believed that assessing students using objective items contribute to it [1]. Instructors need to be professionally competent in effective assessment, which includes designing exams and analyzing results [2].
Designing exams at the end of a semester is a complex and time-consuming process and its final output plays an important role in giving feedback to teachers on their educational actions. It surely has an implication for instruction and exam construction. Therefore it is essential to assess it and to make sure the scores yielded by the exam are valid [3].
The validity and reliability of exams can be assessed by post-exam item analysis [4]. It is a process that examines both students’ responses and exam questions to assess the quality of the item and the exam as a whole [5]. Difficulty index (DIF), discrimination index (DI) and functionality of distractors are the three important parameters of item analysis. Computing DIF can indicate which questions are too easy or too difficult, while DI will show which questions discriminate between good and poor examinees. A distractor is said to be functional if at least 5% of the examinees choose it, otherwise it is called non-functional (NFD) [6]. After doing so, instructors can remove too easy/difficult items, improve distracters’ efficiency, and avoid non-discriminating items from the pool of future exams [7]. Instructors in higher education routinely develop and administer teacher-made exams. However a question arises regarding item quality while dealing with such exams [3]. Multiple-choice questions (MCQ) continue to be the preferred method of assessment in medical education [8]. A dependable MCQ requires a thorough assessment and continuous refinement [9, 10]. One way to deal with this is through item analysis [3].
Minimization of errors influencing an exam to produce an observed score which approaches a learner’s ‘true’ score, as reliable and valid as possible is also a main focus of assessment in medical education. To do so, assessors need to be aware of the potential errors that can influence all components of the assessment cycle from question creation to the interpretation of exam scores [11]. Using item performance data instructors can greatly improve effectiveness of their exam items and the validity of scores by selecting and rewrite their items [12].
Studies showed that only 56% of items were in the acceptable range of difficulty index and 32% items were poor in discrimination, in Kalinga Institute of Medical Science, India [13]; a study from Ghana showed that 40% of items fall outside the acceptable discrimination index [14]. In a summative exam for freshman common course at Gondar, Ethiopia 41.9% of items lie outside the moderate difficulty range [1].
Despite the essentiality of good question items, little attention has been paid to examining characteristics of exam items such as DIF, DI, and distractor analysis. Limited studies on the area are found in Ethiopia. And hence, the present study analyzed items in terms of difficulty, discrimination, distractor efficiency and internal consistency reliability to improve the quality and reliability of assessment in medical education.
Methods and materials
Study design and period
An institution-based cross-sectional study was conducted from March to April 2020.
Study setting and participants
Mekelle University’s College of Heath Sciences is one of its’ six campuses. The college hostes a comprehensive specialized hospital serving Tigray and nearby regions. It also encompasses 13 departments and enrolles a total of 2,575 students.
Item analysis using the summative test of the course Human Anatomy was conducted among regular undergraduate health sciences students in Mekelle city, Tigray region, northern Ethiopia.
The summative exam administered during the 2018/19 academic year, Human Anatomy was used as the data source. Parameters considered in this study were difficulty index, discrimination index, internal consistency, reliability, and distracter analysis.
The main reason why this course was selected is because it is a pre-requisite course for most other courses within the stream. Therefore, the impact would be paramount if a biased assessment tool were used in this subject.
Sample size determination
To determine an appropriate sample size for item analysis of the summative Human Anatomy exam administered during the 2018/19 academic year at Mekelle University's College of Health Sciences, a single population proportion calculation was employed.
Where n = the minimum sample size required for the study.
Z = standard normal distribution (Z = 1.96) with 95% confidence level.
P = 50% (since the proportion is unknown for health science exams in Ethiopia).
d = is a tolerable margin of error (d = 5% = 0.05).
Where nf=corrected sample size
-
ni= uncorrected sample size, and
-
N= source population (366). Therefore, 384/ (1+ 384/366) = 187
-
Adding 10 %, 19 + 187 = 206
-
Exam papers for nursing department were not available. So, that made the final sample size 206 −17=189
The initial calculation yielded a sample size of 384 exam papers to be included in the study. However, since the College of Health Sciences has a finite number of exams approximately 2,575 (less than 10,000), population correction formula was applied. This correction resulted in a final sample size of 206 exam papers. Unfortunately, exam papers from the nursing department (n=17) were unavailable as it was destroyed by the department. Therefore, the final analyzed sample size reached 189 exam papers.
Sampling procedure
Stratified random sampling method with proportionate allocation was used to select those 189 exam papers. The College consists of 11 departments who took the selected course. Strata were created based on this department and then the calculated sample size was proportionally allocated to it. The allocated sample within each stratum was further selected by simple random sampling technique.
Data collection procedure and instrument
Human anatomy final exam administered during 2018/19 academic year was used as the research instrument. The sampled summative exam for the course ‘’Human anatomy’’ was collected from the corresponding departments by two MSc Midwives
Data processing and analysis
The data obtained was entered to SPSS version 25 for analysis and various indices like difficult index, discrimination index, reliability coefficient and distracter analysis was calculated. Variables were presented as mean (standard deviations). The relationship between DIF, DI, and DE was measured using Pearson’s correlation test. A one-way analysis of variance was used to examine the differences DIs across different categories of DIF. A P value of < 0.050 was considered statistically significant. Besides, descriptive analysis was applied and findings are presented using frequencies, tables and figures.
➢ Categories of difficulty index
-
Difficult item: DIF less than 0.3
-
Acceptable item: DIF ranging 0.3–0.7
-
Easy item: DIF greater than 0.7
➢ Categories of discrimination index
-
Excellent item: DI of 0.35 and above
-
Acceptable item: DI ranging 0.2–0.34
-
Poor item: DI less than 0.2
➢ Non-functional distractor
-
An incorrect option in MCQ selected by less than 5% of the examinees
Results
Response rate
In this study, of the proposed 206 exams only 189 of them were available. This is because of Nursing exams (n = 17) which is burned. That makes the response rate of the study 92%.
Descriptive statistics on the exams
The total number of exams was 189 with a total number of 539 items in a given ten departments. Descriptive statistics for the exams are provided in the Table 1.
Internal consistency
Internal consistency, as measured by KR-20, revealed that 60% of the exams had excellent reliability, reliability > 0.9. Only two departments had poor reliability. Comparing each department KR-20 ranged from 0.96 in Physiotherapy to 0.39 in Medicine (Table 2).
Difficulty index
All the exams contain items with acceptable difficulty from 66.7% in pharmacy examination to 10% in medical laboratory. None of the items from pharmacy exams were easy while none of the items from Health informatics exam were found difficult. Distribution of items according to their difficulty level is presented in Fig. 1.
Discrimination index
Highest percentage (82%) of excellent discrimination is revealed in laboratory examinations and heights percentage of poor discrimination is found in Environmental health examinations (60%). The lowest percentage of acceptable discrimination is observed in Health informatics examination (6.7%), while the heights being 23.1% in medicine. Distribution of items according to their discrimination capacity is presented in Fig. 2.
Distractor efficiency
Distractor analysis of individual exams comprised of frequency distribution of NFDs using the definition of response frequency < 5% of the examinees. 425 (78.8%) of the exam items contains NFDs (Table 3). There were a total of 2,224 distractors and almost one-third of it 751(33.8%) was non-functional.
Analysis of the number of NFDs per item revealed that items with one NFD had the highest frequency 186(34.5%), followed sequentially by items with two157 (29.1%), three 77(14.3%) and four NFDs 5(0.93%). It also showed the proportion of properly functional distractors varied across exams, ranging from as high as 40% (Anesthesia) to as low as 5.6% (Dentistry).
When we calculated the mean distractor efficiency using the total items, the highest percentage was observed in Anesthesia exam 73.6% (with SD of 25.2) and the lowest was in Health informatics exam 30.1% (with SD of 25.8). The overall DE for all ten exams was 58%.
Using the standard items mean distracter efficiency showed significant improvement in four department exams (Info, Lab, Mid, Pharma). Percentage of mean distracter efficiency using the total items and standard items is presented in Fig. 3.
Correlation between difficulty index and discrimination index
The Pearson r was used to determine whether DIF and DI are correlated in each of the ten department exams. There was a significant negative correlation between DIFI and DI for two department exams: Environmental health and Health informatics (r = −0.3 and −0.48; P = 0.04 and 0.008). These shows that as DIF increase (an item becoming much easier), DI decreases indicating an item becoming much less discriminating. The Scatter plot is presented below the relationship between difficulty index and discrimination index of the two department exams in Fig. 4.
Correlation r between DIF and DI with distractor efficiency
The DE was significantly correlated to the DIF and DIs for three department exams: Midwifery, Environmental health and Health informatics with higher DEs for difficult and acceptable items than easy items. The DE was higher for items with excellent and acceptable DIs compared to items with poor discrimination. Tables 4, 5 6 and 7 presents the correlation between difficulty index and discrimination index with distractor efficiency for the three departments.
Mean comparison of difficulty index and discrimination
Comparison of the mean difficulty and discrimination index for each 10 exams using one way ANOVA was done and it did not give any evidence of statistical significant mean differences in discrimination for 9 of the exams. Only exam of Dentistry showed an evidence that discrimination for easy, acceptable and difficult items was not the same. The ANOVA result (P = 0.001) in Table 5 evidenced that the discrimination index for items of different difficulty levels (easy, acceptable and difficult) in Dentistry exam were significantly different. These suggest that the ability of these items to differentiate between high and low performing students varies with the level of difficulty.
The ANOVA result was showed that a significant difference in discrimination index among the three categories of item difficulty (easy, acceptable and difficult) with F(2, 157) = 7.01, P=0.001. To further investigate these differences a post hoc analysis using Tukey's HSD test was conducted, specifically for Dentistry exams (Table 6).
Interpretation and implication
The ANOVA result (P = 0.001) confirmed that the discrimination index for items of different difficulty levels (easy, acceptable and difficult) in Dentistry exam were significantly different. These suggest that the ability of these items to differentiate between high and low performing students varies with the level of difficulty. Then the Post hoc analysis (Tukey's HSD) showed that difficult items had a significantly lower discrimination index compared to acceptable and easy items. This means that difficult items were less effective in distinguishing between students with different levels of performance. The acceptable and easy items didn’t show a significant difference in discrimination power, indicating that they were similarly effective in distinguishing between students of different performance level. These results suggested that while acceptable and easy items are useful for assessing student performance due to their higher discrimination indices, difficult items needs to be reviewed and potentially revised to improve their discriminatory power. Ensuring that exam items can effectively differentiate between varying levels of student performance is crucial for quality assessment.
Discussion
This study found that most exams (60%) were very reliable, with only two departments having low reliability. This is better than a previous study at Bahria University which looked at the reliability of anatomy exams with multiple choice questions [15]. The reliability in that study was lower (between 0.56 and 0.77) and might be because they had fewer questions (140) in their exams. In general, exams with more questions are considered to be more reliable [16]. This trend is also seen in our study where the two departments with lower reliability only had 12 or 13 questions in their exams.
This study found higher exam reliability compared to previous studies at both Arabian Gulf University (average KR20 reliability of 0.76) [17] and Ghana (KR20 coefficient of 0.77) [14]. This difference might be due to the exam question level of difficulties. Exams with too many easy or difficult questions can be unreliable. The Arabian Gulf University study found nearly half of their questions were easy (25.9%) or difficult (20.8%) [17]. Similarly, the Ghana study had a large portion of questions in the easy (14%) and difficult categories (36%) [14].
Additionally, the Reliability results of this study was higher when it is compared with a cross sectional study done at Gondar (r = 0.58) [1]. This difference might be due to having both small number of items and too many easy and difficult items in the study (number of items = 31, 41.9% of items lie outside the moderate difficulty range).
In the present study, mean difficulty index for 60% of the exams were in acceptable range and for the rest 40% it was easy. This value is a bit greater than the study conducted in Kalinga, Institute of Medical Science, India which reveled 56% of acceptable difficulty and 32% easy range [13]. The overall mean DIFI in this study is considered acceptable (68.7%). It is again higher than the study conducted in West Bengal, India, [18] (mean DIF = 61.92%) and Ghana (58.6%) but with the same acceptable category [14].
The overall mean DIFI in this study is (68.7), which is in an acceptable range. Similarly, the study done at Gondar [1] revealed that the exam as a whole has moderate difficulty level (56%). Besides, mean difficulty index for 60% of the exams in this study were in acceptable range (0.57–0.81) while results of the study conducted in St. Mary’s University College, Addis Ababa [3], indicated higher proportion of items had acceptable difficulty (83%). This higher proportion in mean DIF could be due to the use of wider cut of rage for acceptable items (0.2 < DIF < 0.8) than the current study.
Looking at each department’s mean DI in this study ranged from 0.27 to 0.52. This result is higher than the study done in Hong Kong which showed that mean discrimination ranging from 0.20 to 0.28 [8]. In addition, it is higher than the study conducted in Ghana (DI = 0.22) [14] & Gondar (DI = 0.16) [1]. As the current study contains more items with acceptable difficulty, (6 out of 10 exams had 40% or more items with acceptable difficulty) it is expected to have greater discrimination.
The results of this study showed that the frequency of items with negative discrimination index ranged from 1 to 18 which means students with low achievement in the total score answered such items more correctly than those with higher achievement [19]. This finding is in line with the study done at West Bengal, India [18], containing two items with negative DI and which also states that some studies have shown negative DI in 20% and 4% items [20, 21]. Similarly10%, 9.6% and 5.98% of items had negative DIs in the study done at Ghana [14], Gondar and St. Mary’s University College Addis Ababa [1, 3] respectively. Reasons for negative DI can be wrong key and ambiguity in the question.
Regarding mean distractor efficiency the present study showed the range of 30.1%.−73.9%. Items with Zero NFDs were 21.2%. These values are smaller than the study done at Arabian Gulf University which revealed a range of mean distractor efficiency 66.50–90.00%. Of the items, 48.4had zero NFDs [17]. This discrepancy could be due to the reason of the previous study undergone post validation item analysis while the present study used total items. Comparing the overall mean DE (58%) it is relatively similar with the study from Ghana (55.6%) while being lower from the study of Gondar (85.7%) [1, 14]. This could be due to use of small number of items (n = 31) in the study done at Gondar.
The current study showed that DE was higher for items with excellent and acceptable DIs compared to items with poor discrimination. This finding is similar with the study conducted in Arabian Gulf University [17]. The DE was 83.24% and 83.33% for items with excellent and acceptable DIs, respectively, compared to 77.56% for items with poor discrimination.
Comparison of the mean difficulty and discrimination index for each ten exams using one way ANOVA for one department gave an evidence of statistically significant mean differences in discrimination. This finding is affirmed by a study done at Bahria University, using three modules which state that discrimination in module-II was higher than the rest. Post-hoc analysis also revealed that discrimination in Module-II was significantly different [15].
Conclusion
Internal consistency reliability for 6 of the exams was excellent while two departments had questionable reliability. Considerable number of items had none functional distractors which in turn contributes for the presence of poor quality items. Post exam item analysis is a valuable procedure which gives information regarding the quality of an item as this helps to identify items that require revision immediately and contribute for future question bank improvement.
Limitations
Exam papers used in this study were all written in English language but all examinees who took the exam and instructors who prepared the exam spoke English as a second language (ESL). However, ESL issues were not considered in the current study. Nursing exams were not available. Therefore, the endings of this study affects in its generalizability. As strengths some can be mentioned: this study is one of in its kind done within the context of medical education. It has also tried to include a post hoc analysis to identify deficiencies & weakness among items.
Future Directions
To build upon the findings and address the limitations of the current study, the following future directions are recommended:
A. Incorporate ESL considerations:
Language proficiency assessment: Future studies should include an assessment of the examinees' and instructors' proficiency in English. This can help to determine if language barriers are impacting exam performance and item analysis metrics. Translation and validation Consider translating exams into the native languages of examinees and validating these translated exams to ensure they accurately measure the intended competencies without the added challenge of language proficiency.
B. Expand departmental coverage:
Include Nursing and other health sciences exams: Future research should aim to include exams from other remaining departments to provide a comprehensive analysis across all departments within the medical school. (Broader Departmental Involvement) Encourage participation from all departments to ensure that findings and recommendations are applicable across the entire institution.
C. Longitudinal Studies
Track Changes Over Time: Conduct longitudinal studies to track changes and improvements in item quality and student performance over multiple cohorts. This will help to identify trends and the long-term impact of implemented changes.
Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.
References
Anonymous. Post exam item analysis: Implication for intervention. Jan. 4, 2019. retrieved on Jan 2020 from https://doiorg.publicaciones.saludcastillayleon.es/10.1101/510081. preprint first posted online 2019.
Shavelson RJ. A Brief History of Student Learning Assessment: How We Got Where We Are and a Proposal for Where to Go Next. The Academy in transition. Association of American Colleges and Universities. 2007.
Ashagre E. Improving Test Construction Skills through Item Analysis. Addis Ababa, Ethiopia: UN Conference Center; August 29, 2009. 2009.
Tavakol M, Dennick R. Post-examination analysis of objective tests. Med Teach. 2011;33(6):447–58 Epub 2011/05/26.
Sireci SG, Parker P. Validity on trial: psychometric and legal conceptualizations of validity. Educ Meas Issues Pract. 2006;25(3):27–34.
Musa A, Shaheen S, Ahmed A. Distractor analysis of multiple choice questions: a descriptive study of physiology examinations at the Faculty of Medicine. University of Khartoum Khartoum Medical Journal. 2021;11(1):1444–53.
Siri A, Freddano M. The use of item analysis for the improvement of objective examinations. Procedia Soc Behav Sci. 2011;29:188–97.
Royal KD, Hodgpeth M-W. The prevalence of item construction flaws in medical school examinations and innovative recommendations for improvement. EMJ Innov. 2017;1(1):61–6.
Yirdaw A. Quality of education in private higher institutions in Ethiopia: the role of governance. Sage Open. 2016:6(1). https://doiorg.publicaciones.saludcastillayleon.es/10.1177/2158244015624950.
Pham NTT, Nguyen CH, Pham HT, Ta HTT. Internal Quality Assurance of Academic Programs: A Case Study in Vietnamese Higher Education. Sage Open. 2022;12(4). https://doiorg.publicaciones.saludcastillayleon.es/10.1177/21582440221144419.
Ministry of Education E. General education quality improvement package (GEQIP). 2008.
Anonimous. White paper technical report. Improve your written tests using item analysis. 2006.
Namdeo S, Sahoo B. Item analysis of multiple choice questions from an assessment of medical students in Bhubaneswar, India. International Journal of Research in Medical Sciences. 2016;4:1716–9.
Quaigrain K, Arhin AK, King Fai Hui S. Using reliability and item analysis to evaluate a teacher-developed test in educational measurement and evaluation. Cogent Educ. 2017;4(1):1301013.
Islam ZU, Usmani A. Psychometric analysis of anatomy MCQs in modular examination. Pakistan journal of medical sciences. 2017;33(5):1138–43 Epub 2017/11/17.
Federal Democratic Republic of Ethiopia. Higher Education Proclamation: Proclamation No. 650/2009. FNG. 2009;4976–5044.
Kheyami D, Jaradat A, Al-Shibani T, Ali FA. Item analysis of multiple choice questions at the Department of Paediatrics, Arabian Gulf University, Manama. Bahrain Sultan Qaboos University medical journal. 2018;18(1):e68–74 Epub 2018/04/19.
Mukherjee P, Lahiri SK. Analysis of Multiple Choice Questions (MCQs): item and test statistics from an assessment in a medical college of Kolkata, West Bengal. IOSR Journal of Dental and Medical Sciences. 2015;14(12):47–52.
Tarrant M, Ware J. Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments. Med Educ. 2008;42:198–206.
Gajjar S, Sharma R, Kumar P, Rana M. Item and test analysis to identify quality Multiple Choice Questions (MCQs) from an assessment of medical students of Ahmedabad, Gujarat. Indian journal of community medicine : official publication of Indian Association of Preventive & Social Medicine. 2014;39(1):17–20 Epub 2014/04/04.
McCowan RJ, McCowan SC. Item analysis for criterion-referenced tests. 1999.
Acknowledgements
We would like to acknowledge Mekelle University, College of Health Science for allowing us to conduct this study. Our gratitude also extends to Mekelle University, College of Health Science, and Anatomy department instructors for their willingness to providing us their exam papers for the study.
Funding
Mekelle University gave financial support for data collection.
Author information
Authors and Affiliations
Contributions
Study conceived and designed by: MWG BB MM. Tool designing and preparation by: MWG BB BA. Data processing and analysis: MWG BA; wrote the paper: MWG BB MM BA. All the authors read the final paper and approved it.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Ethical clearance was obtained from Ethical Review Board of Mekelle University (ERC 1618/2020) & the study was conducted according to the guidelines of the Declaration of Helsinki. A formal letter of permission & support was then obtained to collect examinations from Anatomy department. After that all the instructors involved in the provision of the exams for the study was informed about the purpose of the study and their right to refuse. Informed consent was obtained from the instructors so as to use their exams. They were also told that the information obtained from it would be treated with complete confidentiality & did not cause any harm on them.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Gebremichael, M.W., Baraki, B., Mehari, MA. et al. Item analysis of multiple choice questions from assessment of health sciences students, Tigray, Ethiopia. BMC Med Educ 25, 441 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12909-025-06904-6
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12909-025-06904-6