Skip to main content

Evaluating the fairness of a high-stakes college entrance exam in Kuwait

Abstract

The use of college entrance exams for facilitating admission decisions become controversial, and the central argument is around the fairness of test scores. The Kuwait University English Aptitude Test (KUEAT) is a high-stakes test, but very few studies have examined the psychometric quality of the scores for this national-level assessment. This study illustrates how measurement approaches can be used to examine the fairness issues in educational testing. Through a modern view of fairness, we assess the internal and external bias of KUEAT scores using differential item functioning analysis and differential prediction analysis, respectively, and provide a comprehensive fairness argument for KUEAT scores. The analysis for examining the internal evidence of bias was based on 1790 examinees’ KUEAT scores in November 2018. KUEAT scores and first-year college GPAs of 4033 students enrolled in KU were used for assessing the external evidence of bias. Results revealed many items showing differential item functioning across student subpopulation groups (i.e., nationality, gender, high school majors, and high school types). Meanwhile, KUEAT scores also predicted college performance differentially by different student subgroups (i.e., nationality, high school majors, and high school types). Discussion and implications on the fairness issues of college entrance tests in Kuwait are provided.

Introduction

The use of college entrance exams, such as the Scholastic Assessment Test in the United States, for making admission decisions has been controversial for years. The central argument is around the fairness of test scores in determining university admission. Rhoades and Madaus (2003) indicated that college entrance exams contain biased items against minority groups. Zwick (2007) also revealed that students from historically under-resourced, marginalized, and underrepresented populations cannot afford test preparation coaching which resulted in lower scores on standardized testing. However, compared with other measures, these exams provide “a neutral yardstick” for comparing the performance of students from different high schools that greatly vary in terms of contextual factors such as course offerings (Buckley et al., 2018), and objectively measure the academic achievement of students (Churchill et al., 2015).

In Kuwait, the Ministry of Education and the Ministry of Higher Education in Kuwait yearly provide a minimum required high school GPA for Kuwait students who want to apply for a fully funded governmental scholarship that covers tuition expenses with a monthly stipend. The funded students have the choice of studying abroad or in the country for their college education. Many students prefer to study in the country, making admission to the only public university—Kuwait University (KU)—competitive, especially in medical and engineering colleges. The admission decision to KU is based on two criteria: the high school grade point average (GPA) and aptitude test scores. When KU was founded, admission decisions were based only on high school GPA. In 1997, KU collaborated with the Ministry of Education in Kuwait and added the requirement of taking and submitting the aptitude test scores. The Kuwait University English Aptitude Test (KUEAT) is one of the college entrance tests at KU. This test was developed by the faculty of participating colleges at KU in coordination with the Ministry of Education in Kuwait. The high school GPA and the test scores are combined to compute a weighted average score.

The aptitude test scores are very important for making admission decisions, especially with the inflation of high school GPAs. The percentage of 12th grade students in Kuwait who scored 90% or above on their high school GPA increased from 25.43% in 2016 to 47.16% in 2022 (Hasan, 2022), and remained at 41.09% in 2023 and 41.80% in 2024 (Hasan, 2024). Due to the inflation, Gershenson (2018) reported a low discrimination power of high school GPAs for distinguishing students at different proficiency levels, and among students with high GPAs, only a small proportion received high scores on the statewide end-of-course exams. Additionally, the public prosecution detected 40,000 students cheating in the Fall 2022 final exams because the students accessed the test material before the test time with the help of the Ministry of Education employees (Habib and Al Hamadi, 2023).

The inflation of high-school GPA and the cheating crisis in Kuwait undermined the validity and credibility of high-school GPAs which in turn highlighted the need for a standardized test for college admission. Standardized tests can be administered to a large-scale student body within a short time frame and in relatively similar testing conditions. A well-developed test can provide valid, reliable, fair, and comparable scores that support the decisions of college admission. This requires a comprehensive psychometric analysis of test scores including the detection of misfit and biased items.

There is a lack of research for examining and reporting the psychometric quality of KUEAT scores, especially with modern measurement approaches (e.g., Rasch measurement theory). Additionally, KU has never published technical reports on the KUEAT regarding the development process, administration process, scoring procedure, and psychometric properties. In the existing literature, two studies (Eid, 2009; Shamsaldeen, 2019) examined the prediction validity of KUEAT scores with college performances as the criteria. Eid (2009) investigated the predictive validity of KUEAT scores for students who graduated from high school in 1999, 2000, and 2001, and indicated a non-significant relationship between KUEAT scores and college performance. Shamsaldeen (2019) found that KUEAT scores were a significant predictor of second-year college GPAs for science, technology, engineering, and mathematics students. However, KUEAT scores only explained a small amount of variation in college GPAs. Both the secondary and higher education systems in Kuwait have been evolving over the years which may explain the contradictory results. This also highlights the necessity of developing, evaluating, and maintaining KUEAT scores regularly based on psychometric theory.

Purpose of the study

Researchers worldwide have been attending to the bias of high-stakes test scores and investigating the potential factors (e.g., Huang et al., 2016; Sabatini et al., 2015). However, the KUEAT lacks evidence to support the fair use of their scores in making admission decisions.

This study aims to fill the gaps by suggesting a comprehensive view of fairness and examining the internal and external bias of KUEAT scores for assessing psychometric properties with a special focus on fairness. The reliability and validity evidence are also collected as two related foundational areas to fairness. Specifically, differential item functioning (DIF) analysis based on Rasch measurement theory is used to assess the internal bias across student subpopulation groups. A moderation analysis based on a regression model is used to detect differential prediction and examine the external bias of test scores (AERA, APA and NCME, 2014). The demographic groups considered in this study include nationality, gender, high school type, and high school major. In particular, we address the following research questions:

  1. 1.

    How well does the KUEAT scale reflect student English proficiency (i.e., do KUEAT items fit the underlying scale—construct validity)?

  2. 2.

    How well can students at different English proficiency levels be separated by the Rasch scale (i.e., reliability of separation for students)?

  3. 3.

    Are KUEAT items equally difficult across different student subgroups (i.e., does any item show DIF—internal evidence of bias)?

  4. 4.

    How well do KUEAT scores predict first-year college GPA for different student subgroups (i.e., do KUEAT scores show differential prediction—external evidence of bias)?

Fairness as a foundational area of psychometrics

In accordance with the Standards for Educational and Psychological Testing (abbreviated as Test Standards hereafter; AERA, APA and NCME, 2014), fairness is defined as ensuring that all test takers possess similar opportunities to show their abilities in relation to the construct of a test intends to measure. Tests yielding fair scores should exhibit an absence of bias, impartial treatment of all examinees throughout the testing procedure, and equal access to learning the material. Fairness spans from the conceptualization of assessment to the utilization of assessment outcomes across all subpopulation groups throughout every assessment phase. While some researchers focus on the consequences of test utilization, others prioritize the validity of inferences drawn from test outcomes (Camilli et al., 2013). Fairness is a crucial foundational component for developing, evaluating, and using educational assessments.

It is crucial to understand the difference between fairness and bias. Fairness is a societal concept, while bias is a psychometric measure. The Test Standards emphasize a comprehensive understanding of fairness that scrutinizes numerous facets related to the purpose of testing (AERA, APA and NCME, 2014). This entails considering the technical properties of tests, how test scores are reported, the consequences of score uses, and the construct-relevant and irrelevant variables attributed to the performance of individuals and subpopulation groups (e.g., Wang et al., 2020). Bias refers to an anomaly in measurement, and fairness can be viewed as the outcome of who is advantaged or disadvantaged from this anomaly (Camilli, 2006). The bias of an item or a test against a particular subgroup would undermine the score fairness. We can quantify the degree of bias using statistical approaches, such as DIF and differential prediction analysis.

A test demonstrates a bias towards a subpopulation group if consistent non-zero prediction errors occur for members of that subgroup in predicting a criterion for which the test was designed (Cleary, 1968). This could be attributed to the test items requiring sources of knowledge that differ from those intended to be measured, thereby introducing systematic bias and reducing the validity of test scores for a specific group (Camilli & Shepard, 1994). Statistical results may reveal potential bias of test scores against a subpopulation group. Meanwhile, substantive explanations are needed to justify the reasons behind the numbers.

Two sources of bias should be investigated using statistical analysis, which are internal and external evidence of biases. The internal evidence of bias can be quantified by the difference in their probabilities of correctly answering an item between two groups possessing equivalent ability, attributable to construct-irrelevant factors being measured (Camilli, 2006). The DIF analysis is often used to examine item-level bias to collect internal evidence of bias. It examines whether or not an item functions the same way for examinees at the same ability level regardless of their group membership.

The external evidence of bias is assessed by observing differential prediction of test scores on criterion variables across various subgroups (Camilli, 2006). The Test Standards (AERA, APA and NCME, 2014) state that “... test developers and/or users are responsible for evaluating the possibility of differential prediction for relevant subgroups for which there is prior evidence or theory suggesting differential prediction” (p. 15), and suggests that “…differential prediction is often examined using regression analysis” (p.66). Specifically, a regression model with demographic variables as moderators is used for differential prediction analysis. We attend to the relationship between test scores and college performance and how this relationship differs between different subpopulation groups. In summary, the internal evidence relates to the bias at the item level, and the external evidence concerns the bias at the overall test level.

Validity and reliability are two other foundational areas of psychometrics for educational and psychological testing. In some sense, the test scores must be valid and reliable before we attend to the fairness issues. As indicated by Fig. 1, validity, reliability, and fairness make a stable system for assessing the psychometric quality of a testing instrument. Each foundational component contains internal and external evidence. For instance, internal consistency indicates how reliable the scores are, and generalizability reflects how stable the scores can maintain reliability across different conditions. The Test Standards (AERA, APA and NCME, 2014) encourage using integrated validity argument consisting of five forms of validity evidence. Internal evidence based on test content, internal structure, and response process addresses how well the scores reflect the measuring construct. The relations to other variables reflect the external validity of how comparable the current measures are compared with criterion measures, e.g., tests that measure the same construct or future performance. The consequences of testing refer to the positive or negative social consequences of a particular test, which contributes to the external evidence of validity. Fairness can be viewed as maintaining valid and reliable scores across different subpopulation groups. Internal bias occurs when the internal structure varies as a function of group membership, and external bias shows if relations to other variables differ across subpopulation groups.

Fig. 1
figure 1

Internal and external evidence of psychometric quality

Methods

Data description

A secondary dataset was used for analyses. The data were obtained from the Center of Evaluation and Measurement at KU which did not contain identifiable information about individual test-takers. Three parallel forms with the same items in different sequences were administered in November 2018. For test security reasons, the complete test booklets and item content were not provided by KU. Therefore, the examination of internal bias was based on one particular test form using the item-level responses from test-takers who took that form of KUEAT (Sample 1). The external evidence of bias was collected based on students who were accepted by KU and started their college in 2019/2020 (Sample 2).

Sample 1 consists of 1,790 examinees, including 1471 (82%) females and 319 (18%) males, 1640 (92%) Kuwaiti and 150 (8%) non-Kuwaiti (international) students, 1683 (94%) were from public high schools and 107 (6%) were from private high schools, and 1069 (60%) majored in science and 721 (40%) majored in humanities in high school (Table 1). These demographic variables are chosen since they define subpopulation groups with different identities, cultural backgrounds, economic status, and cognitive levels. The unbalanced ratio of student subgroups represents the actual distribution at KU. Dichotomous student responses to individual items on KUEAT are used for analyses (0—incorrect; 1—correct).

Table 1 Demographic characteristics of data samples

Sample 2 consists of 4,033 students who are currently enrolled at KU. This sample includes 3,660 (90%) Kuwaiti and 373 (10%) non-Kuwaiti (international) students, 3395 (84%) females and 638 (16%) males, 3,805 (94%) graduated from public high schools and 228 (6%) graduated from private high schools, and 2353 (58%) majored in science and 1680 (42%) majored in humanities in high school. The total scores of KUEAT and their first-year college GPA were used for analyses.

The datasets used and analyzed in this study are available from the Center of Evaluation and Measurement at KU upon request.

Rasch-based differential item functioning analysis

Rasch (1960/1980) introduces a dichotomous Rasch model for obtaining objective measures for a unidimensional construct. The probability of answering an item correctly is a function of a person’s latent ability and an item’s difficulty. The Rasch model for dichotomous responses is shown below.

$${P}_{ij}\left({Y}_{ij}=1|{\theta }_{j}, {b}_{i}\right)=\frac{{\mathcal{e}}^{({\theta }_{j} - {b}_{i})}}{1+ {\mathcal{e}}^{({\theta }_{j} - {b}_{i})}},$$
(1)

Where \({P}_{ij}\) is the probability of person j correctly answering item i, \({\theta }_{j}\) represents English proficiency of person j, and \({b}_{i}\) shows the difficulty of item i.

Before examining the fairness of scores, we need to evaluate how valid and reliable the scores are (Engelhard & Wang, 2021). Rasch models provide residual-based Outfit and Infit mean square statistics for examining the internal structure of the scale. Outfit is an outlier-sensitive statistic that detects outlying responses. Infit is an information-weighted statistic that identifies unexpected response patterns. For high-stakes multiple-choice items, Linacre (1994) suggests a range of acceptable fit to the scale between 0.8 and 1.2. Lower than 0.8 indicates overfit to the Rasch scale with a lack of variation in response pattern, and greater than 1.2 reflects underfit with more variations in responses than expected (Linacre, 1994). The fit indices are used to examine individual-level item fit. Meanwhile, to assess whether the underlying scale measures a single construct, that is unidimensional, the proportion of variances explained by the measures is used. It is debatable upon an appropriate cut-off value for the variance explained by the model. In practice, unidimensional Rasch models are recommended when the proportion is greater than 20% (Hambleton & Rovinelli, 1986).

Rasch models provide a reliability of separation index, that can be interpreted as internal consistency of person scores. This index is comparable to Cronbach’s alpha in a way that ranges from 0 to 1 and a higher value indicates higher consistency. The reliability can be transformed into a separation index where the reliability is calculated as the number of standard errors of spread among the elements (Wright & Masters, 1982). It is also unique since the latent scores are used for computing this statistic instead of observed or raw scores. The use of latent scores makes this index to be more useful for examining the reliability of the underlying latent scale. Low reliabilities values (i.e., close to 0) indicate a narrow range of location estimates, and high reliabilities (i.e., close to 1) indicate a wide range of estimates (Bond et al., 2021).

For assessing the internal bias at the item level, DIF is conducted using the Rasch–Welch t-test and examines if an item is equally difficult between different subgroups that are defined by examinees’ demographic characteristics such as gender, nationality, high school type, and high school major. The Welch t statistic is shown below.

$$\text{Welch t}= \frac{{b}_{1}-{b}_{2}}{\sqrt{{SE}_{1}^{2}+{SE}_{2}^{2} }},$$
(2)
$$df=\frac{{\left({SE}_{1}^{2}+{SE}_{2}^{2}\right)}^{2}}{\frac{{SE}_{1}^{4}}{{n}_{1}-1}+\frac{{SE}_{2}^{4}}{{n}_{2}-1}}$$
(3)

Where \({b}_{1}\) and \({SE}_{1}\) are the item difficulty measure and its standard error for Group 1, and similarly, \({b}_{2}\) and \({SE}_{2}\) are for Group 2. Bond and Fox (2015) suggested that an item exhibits DIF if (a) the significance test for t-statistic produces a p value below 0.05 and (b) the DIF contrasts (\({b}_{1}-{b}_{2}\)) as effect sizes are greater than or equal to 0.5 logits.

Item-level responses to KUEAT from Sample 1 are used for DIF analysis. The data are analyzed using a Specialized Rasch analysis program—Winsteps software (version 5.3.1; Linacre, 2022) with a joint maximum likelihood estimation method.

Regression-based differential prediction analysis

The differential prediction analysis is used to detect external bias in test scores. We use the regression model with demographic variables as moderators. Each regression model contains a single moderator to examine whether differential prediction exists across different subpopulation groups defined by the demographic variables of examinees. This is clearer in terms of interpretation than analyzing all demographic groups altogether. The regression model is specified below.

$$Y= {\beta }_{0}+ {\beta }_{1}{X}_{1}+ {\beta }_{2}{X}_{2}+{\beta }_{3}{X}_{1}{X}_{2}+\varepsilon$$
(4)

Where \(Y\) denotes first-year GPA, \({X}_{1}\) refers to KUEAT scores, \({X}_{2}\) is the dummy-coded demographic variable or moderator (i.e., gender, nationality, school type, and high school major), \(\beta\)’s are the regression coefficients, and \(\varepsilon\) is the residual term. Among all regression coefficients, \({\beta }_{3}\) is of primary interest for addressing research questions. After accounting for the main effects of KUEAT scores (\({\beta }_{1}\)) and gender (\({\beta }_{2}\)), the interaction effect specifically indicates the prediction of KUEAT scores on first-year GPA differs for different demographic groups. If \({\beta }_{3}\) is significantly different from zero reflecting a moderation effect, then differential prediction of KUEAT scores occurs between demographic groups.

The differential prediction analysis is conducted in R Statistical Software (v4.2.1; R Core Team, 2022) based on Sample 2.

Results

A dichotomous Rasch model is fitted to calibrate student responses to KUEAT items. A variable map is shown in Fig. 2 with persons and items located along a common latent scale. Examinees with higher English proficiency are located at the top, and less proficient students are located near the bottom. For items, more difficult items are located on the top, and easier items are located at the bottom. The Rasch analysis results are used for addressing research questions 1 to 3. Then the moderated regression analyses are used to address research question 4.

Fig. 2
figure 2

Variable map for Kuwait University English Aptitude Test Calibration. Note. Each “#” represents 10 examinees and each “.” represents 1 examinee on the left side of the scale. Persons are ordered based on their latent English proficiency. An examinee with a higher location measure reflects higher proficiency. The test items are displayed on the right side with a higher location indicating higher difficulty

RQ1: How well does the KUEAT scale reflect student English proficiency?

First, the constructed Rasch scale can explain 33.4% of the variation in student responses, which supports that KUEAT measures a single dimension of English proficiency. Next, we obtained the Outfit and Infit mean square statistics for individual items. Results indicated 43 items with Outfit values outside the acceptable range (i.e., 0.8–1.2; Linacre, 1994), and 12 items showing misfit based on Infit statistics (Fig. 3).

Fig. 3
figure 3

Frequency distribution of outfit and infit mean square statistics. Note. MNSQ —mean square statistic. The red lines indicate the acceptable range of fit values that are between 0.8 and 1.2

RQ2: How well can students at different English proficiency levels be separated by the Rasch scale?

The variable map shows a roughly normal distribution of students’ English proficiency estimates. A few examinees with locations above the most difficult item (i.e., q25 and q26) answered all the items correctly, and those with locations below the easiest items (i.e., q45 and q5) got all items wrong. The Rasch analysis provides a person reliability of separation index for assessing how distinctive person estimates are along the Rasch scale. Our analysis returned a value of 0.94. It indicates that the underlying Rasch scale with calibrated KUEAT items can distinguish the examinees with different English proficiency levels. This scale can reach great replicability based on similar items for measuring the same construct. For comparison purposes, we obtained Cronbach’s alpha using the observed scores. The alpha coefficient was 0.93, suggesting great internal consistency of KUEAT items.

RQ3: Are KUEAT items equally difficult across different student subgroups?

After placing all examinees on a common latent scale along with calibrated items, the DIF analyses based on t-tests were conducted for each demographic variable, including nationality, gender, high school type, and high school major, to examine if any item shows bias against a particular subpopulation group. Statistically, the Rasch–Welch t-test was used to examine if an item is equally difficult across subgroups. An item is flagged as displaying DIF when item difficulty measures were significantly different (\(p<.05\)) between subgroups.

Table 2 shows the t-test results for DIF items across nationality, gender, and high school majors. The DIF items by high school types were displayed in Fig. 4 separately. Results indicated five items showing DIF between nationality groups (Table 2, panel A). Two items (q34 and q37) had higher item difficulty measures for Kuwaiti than non-Kuwaiti examinees. Three items (q24, q20, and q82) were more difficult for non-Kuwaiti examinees. Next, five items showed DIF by gender affiliations, among which three items (q34, q42, and q5) appeared to be significantly more difficult for males, and two items (q41 and q39) were more difficult for females (Table 2, panel B). Third, three items exhibited DIF between examinees with different majors in high school (Table 2, panel C). In particular, two items (q82 and q25) were more difficult for examinees who majored in humanities, and one item (q67) was more difficult for examinees in science majors. At last, many items (N = 30) showed DIF across subgroups defined by high school types. Among all, 16 items were more difficult for students from public schools, favoring examinees from private schools; while 14 items favored examinees from public schools and were more difficult for students from private schools (Fig. 4).

Table 2 Differential item functioning items by nationality, gender, and high school major
Fig. 4
figure 4

Flagged Items for differential item functioning by high school type

RQ4: Do KUEAT scores have the same prediction toward first-year college GPA across student subgroups?

The differential prediction analysis was conducted based on Sample 2 using moderated regression analysis. Specifically, we examine if KUEAT scores predict first-year college GPAs differentially between student subpopulation groups. The demographic variables for defining subgroups include gender, nationality, high school type, and high school major, and they were the moderators in regression analyses. Table 3 reports the moderation effect of each demographic variable. Figure 5 displays the regression lines for showing the relationship between KUEAT scores and GPAs for each subgroup.

Table 3 Moderated regression models with demographic variables
Fig. 5
figure 5

Moderation analysis by demographic variables with slope estimates. Note. *p < .05, **p < .01, ***p < .001

First, KUEAT scores significantly predicted first-year GPAs for female students (\({\widehat{\beta }}_{1}=-.002,p<.001\)) but not for male students (\({\widehat{\beta }}_{1}+{\widehat{\beta }}_{3}=.000,p=.926\)), as displayed in Fig. 5A. The analysis showed a non-significant moderation effect (\({\widehat{\beta }}_{3}=.002, p=0.06\)); however, it did reveal different prediction results for the two gender groups.

Second, the regression analysis with nationality as a moderator revealed a significant interaction effect, that is the slope estimates of the two student groups were statistically different (\({\widehat{\beta }}_{3}=-.005, p<.001\)). This indicates a differential prediction of KUEAT scores between Kuwaiti and non-Kuwaiti students on their first-year college GPA (Fig. 5B). The KUEAT scores positively predicted college GPAs for non-Kuwaiti students (\({\widehat{\beta }}_{1}=.003, p<.05\)), but negatively predicted college GPAs for Kuwaiti students (\({\widehat{\beta }}_{1}+{\widehat{\beta }}_{3}=-.002, p<.001\)).

Next, the regression analysis with the high school type as a moderator showed a significant interaction effect (\({\widehat{\beta }}_{3}=.012, p<.001\)), indicating KUEAT scores predicted first-year college GPA differentially between students from public and private high schools. As shown in Fig. 5C, the KUEAT scores negatively predicted college GPAs for students from public high schools (\({\widehat{\beta }}_{1}=-.002, p<.001\)) but positively predicted college GPAs for students who graduated from private high schools (\({\widehat{\beta }}_{1}+{\widehat{\beta }}_{3}=.009, p<.001\)).

Lastly, KUEAT scores significantly predicted first-year GPAs for students who studied humanity majors (\({\widehat{\beta }}_{1}=-.002,p<.001\)) but not for students in science majors (\({\widehat{\beta }}_{1}+{\widehat{\beta }}_{3}=.000,p=.937\)), shown in Fig. 5 (Panel D). The moderation effect by high school majors was significant (\({\widehat{\beta }}_{3}=-.002, p<.01\)), indicating a differential prediction of KUEAT scores on their first-year college GPA between students in different majors.

Discussion

This study examined the fairness of KUEAT scores as an example of a high-stakes college entrance exam in Kuwait. This study is one of the first few attempts to comprehensively evaluate the fairness of high-stakes standardized testing in Kuwait. The fairness of KUEAT scores is evaluated through two aspects: (a) internal evidence of bias at the item level and (b) external evidence of bias at the test level. Our analysis results identified a few DIF items in KUEAT and showed that KUEAT scores predicted college performance differentially for different subpopulation groups.

We first performed an item analysis based on Rasch measurement theory to support the validity, reliability, and fairness arguments. The reliability of KUEAT scores was very high, indicating good consistency and replicability of the test items. Many items showed misfit to the constructed scale, particularly 12 misfit items based on Infit mean square statistics and 40 misfit items based on Outfit mean square statistics. In addition, the DIF analyses further revealed 38 out of 85 items, representing 45% of the entire item set, that may be biased against a particular student group. These pieces of evidence make the validity and fairness argument unwarranted and cause attention to the valid and fair use of KUEAT scores for making admission decisions.

The presence of DIF items can seriously affect the validity and comparability of the scores for intended uses and interpretations (AERA, APA and NCME, 2014). It is essential to note that statistical bias is insufficient to conclude a lack of fairness. Researchers have been investigating different reasons that may cause DIF, such as poorly formatted items, improper item content, or measuring an irrelevant construct (Gafni, 1991; O'Neill and McPeek, 1993; Pae, 2004). When the irrelevant construct is associated with the group membership, the scores would be against a particular student subpopulation (Zieky, 2016). The presence of DIF may also be related to latent group membership, e.g., students who speeded on a test (Cohen & Bolt, 2005; De Ayala et al., 2002). The existence of DIF may imply that the test measures an additional dimension, e.g., an integrated writing assessment may measure both reading and writing proficiency (Mazor et al., 1998; Roussos & Stout, 1996). For multiple-choice items, DIF may be due to that distractors were perceived differently by individuals from different subgroups (Suh & Bolt, 2011; Suh & Talley, 2015). For test developers and users, it is important to investigate the potential reasons for DIF and revise or remove the DIF items before getting into the operational use of the test.

The differential prediction analysis revealed external biases of KUEAT scores. The KUEAT scores predicted first-year college GPAs differentially between student subgroups. Specifically, there was a positive relationship (i.e., the higher KUEAT scores, the higher GPA) for examinees who were non-Kuwaiti, female, graduated from private high schools, or in humanity majors. For Kuwaiti examinees or those who graduated from public high schools, KUEAT scores negatively predicted their college GPAs. For male students and those in science majors, college GPAs were not significantly related to their KUEAT scores. A test should not be used for any purpose if it predicts the future performance differentially between subgroups (Meade & Fetzer, 2009). This differential prediction causes the fairness of scores in question and also makes the predictive validity of test scores unwarranted.

The existing literature revealed contradictory results in terms of the predictive validity of KUEAT scores (Eid, 2009; Shamsaldeen, 2019). In particular, Eid (2009) indicated that KUEAT scores did not predict college performance. However, Shamsaldeen (2019) found that KUEAT scores predicted second-year college GPAs significantly. Our study found that KUEAT scores predicted college performance differentially by different student subgroups. The evidence of predictive validity supports an overall or average relationship between KUEAT scores and college performance. When KUEAT scores predict GPAs differentially, an average relationship cannot represent all student subgroups. We suggest to first examine whether a test shows differential prediction. When KUEAT scores predict all student subpopulation groups in the same manner, supporting the fairness argument, it is more reasonable and meaningful to discuss the predictive validity of scores.

The differential prediction of KUEAT scores may be due to the differential validity of test scores (Meade & Fetzer, 2009; Sackett & Wilk, 1994), the presence of DIF items, measurement issues with criterion variables (Berry, 2015), or contextual influences such as stereotype threat (Steele & Aronson, 1995). Mattern et al. (2017) found that the inclusion of academic discipline in college affected the results of differential prediction analysis. The relationship between KUEAT scores and GPAs varied across different academic disciplines in college (Light et al., 1987). Different majors in college require different levels of English proficiency. In some majors, students can be successful in college regardless of their English levels, which may weaken the association between language proficiency and academic achievement (Cotton & Conrow, 1998). Policymakers and administrators may consider assigning different weights to English test scores in admission decisions for different college majors.

Lastly, there are two limitations in this study. First, there was a restricted range issue with first-year college GPAs, which may underestimate the regression coefficients and the effect size between the predictor and the criterion variables. However, the correction for the range restriction requires the information of population parameters or estimates from similar studies. As neither is available, we could not conduct the correction. Second, since the test items remain confidential and were not provided to us, we could not perform a substantive review of the item content to further explore the reasons that cause DIF. Content experts should review the DIF items to identify possible sources of DIF and provide “explainable sources of bias” for removing or revising any DIF item (Pedrajita, 2011). Similarly, since we do not have access to individual student records, a further examination of differential prediction cannot be conducted. We would suggest the test developers and users consider fairness issues and explore factors that may lead to bias in item scores and test scores against a particular subpopulation group.

Conclusion

This study examined the internal and external biases of KUEAT scores to support the fairness argument. Results indicated differential item functioning at the item level and differential prediction of college GPAs at the test level. Although KUEAT scores established high reliability, the validity and fairness of scores were not supported. We recommend a substantive review of test items and establish a routine examination of the psychometric quality based on modern measurement theory. A well-developed standardized test can be beneficial as a complement to high school GPAs for making college admission decisions.

Availability of data and materials

The datasets used and analyzed in this study are available from the Center of Evaluation and Measurement at Kuwait University upon request.

Abbreviations

AERA:

American Educational Research Association

APA:

American Psychological Association

DIF:

Differential Item Functioning

GPA:

Grade Point Average

KU:

Kuwait University

KUEAT:

Kuwait University English Aptitude Test

MNSQ:

Mean Square Statistics

NCME:

National Council on Measurement in Education

SAT:

Scholastic Assessment Test

References

  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association. https://www.apa.org/about/policy/guidelines-psychological-assessment-evaluation.pdf

  • Berry, C. M. (2015). Differential validity and differential prediction of cognitive ability tests: Understanding test bias in the employment context. Annual Review of Organizational Psychology and Organizational Behavior, 2(1), 435–463. https://doi.org/10.1146/annurev-orgpsych-032414-111256

    Article  Google Scholar 

  • Bond, T., & Fox, C. M. (2015). Applying the Rasch model: fundamental measurement in the human sciences (3rd ed.). New York: Routledge. https://doi.org/10.4324/9781315814698

    Book  Google Scholar 

  • Bond, T., Yan, Z., & Heene, M. (2021). Applying the Rasch Model: fundamental measurement in the human sciences (4th ed.). New York: Routledge.

    Google Scholar 

  • Buckley, J., Letukas, L., & Wildavsky, B. (Eds.). (2018). Measuring success: Testing, grades, and the future of college admissions. Maryland: Johns Hopkins University Press.

  • Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks: Sage Publications.

    Google Scholar 

  • Camilli, G., Briggs, D. C., Sloane, F. C., & Chiu, T. W. (2013). Psychometric perspectives on test fairness: Shrinkage estimation. In K. F. Geisinger, B. A. Bracken, J. F. Carlson, J. I. C. Hansen, N. R. Kuncel, S. P. Reise, & M. C. Rodriguez (Eds.), APA handbook of testing and assessment in psychology, Vol. 3. Testing and assessment in school psychology and education (pp. 571–589) American Psychological Association.

    Chapter  Google Scholar 

  • Camilli, G. (2006). Test fairness. Educational. Measurement, 4, 221–256.

    Google Scholar 

  • Churchill, A., Manno, B. V., Gentles, G., & Malkus, N. (2015). Bless the tests: three reasons for standardized testing. Thomas B. Fordham Institute.

    Google Scholar 

  • Cleary, T. A. (1968). TEST BIAS: Test bias: Prediction of grades of Negro and white students in integrated colleges. Journal of Educational Measurement, 5(2), 115–124. https://doi.org/10.1111/j.1745-3984.1968.tb00613.x

    Article  Google Scholar 

  • Cohen, A. S., & Bolt, D. M. (2005). A mixture model analysis of differential item functioning. Journal of Educational Measurement, 42(2), 133–148. https://doi.org/10.1111/j.1745-3984.2005.00007

    Article  Google Scholar 

  • Cotton, F., & Conrow, F. (1998). An investigation of the predictive validity of IELTS amongst a group of international students studying at the University of Tasmania. IELTS Research Reports, 1(4), 72–115.

    Google Scholar 

  • De Ayala, R. J., Kim, S.-H., Stapleton, L. M., & Dayton, C. M. (2002). Differential item functioning: A mixture distribution conceptualization. International Journal of Testing, 2(3–4), 243–276. https://doi.org/10.1080/15305058.2002.9669495

    Article  Google Scholar 

  • Eid, G. K. (2009). Al-Adaa alalmy letolab jamat Al-Kuwait wofkan landomat altalem altanay almoktalifa: Derasa tataboaya mokarana [The academic performance of Kuwait university students graduating from different high school educational systems: a follow-up and comparative study]. Riyadh, Saudi Arabia: J Educ Sci.

  • Engelhard, G., & Wang, J. (2021). Rasch models for solving measurement problems: Invariant measurement in the social sciences (Vol. 187). Thousand Oaks: Sage.

    Book  Google Scholar 

  • Gafni, N. (1991). Differential item functioning: Performance by sex on reading comprehension tests. Paper presented at the Annual Meeting of the Academic Committee for Research on Language Testing 9th, Kiryat Anavim, Israel. Retrieved from https://files.eric.ed.gov/fulltext/ED331844.pdf

  • Gershenson, S. (2018). Grade inflation in high schools (2005–2016). North Carolina: Thomas B Fordham Institute.

    Google Scholar 

  • Habib, M. & Al Hamadi, H. (2023, Jan 21). ٤٠ ألف طالب غشّوا في الفترة الدراسية الأولى. [40 thousand students cheated in the first semester]. Al Qabas. https://www.alqabas.com/article/5904447-40-ألف-طالب-غشوا-في-الفترة-الدراسية-الأولى

  • Hambleton, R. K., & Rovinelli, R. J. (1986). Assessing the dimensionality of a set of test items. Applied Psychological Measurement, 10(3), 287–302. https://doi.org/10.1177/014662168601000307

    Article  Google Scholar 

  • Hasan, N. N. (2022, June 30). إحصائية أعدها باحث الدكتوراه ناصر حسن: متفوقو «العلمي» تضاعفوا إلى 47 % في 6 سنوات. [Statistics prepared by researcher Nasser Hasan: “Science major” a students have doubled to 47% in 6 years]. Al Qabas. https://www.alqabas.com/article/5887705-إحصائية-أعدها-باحث-الدكتوراه-ناصر-حسن-متفوقو-العلمي-تضاعفوا-إلى-47-في-6-سنوات

  • Hasan, N. N. (2024). إحصائية خاصة حصلت عليها القبس: 41 % نسبة متفوقي الثانوية في القسم العلمي. [Special statistics obtained by Al-Qabas: 41% percentage of high school graduates in the scientific section]. Al Qabas. https://www.alqabas.com/article/5931024-إحصائية-خاصة-حصلت-عليها-القبس-41-نسبة-متفوقي-الثانوية-في-القسم-العلمي

  • Huang, X., Wilson, M., & Wang, L. (2016). Exploring plausible causes of differential item functioning in the PISA science assessment: Language, curriculum or culture. Educational Psychology, 36(2), 378–390.

    Article  Google Scholar 

  • Light, R. L., Xu, M., & Mossop, J. (1987). English proficiency and academic performance of international students. TESOL Quarterly, 21(2), 251–261. https://doi.org/10.2307/3586734

    Article  Google Scholar 

  • Linacre, J. M. (2022). Winsteps® Rasch measurement computer program (Version 5.3.1). Portland, Oregon: Winsteps.com.

    Google Scholar 

  • Linacre, J. M. (1994). Sample size and item calibration stability. Rasch Measurement Transactions., 7, 328.

    Google Scholar 

  • Mattern, K., Sanchez, E., & Ndum, E. (2017). Why do achievement measures underpredict female academic performance? Educational Measurement, Issues and Practice, 36(1), 47–57. https://doi.org/10.1111/emip.12138

    Article  Google Scholar 

  • Mazor, K. M., Hambleton, R. K., & Clauser, B. E. (1998). Multidimensional DIF analyses: The effects of matching on unidimensional subtest scores. Applied Psychological Measurement, 22(4), 357–367.

    Article  Google Scholar 

  • Meade, A. W., & Fetzer, M. (2009). Test bias, differential prediction, and a revised approach for determining the suitability of a predictor in a selection context. Organizational Research Methods, 12(4), 738–761.

    Article  Google Scholar 

  • O’Neill, K. A., & McPeek, W. M. (1993). Item and test characteristics that are associated with differential item functioning. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 255–276). Lawrence Erlbaum Associates.

    Google Scholar 

  • Pae. (2004). Gender effect on reading comprehension with Korean EFL learners. System (linköping), 32(2), 265–281. https://doi.org/10.1016/j.system.2003.09.009

    Article  Google Scholar 

  • Pedrajita, J. (2011). Using contingency table approaches in differential item functioning analysis: a comparison. In ICERI2011 Proceedings (pp. 5449–5458). IATED.

    Google Scholar 

  • R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

  • Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press.

  • Rhoades, K., & Madaus, G. (2003). Errors in standardized tests: a systemic problem. National Board on Educational Testing and Public Policy, Chestnut Hill, MA. Retrieved from https://www.bc.edu/research/nbetpp/statements/M1N4.pdf

  • Roussos, L., & Stout, W. (1996). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20(4), 355–371. https://doi.org/10.1177/014662169602000404

    Article  Google Scholar 

  • Sabatini, J., Bruce, K., Steinberg, J., & Weeks, J. (2015). SARA reading components tests, RISE forms: Technical adequacy and test design. ETS Research Report Series, 2015(2), 1–20.

    Article  Google Scholar 

  • Sackett, P. R., & Wilk, S. L. (1994). Within-Group norming and other forms of score adjustment in preemployment testing. The American Psychologist, 49(11), 929–954. https://doi.org/10.1037/0003-066X.49.11.929

    Article  Google Scholar 

  • Shamsaldeen, F. (2AD). Predictive validity evidence for the Kuwait University Aptitude Tests in STEM colleges [Unpublished master’s thesis]. Pennsylvania: University of Pittsburgh.

    Google Scholar 

  • Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African Americans. Journal of Personality and Social Psychology, 69(5), 797–811. https://doi.org/10.1037/0022-3514.69.5.797

    Article  Google Scholar 

  • Suh, Y., & Bolt, D. M. (2011). A Nested Logit Approach for investigating distractors as causes of differential item functioning. Journal of Educational Measurement, 48(2), 188–205. https://doi.org/10.1111/j.1745-3984.2011.00139.x

    Article  Google Scholar 

  • Suh, Y., & Talley, A. E. (2015). An empirical comparison of DDF detection methods for understanding the causes of DIF in multiple-choice items. Applied Measurement in Education, 28(1), 48–67. https://doi.org/10.1080/08957347.2014.973560

    Article  Google Scholar 

  • Wang, J., Tanaka, V., Engelhard, G., & Rabbitt, M. P. (2020). An examination of measurement invariance using a multilevel explanatory Rasch model. Measurement: Interdisciplinary Research and Perspectives, 18(4), 196–214.

    Google Scholar 

  • Wright, B., & Masters, G. (1982). Rating scale analysis. Chicago: MESA Press.

    Google Scholar 

  • Zieky, M. J. (2016). Fairness in test design and development. In Fairness in Educational Assessment and Measurement (1st ed., pp. 9–31). Routledge. https://doi.org/10.4324/9781315774527-3

  • Zwick, R. (2007). College admission testing [Technical Report]. National Association for College Admission Counseling. Retrieved from https://offices.depaul.edu/enrollment-management/test-optional/Documents/ZwickStandardizedTesting.pdf

Download references

Acknowledgements

We would like to thank the Center of Evaluation and Measurement at Kuwait University for providing the data and supporting this study. We would also like to thank Professors Willis Jones and Debbiesiu L. Lee for their helpful discussions and suggestions.

Funding

This study was supported by the Humanities and Social Science Fund of the Ministry of Education of China, titled “Development of Human and Machine Scoring Techniques in Subjective Creativity Assessments” (22YJC860021) and National Key R&D Program of China (2023YFC3341302).

Author information

Authors and Affiliations

Authors

Contributions

FS and JW made substantial contributions to the conception, design, and manuscript writing. The data analysis and interpretation were conducted by FS under close supervision by JW. SA made substantial contributions to the conception and discussion of findings. All authors were involved in revising the manuscript, approved the submitted version, and agreed to be accountable for their contributions to this study.

Corresponding author

Correspondence to Jue Wang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shamsaldeen, F., Wang, J. & Ahn, S. Evaluating the fairness of a high-stakes college entrance exam in Kuwait. Lang Test Asia 14, 27 (2024). https://doi.org/10.1186/s40468-024-00301-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40468-024-00301-4

Keywords