Item performance across native language groups on the Iranian National University Entrance English Exam: a nationwide study

Bormanaki, Hamidreza Babaee; Ajideh, Parviz

doi:10.1186/s40468-022-00185-2

Research
Open access
Published: 10 September 2022

Item performance across native language groups on the Iranian National University Entrance English Exam: a nationwide study

Language Testing in Asia volume 12, Article number: 29 (2022) Cite this article

2064 Accesses
2 Altmetric
Metrics details

Abstract

This paper reports on an investigation of differential item functioning (DIF) in the Iranian Undergraduate University Entrance Special English Exam (IUUESEE) across four native language groups including the Azeri, the Persian, the Kurdish, and the Luri test takers via Rasch analysis. A total sample of 14,172 participants was selected for the study. After establishing the unidimensionality and local independence of the data, the authors employed two methods to test for the DIF: (a) a t-test uniform DIF analysis, which showed that the Luri test-takers were more favored than other native language groups, and (b) nonuniform DIF analysis, which revealed that majority of nonuniform DIF instances functioned in favor of the low-ability Azeri, the low-ability Persian, the high-ability Kurdish, and the high-ability Luri test-takers. A possible explanation for native language-ability DIF was that the Luri and low-ability test-takers were more likely to venture lucky guesses. We also referred to socioeconomic status (e.g., test-wiseness), guessing, overconfidence, thoughtless errors, stem length, time, L1, and unappealing distractors as possible sources of DIF in IUUESEE.

Introduction

In the context of L2 proficiency testing, high-stake tests play an important role in test-takers’ lives. This highlights the importance of test fairness, an attempt to rule out or decrease bias against some groups of test-takers providing them with equal opportunities for demonstrating their knowledge and skills, which increases social justice (Gipps & Stobart, 2009; McNamara & Ryan, 2011). Therefore, the development of the high stake tests needs to undergo a rigorous process of item analysis to ensure that all test-takers with the same underlying level of language proficiency have the same probabilities of correctly answering the items (Camilli & Shepard, 1994).

This study evaluates the Iranian Undergraduate University Entrance Special English Exam (IUUESEE) through DIF analysis which is a powerful tool to investigate the statistical bias in test items. The IUUESEE test was established in 1999 by the National Organization of Educational Testing in multiple-choice format and includes structure, vocabulary, word order, language function, cloze test, and reading comprehension subtests.

The IUUESEE is a high-stake and norm-referenced test which is administered annually for participants whose aim is to be admitted into Iranian foreign language undergraduate programs. Based on their rank in test outcomes, participants can select a university for their education. Because of the paramount importance of IUUESEE which may result in social and personal consequences for the participants, this research will provide new insights into the psychometrics aspects of the test; specifically on the DIF, it may display by individual items. Through this research, the stakeholders, specifically the test designers, will realize the probable effect of test-takers’ L1 on test outcomes in different parts of the IUUESEE which can shed light on the construct-irrelevant variance among test-takers and promote the construct validity of the test by giving an opportunity to test designers to revise the subtests and items which may unfairly function in favor of a group (or groups).

The DIF analysis is a statistical technique to estimate the extent to which participants with different aspects but the same level of ability has different probabilities of responding to test items correctly (Cohen & Bolt, 2005; Oliveri et al., 2014; Zumbo, 2007). It shows that some factors apart from the test construct influence the performance of one group but not the other (Timukova & Drackert, 2019). In other words, the DIF is a result of unequal probabilities of correctly answering an item by two groups of test-takers, who are otherwise matched in ability on a construct (Ferne & Rupp, 2007). Therefore, the examination of the DIF is an indispensable step in the validation of educational and psychological tests (Camilli & Shepard, 1994). It provides researchers with a series of techniques to uncover construct-irrelevant factors that are likely to discriminate unfairly against a specific group of test-takers and hence threaten the validity of test outcomes (Pae, 2004).

In the context of test fairness, language testing researchers have used statistical DIF analysis, mainly the RASCH-based procedures to disclose a statistical bias in test items (see, e.g., Aryadoust, 2012; Aryadoust & Zhang, 2016; Belzak, 2019; Timukova & Drackert, 2019; Trace, 2019; Vanbuel & Deygers, 2021; Xuelian & Aryadoust, 2020; Zenisky et al., 2003; Zhang et al., 2003). In this regard, standardized fit statistics and the Rasch mean square (MNSQ) were usually used in Rasch-based investigations to examine the applicability of the data set to the model (for further explanation, see the DIF analysis section). This study applied the Rasch-based DIF analysis to examine native language-based DIF in the Iranian Undergraduate University Entrance Special English Exam.

Literature review

Previous studies on the effect of native language on test-takers’ performance

In the context of language assessment, DIF based on language background has been of particular interest to researchers. Two lines of inquiry emerge in the pertinent literature. The first includes studies that examined the structures of a test across different language groups (Ackerman et al., 2000; Brown, 1999; Ginther & Stevens, 1998; Hale et al., 1989; Kunnan, 1994; Li & Suen, 2012; Oltman et al., 1988; Swinton & Powers, 1980). These studies examined whether a test measured the same constructs for several language groups (Kim, 2001). Most directly, Swinton and Powers (1980) identified different constructs across non-Indo-European (NIE) and Indo-European (IE) language groups on the Test of English as a Foreign Language (TOEFL). On the other hand, Ackerman et al. (2000) examined the dimensionality among three different language groups Korean, Arabic, and French in the TOEFL Listening Comprehension section and identified one single dimension across all three groups.

The second inquiry was comprised of studies that explored the differences in test-takers’ performances at the item level (Alderman & Holland, 1981; Chen & Henning, 1985; Harding, 2011; Kim, 2001; Oliveri et al., 2018; Ryan & Bachman, 1992; Sasaki, 1991; Shin, 2021; Shin et al., 2021; Uiterwijk & Vallen, 2005; Xuelian & Aryadoust, 2020). For example, in two early studies into the effect of native language on test performance at the item level, Chen and Henning (1985) and Sasaki (1991) reported that the vocabulary subsection in different tests functioned in favor of the Spanish language groups. Chen and Henning (1985) found that DIF items identified from the vocabulary subsection functioned in favor of the Spanish group rather than the Chinese group. In another study, Oliveri et al. (2018) discovered more DIF items functioning in favor of non-American citizens living in America over American citizens in the verbal reasoning part of the GRE. Recently, Xuelian and Aryadoust (2020) investigated the mother tongue differential item functioning in the Pearson Test of English (PTE) Academic Reading test across Indo-European (IE) and Non-Indo-European (NIE) language families. They found no statistically significant uniform differential item functioning (UDIF) at p>0.05; however, they revealed three non-uniform differential item functioning (NUDIF) items out of 10 items across the language families.

Examining the DIF based on the native language would lead to a significant validation inquiry for language test designers in various test situations, especially high-stakes tests (Geranpayeh & Kunnan, 2007). However, a review of these studies revealed several limitations. The majority of native language-based DIF investigations have been conducted in European and American settings (Pae, 2004). Therefore, the generalizability of the findings would be questioned due to the lack of the DIF studies in other settings such as Asian contexts. The current study was carried out in an Asian context, Iran—on four successive versions of the IUUESEE to help fill this gap. These studies detected DIF items with the arbitrary criterion. For instance, In Chen and Henning’s DIF analysis, if the confidence interval had been determined narrower than 95%, more instances of significant DIF might have been identified. The unbalanced small sample size and short tests were also problematic. The present study is a nationwide investigation that comprises a large sample size (14,000 test-takers) and a large number of items (70 items). Furthermore, the presence or lack of DIF across the ability levels was not taken into consideration in most of the previous studies. In other words, the procedures employed in those studies did not examine non-uniform DIF (see the “DIF analysis” section). Several studies that have not identified UDIF have been revealed to have NUDIF bias in their test items (see Mazor et al., 1994). In the current study, we used Rasch analysis for identifying both uniform and nonuniform DIF for the dichotomous response items.

Previous research has investigated the DIF of IUUESEE in terms of gender (Barati & Ahmadi, 2010) and field of study (Brati et al., 2006). However, the test has not yet been subjected to native language-based UDIF and NUDIF analysis. Without such analysis, the stakeholders, specifically test developers and test users, are left to suppose that the test is fair and does not function in favor of any native language group. Therefore, the objective of the present study is to investigate the interaction between item functioning and native language. To address this aim, the study addresses the following research questions:

1.
Does the test data support the assumptions of unidimensionality and local independence, as requirements of Rasch-based DIF analysis?
2.
Does the IUUESEE contain UDIF items across the Azeri, the Persian, the Kurdi, and the Luri native language groups? If so, to what extent does the test function differentially across the four groups?
3.
Does the IUUESEE contain NUDIF items across ability levels of the four native language groups? If so, to what extent does the test function differentially across the ability levels?
4.
Can more stringent Rasch fit criteria indicate the presence of DIF?
5.
What are the probable factors that caused DIF in items in the test?

Method

Participants

The participants of this study were randomly selected from high school graduates who sat for the IUUESEE in 2016, 2017, 2018, and 2019. Generally, the participants of the IUUESEE are divided into two groups: the first group includes test-takers who take the IUUESEE with an exam of their high school field of study which includes one of the math, science, and literature and humanity fields. The second group includes those who only take the IUUESEE. In other words, this exam is their main exam for entering into undergraduate university programs. The dataset we used in this study contained participants from both groups. The participants of our study were selected from four provinces of Iran according to the four native languages under investigation. Overall, a total sample of 14,172 participants was selected for the current study. Table 1 presents the specific information about the participants.

Table 1 Number of participants by province and first language

Item performance across native language groups on the Iranian National University Entrance English Exam: a nationwide study

Abstract

Introduction

Literature review

Previous studies on the effect of native language on test-takers’ performance

Method

Participants

Materials

Iranian Undergraduate University Entrance Special English Exam

Data collection procedures

Data analysis

Descriptive statistics

The Rasch model

Fit

Reliability and separation

Point-measure correlation

Unidimensionality and local independence

DIF analysis

Results

Fit of the data to the latent trait model

Wright map

Rasch reliability analysis

Unidimensionality and local independence

Identification of differential item functioning

IUUESEE 2016

IUUESEE 2017

IUUESEE 2018

IUUESEE 2019

Non-uniform differential item functioning

Discussion

Conclusion and future research

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Authors’ information

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords