Item performance across native language groups on the Iranian National University Entrance English Exam: a nationwide study

This paper reports on an investigation of differential item functioning (DIF) in the Iranian Undergraduate University Entrance Special English Exam (IUUESEE) across four native language groups including the Azeri, the Persian, the Kurdish, and the Luri test takers via Rasch analysis. A total sample of 14,172 participants was selected for the study. After establishing the unidimensionality and local independence of the data, the authors employed two methods to test for the DIF: (a) a t-test uniform DIF analysis, which showed that the Luri test-takers were more favored than other native language groups, and (b) nonuniform DIF analysis, which revealed that majority of nonuniform DIF instances functioned in favor of the low-ability Azeri, the low-ability Persian, the high-ability Kurdish, and the high-ability Luri test-takers. A possible explanation for native language-ability DIF was that the Luri and low-ability test-takers were more likely to venture lucky guesses. We also referred to socioeconomic status (e.g., test-wiseness), guessing, overconfidence, thoughtless errors, stem length, time, L1, and unappealing distractors as possible sources of DIF in IUUESEE.

vocabulary, word order, language function, cloze test, and reading comprehension subtests.
The IUUESEE is a high-stake and norm-referenced test which is administered annually for participants whose aim is to be admitted into Iranian foreign language undergraduate programs. Based on their rank in test outcomes, participants can select a university for their education. Because of the paramount importance of IUUESEE which may result in social and personal consequences for the participants, this research will provide new insights into the psychometrics aspects of the test; specifically on the DIF, it may display by individual items. Through this research, the stakeholders, specifically the test designers, will realize the probable effect of test-takers' L1 on test outcomes in different parts of the IUUESEE which can shed light on the construct-irrelevant variance among test-takers and promote the construct validity of the test by giving an opportunity to test designers to revise the subtests and items which may unfairly function in favor of a group (or groups).
The DIF analysis is a statistical technique to estimate the extent to which participants with different aspects but the same level of ability has different probabilities of responding to test items correctly (Cohen & Bolt, 2005;Oliveri et al., 2014;Zumbo, 2007). It shows that some factors apart from the test construct influence the performance of one group but not the other (Timukova & Drackert, 2019). In other words, the DIF is a result of unequal probabilities of correctly answering an item by two groups of test-takers, who are otherwise matched in ability on a construct (Ferne & Rupp, 2007). Therefore, the examination of the DIF is an indispensable step in the validation of educational and psychological tests (Camilli & Shepard, 1994). It provides researchers with a series of techniques to uncover construct-irrelevant factors that are likely to discriminate unfairly against a specific group of test-takers and hence threaten the validity of test outcomes (Pae, 2004).
In the context of test fairness, language testing researchers have used statistical DIF analysis, mainly the RASCH-based procedures to disclose a statistical bias in test items (see, e.g., Aryadoust, 2012;Aryadoust & Zhang, 2016;Belzak, 2019;Timukova & Drackert, 2019;Trace, 2019;Vanbuel & Deygers, 2021;Xuelian & Aryadoust, 2020;Zenisky et al., 2003;Zhang et al., 2003). In this regard, standardized fit statistics and the Rasch mean square (MNSQ) were usually used in Rasch-based investigations to examine the applicability of the data set to the model (for further explanation, see the DIF analysis section). This study applied the Rasch-based DIF analysis to examine native language-based DIF in the Iranian Undergraduate University Entrance Special English Exam.

Previous studies on the effect of native language on test-takers' performance
In the context of language assessment, DIF based on language background has been of particular interest to researchers. Two lines of inquiry emerge in the pertinent literature. The first includes studies that examined the structures of a test across different language groups (Ackerman et al., 2000;Brown, 1999;Ginther & Stevens, 1998;Hale et al., 1989;Kunnan, 1994;Li & Suen, 2012;Oltman et al., 1988;Swinton & Powers, 1980). These studies examined whether a test measured the same constructs for several language groups (Kim, 2001). Most directly, Swinton and Powers (1980) identified different humanity, special English, and art) of The Iranian National University Exam called the Konkur examination. Konkur is borrowed and changed from the French term "Concours, " referring to the process of sourcing, evaluating, and selecting participants for different objectives (Alavi et al., 2021). IUUESEE is a large-scale high-stake standardized test of the English language which was administered in 2002 for the first time throughout the country (Razmjo, 2006). After 20 years, the structure of the exam remained almost intact. The test has contained six subtests which have generally included 70 MC (multiple choice) items: structure (10 items), vocabulary (15 items), word order (5 items), language function (10 items), cloze test (15 items), and reading comprehension (15 items).
The items of the structure section are designed in the form of incomplete sentences which are supposed to be completed by an option from four alternatives. The questions of this section measures test-takers' understanding of a specific grammatical rule or mixture of rules. Vocabulary items are designed in the form of incomplete sentences. The test-takers are supposed to choose the best choice for the completion of the sentence meaning. The word order section includes test items asking test-takers to choose the option which does not include any grammatical mistake relating to the stem of the item. Items in the language functions section ask the test-takers to complete the conversations with the best choice. The cloze section includes a passage comprised of 15 blanks requiring test-takers to select the option which completes the passage. The last section, reading comprehension, includes three reading comprehension texts whose length varies between 350 and 500 words covering a wide range of topics such as academic, scientific, and social issues (Alavi et al., 2021). Each text includes 5 multiple-choice items that check test-takers' understandings of the content of the text.
The time to answer 70 items is 105 min. All items are dichotomous. The exam applies correction for guessing in a way that three incorrect answers would eliminate one of the correct responses (Alavi et al., 2021). The test content is not distributed throughout different subtests equally. For example, structure and grammar contained 27.15%, vocabulary included 34.28%, and reading comprehension comprised 38.57 of the content (Razmjo, 2006).

Data collection procedures
The National Organization of Educational Testing provided the data for the study. This organization is responsible for designing, organizing, and administering national exams such as the university entrance exam for high school graduates and the university entrance exam for MA candidates. The organization provided us with the anonymous answer sheets for the test-takers of the special English exam in 2016, 2017, 2018, and 2019.

Data analysis
In advance of performing DIF analysis, we undertook two main analyses of the test data: 1. an analysis of descriptive statistics, item difficulty measures, fit to the Rasch model, and reliability, and 2. an examination of dimensionality and degree of local independence of dataset.

The Rasch model
We conducted the rest of the analysis based on the Rasch model through WINSTEPS, Version 5.1 (Linacre, 2021). In the Rasch model for dichotomous items, there are two core statistical concepts including item difficulty and person ability. The difficulty measure of an item is estimated by taking into account the number of participants who answered the item correctly, regardless of their ability levels and the participant's ability measure is estimated by considering the number of items she (he) answered correctly, regardless of the difficulty level of the items (Linacre, 2012).

Fit
The fit analysis investigates the extent to which the data match the rash model. We reported Infit MNSQ and Outfit MNSQ for items. Considering fit results, Bond and Fox (2007) divided items into two groups: underfitting items and overfitting items. In underfitting items, MNSQ indices are greater than 1.4, and in overfitting items, MNSQ indices are less than 0.6. On the other hand, Wright and Linacre (1994) proposed a rather tough fit criterion which ranged from 0.8 to 1.2. In this study, we preferred Wright and Linacre's criterion because it is stringent and also it can be adjusted to dichotomous data appropriately (Smith, 1996).

Reliability and separation
We used Rasch model to examine the reliability of the test. In the Rasch model, reliability is estimated for both persons and items and ranges from 0 to 1. We also used separation as another index for reliability, referring to the ratio of test items' or test-takers' standard deviation to their root mean square standard error (Linacre, 2010), which varies from zero to infinity.

Point-measure correlation
In this study, point-measure correlations were estimated for all test items. These correlations represent the proportion of the consistency between observed scores and the latent trait (Linacre, 2012). We also estimated the relationships between persons and items on an item-person map or Wright map which represents both person ability and item difficulty along a single line calibrated in log-odd units (logits) (Linacre, 2012).

Unidimensionality and local independence
We estimated unidimensionality through the principal component analysis of linearized Rasch residuals (PCAR). The difference between the expectations of the Rasch model and the observed data leads to residuals (Linacre, 1998;Wright, 1996a). Fit statistics were also used to test for unidimensionality. Test items exhibiting irregular fit indices were purported to include incorrect difficulty measures and were supposed to be affected by a factor not expected by the test designer. The assumption underlying local independence is that response to an item should not affect response to another item in a test. We tested for local independence using Pearson correlation analysis of linearized Rasch residuals.

DIF analysis
Our study can be classified as the "second DIF generation" framework proposed by Zumbo (2007). The second generation is marked by widespread approval of the term DIF rather than item bias. In the testing context, multiple methods have been developed for identifying DIF (e.g., Rasch model, the Mantel-Haenszel procedure, logistic regression, etc.). We adopted the Rasch model which has been frequently used in DIF studies. The Rasch model has an important advantage over other methods. It can identify both uniform DIF (UDIF) and non-uniform DIF (NUDIF) (Linacre, 2010). Except for logistic regression (Swaminathan, 1994), other methods can identify only UDIF.
The presence of uniform DIF indicates that an item consistently functions in favor of a particular group of test-takers across all ability levels, and the presence of non-uniform DIF shows that the performance of test-takers varies across the levels of ability (Xuelian & Aryadoust, 2020). In other words, UDIF occurs when "there is no interaction between ability level and group membership" (Prieto Maranon et al., 1997, p. 559). On the other hand, NUDIF is evidence of interaction between ability level and group membership (Golia, 2016). Examination of NUDIF is of paramount importance which is ignored in DIF studies and most of the studies which have not identified UDIF have been found to display NUDIF (see Mazor et al., 1994). Negligence in the investigation of NUDIF may lead to critical practical consequences (Ferne & Rupp, 2007). Therefore, selecting a method of DIF analysis that can uncover both UDIF and NUDIF is of significant importance.
The Rasch model also has the advantage of being able to examine unidimensionality and local independence which, according to Ferne and Rupp (2007), they function as requirements for Rasch-based DIF analysis. Unidimensionality investigates the contamination of overall test scores by any extraneous dimension, and local independence examines whether test-takers' performance on a test item is affected by their performance on another item or not (Ferne & Rupp, 2007). Stout (1996, 2004) refer to this perspective as a multidimensionality-based DIF analysis method that integrates dimensionality analysis with DIF analysis. In this approach, underlying causes of significant DIF are related to the presence of multidimensionality in items (Ackerman, 1992;Shealy & Stout, 1993). As Roussos and Stout (2004) stated "such items measure at least one secondary dimension in addition to the primary dimension that the item is intended to measure" (p. 108). This multidimensional paradigm of DIF provides researchers with opportunities to take account of these secondary dimensions (Geranpayeh & Kunnan, 2007). Therefore, dimensionality analysis is a significant requirement for Rasch-based DIF analysis (Ferne & Rupp, 2007, p. 129). In the previous DIF studies, only eight of twenty-seven studies examined unidimensionality (Ferne & Rupp, 2007). However, the current study investigated the unidimensionality and local independence in the test items of IUUESEE to see whether they satisfy the preconditions of DIF analysis or not.
Despite the lack of a comprehensive and solid framework for DIF analysis (see Zumbo, 2007), the majority of the researchers have taken two approaches in their DIF studies over the past decades: (1) Confirmatory approach, in which, at first, hypotheses are Page 8 of 35 Bormanaki and Ajideh Language Testing in Asia (2022) 12:29 generated through the analysis of test items and then they are tested via DIF analysis (e.g., Gierl, 2005).
(2) Exploratory approach, in which at first, researchers explore the items with significant DIF and then they try to generate hypotheses about the causes of DIF and explain the findings through previous studies and the evidence from the results or they try to conduct a posteriori content analysis of the items exhibiting DIF (e.g., Lin & Wu, 2003). A review of 27 studies of DIF analysis revealed that a lot of studies applied exploratory analysis (Ferne & Rupp, 2007). The current study is exploratory. At first, we explored the items with significant DIF and then tried to put forward suppositions regarding the causes of DIF and explain the findings through previous studies and the evidence that were found by analyzing the data.

Results
As a requirement for native language-based DIF analysis of the data, testing for the unidimensionality and local independence in the IUUESEE was the preliminary objective of this study. Therefore, after testing for these statistics, we investigated the presence of UDIF and NUDIF in items that met the stringent Rasch fit criteria proposed by Linacre (2010). What we found is discussed as follows.

Wright map
The Wright maps showed that test items of all versions reflect rather a wide range of difficulty with an even spread. This distribution indicates that test items clustered around the mean, where the majority of test-takers clustered together. Furthermore, no gaps are identified in the item hierarchy. The Maps show similarities between a lot of items in terms of difficulty which denotes the presence of an adequate number of items in the test measuring test-takers' ability, generally near the mean where the majority of test-takers are located. The map also plotted some test-takers above the item with the highest difficulty measure meaning that the tests include some high-ability test-takers whose abilities are beyond the test difficulty.

Rasch reliability analysis
The results of person reliability analyses indicate that 37%, 47%, 51%, and 44% of the variability in person measures of the exams 2016, 2017, 2018, and 2019 respectively are attributable to error. Item reliability estimates show that only 1% of the variability in item measures of test versions 2016, 2017, and 2018 is due to error and there is no sign of error in the variability of item measures of test version 2019.
The person separation of all test versions is around one which refers to the measurement of approximately one statistical strata of performance in persons (Wright, 1996). The analyses of items separation revealed that these measurements consistently measure approximately twelve levels of difficulty in items of exam 2016, ten levels in items of exam 2017, thirteen levels in items of exam 2018, and fifteen levels of difficulty in exam 2019 (Wright, 1996).

Unidimensionality and local independence
We analyzed unidimensionality and local independence with WINSTEPS software. The principal component analysis of linearized Rasch residuals revealed that the Rasch dimension explains 28.1% (eigenvalue=27.4) of observed variance in the exam 2016, 23.5% (eigenvalue=21.4) in the exam 2017, 29.6% (eigenvalue=29.4) in the exam 2018, and 29.7% (eigenvalue=29.5) in the exam 2019 which all variances are remarkably close to the Rasch model prediction of 27.7 %, 23.5%, 29.6%, and 29.7% in the exams 2016, 2017, 2018, and 2019 respectively indicating that the estimation of the Rasch difficulty measures was successful (Linacre, 2010). The first contrast in the residuals explains only 2.3% (eigenvalue=2.2) of the variance in the data in the exam 2016, 2.7% (eigen-value=2.4) in the exam 2017, 2.2% (eigenvalue=2.2) in the exam 2018, and 2.8% (eigen-value=2.7) in the exam 2019.
The first extracted dimension from the residuals is about 13.5 times smaller than the Rasch dimension in IUUESEE 2016, 11 times smaller than the Rasch dimension in IUUESEE 2017, 14.5 times smaller than the Rasch dimension in IUUESEE 2018, and 15 times smaller than the Rasch dimension in IUUESEE 2019. Furthermore, the disattenuated correlations of the clusters in all four exams are 1. These statistical outputs support the assumption of unidimensionality in IUUESEE. Investigation of Pearson correlations significantly supported the assumption of local independence. Correlations above 0.70 indicate local dependence (Linacre, 2010), and all observed correlations in the four exams fell between − 0.13 and 0.29 which supported the local independence of all items.

IUUESEE 2016
Tables 2, 3, and 4 present native language UNIDIF analysis of test items (items including DIF) of Exam 2016, which include the local difficulty of test items for each native language subgroup, SEM figures for each measurement, the local difficulty contrast between native language subgroups, and a Welch t value and a p value for this contrast. The difference between the local difficulty magnitudes of the items is called the DIF contrast. The Welch t value shows the statistical variance between the local difficulties of items as a Student's double-sided t statistic (Linacre, 2010). For example, Table 3 shows that the difficulty of item 3 is −0.89 with a SEM of 0.09 for the Azeri subgroup and −0.44 with a SEM of 0.17 for the Luri subgroup; the contrast in difficulty, −0.45, is the measure of DIF effect size (Linacre, 2010); the Welch t value of this contrast is −2.39; and the p value of the contrast is 0.0174, which is significant at the established threshold p value of 0.05 indicating that item 3 includes differential functioning based on the criteria suggested by Linacre (2010). Therefore, it functions differently among the Azeri and the Luri groups based on factors other than the test's construct of interest. Table 3 presents that uniform differential item functioning analysis identified eleven test items with significant DIF at p < 0.05 between the Azeri and the three native language subgroups: items 3, 36, and 62 favoring the Azeri test-takers; item 11 favoring the Kurdish test-takers; items 28, 61, and 65 favoring the Luri test-takers; and items 50, 58, and 70 favoring the Persian test-takers. Items 23 favored the Azeri test-takers compared with the Persian test-takers and the Luri and the Kurdish test-takers in comparison with the Azeri subgroup.
Of these eleven items, four (23, 28, 61, and 65) had UDIF magnitudes larger than 0.6 logits. The item characteristic curves (ICC) of the item with the biggest magnitude (item 65) are presented in Fig. 1. Item 65 favors the Luri test-takers compared with the Azeri ones and leads to differential item functioning.
The solid line in Fig. 1 is the Rasch model curve. Comparing the Azeri with the Luri test-takers, on item 65, the subgroups' ICC curves intersect at four points: − 4.3, −3.6, −3.8, and −1.5 logits (horizontal axis). These are the turning points at which these two  native language subgroups' probabilities of correctly responding to the item intersect. This item favors the Azeri test-takers at ability levels up to − 4.3 logits whereas from − 4.3 to − 3.6 logits, the Luri test-takers were more likely to answer this item correctly; from − 2.8 to − 1.5, the Azeri subgroup was more likely to answer this item correctly. The probability of answering this item at intersecting points was equal for both groups. There was no difference above − 1.5 logits, where many test-takers landed, and from about − 1 to 2 logits where no the Luri takers landed concerning this specific item. The Azeri test-takers were more likely to answer this item correctly. Table 4 lists ten items with significant DIF at p < 0.05 between the Kurdish and the other two native language subgroups namely the Persian and the Luri test-takers. As Table 4 shows, items 23 and 54 are advantageous to the Kurdish test-takers; items 4, 50, and 61 favor the Luri test-takers; and items 3, 14, 24, 29, 50, and 66 favor the Persian test-takers. Six items (14, 23, 50, 54, 61, and 66) of these ten items had UDIF magnitudes larger than 0.6 logits.
According to Table 5, comparing the Luri test-takers with the Persian ones, UDIF analysis identified fifteen test items with significant DIF at p < 0.05: items 17, 23, 28, 38, 61, 63, 64, and 65 favoring the Luri test-takers, and items 3, 6, 7, 32, 36, 48, and 57 favoring the Persian test-takers. Of these eleven items, ten (3, 17, 23, 28, 36, 48, 61, 63, 64 and 65) had UDIF magnitudes larger than 0.6 logits. Figure 2 displays the results of the native language UDIF analysis of all test items for exam 2016. The lines (color figure available online) represent the local item difficulty of the four native language subgroups. The solid line in this figure is the Rasch model curve.

IUUESEE 2017
We explored the native language UDIF analysis for exam 2017. Nineteen DIF items were identified with significant DIF at p < 0.05 between the Azeri and the three native language subgroups (Table 6): items 5, 14, 33, 35, 39, 43, and 66 were easier for the Azeri participants; items 3 and 70 functioned in favor of the Persian test-takers; items   (12, 15, 16, 24, 25, 30, 35, 56, 65, and 66) had UDIF magnitudes larger than 0.6 logits which indicates that these items were more biased than the items with magnitudes lower than 0.6. Our native language UNIDIF analysis of the exam 2017 revealed 5 Items with significant DIF at p < 0.05 between the Luri and the two native language subgroups: the Persian and the Kurdish ( Table 7). All of these items functioned in favor of the Luri test-takers except for item 35 which was easier for the Persian ones. One of these items, item 24 had UDIF magnitudes larger than 0.6 logits which indicates that this item was more biased toward the Luri test-takers than toward the other groups.
Comparing the Kurdish test-takers with the other three native language groups in IUUESEE 2017, our analysis found 6 items with significant DIF at p < 0.05 (Table 8). Items 18, 24, 53, and 66 functioned in favor of the Kurdish subgroup, and items 22 and 39 favored the Persian test-takers in comparison with the Kurdish ones. Four items had UNIDIF magnitude larger than 0.6 and item 66 had the largest one leading to more UNIDIF than other items.
The comprehensive results of the native language UDIF analysis of the IUUESEE 2017 is displayed in Fig. 3.   (Table 9). Comparing the Kurdish with the Luri and the Persian test-takers, our analysis revealed 9 items with significant DIF at p < .05. Items 3, 30, 36, 64, and 65 were easier for the Kurdish test-takers; items 45, 50, and 58 were easier for the Luri test-takers; and only item 45 functioned in favor of the Persian test-takers. Items 3, 45, 50, 58, 64, and 65 had UDIF magnitude larger than 0.60 (Table 10).
Six items with significant DIF were found in the comparison of the Luri test-takers with the Persian subgroup. Items 41, 50, 58, 65 functioned in favor of the Luri subgroup and items 3 and 4 functioned in favor of the Persian subgroup. Items that were easier for the Luri test-takers had the UDIF magnitude larger than 0.6 which shows that these items were more biased than items 3 and 4 meaning that this test was more biased toward the Luri test-takers than the Persian ones (Table 11). Figure 4 illustrates the UDIF analysis of all items of exam 2018.
Comparing the Luri with the Persian test-takers, we found 12 items with a significant DIF (Table 14). Most of these items including items 5, 16, 19, 22, 25, 45, 55, and 70 functioned in favor of the Luri test-takers and only items 15, 40, 44, and 66 were easier for the Persian test-takers. Items 19, 22, 40, 66, and 70 had larger magnitudes than 0.6. Item 70 had the largest UDIF magnitude meaning that this item was biased toward the Luri test-takers as was the case with most of the items with the largest magnitudes.

Non-uniform differential item functioning
To conduct a NUDIF analysis of the four versions of IUUESEE, we segmented native language groups into high-and low-ability subgroups by partitioning the range of person ability measures at the point in the middle of the range and then performed a NUDIF analysis of all test items. In this regard, WINSTEPS invoked 7840 NUDIF comparisons for 280 test items considering eight native language subgroups. When high-ability and low-ability subgroups were compared, a total of 1730 instances of significant NUDIF at p < 0.05 were revealed (Table 16). Our analysis found that IUUESEE 2019 had the largest number of NUDIF cases and IUUESEE 2017 the fewest. The largest number of NUDIF cases was found to relate to the low-ability Persian test-takers and the fewest number of NUDIF instances dealt with the low-ability Luri test-takers. In this case, the largest number of NUDIF cases including 515 instances dealt with the Persian test-takers, and the fewest numbers of NUDIF cases including 299 instances related to the Luri test-takers.

Discussion
This study set out to investigate DIF caused by native language in Undergraduate University Entrance Special English Exam (IUUESEE), using the logistic Rasch model. Overall, the results of this study showed that item format and content of the IUUESEE interact with the native language of test-takers and form bias in the evaluation of their performance. Analysis of descriptive statistics, item difficulty measures, fit to the Rasch model, unidimensionality, local independence, and reliability fulfilled the requirements for DIF analysis.
Reliability analysis of IUUESEE found strong support for the item reliability and separation of the test; however, it cast doubt on the person's reliability and separation of test items based on Linacre (2012). The findings showed that IUUESEE resulted in a lower ability range and has not probably distinguished between high performers low performers appropriately (Linacre, 2012). Our DIF and fit results support this finding. On the other hand, high item reliability and separation coefficients of IUUESEE indicate that it measured the wide range of difficulty and also our sample was large enough to accurately locate the items on the latent variable (Linacre, 2012).
Our investigation of Pearson correlations supported the local independence of items and dimensionality and PCAR analyses revealed that test-takers' performances are not influenced by off-dimensional components to a considerable extent. The test items have not constructed different patterns or clusters which supports unidimensionality (Linacre, 2010).
Fit analysis of the four test visions satisfied the preconditions for DIF analysis. Fit indices of the majority of the items were 1 or near 1 which indicated a lack of erratic response patterns in the data. Although due to the lack of conventional Rasch fit criteria, we were not certain whether Bond and Fox's (2007) more lenient criterion operated better than Wright and Linacre's (1994) more rigid one or not, our findings showed a few erratic response patterns across the data based on Wright and Linacre's criterion. Furthermore, it was also found that Wright and Linacre's (1994) fit criterion (0.8-1.2) was more advantageous than other criteria such as Bond and Fox's criterion (2007) in the investigation of test-takers' response patterns. Page 21 of 35 Bormanaki and Ajideh Language Testing in Asia (2022) 12:29 Overall, the majority of the items across all test versions, such as items 3 and 12 in test version 2017, showed MNSQ fit indices of 1 or near 1 which suggested an absence of erratic response patterns in the data. However, IUUESEE also included several misfitting items. Several items such as items 7 and 29 in version 2016, items 9 and 10 in version 2017, items 2 and 17 in version 2018, and items 15 and 26 in version 2019 showed MNSQ fit indices below 1 and overfit the model, and some items, such as items 15 and 20 in version 2016, items 16 and 18 in version 2017, items 23 and 27 in version 2018, and items 19 and 22 in version 2019 underfit the model to some extent, leading to unexpected variance which is likely due to carelessness or guessing (Wright & Linacre, 1994).
Misfitting items of The IUUESEE do not provide test-takers with equal opportunities to demonstrate their language proficiency. These items overestimate test-takers who could function worse and underestimate those who could function better undermining the fairness of the test. In the case of the easiest misfitting items which their outfit MNSQ values misfit due to sensitivity to outliers, high-ability test-takers missed these easy items. Concerning the most difficult misfitting items with sensitivity to outliers, test-takers with lower levels of language proficiency answered these difficult items correctly. For instance, items 11 and 12 were the easiest test items of test version 2018 (11 difficulty measure= −3.1; 12 difficulty measure= -2.21). Their outfit MNSQ values misfit (11 outfit MNSQ=0.67; 12 outfit MNSQ=0.75). Because outfit is sensitive to outliers, this shows that some high-ability test-takers missed these easy items (Bond & Fox, 2007). The outfit MNSQ values of items 28 (difficulty measure= 2.02) and 65 (difficulty measure= 2.63) which were the most difficult items of test version 2016 were 2.14 and 3.28 respectively indicating that low-ability test-takers answered these difficult items correctly. This finding indicates that determining more strict fit criteria in Rasch-based analysis of dichotomous data contributes to the identification of erratic patterns which is likely attributable to a perplexing impact on an item-level such as DIF (Smith, 1996). That is why we used Wright and Linacre's (1994) range from 0.8 to 1.2 in this study.
Based on UDIF analysis, we explored that IUUESEE 2019 had the largest instances of significant UDIF (61 cases) which is consistent with our findings of NUDIF analysis. It was found that, in most cases (64), items functioned in favor of the Luri test-takers compared to test-takers from other native language groups. Azeri test-takers were favored on the smallest number of items displaying UDIF. NUDIF analysis revealed that a large number of NUDIF instances have happened in favor of the low-ability Persian, the low-ability Azeri, the high-ability Kurdish, and the high-ability Luri test-takers. Finding the real sources of observed DIF is often demanding (Camilli & Shepard, 1994;Gierl, 2005), especially in exploratory DIF investigations which lack a priori hypothesis (Jang & Roussos, 2009); however, reviewing items provided us with some reasons. It showed that low-ability and the Luri test-takers had answered several difficult misfitting items correctly which their counterparts missed. This assumption is supported by the outfit MNSQ patterns of these items: several correct answers on difficult test items by low-ability and the Luri test-takers had outfit MNSQ values greater than 1.2 and lower than 0.8 indicating that their performance on these items was unexpected and can be related most likely to successful lucky guesses. Since all items of the test versions were multiple choice having four options, attempting a lucky guess has a chance of success (25 %).
Page 22 of 35 Bormanaki and Ajideh Language Testing in Asia (2022) 12:29 A closer look at items revealed that the participants had to match the given source text with the sentences which were paraphrased. This entailed a higher level of comprehension. It appears that in these types of items readers needed to establish a text base understanding and keep it in their memory to form a position model which integrates the new coming information with the surface information (Kintsch, 1998). According to Kintsch (1998), this surface information needs to contain a robust mental representation of the elements of the two passages that the test-takers need to match against the mental representation of the correct choice by higher-level cognitive processing. The extent to which a test-takers performs this cognitive processing successfully determines whether the test-takers could answer a test item correctly, and the complexity of this process may encourage guessing from low-level test-takers who cannot successfully carry out the comprehension process (Xuelian & Aryadoust, 2020).
The presence of wrong answers in the responses of the high-ability test-takers supports the assumption that they did not answer easy items correctly probably due to carelessness, overconfidence, and thoughtless errors (Aryadoust et al., 2011). This assumption is also supported by results of the fit analysis which revealed that high-ability test-takers missed the easiest misfitting items which their outfit MNSQ values misfit due to sensitivity to outliers (extreme values) in the data set.
Socioeconomic status has a significant role in test-takers' performance (see, e.g., Şirin, 2005;Suleman et al., 2012;Kormos & Kiddle, 2013). It may be another source of DIF across native language groups. A factor relating to the socio-economic status that might have influenced item performance in IUUESEE is test-wiseness. It refers to the familiarity with the test format, because of test-takers' educational background, which can affect test performance (Xuelian & Aryadoust, 2020). Two hundred ninety-two and 300 instances of NUDIF occurred in favor of the low-ability Azeri and the low-ability Persian test-takers respectively. This finding points to the importance of test-wiseness and its effect on test-takers' performance. Since the Azeri and the low-ability Persian test-takers were mostly from high and middle socio-economic areas, they had afforded to participate in IUUESEE preparation courses and equip themselves with test-taking strategies to succeed in the test. As Hayes and Read (2004) stated, test-takers with previous exposure to an exam are trained in specific test-taking strategies to respond to test items, which might have assisted them to answer items that their counterparts could not. This echoes the remarks of Ryan and Bachman (1992) who stated that language background "is most likely a surrogate for a complex of cultural, societal, and educational differences" (p. 11). Therefore, the native language may be considered as a representative factor that causes DIF in items, rather than the main source of the DIF (Xuelian & Aryadoust, 2020).
The finding that the large number of items displaying UDIF favored the Luri testtakers is in contrast to our expectation based on their socioeconomic status, since the majority of the Luri test-takers are from low socio-economic areas in Iran (Chalabi & Janadele, 2007). This is supported by the results of the fit analyses (see Tables 17, 18, 19, and 20 in the Appendix) and the general picture of native language-based UDIFanalyses (see Figs. 2, 3, 4, and 5). The figures show that the Luri native language group deviates more from the Rasch model curve than the other groups refering to the largest number of the Luri test-takers' erratic response patterns in the data as was found by fit analyses. Moreover, the figures indicate that the Azeri native language group included the Page 23 of 35 Bormanaki and Ajideh Language Testing in Asia (2022) 12:29 smallest number of erratic response patterns across the four versions of IUUESEE which was supported by specific UDIF results. In general, we should acknowledge that, in this study, the native language is most likely a surrogate for a combination of educational, cultural, and societal differences (Ryan & Bachman, 1992) and the reasons for the DIF may simultaneously derive from different sources and in some contexts, they may not be so obvious (Schmitt et al., 1993). Finally, we should not ignore the effect of the test-takers' native language on their test performance from a developmental, second language acquisition (SLA) perspective. Due to the effect of Ll on L2 acquisition, considering DIF as a function of Ll in IUUESEE is not surprising. The majority of SLA researchers have reached a consensus that the effect of Ll is mostly greatest at the initial stages of SLA, or at the lower levels of L2 proficiency, and is likely to decrease as L2 proficiency is increased, leading to greater DIF at lower L2 ability levels and less DIF at higher ability levels (Ryan & Bachman, 1992). Bradlow and Bent (2008) found that the native language effect was less obvious as language proficiency developed and advanced. The findings of the current study resonated with Bradlow and Bent's (2008) findings.
Therefore, in IUUESEE, we need to acknowledge the sensitivity of DIF to the low-ability test-takers, since test-takers' native language influenced their test performance as a result of the insufficient development of their target language. This is in line with the results of another study which found that the dimensionality of L2 exams is a function of test-takers' proficiency levels (Oltman et al., 1988).

Conclusion and future research
This study has provided insight into the interaction between test-takers' native language and their test performance. This interaction became more evident when native language groups were divided into subgroups in NUDIF analysis. The UDIF and the NUDIF analyses respectively revealed 165 and 1730 instances of significant DIF at the established threshold p-value of 0.05 recommended by Linacre (2010). It was found that the Luri test-takers were favored more on the test items of IUUESEE than other native language groups. Since the IUUESEE is an MC format, it appears that it encouraged lucky guesses among the Luri test-takers, the low-ability Azeri, and the low-ability Persian test-takers who have probably practiced test-taking strategies. Closer inspection of the DIF items showed that many of them had long and wordy stems and unappealing distractors. These factors, together with a little time available to answer the test items probably led low-ability and the Luri test-takers to venture lucky guesses. This was also supported by the results of fit analyses which showed erratic response patterns in the data and revealed that low-ability test-takers have successfully answered difficult items with misfitting outfit MNSQ values due to sensitivity to outliers. Test-takers' socioeconomic status appeared to be another factor contributing to DIF results. These findings suggest some implications for the Iranian National University Entrance English Exam. The findings cast doubt on the validity of the IUUESEE by examining its item functionality across different native language groups and subgroups of test-takers. This information is useful for stakeholders such as test writers and policymakers. They should be cognizant of the issue that some items of IUUESEE display DIF among testtakers with different native languages leading to construct-irrelevant variance. Deciding whether to keep or eliminate DIF items would entail an examination of the whole test and would depend on the application of the cancellation rule (Borsboom, 2006). However, the test designers need to inspect the test bank to know whether there are similar DIF items and offer transparent procedures for item writers to eschew systematic problems. Considering this perspective, the current study offers empirical evidence that can be put into consideration for improving the design of IUUESEE.
There are some limitations in this study that should be stated. Our study is limited in scope as it examined native language as a separate factor leading to DIF. However, as we mentioned, there are other factors including socioeconomic status (e.g., test preparation), guessing, overconfidence, thoughtless errors, stem length, time, and unappealing distractors which may be sources of DIF in IUUESEE. These factors need to be investigated meticulously to help test designers better evaluate their effect on IUUESEE outcomes. Future research also needs to investigate other aspects, such as age, content and item type, academic background, and prior exposure to English, which have all been revealed as sources of DIF in previous studies (Aryadoust, 2012;Chubbuck et al., 2016;Pae, 2004;Takala & Kaftandjieva, 2000). Recent developments in latent DIF analysis that integrated Rasch measurement with latent class analysis can pave the way for future research and address the complications in DIF research with manifest variables (Benıtez et al., 2016;Cohen & Bolt, 2005;Strobl et al., 2015). According to Zumbo's (2007) third generation of DIF, examination of socio-cultural and contextual factors which may affect different native language groups' performances differently would be an interesting domain of investigation. Since task types and test content are undoubtedly the main determinants of test-takers' test performances, we believe that one line of inquiry for continued research would be quantitative analyses across task types and qualitative studies examining the content of the test with a panel of experts which can shed light on the relationships between item content and DIF.
Another limitation relates to the nature of the exploratory DIF approach upon which our study is grounded. Although this approach was able to find UDIF and NUDIF in some items, it failed to find and clarify the causes of DIF. Future research should explore possible ways to perform confirmatory native language-based DIF study of IUUESEE to unravel the DIF sources (e.g., Gierl, 2005).