Rasch testlet model and bifactor analysis: how do they assess the dimensionality of large-scale Iranian EFL reading comprehension tests?

Rasch testlet and bifactor models are two measurement models that could deal with local item dependency (LID) in assessing the dimensionality of reading comprehension testlets. This study aimed to apply the measurement models to real item response data of the Iranian EFL reading comprehension tests and compare the validity of the bifactor models and corresponding item parameters with unidimensional and multidimensional Rasch models. The data collected from the EFL reading comprehension section of the Iranian national university entrance examinations from 2016 to 2018. Various advanced packages of the R system were employed to fit the Rasch unidimensional, multidimensional, and testlet models and the exploratory and confirmatory bifactor models. Then, item parameters estimated and testlet effects identified; moreover, goodness of fit indices and the item parameter correlations for the different models were calculated. Results showed that the testlet effects were all small but nonnegligible for all of the EFL reading testlets. Moreover, bifactor models were superior in terms of goodness of fit, whereas exploratory bifactor model better explained the factor structure of the EFL reading comprehension tests. However, item difficulty parameters in the Rasch models were more consistent than the bifactor models. This study had substantial implications for methods of dealing with LID and dimensionality in assessing reading comprehension with reference to the EFL testing.


Background of study
Linguistic ability is a very complicated and interrelated attribute, which simultaneously needs different proficiencies (Wainer and Wang, 2000). Students' reading comprehension achievement is dependent upon the accomplishment of several cognitive skills at the word, sentence, and the whole test level. Therefore, recognizing the construct of the reading comprehension test is helpful to identify the progressions in English as a foreign language (EFL) testing (Schindler et al., 2018). Construct validity is at the heart of the test validity notion (Messick, 1989). One of the important features of construct validity is the test dimensionality, which is very important in the development and validation of tests of second language (L2) ability (Dunn & McCray, 2020). Test dimensionality, which is defined as the minimum number of examinee abilities measured by the test items, is a unifying concept that underlies some of the most essential issues in the development of large-scale tests (Tate, 2002). Advanced measurement techniques are capable to quantify the test construct validity and dimensionality of large-scale assessments (Reder, 1998;Westen & Rosenthal, 2003). In doing so, the Rasch testlet  and bifactor analysis models (Holzinger and Swineford, 1937;Reise, 2012;Schmid & Leiman, 1957) are some new advanced methods, which are capable of investigating the construct dimensionality of the EFL reading comprehension tests.
Reading comprehension assessments usually consist of various text passages, each following some items which are grouped into item bundles (Rosenbaum, 1988). In this context, the item bundle that shares a common reading comprehension passage is a testlet. Moreover, testlets may refer to any common stimuli such as a graph, table, diagram, map, item stems, and scenario .
Local item dependency (LID) is the main property of testlets, which befalls when a special dependency exists between items (Wilson, 1988). The testlet-based process is exercised in EFL reading comprehension testing. For example, some EFL large-scale tests such as the Test of English as Foreign Language (TOEFL) from Educational Testing Service (ETS), the Cambridge exams of the University of Cambridge Local Examinations Syndicate (UCLES), the First Certificate of English (FCE), and International English Language Testing System (IELTS) all offer passages to examinees to read and answer the following related questions (Chalhoub-Deville & Turner, 2000).
Standard item response theory (IRT) models may not function properly with English language reading testlets and lead to small distortion of parameter estimates and overestimated reliability (Li et al., 2010). This is because of overlooking of the testlet effects, which is different from test to test (DeMars, 2012). Testlet effect is a random effect variance induced by the LID of the test. The more the variance, the more the effect contained inside the testlet (Wainer and Wang, 2000). Testlet response theory (TRT) (Wainer et al., 2007) and the bifactor models are newly developed to model the testlet effects in this regard. Moreover, the multidimensional item response theory (MIRT) models can be also used to address the testlet effect (Baghaei, 2012;Wainer et al., 2007).
Rasch models provide the possibility of fundamental measurement in standard testing (Andrich, 1988). The models have extensive applications in objective measurement of dichotomous or polytomous variables in human sciences, by applying either unidimensional or multidimensional Rasch models (Bond & Fox, 2015). Moreover, the theory of Rasch model has been gradually increased in language assessment through past decades (Fan et al., 2019;Aryadoust et al., 2020), while the Rasch model of the TRT, as a special case of the bifactor analysis model, was introduced and developed by .
Testlets have extensive use in EFL testing, and as the face and the nature of the testlets are different from the other forms of testing, they need relevant methods of analyzing test validity and dimensionality. Moreover, an investigation of the testlet effect in EFL reading comprehension tests is paramount accordingly. Thus, the present study aimed to apply and compare the Rasch testlet and bifactor models on the Iranian large-scale EFL reading comprehension tests to investigate the testlet effects and to compare its validity to the other relevant versions of unidimensional and multidimensional Rasch models.
Rasch unidimensional, multidimensional and testlet models, and the bifactor analysis Nowadays, IRT models have been widely used to test scoring in large-scale educational testing. Rasch model is a variation of the IRT models. Especially in the field of language testing, Rasch measurement has been frequently used in the assessment of reading, writing, speaking, and listening skills (Aryadoust et al., 2020). It is an item analysis model with logistic item characteristic curves of equal slope (Andersen, 1973). Reckase (2009) extended the standard Rasch model to the multidimensional Rasch model, in which, the probability of answering an item correctly is the function of more than one latent ability at the same time.
Local independence of test items is the common assumption in all IRT models. It means that the examinee abilities provide all information needed to explain performances, and controlling for the trait level, all of the other affecting factors are random. This assumption is violated when IRT models applied on tests including testlets, because answering to pairs of items are correlated given the ability .
TRT is a solution to model the dependency of the test items, when we confront the problem of local dependency (Baldwin, 2007). Extending the Birnbaum (1968) three parameters logistic (3PL) IRT model,  added the testlet effect to the model and developed the 3PL TRT model. A path diagram for an assumptive testlet model, just like the model that is presented in this study, is shown in Fig. 1.
In the Rasch testlet model, discrimination parameters (loading) are constrained to be equal in each testlet ) (see the Appendix for the technical formulas of the unidimensional, multidimensional, and the Rasch testlet models). Another solution to model the dependency among test items in testlets such as a reading test is a bifactor model (Rijmen, 2010). The bifactor model, also known as a hierarchical model (Markon, 2019) or even a nested-factor model (Brunner et al., 2012), is a latent structure in which the items load on a general factor on the one hand and show the factor structure of specific factors for each item on the other. The general factor accounts for the total variation among items and shows what items have in common, and specific factor structure is interpreted like the other ordinary factor analytic methods (Reise et al., 2010).
Moreover, two major variations of exploratory and confirmatory analysis are available for structuring the specific factors in the bifactor model (Reise, 2012). A template for exploratory and confirmatory bifactor model with an assumptive general factor (e.g., reading ability) and three specific factors (e.g., decoding, fluency, and vocabulary skills) in which three items are nested is illustrated in Fig. 2.
As shown in Fig. 2, in the exploratory bifactor model, item factor loadings on all of the specific factors are freed to be estimated, whereas they are only constrained to be related to the specific factors in the confirmatory bifactor model. It is worth noting that the Rasch testlet model is actually a confirmatory bifactor analysis model in which the factor loadings on the specific factors are restricted to be proportional to the loadings on the general factor within each testlet (Li et al., 2006). That is, the Rasch testlet model is actually the Rasch version of a bifactor model (Jiao et al., 2013). Therefore, in addition to the part b of the Fig. 2, the general structure of Fig. 1 is also true for the confirmatory bifactor analysis model, where the magnitudes of factor loadings are just estimated in each specific factor.

Literature review
The test dimensionality refers to the number of latent variables that are measured by a set of test items. Therefore, an essential unidimensional test measures predominantly one latent variable, while a multidimensional test measures more than one latent variable (Mellenbergh, 2019). Dimensionality in language tests is directly related to the construct validity (Henning, 1992). To reach the goal, an evidence-based process of determining the total scale of a test and its subscales is essential to provide an argument for construct validity. This is exactly what dimensionality analysis does to provide construct validity argument for a test (Slocum-Gori & Zumbo, 2011). The unidimensional Rasch models have been applied to assess the dimensionality of EFL reading tests for decades (Aryadoust et al., 2020). In part of a more general study on applying unidimensional IRT models on EFL vocabulary and reading tests, Choi and Bachman (1992) applied the one-parameter Rasch model on the reading comprehension section of the TOEFL and FCE. Using Stout approach for assessing unidimensionality (Stout, 1987), they showed that the reading tests are not unidimensional at 5% confidence interval. They also determined that the unidimensional Rasch model tends to be significantly less fitted to the item response data in comparison to 2PL and 3PL IRT models. In another research of that type, Boldt (1992) modified the Classical Rasch model through estimating and fixing the guessing parameter to a certain magnitude and applied it to the TOEFL listening, structure, and reading comprehension sections. It was initially expected that the 3PL IRT model was more efficient, but they reported that the Rasch model could equally predict the person's success on the items and the model is competitive to the 3PL IRT model. Moreover, Lee (2004) investigated local item dependency (LID) of a Korean EFL reading comprehension test by IRT-based Q 3 index (Yen, 1984). He demonstrated clear evidences of passage-related LID of the 40 items test. It was found that positive LID existed among items within testlets which induced the passage content. Not in the EFL testing occasion, Monseur et al. (2011) evaluated the LID of the reading component data of Programme for International Student Achievement (PISA 1 ) (2000 and 2003) through the IRT-based Yen's Q 3 method. They reported a moderate testlet effect; however, the global context dependencies were clear for a large number of reading comprehension sections in different countries. As the application of unidimensional IRT models on the EFL reading tests reviewed above, not all of the aforementioned studies went beyond assessing the unidimensionality assumption, where in the meantime, multidimensional IRT methods were growing in the field of EFL language testing (e.g., Ackerman, 1992;McKinley & Way, 1992). In more advanced unidimensional Rasch analysis studies, Baghaei and Carstensen (2013) fitted mixed Rasch model to an EFL high school reading comprehension test consisting of short and long passages with total number of 20 items. They reported that the model fitted significantly better to the item response data than the standard Rasch model. The model also indicated that the students were divided into two classes of high proficient students in short and long passages. Furthermore, Aryadoust and Zhang (2016) fitted the mixed Rasch model to a large sample of Chinese college students taking an EFL reading comprehension test. They found that 48 out of 50 items of the EFL reading test are well fitted to the mixed Rasch model. Moreover, the results indicated two distinct latent classes within students, in which one class is good at reading in depth and the other performs better in skimming and scanning text passages. The act of analyzing a measure requires a number of essential assumptions. The most important among these assumptions is that the construct is unidimensional (Briggs & Wilson, 2003). Although Baghaei and Carstensen (2013) and Aryadoust and Zhang (2016) applied advanced measurement analysis on the EFL reading tests, they could not be able to reject the multidimensionality assumption of the EFL reading test in any way.
The application of multidimensional Rasch model on EFL reading tests has also some case studies, as follows: Baghaei (2012) applied a compensatory multidimensional item response theory (MIRT) model on an English comprehension test including two listening and reading comprehension tests. Each test had also two subtests, which measured informational and interactional listening skills and expeditious and careful reading skills, respectively. Goodness of fit of a unidimensional and two multidimensional Rasch models were evaluated by Akaike's information criterion (AIC) (Akaike, 1974) and Bayesian information criterion (BIC) (Schwarz, 1978). The results generally supported the multidimensional Rasch models with two and four dimensions against unidimensional Rasch model, whereas the four-dimensional model encompassing the aforementioned subtests was superior to two-dimensional model including only the listening and reading dimensions. Moreover, Baghaei and Grotjahn (2014) analyzed an English C-test 2 including two spoken and written discourse passages through confirmatory multidimensional Rasch analysis. They compared the results with the standard unidimensional Rasch model, and they illustrated that the two-dimensional model is the best fitting confirmatory model for the C-test. Using multidimensional IRT models improve the precision of measurement in tests substantially, especially when the test is 1 It is a project that is held regularly every 3 years by the OECD (Organization for Economic Cooperation and Development) 2 A kind of cloze test consisting of several short passages in which the second half of every second word is deleted short and includes more than one unidimensional test (Wang et al., 2004). However, the performance of multidimensional IRT models to deal with test multidimensionality and testlet effects in comparison to other counterpart models such as the bifactor models has not well researched yet.
In comparison to the unidimensional and multidimensional models, employing TRT models to analyze EFL reading tests are more recent. As pioneers of the Rasch testlet model,  applied the Rasch testlet model on a Taiwanese high school English test with 44 items and 11 testlets. The likelihood deviance (− 2 loglikelihood) of the model was significantly smaller than unidimensional Rasch model, indicating the local dependence existed among items within the testlets. Of course, some fluctuations between empirical and expected response curves were observed because of the large sample size of 5000 students; however, the Rasch testlet model was fairly well fitted to the item response data. In another research, Baghaei and Ravand (2016) studied the magnitudes of local dependencies generated by EFL cloze and reading test passages using the 2PL TRT model. They showed that the 2PL TRT model is better fitted to the EFL reading testlet data than the counterpart 2PL IRT model; however, the testlet effect for the reading test was ignorable. Rijmen (2010) also applied three multidimensional models on a testlet-based data stemming from an international English test. He showed that the 2PL TRT model performed better than a unidimensional 2PL IRT model; however, the deviance of the bifactor model was better than the TRT/second-order model. Moreover, in an advanced two-level study, Ravand (2015) applied the TRT model to assess the testlet effect in a reading comprehension test held for Iranian applicants for English master's program in state universities. Four testlets were investigated, where half of the testlet effects were negligible according to criteria proposed by either empirical or simulation studies (Glas et al., 2000;Wang et al., 2002;Zhang et al., 2010;Zhang, 2010, b). It was also found that ignoring local dependence would result in overestimation in lower and upper bounds of the ability continuum, even if the item difficulty parameters were the same as the conventional models. TRT models were typically fitted against unidimensional IRT models in studies such as Baghaei and Ravand (2016), Ravand (2015) and  studies. Although testlet effects were evaluated in the aforementioned studies; however, the goodness of fit of the TRT models were not examined to the multidimensional or bifactor models alongside with the conventional IRT models.
Not in the EFL testing but in a similar testing occasion, Eckes and Baghaei (2015) addressed the issue of local dependency of C-tests in a German as foreign language test using fully Bayesian 2PL TRT model. As they highlighted, when local dependency of the C-test is ignored, reliability of the test is slightly overestimated, whereas the testlet effects of eight texts of the C-test were entirely ignorable. Nevertheless, in contrast to the previous studies, the results showed that the precision of the 2PL TRT model is less than the conventional 2PL IRT model. In another study in the context, Huang and Wang (2013) applied hierarchical testlet model and hierarchical IRT model to some ability and non-ability tests including a Taiwanese EFL reading test and compared the models to the unidimensional 1PL, 2PL, and 3PL IRT models. BIC indices showed that the testlet model fitted significantly better than other aforementioned models. They also showed that ignoring testlet effect led to biased estimation of item parameters, underestimation of factor loadings, and overestimation of the test reliability. Kim (2017) employed the TRT model to investigate testlet effect in an eighth-grade reading comprehension test in the USA. Using a χ 2 for evaluating local independence, he revealed that there were significant dependencies among the test items, where the testlet effects were evaluated high with respect to the magnitudes of the testlet item dependencies. He also compared the goodness of fit of the TRT model with the unidimensional IRT model by AIC and BIC indices in his Ph.D. dissertation, where he concluded that the TRT model fitted better to the test. Jiao et al. (2012) applied an advanced multilevel testlet model for dual local dependence on a state reading comprehension test for high school graduation in the USA. Using The Deviance Information Criterion (DIC), they found that the proposed model and the multilevel model are better fitted than the testlet model to four testlet response data, whereas the testlet model showed better DIC in comparison to standard Rasch model. Moreover, the results of the advanced and typical testlet models showed that the estimated testlet variances for all reading testlets were small. Chang and Wang (2010) fitted standard IRT and TRT models to 10 reading testlets of the Progress in International Reading Literacy Study (PIRLS) test held in 2006. As expected, TRT fitted significantly better to PIRLS (2006) data in comparison to the unidimensional IRT model. They also reported a testlet variance ranged from .168 to .489, which indicated negligible to moderate testlet effect in the test. Moreover, Jiao et al. (2013) applied the Rasch testlet model and a three-level one-parameter testlet model to a large-scale assessment k-12 3 reading test battery including six testlet passages from grades 9 to 11. Applying different methods of Bayesian and non-Bayesian estimation, they reported that half of the testlet effects were negligible for the reading test passages. looking at the findings of the above research, it becomes apparent that not only the testlet model perform better in EFL reading comprehension testing, but also it has the same function in other non-EFL situations of testing reading. Moreover, the testlet effects in testing non-EFL reading comprehension look as random as testing reading in EFL examinations. However, in none of the reviewed studies, the bifactor models were employed or compared to the TRT models.
To review some studies on the application and comparison of bifactor model to investigate the dimensionality of EFL reading comprehension tests, Wu and Wei (2014) compared the bifactor model to the IRT and TRT models to investigate the testlet effect in an EFL passage-based test in China. The results showed that there was not a high degree of dependency between the passage items. They also observed that the item difficulty parameters were the same, while the item discriminations were very different. There were also similarities between ability estimations in the bifactor model with the other models; however, a considerable discrepancy was observed in the standard errors of the different models. Byun and Lee (2016) investigated testlet effect in a Korean EFL reading comprehension test using the MIRT bifactor model. They found that the bifactor model is the best fitting model, and passage topic familiarity is not necessarily related to the factor scores. They also reported that the overall topic familiarity is correlated to the general reading ability score of the examinees. Moreover, in a recent study in the field of language testing, Dunn and McCray (2020) applied and compared a range of CFA model structures on data sets from British Council's Aptis 4 English test (O'Sullivan, and Council, B., 2012). Using some absolute and relative measures of goodness of fit (Sun, 2005), they successfully showed that how the bifactor model spearheads other confirmatory factor analysis measurement models in the field of second language testing. However, in all of the aforementioned studies, just linear models of confirmatory factor analysis were considered to be applied on language test data and compared with the bifactor model.
In a different testing situation, to explore the utility of the bifactor model, Betts et al. (2011) investigated the early literacy and numeracy measured by the Minneapolis Kindergarten Assessment (MKA) tool in the USA to correct for the anomalies found in the factor structure of MKV in the previous research literature. Results of the study showed that the bifactor model provides a strong model conceptualization tool and a precise predictive model for later reading and mathematics. In another study in a non-EFL testing occasion, Foorman et al. (2015) explored the general and specific factors of an oral language and reading test using the bifactor analysis model. Results supported a bifactor model of lexical knowledge of reading rather than a simple view of a threefactor model including decoding fluency, vocabulary, and syntax. Finally, in a much more similar research to this study, Min and He (2014) examined the relative effectiveness of the multidimensional bifactor and testlet response theory models in dealing with local dependency in reading comprehension testlets of the Graduate School Entrance English Exam (GSEEE) in China. They concluded that although the bifactor model is better fitted than the TRT model to the data, but both models produce the same results in terms of item and ability parameters. Moreover, they found that the unidimensional IRT models did not fit to the item response data and had a bigger impact on item slopes other than the item and person parameters. However, the MIRT and different approaches of exploratory and confirmatory bifactor models were not considered for further explanation of the test dimensionality and testlet effects.
As reviewed above in the research literature, the results about the magnitudes of testlet effect or even the existence of such effects in EFL or non-EFL reading comprehension passages were contradictory. Some studies concluded that the testlet effects were significant (e.g., Kim, 2017) or moderate (e.g., Monseur et al., 2011), whereas the other findings indicated that the testlet effects for the reading passages were small (e.g., Jiao et al., 2012, Wu & Wei, 2014, negligible, or even some testlets of the whole reading comprehension section lacked the effect (e.g., Baghaei & Ravand, 2016;Chang & Wang, 2010;Jiao et al., 2013;Ravand, 2015). Therefore, it seems that more research needed to conclude about the random nature of the testlet effects in reading passages, especially in the EFL testing context, to judge that the effects are indigenous in the reading testlets, or it is a random effect that is rooted to the occasion or content of testing.
The other important features of the Rasch testlet model and the bifactor analysis are the goodness of fit and item and ability parameters produced by the models in comparison to the other conventional and multidimensional Rasch models. Typically, TRT models were at least compared to either the unidimensional Rasch model or 2PL/3PL IRT models in most of the past research (e.g., Byun & Lee, 2016;Chang & Wang, 2010;Huang & Wang, 2013;Jiao et al., 2012;Kim, 2017;Min & He, 2014;Ravand, 2015;Rijmen, 2010;Wu & Wei, 2014). All of the aforementioned studies except one (Baghaei and Aryadoust, 2015) favor TRT models in terms of goodness of fit; however, the results of item and ability parameters calibration were not highly consistent in the applied TRT models. Therefore, the correlation of item parameters in the Rasch models need to be more scrutinized, especially with real data sets of large-scale assessments.
Moreover, multidimensional IRT models were also investigated against conventional IRT models in some case studies (e.g., Baghaei, 2012;Baghaei & Grotjahn, 2014;Byun & Lee, 2016). In Rasch models family,  only compared Rasch testlet model to the unidimensional Rasch model. However, due to the popularity of the Rasch models among language testing specialists, the validity of the Rasch testlet model has not yet been compared with the multidimensional Rasch model or the bifactor analysis model with real data sets of EFL reading comprehension tests. Therefore, following some recent research on the EFL reading comprehension testing in the Iranian large-scale university entrance examinations (e.g., Geramipour & Shahmirzadi, 2018;Geramipour & Shahmirzadi, 2019;Geramipour, 2020), present research also aims to introduce a new collection of dimensionality analysis methods to the field of language testing.

Research questions
Thus, as the main purpose of the current study is to apply the Rasch testlet and bifactor models to analyze the LID and dimensionality of the reading comprehension tests and compare them to the unidimensional and multidimensional Rasch models, the questions that guided this study are the following: 1. How large are the testlet effects for the EFL reading comprehension tests? 2. Is there any correlation between item parameters (difficulty indices) of the Rasch testlet and bifactor models with each other and the other unidimensional and multidimensional Rasch models in the EFL reading comprehension tests? 3. Are the Rasch testlet and bifactor models better fitted to the item response data of the EFL reading comprehension tests in comparison to the other unidimensional and multidimensional Rasch models?

Instruments
The EFL reading comprehension section of the Iranian national university entrance examination was analyzed in this study from 2016 to 2018. The test is composed of 2 reading passages with 10 multiple choice items which are part of a high-stakes test held annually to admit the candidates to Ph.D. programs in English Language studies. The test is designed for students with a master's degree who aim to pursue education in Ph.D. degree in state universities. The test measures the knowledge of general English consisting of three sections of grammar (8 items), vocabulary (12 items), and the reading comprehension section (10 items) which was chosen for the Rasch and bifactor analysis. The reading comprehension section includes 2 passages with 10 items evenly distributed in each passage.

Participants
The population data of the reading section were provided by the Iranian national organization of educational testing in the format of Excel data files. Then, sample sizes of 4200 (61.30% men and 38.70% women), 3220 (60.40% men and 39.60% women), and 4500 (67.20% men and 32.80% women) were randomly selected from the Excel data files of 2016, 2017, and 2018 examinations respectively. In doing so, one booklet of the two presented booklets to the examinees was randomly selected each year; then, random cluster samples were proportionately drawn from the population based on the participants' gender. IRT analysis methods require item response data from large sample sizes around 1000 examinees or more to yield accurate item and ability parameters (Hambleton, 1989). In addition, there is no shortage of recommendations regarding to the sample size needed to do factor analysis. Absolute ranges of 100 over 1000 examinees are often suggested for conducting factor analysis (Mundfrom et al., 2005). Therefore, the sample size in the present study is a strength.

Data analysis
At last, different packages of R system (R Core Team, 2019) were employed for the sake of statistical data analysis. Unidimensional Rasch analysis was done by ltm package (Rizopoulos, 2006), multidimensional Rasch models was fitted to the data through mirt package (Chalmers, 2012), and the Rasch testlet and the bifactor models were simultaneously run by TAM package  and sirt package (Robitzsch, 2019). Moreover, unidimensionality of the EFL reading comprehension tests were evaluated by the unified parallel analysis method (Drasgow & Lissak, 1983) and the ltm package in advance of the main analysis.

Results
Before applying the Rasch models and answering the research questions, unidimensionality of the test data under the unidimensional Rasch models was inferentially analyzed. The results of checking unidimensionality of EFL reading comprehension sections using the ltm package of the R software are shown in Table 1. Moreover, Fig. 3 shows the Eigenvalue plots derived from the observed and simulated item response data for testing unidimensionality through the unified parallel analysis. As shown in Table 1 and Fig. 3, the significant differences between the observed and Monte Carlo simulated Eigenvalues show that the unidimensionality assumption holds for none of the EFL reading comprehension tests. Thus, there is a strong reason to use multidimensional measurement models including the Rasch testlet and bifactor models to explain the dimensionality of the EFL reading comprehension tests.  Table 2 shows the testlet effect variances estimated by the Rasch testlet model using the TAM and sirt packages of the R software for each EFL reading comprehension testlets from the Iranian 2016 to 2018 examinations. As shown in Table 2, no substantial testlet effect is observed among testlet passages based on the criteria (values close to 1) proposed by Glas et al. (2000). However, except for one testlet, all of the other testlet effects were higher than 0.25, and then nonnegligible according to criteria proposed by Glas et al. (2000), , and Zhang et al. (2010).

Is there any correlation between item parameters (difficulty indices) of the Rasch testlet and bifactor models with each other and the other unidimensional and multidimensional
Rasch models in the EFL reading comprehension tests?
Item calibration (difficulty and discrimination parameters) results of applying Rasch unidimensional, multidimensional, and testlet models and the exploratory and confirmatory bifactor models using all of the aforementioned R packages are briefed in Table 3.  As shown in Table 3, seemingly, item difficulty parameters show a very high positive (almost perfect) level of consistency within the different Rasch models in the 3 years of examinations. Separate matrix scatter plots (a, b, and c) of the item difficulty parameters (b) under unidimensional Rasch model (IRT), multidimensional Rasch model (MIRT), the Rasch testlet model, and confirmatory and exploratory bifactor models for 3 years of the examinations are depicted in Fig. 4. As seen in parts a, b, and c of the Fig. 4, item difficulty parameters in the applied Rasch models are very highly correlated, whereas the magnitudes of the correlation coefficients range from ρ = 0.99 to ρ = 1. It is worth noting that the item difficulties of the Rasch testlet model are somehow more positively correlated with the unidimensional Rasch model (ρ = 1) than the multidimensional IRT model (ρ = 0.99). However, the item difficulty correlations for the bifactor models were not as perfect as the Rasch models.
Is the Rasch testlet model better fitted to the item response data of the EFL reading comprehension tests in comparison to the other unidimensional and multidimensional Rasch models and the bifactor analysis model?
Finally, to answer the last research question, the goodness of fit indices of the applied Rasch models to the EFL reading comprehension tests were systematically evaluated by log-likelihood statistic (Edwards, 1972), the Akaike's information criterion (AIC), and the Bayesian information criterion (BIC). As seen in Table 4, the lower the indices, the better the Rasch and bifactor models are fitted to the item response data. The results apparently show that the bifactor models are the best fitted models to the EFL item response data; however, the confirmatory bifactor models functioned better than the exploratory models in terms of goodness of fit. However, it is worth noting that the factor loading patterns in exploratory bifactor models are more interpretable than the confirmatory bifactor models. It is also shown that the Rasch TRT model is consistently better fitted to the EFL reading comprehension tests in comparison to the Rasch unidimensional IRT and MIRT models. Moreover, surprisingly, it revealed that the Rasch MIRT model does not necessarily better fit to the data in comparison to the conventional Rasch model. Therefore, in terms of the goodness of fit, the Rasch TRT, IRT, and the MIRT models are better fitted to the Iranian EFL reading comprehension data, respectively.

Discussion
The techniques of testing EFL reading comprehension are largely related to what we refer to reading comprehension. Reading comprehension is the ability to read and understand a given context quickly, which requires techniques of skimming and scanning the context, vocabulary recognition, comprehending questions, and giving correct grammatical and comprehensive responses to the related questions about the context (Henning, 1975). In doing so, almost in every occasions, passages/testlets are employed for testing reading comprehension. Testlet-based tests usually violate the assumption of unidimensionality that is required by the conventional IRT analysis because of the existing LID within the testlets, whereas the assumption is very difficult to be satisfied (Lee et al., 2001). TRT and bifactor models have dealt with the LID problem of the testlet reading passages and target the dimensionality assumption in several theoretical and empirical studies (e.g., Baghaei & Ravand, 2016;Dunn & McCray, 2020;Kim, 2017; Lee, 2004;Min & He, 2014;Morin et al., 2020;Ravand, 2015;Rijmen, 2010;Wilson & Gochyyev, 2020). However, the magnitude of the testlet effect in the reading comprehension passages is still vague and it is not clear whether the effect is always non-negligible. Moreover, due to popularity of the unidimensional and multidimensional Rasch and bifactor models in language testing, they have been extensively applied to the testletbased reading comprehension passages (e.g., Aryadoust et al., 2020;Aryadoust & Zhang, 2016;Baghaei, 2012;Baghaei & Grotjahn, 2014;Baghaei & Carstensen, 2013;Boldt, 1992;Choi & Bachman, 1992;Jiao et al., 2013;. Nevertheless, the exploratory and confirmatory bifactor models and the Rasch testlet model have not yet been systematically compared to the unidimensional and multidimensional Rasch model through the real data of EFL reading comprehension testlets. Thus, the present research intended to fill the research gap by applying and comparing the bifactor and Rasch testlet model with the counterpart unidimensional and multidimensional Rasch models in the Iranian EFL testing context.

How large are the testlet effects for the EFL reading comprehension tests?
Addressing the first research question about the magnitude of the testlet effects in the EFL reading comprehension passages, the results showed that the testlet effects were all non-negligible. However, except for one testlet out of the all nine EFL reading comprehension passages, all of the testlet effects were small and non-substantial. DeMars (2012) believes that testlets might increase authenticity of the reading task as it adds more context to the test. However, modeling of negligible testlet effects makes the model unnecessary complicated and risks capitalization on chance, while it increases the error in parameter estimates. More precisely, ignoring the non-negligible testlet effect leads to overestimation of the classic reliability (Gessaroli & Folske, 2002;Li et al., 2010;Sireci et al., 1991;Wainer, 1995;Wainer and Wang, 2000), where at the same time, estimates of item discrimination parameters may be distorted (Wainer and Wang, 2000).
There is no accepted rule of thumb for evaluating the magnitudes of testlet effects, especially in the field of EFL language testing (Baghaei & Ravand, 2016). Nonetheless, Glas et al. (2000), Luo and Wolf (2019), , Zhang (2010), and Zhang et al. (2010) considered testlet variances below 0.25 to be negligible, whereas testlet variances more than 0.50 supposed to be substantial by Luo and Wolf (2019), Wang et al. (2002, b), andZhang (2010) in several empirical studies. Therefore, in terms of the magnitude of the testlet effects, most of the effects in the EFL reading comprehension tests were neither substantial nor negligible in this study. The results of this research are almost consistent with the findings of Wu and Wei (2014), as they claimed that the testlet effects were small but significant for the Chinese passage-based language tests. Nevertheless, the results contrast sharply with the findings of Kim (2017) in terms of the magnitudes of item dependencies within testlets.
DeMars (2012) considers testlet effect as a random nuisance factor which is not of interest in itself. She also believes that two general conceptions including LID and multidimensionality may be used to model the random nuisance factor and content specialist may waste time speculating on why some items load more than others within testlets. However, it is not still clear how random may be these effects and whether they are context-dependent or not. Overall, considering the review of the research literature and the results of this study among the Iranian EFL students, it seems that the magnitudes of testlet effect are generally random and not dependent upon the context, at least in the field of reading comprehension testing in large-scale Iranian university entrance examinations.
Is there any correlation between item parameters (difficulty indices) of the Rasch testlet and bifactor models with each other and the other unidimensional and multidimensional Rasch models in the EFL reading comprehension tests?
Looking at the second research question, item parameters of the Rasch models were more consistent than the bifactor models in analyzing the EFL item response data, especially the Rasch testlet model had the most similar item difficulty parameters to the unidimensional Rasch model in comparison to the multidimensional Rasch model. That is, because most of the testlet effects were not substantial, the lower testlet effects yielded item difficulty parameters closer to the standard Rasch Model. If the testlet effects were zero, item parameters in the Rasch testlet model were exactly equal to the unidimensional Rasch model . However, item difficulty parameters in the bifactor models were not correlated as perfect as the Rasch models, although the magnitudes of correlations were still high. This is probably because confirmatory bifactor model is a kind of restricted model in which item parameters are distorted in the process of estimation, whereas this problem does not occur in the exploratory version of bifactor analysis (Reise et al., 2010). Thus, lower magnitudes of correlations are expected between item difficulty parameters in the exploratory and confirmatory bifactor models in comparison to the Rasch models.
The results of Baghaei and Aryadoust (2015), Ravand (2015), and Wu and Wei (2014) are somehow in line with the results of this study, where they showed that item difficulty parameters were almost the same across IRT, TRT. However, the findings of this study contradict with the results of Min and He (2014), where they showed that item and ability parameters were the same in both TRT and bifactor models. Moreover, Wainer and Wang (2000) showed that the estimates of item difficulty parameters were not affected by LID, where testlet effects were mostly negligible within 50 EFL reading comprehension testlets. However, more research including simulation studies need to be done for further inference about the behavior of the parameters in different experimental conditions.
Are the Rasch testlet and bifactor models better fitted to the item response data of the EFL reading comprehension tests in comparison to the other unidimensional and multidimensional Rasch models?
At last, to answer the third research question, goodness of fit trials for the models showed that the bifactor models significantly fitted better than the other Rasch models to the EFL item response data. Bifactor models use marginal maximum likelihood (MML) estimation method to estimate item parameters, which permits conditional dependence within subset of items and provides more parsimonious factor solutions (Gibbons et al., 2007). On the other hand, Rasch testlet model also, as a special variation of the bifactor model (Li et al., 2006), outperformed in comparison to the other unidimensional and multidimensional Rasch models.  showed that item and person parameters as well as testlet effects could be more accurately recovered in Rasch testlet model in several experimental conditions. However, not all multidimensional models were fitted better than the unidimensional Rasch model in this study, because the multidimensional Rasch model was not fitted as well as the conventional Rasch model to the EFL test data. This result is in stark contrast to the findings of Baghaei (2012) and Baghaei and Grotjahn (2014), where they showed the superiority of the multidimensional models to assess the dimensionality of listening, reading, and the C-tests. LID is related to the dependence among the response functions of items within a testlet; however, this type of dependency cannot be captured by MIRT model (Andrich & Kreiner, 2010). Therefore, this may be a possible reason why MIRT model even not performed as well as the Rasch unidimensional model in terms of goodness of fit.
The results about the superiority of bifactor model fit against other IRT and TRT models confirm the findings of Byun and Lee (2016) and Foorman et al. (2015). However, the goodness of fit of the multidimensional model versus unidimensional Rasch model is in contradiction with the results of Baghaei (2012) and Baghaei and Grotjahn (2014). Nonetheless, The results about the goodness of fit of the Rasch testlet models were the same as the findings of Baghaei and Ravand (2016), Chang and Wang (2010), Eckes and Baghaei (2015), Huang and Wang (2013), Kim (2017), Rijmen (2010), and .
In addition, confirmatory bifactor models fitted a little bit better than the exploratory bifactor models, but the factor loading patterns in the exploratory bifactor models were more interpretable than the confirmatory bifactor models for all of the EFL item response data. Exploratory analyses allow for directly analyzing the factor structures and identifying modeling problems, while they have no restrictions of confirmatory factor analysis methods to detect the problems after fitting models (Browne, 2001). Moreover, in exploratory bifactor analysis, the item and ability parameters may be distorted, because small cross-loadings are forced to zero and items with significant cross-loadings are accommodated on group factors (Finch, 2011). Reise (2012) insisted on inspecting the test dimensionality through an exploratory bifactor analysis prior to run the confirmatory model. Therefore, that is maybe why some researchers choose exploratory version of the bifactor model to explore the underlying factor structure of psychological measures (e.g., Reise, 2012;Reise et al., 2010). Altogether, the performance of the bifactor models looks very promising and evolving, whereas even some researchers are recently trying to blend the bifactor model with structural equation modeling methods (Morin et al., 2020).

Conclusion
The Iranian Ph.D. entrance examination is a national high-stakes test in Iran taken by more than one hundred thousand examinees every year, which makes it an influential test affecting a large number of master's degree holders hoping to pursue Ph.D. education at Iranian universities. In general, this research is the only study that opts for the application of the Rasch testlet model and bifactor analysis in the high-stakes Iranian university entrance examinations. On the one hand, this study was a successful experience of applying Rasch testlet and bifactor models on some high-profile EFL reading comprehension tests, and they did not suffer from some of the shortcomings and limitations pointed out by studies deploying unidimensional IRT models to study EFL reading testlets (e.g., Choi & Bachman, 1992;Li et al., 2010). On the other hand, apart from identifying testlet effects underlying the EFL reading comprehension tests, it also benefited from newly introduced methods of assessing dimensionality. The findings, which were in many respects new to the field of language testing, might call for a reevaluation of (the construct validity of) high-stakes examinations in light of more stringent alternative methodologies and models.
Although DeMars (2012) believes that the testlet effects are so random and erratic that she recommends subject matter specialists not to speculate on the sources of relationship between the factors and items; however, there is still a room for qualitative indepth content analysis of EFL reading tests to find the different sources of LID. Quantitative research design can also be used to investigate for possible relationships between types of reading comprehension contents and the magnitudes of testlet effects among EFL learners.
Moreover, there are also promising measurement models, titled random block item response theory model (Lee & Smith, 2020) and composite model (Wilson & Gochyyev, 2020), which have been newly introduced to researchers and practitioners to deal with the LID and test dimensionality. Random block item response theory model is statistically equivalent to the Rasch testlet model, which allows for a more complicated and informative model including covariate variables such as gender and age into the model (Lee & Smith, 2020). The composite model consists of two parts simultaneously, including a multidimensional model for the subtests, and a predictive model for a composite of the latent variables based on each subtest. Composite model is claimed that has certain advantages over unidimensional, multidimensional, and hierarchical measurement models including TRT and the bifactor models (Wilson & Gochyyev, 2020). The aforementioned measurement models may be specifically considered for future application and comparison in handling testlet effects and test dimensionality analysis, especially in the field of language testing and assessment.
The limitation of the current study largely lies in the fact that the findings of this study are restricted to the limited number of EFL reading comprehension sections of the Iranian national university entrance examinations as the study populations. Hence, for more generalizability, future research may also adopt simulation studies (Morris et al., 2019) to investigate the prospect of Rasch testlet model and bifactor analysis in analyzing item response data. In addition, it is also recommended that the proposed measurement models, especially the bifactor models, be employed to investigate the dimensionality of other sections in high-stakes language testing in Iranian or other Asian EFL testing contexts. At last, I humbly encourage applied linguists and language testing practitioners/learners to benefit the implications and applications of Rasch testlet and bifactor models in testing reading.
where y ij is the score of examinee i to the item j, p(y ij = 1) is the probability that examinee i answer item j of a test correctly, and θ i denotes the examinee ability and b j is the item difficulty. The multidimensional Rasch model is shown in Formula 2: where θ 1 , θ 2 …, θ k represent k latent traits or abilities of the examinee i, so in the model, the probability of answering an item correctly is function of more than one latent trait. A three parameters testlet response theory (TRT) model is formulated as follows: where a j is the discrimination (slope) of the item j, c j is the pseudo chance level parameter of the item j, and γ id(j) is the testlet parameter which is the effect of testlet d(j) on examinee i. Testlet effect actually captures the within testlet covariation in the TRT model. In this vein, the two parameters logistic (2PL) TRT and the Rasch testlet models are cases of the 3PL TRT model, when in 2PL TRT model c j = 0 and additionally the a j is computed to be fixed in Rasch testlet model other than setting the c j to zero . Thus, in this context, the Rasch testlet model is reduced to the Formula (4) with the same aforementioned elements: Moreover, considering the odd logarithm of answering over not answering an item correctly and doing a little algebra on the Eq. (2), the more simple Rasch testlet equation is obtained as log p y ij ¼ 1 If there is no testlet effect, γ id(j) = 0, the Rasch testlet model is equal to the standard Rasch model (Rasch, 1960).