A comparability study between the General English Proficiency Test-Advanced and the Internet-Based Test of English as a Foreign Language

Methods: Data was collected from 184 test takers, 92 in Taiwan and 92 in the USA. Three specific analyses were conducted: First, a content analysis was performed on the passages in the GEPT-A and on the iBT reading passages. Second, a task analysis of the construct coverage, scope, and task formats used in the reading sections, and third, a test performance analysis of scores on the two tests were conducted.


Foreword
We have great pleasure in publishing this report: LTTC-GEPT Research Reports RG-06. The study described in this report was funded by the 2013-2014 LTTC-GEPT Research Grants. Headed by Professor Antony John Kunnan of California State University, Los Angeles, USA, the study investigated the comparability of two English language proficiency tests -the GEPT Advanced and the Internet-Based Test of English as a Foreign Language (iBT TOEFL). The study provides validity and reliability evidence for the GEPT Advanced and relate to the concept of portability of testes in the use of the Common European Framework of Reference (CEFR).
The GEPT, developed more than a decade ago by the LTTC to serve as a fair and reliable testing system for EFL learners, has gained wide recognition in Taiwan and abroad. It has generated positive washback effects on English education in Taiwan. As the GEPT has successfully reached out to the international academic community with remarkable success over the years, numerous studies and research projects on GEPT-related subjects have been conducted and published as technical monographs, conference papers, and refereed articles in books and journals. In view of the growing scholarly attention on the GEPT, and in order to assist external researchers to conduct quality research on topics related to the test, the LTTC has set up the LTTC-GEPT Research Grants Program, which offers funding to outstanding research projects.
The annual call for research proposals is publicized every October, attracting proposals from all over the world. A review board, which comprises scholars and experts in English language teaching and testing from Taiwan and abroad, evaluates the research proposals in terms of the following criteria:

Introduction
This study was an investigation into the comparability of two language tests, the General English Proficiency Test -Advanced (GEPT-A hereafter) and the Internet-Based TOEFL (iBT hereafter). This is a meaningful study, as the two tests have a similar purpose and a similar test taker customer-base. Both tests are used for admission and selection purpose to undergraduate and graduate programs in U.S. and Canadian English-medium universities. Obviously, this superficial similarity does not necessarily mean that the tests are comparable. This study investigated in depth the degree to which the two tests are in fact comparable, and the degree to which they measured similar language abilities (with a focus on reading and writing). The findings from the study relate to the concept of portability of tests listed as one of the key components in the use of the Common European Framework of Reference (CEFR). Portability in the CEFR refers to the use of a particular test in lieu of another when both tests are available (Kunnan, 2012).

The Cambridge-TOEFL Comparability Study
The most widely-known comparability study of language tests was conducted in 1987-90 and published in 1995 by Lyle Bachman and colleagues. It investigated the comparability of the First Certificate of English (FCE) administered by the University of Cambridge Local Examinations Syndicate (now Cambridge ESOL) and the paper-and-pencil version of the Test of English as a Foreign Language (TOEFL) administered by Educational Testing Service, Princeton.
The research steps used were two-fold: a qualitative content analysis of the items, tasks and prompts and a quantitative analysis of the test performance of the subjects on the tests. The most important aspect of the study related to the care taken in choosing the test instruments, test 3 samples (in terms of test taker characteristics and test takers' score norms) and administrative and scoring procedures in such a way that they truly represented the two testing practices.
The instruments used were authentic FCE tests and institutional (retired) TOEFL and SPEAK, the retired form of The Test of Spoken English. But as the institutional version of the Test of Written English was not available, a similar test was developed by the study researchers. Sampling procedures included selecting subjects that represented the characteristics of the test takers of the two tests. In addition, an examination of the descriptive statistics demonstrated that the means and standard deviations of the study subjects and the test norms of world-wide test taker groups for the two tests were only two points apart and were not practically different. Further, the administration and scoring procedures mirrored the procedures used by the test administrators and raters of the two tests. With these strict measures in place, it was possible to make conclusions regarding comparability based on the test content analyses, the test performance analyses, and the correlational analyses.
The qualitative content analysis of the two tests was conducted by expert judges who used the Communicative Language Ability instrument developed for the study. It was concluded that in general there were more similarities between the two tests than there were differences. The quantitative statistical analysis of the two tests was conducted by analyzing the test 2 performances of the study subjects. The procedures included descriptive statistics, reliability analyses, correlational analyses and factor structure analyses for each of the individual tests and then across the two tests. The study concluded that as the same higher-order factor structure was supported individually for the two tests and across the two tests, the two tests generally measured similar language abilities.

Other Comparability Studies
The content analyses conducted in Bachman et al. (1995) resulted in the development and use of the Communicative Language Ability (CLA) and Test Method Facet (TMF) frameworks. This part of the study was published in Bachman, Davidson, and Milanovic (1996). It outlined the need for the framework and a systematic procedure to analyze test content -the linguistic characteristics, test format characteristics and the communicative language abilities that were tested in the test items, tasks and prompts. Another study, Kunnan (1995), from the same dataset focused on test taker characteristics and test performance on the two tests, the FCE and the TOEFL. This study conducted exploratory and confirmatory factor analyses and structural equation modeling on the test performance and it once again showed that the two tests measured similar language abilities.
Another type of study that compared two versions of the same test was the Choi, Kim, & Boo study (2003). In this study a paper-based language test and a computer-based language testthe Test of English Proficiency developed at Seoul National University. The findings based on content analysis using corpus linguistic techniques supported the notion that both versions of the test are comparable. Along similar lines, a study investigated the comparability of conventional and computerized tests of reading in a second language (Sawaki, 2001). This study was a comprehensive review of literature regarding cognitive ability, ergonomics, education, psychology, and L1 reading. The study did not draw clear conclusions and generalizations about computerized language assessments due to the range of characteristics such as administrative conditions, computer requirements, test completion time, and test takers' effect. Similarly, yet another study investigated the comparability of direct and semidirect speaking test versions 4 (O'Loughlin, 1997). In general, the author concluded that the live and tape-based versions of the oral interaction sub-tests cannot be substituted for each other.
A different approach to investigating equivalence focused on the psychometric equivalence between tests was conducted by Geranpayeh (1994). The study investigated the comparability of scores of subjects who took the TOEFL and the International English Language Testing Service (IELTS). The study found high to moderate correlations between the TOEFL and IELTS scores. The focus on the study was score equivalence: what did 600 on the TOEFL mean on the IELTS? Along the same lines, Bridgeman and Cooper (1998) conducted a study to investigate the comparability of scores from hand-written and word-processed essays. Results indicated that the scores were higher on the hand-written essays than on word-processed essays. Other investigations included: Weir and Wu (2006) investigated the task comparability in semidirected speaking tests of three forms of the GEPT-Intermediate. Similarly, Stansfield and Kenyon (1992) investigated the comparability of the oral proficiency interview and the simulated oral proficiency interview. Hawkey (2009)  Other related procedures have included developing concordance tables of references comparing two or more tests. These procedures only use test scores in developing the tables and are, therefore, not rigorous and do not provide sufficient evidence for comparability of tests.

Summary
The two Bachman et al. studies (1995Bachman et al. studies ( , 1996 obviously have direct bearing on the proposed study as these studies investigated the comparability of two different tests. The other studies were investigations of different versions of the same test. In addition, they generally only focused on score equivalence, and this is insufficient data to reach conclusions regarding comparability of tests. Therefore, this study used principles from the Bachman studies such as carefully choosing test instruments, test samples (in terms of test taker characteristics) and administrative and scoring procedures in such a way that they truly represent testing practices of the two tests that are being compared, and conducted both a test content and test performance analysis.

Research Questions
Based on the previous work done in this area, and the goals of this project, the following research questions were proposed for this study: 1. What is the content of the reading and writing tests of the GEPT-A and the iBT -the test passages, items, tasks?
2. What is the test performance of the study participants on the reading and writing tests of the GEPT-A at the total test score and at the item level, and on the iBT at the total test score level?
3. What is the comparability of the reading and writing tests of the GEPT-A and iBT?

Participants
The participants in this study consisted of 186 test takers from two groups, one group recruited in Taiwan and the other group recruited in the United States. The Taiwan sample included 118 participants recruited by LTTC staff, but only 92 could be used in the study, because overall scores and demographic data were not available for the other 26 (i.e., there were no test or survey responses for those cases).
The U.S. sample consisted of 92 international students from three American universities: California State University, Fullerton (n = 28); Indiana University, Bloomington (n = 50); and the University of Texas at San Antonio (n = 14). All participants had previously taken the iBT. The U.S. participants were recruited by faculty on the three campuses, using a combination of e-mail announcements, flyers, and word of mouth recruitment. As an incentive, U.S. 4 participants received a $50 gift card to a coffee shop chain or to their university bookstore after returning their completed tests and surveys and copies of their iBT score reports.

Instruments
GEPT -Advanced Reading and Writing. The primary instrument for this study was one form of the GEPT-A Reading Comprehension Test (Form AR-1101) and Advanced Level Writing Test. The reading test consisted of two sections. The first section, Careful Reading, included 4 passages and 20 items, including multiple-choice, short answer, and matching items. The expected responses for the short answer questions varied from one word to one or two sentences. The second section, Skimming and Scanning, included 3 passages (with the last passage consisting of 3 shorter sub-passages) and 20 items. The items included both matching and fixed-frame multiple choice (i.e., the options were the same for all multiple-choice questions).
The writing test consisted of two tasks: Task 1 required test takers to read two essays and write an essay in response. Task 2 required students to write a letter to the editor reacting to the information contained in two graphs. Only scores from Task 1 were used in the study.
Test takers survey. The study participants took a brief survey after taking the GEPT-A. The survey asked several background information questions, inquired about test takers' iBT test dates and scores on the four test sections, and included six Likert-scale items about the GEPT-A. The items asked test takers about the difficulty of the GEPT-A reading and writing sections, and about the relevance of the test content and test tasks to their academic studies.
Internet-Based TOEFL (iBT). Three practice iBT reading and writing tests were used for the content comparison of the GEPT-A and iBT. These iBT test forms were taken from The Official Guide to the TOEFL Test (Educational Testing Service, 2012). These forms should be highly comparable to the current operational iBT. Each reading test featured 3 passages and 38-42 items, including multiple-choice, a limited number of multiple response multiple choice (MRMC) questions (selected response items in which test takers must select two or more correct options per item), and one categorization item. The writing tests each contained one integrated writing task and one independent writing task. The integrated writing tasks required test takers to read a short text and listen to a short recording before writing an essay. The independent writing tasks required test takers to write an essay in response to a brief prompt. Table 1 lists the passages used in the study.
The study participants were also required to submit copies of their iBT score reports in order to avoid problems with inaccurately remembering their scores.

Instrument Administration
Standard procedures were used in the administration of the GEPT-A reading test for the Taiwan-based sample. In addition, score report data from the iBT was obtained from all participants. For the U.S.-based sample, the test and survey were e-mailed to the participants as a Microsoft Word document. Participants took the test at home or in a computer lab, with the same time limit as for the regular GEPT-A. One unavoidable difference in terms of time limitations between the two groups was that the U.S. participants were given an overall time limit, while test takers in Taiwan had separate time limits for each section of the test (Careful Reading, Skimming and Scanning, and Writing). Responses were typed directly into the Word document, which was then e-mailed back to the research team. The survey was included in the same electronic document as the test.

Scoring
The tests taken in Taiwan were scored by LTTC using standard procedures. Responses to the reading tests taken in the U.S. were scored by the researchers using the key and scoring guide provided by LTTC, and it was felt that the scoring was highly consistent with those that would have been awarded by LTTC itself. The writing test essays written by U.S-based participants were scored by LTTC raters. Of the 92 essays, 86 were scored; the remaining 6 were considered non-ratable because they were off-topic or plagiarized.

Data Analysis
This section describes the steps followed in analyzing the data. It begins by describing the procedures used in the content analyses of the passages for the reading tests and integrated writing tasks, and continues with a discussion of the task analysis of the reading items. It then moves on to report the methods used to analyze the participants' responses to the reading test, writing test, and survey.

Content analyses of passages
Reading passages from the GEPT-A and iBT were analyzed using the Coh-Metrix Web Tool (Coh-Metrix, n.d.) and the Compleat Web VP function of the VocabProfiler (Cobb, n.d.).
Coh-Metrix is a "web-based software tool" developed at the University of Memphis to "analyze texts on multiple characteristics" (Graesser, McNamara, & Kulikowich, 2011, p. 224) and "to measure cohesion and text difficulty at various levels of language, discourse, and conceptual analysis" (Crossley, Dufty, McCarthy, & McNamara, 2007). The Coh-Metrix analysis involved the 39 variables deemed most useful from among the 106 generated by the web tool. Many of the variables excluded from this analysis were alternative measures of ones included; for example, percentile scores were reported rather than z-scores, on the grounds that the former were more readily interpretable.
For the vocabulary profile analysis, the "classic" word lists (K1, K2, and AWL 1 ) were used, yielding four additional variables. All mid-sentence capitalized nouns automatically 7 recategorized as K1, and any other capitalized proper nouns 2 manually recategorized as K1. Possessive forms were also recategorized when that was not done automatically.
Each passage was analyzed separately, as was the summary paragraph for GEPT R4. The summary paragraph was treated separately because while it superficially resembled a reading passage, at the same time it was essentially the combined text of six test questions. The three GEPT-A passages from the skimming task were analyzed together as GEPT R7, because they were read together as part of a single task. For both the GEPT-A and iBT, a weighted mean was computed for each variable, with the weight for a given passage based on its number of words. For the GEPT-A, the summary paragraph from Passage #4 was not included in the weighted average, as it was not deemed comparable to the other seven passages.
The same set of procedures was followed in analyzing the text of the input passages for the integrated writing tasks. For the GEPT-A, two reading passages were analyzed both as a single passage and as two separate passages. The analysis as a single passage was undertaken because both had to be read in order to attempt the writing task. The analysis as two separate passages was also performed to allow a passage-to-passage comparison with the iBT, which used a single reading passage and a single listening passage. The listening scripts for the iBT integrated writing tasks were also analyzed using the same procedures as the reading passages.
Once the 43 text variables had been computed, descriptive statistics were calculated. Complete results for these variables are reported in Appendix A for GEPT-A reading passages, iBT reading passages, GEPT-A (reading) input passages for the integrated writing task, and iBT reading and listening input passages for the integrated writing tasks.

Analysis of participant test performance data
Descriptive statistics. Descriptive statistics were computed for participant demographic information, total GEPT reading and writing scores, scores on both sections of the GEPT, and iBT scores for reading, writing, listening, and speaking. Descriptive statistics were also computed for each of the six items on the survey.
Item and reliability analyses were performed on the GEPT-A reading test. In the absence of appropriate cut scores, point-biserial coefficients were used to estimate item discrimination. 3 The discrimination was calculated using adjusted total scores-that is, with the item in question being removed from the total in order to avoid inflation of the coefficient due to autocorrelation.
In addition, GEPT-A reading and writing task 1 scores were regressed on iBT reading and writing scores.
Correlations and exploratory and confirmatory factor analyses. Correlations were calculated among the two GEPT-A reading sections, the GEPT-A integrated writing task, the iBT sub-8 scores, and responses to the six survey questions, resulting in a 13 x 13 correlation matrix. Pearson r was used because with a very few minor exceptions, the variables had relatively normal distributions (i.e., means and medians close together, and skewedness and kurtosis with absolute values of less than 2). Two-tailed significance tests were used; while it was assumed that the seven test score variables would have positive relationships, and that there might be a negative relationship between perceptions of GEPT-A difficulty and performance on the GEPT-A and iBT, any relationships between the "relevance" survey questions and the other variables could not be predicted in advance.
An exploratory factor analysis was conducted using SPSS of the GEPT-A reading and writing scores and the iBT reading and writing scores. For the GEPT-A reading, items were grouped into seven testlets, each based on one of the reading passages. Testlet scores were used rather than scores on individual items, because item-level scores often lack sufficient varianceparticularly in smaller datasets such as this one-for clear factor structures to emerge. The factors were extracting using principal axis factoring. Initially, no minimum criterion or maximum number of factors to extract was set. After the initial extraction, any factors with no unrotated loadings under .30 were dropped (following Comrey, 1992), and the analysis was repeated. When the model would not converge for a given number of factors, extraction was attempted with one fewer. When the number of factors stabilized, the result was rotated using the Varimax algorithm. Any factor that did not have at least three variables with loadings of .30 or greater was dropped, and the procedure was repeated. If a model would not converge, the analysis was attempted with one factor fewer. Once the factor structure had stabilized following this procedure, a solution with one additional factor 4 was tried for comparison. These models were compared on the basis of simple structure, parsimony, and interpretability in order to determine the number of factors in the final model. This model was then rotated using Promax, which yielded an oblique factor structure.
A confirmatory factor analysis was then conducted using AMOS, taking the results of the exploratory factor analysis as a starting point (Model 1). The steps for CFA followed standard procedures outlined in Kunnan (1998). For purposes of comparison, additional CFAs were performed with all GEPT-A reading passages and iBT reading scores loading on one factor, and both GEPT-A and iBT writing scores loading on a second factor (Model 2). The two factors were set to correlate with each other, on the assumption that reading and writing are related aspects of language ability. A third model (Model 3) was also tested-primarily for the purposes of exclusion-with the GEPT variables loading on one factor and the iBT reading and writing scores loading on a second factor (which was correlated with the first), to test the hypothesis that the two tests measure separate but related things. Additional models were also tested in an attempt to achieve satisfactory model fit.
Goodness of fit in the CFA models was evaluated using  2 , the NFI, NNFI, CFI, and RMSEA. Consideration was also given to the significance of parameter estimates, as tested using the ratio of raw parameter estimates to their standard errors (Byrne, 2010). All fit indices were provided by AMOS, with the exception of the NNFI, which was calculated by hand from AMOS output following Hu and Bentler (1995). We employed a variety of fit indices in order to evaluate fit from multiple perspectives. The least emphasis was placed on  2 , since it almost invariably is significant for any model tested, despite whatever other fit indices may show.

Results
In this section, we present the findings of the study. We begin with background information on the study participants, followed by the results of the content analyses of the passages. We then continue with the task analysis of the reading items, and conclude with the analysis of the test and survey results.

Participant Background Information
Tables 2 to 7 summarize the background information of the participants in the study. Table 2 summarizes the breakdowns of ages and genders for all participants in the sample, and Tables 3 and 4 detail their academic status (undergraduate or postgraduate) and academic majors. Table 5 summarizes when participants took the iBT. Tables 6 and 7 then provide information on the first language backgrounds of the U.S.-based participants, and when they arrived in the United States.

Content Analysis of Passages
The Coh-Metrix and LexTutor analyses of the reading passages resulted in 43 values for each reading passage. These results are presented in Appendix A. The independent samples Mann-Whitney U test performed on the 43 variables indicated that only six variables were significantly different across the two tests. As shown in Table 8, the two tests varied significantly in the number of words per passage, two separate measures of lexical diversity (MTLD and VOCD) across all words, and in terms of mean number of modifiers per noun phrase, mean sentence syntax similarity across paragraphs, and the percentage of K1 words in each passage. The SD (calculated without weighting for passage length) for the variables with significant differences is also presented in Table 8 as a measure of effect size. 5 The Coh-Metrix and LexTutor analyses of the input passages for the integrated writing tasks (reading only for the GEPT-A; reading and listening for the iBT) also resulted in 43 values for each passage. These results are presented in Appendix A. The results of the independent samples Kruskal-Wallis test were not significant for any of the text variables. This may have been because of the very small sample size (n = 2, 3, and 3 for the three groups of passages), although it is not possible to say so definitively.

Task Analysis of Reading Items
This section reports on the findings of the task analysis of the reading sections of the GEPT-A and iBT. It begins by reporting the topical content of the various reading passages, and then examines the aspects of the reading construct that individual items seemed most likely to assess, the scope of the items, and the task formats used on the two tests.
The listing of passages and their topics is contained in Table 9. As can be seen from some of the classifications, some passages were more challenging than others to assign to a single subject matter category. The GEPT-A passages came from a range of subject matter topics, but with no content from the life sciences. In contrast, the iBT passages covered the same sorts of topics as the GEPT, but with the addition of life sciences topics as well. Notably, the iBT physical sciences passages all dealt with geology. a Titles for Part 1 were not included on the test form, and are taken from the GEPT-A Marking Scheme. Titles for the first two passages in Part 2 are taken from the test. The third passage was a collection of three separatelytitled shorter passages, and the title of the overall whole was inferred by the researchers.
Complete lists of the item-by-item findings are presented in Appendix B, but the results are summarized for the GEPT-A and iBT in Table 10 and Figure 1. One issue that presented itself involved vocabulary questions-whether they were tapping into the top-down reading process 14 of identifying the meaning of unfamiliar vocabulary from context clues, or were instead assessing vocabulary knowledge, a point which is treated in the Discussion below. In rating item scope, however, it was assumed that they are in fact assessing the top-down reading process, not knowledge of vocabulary. As can be seen, the construct coverage of the two tests is similar in that they both have more items requiring reading for specific details than any other part of the reading construct. Both include paraphrasing and/or summarizing, although the GEPT-A assesses this more extensively than does the iBT. Neither does much to assess the ability to identify the main idea of a passage (the GEPT-A included one such item in its careful reading section; the scanning section primarily requires scanning for the main idea of a paragraph, as opposed to the main idea of an entire passage), although the iBT does include a number of items that appear to assess the ability to read for major points or ideas-one for nearly every passage.
The tests also differ markedly in several ways. The GEPT-A devotes heavy coverage to skimming and scanning, something the iBT ignores. In contrast, the iBT includes a large number of items assessing vocabulary knowledge or the ability to determine the meaning of unfamiliar vocabulary from context. Finally, another major difference between the two tests lies in the areas of top-down reading processes such as inferencing, identifying author purpose, 15 and sensitivity to rhetorical organization and cohesion. The iBT includes these to a far greater extent than does the GEPT-A. Table 11 and Figure 2 describe the breakdown of the scope of the reading items on the GEPT-A and iBT. As can be seen, the iBT predominantly uses items with a narrow or very narrow scope (i.e., requiring the processing of several sentences or less). The GEPT-A, on the other hand, focuses more on moderate-scope items (i.e., those requiring the processing of an entire paragraph (or close to it), with this level of scope the proving to be the most common one. Finally, the iBT included a high proportion of items with broad or very broad scope (i.e., the key information was spread across multiple paragraphs or the entire passage, respectively). The GEPT-A, on the other hand, had a much lower proportion of items with these levels of scope, with 8 of the 11 coming from the scanning section. In the last portion of the reading test task analysis, we compared the task formats used on the two tests. The results for this are summarized in Table 12 and Figure 3. While the iBT was entirely dependent upon selected response items, the GEPT-A included a substantial proportion of short answer questions, with only about a third of the items using traditional multiple choice. In contrast, the iBT mainly relied upon multiple choice items, with only 8% of the total items representing task formats other than multiple-choice.  Table 13 provides the descriptive statistics for GEPT-A and iBT scores. Scores are reported in percentages 6 for comparability; raw scores, including Cronbach's alpha and the standard error of measurement (SEM) for the GEPT-A Reading test, are provided in Appendix C, Table C1. Similarly, descriptive statistics for total score by passage are presented in Table 14 for percentages, while the descriptives for raw scores are provided in Table C2.   Table 15 presents the item analysis results for the GEPT-A reading items. Discrimination was calculated based on adjusted total score for the particular section of the reading test (i.e., with each item's score subtracted from the total, to prevent autocorrelation). Tables C3 and C4 include item analysis results based on total GEPT score and on total passage scores as well (also adjusted for autocorrelation).

Analysis of Participant Test Performance Data
Note. The two questions on difficulty level were rated on a 1-3 scale (easy, medium, difficult); the four questions on content and task relevance were rated on a 1-5 scale (strongly agree, agree, neither agree nor disagree, disagree, strongly disagree). The results of the regression analysis for the GEPT-A reading scores was the following equation: GEPT-A-R= 8.561 + 2.458* iBT-R (R 2 = 0.218, SEE=19.54. The results of the regression analysis for the GEPT-A writing task 1 score was GEPT-A-W1= 1.225 + 0.055* iBT-W (R 2 = 0.148, SEE=0.46.

Analyses regarding relationships among tests
In this section, we present the results of the three stages of correlational analyses that were performed. We begin with the correlations among key variables, move on to exploratory factor analyses of the GEPT-A and iBT reading and writing scores, and conclude with confirmatory factor analyses of those scores.
Correlations among variables. Table 19 contains the correlation matrix for the two GEPT-A reading sections, GEPT-A first writing task, all four iBT subscores, and responses to the six test takers survey questions. Table D1 repeats the matrix with the actual significance level and sample size 7 for each correlation. Unsurprisingly, all of the correlations among test scores were highly significant (p ≤ .001) for the variables associated with GEPT-A and iBT scores. The GEPT-A reading and iBT reading scores were correlated at r = .467. While a medium-tolarge correlation, this indicates only about 22% shared variance between the two tests. Similarly, the medium correlation (r = .385) between the GEPT and iBT writing scores indicates about 15% shared variance. The correlation between the two GEPT-A reading sections was very high, with an effect size of r 2 = .491. The effect sizes for the GEPT-A reading and writing scores were somewhat smaller (r 2 = .283 and .168, respectively). The correlations among the four iBT scores had smaller effect sizes, ranging between 10.8% and 38.4% shared variance for each pair of variables. Correlations across the GEPT-A and iBT variables were somewhat lower overall than those among the scores from within a given test. Effect sizes ranged from near-trivial (r 2 = .073) to modest but appreciable (r 2 = .212).
As anticipated, there was a negative relationship between test scores and perceptions of GEPT-A difficulty, but it was not significant for every test score variable, and had a minor effect size at most. The largest correlation was between perceptions of GEPT-A reading test difficulty and GEPT-A Reading Task 1 scores-a significant relationship with a minor effect size (r 2 = .147). Perhaps the most interesting of the correlations with perceived difficulty was the one between the perception of difficulty for the GEPT-A reading and GEPT-A writing tests, which had 11.4% shared variance-a minor relationship, yet nevertheless the secondlargest in this set of correlations. Also worth noting was the low correlation between perception of GEPT-A writing difficulty and GEPT-A writing score (r 2 = .031).
The relationship between the perceived relevance of GEPT-A content (to participants' academic studies) and other variables was for the most part not significant. There was only one significant relationship between content relevance and test scores on the GEPT-A or iBT: the correlation between GEPT-A writing scores and the perceived relevance of the content of the GEPT-A reading test. Given its trivial effect size (r 2 = .037), it may have been a spurious correlation (i.e., resulting from chance). Content relevance of the reading and writing tests were closely related, with strong effect size (r 2 = .419). There was a significant but trivial relationship between the content relevance of the GEPT-A writing test and participants' perception of the difficulty of the GEPT-A writing tasks, as well as a significant but trivial negative relationship between GEPT-A writing content relevance and GEPT-A reading difficulty.
The relationship between the perceived relevance of GEPT-A tasks (to participants' academic studies) and other variables was minor, and difficult to interpret. There were small significant negative correlations with GEPT-A reading and writing scores, except for the nonsignificant correlation between reading task relevance and GEPT writing scores. None of these correlations had effect sizes greater than 6.7% shared variance. These correlations mean that as participants' perception of task relevance increased, their scores went down, and vice versa. There was a small significant correlation between writing task relevance and perceived writing task difficulty, but similarly, the effect size (r 2 = .058) bordered on trivial. There was a small significant but borderline-trivial correlation between perceived GEPT-A writing difficulty and the relevance of its writing tasks to participants' academic studies, indicating that-to a small extent-participants tended to associate GEPT-A writing difficulty with relevance of the writing tasks to their own academic studies. There were also significant correlations between perceptions of GEPT content relevance and task relevance for both reading and writing, with the highest occurring between writing task relevance and writing content relevance, and between reading task relevance and writing task relevance. These last two correlations both showed medium effect sizes (r 2 = .283 and .231, respectively).
Exploratory factor analysis. The results of the EFA are summarized in Table 20. Correlations among the variables analyzed are presented in Table D2. A single-factor solution provided relatively high loadings for all 10 variables, and was both parsimonious and very easy to interpret.
In contrast, a correlated two-factor solution (r = .593) yielded a first factor that accounted for most of the GEPT-A reading passages (Passages 3-7), and a second factor on which the iBT reading and writing scores loaded (with reading particularly high). The GEPT-A writing and the remaining two GEPT-A reading passages cross-loaded on both factors, with roughly equal loadings on each. A three-factor solution would not converge. More than three factors would have led to model identification problems, since it would have required at least one factor with only two indicator variables. Therefore, the single-factor model was confirmed as the best one in the EFA.
24   Confirmatory factor analysis. As explained previously, the first model tested took the results of the EFA as its starting point-a single-factor solution, with the factor assumed to represent reading and writing ability. All parameter estimates were significant, and the model converged in eight iterations. The path diagram for this model is presented in Figure 4.
Model 2, which featured all reading scores loading on one factor and both writing scores loading on a second factor, was attempted next. All parameter estimates were significant, and the model converged in eight iterations. However, as Table 21 indicates, fit was largely unaffected. Because iBT reading and writing scores were self-reported data, and were the only variables in this analysis for which there were missing data, the error terms for these two variables were correlated. This resulted in a noticeable improvement in model fit. The path diagram of the resulting model is shown in Figure 5. The fit borders between moderate and mediocre, as none of the fit indices reach the cutoff values for "good" model fit. Nevertheless, since this model had the best fit, and was also the best in terms of interpretability, it was determined to be the best possible with the current data.
A third model (Model 3) was also tested, with the GEPT variables loading on one factor and the iBT reading and listening scores loading on a second factor (which was correlated with the first), to test the hypothesis that the two tests measure separate but related things (see Figure  6). As with the other two models, the model converged in eight iterations, and all parameter estimates were significant. Model fit was slightly better than with the unmodified Model 1 and Model 2, but was still mediocre at best (see Table 21). Since this model was only identified if missing data were imputed-unlike Models 1 and 2-it was decided that Model 3 was not an accurate description of the factor structure being analyzed. Table 21 summarizes the fit of all the models tested; as can be seen, this model fit the data rather poorly.

26
A few additional models were attempted with task factors or cross-loadings on multiple factors for selected observed variables, in hopes of finding a model with better fit. These models tended not to converge at all, and when they did, always resulted in worsened fit.
Model 2 was thus confirmed as the best available model that included both the GEPT-A and IBT under the circumstances. Additional EFA analyses were then conducted separately on GEPT-A scores and iBT scores. The GEPT-A was modeled with writing scores included in one model (Model 4) and excluded in another (Model 5). To help model identification in the iBT analysis, listening and speaking scores were included along with the reading and writing. In both exploratory analyses, a single-factor solution proved to be the most interpretable result. The factor matrices for these are presented in Tables 22 and 23, and the goodness of fit summaries for the models are presented in Table 24. Figures 7 and 8 show the path diagrams for the three models. Model 5 failed to converge, and AMOS indicated it was probably underidentified. 8 Since model fit was improved but still mediocre for Model 5, an additional CFA was attempted using only GEPT-A reading data. The resulting Model 6 (see Table 24 and Figure 9) had the best fit of any model in the analyses. It had a non-significant  2 , and the NNFI and CFI were both consistent with good fit. The NFI and RMSEA were not as good; taken together, this suggests that the model had marginally good fit.

Discussion
In this section, we discuss the results described in the preceding section. We begin by addressing Research Question 1 with consideration of the results of the content analyses of the passages, followed by the task analysis of the reading items. We then take up Research Question 2 and the analysis of the test and survey results. We next consider Research Question 3 in light of the EFA and CFA results and the overall findings of the study. We conclude the section with a discussion of areas worthy of additional research.

Research Question 1: Content Analysis of Passages
Research Question 1 asked "What is the content of the reading and writing tests of the GEPT-A and the iBT -the test passages, items, tasks?" Aside from the obvious difference in length between the two tests (seven reading passages for the GEPT-A, four for careful reading and three for expeditious reading, vs. three for the iBT), the GEPT-A and iBT reading passages proved to be highly similar in most aspects. The most salient of the six significantly different features was the number of words per passage, with the GEPT-A passages averaging nearly 34 50 words more than the iBT reading passages (.6 standard deviations, 9 a medium effect size). The reading passages for the two tests also differed in certain aspects involving vocabulary. The GEPT-A had a higher level of lexical diversity in its passages (by 1.0 and 1.3 standard deviations) on two separate measures, clearly a large effect size. The proportion of words from the K1 list was also higher for the GEPT-A (by 1.1 standard deviations, also a large effect size). In syntax, the iBT had on average significantly more modifiers per noun phrase (1.1 standard deviations, a large effect size). Finally, in terms of syntactic similarity across paragraphs-an indicator of cohesion and/or of ease of processing-the iBT measured higher than the GEPT-A by 2.0 standard deviations (a large effect size).
We can therefore say that while they are similar in a host of other respects, the only significant differences between the reading passages on the two tests are that the GEPT is slightly longer, uses a markedly greater level of lexical variety, and uses more simple vocabulary-but not, oddly enough, a significantly lower level of more challenging vocabulary. In turn, the iBT features longer noun phrases, which presumably increase its syntactic complexity and the level of reading difficulty. The iBT also has much more consistent syntax at the sentence level than the GEPT-A, which should help increase cohesion while lowering reading difficulty. These variables are probably more important in determining the actual readability of a text for non-native speakers than are traditional readability measures (see, e.g., Carrell, 1987), even if the appropriate values for the alternative measures still await determination.
It is a limitation of the present study that the sample size for integrated writing input passages was so small. It seems likely that this was the main reason for no significant differences being identified for any of the input passage variables-even though, for example, the GEPT-A uses much longer input passages than the iBT. Further research with a greater number of passages from each test would be necessary to establish this conclusively, however.

Task analysis
In this section, which also relates to Research Question 1, we begin with a brief discussion of the topics used in the GEPT-A and iBT reading passages, comment on the construct coverage of the two reading tests (in terms of what aspects of reading they assess in the forms analyzed), and then discuss characteristics of the two tests in terms of the scope and task formats of the reading items.
Topics of the Reading Comprehension Passages. The GEPT-A form analyzed in this study included passages with a range of topical content, but had only one passage dealing with the sciences (and none taken from the life sciences). In contrast, the iBT had a much heavier emphasis on the sciences, with two science passages in two forms, and one in the remaining form. Interestingly, the only physical or Earth science topic covered by either the GEPT-A or iBT was geology (loosely defined, as one iBT passage dealt with icebergs). This may stem in part from the markedly different purposes of the two tests, in that many iBT test takers plan to study science or engineering in the United States, and the TOEFL therefore has had a long history of including reading passages from these subjects.
Construct Coverage of the Reading Tests. The main difference between skimming a text and reading it for the gist is the degree of speededness of the task. Likewise, the difference between scanning and reading for specific details is primarily one of how rapidly the task is performed. The iBT did not include any skimming or scanning items; however, the GEPT-A skimming and scanning section included 20 questions (the same number as on the Careful Reading section), and was allocated a time limit of 20 minutes. Although the time limit for individual sections was not imposed on the U.S.-based test takers, their test instructions did say there was a 20-minute limit. We believe that this, along with the presence of the overall time limit, made it likely that test takers did in fact attempt these tasks in an expeditious fashion, and that they therefore probably skimmed and scanned, rather than reading carefully. The GEPT-A did not include any questions targeting vocabulary knowledge or the ability to infer the meaning of unfamiliar vocabulary from context, while the iBT made heavy use of vocabulary questions (29.5% of all items). It should be pointed out that the iBT items that involved vocabulary could all be answered by a test taker unfamiliar with the words who was skilled at inferring the meaning of unfamiliar vocabulary from context. In fact, every one of the items could be answered using preexisting vocabulary knowledge instead. Given the task format used-multiple choice glosses of the word or phrase in question-examinees who knew the word could answer without even having to read the passage. Furthermore, in some cases, word analysis skills (e.g., application of knowledge of morphology) could be used to infer a definition without having to read the passage, as in Practice Test 1, where the word immeasurably was the focus of one item. By breaking the word down into im-+ measure + able + ly, a strategic reader could identify the correct meaning without reading the passage. The existence of these alternatives to using the intended reading process is one of the main weaknesses of using this sort of task format to assess the ability to infer the meaning of unfamiliar vocabulary.
Furthermore, some vocabulary items on the iBT practice tests could not be answered except using prior vocabulary knowledge-that is, readers would not be able to determine the meaning of the target word from the context, and could not successfully answer the item unless they already knew the meaning of the word being tested. In summary, then, some of these iBT items would function as vocabulary-in-context items for test takers who did not already know the words, but would function as measures of vocabulary knowledge for anyone who already knew them.
Both tests include items that require students to read for specific details, and both tests make heavy use of paraphrasing rather than using identical language in an item and the passage-a practice that makes these items require more than the ability to simply identify relevant information in the passage, and something that would perhaps not be feasible with tests aimed at lower levels of language ability. 10 On the other hand, the careful reading sections of the GEPT-A make such extensive use of specific details items 11 (11 out of 20 careful reading questions) that there is little room left on the test for other aspects of the reading construct, such as inferencing or reading for the main idea.
Although both the GEPT-A and iBT include items assessing test takers' sensitivity to rhetorical organization, these items differ in important ways in terms of their scope. The one GEPT-A item assessing this portion of the reading construct had a very broad scope, meaning that answering it correctly required processing all or nearly all of the passage. On the other hand, most of the nine iBT items assessing this aspect of reading-one for every passagerequired test takers to process all or nearly all of a single paragraph. Two items had narrow scope, with the key information needed to answer them spread across a few sentences. Only one rhetorical organization item had broad scope, requiring test takers to read more than one paragraph in order to answer correctly. None had very broad scope. Thus, the one item of this type on the GEPT was also the only one from either test to require attention to the rhetorical organization of the entire passage, rather than merely a portion of it.
Both tests required paraphrasing and summarizing of material read, but they differed in their emphasis. The GEPT-A requires both paraphrasing and summarizing, but with a greater emphasis on summarizing. In contrast, the iBT straddles the boundary between the two to some extent, and involves a much smaller degree of the information reduction that is required in summarizing. This stems at least in part from the difference in scope between the two tests for these items. In addition, the GEPT-A uses short-answer tasks to address this portion of the reading construct, while the iBT uses multiple choice. Furthermore, it is worth noting that all of the short-answer GEPT-A items-even those not targeting this portion of the reading construct-require at least some degree of paraphrasing because of the strict scoring rules regarding recycling of language taken directly from the passages (i.e., use of more than a key word or phrase is considered "plagiarism").
Interestingly, the iBT appears to have abandoned main idea items in favor of major points. At the same time, however, the GEPT-A only included one main idea item and no major point items. Main idea questions have long been a standard part of testing reading, so it is surprising that the GEPT-A largely omits them. In discussing this finding, however, it is particularly important to keep in mind our determination that the GEPT-A skimming and scanning section did in fact require expeditious reading, not careful reading. Careful reading to identify the main idea of a paragraph would probably be equivalent to reading for major points, but the framework used here differentiates skimming (which inherently involves reading for the gist, and major ideas, of a text) as being qualitatively different from careful reading to identify the main idea (or major points) of a passage.
A final point in terms of the adequacy of the construct representation on these two reading tests involves inferencing and identifying author purpose. Arguably, the latter is an example of the former; in any case, the GEPT-A only includes one author purpose question on the form analyzed in this study, compared to eight inference items and nine author purpose items, or roughly three per test form, one per passage.
Scope. The reading comprehension questions on the GEPT-A and iBT differed substantially in terms of scope. The overwhelming majority of iBT items (76%) had narrow or very narrow scope-that is, the necessary information was contained within several sentences or just one sentence, respectively. In marked contrast, 75% of the GEPT-A reading questions on the form analyzed had moderate, broad, or very broad scope, requiring test takers to extract the necessary information from most or all of a paragraph, more than one paragraph, or more than half of the passage, respectively. This could be expected to make the GEPT-A questions more challenging overall, although verification of this is beyond the limits of the present study, given that iBT response data was not available.
Task formats. The two tests differed markedly in terms of the task formats they employed. The majority of reading items on the GEPT-A were selected response, but over a third were 37 short answer items. The selected response items included a substantial portion that were not multiple choice-although about one third of all questions were multiple choice or fixed format multiple choice, the remaining 30% were matching. On the other hand, the iBT forms analyzed relied overwhelmingly on traditional multiple choice, with eight multiple-response multiple choice items and one categorization item spread across the nine passages. The limited use by the iBT of nontraditional or "enhanced" selected response task formats is better than none at all, but the more even distribution of task formats on the GEPT-A clearly sets it apart, and stands likely to reduce any impact from test method effects on the scores. Furthermore, the fairly heavy use of short-answer questions on the GEPT-A is more authentic (Bachman & Palmer, 1996); such items may engage communicative language ability more thoroughly, and could also do a better job of testing actual comprehension, as opposed to mere recognition of the correct answers in the options.

Research Question 2: Test Performance and Survey analyses
This section of the study relates to Research Question 2, which asked "What is the test performance of the study participants on the reading and writing tests of the GEPT-A at the total test score and at the item level, and on the iBT at the total test score level?" It answers this question by considering the overall scores on the GEPT-A and iBT, and the reliability and item performance of the GEPT-A. It then discusses the results of the test takers survey.
Descriptive statistics. Judging from the descriptive statistics for scores on the two tests, it appears that test takers got a higher proportion of questions correct on the iBT than on the GEPT-A-in fact, percentage correct scores were roughly 1.5 standard deviations higher on the iBT. Similarly, although they use very different rating scales and cannot be expected to be scaled the same, test takers seem to have done better on the iBT writing section than on the GEPT-A first writing task, in terms of percentage of points possible on the rating scale. Whether this is because the GEPT-A Writing Task 1 was rated more strictly than the overall iBT writing section (owing to either the rating scales themselves, rater training, or both), because the GEPT-A writing task was more difficult, or a combination of the two, cannot be determined from the data at hand.
The distribution of GEPT-A reading scores was close to normal, with minor negative skewedness and kurtosis. iBT reading scores, on the other hand, had high positive kurtosis, and were clearly more negatively skewed than GEPT-A reading scores, although not severely so. Similar but less extreme distributional patterns could be seen in the distributions for the other iBT sections, although they were only truly noteworthy in the case of listening. Given that the mean was only 1.2 standard deviations from the maximum possible score, this suggests that a ceiling effect was taking place in the iBT scores-particularly in the case of reading-at least with this population. If it is correct that there was a ceiling effect in the iBT scores, that fact could also be partially responsible for the low correlations between scores on the two tests, as a result of a restriction in range for the iBT scores. It should be noted, however, that such a ceiling effect might not be observed with a more typical sample, one more representative of the usual international iBT candidature, as opposed to the present sample, which had a higher overall level of language proficiency. This can be seen from the fact that the mean composite iBT score for participants in this study, 95.3, was equivalent to roughly the 73 rd percentile among 2014 iBT test takers worldwide, and the mean reading score of 24.9 (82.9% of the possible scale points) was equivalent to roughly the 69 th percentile (Educational Testing Service, 2015b). 38 Study participants tended to do about equally well on the two sections of the GEPT-A reading, the careful reading and skimming and scanning sections. Test takers' iBT scores were similar for reading, writing, and listening, but markedly lower for speaking. Reliability and the Standard Error of Measurement (SEM) for iBT reading typically average .85 and 3.35 out of 30 scale points (Educational Testing Service, 2011). This is roughly comparable to the results found for the GEPT-A reading in this study (.880, and 7.6 out of 120 scale points).
There were clearly noticeable differences in performance across the seven passages used in the GEPT-A. The first three passages were highly similar in format and the nature of their item formats, and the scores on the testlets (sets of items) associated with each passage were roughly comparable. Scores on the other testlets varied widely, with Passage 4 the most difficult, perhaps because of the nature of the task (summarizing and paraphrasing with shortanswer questions). Most puzzling, however, was the marked difference between scores on Passages 5 and 6. These both required skimming, and were ostensibly quite similar, but for some reason scores differed on them by 15%. Any definitive statement as to the reason would probably require comparison with additional passages.
Item analysis. Most items had acceptable item analysis values. As the GEPT-A is a criterionreferenced test, item difficulty does not figure into judgments of item acceptability; however, it was reported for reference, and there were no terribly extreme cases, with only one item falling below .20 and only four above .80. As for discrimination, only six items were below the commonly used criterion of .30 on correlational discrimination indices for professionally developed items (see Carr, 2011). Only one was below .20; thus, the items were viewed as doing an adequate job of discrimination overall.
Survey analysis. Participants reported on average that they found the reading portion of the GEPT-A more difficult than the writing section, by about half of a standard deviation. For both portions of the test, "medium" difficulty was the most common description, with 10% more participants selecting that response for writing than did for reading. At the same time, three times as many participants found the reading "difficult" as did the writing. Twice as many rated the writing "easy" as gave that rating to the reading test.
The average rating for the relevance of test content to students' academic studies was equivalent to a rating of "neither agree nor disagree," for both GEPT-A reading and writing. The relevance of the reading content was judged higher than that of the writing content by about .40 standard deviations. The most common description chosen for the content relevance of both reading and writing tasks was "agree." Similarly, the average rating for the relevance of test tasks to participants' academic studies was equivalent to "disagree" for reading, and between that rating and "neither agree nor disagree" for writing. The most commonly selected rating for reading task relevance was "disagree," while the most common rating for writing task relevance was "agree." In summary, the vast majority of participants found the GEPT-A reading tasks to be of high or medium difficulty, and similar numbers reported the writing tasks to be of low or medium difficulty. These results are puzzling, given that test takers actually tended to perform better on the reading than on the writing. Participants were generally neutral regarding the relevance of the reading test content and tasks to their academic studies, but most agreed that the content and tasks of the writing section were relevant.

Research Question 3: Comparability of the tests
This portion of the discussion addresses Research Question 3, which asked "What is the comparability of the reading and writing tests of the GEPT-A and iBT?" The pervasively significant correlations among the iBT reading and writing scores and GEPT-A reading testlets and writing scores (and the total GEPT-A reading scores as well) indicate that the two tests are in fact assessing related things.
Test takers' perceptions of the difficulty of the GEPT-A reading and writing tasks had a consistently negative relationship with their scores on the two tests. Most of these relationships were significant. This lends some additional-albeit weak-support to the idea that the two tests both assessed the same abilities.
The results of the EFA further indicated that the GEPT-A and iBT reading and writing sections measured substantially the same construct, since all observed variables (the seven passage-based GEPT-A testlets, GEPT-A writing score, and iBT reading and writing scores) loaded on the same common factor.
The hypothesis that the two tests measure the same constructs was also supported by the results of the CFA, although not with the same factor structure as suggested by the EFA. The CFA found the best fit for a two-factor (reading and writing) rather than for a single-factor model. The two-factor reading and writing model also fit better than one with separate factors for the GEPT and iBT. The two-factor reading and writing model did not fit the data as satisfactorily as could be hoped, though. This could have been because of problems with the model, particularly with the identification of the writing factor, which only loaded on two observed variables. Analyzing the results with a third writing variable might help this problem. It is also possible that the model was sufficiently identified, but that the size of the sample led in this case to empirical under-identification, a condition sometimes encountered in factor analytic studies in which a unique solution for all parameter estimates is not possible, despite the fact that all parameters are formally (i.e., algebraically) identified (Rindskopf, 1984). This latter possibility is supported by the fact that a single-factor model, which did not have the problem of a two-indicator factor, did not fit as well as the two-factor model.
Complicating the picture, however, is the fact that both the GEPT-A-only (Model 4) and GEPT-A reading-only (Model 6) models fit better than the best-fitting model that included both tests. This does not necessarily indicate that the two tests are assessing entirely different constructs. For model identification purposes, the iBT-only model had to include listening and speaking scores, whereas GEPT-A listening and speaking were not included. Considering the two models together, therefore, does not involve an apples-to-apples comparison. However, it does indicate that at least to some extent, the GEPT-A and iBT reading and writing sections are measuring somewhat different constructs. That is to say, both tests are clearly assessing reading and writing ability, but equally clearly, they appear to be assessing different aspects of the reading construct. Further support for this interpretation can be found in the strengths of the correlations between factors in the model with correlated reading and writing factors (Model 2a) and the model with correlated GEPT-A and iBT test factors (Model 3). In Model 2a, the reading and writing factors correlated at .85, whereas the GEPT-A and iBT factors in Model 3 correlated at .70. This indicates that at an overall level, the two tests differ even more than do the constructs of reading and writing. This interpretation is also supported by the findings of the task analysis regarding construct coverage, scope, and task formats. A wellfitting single model that includes both tests should be possible to construct, but would 40 presumably require data at the same level (i.e., testlets from both tests, or individual items from both tests), and perhaps with a larger sample size than was obtained in this study.
As a final point, it is worth noting the surprising result that Model 5 (iBT scores only) would not converge, even though the four variables were so highly intercorrelated (r = .328 for the lowest value, and r ≥ .545 for the other five correlations). This provides further indication that empirical under-identification was indeed a problem in the CFAs in this study. It may also relate to the potential ceiling effect identified in the iBT data for the present study; such a restriction of range might easily have impaired model fit for any of the models tested that included iBT data.
These results are parallel in some ways to those found in the comparability study mentioned in the review of literature: the Cambridge-TOEFL Comparability Study described above (Bachman et al., 1995). That study found that scores for individual sections of the FCE and TOEFL were sometimes more closely related to other sections of the same test than to sections from the other test that were intended to assess the same construct (i.e., sections that ostensibly assessed the same construct were often not as closely related to each other as they were to other sections of the same test, which were intended to assess other constructs). Similarly, the present study is not the first one in which clear and easily interpretable EFA results have been less clear, or even impossible to model, when replicated in CFA. For example, a higher-order factor model for which Bachman et al. (1995) had found a clear factor structure using EFA failed to converge at all when subsequently subjected to CFA by Kunnan (1995). It should be pointed out that these two previous studies employed larger samples than were used here, as well. Therefore, taken in context, the present results become somewhat less surprising.

Implications for Research Question 3 of Other Findings
The content analysis of the reading passages found that there are noticeable differences in the passages, but it is impossible to say from the results at hand how important the differences are in terms of effecting examinee performance, and the comparability of the two tests. 12 As for the topics used in the reading passages, based on the test form analyzed, the GEPT-A places much less emphasis on the reading of scientific or technical topics than the iBT. This is one area in which the two tests seem to not be comparable.
Given the contrasts between the two tests in terms of construct representation, it seems fair to say that the GEPT-A and iBT are not comparable in terms of the aspects of the larger reading construct that they assess. In particular, the GEPT-A does not give adequate coverage to aspects of careful reading besides reading for details and paraphrasing/summarizing, particularly inferencing and the ability to determine the meaning of unfamiliar vocabulary from context. At the same time, however, the iBT omits all coverage of skimming and scanning, and has too many items that function (or can function) as assessments of vocabulary knowledge rather than reading ability.
The tests are also not comparable in terms of the scope of their reading comprehension items. The GEPT-A seems to have a more even distribution in scope across its items than does the iBT, with a much lower proportion of narrow-scope items than the iBT. This seems appropriate for a test that purports to assess English at a high level of proficiency, whereas a greater emphasis on items of narrow and very narrow scope would be appropriate on tests targeting lower proficiency levels. The task formats used on the two tests are also not comparable, most notably due to the extensive use of short answer questions on the GEPT-A.
In addition, the score distributions of the two tests were not equivalent in this study, suggesting that the GEPT-A may have been more difficult than the iBT. The reliability of the two tests was comparable, however. Similarly, the correlations between scores on the two tests were high enough to indicate that they probably assess related constructs, but were also low enough to make clear that there are marked differences as well-although, as noted above, ceiling effects in the iBT scores may have depressed the correlations between the two tests.
Furthermore, the results of the regression analyses indicated that while iBT reading and writing scores can be used to predict GEPT-A reading and Writing Task 1 scores, the relationship between the two sets of scores is tenuous due to small effect sizes. Based on the regression equations, however, as well as concordance tables published by ETS (Educational Testing Service, 2015a), C1 in the CEFR is equivalent to a 24 iBT reading or writing score, which equates to a GEPT-A reading score of 68, and a GEPT-A Writing Task 1 score of 3. Thus, these regression results must be interpreted with caution in view of small effect sizes and large standard errors of the estimates.
In conclusion, while the passage and task analyses revealed important differences between the two tests, the correlational analyses indicate that the GEPT-A and iBT are both assessing reading and writing, and the scores on the two are very closely related. It is probably most accurate to say that the two tests assess the same constructs but from somewhat different perspectives, and therefore with somewhat different construct definitions.

Areas for future research
One of the limitations of the present study is that it only involved the analysis of a single form of the GEPT-A. A replication of the task analysis from this study using additional GEPT forms would be desirable. This would show how representative this particular form was, and would provide a more reliable description of the test and the characteristics of its tasks. The use of additional iBT forms might be desirable as well, assuming they were balanced by equal numbers of GEPT-A forms.
This study found that reading scores were higher on the iBT than on the GEPT-A. The greater level of challenge for the GEPT-A could be due to some of the points identified in the task analysis, particularly the greater scope of most GEPT-A items and the fairly extensive use of limited production tasks, rather than a total reliance on selected response. Verification of this might be performed using a multi-trait multi-method study, a many-facet Rasch analysis, or preferably both methods used in conjunction.
A study of how various task and passage characteristics (including both Coh-Metrix output and vocabulary-related measures) might affect reading performance would be another interesting area of investigation. It was not possible in this study to explore this question, but a study with additional test forms and passage-based testlet difficulty placed on the same scale (e.g., estimated using IRT and anchor passages) might shed light on this topic. Certainly, 42 empirical text measures that actually predicted testlet difficulty would be an invaluable resource for test development.
It would be desirable to attempt a CFA of the GEPT-A with a larger sample size and separate scores for each of the subscales on the analytic rating scale, and perhaps with listening and speaking scores as well. A clearer understanding of the factor structure of the GEPT-A would be a useful component in the overall validity argument for the test.

Conclusion
This study aimed to investigate the comparability of the GEPT-A and iBT using data from test takers in both Taiwan and the United States, a form of the GEPT-A reading and writing sections, and iBT reading and writing test forms published commercially by ETS. Three research questions were posed regarding the content of the GEPT-A and iBT reading and writing tests, performance on the two tests, and the comparability of the two tests.
We concluded that the passages on the two tests are comparable in many ways, but reading passages differ in several key regards. The task analysis revealed that the construct coverage, item scope, and task formats of the two tests are clearly distinct. Analysis of participant responses indicated that the GEPT-A has good reliability, and that reading comprehension items tend to function quite well. It also appears that the two tests assess the same constructs, but emphasize different aspects of the reading construct, making results on the two tests not entirely comparable.        Halliday and Hasan (1976). In this case, however, the cohesive ties occur across independent clause boundaries, thereby bringing it into the realm of cohesion, rather than simply grammatical parsing.    .385 ** .545 ** 1.000 **Correlation is significant at the 0.01 level (2-tailed). *Correlation is significant at the 0.05 level (2-tailed).