Skip to main content

Investigating the factor structure of the Test of English for Academic Purposes (TEAP) and its relation to test takers’ perceived test task value

Abstract

Background

This study investigated the scoring and criterion-related validity of the TEAP, a newly developed Test of English for Academic Purposes. In this study, scoring validity was examined by investigating the factor structure, while criterion-related validity was examined by first investigating the longitudinal change of test takers’ perceived test task value toward the measured construct and then investigating the relationship of test takers’ perceived value to the factor structure of the TEAP.

Methods

Confirmatory item-level factor analysis was conducted using the data obtained from 2217 first-year university students comparing four models (unitary, correlated, receptive-productive, higher order). Additional confirmatory factor analyses were conducted to first investigate the longitudinal change of perceived value toward the measured construct and then to investigate the relationship of test takers’ perceived value of the construct measured by the test to the factor structure of the TEAP.

Results

The results show that the higher-order model was the best-fitting model. This confirmed a previous small-scale study suggesting the generalizability of the test’s factor structure. Furthermore, it was found that test takers’ perceived values measured at the start of university positively affected the values measured about 6 months later.

In addition, perceived values measured both at the start of university study and about 6 months later positively correlated with the higher-order factor of the test.

Conclusions

The results provide further support of the scoring validity of the test. In addition, a positive relationship between the higher-order factor of TEAP and the factors of perceived values provide evidence of the usefulness of test takers’ perception to further support the criterion-related validity of the test.

Introduction

Educational reform efforts are observed around the world (Chalhoub-Deville, 2016) including Japan’s plan to reform its English education (Allen, 2020; Kuramoto & Koizumi, 2016; Sasaki, 2008). In 2013, the Ministry of Education, Culture, Sports, Science and Technology (MEXT) launched a policy designed to improve English education in Japan and revise the university entrance examination system. The first point of the revision concerns the balanced teaching and learning of the four English skills: reading, writing, listening, and speaking. High school graduates are expected to achieve a level of English competency equivalent to the B1/B2 levels of the Common European Framework of References for Languages (CEFR; Council of Europe, 2001), which defines the six common reference levels (A1, A2, B1, B2, C1, C2), while junior high school students are expected to achieve the A1/A2 levels.

The second point of the present reform in Japan concerns the revision of in-house English language college entrance examinations to measure the four English skills. In Japan, most universities develop and administer their own in-house entrance examinations for admission purposes annually, and the revision came from the concern that those in-house tests mostly use multiple-choice test formats focusing on receptive aspects of English skills such as reading and listening. Under the new policy, universities are expected to develop and administer the four skills through in-house examinations, but they are also allowed to utilize externally developed standardized English tests of the four skills in addition to in-house examinations. Thus, under the new plan, students who wish to be matriculated to universities could choose English tests and take them up to two times within the April–December period for admission purposes in addition to the in-house tests.

The third point concerns the revision of the National Center Test (NCT), which was first introduced in 1990 and has been taken by over 500,000 students (National Center for University Entrance Examinations, 2017) who wish to be matriculated in mainly national or public universities. The revision came from the concern that the NCT only measures written and listening skills of English by a multiple-choice test format, while the new policy required well-balanced teaching and learning of the four skills of English. As of this writing, a newly revised NCT, now called the Common Test (CT), was launched and is expected to be continued until 2023 (Ministry of Education, Culture, Sports, Science and Technology, 2020).

This study concerns the second point of revision or the plan to use standardized English tests for college entrance purposes. In the original plan, MEXT selected the authorized tests based on criteria such as the appropriate link between the test content and the national curriculum standards and the CEFR. Such tests include the Test of English as a Foreign Language Internet-based Test (TOEFL iBT®; Educational Testing Service, 2022), International English Language Testing System (IELTS; British Council, IDP, IELTS Australia, and Cambridge English Language Assessment, 2022), Cambridge English exams (University of Cambridge Local Examinations Syndicate, 2022), the Test in Practical English Proficiency (EIKEN; Eiken Foundation of Japan, 2022a, 2022b), Global Test of English Communication (GTEC; Benesse, 2022), and the Test of English for Academic Purposes (TEAP; Eiken Foundation of Japan, 2022a, 2022b). Though this point of revision was expected to start in 2020, the plan was officially postponed in 2019 due to various fairness-issue reasons such as the issue of test fee, test center locations, and the number of test administrations (Allen, 2020). Yet, even after the announcement, universities still have the freedom to choose standardized tests to be used for admission purposes so that students can choose and take tests in addition to the option of taking in-house tests.

Test of English for Academic Purposes (TEAP)

The TEAP test was developed by the Eiken Foundation of Japan together with Sophia University, a private university in Tokyo known for its English language program, and the Center for Research in English Language Learning and Assessment (CRELLA) of the University of Bedfordshire in the UK. There have been many standardized English tests which measure the four skills of English such as TOEFL iBT, IELTS, and EIKEN, but the TEAP was designed specifically for college admission purposes in Japan. Some desirable features of the TEAP include the appropriate task difficulty for university applicants in Japan, or CEFR A2-B2 (MEXT, 20120, and the link to the national course of study, while most of other tests have original test purposes and designs other than Japanese college admission. In order to collect evidence to support the use of the TEAP for Japanese college entrance purpose, the TEAP has gone through a series of validation studies conducted in collaboration with CRELLA and reported on the website of the Eiken Foundation of Japan, including studies of contextual and cognitive aspects of the reading and listening sections (Taylor, 2014), the writing section (Weir, 2014), speaking section (Nakatsuhara, 2014; Nakatsuhara et al., 2014), stakeholders’ perceptions of university entrance examinations, and the expected washback from the introduction of the TEAP (Green, 2014; Nakamura, 2014), and factor structure (In’nami et al., 2016).

Dimensionality of foreign language ability

In the socio-cognitive framework (O’Sullivan & Weir, 2011), scoring validity is defined as “to what extent can we depend on the scores on the test? What do the numbers or grades mean?” Large-scale EFL tests which measure multiple skills (e.g., reading, listening, writing, and speaking) have been the target of investigation in the previous research (Gu, 2015; In’nami & Koizumi, 2011; In’nami et al., 2016; Sasaki, 1993; Sawaki et al., 2009), yet other studies have investigated the validity of single-skill assessments including Bachman and Palmer (1981), which investigated the scoring validity of FSI (Foreign Service Institute) speaking assessment. The rationale behind those studies has been not only to validate the scoring procedure but also to elucidate the fundamental issue related to the dimensionality of a language ability (Bachman & Palmer, 1981; Gu, 2015; In’nami et al., 2016; Oller Jr., 1980; Sawaki et al., 2009). This aspect of validity is important because the actual score reporting policy (e.g., computing an overall score from individual skill scores) must be supported by the psychometric properties of the target assessment.

Oller Jr. (1980) once argued that one’s language proficiency, like one’s intelligence, can be explained by the existence of a unitary factor from the studies which showed the strong correlations among various language tasks. However, later studies (Bachman & Palmer, 1981; Gu, 2015; In’nami et al., 2016; Sawaki et al., 2009) have found the plausibility of multi-componential factor structure to explain one’s language proficiency supporting the validity of producing a separate score for each skill and a composite score.

Sawaki et al. (2009) investigated the factor structure of the TOEFL iBT using CFA and found that the higher-order factor model was the best-fitting model (CFI = 0.98, RMSEA = .022 [.021, .022]). In’nami and Koizumi (2011) investigated the factor structure of the TOEIC listening and reading test using CFA and found that the correlated factor model, which hypothesized reading and listening as two correlated factors, was the best-fitting model (CFI = 0.972, RMSEA = .065 [.042, .088]). Due to the fact that the higher-order and correlated models cannot be statistically distinguished when the number of correlated first-order factors is three or less, the researchers only evaluated the reading and listening components; it was difficult to compare the correlated model with more complex models, including the higher-order factor model. However, the authors argued that the high correlation (r = 0.87) between the reading and listening factors suggested the existence of a higher-order factor.

Following the previous studies on the factor structure of standardized English tests, investigating the factor structure of TEAP, which is the focus of this study, was an essential aspect of validation. In’nami et al. (2016) hypothesized four competing models (i.e., a unitary model, a correlated factor model, a higher-order model, and a receptive and productive model). The single-factor model or a unitary model hypothesizes that all items from all four skills load on one factor. The correlated four-factor model hypothesizes each item can be explained by a skill-specific factor that is correlated with the other skill-specific factors. The higher-order factor model hypothesizes the presence of a general higher-order factor with four lower-order factors corresponding to the assessed four language skills. The receptive-productive model hypothesizes the presence of two factors, a receptive factor and a productive factor under. The former corresponds to the receptive skills (i.e., reading and listening), and the latter corresponds to the productive skills (i.e., writing and speaking). The authors confirmed the existence of a higher-order structure of EFL test by investigating the factor structure of the TEAP and its relationship to the TOEFL iBT based on data gathered from 100 college students in order to validate the proposed use of scores derived from the TEAP for college admission purposes as well. The researchers found that a structural model that hypothesized the existence of a higher-order factor, which controls the four first-order factors, fits the data best (CFI = 0.932, RMSEA = .014 [.000, .022]) when compared to a unitary model, a correlated factor model, and a receptive and productive model. They argued that the results supported computing and reporting a score for each of the four English skills of reading, listening, writing, and speaking on the TEAP test and computing a composite score for admission purposes.

A major part of previous studies on large-scale tests (e.g., the TOEFL iBT and the TEAP) suggest the existence of a higher-order general English proficiency factor under which skill-based factors of reading, listening, writing, and speaking are located (Gu, 2015; In’nami et al., 2016; Sawaki et al., 2009).

Test-taker perception

In the socio-cognitive framework (O’Sullivan & Weir, 2011), criterion-related validity is defined as “What external evidence is there outside the test scores themselves that the test is doing a good job?” As this aspect of validity focuses on the extent to which test scores reflect a suitable externally measured variable of performance or demonstration of the similar abilities as are included in the test, the appropriateness of the selected variable determines the result of investigation. Previous studies on large-scale tests have used variables such as scores from other tests which measure similar construct (e.g., ETS, 2010), students’ evaluation of classroom activities in term of importance (e.g., Sawaki et al., 2009), or students’ evaluation of their own English proficiency (e.g., Powers & Powers, 2015; Runnels, 2016). Those chosen variables together with the target test scores have shown the criterion-related aspect of validity of the target test. One of important variables when considering the criterion-related validity (O’Sullivan & Weir, 2011) of a test is learner beliefs or students’ perceptions of the target language or test because what language learners believe affects how they engage with daily learning activity or practicing for target test (Dornyei, 2005). This variable is especially important for this study context where the new policy of English education (MEXT, 2015) aims to balanced teaching and learning of the four skills of English in the classroom. Dornyei (2005) argued that learner beliefs greatly affect behavior when a learner believes in a particular method of learning might reject another. Previous studies (Horwitz, 1988; Eccles & Wigfield, 2002; Sawaki et al., 2009; Xie & Andrews, 2013; Xie, 2015) on learner beliefs have posited various constructs regarding learners’ beliefs on language learning. Horwitz (1988) defined learner beliefs as learners’ opinions on a variety of issues and controversies related to language learning and developed the Beliefs about Language Learning Inventory (BALLI). BALLI was invented to validate the existence of learner beliefs and their impact on language learning with a scale of five types of learner beliefs as (a) difficulty of language learning, (b) foreign language aptitude, (c) the nature of language learning, (d) learning and communication strategies, and (e) motivation and expectations. In the field of psychology, Eccles and Wigfield (2002) posited an expectancy-value model in which learners’ achievement performance, persistence, and choice are affected by their expectancy-related (e.g., “Can I do it?”) and task-value (“Do I want to do it?”) beliefs. In the field of language testing, based on the expectancy-value model, Xie and Andrews (2013) defined expectation and value as learners’ beliefs about how well they will do on upcoming tasks and how much they value the upcoming tasks as desirable. In their study (Xie & Andrews, 2013), which was conducted to 872 Chinese test takers of College English Test (CET), participants responded to a questionnaire asking their perceived expectations (e.g., “If I prepare for it in appropriate ways, I believe I will pass CET4.”) and values (e.g., “In order to answer questions correctly, I must understand the key points in reading.”) toward the target test together with their test preparation activities. Based on SEM, the authors found a significant effect of learners’ perceived test values on the actual test preparation activity confirming the positive impact of language learners’ beliefs on their language learning practices before the test. Sawaki et al. (2009) examined the criterion-related validity of the Test of English as a Foreign Language Internet-based Test (TOEFL iBT) listening section by examining its relationship to a criterion measure designed to reflect language-use tasks that university students encounter in everyday academic life. The design of the criterion measure was based on students’ responses to a survey on the frequency (i.e., how often learners engage in a task) and importance (i.e., how important a task is for learners to performance well in the class) of various classroom tasks that require academic listening. The authors found significant positive correlation between the listening section score of the TOEFL iBT test and the students’ responses to the survey (r = 0.64 for all participants) suggesting the usefulness of using students’ survey results to examine the criterion aspect of test validity.

Xie (2015) investigated the impact of learner beliefs on the test scores of the CET test by conducting a study in which about 800 Chinese test takers responded to two questionnaires asking (1) their perception toward skills necessary to answer test questions correctly and (2) their test preparation activities before taking the CET test. In test based on SEM, the author found a significant path (β = 0.389, p < .01) from the test-taker perception to the test preparation.

Finally, previous studies (Peacock, 2001; Li, 2021) have focused on the nature of learner beliefs focusing on the longitudinal change. Peacock (2001) examined the longitudinal change of beliefs about second language learning of ESL teachers over their 3-year program using BALLI and found nonsignificant differences over the 3 years. In the similar vein, Li (2021) investigated the longitudinal change of Chinese EFL learners’ beliefs upon arrival at the university (survey 1) and a year after the arrival (survey 2) by asking their degree of agreement to (1) difficulty of language (perceptions about the difficulty of learning a foreign language in general [e.g., it is difficult for me to take part in group discussion in English]), (2) nature of language learning (perceptions about a wide range of issues concerning the nature of learning a foreign language [to learn English means doing a lot of repetition and practice]), and (3) autonomy in language learning (perceptions about readiness to be autonomous in learning a foreign language [I believe that I should find my own opportunities to use English.]). The study found a significant (p < .001) increase of agreement to all three types of questions possibly reflecting the course of study they had gone through for a year.

In summary, investigating the degree of the relationship of test takers’ test score to test takers’ self-assessment or beliefs has been an important part of validation because their English proficiency measured at a test should match with their perception or belief toward the target test and study practice. Previous studies on large scale tests (Powers & Powers, 2015; Ross, 1998; Xie, 2015; Xie & Andrews, 2013; Sawaki & Nissan, 2009) have found significant relationships between test score and learners’ self-assessments and beliefs. In addition, previous studies (Li, 2021; Peacock, 2001) examining the longitudinal change of learner beliefs have found mixed results in terms of the degree of change over a certain period of time, yet these studies examined the change in a cross-sectional way in which the same participants were asked to respond to questions only once.

Purposes of the study

As an attempt to address the research gap, this study is intended to connect the often separately investigated aspects, scoring and criterion related, of validity (O’Sullivan & Weir, 2011) of the TEAP. This study is also intended to further investigate the dimensionality of EFL proficiency based on the four-skill English tests, the longitudinal change of test takers’ perceived values toward a test, and the link of test takers’ perceived values to the measured constructs of the TEAP.

Thus, this study poses the following three research questions:

  1. 1)

    Which of the four models (unitary, correlated four factor, higher order, and receptive-productive) best represents the test construct of the TEAP?

  2. 2)

    To what degree do test takers’ perceptions (task value) toward the measured construct of the TEAP change over time before and after entering university?

  3. 3)

    To what degree are test takers’ perceptions (task value) toward the measured construct of the TEAP related to the factor structure of the TEAP?

The first question is expected to shed light on the test dimensionality issue (Gu, 2015; In’nami et al., 2016; Kamiya, 2017; Oller Jr., 1980; Sawaki et al., 2009) which can further contribute to the debate on the existence of a higher-order factor structure of one’s EFL proficiency measured by the four skills’ test. The first research question will pave the way for the following research questions because these research questions will be conducted based on the findings of the first research question in terms of the factor structure of the TEAP. The second question will contribute to the discussion of longitudinal change of test takers’ perceived value of a test (Li, 2021; Peacock, 2001; Xie & Andrews, 2013), which is found to be one of the factors positively affecting test takers’ test preparation activity. Finally, the third question, which is based on the discussion of the first and second research questions, will also contribute to the discussion on the link between test takers’ perceptions and the measured English proficiency (Xie, 2011; Xie & Andrews, 2013) by investigating the longitudinal change of test takers’ perceptions and the relationship to the proficiency measured by the TEAP.

Method

Participants

A total of 2490 first-year undergraduate Japanese learners of English enrolled at a private university in Tokyo participated in this study. Of these students, the data of 273 students were excluded case wise because they did not complete the study. The English proficiency of these students, based on the self-reported overall score (full mark of written and listening sections is 200 and 50, respectively) of the NCT from 1532 of the participants (written mean = 165.3, SD = 28.2; listening mean = 43.7, SD = 8.02), could be considered to be on the higher end of the overall Japanese third-year high school students (written mean = 118.9, SD = 41.1; listening mean = 33.16, SD = 9.4) (National Center for University Entrance Examinations, 2017) though this only represents about 70% of all participating students. A questionnaire on the study participants’ most focused skill of English at high school showed that 80%, 5%, 8%, and 7% of participants focused mostly on reading, listening, speaking, and writing, respectively. As for the most focused skill of English at university, the participants responded that 19%, 18%, 48%, and 15% focused mostly on reading, listening, speaking, and writing, respectively.

Instrument

The Eiken Foundation of Japan provided a mock version of the TEAP which was equivalent to the actual TEAP tests administered at test centers in terms of the format, administration, content, and difficulty. The reading and listening sections are designed to measure the understanding of short and long passages with visual information including graphs and charts. The speaking section is based on face-to-face, one-on-one interviews and includes both monologue and dialogue tasks on various issues. The writing section includes both summary and integrated tasks. In this study, item-level dichotomous responses were obtained for reading and listening skills, while criterion-level ratings were obtained for speaking and writing skills (see Table 1).

Table 1 Structure of the TEAP

A set of questionnaires (a total of 10 questions) was prepared based on the previous studies (Sawaki & Nissan, 2009; Xie & Andrews, 2013) to investigate test takers’ perceived value toward each section of the TEAP. Instead of asking the generic value toward the test as a whole as was designed in Xie and Andrews (2013), each question asks the usefulness of a measured construct of each section of the TEAP on a 6-point Likert scale of agreement: 1 = I strongly think so and 6 = I strongly do not think so.

Procedure

First, in April 2015, just a few days after entering university, students responded to a set of questionnaire items (see Table 2) which asked their perceptions (as perceived value at high school or PVH) toward the construct measured by the TEAP by section level. Then, in January 2016, 8 months after they started taking university courses, the same group of students took the mock version of the TEAP and responded to the same questionnaire (as perceived value at university or PVU). Students were asked to participate in this series of study as a mandatory part of the university’ educational program outside the regular class hours.

Table 2 Questionnaire to measure test takers’ perception toward test construct

Analysis

To address the first research question, the TEAP item-level raw scores were obtained for each participant. Based on the previous discussion on the dimensionality of one’s EFL proficiency measured by the four skills’ test (Gu, 2015; In’nami & Koizumi, 2011; In’nami et al., 2016; Kamiya, 2017; Sawaki et al., 2009) and on the design of the TEAP which outputs four separate skill scores (reading, listening, writing, and speaking) and an accumulated overall score, four models were hypothesized: a correlated four-factor model, a single-factor model, a higher-order factor model, and a receptive-productive model. The correlated four-factor model (Fig. 1) hypothesizes the presence of four correlated factors corresponding to the four assessed skills. This model assumes that the variance from each item can be explained by a skill-specific factor that is correlated with other skill specific. The correlated four-factor model (Fig. 1) hypothesizes the presence of four correlated factors corresponding to the four assessed skill-specific factors. Previous studies have found this model to have as good a statistical fit to the model as the higher-order factor model (Gu, 2015; In’nami et al., 2016; Sawaki et al., 2009).

Fig. 1
figure 1

TEAP correlated four-factor model. R, reading factor. L, listening factor. W, writing factor. S, speaking factor

The single-factor model specifies that all items from all four skills load on one factor. This model assumes that all the variance from all items can be explained by a single factor. This model assumes that the variance from each item can be explained by a single general factor as posed by Oller Jr. (1980). Figure 2 shows the TEAP single-factor model.

Fig. 2
figure 2

TEAP single-factor model. G, general factor

The higher-order factor model hypothesizes the presence of a higher-order factor under which four other factors corresponding to the assessed four language skills are controlled. This model assumes that the variance from each item can be explained by a skill-specific factor that is governed by the higher-order model, and previous studies have chosen this model as the final model based on both statistical and theoretical aspects (Gu, 2015; In’nami et al., 2016; Sawaki et al., 2009). Figure 3 shows the TEAP higher-order factor model.

Fig. 3
figure 3

TEAP higher-order factor model. G, general factor. R, reading factor. L, listening factor. W, writing factor. S, speaking factor

The receptive-productive model hypothesizes the presence of two factors: a receptive factor and a productive factor. This model assumes that the ability or the factor structure of the TEAP is separable into receptive (i.e., reading and listening) and productive (i.e., writing and speaking) components, and previous studies on the factor structure of the four-skill English tests have posed this as one of the competing models (In’nami et al., 2016; Kamiya, 2017). Figure 4 shows the TEAP receptive-productive factor model.

Fig. 4
figure 4

TEAP receptive-productive factor model. Rec, receptive skill factor. Pro, productive skill factor

To address the second research question, test takers’ responses to the questionnaire, both PVH and PVU, were analyzed by evaluating the model-data fit of the proposed model (see Fig. 5). Figure 5 shows the model in which the PVH affects the PVU

Fig. 5
figure 5

Test takers’ perception longitudinal model. PVH, perceived value at high school. PVU, perceived value at university

For the third research question, the model-data fit of the proposed model in which the final model from the first research question was correlated with the factor structure of the PVH and PVU (see Fig. 6) was investigated.

Fig. 6
figure 6

Test takers’ perception and TEAP higher-order model. PVH, perceived value at high school. PVU, perceived value at university. G, general factor. R, reading factor. L, listening factor. W, writing factor. S, speaking factor

For the investigation of all research questions of this study, Mplus 8.4 (Muthén & Muthén, 1998–2022) was employed to estimate the parameters and evaluate the model fit. The parameters were estimated using the robust weighted least squares (WLSMV) in order to deal with item-level categorical data of the TEAP. This estimation method was chosen because this WLSMV estimator has shown accurate test statistics, parameter estimates, and standard errors under both normal and non-normal latent response distributions across various sample sizes (Byrne, 2012. In order to ensure the model identification, factor loadings of all the first observed variables across skills (reading, listening, writing, and speaking) were fixed to 1.0. The chi-square, comparative fit index (CFI), Tucker–Lewis index (TLI), root-mean-square error of approximation (RMSEA), and standardized root-mean-square residual (SRMR) were employed to evaluate model fit. Those indices were chosen because, based on the meta-analysis of CFA studies, they were the most frequently reported (In’nami et al., 2016; In’nami & Koizumi, 2011; Kamiya, 2017; Muthén, 2004). Model fit was evaluated by a nonsignificant chi-square, the CFU and the TLI of 0.95 or above, the RMSEA of 0.06 or below, and the weighted root-mean-square residual (SRMR) of 0.08 or below.

Results

Table 3 shows the results concerning the first research question, CFA for the four proposed models (correlated, high order, single, and receptive-productive). Overall, the correlated model and higher-order model showed the better model-data fit compared to the other two models (correlated and single).

Table 3 Fit indices for the models

The difference between the correlated and higher-order models was small. This finding was consistent with that of In’nami et al. (2016), who found a very similar degree of model-data fit indices between the two models. In addition, the result was also similar to Sawaki et al. (2009) who found similar model-data fit indices between the TOEFL iBT correlated four-factor model and higher-order model.

For the higher-order model, all the estimated parameters were found to be significant (see Table 9 in Appendix), including the paths from the general higher-order factor to the individual skill factors (reading, listening, writing, and speaking; 0.619–0.946). Standardized parameter estimates for the TEAP correlated-factor model were all significant (see Table 10 in Appendix) including the correlation between factors (from 0.457 to 0.766).

Following the previous literature in which two nesting models were compared (In’nami et al., 2016; Kamiya, 2017; Sawaki et al., 2009), these two models (correlated and higher order) were further compared using the Mplus DIFFTEST command (Muthén & Muthén, 1998–2022), which is often used when the data contain dichotomous or categorical data. The DIFFTEST result showed that the correlated-factor model was significantly better than the higher-order model (χ2 difference = 57.231, df difference = 2, p = 0.000).

However, the difference between the two models in terms of CFI, TLI, RMSEA, and SRMR was minimal (Table 2), showing that these two models were practically equivalent. As the higher-order model was more parsimonious than the correlated-factor model and it was consistent with previous research (In’nami et al., 2016), the higher-order model was chosen as the final model. This argument was in line with Sawaki et al. (2009) that, even though a significant chi-square difference test (correlated-factor model vs. higher-order model) implied that the correlated-factor model was better than the higher-order model, and the difference of fit indices between the TOEFL iBT higher-order model and correlated-factor model was minimal, which allowed them to make the final decision that the TOEFL iBT higher-order model was the best model. In addition, the result fits well with the current score reporting policy in which a composite score is reported together with scores of each individual skill.

Table 4 shows the results of descriptive statistics for questionnaire items on test-taker perception measured at high school and university. The result shows that across all 10 questionnaire items, on average, the degrees of perceived value by test takers at university became lower than the value perceived at high school, while degrees of variance at university became greater than at high school.

Table 4 Descriptive statistics for questionnaire items on test-taker perception

Table 4 shows the results of structural equation model (SEM) for the test takers’ perception longitudinal model.

Overall, the model showed a decent model-data fit. Table 5 shows the estimated standardized parameter estimates for the test takers’ perception longitudinal model. All the estimated parameters as well as the path from PVH (α = 0.90) to PVU (α = 0.98) were significant, showing that the value perceived by test takers at high school positively affected the value perceived at university.

Table 5 Fit indices for the proposed model

Table 6 shows the results of SEM for the test takers’ perception and the TEAP higher-order model. Overall, the model showed decent model-data fit. Table 7 shows the estimated standardized parameter estimates for the test takers’ perception longitudinal model. All the estimated parameters were found to be positive and also significant, and the path from the general English proficiency to the PVU, the correlations between the general PVH and general PVU, was also significant. The correlation between the general PVH (r = 0.178) was higher than the general PVU (r = 0.131), showing that the value perceived at high school has more impact on the measured construct than the perceived value at university. In addition, parameter estimates from each skill factor were all significant together with those from the higher-order general factor to skill factors (Table 8).

Table 6 Standardized parameter estimates for the test takers’ perception longitudinal model
Table 7 Fit indices for the test takers’ perception and the TEAP higher-order model
Table 8 Standardized parameter estimates for the test takers’ perception and the TEAP higher-order model

Discussion and conclusion

In order to further collect validity evidence for the TEAP test, the author examined the following: (1) the factor structure of the TEAP to investigate scoring validity, (2) the degree of longitudinal (high school and university) change of test takers’ perceived value toward the measured tasks, and (3) the relationship between the perceived value factors and the TEAP factor to investigate criterion-related validity.

For the first research question, confirmatory factor analysis was conducted on the data collected from 2217 first-year Japanese university students. This research question asked which factor model best explains the TEAP data (single-factor, correlated four-factor, higher-order factor, and receptive-productive factor). Of these four item-level response models, the higher-order model best explained the TEAP data. This result also supports the current score reporting policy in which a composite overall score is computed from scores from individual skill scores. This result replicated a previous study (In’nami et al., 2016) which also examined the factor structure of the TEAP, providing added evidence to the scoring validity of the TEAP. The estimated loadings of first-order factors on higher-order factor (0.807, 0.946, 0.619, and 0.718 for reading, listening, writing, and speaking, respectively) were similar to those found in the previous study (In’nami et al., 2016), also suggesting the generalizability of the factor structure of the TEAP.

One explanation for this finding is that both studies were conducted on data from Japanese university students with similar regional representativeness (i.e., Tokyo area), while the results could be different if the data were collected from other age groups of students (e.g., high school students). In addition, this study also adds evidence to the existence of a higher-order factor structure of the four-skill English tests in previous studies (e.g., Sawaki et al., 2009) on the issue of dimensionality of a language ability (Gu, 2015; Oller Jr., 1980; Sawaki et al., 2009). As for the second research question, confirmatory factor analysis was used on the same data as the first research question. The second research question concerned the degree to which test takers’ perception (task value) changes over time before and after entering university. Descriptive statistics of questionnaire items found that the degree of agreement at high school was higher than at university, while the variance of agreement was greater at university than at high school across all 10 items. Standardized factor loadings across tasks for the perceived value at university were higher (from 0.848 to 0.937) than the perceived value at high school (from 0.637 to 0.878) with more varying degrees of loadings. In addition, the perceived value factor at high school toward measured tasks (PVH) positively (β = 0.248) affected the perceived value at university (PVU). Based on the result, this study shed light on the nature of perceived value across different tasks and its longitudinal change (from high school to university). The result suggests that students perceived the values for tasks with a varying degree, and that the perceived value at high school positively affected that at university. The reason behind this longitudinal change of perceived value might be explained by the fact that students went through a series of courses taught in English for 8 months at university, perceiving various aspects of each skill of English. The majority of students (80%) reported that high school English language instruction focused on the reading skill the most, while the trend changed after 8 months when the four skills were more equally focused upon. This trend probably reflects the nature of actual courses at university which gave students the opportunity to learn various aspects of each skill of English. This result might suggest a positive washback of the TEAP which assesses the four skills of English leading to more balanced teaching and learning of English at high school.

Regarding the third question, this study found that the TEAP higher-order general factor positively correlated with the perceived value factors both at high school (r = 0.178) and university (r = 0.131). This result is consistent with Xie and Andrews (2013), who found a positive impact of perceived test value on students’ test preparation, while this study found a positive relationship of perceived skill-based value and measured test result. This result also adds evidence to the discussion of criterion-related aspects of test validity suggesting the possibility of including the self-reported perceived value as a criterion to measure a test’s validity.

Implications and future research

First, this study identified the higher-order model as the best representation of the underlying factor structure of the TEAP test, which supports the scoring validity of the TEAP as test takers receive not only a score for each skill but also a composite score. This is because universities usually consider admissions based on the submitted composite scores rather than looking at each skill score. However, future research is required to investigate the generalizability of this study by extending the participants to broader populations including students with more diverse English proficiency.

Second, this study found that students’ perceived value toward the tasks on a test changes over time before and after entering university in terms of strength and variation. This suggests that we need to take into account that students tend to have a varying degree of perceived value toward different task types reflecting their study practice at each stage of their English study. Additionally, it could be important to raise students’ awareness toward each skill of English at high school because the perceived value at high school could positively affect that at university. However, future studies are required to investigate the longitudinal change of students’ perceived value over time further by possibly extending the interval between university entrance and the time they graduate. In addition, more qualitative aspect of students’ perceived value could be investigated by conducting interviews. Such a study would shed light on the rationale behind test takers’ perceptions and the changes overtime after entrance to university.

Third, this study identified a positive relationship between test takers’ value toward tasks measured by the TEAP and the measured construct of the TEAP, showing the importance of perceived value toward each aspect of the test constructs when considering the validity of a test. Future study is needed to further include the amount of time spent on test preparation activities.

Availability of data and materials

The data that support the findings of this study are available from the Eiken Foundation of Japan, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data may however be available from the author upon reasonable request and with permission of the Eiken Foundation of Japan.

Abbreviations

TEAP:

Test of English for Academic Purposes

MEXT:

Ministry of Education, Culture, Sports, Science and Technology

CEFR:

Common European Framework of References for Languages

NCT:

National Center Test

CT:

Common Test

IELTS:

International English Language Testing System

GTEC:

Global Test of English Communication

TOEFL iBT:

e Test of English as a Foreign Language Internet-based Test

CRELLA:

Center for Research in English Language Learning and Assessment

EFL:

English as a Foreign Language

CFA:

Confirmatory factor analysis

TOEIC:

Test of English for International Communication

CFI:

Comparative fit index

RMSEA:

Root-mean-square error of approximation

ETS:

Educational Testing Service

BALLI:

Beliefs about Language Learning Inventory

CET:

College English Test

SEM:

Structural equation modeling

ESL:

English as a Second Language

SD:

Standard deviation

PVH:

Perceived value at high school

PVU:

Perceived value at university

WLSMV:

Weighted least squares mean and variance adjusted

TLI:

Tucker–Lewis index

SRMR:

Standardized root -mean-square residual

References

Download references

Acknowledgements

The author thanks all the students who participated in this study. The author would also like to express my sincere gratitude for Dr. Yasuyo Sawaki at the Waseda University Graduate School of Education for her helpful discussions and insightful comments on the manuscript.

Funding

Not applicable

Author information

Authors and Affiliations

Authors

Contributions

The author read and approved the final manuscript.

Corresponding author

Correspondence to Keita Nakamura.

Ethics declarations

Competing interests

The author is employed by the Eiken Foundation of Japan, which develops and administers the TEAP test.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Tables 9 and 10

Table 9 Standardized parameter estimates for the TEAP higher-order model
Table 10 Standardized parameter estimates for the TEAP correlated-factor model

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nakamura, K. Investigating the factor structure of the Test of English for Academic Purposes (TEAP) and its relation to test takers’ perceived test task value. Lang Test Asia 12, 35 (2022). https://doi.org/10.1186/s40468-022-00183-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40468-022-00183-4

Keywords