Is the Common Test for University Admissions in Japan enough to measure students’ general English proficiency? The case of the TOEIC Bridge

This study investigated to what extent the scores of two English tests are correlated to each other, namely, the English test of the Common Test for University Admissions (Common Test, henceforth) in Japan and the TOEIC Bridge , a commercially available English test developed by Educational Testing Service (ETS) that measures four skills of listening, reading, speaking, and writing. Moreover, this study examined to what extent the two tests’ constructs overlap from the viewpoint of L2 competence. In total, 128 university freshmen and high school seniors took the Common Test at the official venues and also the TOEIC Bridge at the researcher’s university ( n = 92) or at home ( n = 36) a few months later. Results indicated that the scores of the corresponding skills are moderately correlated to each other across the two tests (Reading = .548; Listening = .646; Total = .732). Confirmatory factor analyses demonstrated that the degree of data fitting of the three models of test constructs (unitary, correlated skills, correlated tests) was statistically similar to each other. On the basis of substantive and statistical results, however, we claim that the correlated skills model should be chosen as the best-fit model and, consequently, that the productive skills should be measured in addition to the Common Test.


Background
A majority of Japanese high school students who wish to continue studying in tertiary schools are required to take a test held nationally, organized by the National Center for University Entrance Examinations in Japan (National Center, henceforth).The inception of this test dates back to 1979, which was updated to the National Center Test for University Admissions (Center Test, henceforth) in 1990.Then, the newest version, the Common Test for University Admissions (Common Test, henceforth), was launched in 2021.The Common Test's English test consists of two sections of reading and listening and is held face-to-face at around 700 testing centers all over Japan.Due to its large number of test-takers, which counts over 500,000 annually, all of the questions are in a multiplechoice format for the sake of time efficiency and fairness (see Table 1).
The Ministry of Education, Culture, Sports, Science, and Technology (MEXT, henceforth) originally scheduled to replace the English test within the Center Test with commercially available, private English tests of four skills when the Center Test would be switched to the Common Test due to the criticism that the Center Test measured only the receptive skills (reading and listening), despite the fact that the national curriculum guidelines stipulate that not only the receptive skills but also the productive skills (speaking and writing) must be cultivated in a balanced manner in Japanese high schools (MEXT, 2018a).This motivated Kamiya (2017) to conduct a study to investigate the score compatibility between the Center Test and one of the four-skilled private tests, namely the TOEFL Junior Comprehensive.The overall results supported the validity of a replacement of the former with the latter; however, the data also indicated that test-takers' general English proficiency alone occupied the major portion of the scores, regardless of skills, across the two tests.
Due to several concerns raised during the transition period from the Center Test to the Common Test, such as difficulty in securing fairness across students with diverse economic statuses (i.e., opportunities to take private tests) and residential background (i.e., accessibility to testing venues) and score incompatibility between different private tests derived from distinct test constructs, MEXT officially announced in 2019 to postpone the replacement until 2024.However, partly because MEXT planned to use both the Common Test and private tests concurrently until then, a few major changes have been made from the Center Test to the Common Test.First, the questions to measure "pronunciation and accent" and "grammar and usage" were deleted in the reading section because, arguably, (a) the former would be better measured in a speaking test, whereas the latter in a writing test; (b) the knowledge and skills necessary to answer these questions do not emulate what is necessary for communication; and (c) they can be measured indirectly even in reading and listening sections (National Center for University Entrance Examinations, 2021).This essentially resulted in the section predominantly measuring reading comprehension skills alone.Second, in the Center Test, the scores were unevenly divided between reading (200) and listening (50) sections.But reading and listening sections each weigh evenly in the Common Test (100 & 100) (however, the actual score allocation of each of these two sections is left to the discretion of Score range 0-100 0-100 each institution).After a series of meetings held by a special committee run by MEXT, however, MEXT announced in 2021 that the introduction of private tests at university entrance examinations would be aborted, mainly for the abovementioned reasons.
Due to these changes from the Center Test to the Common Test, the applicability of the results obtained in Kamiya (2017) to the Common Test was called into question, which is the rationale for conducting the present study.

TOEIC Bridge
Although it would have been ideal to use the same private test as Kamiya (2017), namely, the TOEFL Junior Comprehensive, for the sake of comparability of results, at the time of conducting this study, the TOEFL Junior Comprehensive was defunct.Therefore, we needed to choose another private test.Among various candidates, we selected the TOEIC Bridge because it is designed to target beginning to lower-intermediate level learners, namely, from A1 till B1 levels of the Common European Framework of Reference for Languages (CEFR; Council of Europe, 2020) (Schmidgall, 2021).According to the survey conducted by MEXT, only 0.3-0.4% of third-year high school students reached B2 levels (MEXT, 2018c).Thus, we considered the test appropriate from the perspective of difficulty (but see the Limitations section).
The TOEIC Bridge originally started in 2001 as a test to measure two skills, reading and listening only.However, it had been upgraded to measure four skills starting in 2019.The test is held widely in around 35 countries (as of 2019;IIBC, personal communication, March 27, 2023).In 2021, there were 140,700 and 34,900 test-takers for listening and reading and speaking and writing, respectively (IIBC, n.d.).The test comprises four skill-based sections as shown in Table 2.The reading and listening tests can be taken either on paper or online, and all the questions are presented in a multiple-choice format.In the speaking test, test-takers record their voices through microphones in response to prompts.In the writing test, they unscramble words to complete sentences or type sentences or paragraphs.Both speaking and writing tests are held only online and their answers are evaluated by raters certified by Educational Testing Service (ETS), the organization that administers the TOEIC Bridge.Due to its recent introduction, to our knowledge, there have been only two attempts to compare its scores with the Common Test (and none with the Center Test).IIBC, which runs the TOEIC Bridge in Japan, has reported correlation coefficients of reading and listening scores between the TOEIC Bridge and the Common Test for 2 years in a row (IIBC, 2021(IIBC, , 2022)).The results showed moderate to strong correlations in reading (r = 0.554, 0.592), listening (r = 0.490, 0.559), and with total of both sections (r = 0.623, 0.665).However, the score of the Common Test was self-reported, not confirmed by official score reports; thus, there is a suspicion that their self-scoring might not be accurate.More importantly, their data on the TOEIC Bridge were limited in scope without speaking or writing scores.In sum, there has not been any attempt to explore the score relationships between the Common Test and fourskilled private tests.

Models of test constructs and structure of L2 abilities
This study seeks to unveil the test constructs of the Common Test and the TOEIC Bridge through the lens of the structure of L2 abilities.Because there are a host of studies on this topic, we restricted the selection only to those that used at least one of the widely available private tests.As can be seen in Table 3, four major models of a structure of L2 abilities have been proposed as their candidates.A unitary or unidimensional model (Fig. 1) presupposes that L2 abilities are a single construct.Therefore, all of the test scores will be subsumed under a single, unobserved latent variable.An uncorrelated model (Fig. 2) presupposes that L2 competence consists of multiple, divisible, first-order variables, such as receptive and productive skills, but these variables are not correlated highly to each other.When these are highly correlated, it is called a correlated model (Fig. 3).Finally, when all of these first-order variables are subsumed under a single second-order variable, it is called a high-order, second-order, or hierarchical model (Fig. 4).The history of these inquiries originated from Oller (1979).He analyzed multiple data sets and consistently claimed that all data could be subsumed under a single dimension.He even stated, "the current practice of many ESL programs, textbooks, and curricula of separating listening, speaking, and reading and writing activities is probably not just pointless but in fact detrimental" (p.458).His indivisibility hypothesis instigated a host of ensuing explorations, most, if not all, of which criticized Oller's use of principal component analysis (see Fouly et al., 1990), and instead, adopted more rigorous methods of confirmatory factor analysis.
Since then, among numerous kinds of private tests, a lot of attention has been paid to the TOEFL, probably due to its large number of test-takers with various backgrounds (around a million test-takers in around 160 countries annually; ETS Japan, personal communication, May 30, 2023).This convenient feature makes it easy to secure strong statistical power and to compare data among multiple diverse groups (Stricker & Rock, 2008).Through this line of research, a clear trend has appeared, which is that either a correlated or a higher-order model is acceptable, rejecting unitary and uncorrelated models (except Wilson, 2000, but see In'nami & Koizumi, 2012, for its possible reasons).When the number of first-order factors is two, a higher-order model cannot be identified.When it is three, these two models are statistically indistinguishable from each other; in such a case, a higher-order model is chosen based on the principle of parsimony.Thus, it is often impractical to decide which one is the best fit.As the consensus has been almost reached on the structure of L2 abilities, the momentum toward identifying the structure of L2 abilities has waned, and we witness a shift of studies toward validating newly made tests, such as the TOEFL iBT and the TEAP when they came out in public.Pertinent to the present study, however, is that there has been only one study on the TOEIC (In'nami & Koizumi, 2012), but only for listening and reading, not for speaking or writing, and none for the TOEIC Bridge.
Somewhat unpredictably, in Kamiya (2017), which targeted the Center Test and the TOEFL Junior Comprehensive, although the correlated model was chosen as the bestfit model, the unitary model was found to be almost equally a good fit as well.However, although not reported in the article, when only the data of the TOEFL Junior Comprehensive were extracted and analyzed, the correlated model with two variables of receptive and productive skills was shown to be clearly better than the unitary model (e.g., SRMR = 0.0040 and 0.0132, respectively).This implies that the Center Test, especially its reading section, was measuring general English proficiency rather than reading abilities alone, which skewed the whole data set to close on the unitary model.If so, since the pronunciation and grammar sections in the Center Test were excluded at the transition Table 3 List of studies on test constructs and structure of L2 abilities using private tests The table is not intended to be an exhaustive list of previous studies on this subject.TEAP Test of English for Academic Purposes (available only for Japanese university applicants), TOEFL JC TOEFL Junior Comprehensive, TOEIC LR TOEIC Listening & Reading."△" denotes that this model was chosen as the second best."-" denotes that this model was not considered.In the case of higher-order model, it was not considered due to the fact that the number of first-order factors was less than four (see the main text for its rationales)

Study
Participants  to the Common Test, we can predict that, in the case of the Common Test, the correlated model will demonstrate a better fit when compared to the unitary model.

Research questions
Some additional explanations are necessary regarding the models that we consider.First, owing to the fact that there are only two first-order variables (e.g., receptive and productive skills), this study is incapable of identifying a higher-order model.Second, uncorrelated models will not be considered as they were unidentified in Kamiya (2017).Third, there will be two versions of correlated models to be examined, namely, skill-based and test-based models.The former is following the convention of past literature (receptive and productive skills).For the latter, whereas the Common Test must strictly follow the national curriculum guidelines (MEXT, 2018a), the TOEIC Bridge has no such restriction for its worldwide administration.Therefore, dividing the test scores into test types, rather than skills, may produce a better fit.Therefore, the test-based model was additionally considered.In sum, the present study was guided by the following two research questions.
1. How are the scores of the Common Test and the TOEIC Bridge correlated with each other?2. Which of the three models (unitary, correlated skills, and correlated tests) best represents the test constructs of the Common Test and the TOEIC Bridge?

Participants
The original plan was to recruit only high school students, following the procedure in Kamiya (2017); however, due to the COVID-19 pandemic, the researcher could not get permission to do so from the university.Therefore, the students of the researcher's university were invited to participate throughout the study period.In the final year, though, this restriction was lifted, so a group of high school students participated.In total, 128 Japanese learners of English aged 18 or 19 participated in this study (118 females, 10 males), which consisted of four groups as shown in Table 4.The university students were all freshmen.Efforts were made to recruit students with diverse majors in order to secure a wide range of English proficiencies for ensuring high reliability of analysis (Mizumoto, 2014): international communication (n = 36), Japanese literature (n = 19), English literature (n = 18), fine arts (n = 10), and liberal arts (n = 10).The high school students were from five schools in the same district.This study was approved by the ethics committee of the researcher's university and all of the participants agreed to participate in the study and read and signed the consent form.

Procedure and analyses
For university students, solicitation emails were sent to all freshmen right after enrollment from the researcher.For high school students, the researcher first contacted the principals of 10 high schools for permission for recruitment of their third-year students.After all of them agreed, the flier was distributed to them either face-to-face (paper) or online (PDF) from a teacher in each school.For both university and high school students, those students who were interested in participation voluntarily contacted the researcher.
All of the participants took the Common Test at the official venues in January.High school students took the TOEIC Bridge in the following March, in the same month of graduation.University students took it in the following May, a month after enrollment.Three groups took the TOEIC Bridge in a CALL Lab at the researcher's university.Due to the spread of the pandemic, following the regulations imposed by the university, those in 2021 needed to take it online at home using Zoom with a camera on, invigilated by the researcher.The scores of the Common Test were confirmed by the official score reports provided by the National Center.The scores of the TOEIC Bridge were confirmed by the official score reports provided by IIBC.
Analyses were conducted in three steps.First, Pearson's correlations were conducted in order to see to what extent the score of each section and total of the Common Test and the TOEIC Bridge are correlated.Second, confirmatory factor analyses (with maximum likelihood estimation) were conducted to detect the model that best fits the current data.Finally, chi-square difference tests were conducted in order to compare model fits.In accord with the recommendation not to conduct an exploratory factor analysis prior to a confirmatory factor analysis on the same data set, as it leads to model overfit (overly optimistic modeling) (e.g., Fokkema & Greiff, 2017), it was not implemented.

Descriptive statistics
Although the National Center tries to equalize the test difficulty of the Common Test across multiple years, there are variations in the mean scores because the scores are not adjusted to secure the same level of difficulty, a practice seen in the TOEFL and the TOEIC, called score equating (Livingston, 2014).Therefore, the mean scores of the Common Test need to be examined to check whether the three versions of the Common Test used in the present study had approximately the same difficulty level.Table 5 shows the means and standard deviations of the Common Test across Japan in the 3 years when this study was conducted (National Center for University Entrance Examinations, n.d.-d).Although the mean scores admittedly varied among these 3 years, no score adjustment was made and all the scores were aggregated for the following reasons.First, the National Center stipulates that scores will be adjusted when averages across subjects differ by over 20 (although this applies only to those tests conducted in the same year, and English is not a subject for the adjustment), but the widest gap in the present data was around eight (61.80-53.81= 7.99 for reading), which is much lower than the benchmark of 20.Second, one-way ANOVAs confirmed that the total scores of the Common Test (p = 0.100) and the TOEIC Bridge (p = 0.113) were not significantly different across the four groups of participants.
Because the score equivalency of the TOEIC Bridge is established by score equating (Livingston, 2014), the English proficiency of these four groups can be assumed to be homogeneous.Since the groups' scores on the Common Test were also similar, the Common Test's score equivalency was presumed.Table 6 shows the descriptive statistics of the scores of the participants in this study.

Pearson's product-moment correlation matrix
Table 7 shows the results of Pearson's product-moment correlation matrix.As expected, the highest correlations are observed between the scores of each section and its total score within the same test, except for the TOEIC Bridge writing, which marked a rather low coefficient (r = 0.661).More importantly, the scores of the corresponding skills also demonstrated relatively high coefficients across the two tests (reading = 0.548; listening = 0.646).Moreover, the total score showed an even higher coefficient (r = 0.732) (see Figs. 5, 6, and 7).

RQ1: Correlations between the Common Test and the TOEIC Bridge
The results of Pearson's product-moment correlations showed that the scores of the Common Test and the TOEIC Bridge are moderately correlated for reading (r = 0.548), listening (r = 0.646), and also total (r = .732).A higher correlation coefficient for the total score than for the individual skill (reading or listening) is commonly observed in other studies in which the total score is derived from the sum of the two skills (reading and listening) (IIBC, 2021(IIBC, , 2022) ) and of the four skills (Kamiya, 2017).Inspecting Figs. 5, 6, and 7, we surmise that this is probably because the total score better reflects the participants' English proficiency owing to the reduced amount of measurement error by combining the scores of multiple tests, which decreases the deviations from the regression line.Plonsky and Oswald (2014) proposed new benchmarks of correlation coefficients for L2 studies, with 0.25 being weak, 0.40 being medium, and 0.60 being large.According to these criteria, all of these correlations can be said to be medium to large.These figures are roughly equal to those correlation coefficients obtained for the Common    Test or the Center Test against several private tests, as can be seen in Table 9.However, according to Dorans ( 2004), the correlation coefficient of 0.866 is minimally necessary for a test to be replaced by another.Although the interchangeability of the Common Test and the TOEIC Bridge is not the objective of our inquiry, at least from a psychometric standpoint, their test constructs are distinct enough to warrant further examination of their reasons.For the sake of comparisons, we summarized the selected specifications of these two tests in Table 10, taken from multiple sources for the Common Test (National Center for University Entrance Examinations, n.d.-c, n.d.-d) and the TOEIC Bridge (Everson et al., 2021;Schmidgall, 2021;Schmidgall et al., 2019Schmidgall et al., , 2021)).From this table, it is clear that although these tests share some commonalities, such as the objective to deal with communication in real daily life contexts and CEFR levels to be measured (A1-B1), there are a number of differences between them.First, the Common Test presumably targets the life of high school students because the questions are made considering the "situations in which students learn in the classroom, discover problems in their social and daily lives" (National Center for University Entrance Examinations, n.d.-b, pp.1-2).In contrast, the TOEIC Bridge deals with adult life, which makes some of the questions irrelevant to high school students.For instance, IIBC provides sample questions about the TOEIC Bridge on their website (IIBC, n.d.).A listening question plays, "What color is your car?"A reading question (fill in the blank) reads, "We have received your order for twelve yellow roses." A speaking question asks to summarize the announcement made by a company president at a staff meeting.A writing question asks to reply to a question, "what types of training or education do you think people will need to get well-paid jobs in the future?"High school students would probably never encounter a situation to be exposed or use any of these sentences in their daily lives even in the L1 (Japanese).
Second, the Common Test is expected to follow the national curriculum guidelines (MEXT, 2018a) whereas the TOEIC Bridge bears no such obligation.Thus, in the latter, some of the linguistic items may go beyond what is supposed to be covered in the former.For example, a sample reading section of the TOEIC Bridge (Educational Testing Service, 2020) has a choice of "An employee's retirement" in a multiple-choice question.Although this is a high-frequency phrase in the workplace, high school students may not be familiar with such a phrase.
Finally, the Common Test mainly consists of American English, and to a much lesser extent, British English and Japanese-accented English (National Center for University

Objective
To measure the skills to use knowledge of English vocabulary, expressions, grammar, and language functions appropriately in real-life communication according to the purpose, situation, and circumstance To measure English language proficiency in the context of everyday adult life

Dialects
The USA, the UK, Japan The USA, the UK, Canada, Australia

Restrictions of guidelines
Yes No CEFR A1-B1 A1-B1 Entrance Examinations, n.d.-a).This is because Japanese students are used to American English for two reasons: (a) most, if not all, English textbooks used in Japanese schools are written in American English (e.g., Mitsumura Tosho, n.d.;Tokyo Shoseki, 2018), and (b) the majority of teachers who come from abroad to teach English are Americans.
For instance, in one of the largest programs for hiring foreign teachers, the JET (Japan Exchange and Teaching) Programme, as of 2023, Americans comprise 55% of the entire faculty (Council of Local Authorities for International Relations, 2015).In regard to the use of Japanese-accented English, because sharing the same L1 between speakers and listeners is known to facilitate comprehension (e.g., Tergujeff, 2023), high school students should find Japanese-accented English easier to comprehend.On the other hand, in addition to American and British English, the TOEIC Bridge contains those dialects of Canada and Australia, both of which are lacking in the Common Test.Because high school students are unfamiliar with Canadian and Australian English, they may have difficulty understanding them compared to American English.All in all, these discrepancies in specifications between these two tests may have yielded correlation coefficients not high enough to be replaceable.Looking at Table 9 we find that most of the correlation coefficients in the previous and the present studies did not reach the benchmark of 0.866 (Dorans, 2004).This makes sense because the test specifications required for university matriculations imposed on Japanese high school students should be quite distinct from those for assessing the English proficiencies of test-takers all over the world.

RQ2: Test constructs of the Common Test and the TOEIC Bridge
The results of confirmatory factor analyses revealed that the three models compared (unitary, correlated skills, correlated tests) explained the data equally well (see Table 8).The correlated skills model and the correlated tests model have the same degree of freedom; therefore, a chi-square difference test cannot be conducted between these two models.Comparing the values in each index, the correlated skills model seems to be superior to the correlated tests model (e.g., RMSEA = 0.058 and 0.071, respectively).Moreover, in the correlated tests model, the two latent variables (Common_Test and TOEIC_Test) are highly correlated to each other with the correlation coefficient being 0.98.Traditionally, when the correlation coefficient is over 0.9, they are regarded as being statistically indistinguishable (Gu, 2015).Therefore, we deem that the correlated tests model should be rejected.
Following this notion, among the two models left (unitary and correlated skills), if we follow the principle of parsimony, the simpler model with more degrees of freedom (unitary) should be chosen over a more saturated model with fewer degrees of freedom (correlated skills).Yet, because this homogenous result may be ascribed to the small sample size (see the Limitations section), we must consider other criteria for judgments, and we deem that the correlated skills model is the best-fit model for the following reasons.
First, as can be seen in Table 3, Oller's studies are the only ones that fully supported the unitary model, and a large body of literature following them uniformly negates it.Based on the accumulation of such empirical evidence, it is more natural to disregard its viability.
Second, since the correlation coefficient of the two latent variables (Receptive_skills and Productive_skills) is not over 0.9, due to the abovementioned reason, it is more valid to set up these two constructs separately, rather than combining them into a single construct.
Finally, in order to find out to what extent the TOEIC Bridge was capable of distinguishing the receptive and productive skills, we performed confirmatory factor analyses of two models (unitary and correlated skills) only with the data of the TOEIC Bridge.Table 11 shows the results of model fits.To our surprise, unlike the TOEFL Junior Comprehensive, the TOEIC Bridge does not seem to well distinguish between receptive and productive skills.A chi-square difference test also confirmed that these two models are statistically homogenous (p = 0.128).This may be because of the nature of some of the questions for speaking and writing tests, in which integrative skills are necessary to answer such questions (see the Limitations section for another possible reason).Looking at Table 11, the unitary model is superior to the correlated skills models at two indices (RMSEA and TLI) whereas the opposite is true for the other three (SRMR, CFI, and NFI).Therefore, there is no consistent pattern as to the superiority of either model.Compared to that, when combined with the scores of the Common Test, all the indices consistently favor the correlated skills model because the correlated skills model recorded (a) the lowest values in RMSEA and SRMR and (b) the highest values in CFI, NFI, and TLI (Table 8).Adding the scores of the Common Test strengthened the model fit of the correlated skills model indicating that the Common Test itself measured receptive skills rather than general English proficiency.

Limitations
There were several limitations in the present study.First, there was a time gap of two (high school students) to four (university students) months between the Common Test and the TOEIC Bridge.This was unavoidable due to logistic reasons.It is unknown whether the participants' English proficiency changed, and if so, how much, during this period.
Second, although we aimed to recruit roughly the same number of participants as Kamiya (2017) (n = 144), due to the limitation on budget, this was unfeasible.This could be a reason why all three models were found to be homogenously well fit to the data.Thus, the results should be interpreted cautiously.
Third, although the level of the TOEIC Bridge is appropriate for the majority of high school students on the national level, it may have been too easy for the participants in this study.Among the 93 university students, 54 (58.1%) were English majors.All the high school students were recruited from high-level schools in the district.According to Table 6, the accuracy rate reached approximately 80% across all the skills and all of the data were negatively skewed, which may have weakened its discriminative power.Moreover, the mean score of the writing section was as high as 45.7/50, which may explain why this section had the lowest factor loading in all the models.Fourth, 36 participants who participated in the first year of this project needed to take the TOEIC Bridge online due to the pandemic.Independent t-tests indicate that they scored higher than those who took it face-to-face for speaking (p = 0.010) and total scores (p = 0.049).Although speculatively, the online participants may have felt less anxious when speaking English aloud at home alone compared to those who took it in a computer lab with the presence of other students around them hearing their voices from each other.Although this decision was beyond the researcher's control, there is some doubt about their score equivalency.The pandemic facilitated online administration of even highstakes tests, such as the TOEFL iBT, but to my knowledge, there has not been any systematic attempt to validate the score comparability of the speaking section between online and face-to-face.This point may be worthy of being addressed in future studies.

Conclusion
Commotion ensued around 10 years ago when the idea of replacing the Center Test by private tests of four skills was introduced by MEXT.After countless heated arguments, this idea has been officially aborted.The introduction of assessment on the productive skills, be they speaking and/or writing, to university entrance examinations, does not seem to be happening anytime soon.
However, the picture does not seem completely bleak for three reasons.First, a speaking test, called ESAT-J (English Speaking Achievement Test for Junior High School Students) has been introduced into an entrance examination for all public senior high schools in Tokyo from November 2022 despite the fact that speaking skills are "the most logistically challenging and controversial to assess" (O'Sullivan et al., 2022, p. 12) among the four skills.Second, a few universities are devising their own ways to measure students' speaking skills at their entrance examinations (Committee for Selection of Good Practices in University Admissions, 2022).Third, a growing number of universities are now utilizing the scores of private tests of four skills for entrance examinations by, for instance, requiring a certain score for applying or adding extra points to the score of the test conducted by each university (Kawaijuku Education Institution, n.d.).Thus, albeit slowly, we are moving forward from the measurements of receptive skills only toward that of four skills in line with the high school English curriculum guidelines (MEXT, 2018b), which stipulates that all of the four skills must be cultivated in a balanced manner.We truly hope that the National Center will create a reliable and valid measurement of four skills someday.But until that day comes, a viable solution seems to have each university implement an assessment of productive skills on their own in addition to the Common Test.

Fig.
Fig. Sample of unitary or unidimensional model

Fig. 4
Fig. 4 Sample of high-order, second-order, or hierarchical model

Fig. 8 Fig. 9
Fig. 8 Standardized regression weights of the unitary model

Table 1
Components of the English test of the Common Test

Table 2
Components of the TOEIC Bridge

Table 4
Demographic information of participants

Table 5
Descriptive statistics of the Common Tests across Japan

Table 6
Descriptive statistics of scores

Table 7
Results of Pearson's product-moment correlation matrix

Table 8
Results of model fits df Degree of freedom, CI 90% confidence interval

Table 9
Correlation coefficients between Center Test or Common Test and private tests Kamiya (2017)s to the added score of reading and listening except forKamiya (2017)and the present study, which refers to the added score of the four skills b STEP Eiken is a test to measure English proficiency conducted in Japan.The test is currently divided into seven levels, five of which measure four skills whereas the other two (the lowest levels) measure the receptive skills only

Table 10
Selected specifications of the Common Test and the TOEIC Bridge

Table 11
Results of model fits of the TOEIC Bridge