Longitudinal measurement of growth in vocabulary size using Rasch-based test equating

The purpose of this study is to equate and further validate three forms of the vocabulary size test (VST) created by Aizawa and Mochizuki (2010). These three forms, VST 1, 2, and 3, were administered to a cohort of 189 high school students ranging in age from 16 to 18 in April of their 1st, 2nd, and 3rd year of high school. Although these alternate forms were designed to be of equal difficulty, formal equating of the three forms was never carried out. In order to verify whether gains in test scores were due to growth in vocabulary size or differences in difficulty among the three forms, a fourth form comprised of items selected from VST 1–3 was created and administered in December of year 3. The four test forms were then equated using Rasch analysis, placing persons and items on a single, uniform logit scale. The results indicated that (1) the three original forms of the VST all showed good fit to the Rasch model, (2) differences in test difficulty among the original three forms were minor, and (3) the four VST forms, linked via a single Rasch analysis, can be appropriately used for estimating gains in students’ VS across their high school career. In addition, follow-up analyses indicated considerable overlap in item difficulty among word frequency bands, suggesting that word frequency was not the sole indicator of difficulty of vocabulary items, and that learner progress in vocabulary learning tended to be uniform and parallel across all frequency bands. Overall, the study illustrates a method for creating a valid and reliable measure of growth in VS over an extended period of time and provides insight into the relationships among word frequency, word difficulty, and progress in vocabulary learning.


Introduction
Researchers and teachers agree that learning vocabulary is an essential part of second language (L2) acquisition. The importance of learning vocabulary is revealed in three ways. Firstly, vocabulary is the primary building block of understanding for communication. As Wilkins (1972) puts it, "without grammar, very little can be conveyed; without vocabulary nothing can be conveyed" (pp. 111-112). Secondly, sufficient vocabulary helps L2 learners expand the four skills. Regarding this, Nation (1994) notes, "a rich vocabulary makes the skills of listening, speaking, reading, and writing selected words based on word-item (lemma 1 ) counting instead of word family 2 counting, which many vocabulary specialists believe is better tuned to learner knowledge of morphology compared to word family. Specifically, many EFL Japanese learners may be unfamiliar with the various word forms, such as noun, adjective, adverb, etc., that is assumed when basing VS estimates on word families (see Kremmel, 2016;McLean, 2018). Finally, he based frequency calculations on a locally developed corpus, namely the Hokkaido University English Vocabulary List (HUEVL; Sonoda, 1996). The HUEVL is based on approximately 9 million words from Time Magazine plus a total of 2.7 million words from the US Department of Energy Corpus from 1989 to 1993. Thus, the HUEVL represents English as used for current topics and science. It is organized into five level bands from junior-high to college-advanced level. Soon after Aizawa created his test, Mochizuki (1998) revised the HUEVL and created a revised version of the test that reflected higher familiarity with loan words, and grouped words into seven wordfrequency levels: 1000-to 7000-word levels. Mochizuki's (1998) test included 30 items in each level band, in which test takers identify, from six alternatives, the two English words that match the two Japanese definitions. An example item from the Mochizuki (1998) test is shown in Table 1.
The 2000-to 7000-word levels of Mochizuki's (1998) test were further analyzed by Kasahara (2006) using the FACETS (Linacre, 2005) Rasch measurement software package. He replaced 21 items which were deemed too easy or which did not match a hierarchical Rasch model with new items selected from the Japan Association of College English Teachers (JACET) List of 8000 Basic Words (JACET, 2003). He argued that his revisions maintained the high reliability of the test and improved its validity by removing items that did not fit a unidimensional scale structure, as indicated by FACETS' fit statistics. Later, Aizawa and Mochizuki (2010) devised three forms of the VST with 26 words in each level band. They also provided a formula for estimating the total VS of junior and senior high school learners, which has been widely used in Japan (e.g., Yashima, 2002;Kosuge, 2003;Katagiri, 2009;Akase & Uenishi, 2015). Using these VSTs, estimated VS was calculated separately for each level band by multiplying the percent of items answered correctly by 1000, with the total VS then estimated by summing the results for the seven level bands.
As mentioned above, the word frequency approach has generally been used to construct VSTs. However, aside from frequency, a wide range of other factors have been acknowledged to make words relatively easy or difficult to learn, including phonotactic regularity, structural or morphological complexity, word class, imageability of concept, and word meaningfulness (see Laufer, 1997;de Groot & van Hell, 2005;de Groot, 2006). These intralexical factors are also influenced by the regularity of the language: the more regular the lexis of a language, and the more that a word or phrase conforms to the language's norms, the easier it is to learn (Schmitt & Schmitt, 2020). These multiple sources of word difficulty might call into question whether the unidimensional Rasch model is appropriate for estimating difficulty. However, distinctions between psychological dimensionality and measurement dimensionality have been well discussed in the literature (Sick, 2010;McNamara, 1996 p. 270-271), and employing a Rasch approach for VST analysis is well established (e.g., Beglar, 2010;McLean, Kramer, & Beglar, 2015).
Although it is acknowledged that many factors affect word difficulty, VSTs have generally adopted word frequency as the sole indicator of difficulty for practical reasons (Nation & Anthony, 2016). This is due in part to the development of corpus linguistics, which has made it easy to categorize words from various sources and genres into frequency bands. Nevertheless, the extent to which frequency determines difficulty in a VST is an unresolved issue. Kasahara (2006), for example, found that while mean differences in difficulty between the frequency bands were orderly and progressive, there was considerable overlap in the range of difficulty of the items within the six frequency bands that he analyzed. Ha (2021) on the other hand, found that frequency bands, with the exception of the 5000-word level, were not only progressive in difficulty but showed very little overlap.
A related question is whether growth in VS is characterized by the progressive mastery of the words comprising each frequency band, in order, or whether learners make more or less equal progress in each frequency band as a course of instruction progresses. To my knowledge, no studies have directly addressed this issue.
To gain a better understanding of growth in vocabulary, this study focused on measuring longitudinal growth in VS of Japanese EFL high school students. Measuring longitudinal growth presents two problems. One is a possible practice effect. If learners take the same test several times, motivated learners might be led to looking up or learning words on the test, which can lead to overestimating their VS on future tests. Aizawa and Mochizuki's (2010) constructed three forms of their VST with the goal of avoiding this practice effect. However, these three forms were assumed to be of equal difficulty without any formal equating.
The other issue with the measurement of growth over time is a possible testing effect. The three VST forms might vary somewhat in difficulty. When attempting to measure growth, it is not clear whether higher scores on a testing occasion are due to learners' growth, slight differences in test difficulty, or some combination of both. In this study, equating these three forms using Rasch analysis is offered as a solution to both problems.

Applying Rasch measurement to vocabulary size tests
According to Sick (2008aSick ( , 2008b, Rasch measurement is used to assess the quality of tests and questionnaires. It constructs true interval-scale measures from raw scores. It also plays an important role in construct validation. In addition, most importantly for the purpose of this research, Rasch analysis can be used to equate tests by creating a common scale based on Rasch logits.  Figure 1 is a simplified example of using Rasch measurement theory to link two tests with common items. Test 1 is easier than test 2. If a group of people take both tests, it is expected that they will get a lower percent score on test 2, the more difficult test. To be able to compare their scores, it is necessary to place the test scores on a common scale. Rasch analysis enables the analyst to place all test items on a single logit scale, based on the difficulty of a subset of common items. Item difficulty is estimated based on the percentage of test takers who answered each item correctly. The item scores are then converted to Rasch logits, which represent a probability that a test taker of a certain ability will be able to answer an item correctly. Easier items have lower logits, such as − 2 and − 1, and more difficult items have higher logits such as 3, 4. The zero logit is set to the average difficulty of item 4 on both tests. Items 4, 5, 6, which appear on both tests, are common items that can be used as linking items. A Rasch analysis constructs an interval level scale which is invariant between test occasions. No matter which items are selected, persons with greater ability will tend to do better and regardless of a persons' ability, there will be a greater chance of success on an easy item than a more difficult one. In addition to producing interval-level measures for both person ability and item difficulty that can be employed in further statistical analyses, a Rasch analysis can be used to evaluate test targeting using a Wright map (Engelhard Jr. and Wang, 2021), a figure which shows the simultaneous locations of persons and items along a common Rasch logit scale.
Rasch measurement may also provide an alternative method of estimating total VS. The dichotomous Rasch model (Rasch, 1980) is an estimation of the probability that a test taker of an estimated ability will answer an item of an estimated difficulty correctly. The Rasch dichotomous model can be expressed using the following equation, in which θ represents the ability level of the test taker, δ represents the difficulty of an item on the same scale, and e is the natural log (2.171). Assuming the data fit the Fig. 1 Rasch measurement: linking tests with common items (adapted from Sick, 2008b) Rasch model, the formula can be used to determine the probability that student n will answer item i correctly (X = 1).
The above equation can also be used to estimate the number of items within a set that a student will answer correctly. For example, by calculating the mean item difficulty of a set of items sampled from a frequency band, it is possible to estimate the number of items in that set that will be answered correctly by a student of known ability. It is then possible to infer the total number of words in that frequency band that is known by the student, based on the mean item difficulty of the sample (see Gibson and Stewart, 2014, for a more detailed explanation).
An important difference from the formula proposed by Aizawa and Mochizuki (2010), which will hereafter be referred to as the raw score method, is that the Rasch method incorporates information from all test items, not only the items in that level band. In the VS estimation using the Rasch method, the estimated number of words known in a frequency band is equal to the mean probability of success multiplied by 1000. The total VS is the sum of estimated words known for each level.
This study attempts to equate and validate the three forms of a VST created by Aizawa and Mochizuki (2010) and to determine the validity of using these three test forms to measure longitudinal growth in VS during three years of high school study. The larger purpose of the study is to lay the groundwork for investigating factors that affect or influence growth in VS across an extended period, such as a course of EFL study during secondary school. However, in the author's opinion, that goal cannot be separated from the need to methodically link the instruments used to measure VS. Above all, it is necessary to account for and eliminate both a testing effect and practice effect. Without these steps, measurement of growth in VS will be ambiguous in that it will not be clear whether observed changes in longitudinal measures are due to growth in VS, differences in difficulty of the VSTs, or a combination of both. Thus, the first three research questions form a logical sequence.
RQ1: To what extent do the three forms of Aizawa and Mochizuki's VST (2010) fit the Rasch measurement model? RQ2: If the three forms of the VST fit the Rasch model individually, can a fourth form (VST 4) be created to validly link them to a common scale?
RQ3: To what extent can the four forms then be used to unambiguously measure growth in VS during a three-year high school career?
Two additional research questions address characteristics of word difficulty and growth in VS with potential pedagogical implications: RQ4: To what degree do the words within frequency bands vary in difficulty, as estimated using a common scale of measurement?
RQ5: To what degree is growth in VS influenced by word frequency? Specifically, is progress characterized by the progressive mastery of each frequency band, or do learners make parallel and equivalent progress in multiple frequency bands?

Participants
Originally, 204 Japanese EFL high school students majoring in science and engineering at a National Institute of Technology (NIT) were recruited for this study. The study began in April 2018. However, during the 3-year high school duration of the study, 15 students were unable to participate in all parts of the study, dropping the pool to 189 participants (36 female, 153 male). Participants were informed that participation was voluntary and that no data obtained from the study would not affect their school grades.
Before entering NIT, all participants had studied English for at least 3 years in Japanese junior high school. They had also taken Foreign Language Activities in grades five and six in elementary school, where they developed a foundation of communication abilities through English. During the three-year enrollment in NIT, their general English proficiency could be estimated as ranging from grade pre-2 (CEFR A2) to grade 2 (CEFR B1) of the STEP Eiken test, which is the most widely used standardized test for assessing test takes' English proficiency in Japan. Basically, most students felt the importance of studying English and a few highly motivated students achieved Eiken Grade Pre-1 (CEFR B2). In April, 2021, at the end of their 3rd year, all students took the Test of English for International Communication (TOEIC), a commercial test which is widely used by Japanese companies when recruiting employees in Japan. The average TOEIC score of the participants was around 400.

Materials and procedure
The three forms of Aizawa and Mochizuki's VST (2010) were employed in this study. Because of time constraints and test difficulty, six level bands representing 1000-to 6000-word levels were used. Higher level bands were not included because it was assumed that few of the participants would have encountered such low-frequency words, and with a multiple-choice format such as the Aizawa and Mochizuki VSTs, widespread guessing of unknown words could both overestimate VS and lower reliability (Stewart, 2014). Each level has 26 items, which means the total number of items was 156 for each form. The three VST forms (VST 1, VST 2, and VST 3) were administered to the participants in April from 2018 (first-year) to 2020 (third-year), respectively.
As mentioned previously, no formal equating of test difficulty of the three VST forms has ever been carried out. Consequently, a longitudinal study would conflate learner increases in VS with differences in test difficulty. Although Rasch analysis is frequently employed for equating test forms, the problem was that these forms did not have any common items that could be used to link them. In order to link the three forms of a VST, it was necessary to make a fourth version, VST 4, which consisted of words selected from each level of each test form. Specifically, for each vocabulary frequency band, 42 items were selected from VST 1 and 8 items from both VST 2 and VST 3, making a total of 26 items for each vocabulary band, equivalent in length to VST 1, 2, and 3.
Prior to selecting items for VST 4, separate Rasch analyses were carried out on VST 1-3 using Winsteps version 4.7 (Linacre, 2019) in order to determine that the three forms fit the Rasch model individually and were thus suitable for linking. Items were selected for VST 4 based on having very good fit to the Rasch model and to represent the full range of difficulty found in each frequency band. However, items that were answered correctly by all or all but a few participants were avoided because items with extreme scores cannot be used as linking items. Consequently, it was expected that VST 4 would be slightly more difficult, on average, than VST 1-3. VST 4 was administered to the same participants in December 2020.

Analyses
Following the individual analyses of VST 1-3 and the administration of VST 4, data from the four forms were analyzed using the stacking method (Wright, 1996;Bond, Yan, & Heene, 2021, p. 203). The stacking method combines data from all four administrations, initially treating persons tested at different times as different persons, in order to estimate all item difficulties concurrently. The common items comprising VST 4 provide a basis for linking the four forms. This technique places all forms of the test on a common logit scale, allowing unambiguous measurement of mean item difficulty of the three original test forms, as well as of person ability at different times. In all, the stacking analysis combined data from a total of 468 items (156 items times 3) plus 756 students (189 students times 4). Figure 2 provides a graphic illustration of the procedure.
Following the stacking analysis, mean item difficulties of VST 1 to VST 4 were compared using a one-way repeated measures analysis of variance (ANOVA) in order to assess whether the original forms were of equivalent difficulty. Following this, the total VS for each student at each administration was estimated using the Rasch estimation method. Mean VS at each administration was then plotted and a one-way repeated measures ANOVA used to assess whether growth in mean VS was statistically significant and substantially meaningful.
Finally, the range of difficulty found within each vocabulary frequency band was plotted and compared in order to ascertain the degree to which word frequency alone affects the likelihood of students knowing the word. The analysis of frequency bands was further supplemented by plotting growth in VS by frequency band in order to examine whether VS within frequency bands tends to grow at differential or parallel rates.

Individual Rasch analyses of VST 1-3
The results of the individual Rasch analyses for VST 1-3 are presented in Table 2. The individual item fit statistics were explored to examine the technical quality of the three original VSTs. This preliminary examination is primarily to determine whether the individual tests have adequate fit to the Rasch model and are thus suitable for linking via common items. As shown in Table 2, each VST administration had a higher mean score. The maximum possible score on each VST was 156. Average raw scores for VST 1-3 were 69.2, 79.7, and 88.9, which were 44.4%, 51.1%, and 57.0% of the total, respectively. The standard deviation of each VST administration was also higher, which indicates that the scores were spreading out across time.
The items comprising the three forms of the VST fit a Rasch model well 3 (RQ1). Infit MNSQ for all items was within 0.7-1.3. 4 Although there were some items with high Outfit MNSQ, close examination showed that these cases were all very easy or very difficult items. Unexpected scores were probably due to guessing correctly or carelessness by a small number of test takers.
Item and person reliability 5 ranged from 0.87 to 0.98, which indicates a high degree of replicability. The person reliability of VST 1 is lower than VST 1 and 2, possibly because vocabulary knowledge was lower at Time 1 and the participants answered more items by guessing. Item separation 6 , an estimate of the spread or separation of the items along the measured variable, ranged from 5.17 to 6.24. This shows that the items can be separated  (Engelhard, 2013). 4 What constitutes "acceptable" infit and outfit means square fit statistics is a matter of contention in Rasch measurement theory. Linacre (2012) has suggested a range between 0.50 and 1.50 as acceptable fit for a test or questionnaire under development. However, a more conservative range of 0.7-1.3 has been recommended for multiple-choice tests used for high stakes or substantial decisions (Wright et al., 1994;Bond et al., 2021). 5 Reliability statistics report the reproducibility of the measures. A reliability value of 0.90 or higher is accepted as a high value. In a Rasch analysis, the person reliability estimates how likely these person measures would be reproduced using a different set of items sampled from the same domain. Likewise, the item reliability estimates how likely the item difficulty measures would be reproduced if the test were administered to a similar sample of test takers. (Linacre, 2012). 6 The item separation index shows the number of statistically significant levels into which the items could be divided according to difficulty. A value of 3.00 or more is considered good (Linacre, 2012).
into 5 to 6 statistically distinct bands based on their difficulty. Person separation 7 ranged from 2.54 to 3.71, which means the participants can be divided into two or three statistically different groups according to their scores on the VSTs. Based on the aforementioned results of the individual Rasch analyses for VST 1-3, items were carefully selected for VST 4 to be representative of each level band and to reflect the full range of difficulty found in each frequency band. This procedure allowed the researcher to validly link the four test administrations to a common Rasch logit scale (RQ2). For the benefit of other researchers who may wish to link other tests to the Aizawa and Mochizuki tests, the Rasch measures of all linking items (VST 4) have been included as an appendix. Figure 3 presents a Wright map generated from the stacked analysis of VST 1-4. The distribution of student abilities is shown on the left-hand side of the map, and the distribution of items by difficulty, labeled to indicate the frequency band they were drawn from, is illustrated on the right-hand side of the map. The Wright map illustrates the targeting, the match of student ability to item difficulty, of the four VST administrations as a single system of measurement.

Wright map of person and items measures
Both persons and items are measured on a common logit scale that ranges from − 4 to 4 logits. A single item from level 5, the word "offspring" in fact, stands out at the top of the map as the most difficult. Along the bottom of the map, we can see that a number of items from level 1 were very easy for this group of test takers. However, a few items from Level 3 (e.g., "bean" and "balloon") were also very easy for this group of persons.
Overall, the Wright map indicates that the distribution of vocabulary item difficulties matches the distribution of student abilities during their 3-year course of study well and that there is a correspondence between frequency level band and item difficulty. However, the correspondence is not perfect. For example, a few level 4 words, such as "feast," "hinder," and "triumph," were difficult items at slightly over 3 logits, answered correctly by approximately 42% of test takers only. In addition, one level 2 word, "mend" at 2.5 logits, was located at the outer range of person ability (RQ3 and RQ4). Table 3 show the results of a one-way ANOVA comparing differences in mean item difficulty of VST 1-4. Although there are some differences in the mean item difficulty of the three original forms, the error bars in Fig. 4 indicate that these differences are within the error of estimation.

Figure 4 and
Only VST 4, which deliberately excluded the easiest items from VST 1-3, has a mean difficulty that appears to be outside the error bars of the other three versions. However, a one-way ANOVA indicated that after equating, mean differences among the versions were not significant (p = .45). An implication of this is that the three original forms of the VST created by Aizawa and Mochizuki (2010) do not differ significantly in difficulty and can now be regarded as having been formally equated (RQ3). Figure 5 shows the estimated mean vocabulary size at each administration of the VST. Both the estimated vocabulary size using the raw score method (score EVS) and the estimated vocabulary size using the Rasch method (Rasch EVS) indicate growth at each consecutive administration.

Estimated mean vocabulary size
Although there are some slight differences between the methods, both indicate consistent growth in VS across time. Because VST 2 was slightly easier than VST 1, it is probable that the raw score EVS slightly overestimated mean growth in VS during the first year. The difference between the two methods is most pronounced at Time 4. This is most likely because VST 4 by design omitted the easiest items because items answered correctly by all test takers are not suitable for Rasch equating. The score EVS thus underestimates  Figure 6 shows the distribution of item difficulty within the vocabulary frequency bands. Frequency band Level 1 is clearly the easiest, and in fact, the 26 items in this level band were answered correctly by about 95 to 98% of the test takers in each VST. In contrast, other frequency bands overlap considerably in difficulty (RQ4).

Item difficulty ranges of frequency level bands
One level 2 item, "mend," was more difficult than the average difficulty of other frequency bands. The words "bean" and "balloon" were the easiest items in level 3 and, in fact, were also below the difficulty measures of most level 1 words. The word "offspring," which is a level 5 word, was the most difficult item on the test. Similar examples can be observed in other level bands. In some cases such as "balloon," "garbage," "hydrogen," and perhaps "economically," the easiness can be explained by the school curriculum, which is science and technology oriented. It is possible that 3rd year students encounter these words outside of their English classes. Figure 7 shows the mastery of level bands across four time periods. Overall, the growth of VS within frequency bands tended to be consistent and parallel. Students made  progress in all frequency bands above level 1 at a similar rate. The annual growth rates of levels 2-6 were consistently between 6 and 8%, and the growth rates of level 3 and level 4 were so close as to be overlapping in the figure. Growth rates tended to be slightly higher between the 2nd and 3rd VST administrations (RQ5).

Discussion
This study sought to answer five RQs related to the validating and equating of the three forms of the VST created by Aizawa and Mochizuki (2010). A Rasch analysis was used to equate and compare the original three forms of the VST using a fourth form comprised of common items as a linking test. All items showed good fit to the Rasch model, which further establishes the validity and reliability of the VSTs. A Wright map based on a stacking analysis demonstrated that the four test forms, as a measurement system, were well targeted for measuring vocabulary growth during three years of study at NIT. In addition, an analysis of the relative difficulty of the three original forms based on the equated Rasch item difficulties found that they were not significantly  different. This point could be important to teachers who wish to monitor growth in vocabulary size during high school, but feel they do not have the knowledge or means to link the test forms using Rasch analysis. This study has formally equated the original three forms and teachers or future researchers may now regard the raw score counts as equivalent. Estimates of growth in VS derived from these forms should be free from both practice effects and testing effects. Regarding estimating VS based on test performance, the score method and the Rasch method were introduced and compared. The two methods produced similar but slightly different estimates of VS. It is hypothesized that differences might arise because the Rasch method incorporates information from all items to estimate the probability that a learner will know any specific word. The score method, in comparison, is derived from the total number correct in a single frequency band only and is not influenced by words outside of that band. This could be problematic in view of the fact that item difficulty is not determined entirely by word frequency. Consequently, it is suggested that future researchers employ the Rasch method for estimating VS as a more precise and valid estimate.
The analysis of word difficulty by frequency band (Fig. 6) indicated that word difficulty is not entirely determined by frequency. Although students can generally be expected to know a greater number of high-frequency words than low-frequency words, their occurrence in the school curriculum as well as intralexical factors (Schmitt & Schmitt, 2020) almost certainly influence students' learning of vocabulary. Due to the demand for engineers to have English communication skills, students in science and engineering fields are also increasingly required to learn some English words related to their areas of expertise. Thus, it is possible that the frequency of encounter is not the same as that found in the HUVEL corpus (Sonoda, 1996), which was based on current events and science articles from the 1990s. Future research may need to reexamine the consequences of using HUVEL or any other specific corpora to estimate word frequency in the Japanese EFL context. Although this result was not unexpected, it does emphasize the need for further research in the interlexical factors that make words The mastery of level bands across the four time periods (Fig. 7) demonstrates that apart from the Level 1 words, learners do not learn the words comprising the frequency bands in succession. Rather, they seem to be learning a portion of words from Level 2 to 6 each year at a gain of roughly 5-8% per band per year. This is possibly because in the Japanese EFL context, new vocabulary is acquired by deliberate study of textbook passages and assigned word lists rather than from graded readers. As noted above, further research into the relationship between word difficulty and the order in which new words are introduced in the school curriculum could have useful pedagogical implications.
Another point worth mentioning is that in Kasahara's (2006) modified version of the VST, the 1000-word level band was left unchanged. Furthermore, he did not include the Level 1 words in a study conducted with university students, assuming they would know most of the words. However, easy items known by nearly all students are necessary for accurate estimation of VS. Furthermore, the small gains of 95 to 98% that these high school students made in the 1000-word level could be pedagogically important, as some researchers (e.g., Hu & Nation, 2000;Laufer, 2005) have suggested that 98% coverage of a foreign language text is required for good comprehension and guessing from context. If so, it would be especially important that learners master a minimum of 98% of the very high-frequency level 1 words.

Conclusion and future research
This study has completed a formal validation and linking of three forms of a VST designed to measure growth in VS and tested their efficiency in the context of a threeyear high school program at NIT. A Rasch analysis both established their equivalence and demonstrated that observed gains in VS could validly be attributed to changes in learner ability, rather than differences in the difficulty of the tests. The study also demonstrated that in the Japanese EFL context, word frequency is not the sole determiner of whether a student will know a word. Furthermore, learners in this study were shown to make similar gains in all frequency bands during the course of the study, an indication that for better or worse, word frequency may not be the primary factor determining the order of word presentation in the Japanese school curriculum.
A limitation that could be addressed in future studies is whether these results would generalize to other cohorts of learners of various academic backgrounds. Since this study was based entirely on science and engineering students, high school students who opt for humanities courses may show different patterns of development in their VS. Such studies could shed further light on how high school students develop their knowledge of English vocabulary.
Future research might make use of these now equated test forms to investigate other factors that influence growth in VS, such as learning strategies, motivation, and other affective factors. Further examination of which individual differences influence growth and rate of growth in VS would lead to a better understanding of vocabulary acquisition among Japanese EFL high school students. Finally, this study has demonstrated the utility of using Rasch measurement to link test forms and separate true gains in VS from differences in test difficulty. Furthermore, a method for more accurately estimating VS in each level band using Rasch measures was demonstrated. It is recommended that future researchers investigating growth in VS consider using this approach for maximum validity and accuracy.
Appendix. Rasch item difficulties of VST 4 (linking items)

Acknowledgements
The author thanks all the students and teachers who helped in collecting data. This study would not have been possible without the support of my supervisor, Dr. James Sick at Takushoku University Graduate College of Language Education. The author would also like to express my sincere gratitude for his helpful discussions and insightful comments on the manuscript.
Author's contributions I am the sole author of the manuscript. The author read and approved the final manuscript.

Authors' information
Masaki Akase is an Associate Professor in the Division of General Education at National Institute of Technology, Nagano College. His research interests include second language vocabulary acquisition, learning strategies, and individual differences. He also collaborates with other researchers on comparative analysis of English textbooks in Japan and other Asian countries.

Funding
The author would like to acknowledge the Japan Society for Promotion of Science (JSPS) for providing financial support (Grant-in-Aid, Project Number: 20K00882) for this study.

Availability of data and materials
A copy of the linking test (VST 4) as well as summarized Rasch difficulty calibrations for the 156 items comprising that form will be provided as supplementary materials. Individual participant responses to all test items cannot be provided at the present time because permission to publish or provide raw data (non-summarized) was not granted by the institutional ethics committee.

Declarations
Ethics approval and consent to participate Full consent to conduct the research and publish results was obtained from the National Institute of Technology, Nagano College, the institution where it was conducted. The NIT Ethics Oversite committee reviewed and approved the research in accordance with institutional and official policies regarding research involving high school students below the age of majority. Participants were informed that their participation was voluntary and test results would not affect their grades in any way.

Consent for publication
The institutional ethics committee gave permission for de-identified, summarized data to be published, on behalf of the participants. Approval from the institution, rather than the parents, is customary in Japan in the case of high school, junior high school, or elementary school students.