Investigating the effect of classroom-based feedback on speaking assessment: a multifaceted Rasch analysis

Due to subjectivity in oral assessment, much concentration has been put on obtaining a satisfactory measure of consistency among raters. However, the process for obtaining more consistency might not result in valid decisions. One matter that is at the core of both reliability and validity in oral assessment is rater training. Recently, multifaceted Rasch measurement (MFRM) has been adopted to address the problem of rater bias and inconsistency in scoring; however, no research has incorporated the facets of test takers’ ability, raters’ severity, task difficulty, group expertise, scale criterion category, and test version together in a piece of research along with their two-sided impacts. Moreover, little research has investigated how long rater training effects last. Consequently, this study explored the influence of the training program and feedback by having 20 raters score the oral production produced by 300 test-takers in three phases. The results indicated that training can lead to more degrees of interrater reliability and diminished measures of severity/leniency, and biasedness. However, it will not lead the raters into total unanimity, except for making them more self-consistent. Even though rater training might result in higher internal consistency among raters, it cannot simply eradicate individual differences related to their characteristics. That is, experienced raters, due to their idiosyncratic characteristics, did not benefit as much as inexperienced ones. This study also showed that the outcome of training might not endure in long term after training; thus, it requires ongoing training throughout the rating period letting raters regain consistency.

vital matter. Such importance calls upon valid and reliable approaches to assessing this skill (Hughes, 2011).
One of the most significant issues related to the scoring process is the rating scale and how it is developed and used. A majority of students' performances are scored subjectively in many speaking tests by utilizing a rating scale. Scoring descriptions can then be obtained by relating the assigned number to the relevant corresponding descriptor in the scoring rubric guide (Hazen, 2020). Two related issues here are, first, the criteria selected against which the students are to be rated and, second, the number of bands or categories in the rating scale that can be justified (Moradkhani & Goodarzi, 2020).
One issue which has always been regarded as an inherent cause of evaluation error that itself might disturb the true assessment of students' speaking competence is rater variability (McNamara, 1996;Tavakoli et al., 2020). Therefore, rater effects must be considered for suitable measuring of test takers' speaking competence. A lot of research (e.g., Tavakoli et al., 2020;Theobold, 2021) on second language speaking assessment by raters has concentrated on causes of rater variation. Such variables consist of rater severity, reciprocity with other facets of the scoring setting, and inter-rater reliability (Lynch & McNamara, 1998;Rezai et al., 2022).
Without rater consistency, raters are not likely to give equal scores to a single performance; thus, severity, which is the possibility of awarding lower scores by raters, and leniency, which is the reverse aspect, are increased. This will result in the assessment being a lottery causing it to be a matter of chance that a particular test taker is scored by which rater (Ahmadi, 2019). That is, a test taker may be scored by the most lenient member of the rater group and benefit consequently or may be scored by the severest member and disadvantage as a result. Because speaking tests demand subjective assessment of this skill, much attention has been paid to achieving a satisfactory measure of consistency among raters so that scoring oral language can be done impartially and systematically (Kwon & Maeng, 2022). Nevertheless, the more emphasis is put on reliability, the less validity is obtained (Ghahderijani et al., 2021;Huang et al., 2020); in other words, emphasizing higher measures of reliability o not necessarily lead to valid measurements of speaking skills. The thing that paves the way for both a reliable and valid measurement of speaking skills is rater training.
On the contrary, McQueen and Congdon (1997) argue that although rater training is intended to maximize Interrater agreement, it does not assure the quality of assessment. Some scholars (e.g., McNamara, 1996;Weigle, 1998) have cautioned against the hazards of compulsory consistency, and as a result, have underlined individual self-consistency (intra-rater agreement) as a more fruitful goal of the training program. It is well documented that, without such training, scoring is doomed to be extremely inconsistent (Iannone et al., 2020). A fairly substantial amount of literature commencing with the research done by Huang et al. (2020) and persisting up to now with the work of Davis (2019), has been researched which establishes that training is a highly significant factor in the reliability of speaking ratings in first and second language settings accordingly. Although it is well-established that trained raters can rate students' performances reliably, there remain a number of questions about the validity of these ratings. This is due to the fact that reliable ratings do not necessarily lead to valid judgments of writing skills.
In the performance assessment, rater training has also been referred to, although from various viewpoints, especially regarding the utmost goal of achieving notable measures of consistency in scorings. Linacre (1989) specifies that unwanted error variance in scoring had better be removed or diminished as much as possible; however, there are some conceptual and theoretical obstacles to fulfilling this objective. For example, even if we train raters to assign precisely similar scores to test-takers, which is farfetched, there remain concerns regarding the interpretability of such scores.
The Multi-faceted Rasch model introduced by Linacre (2002), which can be done using the computer software FACETS, takes a different viewpoint on the issue of rater variability by considering both the factor of raters in performance-based language testing and supplying feedback to the raters based on their performance in scoring (Ahmadian et al., 2019;Lumley & McNamara, 1995). Pioneers of the Rasch technique in assessment argue that it is impossible to train raters to obtain the same degree of severity (Lunz et al., 1990). In reality, the application of the Rasch assessment rules out the requirement for bringing raters higher consistency. This is due to the fact that measures of test takers' abilities are free from those of raters' severity in assessment. However, Lumley and McNamara (1995) state that rater variation could be identified concerning the severity and random error. Thus, training and even retraining are suggested for those raters who are spotted as misfitting by the Rasch technique (Lunz et al., 1990) to provide more selfconsistency (intra-rater consistency) among raters. The implication is that rater training does not intend to force raters into consistency. Consequently, as Wigglesworth (1997) suggests, the primary purpose of rater training had better be to prevent raters from implementing their subjective judgments in short intervals and as a result alter their rating approaches in long term accordingly.
Attempts in the reduction of raters' biases have produced conflicting results. In a related study, Wigglesworth (1997) found a reduction of biasedness as a result of feedback and training and that the raters were able to incorporate it in their subsequent ratings. However, more recent studies have found rather little insignificant effect (Elder et al., 2007;John Bernardin et al., 2016;Rosales Sánchez et al., 2019). Wigglesworth (1997) investigated bias in the context of rater training to evaluate both live and tapebased oral tests. She observed different behaviors and significant variations in how the rates responded to various test criteria based on the modality of the interview. Some raters were severer on fluency or vocabulary, while some others rated them more leniently. Also, they were different on account of their severity estimates for different task types. However, it seemed that raters were able to incorporate the feedback they received in their subsequent ratings since their level of biasedness was reduced to a considerable extent compared to the previous ratings. However, in Wigglesworth's study raters were first given the feedback and then attended a group rater training session; therefore, it is not clear whether the changes in the rating behavior are due to individual feedback or both the bias reports and rater training session. Fan and Yan (2020) investigated the consistency of raters' severity/leniency over several grading intervals by using MFRM. The outcomes of data analysis demonstrated significant instability in two of the three scoring periods ranging from one to four. In another study, Lumley and McNamara (1995) investigated three sets of grading of a spoken English test in 20 months. The findings of the interactional effects of time and rater facets represented a significant change in rater's severity. Studies conducted by Brown (2005) revealed that trained, native, or advanced speakers of the target language raters, score test takers with the same degree of severity and consistency as other raters do.
The use of MFRM in bias analysis has got several implications for performance assessment. First, MFRM helps researchers study the rater facet concerning their facet of interest by keeping the other facets constant and neutral (Lunz et al., 1990). Second, it can help researchers in administering rater training programs. Research has shown that rater consistency and rating validity can be increased through training (McQueen & Congdon, 1997). Third, MFRM can help reduce self-inconsistency and increase intrarater reliability, which increases the fairness of tests specifically in placement and summative evaluation tests (Davis, 2019). Tavakoli et al. (2020) investigated the rating of 40 essays written by Japanese students by employing 40 native English speakers. Each rater scored all the 40 essays on a six-point analytic rating scale of five categories. The results showed that some raters scored higher ability test-takers more severely and lower ability ones more leniently than expected. Brown (2005) studied rater bias in a face-to-face oral test of Japanese EFL learners. The results of MFRM showed significant bias in scoring criteria but no significant bias in task fulfillment.
Nevertheless, much of the research done up to now has explored the use of FACETS on just a couple of facets. For instance, research on the rater's severity or leniency on test-takers (Lynch & McNamara, 1998), task types (Wigglesworth, 1997), and specific rating time (Lumley & McNamara, 1995;Vadivel & Beena, 2019). However, no research has incorporated the facets of test takers' ability including facets of test takers' ability, raters' severity, task difficulty, group expertise, scale criterion category, and test version so far all in one piece of research together with their two-sided impacts.
Even though earlier research on rater variation has emphasized achieving higher measures of raters' consistency as the ultimate aim of rater training (Bijani & Fahim, 2011;Lumley & McNamara, 1995;McNamara, 1996;Vadivel et al., 2021), rater variability can be still traced following training not only for rater severity but also for internal consistency. Also, the dynamic and unpredictable nature of oral interaction questions the reliability of the measure of oral competence. This unpredictability will also affect test validity. In other words, test takers may receive different scores on different occasions from different raters. There is a considerable amount of research exploring the discourse of oral language interviews (e.g., Brown, 2005); however, little research has ever investigated the variation among raters.
Although it is verified that rater training has a significant role in persuading higher consistency among raters in terms of their rating behaviors, there is still a paucity of information about how training functions to provide higher measures of consistency among raters. Even if several rater training impacts have been specified, there are still few studies stipulating such impacts (Brown, 2005;Liu et al., 2021). In addition to that, little research has explored the duration of rater training effects (Bijani, 2010). There are studies exploring the effectiveness of the training program in short periods, but few studies have investigated its effectiveness after a long period following training since Lumley and McNamara (1995) suggested that the outcomes of training might not endure in long terms following training and that raters may change over time. Thus, a need for renewed training is worth investigating.
The results of this study can provide fruitful information to teachers who are doing their pre-service teacher education programs or teachers who are already doing their in-service education programs. Since teachers, might assess students' speaking performances for a variety of reasons, the provision of opportunities to practice rating in a way that is accompanied by individual rating feedback can help raters improve their rating ability. Moreover, the results of this study could be used for raters with various degrees of rating proficiency-inexperienced and experienced ones. Also, the results of this study can demonstrate characteristics to be used in rater training and teacher education.
Therefore, this study focused on raters' severity, bias and interaction measures, and internal consistency considering their interaction of the six different facets used in the study including test takers' ability, rater severity, raters' group expertise, task difficulty, test version, and rating scale criteria using a quantitative approach. Each rater's rating behavior was primarily analyzed so that it would provide feedback to them accordingly. Then, an investigation of the scoring behaviors of the two groups of raters (experienced and inexperienced raters) was followed. Besides, this study investigated the enhancement of rating ability through lapse of time in both rater groups. Also, the two groups of raters were compared with each other in each rating session. Therefore, the following search questions can be formed: 1. How much of test takers' total score variance can be accounted for in each facet (test takers' ability, raters' severity, task difficulty, group expertise, scale criterion category, and test version)? 2. To what extent was the provided feedback successful following the training program regarding severity, bias, and consistency measures?

Design
In order to investigate the research questions outlined in the first chapter of this dissertation, the researcher employed a pre-post, mixed-methods research design in which a combination of quantitative and qualitative approaches was used to investigate the raters' development over time concerning rating L2 speaking performance (Cohen et al., 2007). This method offered a comprehensive approach to the investigation of the research questions involving a comparison of raters' and test takers' perceptions before and after the rater training program. In addition, the type of sampling which was used in this study was "subjects of convenience", that is the subjects were selected based on certain reasons and they were not selected randomly (Dörnyei, 2007).

Participants
As many as 300 adult Iranian students of English as a Foreign Language (EFL), consisting of 150 males and 150 females, between the ages of 17 and 44 took part in this research as test-takers. The participants were chosen from a pool of Intermediate, High-intermediate, and Advanced stages of learning English at the Iran Language Institute (ILI). The reason for opting for the students from the aforementioned levels of proficiency was due to the fact that they had already acquired the required fundamental principles of academic oral performance. It was mentioned that the test takers were selected from three various English proficiency levels at the ILI; however, considering the sole educational level could not be a valid criterion for classifying learners into different proficiency levels. Thus, to make sure that the test-takers taking part in this study were not at the same level of language proficiency, a TOEFL test was given to make sure whether there was a significant difference between them or not. In order to make sure whether there is a significant mean difference among the scores of the test takers of the three groups, an ANOVA was run. Table 1 demonstrates the ANOVA statistical analysis of the TOEFL scores of the three groups of test-takers.
The outcome shows that there is a significant difference with respect to takers' general language proficiency (TOEFL score) among the test takers.
As many as 20 Iranian EFL teachers, consisting of 10 males and 10 females, between the ages of 24 and 58 took part in this research as raters. The raters were Bachelor's and Master's holders in English language-related majors, working in various public and private academic centers. As one of the prerequisites of this study, the raters had to be separated into groups of experienced and inexperienced ones in order to explore their similarities and differences and to investigate which group might outperform the other one. In addition to that, to keep the data provided by the raters confidential, their names and identities were anonymized by attributing them each a score from 1 to 10.
The raters were provided with a background questionnaire, adapted from McNamara and Lumley (1997), with the help of which information included (1) demographic information, (2) rating experience, (3) teaching experience, (4) rater training, and (5) relevant courses passed would be obtained. The obtained data are summarized in Table 2.
Thus, the raters were classified into two expertise groups based on their experiences stated above.
A. Raters with no or fewer than two years of experience in rating and undertaking rater training, plus no or fewer than five years of experience in English language teaching and managed to pass fewer than the four core courses relevant to English language teaching. From now on these raters are referred to as new raters. B. Raters with two and more years of experience in rating and undertaking rater training, plus five and more years of experience in English language teaching and managed to pass all the four core courses relevant to English language teaching as well as a minimum of two other selective courses. From now on these raters are referred to as old raters. A more important reason for choosing these groups of expertise is to investigate any differences between experienced and inexperienced raters in terms of how they approach the task of oral assessment and how they are affected by the rating process. It is noteworthy to indicate that in order to eliminate the rater expectancy effect, the raters and rater groups were not informed of the existence of two various groups and any similarities and differences between the two. Table 3 displays the summary characteristics of the raters participating in the study.
It is noteworthy to indicate that all the participants were informed in advance that they were participating in a research study and the researchers obtained their consent orally that the outcomes of this research would be used to make publications, yet their identities would be kept anonymous.

Instruments
The present study aimed to use the Community English Program (CEP) test to evaluate test takers' speaking ability in different settings. The goal of the speaking test is to evaluate to what extent the speakers of a second language can produce meaningful, coherent, and contextually appropriate responses to the following five tasks. Task 1 (description task) is an independent-skill task that displays the personal experience of test-takers to answer without input provision (Bachman & Palmer, 1996). Moreover, task  3 (summarizing task) and 4 (role-play task) display test takers' listening ability in responding orally to any given input. In other words, the response contents are given to the test takers via short and long listening. For tasks 2 (narration task) and 5 (exposition task), the test takers are needed to give a response to pictorial prompts consisting of a series of photos, graphs, figures, and tables. The aforementioned tasks were implemented via two delivery methods: (1) direct and (2) semi-direct. The former is aimed to use for an individual face-to-face method; however, the semi-direct test is mainly aimed for use in a language laboratory context.
As one of the requirements of this study to evaluate the influence of using a scoring rubric on the validity and reliability of assessing test takers' oral ability, this study aimed to employ an analytic rating scale. The purpose of using an analytic rating scale was to assess test takers' oral performance to determine the extent to which it evaluates the oral proficiency of test-takers more validly and reliably. For either version of the test, all the test takers' task performances were evaluated by the use of the ETS (2001) analytic rating scale. In ETS (2001) rating scale, evaluation is done based on fluency, grammar, vocabulary, intelligibility, cohesion, and comprehension. Each of these criteria is accompanied by a set of 7 descriptors. All scoring is done on a Likert scale from 1 to 7.
The reliability of the test was estimated. According to Table 4, the reliability of the questionnaire, in whole including 20 items, was α ≥ 87.7% which is according to Cohen's table of effect size considered much larger than typical.
Also, to ascertain the validity of the test, a confirmatory factor analysis (CFA) was run, and the obtained model fit reflecting the result of CFA displayed NFI (normal fit index) = 0.91, CFI (comparative fit index) = 0.92, TLI (Tucker-Lewis index) = 0.95, SRMR (standardized root mean square residual) = 0.06, RMSEA (root mean square error of approximation) = 0.042. All the obtained indices indicate the goodness of the model and confirm the validity of the questionnaire.

Pre-training phase
The 300 students were randomly selected to take a sample TOEFL (iBT) test including listening, structure, and reading comprehension to make sure that they are not at the same level of language proficiency and that there is a significant difference between the three groups. Meanwhile, the raters were awarded the background questionnaire before running the test tasks and collecting data. As indicated before, this was intended to separate the raters into the two groups of experienced and inexperienced ones.
Having made sure that the three groups of test-takers are at various levels of language proficiency and identified the raters' background information and their level of expertise and classified them as inexperienced raters and experienced ones, the speaking test started. It is worthy to indicate that the 300 test-takers who took part in this research were separated into three groups where each would take part in a stage of this research namely (pre-/ immediate post-/delayed post-training). Half of the members of each group would also participate in the direct and the other half in the semi-direct version of the speaking test. The reason why all the raters did not take part in both versions of the oral test was owing to the impact of each version that would most possibly influence their performance in the other test version. Such an action would familiarize the raters with the type of questions appearing in either version and would thus negatively influence the validity of the research. The raters were then given a week to submit their ratings, based on the six-band analytic rating scale, to the researcher.

Rater training procedure
Once the pre-training phase was over, the raters took part in a training or norming session during which they got familiar with the oral tasks and the rating scales. The training program was done by the first author of this article who is an authorized IELTS instructor and a Ph.D. holder in the field of Teaching English as a Foreign Language (TEFL). He trained the raters in two sessions and also evaluated all test takers' performance in the three phases of the study to serve as benchmarks for raters' ratings and further data analysis.
They also had the opportunity to practice the instructed materials provided with a number of sample responses collected from some similar proficiency-level students other than the ones participating in this study as test-takers. The researcher gave each rater information about the scoring process as the objective of the training program was to make raters with various degrees of expertise familiar with significant aspects of scoring while they score each student's speech production.
In the meantime, the responses which were previously recorded were played for the raters as they were monitored and provided with direct guidance from the trainer. The raters were also encouraged to form panel discussions and share their justifications and reasons behind the scores they decided to assign while giving reference to the scoring rubric.
The trainer also provided individual feedback for each rater regarding their previous ratings during the pre-training phase. This is what Wallace (1991) stresses in rater training programs. He believes that what helps acquired knowledge to get internalized is through reflection not merely by repeated practice. This will further provide the raters with a chance to reflect upon their scoring behavior. Since each rater possesses a different rating ability and rating behavior, each rater needed to be provided with feedback individually.

Immediate post-training phase
Immediately following the rater training program, when the raters got the required skill in rating speaking ability, the speaking tasks (description, summarizing, role-play, narration, and exposition) of both versions of the test (direct and semi-direct) were administered one by one. As it was mentioned before in the pre-training data collection procedure, the second third of the test takers (including 100 students) were tested from whom the data were elicited. It is again stressed that the oral tasks were assessed using the ETS rating scale. The selection of 100 oral performance data in whole ([100 × 5] semi-direct + [100 × 5] direct = 1000 in general) of both methods at this stage was done randomly for each rater. Randomization was done to counteract the influence of sequencing the performances on the raters' behaviors so that they could not remember how many data at a particular score were rated by them.

Delayed post-training phase
Exactly 2 months (as suggested by McNamara, 1996) after the immediate post-training data collection, the fourth phase of the data collection procedure was administered. In this phase, the last third of the test takers (including 100 students) were tested from whom data were elicited. The raters were provided with the collected data to rate based on the knowledge they had already gained during the rater training program two months before. The aim was to observe the delayed impact of the training program on raters and also the degree of inter-rater reliability. The expectation was that raters were still consistent in rating.

Data analysis
Quantitative data (i.e., raters' scores based on an analytic rating scale) were gathered and analyzed with MFRM during three scoring sessions. In order to compensate for the impact of test methods and rating factors, MFRM has been widely used in second/ foreign language performance assessments. Through estimating the probabilities of patterns of responses, MFRM estimates the variability associated with the factors such as test-takers, raters, rater groups, tasks, and rating scales involved during performance assessment procedures (McNamara, 1996). MFRM also provides information on rater characteristics, specifically severity, consistency, randomness, and inter-rater reliability (McNamara, 1996). Consequently, studies on rater variability in the speaking and writing scoring process have often used MFRM in their analyses (McNamara, 1996;Weigle, 1998). In the present study, MFRM was performed using the FACETS program after each rating stage or phase of the study to examine both individual rater and rater group scoring patterns. A six-facet model will be used including the facets of the test taker (test takers' ability), rater (raters' severity/leniency), rater group (experienced/inexperienced), task (the tasks used in the study), rating criterion (categories of the rating scale for the analytic scale), and test method (direct/semi-direct). Therefore, a six-facet partial credit model will be employed.
The patterns of the awarded scores of the two groups of raters (new and old) were investigated each time they rated test takers' oral performances by the use of an analytic rating scale. The quantitative data were compared (1) across the two groups of raters to explore the raters' capability cross-sectionally at each scoring stage, and (2) within each rater group to study the improvement of the raters' ability.

Results
Having analyzed the data during the pre-training phase, the FACETS variable map representing all the facets was obtained. In the FACETS variable map, presented in Fig. 1, the facets are placed on a common logit scale that facilitates interpretation and comparison across and within the facets in one report. The figure plots test takers' ability, raters' severity, task difficulty, scale criterion difficulty, test version difficulty, and group expertise. According to McNamara (1996), the logit scale is a measurement scale that expresses the probabilities of test takers' responses in various conditions of measurement. It also contains the means and standard deviations of the distributions of estimates for test-takers, raters, and tasks at the bottom.
The first column (logit scale) in the map depicts the logit scale. It acts as a fixed reference frame for all the facets. It is a true interval scale that has got equal distances between the intervals (Prieto & Nieto, 2019). Here, the scale ranges from 4.0 to -4.0 logits.
The second column (Test Taker) displays estimates of test takers' proficiency. Each star displays a singlet test taker. Higher scoring (more competent) test takers are at the top of the column whereas lower scoring (less competent) ones are at the bottom. Here, the range of the test takers proficiency ranges from 3.81 to -3.69 logits; thus making a spread of 7.50 with respect to test takers' ability. It is worthwhile to specify that no test taker was identified as misfitting; thus, none of them was excluded from data analysis during the pre-training phase of this research. The third column (rater) displays raters concerning their severity or leniency estimates in scoring test takers' oral proficiency. Since there was more than one rater scoring each test taker's performance, raters' severity or leniency scoring patterns can be estimated. This will give us raters' severity indices. In this column, each star displays one rater. Severer raters appear at the top of the column, whereas more lenient ones at the bottom. At the pre-training, rater OLD8 (severity measure 1.72) was the severest rater and rater NEW6 (severity measure -1.97) was found to be the most lenient rater. Besides, in this phase, OLD raters, on average, were rather severer than NEW raters who tended to be more lenient than the OLD ones. Here, raters' severity estimate ranges from 1.72 to -1.97 logits which makes the distribution of rater severity measures (logit range = 3.69) which is much narrower than the distribution of the test takers' proficiency measures (logit range = 7.50) in which the highest and lowest proficiency logit measures were 3.81 and -3.69 respectively. This demonstrates that the effect of individual differences on behalf of raters on test-takers was relatively small. Raters, as shown in the figure, seem to have spread equally above and below the 0.00 logits.
The fourth column (task) displays the oral tasks used in this study in terms of their difficulty estimates. The tasks appearing at the top of the column are harder for the test takers to implement than the ones at the bottom. Here, the Exposition task (logit value = 0.82) was harder for the test takers than the other tasks, while the Description task (logit value = -0.37) was the least difficult one; therefore, making a spread of 1.19 logit range variation. This column has the lowest variation in which all the elements are gathered around the mean.
The fifth column (scale category) displays the severity of scoring the rating scale categories. The most severely rated scale category appears at the top and the least severely rated scale category appears at the bottom. Here, Cohesion was measured to be the most severely scored category (logit value = 0.79) for raters to use whereas Grammar was the least severely scored one (logit value = -0.46).
Columns 6 to 11 (rating scale categories) display the six-point rating scale categories employed by the raters to evaluate the test takers' oral performances. The horizontal lines across the columns are the categories threshold measures that specify the points at which the probability of achieving the next rating (score) starts. The figure shows that each score level was used although there was less frequency at the extreme points. Here, the test takers with the proficiency measure of between -1.0 and + 1.0 logits were likely to get ratings of 3 to 4 in Cohesion. Similarly, the test takers at the logit proficiency of 2.0 logits had a relatively high probability of receiving a 5 from a rater at the severity level of 2.0 in Intelligibility.

RQ1: How much of test takers' total score variance can be accounted for in each facet?
A FACETS program enables us to determine how much each score variance is attributed to which of the facets employed. Accordingly, one more data analysis was done to measure to what extent the total score variance is associated with each of the facets identified in this study. Table 5 shows the percentage of total score variance associated with each of the facets used in the study prior to the training program. The information provided in the table shows that the greatest percentage of the total variance (44.82%) is related to the test takers' ability differences however the remaining variance (55.18%) is related to other facets including rater's severity, group expertise, test version, task difficulty, and scale categories.
The rather high percentage of total score variance, other than that of test takers' capability at the pre-training phase calls up the caution to be taken about the effect of unsystematicity of rating and the existence of undesirable facets influencing the final obtained score. Furthermore, it shows that the rater's facet entails a significant extent of total test variance (26.13) which indicates that there is a likelihood of inconsistency and disagreement between raters and their judgments proving that a number of raters are relatively severer or more lenient towards the test takers than the other raters. This finding represents that the test-takers will be scored differently depending on the rater. The rather small effect of other facets including test version, task difficulty, and scale categories shows that there is a slight bilateral and multilateral interactional effect of the facets involved in test variability; thus, proving the neutralizing effect of test variability through the combination of other test facets.
Having analyzed the data at the immediate post-training phase, the FACETS variable map representing all the facets and briefly stating the main information about each one was obtained. The FACETS variable map, displayed in Fig. 2, plots test takers' ability, raters' severity, task difficulty, scale criterion difficulty, test version difficulty, and group expertise.
The second column (test taker) displays estimates of test takers' proficiency. Here, the range of the test takers' proficiency ranges from 3.62 to -3.16 logits, with a spread of 6.78 logit value. The reduction of test takers' proficiency logit from 7.50 (before training) to 6.78 (after training) shows that they were rated more similarly with regard to severity/leniency indices. This reflects that the test takers have been more clustered around the mean concerning raters' scoring of their oral proficiency level.
The third column (rater) displays raters about severity or leniency estimates in rating test takers' oral proficiency. Here, raters' severity estimate ranges from 1.26 to -1.05 logits which makes the distribution of rater severity measures (logit range = 2.31) which is again a lot narrower than (almost one-third) the distribution of the test takers' proficiency measures (logit range = 6.78) in which the highest and lowest proficiency logit measures were 3.62 and -3.16 respectively. This demonstrates that the effect of individual differences on behalf of raters on test-takers was relatively small. Likewise, in the pre-training phase, raters, as shown in the figure, seem to have spread equally above and below the 0.00 logits. Besides, the significant reduction of raters' severity measure distribution from 3.69 in the pre-training phase to 2.31 in the immediate post-training phase displays the efficiency of the training program in bringing raters closer to one another concerning severity/leniency indices. In other words, they rated more similarly concerning severity/leniency after the training program. The fourth column (task) displays the oral tasks used in this study in terms of their difficulty estimates. Here, the Exposition task (logit value = 0.61) was harder for the test takers than the other tasks while the Description task (logit value = -0.14) was the least difficult one; therefore, making a spread of 0.75 logit range variation. The reduction of logit range, compared to the pre-training phase, indicates that the tasks were rated with less severity and leniency. This column has the lowest variation in which all the elements are gathered around the mean. The fifth column (scale category) displays the rating scale category severity in scoring. Here, Cohesion was measured to be the most severe category (logit value = 0.58) for raters to use whereas Grammar was the least severe one (logit value = -0.17).
Similar to the pre-training phase, the total score variance attributable to each facet was calculated to measure the effect of each facet on total score variance immediately following the training program. Table 6 displays the percentage of total score variance associated with each of the facets used in the study at the immediate post-training phase. The information provided in the table shows that the greatest percentage of the total variance (67.12%) is related to the test takers' ability differences however the remaining variance (32.88%) is related to other facets including rater's severity, group expertise, test version, task difficulty, and scale categories.
The considerable increase of total score variance percentage attributed to test takers' ability and reduction of variance percentage attributed to other facets indicates the significant increase of systematicity and consistency in scoring following the training program. In other words, the training program was quite effective in the reduction of undesirable facets and unsystematicity of scoring influencing total score variance in the immediate post-training phase. The scoring procedure moved towards the establishment of consistency in scoring in a way that a majority of score variance was associated to test takers' performance ability differences.
Having analyzed the data at the delayed post-training phase of this research, the FAC-ETS variable map representing all the facets was obtained. The FACETS variable map, displayed in Fig. 3, plots test takers' ability, raters' severity, task difficulty, scale criterion difficulty, test version difficulty, and group expertise.
The second column (test taker) displays estimates of test takers' proficiency. Here, the range of the test-taker's proficiency ranges from 3.70 to -3.53 logits, with a logit distribution of 7.23.
The third column (rater) displays raters concerning their severity or leniency estimates in rating test takers' oral proficiency. Here, raters' severity estimate ranges from 1.28 to -1.26 logits which makes the distribution of rater severity measures (logit range = 2.54) which is again a lot narrower than (almost one-third) the distribution of the test takers' proficiency measures (logit range = 7.23) in which the highest and lowest proficiency logit measures were 3.70 and -3.53 respectively. This demonstrates that the effect of individual differences on behalf of raters on test-takers was relatively small. Similar to the previous two phases of the study, raters, as shown in the figure, seems to have spread equally above and below the 0.00 logits. Through comparing the measures of severity distribution, raters were still closer to one another in the delayed post-training phase (2.54 logits) regarding severity/leniency measure compared to the pre-training phase (3.69 logits) which shows the rather long-lasting effectiveness of the training program. However, the increase in severity logit measure compared to the immediate post-training phase (2.31 logits) reflects the raters' tendency in moving gradually to their way of rating which implied a need for ongoing training programs in specific intervals. The fourth column (task) displays the oral tasks used in this study regarding their difficulty estimates. Here, the Exposition task (logit value = 0.66) was harder for the test takers than the other tasks while the Description task (logit value = -0.24) was the least difficult one. This column has the lowest variation in which all the elements are gathered around the mean. The fifth column (scale category) displays the rating scale category severity of scoring. The most severely scored scale category was at the top and the least severely scored scale category was at the bottom. Here, Cohesion was measured to be the most severely scored category (logit value = 0.62) for raters to use whereas Vocabulary was the least severely scored one (logit value = -0.24). Figures 4, 5, 6, 7, 8, and 9 graphically plot the raters' bias interaction curve to the testtakers in Z-scores for new and old raters at the three phases of the study. The graphs display all rater biases be they significant or not. In each plot, the curved line displays the raters' severity logit. The symbols • show z-scores that indicate non-significant bias, and the ✖ symbols indicate significant bias. Bijani et al. Language Testing in Asia (2022) 12:26 Pre-training: there were 3 significant biases for NEW raters which were identified as significantly lenient. For old raters, the data showed 4 significant biases among which 3 were identified as significantly severe and 1 lenient.

Fig. 5 New raters' bias interaction (pre-training)
Immediate post-training: there were 3 significant biases for OLD raters which were identified as significantly severe. No NEW raters were spotted to have a significant bias in the immediate post-training phase of the study.
Delayed post-training: there was 1 significant bias for NEW raters who were identified as significantly lenient; however, the leniency was slightly below the acceptable range which could be ignored, too. For OLD raters, the data showed 4 significant biases among which 3 were identified as significantly severe and 1 lenient. One rater was on the borderline of severity measure.
Additionally, in order to graphically represent the raters' consistency measures throughout the three phases of the study, the raters' infit mean square values were employed. As indicated before, the infit mean square that ranges between 0.6 and 1.4 is considered the acceptable range (Wright & Linacre, 1994). The following figure (Fig. 10) plots graphically the change of raters' consistency in rating using infit mean square values in the three phases of the study.
The raters achieved more consistency in the immediate post-training phase. In the delayed post-training phase, although the raters were still more consistent than in the pre-training phase, they had reduced consistency compared to the immediate post-training phase to a considerable extent. For a great number of the raters, the training program and feedback were pretty beneficial and brought the raters within the acceptable range of consistency after training. It was only rater OLD8 (Infit MnSq. = 0.5) who still displayed inconsistency after training. In the delayed post-training phase, although there was more consistency compared to the pre-training phase, a few more raters seem to have lost consistency compared to the immediate post-training phase. Raters OLD3 and OLD8 with the Infit Mean Square values of 1.5 and 0.4 respectively show inconsistency after training. It must be indicated that the raters who did not improve or even lost consistency after training were among the ones who were not positive about the rater training program and the feedback the raters were to be provided.
Likewise, in the previous two phases of the study, the total score variance associated with each facet was calculated to measure the effect of each facet on total score variance during the delayed post-training phase. Table 7 displays the percentage of total score variance associated with each of the facets used in the study at the immediate posttraining phase. The information provided in the table shows that once again the greatest percentage of the total variance (61.85%) is attributed to the test takers' ability differences however the remaining variance (38.15%) is related to other facets including rater's severity, group expertise, test version, task difficulty, and scale categories.
In the delayed post-training phase still, a significant increase is observed towards the establishment of consistency in scoring and reduction of the influence of other intervening facets in total score variance. Here, a considerable degree of the sum of score variance is related to test takers' oral ability performance differences which shows the relative systematicity and consistency in scoring compared to the pre-training phase.  This outcome provides evidence of the ongoing efficiency of the training program in the long term. However, comparing the outcomes to the immediate post-training phase, a reduction of total score variance associated to test takers' ability and an increase of variance related to other intervening facets is observed. This outcome although still shows consistency of scoring based on test takers' oral ability, and it calls upon the gradual loss of consistency and increase of error and unsystematicity after training.

RQ2:
To what extent was the provided feedback successful following the training program concerning severity, bias, and consistency measures?
The following tables (Tables 5, 6, 7, and 8) demonstrate the result of training and feedback provision on severity, bias, and consistency measurement during the three phases for both successful and unsuccessful adjustments. Table 8 shows the differences in the successful application of the training program and the feedback effectiveness on raters' severity reduction based on severity logit values during the three phases of the study. A pairwise comparison using a Chi-square analysis revealed that there is a considerable difference in successful severity reduction between the pre-training and the immediate post-training phase (χ 2 (1) = 32.59, p < 0.05) and between the pre-training and the delayed post-training phase (χ 2 (1) = 9.761, p < 0.05). However, there observed no statistically significant difference between the immediate post-training and the delayed post-training phase (χ 2 (1) = 1.408, p > 0.05). Table 9 demonstrates the same comparison but concerning biasedness. The analysis is based on the comparison of Z-score values obtained from the FACETS. The result is fairly similar to the one on severity analysis. A pairwise comparison using a chi-square analysis revealed that there is a considerable difference with respect to successful bias  reduction between the pre-training and the immediate post-training phase (χ 2 (1) = 16.42, p < 0.05) and between the pre-training and the delayed post-training phase (χ 2 (1) = 4.97, p < 0.05). However, there observed no statistically significant difference between the immediate post-training and the delayed post-training phase (χ 2 (1) = 0.154, p > 0.05). Table 10 displays the results of consistency comparison across the three phases by comparing the data obtained from infit mean square values. The result, like what was found in the aforementioned two tables, was found. Using a chi-square analysis, there observed a significant difference in terms of successful consistency achievement between the pre-training and the immediate post-training phase (χ 2 (1) = 23.14, p < 0.05) and between the pre-training and the delayed post-training phase (χ 2 (1) = 07.63, p < 0.05). However, no statistically significant difference was obtained between the immediate post-training and the delayed post-training phase (χ 2 (1) = 0.822, p > 0.05). As indicated before, fit statistics is used to identify which raters tended to overfit (having too much consistency) or underfit (misfit) (having too much variation) the model and at the same time to identify which raters rated consistently with the rating model. Table 11 displays the frequency and percentages of rater fit values placed within the overfit, acceptable, or underfit (misfit) categories.

Discussion
One finding of the study, which is parallel with those of (Bijani, 2010;Kim, 2011;Theobold, 2021;Weigle, 1998), also showed that not only can rater training make raters consistent in their ratings (intra-rater reliability), but also it can increase consistency among raters (interrater reliability), too. It should, however, be noted that this finding is in contrast with Davis (2019); Eckes (2008);McNamara (1996) who found that rater training can only be beneficial in promoting self-consistency but not inter-rater consistency. The  reason for such discrepancy in findings might be due to the various sampling, oral tasks, or even the scoring techniques used for measuring and analyzing the data. The findings of this study, first of all, revealed a wide variation in raters' behavior from before training to after training since they have reduced severity/leniency estimate to a high extent which made them more similar to each other. This reduction of severity estimate is more noticeable for new raters. Although severity variation among raters was reduced after training, there remained some significant severity differences among them. This, rather abnormal behavior, even after training, is due to the behaviors of some extreme raters consisting of OLD8, OLD4, OLD7 (in severity), and OLD3, OLD9, and NEW6 (in leniency) who, due to arrogance, overconfidence, or unwillingness of training program effectiveness, did not change behavior even after training and ultimately this caused overall significant variation among raters after training. In other words, those raters whose rating behavior improved very little or even got worse after the training program were those who were relatively less positive, or better to say pessimistic in their perceptions of the oral assessment rater training program. However, it is important to note that even though a causal relationship between raters' attitudes and the rating outcome cannot be formulated, it is possible to assume that if training programs are in line with the expectations and requirements of raters, they will result in more promising outcomes which will automatically result in a higher consistency with the other raters and the benchmark as well. This indicates that although training has brought raters' extreme differences within the acceptable range of severity, it could not eradicate severity variation among them. This finding is parallel with that of Stahl and Lunz (1991, cited in Weigle, 1998) who found that training cannot eliminate severity differences among raters.
Second, the training program and feedback were successful in modifying raters' fit statistics, indicating consistency among raters, after training. A considerable number of the raters who were identified as inconsistent before the training became consistent afterward. One rater (OLD8) was still identified as inconsistent after training. This might indicate that not all raters have the potential to be employed as raters and thus, according to Winke et al. (2012) and Iannone et al. (2020) should be excluded from the rating job.
The outcomes indicated that the training program was successful enough in letting the rater get closer to one another in rating and increasing their central tendency. Also, they were capable of diminishing biases compared to the pre-training phase most probably because they were provided with post-rating feedback where their biases were specifically pointed out. It also confirmed the impact of rater training on the overall consistency of raters' scoring behavior. One other possibility for the reduction of raters' biases in scoring might be on account of the fact that raters were provided with instructions that considerably provided them with explicit and clear rating procedures which probably is why little bias was observed after training. This finding is rather contradictory when compared with previous literature. That is, in terms of the reduction of raters' biases after the training program, the outcome of oral performance assessment is consistent with that of Wigglesworth (1997) who found rather the same finding regarding the reduction of bias measures after the training program. However, on the other hand, Elder et al. (2007) found a rather insignificant effect of the training program in bias reduction of raters' consequent scoring behavior.
The drastic change in rating behavior of some raters including rater OLD7 (moving from extreme leniency to extreme severity), NEW8 (moving from extreme leniency to severity), and OLD3 (moving from severity to extreme leniency) might probably be due to overgeneralization of the feedback provided. Concerning raters' fit statistics, raters who were identified as misfitting raters, according to Huang (1984, cited in Shohamy et al., 1992, could be viewed to have relative inefficiency; thus, as items on a test, to be discarded from the study. Consequently, misfitting raters had better be removed from the study; however, for the sake of examining the effectiveness of the training program, misfitting raters were kept to better observe their change of behavior in rating throughout the study. This decision has also been supported by Stahl andLunz (1991, cited in Eckes, 2008) who stated that misfitting raters must be trained and not be excluded from the rating task.
Concerning the finding of the study in the delayed post-training phase, this study although provided promising results for the long-lasting effects of the training program, it reflected traces of gradual loss of consistency and increase of biasedness. The outcomes showed that through the lapse of time, variation gradually increases and raters tend to rate the way they rated before; however, still raters are more consistent in rating than they were before training.

Conclusions
One of the major findings of this study explored to what extent the training program affected the severity and internal consistency of the raters as measured by the FACETS. The outcome of data analysis through comparing pre-, and post-data demonstrated that training reduced differences in severity among raters specifically to a high extent among NEW raters, i.e., most of the raters who were identified as inconsistent before the training were no longer inconsistent afterward. The second major finding indicated that NEW raters had a broader range of severity and inconsistency than OLD ones before training. However, this was not the case after training. NEW raters tended to show less severity and higher consistency than OLD ones after training. The third finding showed that there was less variance in test takers' scores rated by the raters after training compared to the pre-training phase. Finally, the fourth finding showed that the training program helped raters realize and put the planned rating criteria into practice and helped raters modify their expectations of test-takers features and their performance ability, and their demands of the oral tasks.
The major finding was that the training program decreased yet did not eradicate the variation in severity and consistency among raters. The comparison across raters demonstrated that NEW raters had an extensive degree of inconsistency than OLD ones before the training. However, this difference was reduced after training in a way that even they became more consistent and less biased than OLD ones after training.
The outcomes of this study demonstrated that rating is still possible without training, but in order to have a reliable rating, training is essential. The primary purpose of training is to help raters articulate and justify their scoring decisions for reliable ratings. Raters, before training, differed strongly from one another concerning severity, bias, and consistency; however, following the training they diminished severity and bias to a high extent resulting in increasing the consistency in rating.

Implications of the study
Although rater training is a significant part of teacher education, it cannot make raters proficient alone. Training raters to be consistent is typically a long-lasting process since raters may not be capable of applying the techniques and strategies from training to the real scoring setting. Besides, the impacts of training might bring about changes in the delayed result. Thus, the implication is that longitudinal rater training had better be awarded before discussing the betterment of raters' scoring capability and rater variability.
The outcome of the study lead to higher degrees of interrater reliability and diminished measures of severity/leniency, biasedness, and inconsistency. However, it may turn raters identical to each other in their rating behavior. They can merely bring about higher self-consistency (intrarater consistency) among them.
Similar to the research done previously, even though rater training could assist raters to achieve higher measures of self-consistency (intra-rater reliability) and can increase interrater reliability accordingly, it cannot simply eradicate raters' differences related to their characteristics. That is, experienced raters, due to their idiosyncratic characteristics, did not benefit as much as inexperienced ones. Also, some amount of severity was still left after training which may have an impact on future interpretations and decisions. This is something that through more training and individual feedback could be better paved but not thoroughly removed. The analysis outcomes of the fit statistics index of the raters demonstrated that raters are likely to increase their internal consistency in ratings through receiving training, feedback, and gaining experience.
MFRM can point out sources of raters' bias thus making assessment fairer. It can reduce the intimidation of getting either accepted or rejected based on factors that have nothing to do with their true ability. Besides, it can determine raters' bias which is the extent to which raters show interaction with either of the test versions or categories of the rating scale. The implication is that MFRM equips decision-makers with a tool to spot misfitting raters. Rating is typically an expensive and time-taking activity in which misfitting raters can invalidate test outcomes resulting in a huge loss. Therefore, although MFRM does not solve the problem, it can help provide feedback to assist raters to apply ratings more consistently.
Concerning the rather significant variation between the immediate and delayed posttraining phase of the study, the outcomes of the study showed that the outcome of training might not endure long afterward. The implication is that such finding provides evidence for the requirement of ongoing training throughout the rating period letting raters regain consistency.
This study showed that raters can rate reliability, regardless of their background or level of expertise. However, rating reliability can be enhanced through training programs. The substantial rater severity/leniency differences among raters, as was also found in some previous research (e.g., Bijani & Fahim, 2011;Eckes, 2008;Theobold, 2021), have an important consequence for decision-makers that in rater training, more attention and importance shown to be dedicated to consistency within raters (intra-rater agreement) than consistency between or among raters (interrater agreement). The fact that raters reduced consistency and increased bias and severity in the delayed posttraining phase, compared to the immediate post-training phase, reflects the need for assessment organizations to constantly monitor the raters based on their severity/leniency, bias, and consistency.

Suggestions for further research
This study only focused on oral performance assessment by the raters. Thus, further research could study the use of other skills (e.g., writing) and investigate raters' scoring variability including the facets used in the study on those skills as well. Besides, it did not explore the use of group oral assessment. Therefore, further studies could investigate the influence of the group oral-assessment technique on learners' performance quality and of course raters' internal agreement in scoring. On the other hand, no investigation was done regarding the differences between native and non-native speaker raters. Consequently, future studies could also investigate the differences in rating reliability as well as their behavioral variations between native speaker (NS) raters and nonnative speakers (NNS). Besides, future studies could investigate the use of raters coming from backgrounds (other than Persian language) and how they rate test takers' oral performances. Further research is required to explore the impact of the issues related to raters' and test takers' backgrounds and personalities (e.g., different first language backgrounds and language accents) on the consistency of raters in their rating.