Fairness in classroom assessment: development and validation of a questionnaire

Although fairness in assessment practices has gained noticeable attention over the recent years, there has been a long-lasting study to design and validate a questionnaire to measure it from a psychometric perspective. Thus, this study aims to develop and validate a questionnaire with adequate psychometric properties to measure fairness in classroom assessment. Using a random sampling method, two samples of male and female university students for the first pilot (n = 128) and the second pilot (n = 360) were selected from Ayatollah Borujerdi University and Lorestan University. Drawing on the past literature, a pool of items (n = 118) were extracted and subjected to a 12-step systematic procedure, including content analysis and sampling; creating an item bank; running the first pilot; creating item pool one; expert judgment to evaluate the sub-scales; running an interview and think-aloud protocol; running Cronbach’s alpha; running the second pilot; running exploratory factor analysis, confirmatory factor analysis, and Cronbach’s alpha; creating item pool two; expert review; and translation and translation quality check. Findings yielded a 110-item questionnaire with 10 sub-scales: learning materials and practices (18 items); test design (24 items); opportunities to demonstrate learning (8 items); test administration (21 items); grading (11 items); offering feedback (6 items); tests results interpretation (5 items); decisions based on tests results (3 items); test results consequences (4 items); and students’ fairness-related beliefs and attitudes (10 items). The hope is that this questionnaire can serve research and educational purposes.

Different scholars have tried to illuminate the basic features of fair APs. For example, Peters et al. (2017) consider APs as fair if (a) they are not used as a mechanism for classification but as a diagnostic tool, (b) they are used to improve student learning not as an external tool to measure students' performance, and (c) they are used to even out the overall students' evaluation, not as a punishment tool for students who do not meet the intended requirements. Additionally, Pettifor and Saklofske (2012) note that one of the best ways to transfer educational APs to fair practices is by making test-takers familiar with evaluation criteria. They add that those evaluation criteria should be codefined jointly by test-makers and test-takers. Moreover, Stobart (2005) maintains that to achieve fairness in APs, test-makers should make sure that there is no bias against test-takers regarding their gender, ethnicity, nationality, and socioeconomic status. Likewise, for Kyaruzi et al. (2018), APs are fair when they are tailored to test-takers' needs and characteristics.
In the past literature, a range of studies has been conducted to verify the fundamental features of fair APs from teachers' and students' perspectives. In an early attempt, Green et al. (2007) examined teachers' perceptions about fair challenges in summative tests. Their findings documented that confidentiality, communication about grading, and multiple assessment opportunities received the highest value from the participants' perspectives. Moreover, in another study, Tierney et al. (2011) examined how Canadian teachers assessed their students. Their findings evidenced that (a) teachers should take into account their students' progress during the course; (b) skills related to products than procedures should be given attention; (c) professional judgment along with standards-based grades should be used by teachers to assess their students' learning; and (d) teachers should provide students with enough feedback about their performance and grades. Additionally, in research by Segers and Tillema (2011), teachers' conceptions about fair APs in the Netherlands were investigated. Their results documented that APs are considered as fair if they met some criteria: being useful for student learning, being beneficial to demonstrate what students have learned, being interesting for students, being helpful to create a collaborative climate in the classroom, and serving to exert accountability. Furthermore, Tierney (2014) carried out a multi-case study to re-conceptualize fair assessment from the Canadian primary and secondary teachers' perspectives. His results documented that the participants perceived APs as fair if they equitable for all testtakers, offer multiple learning opportunities to all test-takers, transparent, create a trustful environment in the classroom promoting critical reflection, and avoid an equal assessment for all test-takers. Likewise, Scott et al. (2014) did mixed-methods research to disclose the Canadian teachers' perceptions about fairness in CA. Their findings revealed that fair assessment is more connected to the notion of equity. Moreover, their results showed that APs are perceived as fair if they meet five criteria: (a) test makers have a clear understanding of the effects of tests on test-takers and their families; (b) tests are designed and administered based on test-takers' needs and characteristics (e.g., ability level, gender, socioeconomic status, culture, and language); (c) all testing stakeholders have the right to express their voices and concerns about assessment malpractices; (d) test-takers and their families are not overwhelmed by the frequency, intensity, and intrusion of APs; and (e) APs are not used as instruments to punish or reward test-takers. Finally, Murillo and Hidalgo (2020) conducted a phenomenographic study to disclose fairness in APs from teachers' conceptions in Spain. Their findings indicated that the participants' conceptions of fair assessment were closely related to the principle of equality and equity. Additionally, their findings unveiled that the participants' perceptions of fair assessment were influenced by the school context.
University students' perceptions about fair assessment were examined at Southwestern University by Pepper and Pathak (2008). Their findings indicated that the participants perceived APs as fair if there was explicitness in assessment administration and grading criteria, frequent feedback, and proactivity in the assessment process. Further, in research by Murillo and Hidalgo (2017), primary and secondary school students' perceptions about fair APs were explored in Spain. They found that, on the one hand, the participants' perceptions about fair assessment were associated with equality, objectivity, transparency, and evaluation of class content. On the other hand, the participants' perceptions about fair assessment were related to equity which included some ideas, such as diversification of tests, adaptation, and qualitative assessment. Likewise, in another study, Wallace (2018) explored Taiwanese university L2 learners' (n = 83) perceptions about the fairness of a single test administration. The participants reported that the test administration had a high level of interactional fairness and a high level of procedural fairness. However, the level of distributive fairness was moderate. These findings mean that for the participants, interactions with their teachers were respectful and testing procedures were followed equally for all test-takers. Still, the test scores did not represent their performance adequately. Moreover, Rasooli et al. (2019) tried to conceptualize fair assessment from Iranian university teachers' perceptions. Their findings evidenced that the participants' perceptions of assessment fairness included three principles: distributive justice, procedural justice, and interactional justice. In addition, their participants perceived the procedures for outcome distributions, the communication procedures, and the interpersonal relationships as crucial in the conceptualization of assessment fairness. Likewise, in a systematic meta-ethnographic study, Rasooli et al. (2018) tried to present a comprehensive conceptualization of assessment fairness in the classroom with a dominant focus on how fair APs affect student learning. They found that APs are perceived as fair if (a) students have enough opportunities for learning and enough opportunities for demonstrating learning; (b) there is transparency, consistency, and justification in APs; (c) there are suitable accommodations; (d) APs follow the 'do no harm principle' and classroom environment is constructive; (e) there is no score pollution; and (d) students have opportunities to do group work and assess their peers' performance. Finally, Bazvand and Rasooli (2022) explored Iranian postgraduate university students' perceptions of fairness in classroom assessment within the higher education context. They found that the participants' perceptions of fairness had been affected by 'equity principle' and 'interactional fairness principle' .

Fair assessment models
In the past literature, some models have been presented to conceptualize fair assessment. Here, we review critically three influential ones. One of the first comprehensive models to illuminate the concept of fairness was presented by Kunnan (2004). This model consists of five features: validity, absence of bias, access, administration, and social consequences. The feature of validity means that the required evidence of 'content representativeness or coverage evidence' , 'construct or theory-based validity evidence' , 'criterion-related validity evidence' , and 'reliability' is collected. Content representativeness or coverage evidence means that testing practices represent test domain adequately. Construct or theory-based validity evidence suggests that testing practices represent the test domain adequately. Criterion-related validity evidence means that "the test scores under consideration meet criterion variables such as school or college grades and on the job-ratings, or some other relevant variable" (Kunnan, 2004, p. 37). Reliability indicates that test results are consistent in terms of stability (e.g., test scores' consistency on different testing occasions), alternative form evidence (e.g., test scores' consistency between two or more different forms of a test), inter-rater evidence (e.g., test scores' consistency between two or more raters), and internal consistency evidence (e.g., "in the way test items measure a construct function" (Kunnan, 2004, p. 37). The feature of the absence of bias means that the required evidence of 'offensive content or language' , 'unfair penalization based on test taker's background' , and 'disparate impact and standard setting' is gathered. Offensive content or language means that the content of tests is not offensive for test-takers with different backgrounds (e.g., gender, religion, age, first language and culture, and nationality). Unfair penalization based on testtaker's background means that the content of tests does not cause unfair penalization due to the membership of a test-taker to a particular group or community. Disparate impact and standard setting suggest that there is no bias against a group of test-takers in terms of different performance and outcomes. The feature of access means that the needed evidence for 'educational access' , 'financial access' , 'geographical access' , 'personal access' , and 'conditions or equipment access' is collected. The educational access means that all test-takers have equal opportunities to learn the content and they have equal opportunities to become familiar with testing practices. The financial access means that all test-takers afford to pay for tests' expenses. The geographical access means that all test-takers have easy access to test sites. Personal access means that test accommodations are appropriate for all test-takers even those with physical and learning disabilities. Conditions or equipment access means that "takers are familiar with the test-taking equipment (such as computers), procedures (such as reading a map), and conditions (such as using planning time)" (Kunnan, 2004, p. 38). The administration feature implies that the required evidence of 'physical conditions' and 'uniformity or consistency' is gathered. The physical conditions suggest that test administration conditions and facilities (e.g., light, temperature level, chair) are appropriate. The consistency means that test administration conditions are consistent across tests sites. However, the uniformity infers that all test-takers take tests under the same conditions. The social consequences feature implies that the required evidence of 'washback' and 'remedies' is collected. The washback treats test effects on instructional practices (e.g., educational materials, ways of teaching, ways of learning, and test-taking strategies). The remedies refer to "remedies offered to test takers to reverse the detrimental consequences of a test, such as re-scoring and re-evaluation of test responses, and legal remedies for highstakes tests (Kunnan, 2004, p. 39).
The second model is the Assessment Use Argument (AUA), presented by Bachman and Palmer (2010). AUA includes four claims, namely assessment records, interpretations, decisions, and consequences. For each claim, they offer one or more assumptions requiring theoretical and empirical support to establish a compelling validity argument. They argue if interpretations are meaningful, impartial, generalizable, relevant, and sufficient; decisions are values sensitive and equitable; consequences are beneficial; and assessment records are consistent. Under AUA, test results interpretation and use are valid if they are stated clearly and are supported by strong evidence. In simple terms, to evaluate the validity of test results interpretations and uses, there is a need for the completeness and coherence of a network of inferences and assumptions (or an argument) (Kane & Burns, 2013). It should be stressed that the researcher used the above-alluded studies and models to extract the sub-scales and items.
As this review demonstrates, while the above-alluded studies and models have been noticeable attempts to present a comprehensive definition of fair assessment construct, none of them have purported to develop and validate a psychometrically sound questionnaire to measure fairness in CA. Therefore, the present study is the first attempt to develop and validate a questionnaire with sound psychometric properties to gauge fairness in CA.

Method
As pointed out above, this study aims to develop and validate a questionnaire to measure fairness in CA. The researcher went through a systematic, 12-step design, and validation procedure for the development of an assessment fairness questionnaire (AFQ). The primary purpose was to produce a psychometrically sound questionnaire by ensuring that the reliability and validity criteria were met well. This systematic procedure was based on the practices recommended by leading scholars in social sciences (Artino Jr et al., 2014;Dörnyei, 2003) and followed by Salehi and Jafari (2015). Table 1 presents the steps taken to design and validate AFQ. Each of the steps is detailed below.
The first step was content analysis and sampling. In line with Clément et al. (1994), the past literature, including definitions, models, and instruments was meticulously inspected to extract and verify the most frequent and relevant components of assessment fairness. The analysis yielded 10 overarching sub-scales: (1) learning materials and practices; (2) test design; (3) opportunities to demonstrate learning; (4) test administration; (5) grading; (6) offering feedback; (7) tests results interpretation; (8) decisions based on tests results; (9) test results consequences; and (10) students' fairness-related beliefs and attitudes. As teaching and testing are complementary, the emerged sub-scales included both teaching and testing processes. The teaching sub-scale covered opportunities that test-takers may have to learn educational materials. The testing sub-scales comprised the steps that should be taken to implement quality APs in the classroom.
The second step was creating an item bank. To meet the current study's goals, over 203 items were collected to make an item bank. For this purpose, the researcher went through the available literature meticulously to extract and verify the items related to the sub-scales of assessment fairness. The items were designed based on three dimensions of fairness: distributive justice, procedural justice, and interactional justice (Green et al., 2007;Greenberg, 1993;Kunnan, 2000Kunnan, , 2004Rasooli et al., 2018Rasooli et al., , 2019Tierney, 2014;Wallace, 2018). Distributive justice examines if outcomes are distributed based on three principles: 'equity' , 'equality' , and 'need' . The principle of equity examines if the comparison of the ratio of outcome distribution of a test-taker is same with that of a similar test-taker. The principle of equality inspects if the outcomes are equally distributed among test-takers. The principle of need probes if the outcomes are distributed based on test-takers' needs. Procedural justice treats the fairness of the procedures for outcome distributions based on five principles: consistency, bias suppression, accuracy, correctability, voice, and ethicality.  Table 1 Twelve-step questionnaire development and validation procedures Step 1 Content analysis and sampling Step 2 Creating an item bank Step 3 Running the first pilot Step 4 Creating item pool one Step 5 Running expert judgment to evaluate the sub-scales Step 6 Running an interview and think-aloud protocol Step 7 Running Cronbach's alpha Step 8 Running the second pilot Step 9 Running exploratory factor analysis, confirmatory factor analysis, and Cronbach's alpha Step 10 Creating item pool two Step 11 Running expert reviews Step 12 Running translation and translation quality check outcomes and procedures is logical. In order to develop quality items, the researcher constantly reviewed and contrasted the items of the item bank with AFQ. It should be noted that the researcher took no item of the item pool directly (constructed in the succeeding stages) from this item bank. The major role of this item bank was to increase the quality of the items in the item pools. The third step was creating the item pool one. The researcher developed the first version of the items. Due to the additive nature of Likert scale self-analysis questionnaires, the number of items assigned to each sub-scale is of paramount importance. The reason for this is reliability where the number of items for each sub-scale should not be less than four items (Dörnyei, 2003). The researcher wrote the items in Persian and English concurrently. Considering the items in the item pool one, the researcher wrote the first draft of AFQ with 118 items. It should be noted that the researcher constantly read, edited, and revised the draft. Next, the items written in natural English were simplified lexically. As the intended respondents were supposed to be students with different English proficiency levels, simplifying the items would increase the readability of AFQ for a large scale of respondents. In turn, as Radhakrishna (2007) stresses, this led to the increase of AFQ reliability and validity. It should be highlighted that the researcher used the Corpus of Contemporary American English (COCA) constantly to examine the simplicity of the words of the item pool. In particular, the researcher chose the words that were among the first 2000 words of the corpus in terms of frequency.
The fourth step was running expert judgment to evaluate the sub-scales. The researcher used expert judgment to increase the content validity of AFQ and make sure if the sub-scales emerged from the analysis of the literature could be included in the item pool. Using a sub-scale evaluation checklist, the researcher invited four university professors specialized in applied linguistics at Lorestan University to judge if the sub-scales were necessary for AFQ. It should be noted that expert judgment was used as a pre-testing method before running the psychometric validity procedures. As Olson (2010) notes, expert judgment can be used to "discern questions that manifest data quality problems" (p. 295). The sub-scales evaluation checklist consisted of 10 items. For example, the first item asked, "Do you agree that providing opportunities to demonstrate learning is an essential component for assessment fairness in the classroom? If yes, how much? The Likert type used for the evaluation of sub-scales checklist was a three-point Likert-scale. It included three items, namely not essential, somehow essential, and very essential. Then, the experts were asked to provide their reasons for their choice. The experts' judgment revealed that all the sub-scales were needed, and therefore, no subscale was deleted.
The fifth step was running an interview and think-aloud protocol. The researcher used a verbal report protocol to interview 15 students from the target samples. The reason was to identify the vague items and make sure the response validity of the items. The participants were invited to report on the items lacking the required readability. During the online interviews running on Adobe connect platform, the participants carefully went over the items one by one, thought aloud, and expressed their views and feelings. The researcher asked if the items were clear enough to the participants. If the participants expressed a problem with the items, they were stopped and asked for the defective aspects of the items. It is worthy to note that the Persian version of the items was used and the interviews were run in Persian so that the interviewees can express their ideas and conceptions with ease.
The sixth step was running the first pilot. To pilot AFQ for the first time, a total of 128 university students were selected using a random sampling method at Ayatollah Borujerdi University and Lorestan University, Iran. The underlying reason for selecting the participants was their easy availability and their great deal of experience with testtaking. They included males (n = 82) and females (n = 38) and their ages ranged from 19 to 45. They were B.A. (n = 105) and M.A. (n = 23) undergraduate students who majored in English literature, applied linguistics, and linguistics. To access the participants, the first researcher referred to the Deputy of Education of Ayatollah Borujerdi University and Lorestan University and explained the present study's objectives. Both Deputies of Education allowed the researcher to meet the department heads of English language and literature. Having described in detail the present study's objectives, the department heads permitted the researcher and colleagues to run the study with the cooperation of their students. Since the present study was conducted during the outbreak of the COVID-19, the students were not present on the campus and, therefore, the researcher could not meet them in person. The researcher got the students' phone numbers and sent a podcast voice to them via WhatsApp. During the COVID-19 pandemic, all university students installed WhatsApp on their phones to be in touch with their university teachers, university officials, and classmates. The podcast voice explained the current study's objectives and asked if they agree to fill in the questionnaire. A total of 128 students agreed willingly to fill in the questionnaire. Then, the researcher sent a digital format of AFQ to them. It should be noted that AFQ started with digital written consent (in Persian) and if the participants agreed with its content, they could move on to the next stage to fill in the questionnaire. During answering the items, the participants could contact the researcher to raise their problems with the items.
The seventh step was running Cronbach's alpha. In line with Dörnyei (2003), in the pilot phase, the internal consistency indexes were used to reduce the problematic items of the item pool one. For this purpose, the researcher used Cronbach's alpha to delete the problematic items. Based on the results, the items whose Cronbach's alpha was less than 0.70 (8 items) were deleted. In total, the Cronbach's alpha of 110 items was larger than 0.80.
The eighth step was running the second pilot. According to the results of the first expert judgments and Cronbach's alpha, 10 sub-scales with 110 items were verified. Hence, the second item pool included 110 items. In this step, AFQ was distributed among 360 university students selected the through random sampling method. They included male (n = 245) and female (n = 115) students and their ages ranged from 19 to 47. They were B.A. (n = 270) and M.A. (n = 90) undergraduate students majoring in English literature, applied linguistics, and linguistics. The procedures explained in the sixth step were followed to achieve the participants in the second pilot two. It should be noted that the participants were ensured that their responses would remain confidential and they would be kept informed about the final findings.
The ninth step was running exploratory factor analysis (EFA), confirmatory factor analysis (CFA), and Cronbach's alpha. According to Riazi (2016), EFA is a statistical test used to disclose the underlying theoretical foundations of a topic by reducing data to a smaller set of variables. However, CFA is a statistical test used to verify the factor structure of a set of observable variables. The researcher subjected the 110 items of AFQ to an EFA to explore its factorial structure. Afterward, the researcher run a CFA to check if the factors emerged in the EFA were confirmed. Next, the researcher examined the internal consistency of items using Cronbach's alpha. The primary purpose was to identify and delete the defective items creating factor pollution and alpha reduction from item pool two.
The tenth step was creating the item pool two. To make it, the researcher made some modifications. He rewrote and replaced the items deleted in the statistical analysis section above with new items. In line with the current study's aims, the researcher simplified some items more in terms of grammar and lexicon to improve their readability. He chose simple tenses, changed the sentences' voices at times, and reversed and paraphrased some items to become simpler.
The eleventh step was running expert reviews. Two associate professors in Applied Linguistics at Tehran University were invited to review and comment on the items. The researcher referred to their offices and asked them kindly to examine the items based on six criteria: double-barreled, vague, unrepresentative, hard, sensitive, and burdensome (Dörnyei, 2003). In light of the professors' comments, some minor modifications were made in regard to the language of the items. However, no item was deleted from the final version of AFQ.
The last step was running translation and translation quality check. As pointed out above, the researcher wrote the items in English and Persian from the beginning and changed both equivalents of the items concurrently. The reason for this was making parallel versions of AFQ. As mentioned above, in the first and second pilot phases, the Persian version of the items was used. There were some reasons for this. First, the participants had different language proficiency levels. Second, misunderstanding of the items may have jeopardized the reliability and validity of the participants' responses. Third, responding to the questionnaire in Persian was naturally less anxiety-provoking for the participants. Fourth, answering the questionnaire in Persian was less time-consuming for the participants. It should be stressed that the researcher checked the quality of the translation in two ways. In the first way, they invited 15 students to report on the clarity of the translations and highlight the vague words, phrases, and sentences. In the second way, the researcher invited two experts in translation to check the clarity of the translations and the equivalence between the Persian and English items. Based on their comments, some modifications were made to the defective translations.

Results and discussion
To achieve the intended aims, the researcher used Cronbach's alpha, EFA, and CFA. The researcher used EFA and CFA in the second pilot. He used SPSS version 22 to run EFA analyses and he used the analysis of moment structures (AMOS) V. 21 program to run CFA. The reason for using AMOS was that it supports SEM (structural equation modeling) and has a graphical interface and is diagram-based (Kline, 1998). The results of internal consistency reliability for the pilot one was 0.72 (Table 2).
The researcher subjected the 110 items of the questionnaire to EFA with oblique rotation (direct oblimin) to explore the factorial structure of AFQ in the sample. The sampling adequacy for the analysis, KMO = 0.92 ('marvelous' according to Kaiser & Rice, 1974) was verified by the Kaiser-Meyer-Olkin measure. Bartlett's test of sphericity was χ 2 (5995) = 39803.55, p < .05, indicating that the correlation structure is adequate for factor analyses. An initial analysis was run to obtain eigenvalues for each factor in the data. Ten factors had eigenvalues over Kaiser's criterion of 1 and in combination explained 73.26 % of the variance. We retained 10 factors because of the large sample size and the convergence of the screen plot and Kaiser's criterion on this value. Table (rotated component matrix) shows the factor loadings after rotation. The table shows the results of the final version of TF. As Field (2013) asserts, since no item had loadings below 0.4, none of the items were deleted from the final version. The items clustering on the same factor suggest that factor 1 represents 'test design' , factor 2 represents 'test administration' , factor 3 represents 'learning materials and practices' , factor 4 represents 'grading' , factor 5 represents 'students' fairness related beliefs and attitudes' , factor 6 represents 'opportunities to demonstrate learning' , factor 7 represents 'offering feedback' , factor 8 represents 'tests results interpretation' , factor 9 represents 'tests results consequences' , and factor 10 represents 'decisions based on tests results' (Table 3).
The researcher subjected the 10-factor model which emerged from EFA to CFA using AMOS (see Fig. 1). In the present study, the researcher used χ 2 /df (chi-square divided by degree of freedom), goodness of fit index (GFI), root mean square error of approximation (RMSEA), normed fit index (NFI), Tucker and Lewis index (TLI), and comparative fit index (CFI). According to MacCallum et al. (1996), a fit model is acceptable if χ 2 / df is less than 3, GFI, NFI, TLI, and CFI are above 0.90, and RMSEA is less than 0.08. As reported in Table 4, the results of CFA showed that all goodness-of-fit indices were above the cutoff points. Hence, the factorial structure of AF was confirmed by CFA.
The results of the internal consistency of the 10 sub-scales of AFQ are presented in Tables 4 and 5.
As it can be observed, the sub-scales along with the whole questionnaire gained acceptable indexes of Cronbach's alpha: learning materials and practices (0.97), test design (0.98), opportunities to demonstrate learning (0.94), test administration (0.97), grading (0.95), offering feedback (0.93), tests results interpretation (0.92), decisions based on tests results (0.86), test results consequences (0.91), and students' fairnessrelated beliefs and attitudes (0.96). The internal consistency of the whole questionnaire is 0.91 which suggests that AFQ is highly reliable with this sample (Appendix).

Conclusion
The present study purported to develop and validate a questionnaire to measure fairness in CA within the context of Iranian higher education. Having followed a 12-step systematic procedure, a questionnaire with 10 sub-scales was developed. The construct validity of the questionnaire was supported by CFA and EFA and its reliability was confirmed by Cronbach's alpha. The hope is that the present questionnaire can be used as a useful instrument to measure the fairness of APs administered in the classroom. This questionnaire is likely to be used for diagnostic purposes by teachers to gauge how much fairness is met in their APs. Additionally, this questionnaire can be used for research purposes to investigate the correlation between fairness in APs and other variables impacting students' learning. For example, future studies can explore any significant correlation between fairness in APs and students' motivation to continue learning. Moreover, the 10 sub-scales that emerged from the data are likely to be considered as a new model of fairness in CA in the literature. Although the present study was an early systematic effort to develop and validate an AFQ, there were two limitations with it that should be acknowledged. The first limitation was related to the sample of the participants. It was limited to two state universities in Iran. Further studies are needed to examine the reliability and validity of the present questionnaire by including other samples of participants (e.g., teachers and school students) in other settings (e.g., schools). The second limitation was germane to the Likert-scale used in the current study. It had only five points (from strongly disagree (1) to strongly agree (5)). By using larger scales, for example with seven to eleven points, researchers are likely to get a better understanding of respondents' response variance on individual items and, in turn, improve the reliability of the subscales. We hope that the implementation of such revisions leads to a more validated questionnaire.

Tests results interpretation
1 My grades are not interpreted clearly (The criteria to interpret my grades are not specified).
2 My grades are interpreted consistently (For example, if my grade is lower than ten, it always means that I have to retake the course).
3 Teachers express their reasons for my grades interpretations.

Author's contributions
The author(s) read and approved the final manuscript.

Funding
This study was not supported by any funds of any sorts.

Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.