Test review of Iranian university entrance exam: English Konkur examination

The present paper appraises a standardized test, the entrance exam of Iranian universities, known as “Konkur” that is administered annually as a means of gaining admission to higher education in Iran. This norm-referenced test is administered for students majoring in mathematics, experimental sciences, and humanities whose scores along their weighted GPAs in the last 3 years of high school are used as indicators of students’ rank. Based on the rank achieved, they would find the opportunity to select the highly regarded university for their education. Due to the importance of such a high-stake test which may bring about social and long-time consequences for the participants, the present paper tries to evaluate the test and its psychometrics aspects. It is ostensible that the exam provides a limited situation for measuring the participants “knowledge of language” rather than their “knowledge about language.” Therefore, the dimensionality and validity of the test are debatable. Thus, the present review tries to characterize Konkur examination and discusses the rooms for untouched aspects for the betterment of its quality.

sources of instruction in the school years in Iran. The test is considered to be a normreferenced test that is designed to evaluate the students' ability. Mainly, the examinees are native speakers of Farsi who only sit for the test in the standard exam centers, within and outside of the country, approved by the NOET.
At the time that the test was administered for the first time, the main focus of language instruction was on enabling students in terms of reading comprehension ability because they were supposed to be able to cope with the demands of reading technical texts at university. Thus, we can see that the focus of this test was given to reading comprehension, structure of the language, grammar, and vocabulary (Farhady, 1985). After 18 years the structure of the test is remained almost intact. Recently, the course books are reviewed and revised and we believe that a new test format is required to assess the communicative ability of students rather than purely memorized grammar and vocabulary.
You can find the test pamphlet that is distributed among the participants in four codes (A, B, C, D). All pamphlets, even in different codes, present the same questions, the same sub-sections, and the same sequence of sub-sections; the only difference in these codes is either the sequence of items or the sequences of the options for items form one person to another. That is, although the examinees answer the same questions, they receive the questions in diffrent orders. The time of administration is 105 mins. All items are dichotomously scored. A correction for guessing is also applied whereby, three incorrect answers would remove one of the correct answers. The number of the questions is 70 presented in sub-sections, namely, grammar (10 items), vocabulary (15 items), sentence structure (5 items), language functions (10 items), cloze test (15 items), and reading comprehension presented as three separate texts (15 items). Statistically speaking, we can see that the test content is not distributed among different section and skills equally; it is as following: about 27.15% for structure and grammar, 34.28% for vocabulary, and 38.57 for reading comprehension (Razmjo, 2006).

Test formats
Based on the content of the books presented to the students at high schools, the test is designed and accommodates six sections as following:

Grammar section
This section includes 10 questions asking about English grammar. The questions are presented in the form of incomplete sentences that should be completed by the option that is to be selected by the students; the options could be phrases, words, prepositions, or verbs. In terms of the sequences of question, there is no pre-determined rule and they are sequenced randomly for each participant rather than on a regular basis that is the same for all participants. Sometimes two grammatical rules are mixed in one question that seems to be very complex for the students to detect the idea behind the question and find the correct answer.

Vocabulary
In this part, 15 questions are presented in the form of incomplete sentences. The students should select the best option for the completion of the sentence meaning. Among the options provided for the participants the correct answer was instructed previously within the calssroom context, however, the distracts may be new words for them. The part of speech of the options may differ across questions but it is tried to be the same among the options of each single item in order to prevent the possibility of random guessing on the part of the participants.

Sentence structure
In this part, there are 5 questions; each option of the question presents a sentence and the participants should select the option in which there is no grammatical mistake based on the stem of the question. Mainly, the sentences are long in the form of compound and complex sentences and the mistake could be presented in any component of the sentence.

Language functions
In this part, several conversations (mainly does not exceed 3) are written and it is composed of 10 questions. The participants should complete the conversations with the best answers from the options. The correct answer should serve as a complement for the conduction of the function that is happening between two sides of the conversation.

Cloze test
In this part, participants should read a passage in which there are 15 blanks (mainly occurs at a regular number distance, for instance every ten words) and select the option which completes the sentence best. Since the blanks are presented in one text, misunderstanding or failure in finding the correct answer of one blank may lead or mislead the students to slect the proper option for the next blanks.

Reading section
Each test has three reading comprehension texts whose length ranges from 350 to 500 words covering a wide range of topics such as academic, scientific, and social issues. For each text, there are 5 multiple choice items asking about the content of the text, meaning of the vocabulary, and sentence interpretations.

Reliability
Due to the importance of the test consequences, the Konkur examination constructors should try their best to meet all the necessary conditions for the test reliability. The quality and number of the items stress that the objectivity of measurement is seriously considered resigned to the fact that a sufficient number of items (N=70) were presented all in the multiple choice format and were assessed through machine-scoring which is a reliable scoring procedure (Roberts, Altenberg, & Hunter, 2020). We believe that the major concerns for reliability are the imbalanced number of items in each sub-section, the equal weight for the selection of wrong option in different subsection, and the interference of the skills in sections such as cloze test or grammar. The analyses that we run on internal consistency of the test shows that the level of reliability is not equal among different subsections partly due to the unequal number of items included in diffrent sections and it ranges for the lowest part belonging to grammar to the highest level which belongs to sentence structure section.

Generalizability and dependability of findings
Another concern in terms of reliabilty is the issue of dependability of findings and generalizability of the results. How is this possible to be sure that the outcome of the test is the real performance of the participants? Conceptually, dependability means how much the results of a test show the intended level of the construct we wanted to measure. The use of neutral texts and sentences in this test shows that the developers were aware of the issue and they tried to prevent the existence of any potential bias in the function of items. About the generalizability of the results, we should be aware that the participants were all Iranian students, and the content was taken from the books in high-school; therefore, the findings could be generalized to similar contexts of the country of administration rather than an international level. Khodi (2020) ran a generalizability analysis on a sample of 5000 examinees and reported that and 86% of the total variance can be explained by individuals, which is a high degree of reliability of the test. Since, in addition to the gender of the participants, their major was different; they examined the potential contribution of major to the performance of students. It is reported that the interaction of individuals' fields of study and the overlapping questions in the test sections caused an error of about 1.5%. It shows that the national entrance examination does not have a bias against any group of participants with different educational backgrounds.

Validity
As an academic test that is designed to assess the English level of test takers, it should enjoy some certain qualities the most important of which is believed to be validity. Based on Messick (1989) and Bachman (1990) validity accommodates a wider range of concepts including construct validity, content representativeness, and criterion-related validity. For the present test, validity means measuring what the test is supposed to measure while we believe that the social aspect of validity (Chalhoub, 2016) should be added to this old definition. We believe that no test could be considered valid outside of the specific use and context it is designed for (Messick, 1989). Thus, in proceeding some points related to the validity of the test are mentioned and explained.
We can see in the nature and structure of the question in the test that in spite of the construct definition of language proficiency, Konkur designers found it difficult to fully operationalize it due to the constraints and considerations of other test qualities to be manifested practically. For instance, speaking, writing and listening were not accommodated in the Konkur examination on account of the vast regional differences of participants due to accessibility issues to proper instruction. The exclusion of these skills is due to the the subjectivity in scoring these productive skills which may would pose some concerns in the matter of validity and reliability.
Overall, we believe that the construct of academic English was operationalized as the reading, grammar, and vocabulary skills that are critical to success of a first year student at university, or potentially it is the impact of needing such skills which at universities that has led to a test with such a format. It seems that the Konkur examination does not enjoy a full representation of construct validity as there is a wide gap between the intended curriculum and the test.

Factor structure of the test and test dimensionality
It is not clearly stated that the construct of measurement, that is language proficiency, is defined as a unidimensional of multidimensional construct. The the form and content of the test accentuate that there are several dimensions for the test, but on the other hand the sum-up procedure of scoring shows that no weighted score is dedicated to these dimensions and all are taken into account similarly. Even the difficulty of items does not contribute to the calculation of the final score of the participants. It means that answering a very difficult question would bring the same score as answering an easy question will bring. We suggest that for such a high-stake test with major social and life-long consequences the application of weighted scores and item difficulty level to the scoring procedure because it would increase the quality and dependability of the results. In the scoring procedure, in addition to what we stated, we can find another major concern. In spite of the fact that there is a wide range of item response theory models such as bifactor, higher-order or unidimensional models are taken as the basic framework of analysis, unfortunately in the scoring procedure of the test we could find no sign of using these models in validation procedure of the finding. In an independent study we made a comparison of these models and checked if the nature of language in this test is multidimensional or unidimensional. We found that the factor structure of language proficiency is best explained through the testlet model rather than being measured through the bifactor model (Alavi, Karami, & Khodi, in press).

Impact and washback
It is believed that "testing is never a neutral process and always has consequences" (Stobart, 2003, p. 140). Evaluation of washback is a complex and multi-dimensional act that does not exist naturally and is taken as the the aftereffect of teachers, educators, or other factors' contribution in the test-taking procedure (Alderson & Wall, 1993;Bailey, 1996;Cheng & Falvey, 2000;Spratt, 2005). For the present test in particular, the washback effect occurs due to the fact that its structure is in practice a centralized, measurement-driven system whose orientation is bound to the teachersdominated classes, textbooks and testing impact (Ghorbani & Neissari, 2015). Although the great emphasis for washback is suggested for communicative-oriented methods, and it is stressed that tests should explicitly be designed to bring positive washback (Cheng, Watanabe, & Curtis, 2004), apparently in Iran this occurs for a reading-oriented test. Therefore, very few Iranian students finish high school with the ability to speak English effectively in spite of mastering the prescribed textbooks (Farhady, Jafarpoor, & Birjandi, 1994), and English instruction in most of the Iranian academic situations seems to be ineffective and impractical (Hosseini, 2007).
In Iran, it was found that that the EEU negatively and implicitly influences English teachers for instruction of the content and format of the test (Salehi & Yunus, 2012) and regarding the UEE format and importance, students potentially spend more time on grammatical structures, vocabulary, and reading than writing, pronunciation, speaking, and listening exercises (Farhady et al., 1994;Ghorbani, 2012). The ultimate objectives of the EFL program stay to be outlined that has led to different repercussions over the various periods of the educational program including assessment programs. The large scales and high s-stake test named Konkur functions alike an agreement among the instructors for deciding about the material of instruction, and its negative impacts known as washback and the implementation of the actual curriculum fails (Jahangard, 2007). This exam is extremely important not only to students and their parents but also to the larger society and the whole society is affected by its impacts. For instances, the traffic limitations are changes in the administration day and many parents wait outside the administration centers until to the end of the exam. As this importance accentuates, many non-governmental institutes have started to present the simulated exams, supplementary classes, and books for students to achieve the utmost skill for taking the test, a procedure that starts even starts almost 3 years before the test. Teachers are also influenced by the Konkur and try to adapt their instruction with the hidden curriculum and students' needs and even preferences.

Clarity of the test
The information that is distributed by test administration board aims at providing transparency the test configuration, test time, and format. In this particular case, the Konkur examination, such information is always published, disambiguated, and clarified. It is believed that the content presented in the textbooks would suffice mainly the needs of some students and for achieving higher ranks in the Konkur and having a good command of English; a supplementary sources are needed which fundamentally differ among the students of different schools and regions and this hidden syllabus is never determined and there is no consensus about it (Salehi & Zamanian, 2012).

Conclusion
UEE enjoys a well-situated presentation of the content instructed and addresses the curriculum in a comprehensive manner. Although the washback looks somehow deconstructive, it could be reestablished with regards to the new course books and the priorities that have been newly accepted and changed. Challenges remain, inevitably, in terms of overlooking listening, writing, and speaking could be resolved possibly through addition of the new contents to the instructional curriculum and new test format. The construct irrelevant variance also may exist but could be suppressed by application of weighted scores and adding new variables such as applying students' educational background. The remaining concerns should be resolved once a consensus achieved about the definition of the constructed that is assessed basing in the test and new technological tools could be used in the assessment procedure. One major reason that some participants fail in the exam is the incompatibility of the nature of the items of the test with the item format, for instance, assessing the communicative ability of the participants through filling the blank of a conversation. The overall analysis of UEE depicts that although the context of the test and the content are in accordance, the nature is not a good indicator of participants real language ability due to lack of some important language skills in the test. Drawing on the social and life-time consequences of the test in Iranian society, it is promising that a well-deserved number of research studies evaluate UEE and provide suggestion for its betterment.