The impact of 24-h take-home exam on language learning and teaching on the China campus of a British university

Take-home exam (THE) use has been reported in various disciplines, but research on THE use in language modules in higher education appears to be scarce. The current study employed surveys and interviews to examine how the shift to written THE, in place of the traditional in-class exam (ICE) during the pandemic, impacted language learning and teaching on the China campus of a British university. Additionally, correlation analyses were conducted with ranking data of students from the same cohort under THE and ICE to explore patterns in student performance under these exam conditions. In surveys and interviews, teachers reported that their teaching foci did not change under THE, while many students reported that their learning practices were different under THE and ICE. Students also exhibited a tendency to spend more time practicing skills that they expected to be assessed in the exam. Overall, both teachers and students expressed preference for ICE, with many raising concerns about fairness issues in THE. Furthermore, correlation analyses showed that, overall, for a given group of students, written ICE rankings exhibited strong correlations with each other but written THE rankings did not, suggesting relative instability of THE results. However, when written THE and oral ICE results from the same module are combined, the resultant rankings strongly correlated with pure ICE rankings. This indicates that combining ICE and THE components for assessment could help mitigate some perceived short-comings, including the instability issue, of THE used alone.


Introduction
Language learning at the university is usually evaluated either through pen and pencil tests, or performance tests (McNamara, 2000).The former is a traditional end-semester final exam, where students are evaluated with "in-class, closed-book, invigilated penand-paper exam" (Bengtsson, 2019, p. 1), and is known as in-class exam (ICE).This is a traditional testing method used in many disciplines for a long time with little change (Williams & Wong, 2009).However, the outbreak of COVID-19 has forced educational institutions to change significantly their teaching and assessing methods.Many universities switched from traditional face-to-face teaching to online teaching in a short time and used new assessment methods.When devising new assessment methods, colleges and universities focused on how to most effectively and authentically assess student learning online (Harrison, 2020).Chan (2022) reviewed the practice of 76 universities and found that the most common approach was to focus on grading, employing for example a "binary grading system" which gives students a "Pass" or "Fail" as a "safe landing" instead of changing the assessment itself, while a few universities, like Princeton University, have replaced all examinations with take-home examinations (p.8).Take-home exam (THE) is an exam that "the students can do at any location of their choice non-proctored" and whose "time limit is extended to day(s) rather than hours as is the typical time limit for an ICE" (Bengtsson, 2019, p. 2).THE has been used prior to the pandemic as "an assessment method on a regular basis" in universities in Australia, Canada, Finland, and Sweden but was relatively uncommon in UK universities before 2020 (Bone & Maharg, 2019, p. 934), and little investigation has been conducted on its use in the field of second language teaching and learning.Hence, research in this area is necessary.

Context of the current study
The current study was conducted on the Chinese campus of a British university.The campus provides a UK-style education in terms of curriculum, pedagogy, systems, language, and resources (Quality Assurance Agency for Higher Education, 2013), with the same quality assurance standards and regulations as the British main campus.English is the medium of instruction for all subjects except second language modules, where languages are taught in English and the target language.On campus, there are over 8000 students, more than 90% of whom are native Chinese (University of Nottingham Ningbo China [UNNC], n.d.) and share the common language of Mandarin Chinese.They were admitted to the university through the first tier of China's National College Entrance Examination (Gaokao), before which most of them had not formally studied in a western educational system.
The Chinese students on campus are graduates from domestic high schools nursing an exam-driven learning culture, where "assessment provided motivational forces by offering results indicative of learning progress" for them (Gao, 2006, p. 61).In Chinese high schools, students take English examinations that emphasize the learning of vocabulary and grammar, so students "might develop a belief that learning language is mainly about acquiring knowledge rather than developing communicative skills" (Li & Ruan, 2015, p. 48).After entering our university where the current research was conducted, all the non-English native speaking students take 1 year of English for academic purposes (EAP) courses to support later study.After the preliminary year of English training, students formally start their academic degree study and many of them can choose a second language course in the language center.
The language center (LC) offers French, German, Japanese, Korean, Mandarin, and Spanish courses.In February 2020, when the coronavirus outbroke, the whole university shifted to online teaching.However, in May 2020, the majority of students were able to come back to campus to receive face-to-face teaching while a small number of students unable to come back to campus for various reasons continued learning online in small separate groups.From the beginning of academic year (AY) 2020-2021, because of the uncertainty of COVID, LC announced at the beginning of the autumn semester that the end-semester written exam would change to online entirely with a 24-h THE and the use of THE format lasted for 2 semesters until it changed back to ICE in autumn 2021 (see Table 1).

The implementation of THE
When transferring to online format, in autumn 2020, the traditional 1.5-to 2-h in-class written exam was replaced by a 24-h THE and the exam tasks were changed to two pieces of writing based on two reading stimuli (see Table 2).In the 24-h THE, students would access Moodle, where the module convener had set up the THE exam assignment before the exam date.Students would download the exam paper and were expected to upload their answers onto Moodle within 24 h.We chose THE as the end-semester assessment for several reasons.Firstly, not all students had been able to come back to campus, so it was not possible to resume the traditional ICE.Secondly, THE was one of the alternative methods employed by institutions in other parts of the world where teaching was still totally online (Gamage et al., 2020), including the LC on our home campus in the UK.Thirdly, for many researchers, THE was "a promising move to assessment for learning during the time of Covid-19" (Tam, 2022, p. 488) and could even be "far more valuable than being an emergency alternative to in-class exam" (Braselmann  In academic year 2020-2021, the usual 1.5-to 2-h in-class written exam was replaced by 24-h THE and the exam tasks were changed to two pieces of writing based on two reading stimuli.Regardless of format, the written exam weighed 50% in the final mark of a module.The oral exam (always an ICE) constituted the other 50%  et al., 2022, p. 99).Finally, it should be noted that we gave students 24 h to finish their exam considering that some students were outside of China and had a time difference, and that technical issues could arise and take time to resolve.The LC considered also the potential problems related to THE, namely that students could check books, internet, and so on for answers.To cope with these problems, we changed the exam design.
The regular in-class written exam consisted of three parts: reading comprehension, use of language (grammar), and writing, whereas in the 24-h THE, there were two writing tasks based on two reading stimuli, where students were required to read the stimulus text and write personal responses.The reading stimuli were provided in picture format so students could not directly copy and paste any text, in the hope that this could reduce the potential use of tools like machine translation.
Students were asked to hand-write their answers either on writing sheets centrally designed by LC or on blank white sheets of paper.They would then scan or take pictures of their work and upload the files on Moodle.After the end of the 24-h period, module conveners downloaded all students' scripts and distributed them within their respective language team for marking.The marking criteria were centrally designed by the home university, but there was no time to discuss them and standardize their use across language teams on our campus, so standardization was conducted only within each language team.
At our university, language modules also include an oral exam, whose format did not change during the pandemic.The oral exam is normally an in-class closed-book exam, where groups of 2 or 3 students engage in a conversation prompted by a randomly drawn card, after which each student answers some questions asked by examiners on the spot.For the oral exam, the only change caused by the pandemic was that a very small number of students had to be examined online with the same invigilation and procedural standards.For the current study, therefore, the oral exam is conceptualized as always being an ICE.

Literature review
In the literature, there are two other concepts related to THE and ICE and appear frequently in research studies: open-book exams (OBEs) and closed-book exams (CBEs).It is important to first make clear the usage of these terms for the purposes of the current study.First, in OBEs, students can access class notes and teaching materials during the exam (e.g., Tao & Li, 2012).OBEs could be in-class invigilated open-book exams or take-home open-book exam.The latter happens in the same condition as THE.In this paper, therefore, the term THE is used for all the exams that students take at home or other places without invigilators during a given period of time, from a few hours to a few days; these include the take-home OBEs.Second, closed-book exams (CBEs) usually are in-class exams (ICEs), so in this paper, ICEs and CBEs are considered interchangeable and are used to refer to exams that happen in class with invigilators where students are not allowed to access learning materials.Closed-book take-home exams (e.g., Fernald & Webster, 1991) are not considered for the current study.
There are also review reports.Bengtsson (2019) reviewed 35 articles about THE in higher education and concluded that THE may be the preferred choice of assessment because they promote higher-order thinking skills and allow time for reflection.Durning et al. (2016) and Johanns et al. (2017) have done similar work.
However, there seems to be a dearth of research on the use of THE in second language education.

Impact of THE on learning and teaching
Researchers reported inconsistent results about the impact of THE (including takehome OBEs) on students' learning.
Some researchers found that THE had a positive impact on students' learning.For example, López et al. (2011) concluded that THE is "a powerful tool" for assessing all types of skills (p. 6) and was greatly appreciated by students as it improved their learning process.Later studies also echoed these statements, reporting that THE implementation deepened understanding (Karagiannopoulou & Milienos, 2013;Jacobs, 2021), and increased student motivation and engagement (Myyry & Joutsenvirta, 2015).
These results, however, must be viewed with care, as these benefits might be associated with changes in other factors brought about by THE implementation, such as changes in question types, rather than the shift to THE per se.For instance, in López et al. (2011), more than half of the student participants found their THE in a computer science course "very demanding" and agreed that significant learning took place during the exam (pp.5-6).This could be due to the fact that López and colleagues designed a THE with 16 open-ended questions that required students to collect and synthesize information from the course and on the internet (p.5).Similarly, when shifting to THE, Jacobs (2021) made more extensive use of open-ended questions in chemistry examinations.Another example is Braselmann and colleagues' (2022) study: the researchers did not only create a complex THE design with a mixture of closed, semi-open, and open-ended components, but they redesigned the whole course around the THE requirements.When reviewing the reported benefits of THE, it is therefore essential to identify and consider the factors affected by THE use in individual studies.
While the abovementioned THE studies involved certain degrees of redesign of exam tasks, Marsh (1984) gave two groups of students the same set of multiple-choice questions in THE and ICE formats, respectively.In an unexpected delayed test 1 week later, the students in the ICE group outperformed those in the THE group, indicating that ICE may be associated with better retention compared with THE.As Marsh explained, students' expectation that they could rely on study materials in a THE could hinder learning (p.112).
None of the studies discussed so far was from the field of language education, where exams are designed to assess language knowledge and skills and can look very different from exams in other disciplines.Thus, the current study contributes to the discussion on THE's impact on learning and teaching by examining its use in language education.

Exam results
Extant studies that compared student results in THE and ICE mainly used average scores or grades in the same course.For example, Braselmann and colleagues (2022) compared average grades under THE with previous ICE grades in the same course and found "no significant difference" (p.97).Spiegel and Nivette (2021) also reported comparable results between THE and ICE.In Jacobs (2021), average scores were overall higher for THE than ICE, but not by much.
These comparisons have some limitations.First, given the relatively small amount of data, no pattern can be reliably observed yet and the comparisons do not reveal much about the exam formats.Second, some of these studies (e.g., Braselmann et al., 2022) compared results from different cohorts.This may make results harder to interpret, as different cohorts may possess varying characteristics that affect exam results, and therefore may also limit the insight that can be gained through comparing THE and ICE results.
In response to these limitations, the current study utilized marks from the same cohort across four consecutive semesters, as well as the corresponding ranking information, to explore patterns in performance of the same students in THE and ICE.

Exam design and fairness
The majority of research produced about THE and ICE has demonstrated that, either because of its format (open-ended and essay-type questions) or because of the longer time duration, THE promotes high-order cognitive activities (Bengtsson, 2019;Tam, 2022), evaluation and creation of knowledge (Khan, 2022), and reflection on personal experience (Ng, 2020).
However, a basic principle in assessment design is to ensure that assessments enable students to demonstrate their learning (Quality Assurance Agency for Higher Education, 2018), and "reflect students' real competence" (Şenel & Şenel, 2021, p. 246).The accuracy of the assessment in representing students' real competence not only depends on effective design of exam tasks but is also related to students' conduct during exams; students might misbehave in exams and violate academic integrity, which can impact exam fairness.Because of the open-book nature of THE, dishonesty is a common concern (Cleophas et al., 2021;Ng, 2020).Studies have revealed that there is a higher probability for online students to cheat in assessments in comparison with campus-based students (e.g., Gamage et al., 2020).
Another consideration is, when adopting THE, it is also necessary to consider the overall assessment design of the module.Durning et al. (2016) and Johanns et al. (2017) concluded that a combined approach (of OBEs/THE or CBEs/ICEs) could be more effective in assessing different competencies.The current study will examine this point in its data analysis.
All the research discussed so far focused on content modules in non-language subjects, and no research was found related to language examinations, which often aim at enabling students to demonstrate the mastery of multidimensional skills in a language rather than knowledge in a subject.This study also contributes to filling this gap.

Research questions
The research discussed so far focused on THE as an assessment tool in content modules aiming at testing students' knowledge of concepts and theories and their ability to apply them in disciplinary contexts.Most studies that analyzed exam results did so only in terms of average marks or grades from different cohorts.Furthermore, few studies surveyed both students' and teachers' perspectives on THE and ICE, as well as their learning and teaching strategies under these exam formats.
Given these gaps, the current study aims at contributing to the ongoing discussion about THE by exploring its use in the field of language education, and more specifically by answering the following research questions: 1. To what extent does the awareness of take-home exam (THE) implementation affect students' learning strategies and teachers' teaching strategies? 2. How well correlated are exam results of a given cohort under THE and ICE as measured by rankings?3. How do students and teachers believe THE should be designed to assess language skills accurately?

Methods and samples
A combination of quantitative and qualitative data collection, including surveys and semi-structured interviews, was adopted in this study.The findings were triangulated with the analysis of exam results.
We conducted two questionnaire surveys to gather information from students and teachers respectively about their behaviors, opinions, and experiences in relation to THE, and explored these data for possible similarities and differences (Neuman, 2014).As Wellington (2015) posed it, a survey is a "fact-finding mission" (p.191); this reflects the purpose of our two surveys.In addition, to gain a deeper, more nuanced understanding of the participants' survey answers and their subjective experience with THE (Kvale, 2008), respondents were also invited to an interview.
After obtaining ethical approval from the university, in May 2022, two anonymous surveys were launched to both language center students and teachers respectively.The surveys were sent to all 23 language center teachers and all 424 students who took A2-B1 level language modules in 2021-2022.We targeted this cohort of students because they experienced both the 24-h THE in 2020-2021 and the ICEs in 2021-2022 and therefore could compare their THE and ICE experience.
Each survey consisted of three main parts.For students, the first part included questions on basic background information, including the language they were learning and the level of their language module at LC.The second part focused on their learning strategies and practices under THE and ICE, and their attitudes toward the two assessment formats.In the third part, students were invited to present their opinions on preferred exam format and question types.The last two parts included both Likert scale and open-ended questions to enable respondents to elaborate on their answers.For language teachers, the survey questions covered their teaching strategies under THE and ICE, their attitudes toward the two exam formats, and what they thought should be taken into consideration when designing THE.This survey also contained Likert scale and open-ended questions.
Before officially launching the survey, a pilot survey was conducted with five students and two language teachers.The surveys were then refined following their feedback for higher effectiveness and reliability.Changes were made (1) to reorder some questions; (2) in phrasing to improve readability; (3) in the number of options for closed questions; and (4) to correct typos.
Out of the 424 eligible students,135 completed the survey.Table 3 displays their demographic information.French (32.6%),Japanese (30.4%), and Spanish (25.2%) were the three languages with the most respondents.Out of the 23 teachers, 11 completed the survey.
At the end of the survey, nine students and five teachers indicated interest in being interviewed.Interviews lasted about 20 min each for students and 30-45 min each for teachers.Interview recordings were transcribed after each interview.
In addition, to triangulate our findings, quantitative analysis was conducted to compare the THE results in 2020-2021 and ICE results in 2021-2022.We collected the exam results of 206 students, of whom 28.3% were studying French, 25.9% Japanese, and 45.9% Spanish.

Data analysis
We adopted four empirical methods for quantitative analysis: descriptive analysis, scale analysis, non-parametric analysis, and correlation analysis, as illustrated in Fig. 1.Descriptive analysis revealed important facts and patterns of respondents' language learning and teaching practices under different assessment methods.Scale analysis enabled us to measure respondent attitudes toward different assessment methods, and included both validity (Kaiser-Meyer-Olkin [KMO] test and Bartlett's test; Bartlett, 1954;Dabestani et al., 2014;Gunawan et al., 2022;Kaiser, 1974) and reliability tests (Cronbach's alpha coefficient; Croasmun & Ostrom, 2011;Gliem & Gliem, 2003;Vaske et al., 2017), both of which are commonly used methods to test the reliability and validity of Likert-scale questionnaires.Our survey included rank-order questions where respondents were asked to rank their learning or teaching foci under the two assessment methods, and non-parametric tests can compare these ranked data more robustly than parametric tests (Krzywinski & Altman, 2014).Hence, we followed Shin and Park's (2009) model and used the non-parametric Wilcoxon signed rank test to compare learning and teaching foci under the two assessment formats.As for analyses of student exam results, we conducted paired t-tests and the Spearman correlation test to compare results under THE and ICE.For comparison of cohort average marks under THE and ICE, we employed the paired t-test because several previous simulations have found parametric tests to be more robust in analyzing both normally and non-normally distributed continuous data in most situations when the sample size is not very small (Skovlund & Fenstad, 2001;Wadgave & Kahairnar, 2019).Moreover, to gain further insight into how students performed under different exam settings, it is vital to also analyze changes in students' relative positions within their cohort, as "rank eliminates any disparity between the two characteristics compared" (Spearman, 2010, p. 1141).Therefore, Spearman's rank correlation coefficient analysis, one of the most known tests for comparing rankings (Csató, 2013), was conducted to compare students' exam rankings.
Regarding qualitative analysis, to ensure interrater reliability, we followed the following steps.Two researchers of the team coded two students' and one teacher's interview transcripts separately using descriptive codes (Saldaña, 2013).They then met, and selected and retained the codes that they had both identified or deemed relevant to the study, based on which they built the codebook (see Table 4).Afterwards, one of the researchers coded all the students' and teachers' interviews including those already coded (in total 11 students and 5 teachers).NVivo 12 was used to organize and manage the codes.

Results
In this part, we first present the quantitative results based on the surveys and the exam results and then present the qualitative results from the interviews and the open-ended questions in the surveys.In both parts, we present results in the following topical order.Impact of THE on Learning and Teaching, Exam Results, and Exam Design and Fairness.

Impact of THE on learning and teaching
The student survey shows that most students believe that THE did not influence their class attendance (83.05%), class participation (82.76%), and homework completion (76.72%).c 0=no, 1=yes (extracurricular practices include such activities as consuming recreational contents in the target language, participating in events organized by language teachers, following off-campus language classes, and using language apps.)Table 5 shows that the majority of students reported that they did not change their weekly self-study time (64.1%).Compared with ICE, 26.7% reported spending less time on self-study for THE, while 9.2% spent more time for THE.Most students did not change their self-study focus (55.6%).According to Table 6, grammar (mean=3.65,SD=1.58 for THE; mean=3.78,SD=1.37 for ICE) and vocabulary (mean=3.33,SD=1.50 for THE; mean=3.43,SD=1.36 for ICE) were the two aspects student spent the most time on, regardless of assessment type.In contrast, reading was the skill they spent the least time on for both types of assessment (mean=2.59,SD=1.46 for THE, and mean=3.02,SD=1.43 for ICE).Students spent significantly more time on readingfocused self-study under ICE implementation compared with THE (paired differences in mean=−0.43,t=−2.359,p=0.02).

Impact of THE
Table 7 shows that teachers (Obs=8) focused on grammar (mean rank=2.75 for THE and 2.88 for ICE) and speaking (mean rank=3.00for THE and 2.88 for ICE) skills the most for both THE and ICE.According to the Wilcoxon signed rank test shown  in Table 8, there was no statistically significant change in teachers' teaching focus on vocabulary, grammar, phonetics, listening, reading, and writing when THE was implemented.Teachers' teaching activities did not change for THE either, as shown by their overall neutral or "disagreeing" responses to questions about the impact of THE on their teaching practices (Table 9).Students' attitudes toward 24-h take-home exams were measured in terms of perceived benefits, issues, and level of stress, while their attitudes toward traditional exams were measured in terms of perceived benefits and level of stress, all on a 5-point Likert scale.internal consistency; hence, the questionnaire has good reliability in measuring respondent attitudes (Vaske et al., 2017).In the reliability test, an observation should be dropped if it has a missing value in at least one of the specified variables, so the numbers of observations in Table 10 might be different from those in Table 12.According to Table 11, the result of the KMO test for 24-h take-home exams is 0.721 (p<0.001), and that for traditional exams is 0.723 (p<0.001),meaning that the questionnaire has good validity in measuring respondent attitudes (Kaiser, 1974).
As shown in Table 12, students thought that the exam duration of THE was too long, that they developed less in reading and grammar skills, and that they spent less time on learning overall.Similarly, teachers thought that, under THE implementation, students' reading and grammar skills were less developed, and that students spent less time on language learning and did not do their homework (Table 13).

Exam results
The comparison of exam results of ICEs and THE (Table 14) shows that during the pandemic, in 2021-2022, there was a statically significant drop in the average mark for French (from 64.4 to 54.4 out of 100 marks) and Spanish (from 61.6 to 58.6), while the Japanese average mark increased (from 59.6 to 61.2).
Spearman's rank correlation coefficient analysis was conducted to compare students' rankings within their cohort under different exam conditions.First, we compared the ranking changes of written exam and oral exam in the two semesters within 2020-2021    and 2021-2022 (Table 15).The comparison shows that, for each of the three languages, the oral exam rankings were strongly correlated with each other, with ρ values between 0.625 and 0.802 (p<0.001).On the other hand, the written exam rankings were not strongly correlated in 2020-2021 when THE was implemented (ρ between 0.399 and 0.541, p≤0.003), while they exhibited much stronger correlations in 2021-2022 under ICE (ρ between 0.636 and 0.773, p<0.001).
The written mark, regardless of THE or ICE, only consists of 50% of the final mark of the language modules, so we then included the oral exam (always ICE), the other 50%, into our analysis.
We used rankings obtained from average marks between semesters within the same academic year to indicate overall results in that academic year.We compared students' rankings for both written and oral components in 2020-2021 (with a written THE) and 2021-2022 (with a written ICE, Table 16).The results show that the correlations for the written exams were weak for French (ρ=0.331,p=0.014) and Spanish (ρ=0.346,p<0.001) but strong for Japanese (ρ=0.699,p<0.001); the correlations for the oral component for all languages were strong, between 0.695 and 0.770 (p<0.001).
When we used the rankings obtained by combining the oral and written components (i.e., based on the final overall module results; written 50%, oral 50%), correlation analyses showed strong correlations between 2020-2021 and 2021-2022 rankings for each of the three languages (ρ between 0.682 and 0.787, p<0.001), despite the inconsistency in results observed when the written component alone was used for analysis (Table 16).
The surveys also yielded relevant information in understanding student results under THE and ICE.For THE, students reported (Table 12) that they had more time to write and check their answers (mean=4.58,SD=0.77) and that they felt less pressure because there was more time and no invigilator and they had the chance to seek help.However, some thought that 24 h was too long for THE (mean=2.91,SD=1.41).On the other hand, students believed that ICEs were fair for everyone (mean=3.87,SD=1.31) and allowed them to demonstrate their skills (mean=3.43,SD=1.30) and obtain higher marks (mean=3.24,SD=1.41).According to the teacher respondents (Table 13), the major issue of THE was suspected plagiarism (mean=4.45,SD=0.82), followed by the possibility of technical issues (mean=3.64,SD=1.12).Some also found the exam duration too long (mean=3.60,SD=0.97).Like students, teachers believed that ICEs were fair for everyone (mean=4.73,SD=0.47) and enabled students to obtain higher marks (mean=3.70,SD=0.95) and to demonstrate their skills (mean=3.55,SD=1.29).
In the interviews, students and teachers also gave comments that could help interpret the result differences under the two exam formats.These will be presented in "Exam results".

Exam design and fairness
Based on their experience with THE, survey respondents were asked to express their preference for different options for THE exam design in relation to such aspects as exam content, exam duration, learning skills, and learning focus required when preparing for THE.
For THE content, about half of the student respondents (46.5%) preferred a comprehensive exam consisting of reading, use of language, and one writing task; 27.9% preferred two writing tasks, while 25.6% preferred one reading task and one writing task.For the exam duration, 44.4% of the student respondents preferred a take-home exam with a 4-h time limit, while 31.6%preferred a 24-h take-home exam.Most students believed that a good THE design should require them to use critical thinking and resource finding abilities (60.15%), time management skill (52.63%), and high-order thinking (34.59%), and require their learning foci to be writing (74.44%) and grammar (57.89%).
On the other hand, in the eyes of the teacher respondents, a good THE should be only focusing on writing tasks (70%) and should have a time limit of 2 h (70%).A good design should require students to utilize time management skills (mean=3.91,SD=1.14), critical thinking (mean=3.82,SD=1.08), and resource finding skills (mean=3.30,SD=1.06).
If the THE were to be implemented again, teachers (Obs=10) would focus on teaching grammar (mean rank=2.90)and writing (mean rank=3.00),followed by vocabulary and reading (mean rank=3.50).They would assign more homework related to writing and provide more feedback on writing (72.7%).About half of the teacher respondents would also like to apply stricter marking criteria for THE.

Qualitative results: interviews and open-ended questions in surveys
The 11 student participants were coded S1, S2. . .S11, with S1 representing student 1, and so forth.Nine out of 11 students study Japanese, one French, and one Spanish.The five teacher participants were coded T1, T2. . .T5. Three out of five teach Mandarin, one French, and one Japanese.

Impact of THE on learning and teaching
Two students reported a positive impact on their learning attitudes because THE gave them less pressure (S1) and therefore made them "love their second language" (S2).In contrast, the majority of student participants reported negative attitudes toward language learning when THE was implemented.They felt less motivated (S3, S6, S9) and spent less time on learning (S3, S5, S8); they worked much harder and spent more time on remembering vocabulary for ICEs (S4, from both survey and interview responses).However, with THE, students' learning changed to focus more on expressing themselves in the THE writing tasks (S6).
Most student interviewees did not change their learning practices even though since the beginning of the semester they had been informed that the exam would be a THE, because they felt they needed to "learn grammar, vocabulary to compose [their] own paragraph" anyway (S4); and because they were learning to develop the ability to communicate (rather than to merely pass exams) and the class contact hours were the same (regardless of exam format) (S2).They had interest and motivation (S3, S4, S5, S7): "learning language needs long-term effort and daily accumulation of knowledge" (S8).
The main change in learning practice happened during exam preparation.Most of the interviewees said that they felt less stressed because they did not need to remember words with a 24-h THE (S2) and could read the textbook during the exam (S3).Some of them did less preparation (S4) or even stopped reviewing (S9).
As for teaching, most teacher interviewees did not change their teaching methods because "[they] teach according to the learning outcomes;" that is, they teach the four skills (listening, reading, speaking, and writing) to prepare students for communication rather than for exams (T1).They did not want students to focus on exams too much (T4).
However, one teacher changed teaching to focus on oral and writing (T3).Others gave students more writing tasks or reading practice opportunities and tasks to develop their vocabulary during the teaching weeks to help them cope with the expected changes in assessment (T1, T2).No matter what teachers did during the semester, they focused on preparing the students for the exam when it was closer to the exam period (T1, T2, T3, T5).

Exam results
One student participant confirmed that she obtained higher marks in THE than ICE because she was better at using the language for communication rather than answering detailed grammar questions (S1).Nonetheless, most students believed that they were disadvantaged in THE, because there was no reading and grammar questions to help them obtain a higher mark (S3) and their peers obtained higher marks with "perfect works" in THE (S4), which signals academic integrity concerns with regard to THE.
When talking about overall exam performance, students had blurred or even contradictory impressions about which format of assessment led to better results.One student (S1) mentioned that most of their classmates did not like THE because it only had writing tasks and they feared that they would not obtain a high mark.However, S1 clarified later that the overall students' performance was better in THE because some students obtained good results which they did not deserve.Other students had similar opinions: on the one hand, they thought that the reading and grammar tasks in ICE help students obtain higher marks as these questions are objective and one could get full marks if the response is correct.On the other hand, they also thought that THE "enabled" weaker students to ask for external help because there was enough time to do so and there was no supervision, whereby weaker students could also get good marks, causing an increase in average marks under THE compared with ICE (S5).
Teacher T3 agreed that the reading task in ICE did help a certain group of students who were in the second class (50-69 out of 100 marks) to obtain a higher mark, while cheating helped lower-achieving students to get higher marks.
Other teachers (T1, T5, T6) reported that, according to their own impression, there were no significant differences between students' exam results in THE and in ICEs.However, one of them mentioned that she heard students from other languages performed much better in the THE than in ICEs (T1).

Exam design and fairness
Students had polarized attitudes toward THE.Some were enthusiastic about THE, because it made the exam easier.Others hated THE and questioned its validity because they believed that it was not a "serious exam" (S4).
Most student interviewees preferred traditional ICE because "there's less chance of cheating, " "it's efficient, " and it has stood the test of time (S3).
For THE design, many students wanted the reading task.This is not only because reading skills are an important component of language learning (S2), but also because examiners' evaluation of the answers is standardized and "objective" and students have a chance at obtaining full marks (S3).
In the survey, some students expressed preference for different question types for different exam formats: THE should have writing tasks, as other tasks such as reading are "inefficient" (survey result), while for ICEs, they think that it is better to test "grammar and the knowledge points from the textbook" and it can be a fair method to test students' language level (S8).
Two students had a "bigger picture" about exam design.One student (S7) mentioned that "language is a tool for communication" so regardless of exam format, a language exam should include tasks that involve situations "a person might meet in the world of work." Another student (S8) recommended having both THE and ICE because they required different skills from students and these skills are all useful.
Teachers preferred traditional ICE to THE, and thought that the latter was only for emergency use.They thought that for THE, duration should be shortened or more tasks should be given, but that the use of writing tasks per se was appropriate (T3).Writing based on a stimulus text not only tests writing skills, but students also need to understand the main points of the stimulus text and respond to it, so it also tests reading skills, analytical skills, and other higher cognitive skills like critical thinking skills (T3).
All teachers were satisfied with the THE design with writing tasks.One teacher mentioned that the reading stimulus text could be longer to integrate more reading skills assessment into the exam.Some teachers also recommended that ICEs have the same design as THE because the writing tasks in THE also evaluate grammar and reading in a more communicative way (T3, survey result).
To tackle the cheating issue in THE, one teacher recommended talking with the students about cheating and its consequences (T3).Another teacher (T4) suggested looking into the overall design of the assessment.According to him, having the (closed-book) oral exam, which is worth 50% of the overall mark, is important because oral exam is "the most challenging" for students and they would "try to study more [for] oral exams" (T4).

Impact of THE on learning and teaching
As presented in "Impact of THE on learning and teaching", about 45% of the student respondents self-reported that, under THE, their learning foci were different from those under ICE.Specifically, in 2021-2022, when ICE was reinstated, they reported spending significantly more time on reading practice than in 2020-2021.The interview results in "Impact of THE on learning and teaching" provided a possible explanation for this difference.In 2020-2021, knowing in advance that THE would be implemented might have impacted some students' learning attitudes and practices, because, for example, they knew they would have time to consult learning materials during the exam (see also Agarwal & Roediger, 2011;Durning et al., 2016).When ICE was reinstated in 2021-2022, students may have realized that THE had only been implemented as a temporary measure and thought it would be wise to study more than they had with THE in 2020-2021 to maintain a good level of performance.Moreover, the increase in reading practice time also mirrors the change in exam format: reading tasks appeared to be absent in the 2020-2021 THE but then returned as a significant part of the ICE in 2021-2022.
Nevertheless, albeit seemingly missing from the 2020-2021 THE, reading was in fact still a crucial part of the assessment, because students needed to read and comprehend the stimulus text in order to write an appropriate response.Some students did not seem to understand this point.Some student interviewees (S3, S4) frequently mentioned that the THE did not include a reading task, which means that they did not fully understand the role of the reading stimulus in the exam: their reading skills were still being assessed despite the lack of traditional tasks of reading comprehension, such as multiple-choice, true-or-false, and fill-in-the-blank questions.Interestingly, there is evidence that our THE task format had some success in assessing reading skills: under THE implementation, a teacher participant (T3) felt students' reading skills were not well developed and that many students did not fully understand the reading stimulus and in some cases went off topic or failed to respond to it in full.While it is unclear whether there was a connection between students' insufficient understanding of how reading was being assessed in the THE and their performance in that aspect, the current observations revealed that students' understanding of an exam format may still be limited even with access to all relevant information; sometimes, what is obvious to teachers might not be as obvious to students.Consequently, it may be beneficial to provide students with more explicit explanations on certain aspects of an exam, so that students fully understand its requirements and expectations (see also Durning et al., 2016).
Furthermore, as Biggs and Tang (2011) pointed out, teachers should see the intended learning outcomes as the key element of their teaching (p. 197).Our teacher participants reported that they did not change their teaching methods, out of the belief that teaching should not be exam oriented, but be learning outcome oriented.However, students may think otherwise: they "learn what they think they will be tested on" (Biggs & Tang, 2011, p. 197).Our student participants did exactly that.When ICE (along with the reading comprehension tasks) was reinstated, students spent significantly longer time on reading skills in self-study.Agarwal and Roediger (2011) compared students' learning habits and exam performance when students had different expectations for the final exam (closedbook vs. open-book) and found that "students' study habits may be based, in large part, on the perceived difficulty of a final test" (p.850) and that the majority of students who were not informed about a specific type of final exam expected a closed-book exam (p.849).As a result, the authors recommend that "teachers give closed-book tests or at least do not announce in advance that they will be giving open-book tests" (p.850).However, in the UK higher education system, including at the university where the current study was conducted, students are to be informed of the type of exams they will undertake at the beginning of the module.To compensate for any potential impact of THE on learning, Agarwal and Roediger (2011) suggested that "teachers administer frequent quizzes" to improve long-term retention (p.850).

Exam results
The comparison between THE and ICE exam results in "Exam results" showed that there was an apparent drop in the average marks for French and Spanish learners after the reinstatement of ICE in 2021-2022, while Japanese learners' average marks increased.The average mark could have been influenced by many factors such as the difficulty of the papers, which was not taken into consideration in this research.Therefore, we will limit our discussion to the analyses of the exam rankings.
According to the ranking analysis in "Exam results", the Spearman correlations for THE were much lower than those for ICEs, an indication that students' rankings in the ICEs were more stable compared with those in the THE.The unstable results in THE could be related to many factors which need further investigation but, importantly, it indicates that this type of THE is not as stable as ICE as an assessment tool for students' language skills at our institution.This result of our study reminds us to be cautious about the notion of using THE as the sole evaluation method for language modules.
Students' written THE and ICE rankings were weakly correlated for French and Spanish, but correlation was higher for Japanese.For French and Spanish, the results show that students' performance was quite different under different written exam formats.These differences mirror the interviews with students and teachers in "Exam results".Most interviewees believed that there were differences in exam performance between THE and ICEs.According to their belief, the types of exam tasks and dishonest behaviors were the two main factors that could have influenced the exam results.Based on their opinions, on the one hand, the lack of reading tasks in the THE lowered the mark of the top-and middle-achieving student groups.Karagiannopoulou and Milienos (2013) also found that THE benefited different student groups differently, depending on their approach to learning and their preference of exam format.We should thus take student's individual differences into account in assessment procedures as suggested by Myyry and Joutsenvirta (2015).On the other hand, the suspected dishonest behaviors were believed to have increased lower achievers' marks.One fact that could be considered consistent with this belief is that the top-and middle-achievers' results did not change as dramatically as those of the lower achievers.The dishonesty issue will be further discussed in "Exam design and fairness".
Curiously, our Japanese language students' performance in THE was strongly correlated with their performance in ICEs but our current data do not seem to effectively explain this difference.
Overall, students' results in the closed-book oral exams showed strong correlations across all semesters.When comparing students' exam results in 2020-2021 and 2021-2022, the correlations were stronger when both written and oral results were combined in the analysis (i.e., when the final overall results of a module were used), compared with when only the written exam was considered.Crucially, this means that the oral exam (which is a closed-book in-class exam and is conceptualized as always being an ICE; see end of "The implementation of THE") could mitigate the THE instability issue discussed earlier in this section.This result constitutes a significant complement to the conclusion of Durning et al. (2016) and Johanns et al. (2017) that a combined approach (of THE/ OBE or ICE/CBE) could be more effective in assessing different competencies.

Exam design and fairness
According to the survey results shown in "Exam design and fairness" and the interview results in "Exam design and fairness", teachers and most students preferred ICE in general as they believed it is much fairer.Fairness was the main concern students reported, and they felt that ICE is fairer because it has been tested over a long period of time; Gamage et al. (2020) also stated that ICE is a more secure exam method in the sense that academic misconduct is less likely to happen in an invigilated environment.
Many studies have recorded self-reported cases of some form of cheating.As reported by the International Centre for Academic Integrity (ICAI, n.d.), more than two-thirds of college students have self-reported cheating behaviors and cheating is still on the rise.Khan and colleagues (2022) commented that the shift to remote learning "comes with its own challenges, particularly in academic integrity during assessments, like the issue of academic dishonesty" (pp.18-19).To tackle this, the language center tried various methods as explained in "The implementation of THE".Some teachers also talked with students about cheating and its consequences, like suggested by McCabe et al. (2012) and by participants in Erguvan (2022).Despite these efforts, however, misbehavior in THE was still the biggest concern of our participants.Certainly, in the post-COVID digital age, more research and more experimentation of different strategies to avoid academic misconduct will be necessary.In these endeavors, just like in the current study, student voices should continue to be considered and their involvement should be encouraged (Azizi, 2022).
As far as second language learning is concerned, most of the participants of this study thought that ICE is not only fairer but more appropriate than THE.This opinion might be due to two reasons: 1) The nature of second language learning and testing.The key learning outcome tested in the language exams in the current study is the ability to express oneself, and especially at beginner and elementary levels, topics are related to everyday life.It is easy for a student taking a THE to obtain help from a native or proficient user of the tar-get language and this would not be detected by any anti-cheating software.This practice is known as "contract cheating" (Ahsan et al., 2021, p. 523).Conversely, contract cheating can be more difficult in THE for other disciplines as they require specialized knowledge and, sometimes, citation of examples and references (Gamage et al., 2020); 2) Second language proficiency necessitates gradual learning.Learning a language is not a one-semester or 1-year effort.It requires time and students need to demonstrate progress from one level to the next by demonstrating that they have acquired the required skills.If they do not have the necessary foundations, the following level of study will be more challenging.Enhancing the base knowledge is the key and ICEs are thought to be associated with a greater amount of study and produce better learning in a university-level environment (Marsh, 1984).
The two reasons above might also apply in other disciplines, like science courses where there is clear progression in knowledge and skills and whose exams could require mostly calculations and solution of problems without the need for extensive reading and citation.ICEs might work better than THE for evaluating students' learning in sciences, but more research is needed both for language and scientific disciplines.
Nevertheless, during COVID-19, at the language center, THE was thought to be the only possible solution in place of ICE.To maximize THE validity and reliability, the language center relied on changing the exam design and duration.As Cleophas and colleagues (2021) mentioned, a key to avoiding fraud in the first place is a suitable design of online exams.Teacher participants in this study agreed that THE should test productive skills such as writing, and that it would be better to also test student's receptive skills such as reading.Our THE writing task with a reading stimulus required students to understand the main points of the stimulus and write a response based on their own experience or express their own opinions.It incorporated reading skills into the exam and was thus evaluated internally as an appropriate design.Teacher participants felt that this kind of task enabled them to assess language skills, as well as such key skills for university students as critical thinking and information management skills (see also López et al., 2011).They also thought that this type of task enhanced deep learning as students engaged with higher-order skills like analytical skills, agreeing with the conclusion of Johanns et al. (2017).
Regarding exam duration, over half of students and most teachers who responded to the survey agreed that 24 h was too long for the THE, because it gave time for dishonest behaviors to take place.Respondents suggested 2 to 4 h for THE, more specifically about 2 h to answer the exam questions and some extra time to deal with any operational tasks, such as uploading the answers to the exam platform.These responses are supported by findings of some existing studies (Ng, 2020;Spiegel & Nivette, 2021;Tam, 2022), which also recommended setting tight time restrictions for THE.
Besides the written exam design, it is also important to look into the overall module assessment design.One teacher mentioned that when implementing THE, it is essential to have another exam component, in our case a closed-book in-class oral exam, to ensure the reliability and validity of the overall module assessment.As the participant explained, students will study more if they need to take an in-class oral exam which requires them to memorize and use vocabulary, grammar, and phrase expressions, skills they would use also in the THE.Therefore, having to take the THE with an oral exam, students are more likely to study the language rather than merely rely on cheating.This result is echoed by the conclusion of Durning et al. (2016) that the combined use of OBE and CBE could be effective as "OBEs and CEBs can contribute to an assessment program in part because of their complementary pros and cons" (p.588), and by Johanns and colleagues (2017) who favored the use of a mixed method of examinations throughout the course of a nursing program.

Limitations
The biggest concern of our research participants is exam fairness and suspected academic dishonesty.However, it was difficult to obtain data of cheating behavior in THE and we were unable to fathom the real impact of academic misconduct on THE results.While this study provides insights into the impact of THE in second language learning and teaching, there are important limitations to consider that might influence the interpretation of our findings.First, the interviewees study and teach different languages, and because of the limited number of participants from each language, we could not evaluate whether their opinions and experiences were related to specific characteristics of the teaching and/or learning of a certain language, or whether they were shared by more students or teachers at the institution.The limited sample size of interviewees and survey respondents also means that their views and experiences may not fully represent those of all our language students and teachers.Second, the self-reported nature of the survey and interview data may have introduced bias and led to inaccuracies.Third, we did not investigate the impact of factors such as students' well-being (Stowell & Bennett, 2010), motivation, or learning style (Spiegel & Nivette, 2021).Although we used exam result data from the same cohort of students, these data span two academic years.This study did not consider any changes that individual students and/or the cohort may have undergone during this time.Fourth, the teacher participants come from diverse cultural backgrounds, which might have influenced their teaching and marking (Bianco & Crozet, 2003).Fifth, the study was conducted on a transnational campus, which has its own features that may not be found in other higher education contexts, rendering it necessary to take extra caution against overgeneralization of our findings.Finally, the study was conducted in the unique context of the COVID-19 pandemic.Our results about THE's impact on teaching and learning cannot be generalized to non-pandemic situations.

Conclusions
This study investigated the impact of 24-h take-home exam in the field of language education on the China campus of a British university by analyzing students' exam performance, teachers' teaching methods, and students' learning strategies under THE implementation.Our findings show that the implementation of THE during the pandemic did not change teachers' teaching foci but expected exam format influenced many students' learning practices: students tended to spend more time on skills that they anticipated would be tested in exam.Also, students and teachers believed that cheating was a major issue under THE.
The current study also found that students' rankings exhibited fairly strong stability in ICEs, but such stability was not found with THE.When results of THE and ICE (oral) components were combined, however, the overall ranking stability greatly improved, suggesting that the oral ICEs mitigated the ranking instability associated with the written THE in our study.
Student and teacher participants preferred ICEs for evaluating learners' language levels, but teachers still considered the THE with two writing tasks based on reading stimuli an appropriate tool to evaluate language learning as they involve higher-level thinking skills.In addition, some students and teachers favored the design of language assessment with a combined use of ICE and THE, based on the consideration that when THE was implemented, the in-class closed-book oral exam could enhance the overall validity of the assessment for a language module.Strategies like this could potentially improve the usability of THE as a formal assessment instrument, and future research could explore the effectiveness of various strategies used in operations such as exam design, administration, and grading to offset the shortcomings of THE.

Fig. 1
Fig. 1 Approaches and methods of quantitative analysis

Table 1
The timeline of THE implementation in LC

Table 2
Structure of assessment at the language center under THE and ICE implementation

Table 3
Demographic characteristics of students

Table 5
Impact of THE on students' out-of-class learning a 1=fewer hours, 2=no impact, 3=more hours (self-study includes any study activity outside of class time.)b 0=no, 1=yes (learning foci include vocabulary, grammar, reading, and writing.)

Table 6
Comparisons of students' self-study time and focus for THE and ICE a Estimated number of hours of study per week

Table 7
Teachers' teaching focus for ICE and THE a Respondents were required to rank the options from the most important (1) to the least (7) Table 10 and Table 11 provide results of the reliability and validity check.As shown in Table 10, the Cronbach's alpha of all items are greater than 0.5, indicating good

Table 8
Comparison of teaching foci under THE and ICE The sum of negative ranks equals the sum of positive ranks a Based on positive ranks b

Table 9
Impact of THE on language teachers' teaching activities

Table 10
Reliability check of students' attitudes toward different assessment methodsa Listwise deletion based on all variables in the procedure

Table 11
Validity check of students' attitudes toward different assessment methods

Table 13
Teachers' opinions about THE and ICE

Table 14
Comparisons of average marks in written THE and ICEs

Table 15
Comparisons of students' rankings in different semesters of the same academic year

Table 16
Comparisons of students' rankings from written, oral, and overall (written and oral 50% each) marks