Micro-Corpus LLM Synergy for English Writing Instruction: A Controlled Experiment

Zhongfang Cheng; Chun Keat Yeap; Amirah Mohd Juned

doi:10.47405/mjssh.v11i5.3995

Zhongfang Cheng Academy of Language Studies, Universiti Teknologi MARA (UiTM) Melaka Branch, 78000 Alor Gajah, Melaka, Malaysia
Chun Keat Yeap Academy of Language Studies, Universiti Teknologi MARA (UiTM) Melaka Branch, 78000 Alor Gajah, Melaka, Malaysia https://orcid.org/0000-0002-8106-4406
Amirah Mohd Juned Academy of Language Studies, Universiti Teknologi MARA (UiTM) Melaka Branch, 78000 Alor Gajah, Melaka, Malaysia

DOI: https://doi.org/10.47405/mjssh.v11i5.3995

Keywords: Large Language Models, Corpus Linguistics, Retrieval-Augmented Generation (RAG), Educational Technology, Controlled Experiment

Abstract

Even though Large Language Models (LLMs) demonstrate high efficiency and effectiveness in text generation, their tendencies toward hallucination and limited controllability raise concerns in educational contexts, where transparency and reliability are essential. Responding to claims that traditional corpora are becoming obsolete, this study re-examines the pedagogical value of corpora in the era of LLMs through a focused review of recent literature and a classroom-based randomized controlled trial. Specifically, representative studies published between 2020 and 2024 are systematically reviewed using a three-dimensional analytical framework comprising Efficiency, Controllability, and Educational Adaptability. In addition, a controlled experiment (N = 60) in a university English writing course compares a pure LLM-based instructional model with a corpus-augmented LLM model using Retrieval-Augmented Generation (RAG). The results show that the corpus-augmented model significantly reduced immediate post-test errors (9.7) compared to the LLM-only group (28.3), improved delayed retention, encouraged greater learner questioning behaviour, and reduced teachers’ preparation time by approximately 30%. These findings demonstrate that small-scale, task-specific pedagogical corpora play a critical role in enhancing the controllability and instructional effectiveness of LLMs. The study proposes a micro-corpus–LLM synergy framework and provides an open micro-corpus to support classroom implementation and replication.

Downloads

Download data is not yet available.

References

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21) (pp. 610–623). Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ... Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

Bryant, C., Yuan, Z., Qorib, M. R., Cao, H., Ng, H. T., & Briscoe, T. (2023). Grammatical error correction: A survey of the state of the art. Computational Linguistics, 49(3), 643–701. https://doi.org/10.1162/coli_a_00478

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., & Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv. https://doi.org/10.48550/arXiv.2312.10997

Jain, S., & Wallace, B. C. (2019). Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 3543–3556). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1357

Ji, Z., Lee, N., Frieske, R. M., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), Article 248. https://doi.org/10.1145/3571730

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, 33, 9459–9474. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html

Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Le Scao, T., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma Sharma, S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N. V., ... Rush, A. M. (2022). Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations. https://openreview.net/forum?id=9Vrb9D0WI4