Micro-Corpus LLM Synergy for English Writing Instruction: A Controlled Experiment
Abstract
Even though Large Language Models (LLMs) demonstrate high efficiency and effectiveness in text generation, their tendencies toward hallucination and limited controllability raise concerns in educational contexts, where transparency and reliability are essential. Responding to claims that traditional corpora are becoming obsolete, this study re-examines the pedagogical value of corpora in the era of LLMs through a focused review of recent literature and a classroom-based randomized controlled trial. Specifically, representative studies published between 2020 and 2024 are systematically reviewed using a three-dimensional analytical framework comprising Efficiency, Controllability, and Educational Adaptability. In addition, a controlled experiment (N = 60) in a university English writing course compares a pure LLM-based instructional model with a corpus-augmented LLM model using Retrieval-Augmented Generation (RAG). The results show that the corpus-augmented model significantly reduced immediate post-test errors (9.7) compared to the LLM-only group (28.3), improved delayed retention, encouraged greater learner questioning behaviour, and reduced teachers’ preparation time by approximately 30%. These findings demonstrate that small-scale, task-specific pedagogical corpora play a critical role in enhancing the controllability and instructional effectiveness of LLMs. The study proposes a micro-corpus–LLM synergy framework and provides an open micro-corpus to support classroom implementation and replication.
Downloads
References
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21) (pp. 610–623). Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ... Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
Bryant, C., Yuan, Z., Qorib, M. R., Cao, H., Ng, H. T., & Briscoe, T. (2023). Grammatical error correction: A survey of the state of the art. Computational Linguistics, 49(3), 643–701. https://doi.org/10.1162/coli_a_00478
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., & Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv. https://doi.org/10.48550/arXiv.2312.10997
Jain, S., & Wallace, B. C. (2019). Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 3543–3556). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1357
Ji, Z., Lee, N., Frieske, R. M., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), Article 248. https://doi.org/10.1145/3571730
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, 33, 9459–9474. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html
Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Le Scao, T., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma Sharma, S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N. V., ... Rush, A. M. (2022). Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations. https://openreview.net/forum?id=9Vrb9D0WI4















