USING LARGE LANGUAGE MODELS FOR TEXT ANALYSIS IN THE EVALUATION OF UNIVERSITY EDUCATIONAL PROGRAMS
Abstract
Background. Large language models (LLMs) are increasingly used in educational analytics, particularly for processing large volumes of accreditation-related documents. However, it remains unclear how reliably such models can assess the quality of self-evaluation reports for educational programs, and which textual characteristics influence how models form their assessments.
Materials and Methods. In the study, ten self-evaluation reports of educational programs were analyzed: five identified by the expert assessment as the strongest within the higher education institution over the last three years, and five as the weakest over the same period. GPT-5 and Gemini-2.5 models independently evaluated each document using the official ten Ukrainian National Agency for Higher Education Quality Assurance (NAQA) criteria and eight textual metrics reflecting structural, semantic, argumentative, and factual properties of the text. All evaluation grades were generated directly by the models on a unified scale from 1 to 10. To analyze the relationships between NAQA and textual criteria, Pearson's and Spearman’s correlation coefficients were used.
Results and Discussion. LLMs demonstrated limited alignment with the NAQA criteria, yielding weak correlations. In contrast, textual criteria, primarily factual density, argumentativeness, semantic coherence, and lexical diversity, consistently differentiated between stronger and weaker reports. GPT-5 exhibited lower variability and reduced sensitivity to stylistic noise, while Gemini-2.5 reacted more strongly to structural and stylistic deficiencies. Correlation matrices showed that textual criteria better capture the latent quality characteristics of documents than the direct application of NAQA criteria.
Conclusion. The results show that LLMs currently do not accurately reproduce expert evaluations based on the formal NAQA criteria but effectively analyze the structural and content-related characteristics of reports using textual metrics. These metrics complement the NAQA criteria by accelerating expert workflows and enhancing document monitoring. Future research will focus on expanding the dataset, standardizing prompts, and comparing a broader range of models.
Keywords: large language models, educational programs, quality assessment.
Full Text:
PDFReferences
[1] Mazzullo, E., Bulut, O., Wongvorachan, T., & Tan, B. (2023). Learning analytics in the era of large language models. Analytics, 2(4), 877–898. Doi: https://doi.org/10.3390/analytics2040046
[2] Aboalela, R. (2024). Harnessing technology to achieve the highest quality in the academic program of university studies. International Journal of Advanced Computer Science and Applications, 15(8). https://doi.org/10.14569/IJACSA.2024.0150829
[3] Huang, Y., Tang, K., Chen, M., & Wang, B. (2024). A comprehensive survey on evaluating large language model applications in the medical industry. arXiv preprint arXiv:2404.15777. https://doi.org/10.48550/arXiv.2404.15777
[4] Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023, December). G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 2511–2522). https://doi.org/10.18653/v1/2023.emnlp-main.153
[5] Syromiatnikov, M., Ruvinskaya, V., & Troynina, A. (2025). ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian. arXiv preprint arXiv:2501.06715. https://doi.org/10.48550/arXiv.2501.06715
[6] OpenAI. (2025). GPT-5 system card. https://cdn.openai.com/gpt-5-system-card.pdf
[7] Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., ... & Mehta, S. V. (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261. https://doi.org/10.48550/arXiv.2507.06261
[8] Ministry of Education and Science of Ukraine. (2024). On approval of the Regulations on the accreditation of educational programs for the training of higher education applicants (in Ukrainian). Order No. 686 on May 15, 2024. https://zakon.rada.gov.ua/laws/show/z1013-24
[9] Muhlgay, D., Ram, O., Magar, I., Levine, Y., Ratner, N., Belinkov, Y., … & Shoham, Y. (2024, March). Generating benchmarks for factuality evaluation of language models. In: Proceedings of the 18th conference of the european chapter of the association for computational linguistics (Vol. 1: Long papers) (pp. 49–66). https://doi.org/10.18653/v1/2024.eacl-long.4
[10] Pavlyshenko, B., & Stasiuk, M. (2025). Semantic Similarity Analysis Using Transformer-Based Sentence Embeddings. Electronics and information technologies, (30), 43–58. https://doi.org/10.30970/eli.30.4
[11] Templin, M. C. (1957). Certain Language Skills in Children: Their Development and Interrelationships (NED-New edition, Vol. 26). University of Minnesota Press. 208 p. http://www.jstor.org/stable/10.5749/j.ctttv2st
[12] Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Routledge. 535 p. https://doi.org/10.4324/9780203774441
[13] Conover, W. J. (1999). Practical nonparametric statistics. John Wiley & Sons. 608 p.
[14] Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Advances in neural information processing systems, 30.
DOI: http://dx.doi.org/10.30970/eli.33.1
Refbacks
- There are currently no refbacks.

Electronics and information technologies / Електроніка та інформаційні технології