PARAMETER EFFICIENT FINE-TUNING AND OVERFITTING IN GPT LARGE LANGUAGE MODELS: A METRIC-BASED COMPARISON

Bohdan Pavlyshenko, Ivan Bulka

Abstract


Background. Building upon previous research, this study conducts an exploration into Large Language Models (LLMs), with an emphasis on the fine-tuning and assessment of LLaMA-3.1 for instructional tasks. LLaMA-3.1, which is a new generation model and has gained considerable recognition based on its superior performance on various benchmarks. Besides assessing the disparities and improvements between the base and the fine-tuned versions of LLaMA-3.1 on an instruction dataset, the study also addresses the concern of overfitting with LLaMA-3.1. Furthermore, it carries out a comparison between LLaMA-3.1 and both its predecessor, LLaMA-2, and another LLM known as Mixtral, thereby providing a more comprehensive picture of LLaMA-3.1's capabilities compared to other models.

Materials and Methods. The fine-tuning of LLaMA-3.1 employed state-of-the-art techniques, such as Low-Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA), on comprehensive instruction datasets. Acknowledging the resource-intensive nature of LLM fine-tuning, optimization measures were taken. The fine-tuning process was additionally enhanced using Parameter-Efficient Fine-tuning (PEFT) on NVIDIA A100 Tensor Core GPU (graphics processing unit) instances. All the models were fine-tuned using Hugging Face and PyTorch platforms for optimal performance.

Results and Discussion. The results obtained from fine-tuning and evaluating LLaMA-3.1 offer valuable insights into how this model performs with specific tasks. The evaluation framework proved helpful in the efficient assessment assessing LLMs' performance concerning instruction tasks. The research highlights the importance of evaluation for LLM applications. It shows that not always is fine-tuning a good choice, due to the nature of the model and the specifics of the task. It highlights the overfitting problem.

Conclusion. The close examination of LLaMA-3.1 contributes to the field of machine learning by offering insights into how this model works and its possible fine-tuning for special tasks. The findings of this research create opportunities for more in-depth studies around the application of LLMs. It highlights the importance of efficient evaluation with already designed metrics.

Keywords: LLMs, GPT, Mixtral, LLaMA, fine-tuning, overfitting


Full Text:

PDF

References


  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
  2. De Leon, Francisco, Pablo Gómez, Juan A. Martinez-Velasco, and Michel Rioual. "Transformers." In Power system transients, pp. 177-250. CRC Press, 2017
  3. Rogers, A., Kovaleva, O., & Rumshisky, A. (2021). A primer in BERTology: What we know about how BERT works. Transactions of the association for computational linguistics, 8, 842-866. https://doi.org/10.1162/tacl_a_00349
  4. Hao, Y., Dong, L., Wei, F., & Xu, K. (2019). Visualizing and understanding the effectiveness of BERT. arXiv preprint arXiv:1908.05620. https://doi.org/10.48550/arXiv.1908.05620
  5. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., ... & McGrew, B. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774. https://doi.org/10.48550/arXiv.2303.08774
  6. Liga, D., & Robaldo, L. (2023). Fine-tuning GPT-3 for legal rule classification. Computer Law & Security Review, 51, 105864. https://doi.org/10.1016/j.clsr.2023.105864
  7. Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr, M., Matsushita, H., ... & Awadalla, H. H. (2023). How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210. https://doi.org/10.48550/arXiv.2302.09210
  8. Goyal, T., Li, J. J., & Durrett, G. (2022). News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356. https://doi.org/10.48550/arXiv.2209.12356
  9. Koubaa, A. (2023). GPT-4 vs. GPT-3.5: A concise showdown. Preprints. org, 2023030422.
  10. Mao, R., Chen, G., Zhang, X., Guerin, F., & Cambria, E. (2023). GPTEval: A survey on assessments of ChatGPT and GPT-4. arXiv preprint arXiv:2308.12488. https://doi.org/10.48550/arXiv.2308.12488
  11. Adetayo, A. J., Aborisade, M. O., & Sanni, B. A. (2024). Microsoft Copilot and Anthropic Claude AI in education and library service. Library Hi Tech News. https://doi.org/10.1108/LHTN-01-2024-0002
  12. Pavlyshenko, B. M. (2023). Analysis of disinformation and fake news detection using fine-tuned large language model. arXiv preprint arXiv:2309.04704. https://doi.org/10.48550/arXiv.2309.04704
  13. Pavlyshenko, B. M. (2023). Financial news analytics using fine-tuned llama 2 gpt model. arXiv preprint arXiv:2308.13032. https://doi.org/10.48550/arXiv.2308.13032
  14. Pavlyshenko, B. M. (2022). Methods of informational trends analytics and fake news detection on twitter. arXiv preprint arXiv:2204.04891. https://doi.org/10.48550/arXiv.2204.04891
  15. Prottasha, N. J., Sami, A. A., Kowsher, M., Murad, S. A., Bairagi, A. K., Masud, M., & Baz, M. (2022). Transfer learning for sentiment analysis using BERT based supervised fine-tuning. Sensors, 22(11), 4157. https://doi.org/10.3390/s22114157
  16. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33, 9459-9474.
  17. Lin, X., Wang, W., Li, Y., Yang, S., Feng, F., Wei, Y., & Chua, T. S. (2024, July). Data-efficient Fine-tuning for LLM-based Recommendation. In Proceedings of the 47th international ACM SIGIR conference on research and development in information Retrieval (pp. 365-374). https://doi.org/10.1145/3626772.365780
  18. Meier, R. (2024). Llm-aided social media influence operations. Large Language Models in Cybersecurity: Threats, Exposure and Mitigation, 105-112. https://doi.org/10.1007/978-3-031-54827-7_11
  19. Yu, X., Chen, Z., Ling, Y., Dong, S., Liu, Z., & Lu, Y. (2023). Temporal data meets LLM--explainable financial time series forecasting. arXiv preprint arXiv:2306.11025. https://doi.org/10.48550/arXiv.2306.11025
  20. Vavekanand, R., & Sam, K. (2024). Llama 3.1: An in-depth analysis of the next-generation large language model. https://doi.org/10.48550/arXiv.2306.11025
  21. Li, Y., Yu, Y., Liang, C., He, P., Karampatziakis, N., Chen, W., & Zhao, T. (2023). Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv preprint arXiv:2310.08659. https://doi.org/10.48550/arXiv.2310.08659
  22. Wu, B., Zhu, R., Zhang, Z., Sun, P., Liu, X., & Jin, X. (2024). {dLoRA}: Dynamically orchestrating requests and adapters for {LoRA}{LLM} serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) (pp. 911-927).
  23. Zhang, X., Rajabi, N., Duh, K., & Koehn, P. (2023, December). Machine translation with large language models: Prompting, few-shot learning, and fine-tuning with QLoRA. In Proceedings of the Eighth Conference on Machine Translation (pp. 468-481).
  24. Pavlyshenko, B., & Bulka, I. (2024) Metric-Based Comparison of Fine-Tuned LLAMA 2 and MIXTRAL Large Language Models for Instruction Tasks. Electronics and information technologies/Електроніка та інформаційні технології, 26, 16-24. http://dx.doi.org/10.30970/eli.26.2
  25. MosaicML. (2023). MosaicML Instruct-v3 Dataset. Hugging Face. https://huggingface.co/datasets/mosaicml/instruct-v3
  26. Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto. (2023). Stanford Alpaca: An Instruction-following LLaMA model. Hugging Face. https://huggingface.co/datasets/tatsu-lab/alpaca
  27. Ying, X. (2019, February). An overview of overfitting and its solutions. In Journal of physics: Conference series (Vol. 1168, p. 022022). IOP Publishing. https://doi.org/10.1088/1742-6596/1168/2/022022
  28. You, Y., Wang, Y., Zhang, H., Zhang, Z., Demmel, J., & Hsieh, C. J. (2020). The limit of the batch size. arXiv preprint arXiv:2006.08517. https://doi.org/10.48550/arXiv.2006.08517
  29. Kang, F., Just, H. A., Sun, Y., Jahagirdar, H., Zhang, Y., Du, R., ... & Jia, R. (2024). Get more for less: Principled data selection for warming up fine-tuning in llms. arXiv preprint arXiv:2405.02774. https://doi.org/10.48550/arXiv.2405.02774
  30. Sanyal, S., Neerkaje, A., Kaddour, J., Kumar, A., & Sanghavi, S. (2023). Early weight averaging meets high learning rates for llm pre-training. arXiv preprint arXiv:2306.03241. https://doi.org/10.48550/arXiv.2306.03241
  31. Agrawal, A., Mueller, S. M., Fleischer, B. M., Sun, X., Wang, N., Choi, J., & Gopalakrishnan, K. (2019, June). DLFloat: A 16-b floating point format designed for deep learning training and inference. In 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH) (pp. 92-95). IEEE. https://doi.org/10.1109/ARITH.2019.00023
  32. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., ... & Vasic, P. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783. https://doi.org/10.48550/arXiv.2407.21783
  33. Es, S., James, J., Anke, L. E., & Schockaert, S. (2024, March). Ragas: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations (pp. 150-158).
  34. VM, K., Warrier, H., & Gupta, Y. (2024). Fine tuning llm for enterprise: Practical guidelines and recommendations. arXiv preprint arXiv:2404.10779. https://doi.org/10.48550/arXiv.2404.10779
  35. Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., & Ma, Y. (2024). Llamafactory: Unified efficient fine-tuning of 100+ language models. arXiv preprint arXiv:2403.13372. https://doi.org/10.48550/arXiv.2403.13372




DOI: http://dx.doi.org/10.30970/eli.30.3

Refbacks

  • There are currently no refbacks.