PARAMETER EFFICIENT FINE-TUNING AND OVERFITTING IN GPT LARGE LANGUAGE MODELS: A METRIC-BASED COMPARISON

Bohdan Pavlyshenko; Ivan Bulka

doi:10.30970/eli.30.3

PARAMETER EFFICIENT FINE-TUNING AND OVERFITTING IN GPT LARGE LANGUAGE MODELS: A METRIC-BASED COMPARISON

Bohdan Pavlyshenko, Ivan Bulka

Abstract

Background. Building upon previous research, this study conducts an exploration into Large Language Models (LLMs), with an emphasis on the fine-tuning and assessment of LLaMA-3.1 for instructional tasks. LLaMA-3.1, which is a new generation model and has gained considerable recognition based on its superior performance on various benchmarks. Besides assessing the disparities and improvements between the base and the fine-tuned versions of LLaMA-3.1 on an instruction dataset, the study also addresses the concern of overfitting with LLaMA-3.1. Furthermore, it carries out a comparison between LLaMA-3.1 and both its predecessor, LLaMA-2, and another LLM known as Mixtral, thereby providing a more comprehensive picture of LLaMA-3.1's capabilities compared to other models.

Materials and Methods. The fine-tuning of LLaMA-3.1 employed state-of-the-art techniques, such as Low-Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA), on comprehensive instruction datasets. Acknowledging the resource-intensive nature of LLM fine-tuning, optimization measures were taken. The fine-tuning process was additionally enhanced using Parameter-Efficient Fine-tuning (PEFT) on NVIDIA A100 Tensor Core GPU (graphics processing unit) instances. All the models were fine-tuned using Hugging Face and PyTorch platforms for optimal performance.

Results and Discussion. The results obtained from fine-tuning and evaluating LLaMA-3.1 offer valuable insights into how this model performs with specific tasks. The evaluation framework proved helpful in the efficient assessment assessing LLMs' performance concerning instruction tasks. The research highlights the importance of evaluation for LLM applications. It shows that not always is fine-tuning a good choice, due to the nature of the model and the specifics of the task. It highlights the overfitting problem.

Conclusion. The close examination of LLaMA-3.1 contributes to the field of machine learning by offering insights into how this model works and its possible fine-tuning for special tasks. The findings of this research create opportunities for more in-depth studies around the application of LLMs. It highlights the importance of efficient evaluation with already designed metrics.

Keywords: LLMs, GPT, Mixtral, LLaMA, fine-tuning, overfitting

Full Text:

PDF

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
De Leon, Francisco, Pablo Gómez, Juan A. Martinez-Velasco, and Michel Rioual. "Transformers." In Power system transients, pp. 177-250. CRC Press, 2017
Rogers, A., Kovaleva, O., & Rumshisky, A. (2021). A primer in BERTology: What we know about how BERT works. Transactions of the association for computational linguistics, 8, 842-866. https://doi.org/10.1162/tacl_a_00349
Hao, Y., Dong, L., Wei, F., & Xu, K. (2019). Visualizing and understanding the effectiveness of BERT. arXiv preprint arXiv:1908.05620. https://doi.org/10.48550/arXiv.1908.05620
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., ... & McGrew, B. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774. https://doi.org/10.48550/arXiv.2303.08774
Liga, D., & Robaldo, L. (2023). Fine-tuning GPT-3 for legal rule classification. Computer Law & Security Review, 51, 105864. https://doi.org/10.1016/j.clsr.2023.105864
Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr, M., Matsushita, H., ... & Awadalla, H. H. (2023). How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210. https://doi.org/10.48550/arXiv.2302.09210
Goyal, T., Li, J. J., & Durrett, G. (2022). News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356. https://doi.org/10.48550/arXiv.2209.12356
Koubaa, A. (2023). GPT-4 vs. GPT-3.5: A concise showdown. Preprints. org, 2023030422.
Mao, R., Chen, G., Zhang, X., Guerin, F., & Cambria, E. (2023). GPTEval: A survey on assessments of ChatGPT and GPT-4. arXiv preprint arXiv:2308.12488. https://doi.org/10.48550/arXiv.2308.12488
Adetayo, A. J., Aborisade, M. O., & Sanni, B. A. (2024). Microsoft Copilot and Anthropic Claude AI in education and library service. Library Hi Tech News. https://doi.org/10.1108/LHTN-01-2024-0002
Pavlyshenko, B. M. (2023). Analysis of disinformation and fake news detection using fine-tuned large language model. arXiv preprint arXiv:2309.04704. https://doi.org/10.48550/arXiv.2309.04704
Pavlyshenko, B. M. (2023). Financial news analytics using fine-tuned llama 2 gpt model. arXiv preprint arXiv:2308.13032. https://doi.org/10.48550/arXiv.2308.13032
Pavlyshenko, B. M. (2022). Methods of informational trends analytics and fake news detection on twitter. arXiv preprint arXiv:2204.04891. https://doi.org/10.48550/arXiv.2204.04891
Prottasha, N. J., Sami, A. A., Kowsher, M., Murad, S. A., Bairagi, A. K., Masud, M., & Baz, M. (2022). Transfer learning for sentiment analysis using BERT based supervised fine-tuning. Sensors, 22(11), 4157. https://doi.org/10.3390/s22114157
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33, 9459-9474.
Lin, X., Wang, W., Li, Y., Yang, S., Feng, F., Wei, Y., & Chua, T. S. (2024, July). Data-efficient Fine-tuning for LLM-based Recommendation. In Proceedings of the 47th international ACM SIGIR conference on research and development in information Retrieval (pp. 365-374). https://doi.org/10.1145/3626772.365780
Meier, R. (2024). Llm-aided social media influence operations. Large Language Models in Cybersecurity: Threats, Exposure and Mitigation, 105-112. https://doi.org/10.1007/978-3-031-54827-7_11
Yu, X., Chen, Z., Ling, Y., Dong, S., Liu, Z., & Lu, Y. (2023). Temporal data meets LLM--explainable financial time series forecasting. arXiv preprint arXiv:2306.11025. https://doi.org/10.48550/arXiv.2306.11025
Vavekanand, R., & Sam, K. (2024). Llama 3.1: An in-depth analysis of the next-generation large language model. https://doi.org/10.48550/arXiv.2306.11025
Li, Y., Yu, Y., Liang, C., He, P., Karampatziakis, N., Chen, W., & Zhao, T. (2023). Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv preprint arXiv:2310.08659. https://doi.org/10.48550/arXiv.2310.08659
Wu, B., Zhu, R., Zhang, Z., Sun, P., Liu, X., & Jin, X. (2024). {dLoRA}: Dynamically orchestrating requests and adapters for {LoRA}{LLM} serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) (pp. 911-927).
Zhang, X., Rajabi, N., Duh, K., & Koehn, P. (2023, December). Machine translation with large language models: Prompting, few-shot learning, and fine-tuning with QLoRA. In Proceedings of the Eighth Conference on Machine Translation (pp. 468-481).
Pavlyshenko, B., & Bulka, I. (2024) Metric-Based Comparison of Fine-Tuned LLAMA 2 and MIXTRAL Large Language Models for Instruction Tasks. Electronics and information technologies/Електроніка та інформаційні технології, 26, 16-24. http://dx.doi.org/10.30970/eli.26.2
MosaicML. (2023). MosaicML Instruct-v3 Dataset. Hugging Face. https://huggingface.co/datasets/mosaicml/instruct-v3
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto. (2023). Stanford Alpaca: An Instruction-following LLaMA model. Hugging Face. https://huggingface.co/datasets/tatsu-lab/alpaca
Ying, X. (2019, February). An overview of overfitting and its solutions. In Journal of physics: Conference series (Vol. 1168, p. 022022). IOP Publishing. https://doi.org/10.1088/1742-6596/1168/2/022022
You, Y., Wang, Y., Zhang, H., Zhang, Z., Demmel, J., & Hsieh, C. J. (2020). The limit of the batch size. arXiv preprint arXiv:2006.08517. https://doi.org/10.48550/arXiv.2006.08517
Kang, F., Just, H. A., Sun, Y., Jahagirdar, H., Zhang, Y., Du, R., ... & Jia, R. (2024). Get more for less: Principled data selection for warming up fine-tuning in llms. arXiv preprint arXiv:2405.02774. https://doi.org/10.48550/arXiv.2405.02774
Sanyal, S., Neerkaje, A., Kaddour, J., Kumar, A., & Sanghavi, S. (2023). Early weight averaging meets high learning rates for llm pre-training. arXiv preprint arXiv:2306.03241. https://doi.org/10.48550/arXiv.2306.03241
Agrawal, A., Mueller, S. M., Fleischer, B. M., Sun, X., Wang, N., Choi, J., & Gopalakrishnan, K. (2019, June). DLFloat: A 16-b floating point format designed for deep learning training and inference. In 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH) (pp. 92-95). IEEE. https://doi.org/10.1109/ARITH.2019.00023
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., ... & Vasic, P. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783. https://doi.org/10.48550/arXiv.2407.21783
Es, S., James, J., Anke, L. E., & Schockaert, S. (2024, March). Ragas: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations (pp. 150-158).
VM, K., Warrier, H., & Gupta, Y. (2024). Fine tuning llm for enterprise: Practical guidelines and recommendations. arXiv preprint arXiv:2404.10779. https://doi.org/10.48550/arXiv.2404.10779
Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., & Ma, Y. (2024). Llamafactory: Unified efficient fine-tuning of 100+ language models. arXiv preprint arXiv:2403.13372. https://doi.org/10.48550/arXiv.2403.13372

DOI: http://dx.doi.org/10.30970/eli.30.3

Refbacks

There are currently no refbacks.

Username
Password
Remember me

Electronics and information technologies / Електроніка та інформаційні технології

PARAMETER EFFICIENT FINE-TUNING AND OVERFITTING IN GPT LARGE LANGUAGE MODELS: A METRIC-BASED COMPARISON

Abstract

Full Text:

References

Refbacks