INFLUENCE OF DATA AUGMENTATION ON NAMED ENTITY RECOGNITION USING TRANSFORMER-BASED MODELS

Bohdan Pavlyshenko, I. Drozdov

Abstract


Transformer-based models have demonstrated their effectiveness for natural language processing tasks. Training these models requires huge amounts of textual data. The creation of a high-quality dataset demands substantial resources dedicated to the collection, processing, and annotation of data.  Also, building a large dataset for less commonly used languages or domains presents a significant challenge due to the inadequacy of available information for forming a comprehensive dataset. Data augmentation is one of the approaches to generating synthetic information, which helps increase the initial dataset size and enhance model performance.

The main goal of this article is to explore the possibilities of using data augmentation to enhance the capabilities of popular transformer-based models: BERT, ALBERT, DistilBERT, and RoBERTa. The study used one of the most popular datasets for named entity recognition research - CoNLL 2003. During the experiments, reduced versions of the initial dataset were created: down to 20%, 10%, and 5%, with different approaches to sentence selection in these datasets. Word-level augmenters were used for data augmentation: antonym augmentation, synonym augmentation, and word embeddings and their combinations. The experiments were conducted on identical equipment to obtain comparable results. The evaluation of results is based on the F1 score. The results demonstrated the effectiveness of data augmentation for small datasets, where significant improvements were achieved. With larger datasets, the impact of augmentation decreases.

Keywords: named entity recognition, natural language processing, augmentation, BERT, ALBERT, DistilBERT, RoBERTa.


Full Text:

PDF

References


  1. Li J., Sun A., Han J., Li C. (2020). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 34(1), 50–70.
  2. Roy, A. (2021) "Recent trends in named entity recognition (NER)." arXiv preprint arXiv:2101.11420v1
  3. Yadav V., Bethard S. (2019) "A survey on recent advances in named entity recognition from deep learning models." arXiv preprint arXiv:1910.11470v1.
  4. Shen Y., Yun H., Lipton Z. C., Kronrod Y., Anandkumar A. (2017). "Deep active learning for named entity recognition." arXiv preprint arXiv:1707.05928v3.
  5. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Polosukhin I. (2017). "Attention is all you need." Advances in neural information processing systems 30
  6. Devlin, J., Chang MW., Lee K., Toutanova K. (2018). "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805v2.
  7. Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Stoyanov V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692v1.
  8. Sanh V., Debut L., Chaumond J., Wolf T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper, and lighter. arXiv preprint arXiv:1910.01108v4.
  9. Lan Z., Chen M., Goodman S., Gimpel, K, Sharma P., Soricut R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942v6.
  10. Pavlyshenko B. M. (2023). Analysis of Disinformation and Fake News Detection Using Fine-Tuned Large Language Model. arXiv preprint arXiv:2309.04704v1.
  11. Pavlyshenko B. M. (2023). Financial News Analytics Using Fine-Tuned Llama 2 GPT Model. arXiv preprint arXiv:2308.13032v2.
  12. Chen J., Tam D., Raffel C., Bansal M., Yang D. (2021). An Empirical Survey of Data Augmentation for Limited Data Learning in NLP. arXiv preprint arXiv:2106.07499v1.
  13. Feng S., Gangal V., Wei J., Chandar S., Vosoughi S., Mitamura T., Hovy E. (2021). A survey of data augmentation approaches for NLP. arXiv preprint arXiv:2105.03075v5.
  14. Chen S., Aguilar G., Neves L., Solorio T. (2021). Data Augmentation for Cross-Domain Named Entity Recognition. arXiv preprint arXiv:2109.01758v1.
  15. Dai X., Adel H. (2020). An Analysis of Simple Data Augmentation for Named Entity Recognition. arXiv preprint arXiv:2010.11683v1.
  16. Pavlyshenko B., Stasiuk M. (2023). Augmentation in a binary text classification task. 2023 IEEE 13th International Conference on Electronics and Information Technologies (ELIT), Lviv, Ukraine, pp. 177–180.
  17. Pavlyshenko B., Stasiuk M. (2024). Data augmentation in text classification with multiple categories. Electronics and information technologies, Issue 25. – P. 67-80. DOI: http://dx.doi.org/10.30970/eli.25.6
  18. Sang, E. F., De Meulder F. (2003). "Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition." arXiv preprint arXiv:cs/0306050v1.
  19. Mikolov T., Le Q., Sutskever I. (2013). Exploiting Similarities among Languages for Machine Translation arXiv preprint arXiv:1309.4168v1.
  20. HuggingFace [Electronic resource]. Access mode: https://huggingface.co/
  21. NlpAUG library repository [Electronic resource]. Access mode: https://github.com/makcedward/nlpaug
  22. Seqeval library repository [Electronic resource]. Access mode: https://github.com/chakki-works/seqeval
  23. Pavlyshenko B, Drozdov I. (2023). Named entity recognition using OpenAI GPT series models. Electronics and information technologies, Issue 23. – P. 46-58. DOI: http://dx.doi.org/10.30970/eli.23.5




DOI: http://dx.doi.org/10.30970/eli.28.6

Refbacks

  • There are currently no refbacks.