NAMED ENTITY RECOGNITION USING GENERATIVE TRANSFORMER MODELS WITH SENTENCE-LEVEL DATA AUGMENTATION APPROACHES

Ihor Drozdov; Bohdan Pavlyshenko

doi:10.30970/eli.31.1

NAMED ENTITY RECOGNITION USING GENERATIVE TRANSFORMER MODELS WITH SENTENCE-LEVEL DATA AUGMENTATION APPROACHES

Ihor Drozdov, Bohdan Pavlyshenko

Abstract

Background. Named entity recognition, as one of the key tasks of the natural language processing (NLP) field, plays a vital role in the processing and understanding of the texts. Usage of transformer-based models demonstrates exceptional performance on most NLP tasks but requires a considerable amount of information for practical model training. Building a high-quality annotated dataset for named entity recognition is resource-intensive, especially for low-resourced languages. Using data augmentation to extend the annotated dataset with synthetic data provides an opportunity to increase the efficiency of the models for named entity recognition. This study aims to use sentence-level augmentations and large language models to improve model performance on small datasets.

Materials and Methods. To investigate the impact of data augmentation, 5%, 10%, and 20% of training data from the CoNLL and Ontonotes5 datasets with different characteristics were taken. Three main approaches were used to construct the augmented data: summarizing sentences using the T5 model, followed by inserting named entities, paraphrasing sentences using the OpenAI Api, and several methods of replacing named entities in initial and synthetic sentences. BERT, ALBERT, DistilBERT, and RoBERTa models were used for evaluation.

Results and Discussion. According to the results, the effectiveness of using different augmentation methods significantly depends on the initial dataset and its quality. For small datasets with few categories for recognition, sentence-level augmentation methods through summarization or paraphrasing improve the efficiency of models by up to 10%. On the other hand, with an increase in the size of the dataset, artificially created data can lead to a deterioration in recognition results.

Conclusion. Using data augmentation to recognize named entities is an effective tool for small datasets and can improve model performance in resource-constrained cases like specific domains and low-resourced languages. However, synthetic data cannot fully replace a larger, better-built original dataset through context extension for existing named entities and the generation of new, synthetic entities.

Keywords: named entity recognition, natural language processing, data augmentation, large language models

Full Text:

PDF

References

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. Proceedings of NAACL-HLT 2016, 260–270. https://doi.org/10.18653/v1/N16-1030
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Yamada, I., Asai, A., Shindo, H., Takeda, H., & Matsumoto, Y. (2020). LUKE: Deep contextualized entity representations with entity-aware self-attention. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 6442–6454. https://doi.org/10.18653/v1/2020.emnlp-main.523
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. https://doi.org/10.48550/arXiv.1907.11692
Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33 (2020): 1877-1901.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. https://doi.org/10.48550/arXiv.2302.13971
Chen, J., Tam, D., Raffel, C., Bansal, M., & Yang, D. (2023). An empirical survey of data augmentation for limited data learning in NLP. Transactions of the Association for Computational Linguistics, 11, 191-211. https://doi.org/10.1162/tacl_a_00542
Feng, S. Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., & Hovy, E. (2021). A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-acl.84
Chen, S., Aguilar, G., Neves, L., & Solorio, T. (2021). Data augmentation for cross-domain named entity recognition. arXiv preprint arXiv:2109.01758. https://arxiv.org/abs/2109.01758
Dai, X., & Adel, H. (2020). An Analysis of Simple Data Augmentation for Named Entity Recognition. In Proceedings of the 28th International Conference on Computational Linguistics, 3861–3867. International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.343
Pavlyshenko, B., & Stasiuk, M. (2023). Augmentation in a binary text classification task. In 2023 IEEE 13th International Conference on Electronics and Information Technologies (ELIT) (pp. 177–180). IEEE. https://doi.org/10.1109/ELIT57602.2023.10151742
Pavlyshenko, B., & Stasiuk, M. (2024). Data augmentation in text classification with multiple categories. Electronics and Information Technologies, 25, 67–80. http://dx.doi.org/10.30970/eli.25.6
Pavlyshenko, B., & Drozdov, I. (2024). Influence of data augmentation on named entity recognition using transformer-based models. Electronics and Information Technologies, 28, 61–72. http://dx.doi.org/10.30970/eli.28.6
Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050. https://arxiv.org/abs/cs/0306050
Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2006). OntoNotes: The 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers (pp. 57–60). https://aclanthology.org/N06-2015
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67. http://jmlr.org/papers/v21/20-074.html
HuggingFace. (n.d.). HuggingFace [Computer software]. Retrieved June 2025, from https://huggingface.co/
OpenAI. (n.d.). OpenAI API [Computer software]. Retrieved June 2025, from https://platform.openai.com/docs/api-reference
Seqeval library repository. (n.d.). Retrieved June 2025, from https://github.com/chakki-works/seqeval

DOI: http://dx.doi.org/10.30970/eli.31.1

Refbacks

There are currently no refbacks.

Username
Password
Remember me

Electronics and information technologies / Електроніка та інформаційні технології

NAMED ENTITY RECOGNITION USING GENERATIVE TRANSFORMER MODELS WITH SENTENCE-LEVEL DATA AUGMENTATION APPROACHES

Abstract

Full Text:

References

Refbacks