NAMED ENTITY RECOGNITION USING GENERATIVE TRANSFORMER MODELS WITH SENTENCE-LEVEL DATA AUGMENTATION APPROACHES
Abstract
Background. Named entity recognition, as one of the key tasks of the natural language processing (NLP) field, plays a vital role in the processing and understanding of the texts. Usage of transformer-based models demonstrates exceptional performance on most NLP tasks but requires a considerable amount of information for practical model training. Building a high-quality annotated dataset for named entity recognition is resource-intensive, especially for low-resourced languages. Using data augmentation to extend the annotated dataset with synthetic data provides an opportunity to increase the efficiency of the models for named entity recognition. This study aims to use sentence-level augmentations and large language models to improve model performance on small datasets.
Materials and Methods. To investigate the impact of data augmentation, 5%, 10%, and 20% of training data from the CoNLL and Ontonotes5 datasets with different characteristics were taken. Three main approaches were used to construct the augmented data: summarizing sentences using the T5 model, followed by inserting named entities, paraphrasing sentences using the OpenAI Api, and several methods of replacing named entities in initial and synthetic sentences. BERT, ALBERT, DistilBERT, and RoBERTa models were used for evaluation.
Results and Discussion. According to the results, the effectiveness of using different augmentation methods significantly depends on the initial dataset and its quality. For small datasets with few categories for recognition, sentence-level augmentation methods through summarization or paraphrasing improve the efficiency of models by up to 10%. On the other hand, with an increase in the size of the dataset, artificially created data can lead to a deterioration in recognition results.
Conclusion. Using data augmentation to recognize named entities is an effective tool for small datasets and can improve model performance in resource-constrained cases like specific domains and low-resourced languages. However, synthetic data cannot fully replace a larger, better-built original dataset through context extension for existing named entities and the generation of new, synthetic entities.
Keywords: named entity recognition, natural language processing, data augmentation, large language models
Full Text:
PDFReferences
- Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. Proceedings of NAACL-HLT 2016, 260–270. https://doi.org/10.18653/v1/N16-1030
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
- Yamada, I., Asai, A., Shindo, H., Takeda, H., & Matsumoto, Y. (2020). LUKE: Deep contextualized entity representations with entity-aware self-attention. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 6442–6454. https://doi.org/10.18653/v1/2020.emnlp-main.523
- Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. https://doi.org/10.48550/arXiv.1907.11692
- Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33 (2020): 1877-1901.
- Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. https://doi.org/10.48550/arXiv.2302.13971
- Chen, J., Tam, D., Raffel, C., Bansal, M., & Yang, D. (2023). An empirical survey of data augmentation for limited data learning in NLP. Transactions of the Association for Computational Linguistics, 11, 191-211. https://doi.org/10.1162/tacl_a_00542
- Feng, S. Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., & Hovy, E. (2021). A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-acl.84
- Chen, S., Aguilar, G., Neves, L., & Solorio, T. (2021). Data augmentation for cross-domain named entity recognition. arXiv preprint arXiv:2109.01758. https://arxiv.org/abs/2109.01758
- Dai, X., & Adel, H. (2020). An Analysis of Simple Data Augmentation for Named Entity Recognition. In Proceedings of the 28th International Conference on Computational Linguistics, 3861–3867. International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.343
- Pavlyshenko, B., & Stasiuk, M. (2023). Augmentation in a binary text classification task. In 2023 IEEE 13th International Conference on Electronics and Information Technologies (ELIT) (pp. 177–180). IEEE. https://doi.org/10.1109/ELIT57602.2023.10151742
- Pavlyshenko, B., & Stasiuk, M. (2024). Data augmentation in text classification with multiple categories. Electronics and Information Technologies, 25, 67–80. http://dx.doi.org/10.30970/eli.25.6
- Pavlyshenko, B., & Drozdov, I. (2024). Influence of data augmentation on named entity recognition using transformer-based models. Electronics and Information Technologies, 28, 61–72. http://dx.doi.org/10.30970/eli.28.6
- Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050. https://arxiv.org/abs/cs/0306050
- Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., & Weischedel, R. (2006). OntoNotes: The 90% solution. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers (pp. 57–60). https://aclanthology.org/N06-2015
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67. http://jmlr.org/papers/v21/20-074.html
- HuggingFace. (n.d.). HuggingFace [Computer software]. Retrieved June 2025, from https://huggingface.co/
- OpenAI. (n.d.). OpenAI API [Computer software]. Retrieved June 2025, from https://platform.openai.com/docs/api-reference
- Seqeval library repository. (n.d.). Retrieved June 2025, from https://github.com/chakki-works/seqeval
DOI: http://dx.doi.org/10.30970/eli.31.1
Refbacks
- There are currently no refbacks.

Electronics and information technologies / Електроніка та інформаційні технології