DATA AUGMENTATION IN TEXT CLASSIFICATION WITH MULTIPLE CATEGORIES

Bohdan Pavlyshenko, M. Stasiuk

Abstract


In the modern world, the amount of text data that is being generated every day is enormous. However, because of the differences in various language usage in day-to-day life, the amount of data generated in English is much greater than for example, Ukrainian. Moreover, there are a huge amount of languages that may become extinct in the near future. Because of this, there is a request for the methods and techniques that will make it possible to preserve endangered languages and will allow us to use them effectively in the machine learning approaches. One of the developed methods for creating new data based on already existing information is called augmentation.

The purpose of this article is to investigate the effect of data augmentation on the multi-class text classification task, which is performed by different transformer models: BERT, DistilBERT, ALBERT, and XLM-RoBERTa. Data for the models’ training and testing were taken from the HuggingFace. Data themselves were modified using different augmentation techniques: on the word level synonym, antonym, and contextual word embeddings augmentation were used; on the sentence level abstractive summarization and lambada augmentations were utilized. Instead of direct training and evaluation, training infrastructure, provided by the HuggingFace portal was used. Different metrics of model training efficiency were considered: learning time, the output of validation and training loss functions, accuracy, recall, f1-score, and precision.

The result of this investigation allows comparing the efficiency of every observed model in multi-class text classification tasks. At the same time, the efficiency of different text augmentation was estimated. This is valuable for assessing the most corresponding transformer model in connection with augmentation to obtain the best efficiency in the classification with multiple categories.

Keywords: augmentation, multi-class text classification, BERT, ALBERT, DistilBERT, XLM-RoBERTa.


Full Text:

PDF

References


  1. Shorten C., Khoshgoftaar T. M., Furht B. (2021). Text data augmentation for deep learning. Journal of big Data, 8, 1-34.
  2. Wei J., Zou K. (2019). Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196.
  3. Romaine S. (2007). Preserving endangered languages. Language and Linguistics Compass, 1(1‐2), 115-132.
  4. Magueresse A., Carles V., Heetderks E. (2020). Low-resource languages: A review of past work and future challenges. arXiv preprint arXiv:2006.07264.
  5. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Polosukhin I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
  6. Pavlyshenko B. M. (2023). Analysis of Disinformation and Fake News Detection Using Fine-Tuned Large Language Model. arXiv preprint arXiv:2309.04704.
  7. Pavlyshenko B. M. (2023). Financial News Analytics Using Fine-Tuned Llama 2 GPT Model. arXiv preprint arXiv:2308.13032.
  8. Pavlyshenko B. M. (2022). Methods of Informational Trends Analytics and Fake News Detection on Twitter. arXiv preprint arXiv:2204.04891.
  9. Devlin J., Chang M. W., Lee K., Toutanova K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  10. Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Stoyanov V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  11. Sanh V., Debut L., Chaumond J., Wolf T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  12. Lan Z., Chen M., Goodman S., Gimpel, K, Sharma P., Soricut R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
  13. Zhang X., Zhao J., LeCun Y. (2015). Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
  14. HuggingFace [Electronic resource]. Access mode: https://huggingface.co/
  15. Ma E. (2019). Nlp augmentation.




DOI: http://dx.doi.org/10.30970/eli.25.6

Refbacks

  • There are currently no refbacks.