SEMANTIC SIMILARITY ANALYSIS USING TRANSFORMER-BASED SENTENCE EMBEDDINGS
Abstract
Background. Transformer-based models have become central to natural language processing, demonstrating state-of-the-art performance in semantic similarity assessment, a task critical for various applications. These models capture detailed relationships between text, advancing the ability to gauge semantic relatedness.
Materials and Methods. The performance of sentence embedding models, including all-mpnet-base-v2, all-MiniLM-L6-v2, paraphrase-multilingual-mpnet-base-v2, bge-base-en-v1.5, all-roberta-large-v1, all-distilroberta-v1, LaBSE, paraphrase-MiniLM-L3-v2, bge-large-en-v1.5, was assessed across different dataset sizes with two datasets. The following preprocessing steps were applied to the datasets: lowercasing, removing stop words, cleaning from special symbols and numbers, and lemmatization. Cosine similarity scores with negative values, indicating semantic dissimilarity, were treated as equivalent to a human-annotated similarity score of 0, and non-negative cosine similarity values were scaled to the 0-5 range. Metrics such as R2, MSE, RMSE, MAE, Spearman’s Correlation Coefficient, and Kendall's Tau were used for evaluation.
Results and Discussion. Models’ performance generally improves with increased data. Evaluation of sentence embedding models revealed performance variations. all-roberta-large-v1 showed strong accuracy with high R2 values and low errors. BAAI/bge-large-en-v1.5 excelled in capturing semantic relationships, demonstrating high Spearman’s and Kendall's Tau coefficients. all-MiniLM-L6-v2 demonstrated the fastest embedding generation. BAAI/bge-base-en-v1.5 presented the lowest accuracy. Processing times generally increase with data size.
Conclusion. This study highlights a trade-off between accuracy and efficiency in sentence embedding. Model selection depends on balancing these factors to align with specific application needs. In cases when requiring high accuracy should favor all-roberta-large-v1, while those prioritizing speed would benefit from all-MiniLM-L6-v2. BAAI/bge-large-en-v1.5 is most suitable for tasks demanding semantic understanding of text details.
Keywords: semantic similarity, sentence embeddings, transformers.
Full Text:
PDFReferences
- Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of machine learning research, 3(Feb), 1137-1155. https://doi.org/10.1162/153244303322533223
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. https://doi.org/10.48550/arXiv.1301.3781
- Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). https://doi.org/10.3115/v1/D14-1162
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. https://doi.org/10.48550/arXiv.1706.03762
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186). https://doi.org/10.18653/v1/N19-1423
- Pavlyshenko, B., & Stasiuk, M. (2024). Data augmentation in text classification with multiple categories. Electronics and information technologies, (25). http://dx.doi.org/10.30970/eli.25.6
- Pavlyshenko, B., & Stasiuk, M. (2025). Using Large Language Models for Data Augmentation in Text Classification Models. International Journal of Computing, 24(1), 148-154. https://doi.org/10.47839/ijc.24.1.3886
- Pavlyshenko, B. (2014). Clustering of authors’ texts of english fiction in the vector space of semantic fields. Cybernetics and Information Technologies, 14(3), 25-36. https://doi.org/10.2478/cait-2014-0030
- Pavlyshenko, B. (2013). Classification analysis of authorship fiction texts in the space of semantic fields. Journal of Quantitative Linguistics, 20(3), 218–226. https://doi.org/10.1080/09296174.2013.799914
- Han, M., Zhang, X., Yuan, X., Jiang, J., Yun, W., & Gao, C. (2021). A survey on the techniques, applications, and performance of short text semantic similarity. Concurrency and Computation: Practice and Experience, 33(5), e5971. https://doi.org/10.1002/cpe.5971
- Risch, J., Möller, T., Gutsch, J., & Pietsch, M. (2021). Semantic answer similarity for evaluating question answering models. https://doi.org/10.48550/arXiv.2108.06130
- Vrbanec, T., & Meštrović, A. (2017, May). The struggle with academic plagiarism: Approaches based on semantic similarity. In 2017 40th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (pp. 870-875). IEEE. https://doi.org/10.23919/MIPRO.2017.7973544
- Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. https://doi.org/10.18653/v1/S17-2001
- mteb/stsbenchmark-sts. Retrieved from https://huggingface.co/datasets/mteb/stsbenchmark-sts
- Song, K., Tan, X., Qin, T., Lu, J., & Liu, T. Y. (2020). Mpnet: Masked and permuted pre-training for language understanding. Advances in neural information processing systems, 33, 16857-16867. https://doi.org/10.48550/arXiv.2004.09297
- Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. https://doi.org/10.48550/arXiv.1907.11692
- Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2020). Language-agnostic BERT sentence embedding. https://doi.org/10.48550/arXiv.2007.01852
DOI: http://dx.doi.org/10.30970/eli.30.4
Refbacks
- There are currently no refbacks.

Electronics and information technologies / Електроніка та інформаційні технології