Разработка и оценка эффективности семантической эмбеддинг-модели ModernUzBERT с расширенным контекстом для узбекского языка

I.Sh. Khujayarov; M.M. Ochilov; O.A. Kholmatov; V.I. Jumanov

doi:10.62132/ijdt.v9i2.375

PDF (Русский)

Published: May 15, 2026

DOI: https://doi.org/10.62132/ijdt.v9i2.375

Keywords:

NLP, uzbek language, ModernBERT, semantic embedding, information retrieval, long context, flash attention

I.Sh. Khujayarov

Tashkent University of information technologies named after Muhammad al-Khwarizmi

M.M. Ochilov

Tashkent University of information technologies named after Muhammad al-Khwarizmi

O.A. Kholmatov

Tashkent University of information technologies named after Muhammad al-Khwarizmi

V.I. Jumanov

State Institution “Center for the Development of Information and Communication Technologies in Justice Bodies and Institutions”

Abstract

This research is dedicated to the development of ModernUzBERT, an advanced embedding model for the Uzbek language based on contemporary Transformer architecture with an extended context window of 8192 tokens. In low-resource and agglutinative languages such as Uzbek, the limited capacity for processing comprehensive context significantly hampers model performance. To address this challenge, a large-scale text corpus in the Uzbek Latin script comprising over 125 million tokens was curated, and an optimized vocabulary of 52,000 sub-word units was developed. The model training process was executed in two primary stages: initially, the model was pre-trained from scratch using the Masked Language Modeling (MLM) objective; subsequently, it was adapted for semantic search via Supervised Fine-Tuning (SFT) using a dataset of 30,000 question-answer pairs. Through the implementation of Flash Attention 2 and Unpadding technologies, GPU resource utilization efficiency was substantially enhanced. The 8192-token context window enables the analysis of large-scale documents while preserving semantic integrity. Upon conclusion of the study, the developed model and all its components were released for open-access to the scientific community on the Hugging Face platform.

How to Cite

Khujayarov, I., Ochilov, M., Kholmatov, O., & Jumanov, V. (2026). Development and Evaluation of the ModernUzBERT Semantic Embedding Model with Extended Context for the Uzbek Language. INTERNATIONAL JOURNAL OF THEORETICAL AND APPLIED ISSUES OF DIGITAL TECHNOLOGIES, 9(2), 45–53. https://doi.org/10.62132/ijdt.v9i2.375

Issue

Vol. 9 No. 2 (2026): International journal of theoretical and applied issues of digital technologies

Section

Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 5998–6008

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the NAACL-HLT. 2019. 4171–4186

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451

Adilova, Fatima & Davronov, Rifkat & Safarov, Ruzmat. (2023). Uzroberta: An uzbek language pre-trained model. Universum:Technical sciences. 115. 10.32743/UniTech.2023.115.10.16028.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692. 2019. Pages 1–13.

Smit, B., Tworkowski, K., Yazici, A., & ModernBERT Team. ModernBERT: A New Frontier in Encoder-Only Transformers. arXiv preprint arXiv:2412.13663. 2024. Pages 1–24.

Dao, T., Fu, D., Ermon, S., Rudra, A., & Ré, C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Advances in Neural Information Processing Systems (NeurIPS). 2022. Pages 16344–16359

Dao, T. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. International Conference on Learning Representations (ICLR). 2024. Pages 1–12.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., ... & Liu, P. J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research. 2020. Vol. 21. Pages 1–67

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451

Su J., Ahmed M., Lu Y., Pan S., Bo W., Liu Y. RoFormer: Enhanced Transformer with Rotary Position Embedding // arXiv preprint arXiv:2104.09864. – 2021. – P. 1–10. URL: https://arxiv.org/abs/2104.09864

Dao T., Fu D. Y., Ermon S., Rudra A., Ré C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness // Advances in Neural Information Processing Systems (NeurIPS). – 2022. – Vol. 35. – P. 16344–16359. URL: https://arxiv.org/abs/2205.14135

Zaheer M., Guruganesh G., Dubey K. A., et al. Big Bird: Transformers for Longer Sequences // Advances in Neural Information Processing Systems (NeurIPS). – 2020. – Vol. 33. – P. 17235–17251. URL: https://arxiv.org/abs/2007.1406

Zhang R., et al. Clinical ModernBERT: Adapting ModernBERT for Clinical NLP // arXiv preprint arXiv:2412.16480. – 2024. – P. 1–12. URL: https://arxiv.org/abs/2412.16480

Wang X., Wang Z., Ahmed M., et al. BioClinical ModernBERT: Pre-training ModernBERT on 53 Billion Tokens for Biomedical NLP // arXiv preprint arXiv:2501.06648. – 2025. – P. 1–14. DOI: 10.48550/arXiv.2501.06648

Tanaka S., Suzuki R., Takahashi T., et al. Japanese ModernBERT: A Long-Context Encoder for Japanese NLP // arXiv preprint arXiv:2501.17684. – 2025.–P.1–13.DOI:10.48550/arXiv.2501.17684.

Li Z., Chen X., Wang Y., et al. Enhancing Retrieval with ModernBERT and ColBERT // arXiv preprint arXiv:2502.04357. – 2025. – P. 1–11. DOI: 10.48550/arXiv.2502.04357.

Teiletche P., Macé Q., Conti M., et al. ModernVBERT: Towards Smaller Visual Document Retrievers // arXiv preprint arXiv:2510.01149. – 2025. – P. 1–12. DOI: 10.48550/arXiv.2510.01149.

Sennrich, R., Haddow, B., & Birch, A. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th ACL. 2016. Pages 1715–1725

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. "Introduction to Information Retrieval." Cambridge University Press (2008): 233-256. doi: 10.1017/CBO9780511809071

Salemi, Alireza, and Hamed Zamani. "Evaluating Retrieval Quality in Retrieval-Augmented Generation." arXiv preprint arXiv:2404.13781 (2024): 1-15. doi: 10.48550/arXiv.2404.13781

Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., & Liu, Y. Roformer: Enhanced transformer with rotary positional embedding. Neurocomputing. 2024. Vol. 568. Page 127012.

Beltagy, I., Peters, M. E., & Cohan, A. Longformer: The Long-Document Transformer. arXiv preprint arXiv:2004.05150. 2020. Pages 1–16

Hendrycks, D., & Gimpel, K. Gaussian Error Linear Units (GELU). arXiv preprint arXiv:1606.08415. 2016. Pages 1–8

Mansurov, B., & Mansurov, A. UzRoBERTa: A Pretrained Language Model for Uzbek. Journal of Natural Language Processing Challenges. 2021. Pages 45–52.

Kalyan, K.S., Rajasekharan, A., & Sangeetha, S. AMMUS: A Survey of Transformer-based Pretrained Models in Natural Language Processing. arXiv preprint arXiv:2108.05542. 2021. Pages 1–35

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., & Zettlemoyer, L. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th ACL. 2020. Pages 8440–8451

Reimers, N., & Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the EMNLP-IJCNLP. 2019. Pages 3982–3992

Xiao, S., Liu, Z., Zhang, J., & Muennighoff, N. C-Pack: Packaged Resources for General Chinese Embeddings. arXiv preprint arXiv:2309.07597 (BGE-M3 model basis). 2023. Pages 1–15

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., ... & Lample, G. Llama: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971. 2023. Pages 1–19

Clark, K., Luong, M. T., Le, Q. V., & Manning, C. D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. ICLR. 2020. Pages 1–18

Article Sidebar

Main Article Content

Abstract

Article Details

References

Most read articles by the same author(s)