Detection and correction of spelling errors in Uzbek texts based on machine learning algorithms
Main Article Content
Abstract
This study addresses the problem of detecting and correcting spelling errors in Uzbek texts. Due to the complex morphological structure and agglutinative nature of the Uzbek language, traditional spell-checking methods do not provide sufficient accuracy. Therefore, this research employs the Levenshtein distance algorithm to measure word similarity and utilizes neural network-based language models for contextual correction. KenLM (a statistical language model), LSTM (Long Short-Term Memory), and BiLSTM (Bidirectional LSTM) approaches were used as language models. A text corpus of 80 million words was collected and analyzed for model training. The test results indicate that the BiLSTM model achieved the highest accuracy (90.09%) in correcting spelling errors, while the LSTM model recorded 84.62% accuracy. The KenLM model demonstrated an accuracy of 62.21% as well. These findings highlight that deep learning models capable of contextual analysis can significantly improve the automatic detection and correction of spelling errors in the Uzbek language. Based on the study results, future research plans include the application of transformer models, the expansion of annotated corpora, and the development of models that consider various morphological characteristics of the Uzbek language.
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.
References
Norvig, P. (2007). How to write a spelling corrector.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. Retrieved from https://arxiv.org/abs/1810.04805
Eryiğit, G. (2014). The impact of morphology in named entity recognition: Detecting mentions of people, locations and organizations in Turkish. Turkish Journal of Electrical Engineering & Computer Sciences, 22(6), 1356–1371.
Abduazimov, D., Mamatov, A., & Usmonov, U. (2020). Development of an annotated corpus for Uzbek language processing. In 2020 International Conference on Artificial Intelligence and Data Engineering (AIDE) (pp. 45–50). IEEE.
Mukhamadiyev, A.; Mukhiddinov, M.; Khujayarov, I.; Ochilov, M.; Cho, J. Development of Language Models for Continuous Uzbek Speech Recognition System. Sensors 2023, 23, 1145. https://doi.org/10.3390/s23031145.
Musaev, M., Khujayarov, I., Ochilov, M. (2023). Speech Recognition Technologies Based on Artificial Intelligence Algorithms. In: Zaynidinov, H., Singh, M., Tiwary, U.S., Singh, D. (eds) Intelligent Human Computer Interaction. IHCI 2022. Lecture Notes in Computer Science, vol 13741. Springer, Cham. https://doi.org/10.1007/978-3-031-27199-1_6.
Abdullaeva, M.I., Juraev, D.B., Ochilov, M.M., Rakhimov, M.F. (2023). Uzbek Speech Synthesis Using Deep Learning Algorithms. In: Zaynidinov, H., Singh, M., Tiwary, U.S., Singh, D. (eds) Intelligent Human Computer Interaction. IHCI 2022. Lecture Notes in Computer Science, vol 13741. Springer, Cham. https://doi.org/10.1007/978-3-031-27199-1_5.
Musaev, M., Mussakhojayeva, S., Khujayorov, I., Khassanov, Y., Ochilov, M., Atakan Varol, H. (2021). USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition Experiments. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_40.
Butayev Sh. English-uzbek uzbek-english dictionary 80 000 words. “O‘qituvchi” nashriyot-maanba ijodiy uyi. Toshkent – 2013.