Оценка эффективности алгоритмов распознавания дикторов на основе речевых сигналов

Основное содержимое статьи

К.Э. Шукуров
У.К. Хасанов

Аннотация

В этой статье анализируется эффективность использования различных моделей в процессах распознавания говорящих и выбирается наилучшая для системы. С точки зрения точности и быстродействия системы сравниваются классическая архитектура MFCC + косинусное сходство и современная архитектура x-вектор, ECAPA-TDNN + PLDA. На основе набора данных, сгенерированного от разных дикторов, оцениваются показатели точности, f1-оценки, EER, задержки и загрузки графического процессора моделей. Согласно экспериментальным результатам, модель ECAPA-TDNN превосходит другие модели с точностью 95,7%. Поскольку этап распознавания говорящего также важен для систем разделения говорящих, показатели точности имеют высокую актуальность. Модель ECAPA-TDNN + PLDA предлагает хорошие решения с точки зрения использования вычислительных ресурсов, работы с большими наборами данных и их анализа.

Информация о статье

Как цитировать
Шукуров, К., & Хасанов, У. (2026). Оценка эффективности алгоритмов распознавания дикторов на основе речевых сигналов. Международный Журнал Теоретических и Прикладных Вопросов Цифровых Технологий, 9(1), 80–89. https://doi.org/10.62132/ijdt.v9i1.325
Раздел
Articles

Библиографические ссылки

Azofeifa, Jose & Noguez, Julieta & Ruiz-Loza, Sergio & Molina Espinosa, José & Magana, Alejandra & Benes, Bedrich. (2022). Systematic Review of Multimodal Human–Computer Interaction. Informatics. 9. 13. 10.3390/informatics9010013.

Shukurov K., Khasanov U., Turaev B., Kakhkharov A. (2023). The Effectiveness of the Implementation of Speech Command Recognition Algorithms in Embedded Systems. The Eurasia Proceedings of Science Technology Engineering and Mathematics, 23, 220-224. https://doi.org/10.55549/epstem.1365795

S. Kamoliddin Elbobo ugli, K. Shokhrukhmirzo Imomali ugli and K. Umidjon Komiljon ugli. "Uzbek speech commands recognition and implementation based on HMM," 2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT), Tashkent, Uzbekistan, 2020, pp. 1-6, doi: 10.1109/AICT50176.2020.9368591.

Musaev M., Khujayarov I., Ochilov M. (2023). Speech Recognition Technologies Based on Artificial Intelligence Algorithms. In: Zaynidinov, H., Singh, M., Tiwary, U.S., Singh, D. (eds) Intelligent Human Computer Interaction. IHCI 2022. Lecture Notes in Computer Science, vol 13741. Springer, Cham. https://doi.org/10.1007/978-3-031-27199-1_6

K. Shukurov, T. Boburkhon and U. Khasano. "Implementation of speech processing algorithms based on Singular Value Decomposition and Hidden Markov Model," 2021 International Conference on Information Science and Communications Technologies (ICISCT), Tashkent, Uzbekistan, 2021, pp. 01-03, doi: 10.1109/ICISCT52966.2021.9670357.

Musaev M., Abdullaeva M., & Ochilov M. (2022, August). Advanced feature extraction method for speaker identification using a classification algorithm. In AIP Conference Proceedings (Vol. 2656, No. 1, p. 020022). AIP Publishing LLC.

M. Abdullaeva, I. Khujayorov and M. Ochilov. "Formant set as a main parameter for recognizing vowels of the Uzbek language," 2021 International Conference on Information Science and Communications Technologies (ICISCT), Tashkent, Uzbekistan, 2021, pp. 1-5, doi: 10.1109/ICISCT52966.2021.9670268.

Pande, Vinod & Kale, Vijay. (2023). Speakers Identification Using Diarization Techniques. 10.2991/978-94-6463-136-4_80.

Gomez, Antonio. (2022). Speaker Diarization and Identification from Single-Channel Classroom Audio Recording Using Virtual Microphones. 10.48550/arXiv.2207.00660.

Grozdić, Đorđe & Jovičić, Slobodan & Saric, Zoran & Subotić, Irina. (2015). Comparison of GMM/UBM and i-vector based speaker recognition systems.

Pappagari, Raghavendra & Wang, Tianzi & Villalba, Jesús & Chen, Nanxin & Dehak, Najim. (2020). x-vectors meet emotions: A study on dependencies between emotion and speaker recognition. 10.48550/arXiv.2002.05039.

Desplanques, Brecht & Thienpondt, Jenthe & Demuynck, Kris. (2020). ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. 10.21437/Interspeech.2020-2650.

Chung, Joon Son & Nagrani, Arsha & Zisserman, Andrew. (2018). VoxCeleb2: Deep Speaker Recognition. 1086-1090. 10.21437/Interspeech.2018-1929.

Wang, Qiongqiong & Lee, Kong Aik & Liu, Tianchi. (2022). Scoring of Large-Margin Embeddings for Speaker Verification: Cosine or PLDA?. 10.48550/arXiv.2204.03965.

Kanagasundaram, Ahilan & Vogt, Robbie & Dean, David & Sridharan, Sridha. (2012). PLDA based Speaker Recognition on Short Utterances.

K. Shukurov, U. Berdanov, U. Khasanov, S. Kholdorov and B. Turaev. "The role of adaptive filters in the recognition of speech commands," 2021 International Conference on Information Science and Communications Technologies (ICISCT), Tashkent, Uzbekistan, 2021, pp. 1-4, doi: 10.1109/ICISCT52966.2021.9670084.

Saritha, Dr & Laskar, Mohammad & Kirupakaran, Anish & Laskar, Rabul & Choudhury, Madhuchhanda. (2024). ReptoNet: A 3D Log Mel Spectrogram-Based Few-Shot Speaker Identification with Reptile Algorithm. Arabian Journal for Science and Engineering. 50. 10.1007/s13369-024-09426-3.

Dawalatabad, Nauman & Ravanelli, Mirco & Grondin, François & Thienpondt, Jenthe & Desplanques, Brecht & Na, Hwidong. (2021). ECAPA-TDNN Embeddings for Speaker Diarization. 3560-3564. 10.21437/Interspeech.2021-941.

Garcia, Edel. (2015). Cosine Similarity Tutorial.

Prince, Simon & Elder, James. (2007). Probabilistic Linear Discriminant Analysis for Inferences About Identity. IEEE 11th International Conference on Computer Vision. 1-8. 10.1109/ICCV.2007.4409052.

Abdullaeva M.I., Juraev D.B., Ochilov M.M., Rakhimov M.F. (2023). Uzbek Speech Synthesis Using Deep Learning Algorithms. In: Zaynidinov, H., Singh, M., Tiwary, U.S., Singh, D. (eds) Intelligent Human Computer Interaction. IHCI 2022. Lecture Notes in Computer Science, vol 13741. Springer, Cham. https://doi.org/10.1007/978-3-031-27199-1_5.

Ghosh, Debalina & Debnath, Depanwita & Bose, Saikat. (2012). A Comparative Study of Performance of Fpga Based Mel Filter Bank & Bark Filter Bank. International Journal of Artificial Intelligence & Applications. 3.

Mukhamadiyev, A., Mukhiddinov, M., Khujayarov, I., Ochilov, M., & Cho, J. (2023). Development of Language Models for Continuous Uzbek Speech Recognition System. Sensors, 23(3), 1145. https://doi.org/10.3390/s23031145

Jung, Youngmoon & Kim, Younggwan & Lim, Hyungjun & Kim, Hoirin. (2017). Linear-scale filterbank for deep neural network-based voice activity detection. 1-5. 10.1109/ICSDA.2017.8384446.

Snyder, David & Garcia-Romero, Daniel & Sell, Gregory & Povey, Daniel & Khudanpur, Sanjeev. (2018). X-Vectors: Robust DNN Embeddings for Speaker Recognition. 5329-5333. 10.1109/ICASSP.2018.8461375.

Loweimi, Erfan & Qian, Mengjie & Knill, Kate & Gales, M.J.F. (2024). On the Usefulness of Speaker Embeddings for Speaker Retrieval in the Wild: A Comparative Study of x-vector and ECAPA-TDNN Models. 3774-3778. 10.21437/Interspeech.2024-161.

Berrar, Daniel. (2018). Bayes’ Theorem and Naive Bayes Classifier. 10.1016/B978-0-12-809633-8.20473-1.

Bahmaninezhad, Fahimeh & Hansen, John. (2017). i-Vector/PLDA speaker recognition using support vectors with discriminant analysis. 10.1109/ICASSP.2017.7953190.