Сравнительная оценка мер сходства для кластеризации пиков плотности с полуконтролируемым обучением

Основное содержимое статьи

Ahmed Saad Hussein

Аннотация

Алгоритм полуконтролируемой кластеризации Density Peak (SDenPeak) известен своей эффективностью и простотой в задачах кластеризации. Он улучшает производительность кластеризации за счет добавления попарных ограничений, ограничений на обязательное и невозможное связывание, которые управляют процессом группировки, устанавливая сходство и несходство между точками данных. Одним из ключевых факторов точности кластеризации является выбор меры сходства, поскольку различные меры отражают разнообразные структурные характеристики данных. Проблема заключается в том, что не существует универсальной лучшей меры сходства, поскольку выбор подходящей меры является сложной задачей, зависящей от характера данных. Для изучения влияния шести мер сходства на производительность алгоритма SDenPeak в данном исследовании систематически оцениваются шесть мер евклидово расстояние, косинусное сходство, расстояние между городскими кварталами (Манхэттенское расстояние), расстояние Минковского, расстояние перемещения земли и расстояние быстрого вычисления максимального информационного коэффициента с целью понимания их влияния. Для оценки точности кластеризации и структурной согласованности каждой из мер были проведены обширные эксперименты на реальных наборах данных. Полученные результаты представляют сравнительную информацию об эффективности различных мер сходства и иллюстрируют их применимость к различным распределениям данных, предоставляя полезное руководство по достижению наилучшей производительности кластеризации в полуконтролируемых моделях.

Информация о статье

Как цитировать
Hussein, A. S. (2026). Сравнительная оценка мер сходства для кластеризации пиков плотности с полуконтролируемым обучением. Международный Журнал Теоретических и Прикладных Вопросов Цифровых Технологий, 9(2), 7–19. https://doi.org/10.62132/ijdt.v9i2.371
Раздел
Articles

Библиографические ссылки

Zhong G, Pun CM. Self-taught multi-view spectral clustering. Pattern Recognition. 2023 Jun 1;138:109349.

Zhang C, Ni M, Zhong Y, Wei H, Qiu K. Density-ratio peak based semi-supervised algorithm for access network user behavior analysis. IEEE Access. 2019 May 6;7:62904-10.

Jain, A., Jin, R., & Chitta, R. (2014). Semi-supervised clustering. Handbook of Cluster Analysis, 1-35.

Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S. (2001, June). Constrained k-means clustering with background knowledge. In ICML (Vol. 1, pp. 577-584).

Sinha A, Jana PK. Improved affinity propagation clustering algorithms: A PSO-based approach. Knowledge and Information Systems. 2025 Feb;67(2):1681-711.

Shabani N, Wu J, Beheshti A, Sheng QZ, Foo J, Haghighi V, Hanif A, Shahabikargar M. A comprehensive survey on graph summarization with graph neural networks. IEEE Transactions on Artificial Intelligence. 2024 Jan 8;5(8):3780-800.

González-Almagro G, Peralta D, De Poorter E, Cano JR, García S. Semi-supervised constrained clustering: An in-depth overview, ranked taxonomy and future research directions. Artificial Intelligence Review. 2025 Mar 7;58(5):157.

Basu, S., Banerjee, A., & Mooney, R. (2002). Semi-supervised clustering by seeding. In In Proceedings of 19th International Conference on Machine Learning (ICML-2002.

Fan, W. Q., Wang, C. D., & Lai, J. H. (2016, March). SDenPeak: Semi-supervised Nonlinear Clustering Based on Density and Distance. In Big Data Computing Service and Applications (BigDataService), 2016 IEEE Second International Conference on (pp. 269-275). IEEE.

Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492-1496.

Liberti, L., Lavor, C., Maculan, N., & Mucherino, A. (2014). Euclidean distance geometry and applications. Siam Review, 56(1), 3-69.

Alsuhibany, S. A., Almushyti, M., Alghasham, N., & Alkhudier, F. (2016, November). Analysis of free-text keystroke dynamics for Arabic language using Euclidean distance. In Innovations in Information Technology (IIT), 2016 12th International Conference on (pp. 1-6). IEEE.

Nourbakhsh A, Jadidi M, Shahriari K. Clustering bike sharing stations using Quantum Machine Learning: A case study of Toronto, Canada. Transportation Research Interdisciplinary Perspectives. 2024 Sep 1;27:101201.

Hao, L., & Gang, N. (2016, October). A novel diagnosis method for intelligent IETM platform based on cosine similarity and fuzzy semantic inference. In Prognostics and System Health Management Conference (PHM-Chengdu), 2016 (pp. 1-6). IEEE.

Hernandez, A. F. R., & Garcia, N. Y. G. (2016). Distributed processing using cosine similarity for mapping Big Data in Hadoop. IEEE Latin America Transactions, 14(6), 2857-2861.

Wang W, Chen T, Liu H, Zhang J, Wang Q, Jiang Q. Depth perception optimization of mixed reality simulation systems based on multiple‐cue fusion. Journal of the Society for Information Display. 2024 Aug;32(8):568-79.

De Carvalho, F. D. A., Barbosa, G. B., & Pimentel, J. T. (2013, October). Partitioning fuzzy c-means clustering algorithms for interval-valued data based on city-block distances. In Intelligent Systems (BRACIS), 2013 Brazilian Conference on (pp. 113-118). IEEE.

Rodrigues ÉO. Combining Minkowski and Chebyshev: New distance proposal and survey of distance metrics using k-nearest neighbours classifier. Pattern Recognition Letters. 2018 Jul 15;110:66-71.

Howard, S. D., & Sirianunpiboon, S. (2012, August). Fast tests for the common causality of time-of-arrival events from their mutual Minkowski distances. In Statistical Signal Processing Workshop (SSP), 2012 IEEE (pp. 101-104). IEEE.

Xu, J., Lei, B., Gu, Y., Winslett, M., Yu, G., & Zhang, Z. (2015). Efficient similarity join based on earth mover’s distance using MapReduce. IEEE Transactions on Knowledge and Data Engineering, 27(8), 2148-2162.

Beecks, C., Uysal, M. S., & Seidl, T. (2015, December). Earth Mover's Distance vs. Quadratic form Distance: An Analytical and Empirical Comparison. In Multimedia (ISM), 2015 IEEE International Symposium on (pp. 233-236). IEEE.

Tang, D., Wang, M., Zheng, W., & Wang, H. (2014). RapidMic: Rapid Computation of the Maximal Information Coefficient. Evolutionary bioinformatics, 10, 11.

Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492–1496.

Wang, H., Zhang, R., Li, Y., & Sun, M. (2020). Improving density peaks clustering with semi-supervised pairwise constraints. Knowledge-Based Systems, 192, 105369.

Yang, C. H., Lee, B., Lee, Y. I., Chung, Y. F., & Lin, Y. D. (2025). An autoencoder-based arithmetic optimization clustering algorithm to enhance principal component analysis to study the relations between industrial market stock indices in real estate. Expert Systems with Applications, 266, 126165.

Mustafa K, Wang H, Zhou Y, Song J. Semi-supervised cluster ensemble based on density peaks. In Data Science and Knowledge Engineering for Sensing Decision Support: Proceedings of the 13th International FLINS Conference (FLINS 2018) 2018 (pp. 645-651).

Wang, M. (2025). Hybrid data clustering algorithm and interactive experience in E-learning electronic course simulation of legal education. Entertainment Computing, 52, 100760.

Yang, J., Hu, K., Wang, F., Zhang, J., Bao, J., & Liu, W. (2025). A Partial Discharge Diagnosis Method for GIS Based on a Semi-supervised Classification Framework and Density Peak Clustering Algorithm. IEEE Transactions on Instrumentation and Measurement.

Zhang, X., Li, J., & Zhao, Y. (2021). Enhancing clustering performance using semi-supervised density peaks with adaptive similarity measures. Journal of Machine Learning Research, 22(1), 1–20.

Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance metrics in high-dimensional space. International Conference on Database Theory (ICDT), 420–434.

Ding, C., He, X., Zha, H., Gu, M., & Simon, H. D. (2008). A min-max cut algorithm for graph partitioning and data clustering. Proceedings of the IEEE International Conference on Data Mining (ICDM), 107–114.

Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). The Earth Mover’s Distance as a metric for image retrieval. International Journal of Computer Vision, 40(2), 99–121.

Liu, Y., Wang, J., Zhang, X., & Li, H. (2021). A comparative study of distance metrics for semi-supervised clustering algorithms. Pattern Recognition Letters, 144, 12–19.

Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., ... & Mitzenmacher, M. (2011). Detecting novel associations in large datasets. Science, 334(6062), 1518–1524.

Tang, D., Wang, M., Zheng, W., & Wang, H. (2014). RapidMic: rapid computation of the maximal information coefficient. Evolutionary bioinformatics, 10, EBO-S13121.

Malmberg, C., Torpner, J., Fernberg, J., Öhrn, H., Ångström, J., Johansson, C., ... & Kreuger, J. (2022). Evaluation of the speed, accuracy and precision of the QuickMIC rapid antibiotic susceptibility testing assay with Gram-negative bacteria in a clinical setting. Frontiers in Cellular and Infection Microbiology, 12, 758262.

Tang, J., Xu, D., Cai, Q., Li, S., & Rezaeipanah, A. (2024). Towards a semi-supervised ensemble clustering framework with flexible weighting mechanism and constraints information. Engineering Applications of Artificial Intelligence, 136, 108976.

Kadhim MR, Tian W, Khan T. Rapid clustering with semi-supervised ensemble density centers. In2019 16th International Computer Conference on Wavelet Active Media Technology and Information Processing 2019 Dec 14 (pp. 230-235). IEEE.

González-Almagro, G., Sánchez-Bermejo, P., Suarez, J. L., Cano, J. R., & García, S. (2024). Semi-supervised clustering with two types of background knowledge: Fusing pairwise constraints and monotonicity constraints. Information Fusion, 102, 102064.

Liu, X., Zhang, M., Liu, Y., Liu, C., Li, C., Wang, W., ... & Bouyer, A. (2024). Semi-supervised community detection method based on generative adversarial networks. Journal of King Saud University-Computer and Information Sciences, 36(3), 102008.

Hussein, A. S., Li, T., Yohannese, C. W., & Bashir, K. (2019). A Novel Hybrid Approach Based on Rough Set for Classification: An Empirical Comparative Study. Journal of Multiple-Valued Logic & Soft Computing, 33.

Diallo, B., Hu, J., Li, T., Khan, G. A., & Hussein, A. S. (2022). Multi-view document clustering based on geometrical similarity measurement. International Journal of Machine Learning and Cybernetics, 1-13.

Cleophas, T. J., Zwinderman, A. H., Cleophas, T. J., & Zwinderman, A. H. (2016). Non-parametric tests for three or more samples (Friedman and Kruskal-Wallis). Clinical data analysis on a pocket calculator: understanding the scientific methods of statistical reasoning and hypothesis testing, 193-197.

Kadhim MR, Zhou G, Tian W. A novel self-directed learning framework for cluster ensemble. Journal of King Saud University-Computer and Information Sciences. 2022 Nov 1;34(10):7841-55.

Hasan N, Alam MG, Ripon SH, Pham PH, Hassan MM. An autoencoder-based confederated clustering leveraging a robust model fusion strategy for federated unsupervised learning. Information Fusion. 2025 Mar 1;115:102751.

Sun H, Pan J. Heart disease prediction using machine learning algorithms with self-measurable physical condition indicators. Journal of data analysis and information processing. 2023 Jan 18;11(1):1-0.

Zhang J, Wu M, Sun Z, Zhou C. Learning from Crowds Using Graph Neural Networks with Attention Mechanism. IEEE Transactions on Big Data. 2024 Mar 19.

de Menezes JA, Gomes JC, de Carvalho Hazin V, Dantas JC, Rodrigues MC, Nogueira PL, dos Santos WP. Classification of Motor Imagery EEG Signals Based on Sparse Representations of Empirical Mode Decomposition Features. In Advanced Electroencephalography Analytical Methods (pp. 208-241). CRC Press.