A Comparative Evaluation of Similarity Measures for Semi-Supervised Density Peaks Clustering

Main Article Content

Ahmed Saad Hussein

Abstract

Semi-supervised Density Peak (SDenPeak) algorithm is known to be efficient and simple in tasks clustering. It improves clustering performance by adding pair-wise constraints, must-link and cannot-link constraints, that drive the grouping process by imposing similarity and dissimilarity between data points. One of the key considerations in clustering accuracy is the selection of similarity measure because various measures reflect diverse structural attributes to data. The problem with the fact that there is no universal best measure of similarity is that it is a tricky task to choose a suitable measure that is dependent on the nature of the data. To explore the effects of the six similarity measures on SDenPeak algorithm performance, the six measures (Euclidean Distance, Cosine Similarity, City Block (Manhattan) Distance, Minkowski Distance, Earth Mover’s Distance (EMD), and Rapid Computation of the Maximal Information Coefficient (RapidMIC) Distance) are evaluated systematically in this study in order to understand their influences Real-world datasets are extensively experimented to evaluate the accuracy of clustering and structural consistency in each of the measures. These findings present comparative information on the effectiveness of the various similarity measures and illustrate their applicability to various data distributions providing a useful guide to achieving the best clustering performance in semi-supervised models.

Article Details

How to Cite
Hussein, A. S. (2026). A Comparative Evaluation of Similarity Measures for Semi-Supervised Density Peaks Clustering. INTERNATIONAL JOURNAL OF THEORETICAL AND APPLIED ISSUES OF DIGITAL TECHNOLOGIES, 9(2), 7–19. https://doi.org/10.62132/ijdt.v9i2.371
Section
Articles

References

Zhong G, Pun CM. Self-taught multi-view spectral clustering. Pattern Recognition. 2023 Jun 1;138:109349.

Zhang C, Ni M, Zhong Y, Wei H, Qiu K. Density-ratio peak based semi-supervised algorithm for access network user behavior analysis. IEEE Access. 2019 May 6;7:62904-10.

Jain, A., Jin, R., & Chitta, R. (2014). Semi-supervised clustering. Handbook of Cluster Analysis, 1-35.

Wagstaff, K., Cardie, C., Rogers, S., & Schrödl, S. (2001, June). Constrained k-means clustering with background knowledge. In ICML (Vol. 1, pp. 577-584).

Sinha A, Jana PK. Improved affinity propagation clustering algorithms: A PSO-based approach. Knowledge and Information Systems. 2025 Feb;67(2):1681-711.

Shabani N, Wu J, Beheshti A, Sheng QZ, Foo J, Haghighi V, Hanif A, Shahabikargar M. A comprehensive survey on graph summarization with graph neural networks. IEEE Transactions on Artificial Intelligence. 2024 Jan 8;5(8):3780-800.

González-Almagro G, Peralta D, De Poorter E, Cano JR, García S. Semi-supervised constrained clustering: An in-depth overview, ranked taxonomy and future research directions. Artificial Intelligence Review. 2025 Mar 7;58(5):157.

Basu, S., Banerjee, A., & Mooney, R. (2002). Semi-supervised clustering by seeding. In In Proceedings of 19th International Conference on Machine Learning (ICML-2002.

Fan, W. Q., Wang, C. D., & Lai, J. H. (2016, March). SDenPeak: Semi-supervised Nonlinear Clustering Based on Density and Distance. In Big Data Computing Service and Applications (BigDataService), 2016 IEEE Second International Conference on (pp. 269-275). IEEE.

Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492-1496.

Liberti, L., Lavor, C., Maculan, N., & Mucherino, A. (2014). Euclidean distance geometry and applications. Siam Review, 56(1), 3-69.

Alsuhibany, S. A., Almushyti, M., Alghasham, N., & Alkhudier, F. (2016, November). Analysis of free-text keystroke dynamics for Arabic language using Euclidean distance. In Innovations in Information Technology (IIT), 2016 12th International Conference on (pp. 1-6). IEEE.

Nourbakhsh A, Jadidi M, Shahriari K. Clustering bike sharing stations using Quantum Machine Learning: A case study of Toronto, Canada. Transportation Research Interdisciplinary Perspectives. 2024 Sep 1;27:101201.

Hao, L., & Gang, N. (2016, October). A novel diagnosis method for intelligent IETM platform based on cosine similarity and fuzzy semantic inference. In Prognostics and System Health Management Conference (PHM-Chengdu), 2016 (pp. 1-6). IEEE.

Hernandez, A. F. R., & Garcia, N. Y. G. (2016). Distributed processing using cosine similarity for mapping Big Data in Hadoop. IEEE Latin America Transactions, 14(6), 2857-2861.

Wang W, Chen T, Liu H, Zhang J, Wang Q, Jiang Q. Depth perception optimization of mixed reality simulation systems based on multiple‐cue fusion. Journal of the Society for Information Display. 2024 Aug;32(8):568-79.

De Carvalho, F. D. A., Barbosa, G. B., & Pimentel, J. T. (2013, October). Partitioning fuzzy c-means clustering algorithms for interval-valued data based on city-block distances. In Intelligent Systems (BRACIS), 2013 Brazilian Conference on (pp. 113-118). IEEE.

Rodrigues ÉO. Combining Minkowski and Chebyshev: New distance proposal and survey of distance metrics using k-nearest neighbours classifier. Pattern Recognition Letters. 2018 Jul 15;110:66-71.

Howard, S. D., & Sirianunpiboon, S. (2012, August). Fast tests for the common causality of time-of-arrival events from their mutual Minkowski distances. In Statistical Signal Processing Workshop (SSP), 2012 IEEE (pp. 101-104). IEEE.

Xu, J., Lei, B., Gu, Y., Winslett, M., Yu, G., & Zhang, Z. (2015). Efficient similarity join based on earth mover’s distance using MapReduce. IEEE Transactions on Knowledge and Data Engineering, 27(8), 2148-2162.

Beecks, C., Uysal, M. S., & Seidl, T. (2015, December). Earth Mover's Distance vs. Quadratic form Distance: An Analytical and Empirical Comparison. In Multimedia (ISM), 2015 IEEE International Symposium on (pp. 233-236). IEEE.

Tang, D., Wang, M., Zheng, W., & Wang, H. (2014). RapidMic: Rapid Computation of the Maximal Information Coefficient. Evolutionary bioinformatics, 10, 11.

Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492–1496.

Wang, H., Zhang, R., Li, Y., & Sun, M. (2020). Improving density peaks clustering with semi-supervised pairwise constraints. Knowledge-Based Systems, 192, 105369.

Yang, C. H., Lee, B., Lee, Y. I., Chung, Y. F., & Lin, Y. D. (2025). An autoencoder-based arithmetic optimization clustering algorithm to enhance principal component analysis to study the relations between industrial market stock indices in real estate. Expert Systems with Applications, 266, 126165.

Mustafa K, Wang H, Zhou Y, Song J. Semi-supervised cluster ensemble based on density peaks. In Data Science and Knowledge Engineering for Sensing Decision Support: Proceedings of the 13th International FLINS Conference (FLINS 2018) 2018 (pp. 645-651).

Wang, M. (2025). Hybrid data clustering algorithm and interactive experience in E-learning electronic course simulation of legal education. Entertainment Computing, 52, 100760.

Yang, J., Hu, K., Wang, F., Zhang, J., Bao, J., & Liu, W. (2025). A Partial Discharge Diagnosis Method for GIS Based on a Semi-supervised Classification Framework and Density Peak Clustering Algorithm. IEEE Transactions on Instrumentation and Measurement.

Zhang, X., Li, J., & Zhao, Y. (2021). Enhancing clustering performance using semi-supervised density peaks with adaptive similarity measures. Journal of Machine Learning Research, 22(1), 1–20.

Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance metrics in high-dimensional space. International Conference on Database Theory (ICDT), 420–434.

Ding, C., He, X., Zha, H., Gu, M., & Simon, H. D. (2008). A min-max cut algorithm for graph partitioning and data clustering. Proceedings of the IEEE International Conference on Data Mining (ICDM), 107–114.

Rubner, Y., Tomasi, C., & Guibas, L. J. (2000). The Earth Mover’s Distance as a metric for image retrieval. International Journal of Computer Vision, 40(2), 99–121.

Liu, Y., Wang, J., Zhang, X., & Li, H. (2021). A comparative study of distance metrics for semi-supervised clustering algorithms. Pattern Recognition Letters, 144, 12–19.

Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., ... & Mitzenmacher, M. (2011). Detecting novel associations in large datasets. Science, 334(6062), 1518–1524.

Tang, D., Wang, M., Zheng, W., & Wang, H. (2014). RapidMic: rapid computation of the maximal information coefficient. Evolutionary bioinformatics, 10, EBO-S13121.

Malmberg, C., Torpner, J., Fernberg, J., Öhrn, H., Ångström, J., Johansson, C., ... & Kreuger, J. (2022). Evaluation of the speed, accuracy and precision of the QuickMIC rapid antibiotic susceptibility testing assay with Gram-negative bacteria in a clinical setting. Frontiers in Cellular and Infection Microbiology, 12, 758262.

Tang, J., Xu, D., Cai, Q., Li, S., & Rezaeipanah, A. (2024). Towards a semi-supervised ensemble clustering framework with flexible weighting mechanism and constraints information. Engineering Applications of Artificial Intelligence, 136, 108976.

Kadhim MR, Tian W, Khan T. Rapid clustering with semi-supervised ensemble density centers. In2019 16th International Computer Conference on Wavelet Active Media Technology and Information Processing 2019 Dec 14 (pp. 230-235). IEEE.

González-Almagro, G., Sánchez-Bermejo, P., Suarez, J. L., Cano, J. R., & García, S. (2024). Semi-supervised clustering with two types of background knowledge: Fusing pairwise constraints and monotonicity constraints. Information Fusion, 102, 102064.

Liu, X., Zhang, M., Liu, Y., Liu, C., Li, C., Wang, W., ... & Bouyer, A. (2024). Semi-supervised community detection method based on generative adversarial networks. Journal of King Saud University-Computer and Information Sciences, 36(3), 102008.

Hussein, A. S., Li, T., Yohannese, C. W., & Bashir, K. (2019). A Novel Hybrid Approach Based on Rough Set for Classification: An Empirical Comparative Study. Journal of Multiple-Valued Logic & Soft Computing, 33.

Diallo, B., Hu, J., Li, T., Khan, G. A., & Hussein, A. S. (2022). Multi-view document clustering based on geometrical similarity measurement. International Journal of Machine Learning and Cybernetics, 1-13.

Cleophas, T. J., Zwinderman, A. H., Cleophas, T. J., & Zwinderman, A. H. (2016). Non-parametric tests for three or more samples (Friedman and Kruskal-Wallis). Clinical data analysis on a pocket calculator: understanding the scientific methods of statistical reasoning and hypothesis testing, 193-197.

Kadhim MR, Zhou G, Tian W. A novel self-directed learning framework for cluster ensemble. Journal of King Saud University-Computer and Information Sciences. 2022 Nov 1;34(10):7841-55.

Hasan N, Alam MG, Ripon SH, Pham PH, Hassan MM. An autoencoder-based confederated clustering leveraging a robust model fusion strategy for federated unsupervised learning. Information Fusion. 2025 Mar 1;115:102751.

Sun H, Pan J. Heart disease prediction using machine learning algorithms with self-measurable physical condition indicators. Journal of data analysis and information processing. 2023 Jan 18;11(1):1-0.

Zhang J, Wu M, Sun Z, Zhou C. Learning from Crowds Using Graph Neural Networks with Attention Mechanism. IEEE Transactions on Big Data. 2024 Mar 19.

de Menezes JA, Gomes JC, de Carvalho Hazin V, Dantas JC, Rodrigues MC, Nogueira PL, dos Santos WP. Classification of Motor Imagery EEG Signals Based on Sparse Representations of Empirical Mode Decomposition Features. In Advanced Electroencephalography Analytical Methods (pp. 208-241). CRC Press.