Abstract:
Affinity propagation (AP) clustering can automatically search the number and center of clusters, but the number of clusters provided by AP algorithm is quite different from the inherent clustering structure of dataset. Therefore, a method to determine the number of potential clusters in a dataset is proposed. The Euclidean distance square of any two data points is used to form the similarity matrix, and the sample size and the median of non-diagonal elements of the similarity matrix are used as parameters, a preference update formula is established to determine the number of clusters. Similarity and availability are added to form an affinity matrix, the main diagonal elements with positive values in the affinity matrix are taken as centroids of the clusters to realize the mutual verification of the number of clusters and the number of centroids. Through simulation on both random and real datasets, and using various performance metrics and algorithm running time to evaluate clustering effect, the results show that the proposed method can not only accurately estimate the number of clusters, but also effectively accelerate the convergence of the algorithm, so as to adapt to the requirements of big data applications.