AP聚类算法中最佳聚类数量的确定

Determination of the optimal number of clusters in AP clustering algorithm

  • 摘要: 亲和力传播(AP)聚类能自动搜索聚类数量和聚类中心,但它提供的聚类数量与数据固有的聚类结构相差较大。为此,提出一种确定数据集潜在聚类数量的方法。利用任意两个数据点的欧氏距离平方构成相似性矩阵,以数据样本容量和相似性矩阵中非对角元素的中位数为参数,建立偏好的更新公式以确定聚类数量;将相似性与可用性相加构成亲和矩阵,并将亲和矩阵中取正值的主对角元素作为聚类的质心,以实现聚类数量与质心数量的相互验证。通过对随机数据集以及真实数据集的仿真,采用多种性能度量以及算法的运行时间进行评估,其结果说明所提出方法不仅能准确地估计聚类的数量,而且能有效地加快算法的收敛,从而适应于大数据应用的要求。

     

    Abstract: Affinity propagation (AP) clustering can automatically search the number and center of clusters, but the number of clusters provided by AP algorithm is quite different from the inherent clustering structure of dataset. Therefore, a method to determine the number of potential clusters in a dataset is proposed. The Euclidean distance square of any two data points is used to form the similarity matrix, and the sample size and the median of non-diagonal elements of the similarity matrix are used as parameters, a preference update formula is established to determine the number of clusters. Similarity and availability are added to form an affinity matrix, the main diagonal elements with positive values in the affinity matrix are taken as centroids of the clusters to realize the mutual verification of the number of clusters and the number of centroids. Through simulation on both random and real datasets, and using various performance metrics and algorithm running time to evaluate clustering effect, the results show that the proposed method can not only accurately estimate the number of clusters, but also effectively accelerate the convergence of the algorithm, so as to adapt to the requirements of big data applications.

     

/

返回文章
返回