K-Means算法最优聚类数量的确定

Determination of the Optimal Number of Clusters in K-Means Algorithm

  • 摘要: K-均值(K-means)聚类算法是学术与工业领域的经典算法。然而,它却具有两个明显缺陷:1) 需要预先知道聚类的数量;2) 对算法的随机初始化非常敏感。为了解决这两个问题,首先归纳了K-均值算法的基本步骤,并对聚类有效性进行了分析;然后以数据样本点的欧几里德距离为基础,定义了以聚类数量k为自变量的类间质心距离之和以及类内距离之和,由此构造了聚类有效性评价函数;最后根据经验规则,在聚类数量的可能范围内通过求解聚类有效性评价函数的最小值以确定数据集的最优聚类数量。对UCI的3个数据集Iris、Seeds和Wine的仿真结果说明,提出的聚类有效性评价函数不仅能够准确地反映数据的真实聚类结构,还能有效地抑制算法对随机初始化的敏感性,通过对K-均值算法的多次运行,其结果也验证了聚类有效性评价函数的鲁棒性。

     

    Abstract: K-means clustering algorithm is a classic algorithm in academic and industrial fields. However, it has two most obvious defeats: one is that the number of clusters needs to be known in advance; the other is that it is very sensitive to the random initialization of the algorithm. In order to solve these problems, this paper summarizes the basic step of K-means algorithm and analyzes the clustering validity. Then, based on the Euclidean distance of the data points, the sum of centroid distances between classes and the sum of distances within cluster with the number of clusters k as the independent variable are defined, and the cluster validity evaluation function is constructed. Finally, according to the empirical rules, the optimal number of clusters in the data set is determined by solving the minimum value of the cluster validity evaluation function within the possible range of number of clusters. The simulation results of the three UCI datasets Iris, Seeds, and Wine shows that the proposed cluster validity evaluation function can not only accurately reflect the true cluster structure of the data, but also effectively suppress the sensitivity of the algorithm to random initialization. The multiple runs of the K-means algorithm also verify the robustness of the cluster validity evaluation function.

     

/

返回文章
返回