Abstract:
K-means clustering algorithm is a classic algorithm in academic and industrial fields. However, it has two most obvious defeats: one is that the number of clusters needs to be known in advance; the other is that it is very sensitive to the random initialization of the algorithm. In order to solve these problems, this paper summarizes the basic step of K-means algorithm and analyzes the clustering validity. Then, based on the Euclidean distance of the data points, the sum of centroid distances between classes and the sum of distances within cluster with the number of clusters
k as the independent variable are defined, and the cluster validity evaluation function is constructed. Finally, according to the empirical rules, the optimal number of clusters in the data set is determined by solving the minimum value of the cluster validity evaluation function within the possible range of number of clusters. The simulation results of the three UCI datasets Iris, Seeds, and Wine shows that the proposed cluster validity evaluation function can not only accurately reflect the true cluster structure of the data, but also effectively suppress the sensitivity of the algorithm to random initialization. The multiple runs of the K-means algorithm also verify the robustness of the cluster validity evaluation function.