余弦度量和适应度函数改进的聚类方法

Text Clustering Method with Improved Fitness Function and Cosine Similarity Measure

摘要: K-均值算法因其简单和高效性, 在文本聚类中占有重要地位. 针对传统的K-均值算法对初始点敏感、易陷入局部最优的问题, 结合遗传算法已经成为一种趋势. 在充分发挥K-均值算法的高效性的同时, 该文利用遗传算法的全局自适应优化特点克服了对初始点敏感的问题. 同时, 以余弦度量评价对象间的相似性并以此构造新的遗传算法适应度函数、收敛准则以及遗传算法种群更新方式, 提高了K-均值和遗传算法这种结合方式的聚类精度, 并增强了该结合算法的稳定性.

Abstract: The traditional K-means algorithm is widely used because of its simplicity and efficiency. However, it is sensitive to the initial point and easy to fall into local optimum. In this paper, we use cosine measure to evaluate the similarity between objects and construct a new fitness function of genetic algorithm and the new convergence criterion for K-means algorithm. Experimental results show that the new method enhances the clustering accuracy and stability for the combination of K-means and genetic algorithm.