利用单词超团的二分图文本聚类算法

Clustering Algorithm of Bipartite Graph Partition Based on Word Hyperclique

  • 摘要: 鉴于目前传统文本聚类方法中利用文档间的相似度进行聚类存在的问题,在传统的文本挖掘基础上提出了一种新的文本聚类算法——利用单词超团的二分图文本聚类算法。该算法用文档中单词的关联模式来评估文档间的相似度及主题类别预测,并利用图划分策略来大大降低文档相似度比较算法的复杂度,同时将超团作为特征结构的扩展,可以在一定范围内减少语言信息的丢失,提高聚类效果。经实验证明该算法具有较高的有效性。

     

    Abstract: This paper proposes a new algorithm for document-word co-clustering. After mining semantics with word hyperclique patterns, the document dataset with a bipartite graph is described. Then, the efficient graph partitioning algorithm is employed to partition this graph, so that the high computational overhead of traditional clustering algorithms over huge document datasests can be avoided. During clustering, word hyperclique patterns that are full of document semantics are preserved. In this way, our algorithm partially circumvents the problem of loosing document semantics, which happens a lot in traditional clustering algorithms based on document pairwise similarity alone. Finally, the extensive experimental results demonstrated the effectiveness of this algorithm in document clustering accuracy and cluster topic detection.

     

/

返回文章
返回