Semi-Supervised Semantic Dynamic Text Clustering Algorithm

QIAN Zhi-sen; HUANG Rui-zhang; WEI Qin; QIN Yong-bin; CHEN Yan-ping

doi:10.3969/j.issn.1001-0548.2019.06.001

Volume 48 Issue 6

Nov. 2019

Article Contents

Article Navigation > Journal of University of Electronic Science and Technology of China > 2019 > 48(6): 803-808

QIAN Zhi-sen, HUANG Rui-zhang, WEI Qin, QIN Yong-bin, CHEN Yan-ping. Semi-Supervised Semantic Dynamic Text Clustering Algorithm[J]. Journal of University of Electronic Science and Technology of China, 2019, 48(6): 803-808. doi: 10.3969/j.issn.1001-0548.2019.06.001

Citation:

QIAN Zhi-sen, HUANG Rui-zhang, WEI Qin, QIN Yong-bin, CHEN Yan-ping. Semi-Supervised Semantic Dynamic Text Clustering Algorithm[J]. Journal of University of Electronic Science and Technology of China, 2019, 48(6): 803-808. doi: 10.3969/j.issn.1001-0548.2019.06.001

Semi-Supervised Semantic Dynamic Text Clustering Algorithm

doi: 10.3969/j.issn.1001-0548.2019.06.001

1.
School of Computer Science and Technology, Guizhou University Guiyang 550025
2.
Public Big Data Laboratory of Guizhou, Guizhou University Guiyang 550025

Received Date: 2019-07-24
Rev Recd Date: 2019-10-19
Publish Date: 2019-11-30

Abstract

In the traditional dynamic text clustering, the similar texts with different descriptions are divided into different groups; and the difference between the number of cluster categories and the number of real categories is obvious. Aiming at these problems, this paper proposes a semi-supervised semantic dynamic text clustering algorithm (SDCS). The algorithm captures the semantic relationship between texts by semantically representing the text, and dynamically learns the category semantics during the clustering process, so that the text can be accurately clustered according to semantics. At the same time, the algorithm uses the semi-supervised clustering algorithm to supervise the generation of new classes, and produces clustering results that are consistent with the actual situation. The experimental results show that the proposed algorithm is effective and feasible.
- dynamic text clustering,
- semantic learning,
- semi-supervised text clustering,
- text clustering

References

[1]	TIAN Z, RAMAKRISHNAN R, LIVNY M. BIRCH: An efficient data clustering method for very large databases[C]//ACM SIGMOD International Conference on Management of Data. Montreal, Canada: ACM, 1996: 103-114.
[2]	RODRIGUES P P, GAMA J, PEDROSO J P. ODAC: Hierarchical clustering of time series data streams[C]//Proceedings of the 6th SIAM International Conference on Data Mining. Bethesda, MD, USA: SIAM, 2006: 615-627.
[3]	KRANEN P, ASSENT I, BALDAUF C, et al. The ClusTree:Indexing micro-clusters for anytime stream mining[J]. Knowledge and Information Systems Journal, 2011, 29(2):249-272. doi: 10.1007/s10115-010-0342-8
[4]	IIBRAHIM O A, DU Y, KELLER J. Robust on-line streaming clustering[C]//International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems. Cádiz, Spain: Springer, 2018: 467-478.
[5]	GUHA S, MEYERSON A, MISHRA N, et al. Clustering data streams:Theory and practice[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(3):515-528. doi: 10.1109/TKDE.2003.1198387
[6]	ARTHUR D. VASSILVITSKII S. K-means++: The advantages of careful seeding[C]//Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. Philadelphia: Society for Industrial and Applied Mathematics, 2007: 1027-1035.
[7]	ACKERMANN M R, LAMMERSEN C, MÄRTENS M, et al. StreamKM++:A clustering algorithm for data streams[J]. Journal of Experimental Algorithmics, 2012, 17(1):173-187. http://d.old.wanfangdata.com.cn/NSTLHY/NSTL_HYCC0210352678/
[8]	AGGARWAL C C, HAN J, WANG J, et al. A framework for clustering evolving data streams[C]//Proceedings of the 29th International Conference on Very Large Data Bases. Berlin: VLDB Endowment, 2003: 81-92.
[9]	BAO J P, WANG W Q, YANG T S, et al. An incremental clustering method based on the boundary profile[J]. PLOS ONE, 2018, 13(4):e0196108. doi: 10.1371/journal.pone.0196108
[10]	CAO F, ESTERT M, QIAN W, et al. Density-based clustering over an evolving data stream with noise[C]//Proceedings of the 2006 SIAM International Conference on Data Mining. Bethesda, MD: SIAM, 2006: 328-339.
[11]	LIU L X, GUO Y F, KANG J, et al. A three-step clustering algorithm over an evolving data stream[C]//IEEE International Conference on Intelligent Computing and Intelligent Systems. Shanghai, China: IEEE, 2009: 160-164.
[12]	CHEN Y, TU L. Density-based clustering for real-time stream data[C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2007: 133-142.
[13]	YANG F, CAO J, ZHOU K, et al. An adaptive clustering algorithm based on CFSFDP[C]//The 33rd Youth Academic Annual Conference of Chinese Association of Automation. Nanjing, China: IEEE, 2018: 404-408.
[14]	YIN J, WANG J. A dirichlet multinomial mixture model-based approach for short text clustering[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2014: 233-242.
[15]	YIN J, WANG J. A text clustering algorithm using an online clustering scheme for initialization[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, USA: ACM, 2016: 1995-2004.
[16]	LAOHAKIAT S, PHIMOLTARES S, LURSINSAP C. A clustering algorithm for stream data with lda-based unsupervised localized dimension reduction[J]. Information Sciences, 2017, 381:104-123. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=a569067f3ec4f16e22158784eb67e8e8
[17]	MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26:3111-3119. http://d.old.wanfangdata.com.cn/OAPaper/oai_arXiv.org_1310.4546
[18]	WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8(3-4):229-256. doi: 10.1007/BF00992696
[19]	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[C]//Proceedings of the International Conference on Learning Representations. Scottsdale, AZ, USA: ICLR, 2013: 1-12.
[20]	ZHONG S. Semi-supervised model-based document clustering:a comparative study[J]. Machine Learning, 2006, 65(1):3-29. http://d.old.wanfangdata.com.cn/NSTLQK/NSTL_QKJJ029816955/
[21]	TANG J, ZHANG J, YAO L, et al. Arnetminer: Extraction and mining of academic social networks[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA: ACM, 2008: 990-998.
[22]	LAMSAL R, MALCOMBER I. Real-time clustering al gorithm based on predefined level-of-similarity[EB/OL]. (2018-10-03). https://pdfs.semanticscholar.org/35fd/1eea45b0a54d28624771d8745f78226370d1.pdf.

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(4) / Tables(5)

Get Citation

PDF

XML

Article Metrics

Article views(5085) PDF downloads(101) Cited by()

Proportional views

HTML

随着互联网的快速发展和移动设备的广泛使用，知乎、微博等应用平台产生了大量的数据流。这些数据具有传播速度快、随时间动态变化等特点。

不同于传统静态数据，这些数据流无法用传统的聚类算法直接处理，因此研究人员提出了动态文本聚类算法。动态文本聚类算法使用了在线和离线的两阶段处理方式，在在线阶段采用特定的数据结构来概述持续到来的数据，使用k-means等传统聚类算法为基准算法在离线阶段进行聚类。这种分而治之的处理模式成功解决了数据流的聚类问题。这些动态文本聚类算法多在假设文本特征的独立性而缺乏对语义的识别，使用组平均值为类心进行聚类，这种处理模式无法处理新到来的词典外的词特征，也无法解决同一事件不同风格的表达方式的问题，且无法识别事件、已有事件和新增文本间的关系。所以当数据流到来的时候文本会基于词共现特征去到错误的分组，无法利用语义信息进行正确聚类。同时，文本的聚类个数应随着数据流的发展而产生变化，这些传统的动态文本聚类算法普遍依赖于人为设定或自动学习的类别个数，导致无法准确生成符合实际情况的聚类个数，常常产生高于实际类别的聚类结果。针对这些问题，本文提出了一种半监督语义动态文本聚类算法(SDCS)。该算法融入文本的语义信息来表示文本，解决了传统动态文本聚类中因新词特征的出现而导致的文本表示困难问题，更好地捕获了文本间的关系；并且针对现有算法多聚焦于词级别的语义表示，缺乏对类别语义的描述提出在聚类过程中动态的学习类别语义，让只是因为描述，方式不同的同类文本能根据语义准确聚类。此外，该算法加入了半监督聚类的方法引入监督信息，利用这些监督信息对新类的产生进行监督，生成更符合实际情况的聚类结果。

1. 相关工作

近年来研究人员提出了许多数据流聚类算法，大致可以分为基于层次、基于分区、基于密度及基于模型4类。

对于层次聚类算法，一旦决定组合两个簇就无法再撤消。经典处理流的算法birch^[1]引入了簇特征向量CF^[1]和高度平衡树Cftree^[1]的概念，根据人工阈值把数据分配到树再合并或分离树。ODAC算法^[2]以自顶向下的策略来维护簇的树状层次结构。适应流速度的算法CluStree^[3]为微簇分配新数据的过程会随着流速度的变化而变化且自动调整微簇大小，可以独立于流速度在内存中保持相同数量的簇。文献[4]提出基于概率C-Means和高斯混合的一种参数增量更新的EROLSC聚类算法，能识别调整簇并检测异常数据。

基于分区的算法必须在开始前指定聚类的数量并对噪声和异常值敏感且都易于形成球状的簇，其中STREAM^[5]使用k-means获得代表点并再次使用获得最终的聚类。之后基于k-means++^[6]提出了著名StreamKM++^[7]，它在内存中构造并维护一个表示数据流的核心集，之后再使用k-means++对其进行聚类。文献[8]提出了经典的两阶段CluStream算法于在线阶段以金字塔时间框架构建维护微簇，在离线阶段再聚类微簇，为用户在给定的时间范围提供过去微簇的信息。文献[9]提出基于边界点检测算法的BPIC算法，通过边界轮廓的识别来表示聚类结果。

而基于密度的算法则可以发现任意形状的簇并处理噪声，更细的密度网格算法将对象空间划分为有限数量的网格或单元格，然后对非空的网格执行聚类操作，能有效降低高维数据的计算复杂度。文献[10]提出DenStream，能有效支持任意形状簇并处理异常值，使用了衰减函数且不固定微簇数量。基于该算法，文献[11]提出了改进算法rDenStream，额外添加了一个回溯的第三阶段，让算法可以重新学习以前丢弃的数据，为被误判的簇提供了一个形成潜在微簇的机会。DStream^[12]是基于网格的流聚类算法，在在线阶段将输入映射到一个网格中，而离线阶段尝试对密集网格进行聚类并删除稀疏网格。文献[13]针对CFSFDP无法自适应识别簇心的缺点提出了一种基于Max-min算法的快速搜索和密度峰值自适应聚类的算法ACFSFDP，前者确定簇数，后者获得簇心。

最后，基于模型的算法的基本思想是一个簇内的对象在统计中具有相同的分布，通常聚类效果好，但计算复杂度高且不适用于具有大量簇和少量对象的数据集。2016年文献[15]在GSDMM^[14]的基础上提出了FGSDMM+，能自动检测簇的数量，文档选择一个非空簇或从Dirichlet多项式混合模型导出潜在新簇并降低后续文档选择潜在簇的可能性，最后采用Gibbs采样算法以获得最终的聚类结果。文献[16]提出了算法LLDStrea，该算法提出一种在局部无监督的情况下执行LDA的ULLDA的降维技术，最后将输入点分配给投影空间中的微簇。

但是，这些算法都没有考虑文本数据的语义信息，并使用假设文本特征独立计算的簇心来聚类，不能处理随事件发展而出现的词典外的新词特征，也不能正确聚类只是因为描述方式不同的文本数据。并且这些算法都过度依赖人工阈值，产生的聚类个数与实际差别较大。基于此，本文提出了一种半监督语义动态文本聚类算法。

4. 结束语

本文提出一种基于语义处理数据流的动态聚类算法，让新到来的数据能根据学习到的类别语义直接聚类，提出结合半监督的方式追踪数据的动态变化来生成更准确的类别个数。模型算法既弥补了传统数据流聚类算法中缺乏语义描述的问题，又增强学习了类别的语义，使用语义向量的方法表示文本，不仅克服传统维度爆炸等问题，并且能在聚类过程中学习到准确的类别语义，对聚类性能的提高有积极影响。模型使用的半监督聚类方式也成功解决传统聚类算法聚类个数不准确的问题，得到更符合实际情况的聚类结果。最后和其他算法的对比，表现出良好的结果，表明本文算法是有效的、鲁棒的。

Reference (22)

[1]	TIAN Z, RAMAKRISHNAN R, LIVNY M. BIRCH: An efficient data clustering method for very large databases[C]//ACM SIGMOD International Conference on Management of Data. Montreal, Canada: ACM, 1996: 103-114.
[2]	RODRIGUES P P, GAMA J, PEDROSO J P. ODAC: Hierarchical clustering of time series data streams[C]//Proceedings of the 6th SIAM International Conference on Data Mining. Bethesda, MD, USA: SIAM, 2006: 615-627.
[3]	KRANEN P, ASSENT I, BALDAUF C. The ClusTree:Indexing micro-clusters for anytime stream mining[J]. Knowledge and Information Systems Journal, 2011, 29(2): 249-272. doi: 10.1007/s10115-010-0342-8
[4]	IIBRAHIM O A, DU Y, KELLER J. Robust on-line streaming clustering[C]//International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems. Cádiz, Spain: Springer, 2018: 467-478.
[5]	GUHA S, MEYERSON A, MISHRA N. Clustering data streams:Theory and practice[J]. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(3): 515-528. doi: 10.1109/TKDE.2003.1198387
[6]	ARTHUR D. VASSILVITSKII S. K-means++: The advantages of careful seeding[C]//Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. Philadelphia: Society for Industrial and Applied Mathematics, 2007: 1027-1035.
[7]	ACKERMANN M R, LAMMERSEN C, MÄRTENS M. StreamKM++:A clustering algorithm for data streams[J]. Journal of Experimental Algorithmics, 2012, 17(1): 173-187.
[8]	AGGARWAL C C, HAN J, WANG J, et al. A framework for clustering evolving data streams[C]//Proceedings of the 29th International Conference on Very Large Data Bases. Berlin: VLDB Endowment, 2003: 81-92.
[9]	BAO J P, WANG W Q, YANG T S. An incremental clustering method based on the boundary profile[J]. PLOS ONE, 2018, 13(4): e0196108-. doi: 10.1371/journal.pone.0196108
[10]	CAO F, ESTERT M, QIAN W, et al. Density-based clustering over an evolving data stream with noise[C]//Proceedings of the 2006 SIAM International Conference on Data Mining. Bethesda, MD: SIAM, 2006: 328-339.
[11]	LIU L X, GUO Y F, KANG J, et al. A three-step clustering algorithm over an evolving data stream[C]//IEEE International Conference on Intelligent Computing and Intelligent Systems. Shanghai, China: IEEE, 2009: 160-164.
[12]	CHEN Y, TU L. Density-based clustering for real-time stream data[C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2007: 133-142.
[13]	YANG F, CAO J, ZHOU K, et al. An adaptive clustering algorithm based on CFSFDP[C]//The 33rd Youth Academic Annual Conference of Chinese Association of Automation. Nanjing, China: IEEE, 2018: 404-408.
[14]	YIN J, WANG J. A dirichlet multinomial mixture model-based approach for short text clustering[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2014: 233-242.
[15]	YIN J, WANG J. A text clustering algorithm using an online clustering scheme for initialization[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, USA: ACM, 2016: 1995-2004.
[16]	LAOHAKIAT S, PHIMOLTARES S, LURSINSAP C. A clustering algorithm for stream data with lda-based unsupervised localized dimension reduction[J]. Information Sciences, 2017, 381(): 104-123.
[17]	MIKOLOV T, SUTSKEVER I, CHEN K. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26(): 3111-3119.
[18]	WILLIAMS R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992, 8(3-4): 229-256. doi: 10.1007/BF00992696
[19]	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[C]//Proceedings of the International Conference on Learning Representations. Scottsdale, AZ, USA: ICLR, 2013: 1-12.
[20]	ZHONG S. Semi-supervised model-based document clustering:a comparative study[J]. Machine Learning, 2006, 65(1): 3-29.
[21]	TANG J, ZHANG J, YAO L, et al. Arnetminer: Extraction and mining of academic social networks[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Las Vegas, USA: ACM, 2008: 990-998.
[22]	LAMSAL R, MALCOMBER I. Real-time clustering al gorithm based on predefined level-of-similarity[EB/OL]. (2018-10-03). https://pdfs.semanticscholar.org/35fd/1eea45b0a54d28624771d8745f78226370d1.pdf.

数据集	L	V	K
TweetSet	5 450	4 383	3
PaperSet	890	2 400	3
Kajjle	12 122	15 440	20

时间窗	SDCS	Single-Pass	RCADF
T=1	0.363	0.209	0.344
T=2	0.598	0.117	0.438
T=3	0.627	0.276	0.402

时间窗	SDCS	Single-Pass	RCADF
T=1	0.372	0.348	0.239
T=2	0.448	0.302	0.336
T=3	0.416	0.409	0.327

模型	TweetSet	PaperSet	Kajjle
Single-Pass	0.481	0.863	0.229
k-means	0.757	0.856	0.472
FGSDMM+	0.702	0.497	0.467
RCADF	0.612	0.757	0.332
Clustream	0.402	0.339	0.312
SDCS	0.851	0.945	0.535

模型	TweetSet	PaperSet	Kajjle
Single-Pass	0.713	0.925	0.206
k-means	0.818	0.890	0.332
FGSDMM+	0.546	0.832	0.354
RCADF	0.693	0.807	0.256
Clustream	0.705	0.558	0.205
SDCS	0.925	0.980	0.415

Semi-Supervised Semantic Dynamic Text Clustering Algorithm

doi: 10.3969/j.issn.1001-0548.2019.06.001

Abstract

References

Proportional views

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Related

Proportional views