Research on Dimensional Reduction of Sparse Matrix Data Based on Information Entropy

HE Xing-gao; LI Chan-juan; WANG Rui-jin; DENG Fu-hu; LIU Xing

doi:10.3969/j.issn.1001-0548.2018.02.012

Volume 47 Issue 2

Mar. 2018

Article Contents

Article Navigation > Journal of University of Electronic Science and Technology of China > 2018 > 47(2): 235-241

HE Xing-gao, LI Chan-juan, WANG Rui-jin, DENG Fu-hu, LIU Xing. Research on Dimensional Reduction of Sparse Matrix Data Based on Information Entropy[J]. Journal of University of Electronic Science and Technology of China, 2018, 47(2): 235-241. doi: 10.3969/j.issn.1001-0548.2018.02.012

Citation:

HE Xing-gao, LI Chan-juan, WANG Rui-jin, DENG Fu-hu, LIU Xing. Research on Dimensional Reduction of Sparse Matrix Data Based on Information Entropy[J]. Journal of University of Electronic Science and Technology of China, 2018, 47(2): 235-241. doi: 10.3969/j.issn.1001-0548.2018.02.012

Research on Dimensional Reduction of Sparse Matrix Data Based on Information Entropy

doi: 10.3969/j.issn.1001-0548.2018.02.012

School of Information and Software Engineering, University of Electronic Science and Technology of China Chengdu 610054

Received Date: 2017-01-04
Rev Recd Date: 2017-06-15
Publish Date: 2018-03-30

Abstract

Data dimensionality reduction is a necessary step in mining effective information from high-dimensional data. When applying the traditional principal component analysis (PCA) algorithm to high-dimensional sparse data dimensionality reduction, there is a problem that unable to read all data features at once into memory for analysis and calculation, furthermore, the improved block processing PCA algorithm also can not meet the actual requirements because of the time consuming. In this paper, we propose the E-PCA algorithm by introducing the concept of information entropy to improve the PCA algorithm. First, the useless features are eliminated through feature selection based on information entropy, and then PCA algorithm is used to reduce the dimensionality of large, high-dimensional sparse data. The experimental results show that in the case of keeping the same proportion of raw data, the information entropy-based E-PCA algorithm proposed in this paper is superior to block processing PCA algorithm in terms of memory usage, run time and the results of dimension reduction.
- block processing,
- dimensionality reduction,
- high-dimensional sparse data,
- information entropy,
- principal component analysis

References

[1]	JAIN A, CHANDRASEKARAN B.Dimensionality and sample size considerations in pattern recognition practice[J]. Handbook of Statistics, 1982(2):835-855. https://www.sciencedirect.com/science/article/pii/S0169716182020422
[2]	HOU L, GAO J, CHEN R. An information entropy-based animal migration optimization algorithm for data clustering[J]. Entropy, 2016, 18(5):185-200. doi: 10.3390/e18050185
[3]	WANG Rui-jin, LI Dong-fen, QIN Zhi-guang. An immune quantum communication model for dephasing noise using four-qubit cluster state[J]. International Journal of Theoretical Physics, 2016, 55(1):609-616. doi: 10.1007/s10773-015-2698-8
[4]	王珏, 杨剑, 李伏欣, 等. 机器学习的难题与分析[C]//第三届机器学习及应用研讨会. 南京: [s. n. ], 2005. WANG Yu, YANG Jian, LI Fu-xin, et al. Difficulties and analysis of machine learning[C]//The Third Machine Learning and Application Seminar. Nanjing: [s. n. ], 2005.
[5]	LI Dong-fen, WANG Rui-jin, ZHANG Feng-li, et al. Quantum information splitting of arbitrary two-qubit state by using four-qubit cluster state and Bell-state[J]. Quantum Information Processing, 2015, 14(3):1103-1116. doi: 10.1007/s11128-014-0906-8
[6]	尹芳黎, 杨雁莹, 王传栋, 等.矩阵奇异值分解及其在高维数据处理中的应用[J].数学的实践与认识, 2011, 41(15):171-177. http://d.old.wanfangdata.com.cn/Periodical/sxdsjyrs201115025 YIN Fang-li, YANG Yan-ying, WANG Chuan-dong, et al. Matrix singular value decomposition and its application in high dimensional data processing[J]. Mathematics in Practice and Theory, 2011, 41(15):171-177. http://d.old.wanfangdata.com.cn/Periodical/sxdsjyrs201115025
[7]	PEARSON K. On lines and planes of closest fit to systems of points in space[J]. Philosophical Magazine, 1901, 2(6):559-572. http://www.citeulike.org/user/zambujo/article/2013414
[8]	FISHER R, KENZIE W M. Studies in crop variation Ⅱ. The manorial response of different potato varieties[J]. Journal of Agricultural Science, 1923, 13(3):311-320. doi: 10.1017/S0021859600003592
[9]	HOTELLING H. Analysis of a complex of statistical variables into principal components[J]. British Journal of Educational Psychology, 1933, 24(6):417-520. doi: 10.1037/h0071325
[10]	JOLLIFFE I T. Principal component analysis[J]. Journal of Marketing Research, 2002, 87(100):513. doi: 10.1002/wics.101/abstract
[11]	GUEBEL D V, TORRES N V. Principal component analysis(PCA)[M]. New York:Springer, 2013.
[12]	张道强, 陈松灿.高维数据降维方法[J].中国计算机学会通讯, 2009, 5(8):15-22. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=qbkx200708029 ZHANG Dao-qiang, CHEN Song-can. Research on dimension reduction methods of high dimensional data[J]. Communications of the CCF, 2009, 5(8):15-22. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=qbkx200708029
[13]	WANG Y. Semi-supervised dimensionality reduction[J]. Proceedings of the International Symposium on Computer Science, 2010, 41(9):1993-1998. doi: 10.1137/1.9781611972771.73
[14]	REZGHI M, OBULKASIM A. Noise-free principal component analysis:an efficient dimension reduction technique for high dimensional molecular data[J]. Expert Systems with Applications, 2014, 41(17):7797-7804. doi: 10.1016/j.eswa.2014.06.024
[15]	ABRAHAM G, INOUYE M. Fast principal component analysis of large-scale genome-wide data[J]. Plos One, 2014, 9(4):e93766. doi: 10.1371/journal.pone.0093766
[16]	HALKO N, MARTINSSON P G, SHKOLNISKY Y, et al. An algorithm for the principal component analysis of large data sets[J]. Siam Journal on Scientific Computing, 2010, 33(5):2580-2594. http://adsabs.harvard.edu/abs/2010arXiv1007.5510H
[17]	陈伏兵, 杨静宇.分块PCA及其在人脸识别中的应用[J].计算机工程与设计, 2007, 28(8):1889-1892. http://d.wanfangdata.com.cn/Periodical_jsjgcysj200708048.aspx CHEN Fu-bing, YANG Jing-yu. Realization of face recognition algorithm based on block PCA[J]. Computer Engineering and Design, 2007, 28(8):1889-1892. http://d.wanfangdata.com.cn/Periodical_jsjgcysj200708048.aspx
[18]	CHEN Fu-bing, YANG Jing-yu. PCA face recognition algorithm based on local feature[J]. Mini-Micro Systems, 2006, 7(10):1943-1947. https://www.researchgate.net/profile/Manisha_Satone/publication/273487552_Feature_Selection_Using_Genetic_Algorithm_for_Face_Recognition_Based_on_PCA_Wavelet_and_SVM/links/580b162908aeef1bfee47081.pdf?origin=publication_detail
[19]	尹飞, 冯大政.基于PCA算法的人脸识别[J].计算机技术与发展, 2008, 18(10):31-33. doi: 10.3969/j.issn.1673-629X.2008.10.009 YIN Fei, FENG Da-zheng. Face recognition based on PCA algorithm[J]. Journal of Computer Technology and Development, 2008, 18(10):31-33. doi: 10.3969/j.issn.1673-629X.2008.10.009
[20]	LI Dong-fen, WANG Rui-jin, ZHANG Feng-li, et al. A noise immunity controlled quantum teleportation protocol[J]. Quantum Information Processing, 2016, 15(11):4819-4837. doi: 10.1007/s11128-016-1416-7
[21]	AMPILOVA N, SOLOVIEV I. On application of entropy characteristics to texture analysis[J]. Wseas Transactions on Biology & Biomedicine, 2014, 11(1):194-202. http://www.sciencedirect.com/science/article/pii/S0927025604002058
[22]	PHOENIX S J D. Elements of information theory[M].[S.l.]:Wiley, 1992.
[23]	LI Dong-fen, WANG Rui-jin, ZHANG Feng-li. Quantum information splitting of a two-qubit Bell state using a four-qubit entangled state[J]. Chinese Physical C, 2015, 39(4):26-30. doi: 10.1088/1674-1137/39/4/043103/meta
[24]	LI Dong-fen, WANG Rui-jin, ZHANG Feng-li, et al. Quantum information splitting of arbitrary three-qubit state by using seven-qubit entangled state[J]. International Journal of Theoretical Physics, 2015, 54(6):2068-2075. doi: 10.1007/s10773-014-2413-1

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(6) / Tables(4)

Get Citation

PDF

XML

Article Metrics

Article views(4532) PDF downloads(258) Cited by()

Proportional views

HTML

随着大数据产业的快速发展，人们关注的数据对象日渐复杂，业界对数据分析、处理技术的需求更为迫切，特别是对高维数据的分析与处理技术。直接处理高维数据会面临以下困难^[1-6]：维数灾难、空空间、不适定及算法失效等。为解决以上问题，一种有效的方法就是对高维数据进行降维，分为特征选择和特征变换两种方式^[2]。按不同划分标准，算法可分为线性与非线性、监督与非监督、全局与局部等，如PCA、ICA、LDA、LLE、ISOMAP、LTSA、KPCA等。PCA适用于数值型数据，先将数据转换为矩阵形式，再进行相关计算，算法无参数限制，但在某些情况下运行效率不佳。如在处理用户访问网站记录数据时，网站数目庞大，用户能访问的网站数目甚少。这类数据特征维高，有用信息少，即高维稀疏大数据。本文就PCA在处理高维稀疏数据时存在的受内存限制、处理时间长的问题，给出了改进的解决方法。实验结果显示，改进算法能够保留相同比例原数据信息的情况下降低时间成本。

1. 相关研究

1901年，统计学领域首先提出主成分分析(principal component analysis, PCA)^[7]概念。1923年，文献[8]认为它是比方差分析更适合于相应数据的模型分析。1933年，文献[9]将其推广到随机变量，成为数据挖掘界熟知的一种无监督、线性学习方法。它关注事物的主要性质，将原始变量通过线性变换进行线性组合，从n维特征映射到k维上(k < n)，这k维数据是重新构造出来的正交特征，被称为主成分。PCA算法简单，具有无线性误差、无参数限制等优点^[10-12]。但存储空间大，计算复杂度高，采用的线性映射方法也会影响最后的效果，同时协方差矩阵的大小与样本点的维数成正比，导致计算高维数据的特征向量困难。

针对PCA的局限性，如无明确准则来确定主成分，且存在着诸如高斯假设、线性假设及未考虑数据序列相关性等局限，学者给出了多种改进算法，如动态PCA、非线性PCA、多尺度PCA等。文献[13]探讨对分子数据的降维，为解决传统PCA易受噪声影响的问题，提出了NFPCA(noise free PCA)，在PCs的计算步骤基础上增加一个惩罚项来控制噪声。文献[14]针对基因组单核苷酸多态性数据特征急剧增长，经典PCA处理非常耗时的问题，提出了基于随机算法的高性能PCA的实现方法flash PCA。文献[15]针对大型数据集不能存到随机存储器的问题，采用分块Lanczos方法的随机版本进行处理，迭代次数很少，结果几乎最优，参数l越大，计算复杂度越高，但l的选择没有确定的方法。文献[16]针对人脸识别中存在的图像特征维数高、样本小、耗时长及内存消耗大等问题，基于人脸识别特征和图像特性的考虑，采用分块处理，提出分块PCA。在表情和光照变化的时候，可以捕捉人脸局部特征，并将小样本问题大样本化，在识别性能和识别率上明显优于PCA。

本文针对PCA算法内存消耗大、耗时长，数据特征维高时，处理时间不能满足应用需求的问题，提出基于信息熵的高维稀疏大数据降维算法(E-PCA)。该算法引入信息熵，首先进行特征筛选，降低特征数量，将大型稀疏矩阵稠密化后再做降维处理。

4. 结束语

本文针对稀疏大数据特征维数过高，使用PCA降维时，矩阵计算内存消耗太大，使用文献[16]的分块处理技术，比较麻烦，运行时间远远不能满足应用需求，改进了降维算法PCA，给出基于信息熵的E-PCA降维算法。实验结果表明，E-PCA在保持原始数据尽可能多的信息的时候，运行耗时和内存消耗得到了极大的改善。接下来，将利用量子计算和通信^[23-24]进一步提高算法的性能。

Reference (24)

[1]	JAIN A, CHANDRASEKARAN B. Dimensionality and sample size considerations in pattern recognition practice[J]. Handbook of Statistics, 1982, (2): 835-855.
[2]	HOU L, GAO J, CHEN R. An information entropy-based animal migration optimization algorithm for data clustering[J]. Entropy, 2016, 18(5): 185-200. doi: 10.3390/e18050185
[3]	WANG Rui-jin, LI Dong-fen, QIN Zhi-guang. An immune quantum communication model for dephasing noise using four-qubit cluster state[J]. International Journal of Theoretical Physics, 2016, 55(1): 609-616. doi: 10.1007/s10773-015-2698-8
[4]	王珏, 杨剑, 李伏欣, 等. 机器学习的难题与分析[C]//第三届机器学习及应用研讨会. 南京: [s. n. ], 2005.	WANG Yu, YANG Jian, LI Fu-xin, et al. Difficulties and analysis of machine learning[C]//The Third Machine Learning and Application Seminar. Nanjing: [s. n. ], 2005.
[5]	LI Dong-fen, WANG Rui-jin, ZHANG Feng-li. Quantum information splitting of arbitrary two-qubit state by using four-qubit cluster state and Bell-state[J]. Quantum Information Processing, 2015, 14(3): 1103-1116. doi: 10.1007/s11128-014-0906-8
[6]	尹芳黎, 杨雁莹, 王传栋. 矩阵奇异值分解及其在高维数据处理中的应用[J]. 数学的实践与认识, 2011, 41(15): 171-177.	YIN Fang-li, YANG Yan-ying, WANG Chuan-dong. Matrix singular value decomposition and its application in high dimensional data processing[J]. Mathematics in Practice and Theory, 2011, 41(15): 171-177.
[7]	PEARSON K. On lines and planes of closest fit to systems of points in space[J]. Philosophical Magazine, 1901, 2(6): 559-572.
[8]	FISHER R, KENZIE W M. Studies in crop variation Ⅱ. The manorial response of different potato varieties[J]. Journal of Agricultural Science, 1923, 13(3): 311-320. doi: 10.1017/S0021859600003592
[9]	HOTELLING H. Analysis of a complex of statistical variables into principal components[J]. British Journal of Educational Psychology, 1933, 24(6): 417-520. doi: 10.1037/h0071325
[10]	JOLLIFFE I T. Principal component analysis[J]. Journal of Marketing Research, 2002, 87(100): 513-.
[11]	GUEBEL D V, TORRES N V. Principal component analysis(PCA)[M]. New York:Springer, 2013.
[12]	张道强, 陈松灿. 高维数据降维方法[J]. 中国计算机学会通讯, 2009, 5(8): 15-22.	ZHANG Dao-qiang, CHEN Song-can. Research on dimension reduction methods of high dimensional data[J]. Communications of the CCF, 2009, 5(8): 15-22.
[13]	WANG Y. Semi-supervised dimensionality reduction[J]. Proceedings of the International Symposium on Computer Science, 2010, 41(9): 1993-1998.
[14]	REZGHI M, OBULKASIM A. Noise-free principal component analysis:an efficient dimension reduction technique for high dimensional molecular data[J]. Expert Systems with Applications, 2014, 41(17): 7797-7804. doi: 10.1016/j.eswa.2014.06.024
[15]	ABRAHAM G, INOUYE M. Fast principal component analysis of large-scale genome-wide data[J]. Plos One, 2014, 9(4): e93766-. doi: 10.1371/journal.pone.0093766
[16]	HALKO N, MARTINSSON P G, SHKOLNISKY Y. An algorithm for the principal component analysis of large data sets[J]. Siam Journal on Scientific Computing, 2010, 33(5): 2580-2594.
[17]	陈伏兵, 杨静宇. 分块PCA及其在人脸识别中的应用[J]. 计算机工程与设计, 2007, 28(8): 1889-1892.	CHEN Fu-bing, YANG Jing-yu. Realization of face recognition algorithm based on block PCA[J]. Computer Engineering and Design, 2007, 28(8): 1889-1892.
[18]	CHEN Fu-bing, YANG Jing-yu. PCA face recognition algorithm based on local feature[J]. Mini-Micro Systems, 2006, 7(10): 1943-1947.
[19]	尹飞, 冯大政. 基于PCA算法的人脸识别[J]. 计算机技术与发展, 2008, 18(10): 31-33. doi: 10.3969/j.issn.1673-629X.2008.10.009	YIN Fei, FENG Da-zheng. Face recognition based on PCA algorithm[J]. Journal of Computer Technology and Development, 2008, 18(10): 31-33. doi: 10.3969/j.issn.1673-629X.2008.10.009
[20]	LI Dong-fen, WANG Rui-jin, ZHANG Feng-li. A noise immunity controlled quantum teleportation protocol[J]. Quantum Information Processing, 2016, 15(11): 4819-4837. doi: 10.1007/s11128-016-1416-7
[21]	AMPILOVA N, SOLOVIEV I. On application of entropy characteristics to texture analysis[J]. Wseas Transactions on Biology & Biomedicine, 2014, 11(1): 194-202.
[22]	PHOENIX S J D. Elements of information theory[M].[S.l.]:Wiley, 1992.
[23]	LI Dong-fen, WANG Rui-jin, ZHANG Feng-li. Quantum information splitting of a two-qubit Bell state using a four-qubit entangled state[J]. Chinese Physical C, 2015, 39(4): 26-30.
[24]	LI Dong-fen, WANG Rui-jin, ZHANG Feng-li. Quantum information splitting of arbitrary three-qubit state by using seven-qubit entangled state[J]. International Journal of Theoretical Physics, 2015, 54(6): 2068-2075. doi: 10.1007/s10773-014-2413-1

属性维数	内存占用(理论)/MB	内存占用(实际)/MB
1 000	7.63	309
5 000	190.73	867
10 000	762.94	3 816
15 000	1 716.61	4 169
20 000	3 051.76	6 214
30 000	6 866.46	14 558.51
40 000	12 207.03	27 587.89
56 535	16 930.88	40 828.93
169 605	219 466.06	-
282 669	609 602.08	-

数据集	属性维数	运行时间/s
R公司	282 669	153 487.65
Arecene	10 000	132

方法	时间开销/s	贡献率f	结果维k
E-PCA	3 365.83	0.95	961
PCA	15 487.65	0.95	6 323

算法名	贡献率f	降维后结果k	降维后/%		降维前/%
算法名	贡献率f	降维后结果k	KNN	SVM	KNN	SVM
E-PCA	0.95	961	53.1	53.9	53.1	53.6
PCA	0.95	3 323	52.5	50.5	53.1	53.6

Research on Dimensional Reduction of Sparse Matrix Data Based on Information Entropy

doi: 10.3969/j.issn.1001-0548.2018.02.012

Abstract

References

Proportional views

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Related

Proportional views