基于信息熵的高维稀疏大数据降维算法研究

Research on Dimensional Reduction of Sparse Matrix Data Based on Information Entropy

  • 摘要: 数据降维是从高维数据中挖掘有效信息的必要步骤。传统的主成分分析(PCA)算法应用于超高维稀疏数据降维时,存在着无法将所有数据特征一次性读入内存以进行分析计算的问题,而之后提出的分块处理PCA算法由于耗时太长,并不能满足实际需求。本文引入信息熵的思想对PCA算法进行改进,提出E-PCA算法,先利用信息熵对数据进行特征筛选,剔除大部分无用特征,再使用PCA算法对处理后的超高维稀疏数据进行降维。通过实验结果表明,在保留相同比例原数据信息的情况下,本文提出的基于信息熵的E-PCA算法在内存占用、运行时间以及降维结果都优于分块处理PCA算法。

     

    Abstract: Data dimensionality reduction is a necessary step in mining effective information from high-dimensional data. When applying the traditional principal component analysis (PCA) algorithm to high-dimensional sparse data dimensionality reduction, there is a problem that unable to read all data features at once into memory for analysis and calculation, furthermore, the improved block processing PCA algorithm also can not meet the actual requirements because of the time consuming. In this paper, we propose the E-PCA algorithm by introducing the concept of information entropy to improve the PCA algorithm. First, the useless features are eliminated through feature selection based on information entropy, and then PCA algorithm is used to reduce the dimensionality of large, high-dimensional sparse data. The experimental results show that in the case of keeping the same proportion of raw data, the information entropy-based E-PCA algorithm proposed in this paper is superior to block processing PCA algorithm in terms of memory usage, run time and the results of dimension reduction.

     

/

返回文章
返回