不均衡大数据集下的文本特征基因提取方法

Text Feature Gene Extraction on Imbalanced Big Dataset

  • 摘要: 在不均衡大数据集情况下,传统特征处理方法偏重大类而忽略小类,影响分类性能。该文提出了一种文本特征基因提取方法。首先,基于样本类别分布不均衡对特征选择的影响,给出了一种结合信息熵的CHI统计矩阵特征选择方法,以强化小类的特征;然后,在探究多维统计数据高阶相关性的基础上,采取独立成分分析手段,设计了文本特征基因提取方法,用以增强特征项的泛化能力;最后,将这两种方法相融合,实现了在不均衡大数据集下的文本特征基因提取新方法。实验结果表明,所提方法具有较好的早熟性及特征降维能力,在小类的分类效果上优于常见特征选择算法。

     

    Abstract: In the cases of imbalance big datasets, the traditional feature processing method is biased to the large class and ignores the small class, which affects the classification performance. So a text feature gene extraction method is proposed in this paper. First of all, considering the feature selection impact of imbalance distribution of sample categorization, a feature selection method based on the CHI statistical matrix combined with information entropy is used to strengthen the characteristics of the small class. Secondly, based on the high order correlation of multidimensional statistical data, the method of text feature extraction is designed to enhance the generalization ability of feature item. Finally, the two methods are combined to construct a new method of text feature extraction under unbalanced large datasets. The experimental results show that the proposed method has a better performance in early maturity and feature dimension reduction, and is far superior to the common feature selection algorithm in the classification ability of small classes.

     

/

返回文章
返回