Abstract:
In the cases of imbalance big datasets, the traditional feature processing method is biased to the large class and ignores the small class, which affects the classification performance. So a text feature gene extraction method is proposed in this paper. First of all, considering the feature selection impact of imbalance distribution of sample categorization, a feature selection method based on the CHI statistical matrix combined with information entropy is used to strengthen the characteristics of the small class. Secondly, based on the high order correlation of multidimensional statistical data, the method of text feature extraction is designed to enhance the generalization ability of feature item. Finally, the two methods are combined to construct a new method of text feature extraction under unbalanced large datasets. The experimental results show that the proposed method has a better performance in early maturity and feature dimension reduction, and is far superior to the common feature selection algorithm in the classification ability of small classes.