Abstract:
A unsupervised feature annotation criterion-information gain criterion (IGC)-based on feature matrix information gain is proposed to rank the feature variable. According to this rank, three new feature filtering methods:direct selection (DS), cumulate maximum entropy (CEM), and information gain maximum (IGM) are given to reduce clustering complexity. The clustering results of these three filtering methods with two existing variance selection (VS) and gene shaving (GS) methods were tested and compared by using classic QC or K-means algorithm and three biological datasets: rod-shaped viruses (RSV), mixed-lineage leukemia (MLL), and acute leukemia patients (ALP). The experiment results show our feature filtering method has obvious superiority in accelerating the clustering procedure and preserving the clustering structure of initial data.