基于互信息量的生物信息数据特征标注方法

Feature Annotation Method of Biological Information Data Based on Mutual Information

  • 摘要: 提出了一种用于排位特征变量的基于特征矩阵信息增益的无监督特征标注准则(IGC)及直接选择法(DS)、累积最大熵法(CEM)和最大信息增益法(IGM)3种新的特征过滤方法来降低聚类的复杂度.使用经典的QC或K-means聚类算法,在杆状病毒数据集(RSV)、混合血统白血病数据集(MLL)和急性白血病患者数据集(ALP)等3种不同的生物信息数据集上测试并对比了这些特征过滤方法和目前的偏差选择(VS)和基因修剃(GS)过滤方法对聚类结果的影响.试验结果表明,3种特征过滤方法在加速聚类过程及保持初始数据的聚类结构上都具有明显的优势.

     

    Abstract: A unsupervised feature annotation criterion-information gain criterion (IGC)-based on feature matrix information gain is proposed to rank the feature variable. According to this rank, three new feature filtering methods:direct selection (DS), cumulate maximum entropy (CEM), and information gain maximum (IGM) are given to reduce clustering complexity. The clustering results of these three filtering methods with two existing variance selection (VS) and gene shaving (GS) methods were tested and compared by using classic QC or K-means algorithm and three biological datasets: rod-shaped viruses (RSV), mixed-lineage leukemia (MLL), and acute leukemia patients (ALP). The experiment results show our feature filtering method has obvious superiority in accelerating the clustering procedure and preserving the clustering structure of initial data.

     

/

返回文章
返回