基因数据的交互依赖特征选择算法

张俐

doi:10.12178/1001-0548.2021136

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名

邮箱

手机号码

标题

留言内容

验证码

基因数据的交互依赖特征选择算法

江苏理工学院计算机工程学院　江苏常州　213001

基金项目: 国家科技基础性工作专项(2015FY111700-6)

详细信息

作者简介:
张俐(1977 − )，男，博士，副教授，主要从事特征工程与机器学习等方面的研究

通讯作者: 张俐，E-mail：zhangli_3913@163.com

中图分类号: TP181

摘要: 特征选择是生物信息领域中数据预处理阶段必不可少的步骤。传统特征选择算法忽视了特征之间的依赖相关性和冗余性，因此提出一种联合互信息的特征选择算法(JFRR)。该算法利用互信息计算特征之间的冗余值，并利用联合互信息分别计算已选特征集合、候选特征及类标签之间的相关性。将JFRR与其他6个特征选择算法在2个分类器上，使用9个不同基因数据集，进行分类准确率指标(Precision_micro和F1_micro)验证。实验结果表明，该算法能有效提高分类精度。

关键词:

Abstract: Feature selection is an essential step in the data preprocessing phase in the field of bioinformatics. Traditional feature selection algorithms ignore the problems of dependency relevance and redundancy between features. This paper proposes a joint feature relevance and redundancy (JFRR) algorithm for feature selection. The algorithm uses mutual information to calculate the redundancy values between features and applies joint mutual information to compute the relevance among the set of selected features, candidate features and class labels. Finally, JFRR is validated with the other six feature selection algorithms on two classifiers using nine different gene datasets with classification accuracy metrics (Precision_micro and F1_micro). The experimental results show that the JFRR method can effectively improve classification accuracy.

Key words:

序号

数据集

样本数

特征数

分类标签数

数据来源

lung

203

3 312

ASU

lung_discrete

325

ASU

lymphoma

4 026

ASU

Carcinom

174

9 182

ASU

nci9

9 712

ASU

GLIOMA

4 434

ASU

dermatology

358

UCI

wdbc

569

UCI

arrhythmia

416

279

UCI

数据集

JFRR

MID

MIQ

CMIM

JMIM

CFR

CMI-MRMR

lung

87.555

86.546

79.907

84.632

81.804

75.266

90.526

lung_discrete

86.039

80.722

63.346

77.378

63.531

77.718

84.493

lymphoma

89.322

86.672

67.11

86.794

61.462

85.761

87.835

Carcinom

77.595

72.123

58.313

53.895

58.294

50.884

64.932

nci9

75.554

72.605

48.581

72.903

55.971

69.034

46.108

GLIOMA

80.011

79.448

58.68

61.055

58.496

54.448

74.627

dermatology

94.41

93.017

93.298

93.341

94.175

93.572

94.41

wdbc

95.966

95.789

94.738

95.445

94.557

94.38

95.259

arrhythmia

55.677

55.235

49.122

52.762

49.962

56.36

57.473

平均值

82.459

80.24

68.122

75.356

68.695

73.047

77.296

WINS/TIES/LOSSES

9/0/0

8/0/1

6/1/2

数据集

JFRR

MID

MIQ

CMIM

JMIM

CFR

CMI-MRMR

lung

91.106

90.111

77.344

89.126

84.694

85.184

92.563

lung_discrete

91.906

87.767

66.539

86.272

61.304

83.985

87.49

lymphoma

95.102

95.102

70.741

93.713

65.171

91.959

95.102

Carcinom

88.653

89.693

74.39

74.702

74.39

62.107

87.826

nci9

83.15

79.304

48.168

76.868

53.494

75.737

48.168

GLIOMA

32.165

34.248

30.081

34.248

36.331

dermatology

92.432

91.876

91.867

97.466

92.432

91.876

wdbc

94.91

94.563

90.677

94.38

90.852

94.559

90.333

arrhythmia

59.445

58.746

57.509

58.464

平均值

80.985

79.925

67.712

77.6317

68.329

75.302

76.461

WINS/TIES/LOSSES

6/2/1

8/1/0

8/0/1

7/1/1

6/1/2

数据集

JFRR

MID

MIQ

CMIM

JMIM

CFR

CMI-MRMR

lung

95.568

118.655

57.357

127.739

109.277

126.454

882.251

lung_discrete

2.718

0.969

1.0

2.796

2.721

2.781

31.682

lymphoma

37.508

9.497

9.763

27.825

27.308

27.966

326.731

Carcinom

198.758

88.56

100.252

212.276

369.935

2298.744

nci9

76.264

27.323

25.375

51.558

50.038

48.722

25.375

GLIOMA

46.543

20.548

22.712

38.615

70.993

65.942

353.598

dermatology

0.868

0.31

0.318

0.55

0.551

0.554

6.811

wdbc

1.434

0.598

0.591

1.31

1.311

1.285

14.378

arrhythmia

15.985

5.952

8.217

19.022

23.048

16.727

214.663

平均值

52.85

30.268

25.065

53.521

72.798

73.374

461.581

算法

考虑特征之间的交互相关性变化

特征冗余性

MID

$ I\left( {{f_k};C} \right) $

是

CMIM

$ I\left( {{f_k};C} \right) $

是

MIQ

$ I\left( {{f_k};C} \right) $

是

JMIM

$ I({f_k},{f_i};C) $

否

CFR

$ I\left( {{f_k};C} \right) $

是

JFRR

$ I({f_k},{f_i};C) $

是

CMI-MRMR

$ I\left( {{f_i},C|{f_k}} \right) $

是

基因数据的交互依赖特征选择算法

江苏理工学院计算机工程学院　江苏常州　213001

基金项目: 国家科技基础性工作专项(2015FY111700-6)

作者简介:
张俐(1977 − )，男，博士，副教授，主要从事特征工程与机器学习等方面的研究

通讯作者: 张俐，E-mail：zhangli_3913@163.com

收稿日期: 2021-05-18

录用日期: 2022-07-01

修回日期: 2022-04-28

网络出版日期: 2022-10-25

刊出日期: 2022-09-25

中图分类号: TP181

关键词:

全文HTML

过去几十年，在生物信息领域产出大量基因数据^[1-2]。这些基因数据普遍具有样本小、维度高和高噪声等特点^[3]。如何处理这些不相关和冗余特征给数据降维带来重大挑战。常见的数据降维包括特征提取^[4]和特征选择^[5]两类。特征选择由于可以删除无关和冗余特征，同时保留相关原始特征，因此引起许多关注。

在特征选择中主要有数据层面(过滤式方法)和算法层面(包装器方法和嵌入式方法)^[6-8]两方面的研究。过滤式特征选择算法凭借其计算成本低、与具体分类器分离及应用领域广等优点，逐渐成为特征选择技术中的研究热点。常见的基于信息论的过滤式特征选择算法包括采用平均冗余策略的特征选择算法(MID^[9]、MIQ^[9]、JMI^[10]和CFR^[11]等)和采用“最大最小”极端标准的特征选择算法(CMIM^[12]、JMIM^[13]和DWUR^[14]等)。然而这些算法存在忽视对交互依赖特征相关性和冗余性判断的问题。

因此，本文提出一种利用联合互信息和互信息判断特征与类标签之间相关性和冗余性的特征选择算法(joint feature relevance and redundancy, JFRR)。该算法利用联合互信息计算在已选特征下候选特征与类标签之间的相关性；通过互信息计算已选特征和候选特征的冗余性；通过在9个基准基因数据集的实验对比，该算法(JFRR)优于其他特征选择算法(MID、MIQ、CMIM、JMIM、CFR和CMI-MRMR^[15])。

4. 结束语

随着基因数据中高维特征数据的不断增多，特征间的关系变得越来越复杂(包含大量无关特征和冗余特征)。而传统的特征选择算法往往忽视特征间的相关性和冗余性之间的联系。本文提出一种基于联合互信息的JFRR算法。该算法利用互信息和联合互信息间的关系动态分析和调整特征间以及特征与类标签间的相关信息和冗余信息，从而达到删除无关特征和冗余特征的目的，以此提高特征子集的数据质量。为了全面验证JFRR算法的有效性，实验在9个基因数据集上进行。分别通过使用分类器(C4.5和SVM)和分类准确率指标(fmc和pcm)全面评估所选特征子集的质量。实验结果证明JFRR明显优于MID、MIQ、CMIM、JMIM、CFR和CMI-MRMR等6种特征选择算法。

但在一些基因数据中，JFRR算法仍旧存在选择出的特征子集不理想的情况。未来的工作将进一步研究和改进互信息和联合互信息的关系，并以此优化JFRR算法，同时在更广泛的基因数据集中对算法进行验证，以此提高分类预测精度。

参考文献 (20)

[1]	DABBA A, ABDELKAMEL T, SAMY M, et al. Gene selection and classification of microarray data method based on mutual information and moth flame algorithm[J]. Expert Systems with Applications, 2021, 166: 114012. doi: 10.1016/j.eswa.2020.114012
[2]	HAMBALI M A, OLADELE T O, ADEWOLE K S. Microarray cancer feature selection: Review, challenges and research directions[J]. International Journal of Cognitive Computing in Engineering, 2020, 1: 78-97. doi: 10.1016/j.ijcce.2020.11.001
[3]	王翔, 胡学钢. 高维小样本分类问题中特征选择研究综述[J]. 计算机应用, 2017, 37(9): 2433-2438. doi: 10.11772/j.issn.1001-9081.2017.09.2433 WANG X, HU X G. Overview on feature selection in high-dimensional and small-sample-size classification[J]. Journal of Computer Applications, 2017, 37(9): 2433-2438. doi: 10.11772/j.issn.1001-9081.2017.09.2433
[4]	WANG X, LIU J, CHENG Y, et al. Dual hypergraph regularized PCA for biclustering of tumor gene expression data[J]. IEEE Transactions on Knowledge and Data Engineering, 2019, 31(12): 2292-2303. doi: 10.1109/TKDE.2018.2874881
[5]	LIU H, GREGORY D. A semi-parallel framework for greedy information-theoretic feature selection[J]. Information Sciences, 2019, 492: 13-28. doi: 10.1016/j.ins.2019.03.075
[6]	CAI J, LUO J W, WANG S L, et al. Feature selection in machine learning: A new perspective[J]. Neurocomputing, 2018, 300: 70-79. doi: 10.1016/j.neucom.2017.11.077
[7]	LEE C Y, CAI J Y. LASSO variable selection in data envelopment analysis with small datasets[J]. Omega, 2020, 91: 102019. doi: 10.1016/j.omega.2018.12.008
[8]	GAO L Y, WU W G. Relevance assignation feature selection method based on mutual information for machine learning[J]. Knowledge-Based Systems, 2020, 209: 106439. doi: 10.1016/j.knosys.2020.106439
[9]	谢娟英, 王明钊, 周颖, 等. 非平衡基因数据的差异表达基因选择算法研究[J]. 计算机学报, 2019, 42(6): 1232-1251. doi: 10.11897/SP.J.1016.2019.01232 XIE J Y, WANG M Z, ZHOU Y, et al. Differential expression gene selection algorithms for unbalanced gene datasets[J]. Chinese Journal of Computers, 2019, 42(6): 1232-1251. doi: 10.11897/SP.J.1016.2019.01232
[10]	MACEDO F, OLIVEIRA M R, PACHECO A, et al. Theoretical foundations of forward feature selection methods based on mutual information[J]. Neurocomputing, 2019, 325: 67-89. doi: 10.1016/j.neucom.2018.09.077
[11]	GAO W F, HU L, ZHANG P, et al. Feature selection considering the composition of feature relevancy[J]. Pattern Recognition Letters, 2018, 112: 70-74. doi: 10.1016/j.patrec.2018.06.005
[12]	BROWN G, POCOCK A, ZHAO M J, et al. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection[J]. The Journal of Machine Learning Research, 2012, 13: 27-66.
[13]	BENNASAR M, HICKS Y, SETCHI R. Feature selection using joint mutual information maximisation[J]. Expert Systems with Applications, 2015, 42(22): 8520-8532. doi: 10.1016/j.eswa.2015.07.007
[14]	肖利军, 郭继昌, 顾翔元. 一种采用冗余性动态权重的特征选择算法[J]. 西安电子科技大学学报, 2019, 46(5): 155-161. XIAO L J, GUO J C, GU X Y. Algorithm for selection of features based on dynamic weights using redundancy[J]. Journal of XiDian University. 2019, 46(5): 155-161.
[15]	GU X Y, GUO J C, XIAO L J, et al. Conditional mutual information-based feature selection algorithm for maximal relevance minimal redundancy[J]. Applied Intelligence, 2022, 52(2): 1436-1447. doi: 10.1007/s10489-021-02412-4
[16]	MEYER P E, SCHRETTER C, BONTEMPI G. Information-Theoretic feature selection in microarray data using variable complementarity[J]. IEEE Journal of Selected Topics in Signal Processing, 2008, 2(3): 261-274. doi: 10.1109/JSTSP.2008.923858
[17]	ZHANG P, GAO W F. Feature selection considering uncertainty change ratio of the class label[J]. Applied Soft Computing, 2020, 95: 106537. doi: 10.1016/j.asoc.2020.106537
[18]	CHE J X, YANG Y L, LI L, et al. Maximum relevance minimum common redundancy feature selection for nonlinear data[J]. Information Sciences, 2017, 409-410: 68-86. doi: 10.1016/j.ins.2017.05.013
[19]	ZHANG Y S, ZHANG Q, CHEN Z J, et al. Feature assessment and ranking for classification with nonlinear sparse representation and approximate dependence analysis[J]. Decision Support Systems, 2019, 122: 113064. doi: 10.1016/j.dss.2019.05.004
[20]	谢娟英, 丁丽娟, 王明钊. 基于谱聚类的无监督特征选择算法[J]. 软件学报, 2020, 31(4): 1009-1024. doi: 10.13328/j.cnki.jos.005927 XIE J Y, DING L J, WANG M Z. Spectral clustering based unsupervised feature selection algorithms[J]. Journal of Software, 2020, 31(4): 1009-1024. doi: 10.13328/j.cnki.jos.005927

[1]	孙长印, 梁有为, 江帆, 王军选. 场景化毫米波特征选择和波束预测算法 . 电子科技大学学报, 2023, 52(5): 689-698. doi: 10.12178/1001-0548.2022214
[2]	常文文, 聂文超, 袁月婷, 闫光辉, 杨志飞, 张冰涛, 张学军. 基于多层脑功能网络特征的动作意图识别 . 电子科技大学学报, 2023, 52(1): 14-22. doi: 10.12178/1001-0548.2022292
[3]	冯兴乐, 王相相, 段国彬, 闫尉深. 基于范数和相关性的GSM天线组合选择算法 . 电子科技大学学报, 2021, 50(3): 354-359. doi: 10.12178/1001-0548.2020165
[4]	韩嫚莉, 侯卫民, 孙靖国, 王明, 梅少辉. 基于PCA与协同表示的高光谱图像分类研究 . 电子科技大学学报, 2019, 48(1): 117-121. doi: 10.3969/j.issn.1001-0548.2019.01.019
[5]	宋国琴, 刘斌. 基于XGBoost特征选择的幕课翘课指数建立及应用 . 电子科技大学学报, 2018, 47(6): 921-926. doi: 10.3969/j.issn.1001-0548.2018.06.019
[6]	宋勇, 蔡志平. 一种基于信息论模型的入侵检测特征提取方法 . 电子科技大学学报, 2018, 47(2): 267-271. doi: 10.3969/j.issn.1001-0548.2018.02.017
[7]	罗杨, 赵志钦. 基于互信息理论的MIMO天波超视距雷达波形优化方法 . 电子科技大学学报, 2017, 46(1): 27-31,60. doi: 10.3969/j.issn.1001-0548.2017.01.005
[8]	汪文勇, 刘川, 赵强, 沈晓明, 丘晓彤. 直接验证的封装式特征选择方法 . 电子科技大学学报, 2016, 45(4): 607-615. doi: 10.3969/j.issn.1001-0548.2016.04.013
[9]	何红洲, 周明天. 基于互信息量的生物信息数据特征标注方法 . 电子科技大学学报, 2013, 42(6): 916-920. doi: 10.3969/j.issn.1001-0548.2013.06.020
[10]	孙晶涛, 张秋余, 袁占亭, 董建设. 博弈论在邮件特征选择中的应用 . 电子科技大学学报, 2011, 40(1): 95-99. doi: 10.3969/j.issn.1001-0548.2011.01.018
[11]	朱颢东, 李红婵, 钟勇. 新颖的无监督特征选择方法 . 电子科技大学学报, 2010, 39(3): 412-415. doi: 10.3969/j.issn.1001-0548.2010.03.019
[12]	杨宏宇, 李春林. 采用FA和SVDFRM的SVM入侵检测分类模型 . 电子科技大学学报, 2009, 38(2): 240-244. doi: 10.3969/j.issn.1001-0548.2009.02.20
[13]	雷霖, 代传龙, 王厚军. 基于互信息的无线传感器网络节点故障自诊断 . 电子科技大学学报, 2009, 38(5): 696-699. doi: 10.3969/j.issn.1001-0548.2009.05.030
[14]	于泠, 陈波. 入侵数据特征并行选择算法 . 电子科技大学学报, 2008, 37(2): 266-269.
[15]	曾翎, 刘斐, 乔辉. 基于互信息的功能磁共振图像配准 . 电子科技大学学报, 2008, 37(1): 138-140.
[16]	张赪, 蔡之华. 代价敏感的GEP分类算法实现 . 电子科技大学学报, 2007, 36(6): 1319-1321.
[17]	葛森, 黄大贵. 基于最大互信息方法的机械零件图像识别 . 电子科技大学学报, 2007, 36(4): 801-804.
[18]	范自柱, 刘二根, 徐保根. 互信息在图像检索中的应用 . 电子科技大学学报, 2007, 36(6): 1311-1314.
[19]	张中培, 靳蕃. 从相关性分析Turbo码交织器的设计 . 电子科技大学学报, 2000, 29(1): 25-28.
[20]	李仲令, 王晓蕾. 序列相关特性与CDMA系统的多址干扰 . 电子科技大学学报, 1997, 26(2): 132-136.

留言板