Prediction of Molecular Biological Activity Based on Graph Convolution Method of Multi-Characteristic Fusion

TAN Lulu; ZHANG Xinxin; ZHOU Yinzuo

doi:10.12178/1001-0548.2021158

The development cycle of drugs is long and the cost is huge. The method of computerized virtual drug screening can effectively improve the efficiency of the pilot compounds. This paper proposes a new feature fusion scheme based on attention mechanism, called multi-feature fusion scheme. Combined with the existing graph convolution network based on edge attention, the biological activity prediction task is carried out by using this method for different kinds of bioactive data sets selected from PubChem, the public chemical database. The instability and unreliability caused by manual calculation can be avoided by learning the molecular graph features directly, and multi-feature fusion scheme based on attention makes the model adaptive to fuse multiple edge attribute features. The results show that the method can predict the biological activity of molecules more accurately than other machine learning methods.

HTML

药物开发周期长、耗资大，药物流失率高。目前，每10个候选药物中就有9个在I期临床试验或监管批准时失败^[1]。为改善药物发现过程效率低下的状况，缩短新药研发周期及提高成功率，药物化学家们提出了定量构效关系(quantitative structure - activity relationships, QSAR)的概念。QSAR是对已知先导化合物的一系列衍生物进行定量的生物活性测定，分析衍生物的理化参数与生物活性的关系，建立结构与生物活性之间的数学模型，并以这种数学模型来指导药物分子设计^[2]。早期阶段，机器学习方法是QSAR领域较为常用的建模方法。由于传统机器学习方法只能处理固定大小的输入，大多早期的QSAR建模都是针对不同任务，人工生成相应的固定长度的分子描述符。常用的分子描述符包括^[3]：1)分子指纹，通过一系列表示特定子结构的二进制数字对分子结构进行编码^[3]；2)一维/二维分子描述符：由统计学家和化学家处理的描述分子物理化学和微分拓扑衍生的描述符^[3]。常用的建模方法包括线性方法(如线性回归)和非线性方法(如支持向量机、随机森林等)。近年来，深度学习方法已成为QSAR建模的最新研究方向。

过去十年中，深度学习已成为各领域的主要建模方法，尤其在医学领域，涉及生物活性预测、药物从头设计、医学图像分析和合成预测等多个方向。卷积神经网络(convolutional neural networks, CNN)是深度学习中的一种特殊架构，已成功解决了结构化数据(如图像)的问题^[4]。但是，当图形具有不规则形状和大小、节点位置没有空间顺序且节点的邻居也与位置有关时，传统卷积神经网络则不能直接应用于图上。针对这种非欧式结构化数据，研究者们提出了图卷积网络(graph convolutional network, GCN)，且基于此提出了各种衍生架构。文献[5]提出了第一个图神经网络(graph neural networks, GNN)，该架构基于递归神经网络学习了无向图、有向图和循环图的体系结构。文献[6]基于频谱图理论提出了图卷积网络。目前，已有其他形式的GCN，如图注意网络(graph attention network, GAT)、图自动编码器和时空图卷积等。

近几年，已有多数研究将图卷积应用于分子的生物活性预测。在化学图论中，化合物结构通常表示为氢贫化(省略氢)的分子图，每个化合物都以无向图表示，原子为节点，键为边。原子和键均包含很多属性例如原子类型、键类型等。文献[7]利用节点(原子)和边(键)的属性建立图卷积模型。文献[8]创建了原子特征向量和键特征向量，并将二者拼接形成原子键特征向量。文献[9]提出了图记忆网络(graphMem)，这是一种记忆增强的神经网络，该网络可用于处理具有多种键类型的分子图。MPNN^[10]阶段性地总结了GNN模型，摒弃手工特征，迈出了将GNN应用于分子图的重要一步。SchNet^[11]推动了GNN在分子动力学模拟中的应用，使之符合物理学约束方程。DimeNet^[12]对分子中的方向性信息进行建模，使得模型的预测精度更进一步。在这些研究中，都未对节点特征和键属性加以区分，没有关注其内部联系。但事实上，为原子对之间的各种相互作用类型赋予不同权重才是较为准确的方法。

最近，文献[13]提出一种基于边注意的图卷积神经网络算法(edge attention graph convolutional network, EAGCN)，该算法提出了一个边缘注意层来评估分子中每条边的权重：预先构建了一个属性张量，经过注意层处理后，生成多个注意权重张量，其中每个张量都包含数据集中(分子图)一个边属性的所有可能的注意权重。然后，通过查找该权重张量中分子的每个键的值来构建注意力矩阵。这种方法使得模型可以在不同层次和不同边属性上学习不同的注意力权重。经实验证明，EAGCN框架具有很高的适用性，并且直接从图结构中学习特定的分子特征，避免了数据预处理阶段带来的误差。

本文基于EAGCN框架，考虑到无法自适应学习特征重要度带来的不稳定性，提出了基于多特性融合的注意力图卷积模型(multi-feature fusion dge attention graph convolutional network, MF_EAGCN)，其中的多特性融合方案是基于自注意力机制的特征融合方式，能够有效地让模型自适应调节多个特征张量的权重分配。本文使用多种筛选方法对PubChem数据库中的靶标等内容作出限制，选择了不同类型的几种生物活性数据集，并将本文算法与几种基准模型同时应用于其中，分析评估了各自的性能。

3. 结束语

本文提出了基于自注意力机制的多特性融合方案，针对基于边注意机制的图卷积网络模型进行了有效优化。本文将一种基于边注意力的图卷积网络架构，应用于文中选用的不同种类的生物活性预测任务，从而避免了人工特征工程带来的误差，并对比几种机器学习基准算法，验证了本人算法有效性。在此基础上，针对前人提出的模型中存在的问题：无法自适应设置边属性特征权重，本文提出了分子多特性融合的方案优化了算法模型的特征提取能力，通过自注意力机制针对多个特征进行自适应融合，有效地解决了这一问题，并且获得了更好的预测性能。本文使用的数据集偏向数据量较小的数据集，未来会将其扩展到数据量更大的数据集以及其他生物活性预测任务上。在应用于较大数据集时，模型可以针对性地对不同任务作出优化，可以提高模型的泛化性能，提升模型稳定性。

Reference (24)

[1]	DREWS J. Drug discovery: A historical perspective[J]. Science, 2000, 287(5460): 1960-1964.
[2]	DEVILLERS J. Neural, networks in QSAR and drug design[M]. [S. l.]: Academic Press, 1996.
[3]	SUN M, ZHAO S, GILVARY C, et al. Graph convolutional networks for computational drug development and discovery[J]. Briefings in Bioinformatics, 2020, 21(3): 919-935.
[4]	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Advances in Neural Information Processing Systems, 2012, 25(6): 1097-1105.
[5]	GORI M, MONFARDINI G, SCARSELLI F. A new model for learning in graph domains[C]//2005 IEEE International Joint Conference on Neural Networks. [S.l.]: IEEE, 2005: 729-734.
[6]	BRUNA J, ZAREMBA W, SZLAM A, et al. Spectral networks and locally connected networks on graphs[EB/OL]. [2020-10-11]. https://arxiv.org/pdf/1312.6203.pdf.
[7]	KEARNES S, MCCLOSKEY K, BERNDL M, et al. Molecular graph convolutions: Moving beyond fingerprints[J]. Journal of Computer-Aided Molecular Design, 2016, 30(8): 595-608.
[8]	CONNOR W C, BARZILAY R, GREEN W H, et al. Convolutional embedding of attributed molecular graphs for physical property prediction[J]. Journal of Chemical Information and Modeling, 2017, 57(8): 1757-1772.
[9]	PHAM T, TRAN T, VENKATESH S. Graph memory networks for molecular activity prediction[C]//2018 24th International Conference on Pattern Recognition (ICPR). [S.l.]: IEEE, 2018: 639-644.
[10]	GILMER J, SCHOENHOLZ S S, RILEY P F, et al. Neural message passing for quantum chemistry[C]//International Conference on Machine Learning. [S.l.]: PMLR, 2017: 1263-1272.
[11]	SCHÜTT K T, KINDERMANS P J, SAUCEDA H E, et al. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: Curran Associates Inc, 2017: 992-1002.
[12]	KLICPERA J, GROß J, GÜNNEMANN S. Directional message passing for molecular graphs[EB/OL]. [2020-10-21]. https://arxiv.org/abs/2003.03123.
[13]	SHANG C, LIU Q, CHEN K S, et al. Edge attention-based multi-relational graph convolutional networks[EB/OL]. [2020-10-25]. https://arxiv.org/abs/1802.04944v2.
[14]	DAHL G E, JAITLY N, SALAKHUTDINOV R. Multi-task neural networks for QSAR predictions[EB/OL]. [2020-10-28]. https://arxiv.org/pdf/1406.1231.pdf.
[15]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. [2020-11-10]. https://arxiv.org/abs/1706.03762.
[16]	BOLTON E E, WANG Y, THIESSEN P A, et al. PubChem: Integrated platform of small molecules and biological activities[C]//Annual Reports in Computational Chemistry. [S.l.]: Elsevier, 2008: 217-241.
[17]	WEININGER D. Smiles, a chemical language and information system[J]. Journal of Chemical Information and Computer Sciences, 1988, 28(1): 31-36.
[18]	HELLER S, MCNAUGHT A, STEIN S, et al. InChI-the worldwide chemical structure identifier standard[J]. Journal of Cheminformatics, 2013, 5(1): 1-9.
[19]	MAURI A, CONSONNI V, PAVAN M, et al. Dragon software: An easy approach to molecular descriptor calculations[J]. Match, 2006, 56(2): 237-248.
[20]	MAURI A. alvaDesc: A tool to calculate and analyze molecular descriptors and fingerprints[M]// Ecotoxicological QSARs. New York: [s.n.], 2020: 801-820.
[21]	FRISCH M J, TRUCKS G W, SCHLEGEL H B, et al. Gaussian09, revision A.1[EB/OL]. [2020-11-10]. https://www.scienceopen.com/document?vid=45e9a2b5-64f1-4e2c-8a2e-0e0bec409f69.
[22]	YAP C W. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints[J]. Journal of Computational Chemistry, 2011, 32(7): 1466-1474.
[23]	O'BOYLE N M, BANCK M, JAMES C A, et al. Open Babel: An open chemical toolbox[J]. Journal of Cheminformatics, 2011, 3(1): 33.
[24]	LANDRUM G. Rdkit documentation[EB/OL]. [2021-1-20].https://buildmedia.readthedocs.org/media/pdf/rdkit/latest/rdkit.pdf.

原子属性	描述	值类型
原子序号	原子在元素周期表中的位置	Int
相连的原子个数	邻居节点的个数	Int
相邻氢原子个数	氢原子数量	Int
芳香性	是否具有芳香性	Boolean
形式电荷个数	形式电荷个数	Int
环状态	是否在环内	Boolean

键属性	描述	值类型
原子对类型	键连接的原子类型定义	Int
键序	单键/双键/三键/芳香键	Int
芳香性	是否具有芳香性	Int
共轭性	是否共轭	Boolean
环状态	是否在环内	Boolean
占位符	原子之间是否存在键	Boolean

PubChem AID	筛选条件	有活性分子数	无活性分子数
1851(1a2)	Cytochrome P450, family 1, subfamily A, polypeptide 2	5997	7242
1851(2c19)	Cytochrome P450, family 2, subfamily C, polypeptide 19	5905	7522
1851(2d6)	Cytochrome P450, family 2, subfamily D, polypeptide 6,isoform 2	2769	11127
1851(3a4)	Cytochrome P450, family 3, subfamily A, polypeptide 4	5265	7732
492992	Identy inhibitors of the two-pore domain potassium channel (KCNK9)	2097	2820
651739	Inihibition of Trypanosoma cruzi	4051	1326
652065	Identify molecules that bind r(CAG) RNA repeats	2969	1288

基准方法	超参数	值区间	参数意义
RF	Ntrees	($50,100,\cdots ,500$)	树的个数
	max_depth	($1, 5,\cdots ,50$)	每棵树最大树深度
	max_features	$(1, 5,\cdots ,50$)	划分时的最大特征数
SVM	Kernel	RBF	核函数
	C	(1,10,100)	惩罚系数
	γ	(0.1,0.001,0.0001,0.00001,1,10,100)	影响数据映射到新特征空间的量
DNN	Epoch	100	迭代次数
	Batch size	100	最小训练样本数
	Hidden layers	(2,3,4)	隐层数
	Number neurons	(10,50,100,500,700,1000)	每层神经元个数
	Activation function	ReLu	神经元激活函数
	Loss function	binary_crossentropy	损失函数

模型	超参数	值区间	参数意义
EAGCN	Batch size	64	单次训练样本数
	Epoch	100	迭代次数
	weight_decay	0.00001	权重衰减率
	dropout	0.5	随机失活率
	Activation function	ReLu	激活函数
	Loss function	binary_crossentropy	损失函数
	kernel_size	1	卷积核大小
	stride	1	卷积核滑动步长
	n_sgcn1	(30, 10, 10, 10, 10)	多特征图卷积层输出通道数
MF_EAGCN	Batch size	64	单次训练样本数
	Epoch	100	迭代次数
	weight_decay	0.00001	权重衰减率
	dropout	0.5	随机失活率
	Activation function	ReLu	激活函数
	Loss function	binary_crossentropy	损失函数
	kernel_size	1	卷积核大小
	stride	1	卷积核滑动步长
	n_sgcn1	(20, 20, 20, 20, 20)	多特征图卷积层输出通道数

Prediction of Molecular Biological Activity Based on Graph Convolution Method of Multi-Characteristic Fusion

doi: 10.12178/1001-0548.2021158

Abstract

References

Proportional views

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Related

Proportional views