致病氨基酸变异预测的新型融合模型

A Novel Fusion Model for Predicting Pathogenic Amino Acid Substitution

  • 摘要: 氨基酸变异常常会影响蛋白质的结构和功能,进而导致疾病。当前,研究者们已经提出了一些基于计算的方法来预测氨基酸变异致病性。该文构建了一个新型融合模型,旨在提高预测性能和泛化性。首先,提取影响致病性的各类生物特征并用递归特征消除RFE方法筛选最优特征子集。然后,建立包含卷积神经网络和双向长短期记忆神经网络的深度学习模型提取特征,并以拼接的方式融合这两类特征作为模型输入。最后,构建一个基于XGBoost、CatBoost、LightGBM和随机森林的融合模型,用以预测氨基酸变异致病性。该融合模型的10重交叉验证准确性为92.8%,盲测准确性为93.1%,取得了当前最高的预测准确性和泛化性。该工具可用于辅助临床诊断和药物设计,降低研发成本。

     

    Abstract: Amino acid substitution often affects the structure and function of proteins, leading to diseases. At present, researchers have proposed some computational methods to predict the pathogenicity of amino acid substitution. This paper constructs a new fusion model to improve the prediction performance and generalization. Firstly, various biological features affecting pathogenicity are extracted and the optimal feature subset is screened by recursive feature elimination (RFE) method. Then, a deep learning model including convolutional neural networks and bi-directional long-short term memory is established to extract features, and the two types of features are fused in a splicing way as model input. Finally, a fusion model based on XGBoost, CatBoost, LightGBM and Random Forest is constructed to predict the pathogenicity of amino acid substitution. The 10-fold cross validation accuracy of the fusion model is 92.8%, and the blind test accuracy is 93.1%, achieving the highest prediction accuracy and generalization to date. The tool can be used to assist clinical diagnosis and drug design and reduce research and development costs.

     

/

返回文章
返回