Abstract:
Amino acid substitution often affects the structure and function of proteins, leading to diseases. At present, researchers have proposed some computational methods to predict the pathogenicity of amino acid substitution. This paper constructs a new fusion model to improve the prediction performance and generalization. Firstly, various biological features affecting pathogenicity are extracted and the optimal feature subset is screened by recursive feature elimination (RFE) method. Then, a deep learning model including convolutional neural networks and bi-directional long-short term memory is established to extract features, and the two types of features are fused in a splicing way as model input. Finally, a fusion model based on XGBoost, CatBoost, LightGBM and Random Forest is constructed to predict the pathogenicity of amino acid substitution. The 10-fold cross validation accuracy of the fusion model is 92.8%, and the blind test accuracy is 93.1%, achieving the highest prediction accuracy and generalization to date. The tool can be used to assist clinical diagnosis and drug design and reduce research and development costs.