-
启动子通常位于基因上游,能与RNA聚合酶特异性结合并起始转录的一段DNA序列,作为转录起始过程的关键元件,激活RNA聚合酶与模板DNA结合,是基因表达和转录调节的起始步骤[1]。
原核生物RNA聚合酶中的σ因子可以特异性识别并结合启动子。在大肠杆菌中,存在多种σ因子,根据分子量可以分为7类,σ70、σ54、σ38、σ32、σ28、σ24、σ19,在已知的7类σ因子中前6类保守性极强,而σ19在大多数基因组中是缺失的[2]。每一类σ因子具有特定的生物学功能[3-6],σ70主要负责持家基因的转录;σ54被认为是参与氮代谢的调控因子以及控制一些辅助进程;σ38参与稳定期基因的调节;σ32是热休克σ因子(热激因子);σ28参与鞭毛的合成;σ24与极端热应激反应有关;σ19则参与对铁离子转运系统的调控。根据σ因子的同源性,可将其大致分为两类:一类是σ70家族,包括σ70、σ38、σ32、σ28、σ24、σ19;另一类是σ54家族。大肠杆菌基因组内的启动子类型依据与之结合的σ因子种类也可分为相应的类型。不同类型的启动子共有序列也有所差异。因此,启动子也依据被识别的片段分为σ70家族和σ54家族。如σ70启动子具有两个重要的基序区域,−10区和−35区,分别位于转录起始位点上游约10 bp和35 bp处。−10区含有保守序列“TATAAT”,又被称为Pribnow box或TATA box,富含腺嘌呤(adenine, A)和胸腺嘧啶(thymine, T),有助于DNA双链解螺旋分离;−35区则由6个保守的核苷酸“TTGACA”组成[7]。除了σ70因子,−10区和−35区也是被σ70家族其他因子识别的重要片段。相比之下,σ54启动子的共有序列及其位置与σ70启动子具有明显差异,在σ54启动子的−24区和−12区存在保守区域,其保守序列分别是“TGGCA[CT][GA]”和“TGC[AT][TA]”[8]。
启动子序列的鉴定对于研究基因表达、分析基因调控机制、研究基因结构以及注释基因信息至关重要。准确识别启动子的方法一般是依靠昂贵且耗时费力的实验检测方法,然而,在全基因组范围内进行检测是一项艰巨的任务。随着测序技术以及计算机技术的发展,越来越多生物的全基因组被测序出来,尤其是原核生物,因此出现了基于计算生物学的启动子预测方法,这些预测方法在不断地改进,有助于鉴别启动子序列。
Tools Benchmark dataset size (promoter) Sequence similarity Feature extraction/ selection Classification algorithm Evaluation strategy AUC 1.TLS-NNPP[9] 771 (E.coli) / The empirical probability
distribution of TSS-TLS distanceANN Independent test / 2.SIDD[10] 500 (E.coli) / SIDD FLD Independent test / 3.FS_LSSVM[11] 53 (E.coli) / A domain theory for promoters/
C4.5 decision treeLSSVM 10-fold cross-validation / 4.Free energy[12] 1044 (E.coli)
879 (B.subtilis)/ Free energy Modified scoring function Independent test / 5.PromPredict[13] 1145 (E.coli) 615 (B.subtilis)
82 (M.tuberculosis)/ GC content; Average free energy difference between the average free energy Training and validation / 6.SIDD-ANN[14] 1648 (E.coli) / SIDD profile data ANN Independent test / 7.PePPER[15] L.lactis / PWM HMM / / 8.G4PromFinder[16] 3570 (S.coelicolor)
2117 (P.aeruginosa)/ AT-rich element and G-quadruplex motif-based algorithm / Independent test / 9.LN-QSAR[17] 135 (M.bovis) / Pseudo-folding 2D lattice graph LDA Independent test / 10.Ensemble-SVM[18] 450 (E.coli σ70) / k-mer with location with respect to the TSS/ Symmetric uncertainty Ensemble-SVM 10-fold cross-validation / 11.TSS-PREDICT[19] 450 (E.coli σ70)
205 (B.subtilis)
26 (C.trachomatis)/ Information Content; PWM Ensemble-SVM Independent test / 12.TSS-SLP[20] 669 (E.coli σ70) / Dinucleotide Frequency Features SLP 5-fold cross-validation; Independent test / 13.PCSF[21] 683 (E.coli σ70) / Conversation of sequence segments; PCSF Score function 10-fold cross-validation / 14.IPMD[22] 270 (B.subtilis σ43)
741 (E.coli σ70)/ PCSF; ID Modified MD 10-fold cross-validation 0.847 (B.subtilis)
0.920 (E.coli)15.70ProPred[23] 741 (E.coli σ70) / PSTNPss; PseEIIP SVM 5-fold cross-validation; Jackknife test 0.990 16.iProEP[24] 270 (B.subtilis)
741 (E.coli)≤80% PseKNC; PCSF/ mRMR; IFS SVM 10-fold cross-validation 0.988 (B.subtilis)
0.976 (E.coli)17.IPWM[25] 683 (E.coli σ70) / Entropy-based conservative characteristics; Improved PWM Score function 10-fold cross-validation / 18.BacPP[26] 1034 (E.coli) / Binary digits ANN (2,3,10)-fold cross-validation; Independent test / 19.vw Z-curve[27] 1401 (E.coli) 660 (B.subtilis) / variable-window Z-curve/ IFS PLS 10-fold cross-validation / 20.Stability[28] 1035 (E.coli) / DNA duplex stability ANN (2,3,10)-fold cross-validation / 21.iPro54-PseKNC[29] 161 (prokaryotic σ54) ≤75% PseKNC/ F-score; IFS SVM Jackknife test / 22.Promote
Predictor[30]161 (prokaryotic σ54) ≤75% Motif profile-based ANF/ MRMD Bagging; RF; SVM 10-fold cross-validation; Independent test / 23.meta-predictior[31] 579 (E.coli σ70) ≤45% sequence-based features; structure-based features Meta-predictor Independent test 0.850 24.bTSSfinder[32] 3597 (E.coli) 12797 (Nostoc) 351 (Synechocystis)
1471 (S.elongatus)/ PWM; Physicochemical properties/ Mahalanobis distance ANN Independent test / 25.iPro70-PseZNC[33] 741 (E.coli σ70) / PseZNC/ F-score; IFS SVM 5-fold cross-validation 0.909 26.iPromoter-FSEn[34] 741 (E.coli σ70) / Nucleotide Statistics; k-mer; g-gapped k-mer; Approximate signal pattern count; Position specific occurences; Distribution of nucleotides/ Feature subspace Ensemble learning 10-fold cross-validation 0.932 27.iPro70-FMWin[35] 741 (E.coli σ70) / k-mer; g-gapped k-mer; Pattern finding; Positioning distance count/ Adaboost LR 10-fold cross-validation 0.959 28.CNNProm[36] 839 (E.coli σ70)
746 (B.subtilis)/ one-hot CNN 5-fold cross-validation / 29.IBBP[37] 1888 (E.coli σ70) / Image-based and evolutionary approach SVM Independent test / 30.SAPPHIRE[38] 170 (P. aeruginosa and P. putida σ70) / one-hot ANN 5-fold cross-validation; Independent test / 31.iPromoter-2L[39] 2860 (E.coli) ≤80% Multi-window-based PseKNC RF 5-fold cross-validation; Jackknife test / 32.iPromoter-2L2.0[40] 2860 (E.coli) ≤80% Smoothing Cutting Window algorithm; k-mer; PseKNC SVM; Ensemble learning 5-fold cross-validation / 33.MULTiPly[41] 2860 (E.coli) ≤80% Bi-profile bayes; KNN; k-mer;
DAC/ F-scoreSVM 5-fold cross-validation; Jackknife test; Independent test / 34.pcPromoter-CNN[42] 2860 (E.coli) ≤80% one-hot CNN 5-fold cross-validation; Independent test 0.957 35.iPromoter-BnCNN[43] 2860 (E.coli) ≤80% one-hot; k-mer; Structural
propertiesCNN 5-fold cross-validation; Independent test / 36.SELECTOR[44] 2860 (E.coli) ≤80% CKSNAP; PCPseDNC; PSTNPss; DNA strand Ensemble learning 5-fold cross-validation; Independent test 0.984 37.iPSW(2L)-PseKNC[45] 3382 (E.coli) ≤85% NCP; ANF SVM 5-fold cross-validation 0.905 38.deepPromoter[46] 3382 (E.coli) ≤85% Combination of Continuous
FastText N-Grams/ MRMDCNN 5-fold cross-validation 0.885 39.iPSW(PseDNC-DL)[47] 3382 (E.coli) ≤85% one-hot; PseDNC CNN 5-fold cross-validation 0.925 PWM: position weight matrix; SIDD: stress-induced DNA duplex destabilization; PCSF: position-correlation scoring function; ID: increment of diversity; PSTNPss: position-specific trinucleotide propensity based on single-strand; PseEIIP: electron-ion interaction pseudo-potentials of trinucleotide; PseKNC: pseudo k-tuple nucleotide composition; ANF: accumulated nucleotide frequency; PseZNC: pseudo multi-window Z-curve nucleotide composition; KNN: k-nearest neighbors; DAC: dinucleotide-based auto-covariance; PCPseDNC: parallel correlation pseudo dinucleotide composition; NCP: nucleotide chemical property; PseDNC: pseudo dinucleotide composition; mRMR: minimum redundancy maximum relevance; IFS: incremental feature selection; MRMD: maximum-relevance-maximum-distance; ANN: artificial neural network; SVM: support vector machine; FLD: fisher linear discriminant; SLP: single-layer perceptron; LSSVM: least square support vector machine; MD: mahalanobis discriminant; PLS: partial least squares; HMM: hidden markov models; RF: random forest; LR: logistic regression; CNN: convolution neural network; LDA: linear discriminant analysis.
HTML
-
几乎所有的机器学习方法是以数值向量作为输入,因此需要一个合适的特征描述方法将数据集中的每一个样本转换为能够反映序列信息的数值向量。在原核启动子识别工作中,这些特征大致可以分为5类:核苷酸组成、核苷酸理化性质、伪核苷酸组成、二进制编码以及位置权重矩阵,以下对这5类特征进行简单的介绍。
-
核苷酸组成,也叫k-mer,统计了DNA序列片段的所有可能组合的k长度子串出现频率,其计算公式为:
式中,i代表某一k联体,有4k种可能性;N(t)表示DNA序列中某一k联体出现的次数;L表示DNA序列的长度。随着k值的增加,DNA序列的局部或短程信息也会逐渐增加。
此外,核苷酸组成还包括了g-gapped k-mer,GC含量,累积核苷酸频率(accumulated nucleotide frequency, ANF)等。ANF表示了每一个碱基在序列中的分布密度,表达式为:
式中,
$ \left|{s}_{i}\right| $ 代表第i个碱基的位置;$ N\left({s}_{i}\right) $ 表示某一碱基出现频数;$ q\in \left\{A,C, G, T\right\}$ 。 -
DNA序列中碱基的理化性质也可作为启动子预测的重要特征,包括核苷酸的化学性质、双链的稳定性、自由能、应激诱导的DNA双链不稳定性等。
根据表2中对不同核苷酸的分类,DNA序列中第i个核苷酸可以表示为:
Chemical property Class Nucleotides Ring Structure Purine A, G Pyrimidine C, T Functional Group Amino A, C Keto G, T Hydrogen Bond Strong C, G Weak A, T 式中,xi, yi, zi分别表示指环结构(ring structure),功能组别(function group),以及氢键(hydrogen bond),如:
因此4种碱基(A, C, G, T)可以分别表示为(1,1,1),(0,1,0),(1,0,0)和(0,0,1)。
-
伪核苷酸组成(pseudo k-tuple nucleotide composition, PseKNC)最初是由文献[52]提出,分为I型和II型。这两种方法基于核苷酸的物化性质引入了DNA序列的全局或长程顺序信息。
I型PseKNC,也叫平行相关伪核苷酸组成,将每一条DNA序列转化为4k + λ维的向量,具体表示为:
II型PseKNC,也叫串联相关伪核苷酸组成,可产生4k + λ
$ \Lambda $ 维向量:式(5)和式(6)中的
$ {f}_{u} $ 与式(1)相同;前4k个元素是核苷酸组成特征,后面的元素是伪核苷酸组成特征;$ \mathrm{\lambda } $ 是一个正整数,反映序列顺序关联阶数;$ \omega $ 是权重因子,用于权衡核苷酸组分和DNA序列局部结构性质的影响;$ {\tau }_{j} $ 代表的是m阶关联因子,反映了每条DNA序列所有二核苷酸的m阶顺序关联性。 -
二进制编码通过将4种核苷酸转换成包含4个元素的向量作为特征,其中一个元素为1,其余为0,既A、C、G和T分别表示为(1,0,0,0),(0,1,0,0),(0,0,1,0)以及(0,0,0,1)。因此,一段长为L的DNA序列可以用L×4的二维矩阵表示。
-
位置权重矩阵(position weight matrix, PWM)可用来表示序列的保守片段,以序列每一位置的碱基保守程度为参量,分别计算每种碱基的保守指数,以此作为特征,具体表示为:
式中,
$ {S}_{i,j} $ 表示碱基i在第j个位置的保守指数;$ {q}_{_{i,j}} $ 是指在背景序列中碱基i出现在第j个位置的频率;$ {b}_{i} $ 是背景概率。因此,PWM是一个4×L的二维矩阵: