-
启动子通常位于基因上游,能与RNA聚合酶特异性结合并起始转录的一段DNA序列,作为转录起始过程的关键元件,激活RNA聚合酶与模板DNA结合,是基因表达和转录调节的起始步骤[1]。
原核生物RNA聚合酶中的σ因子可以特异性识别并结合启动子。在大肠杆菌中,存在多种σ因子,根据分子量可以分为7类,σ70、σ54、σ38、σ32、σ28、σ24、σ19,在已知的7类σ因子中前6类保守性极强,而σ19在大多数基因组中是缺失的[2]。每一类σ因子具有特定的生物学功能[3-6],σ70主要负责持家基因的转录;σ54被认为是参与氮代谢的调控因子以及控制一些辅助进程;σ38参与稳定期基因的调节;σ32是热休克σ因子(热激因子);σ28参与鞭毛的合成;σ24与极端热应激反应有关;σ19则参与对铁离子转运系统的调控。根据σ因子的同源性,可将其大致分为两类:一类是σ70家族,包括σ70、σ38、σ32、σ28、σ24、σ19;另一类是σ54家族。大肠杆菌基因组内的启动子类型依据与之结合的σ因子种类也可分为相应的类型。不同类型的启动子共有序列也有所差异。因此,启动子也依据被识别的片段分为σ70家族和σ54家族。如σ70启动子具有两个重要的基序区域,−10区和−35区,分别位于转录起始位点上游约10 bp和35 bp处。−10区含有保守序列“TATAAT”,又被称为Pribnow box或TATA box,富含腺嘌呤(adenine, A)和胸腺嘧啶(thymine, T),有助于DNA双链解螺旋分离;−35区则由6个保守的核苷酸“TTGACA”组成[7]。除了σ70因子,−10区和−35区也是被σ70家族其他因子识别的重要片段。相比之下,σ54启动子的共有序列及其位置与σ70启动子具有明显差异,在σ54启动子的−24区和−12区存在保守区域,其保守序列分别是“TGGCA[CT][GA]”和“TGC[AT][TA]”[8]。
启动子序列的鉴定对于研究基因表达、分析基因调控机制、研究基因结构以及注释基因信息至关重要。准确识别启动子的方法一般是依靠昂贵且耗时费力的实验检测方法,然而,在全基因组范围内进行检测是一项艰巨的任务。随着测序技术以及计算机技术的发展,越来越多生物的全基因组被测序出来,尤其是原核生物,因此出现了基于计算生物学的启动子预测方法,这些预测方法在不断地改进,有助于鉴别启动子序列。
表 1 39个原核启动子预测工具比较
Tools Benchmark dataset size (promoter) Sequence similarity Feature extraction/ selection Classification algorithm Evaluation strategy AUC 1.TLS-NNPP[9] 771 (E.coli) / The empirical probability
distribution of TSS-TLS distanceANN Independent test / 2.SIDD[10] 500 (E.coli) / SIDD FLD Independent test / 3.FS_LSSVM[11] 53 (E.coli) / A domain theory for promoters/
C4.5 decision treeLSSVM 10-fold cross-validation / 4.Free energy[12] 1044 (E.coli)
879 (B.subtilis)/ Free energy Modified scoring function Independent test / 5.PromPredict[13] 1145 (E.coli) 615 (B.subtilis)
82 (M.tuberculosis)/ GC content; Average free energy difference between the average free energy Training and validation / 6.SIDD-ANN[14] 1648 (E.coli) / SIDD profile data ANN Independent test / 7.PePPER[15] L.lactis / PWM HMM / / 8.G4PromFinder[16] 3570 (S.coelicolor)
2117 (P.aeruginosa)/ AT-rich element and G-quadruplex motif-based algorithm / Independent test / 9.LN-QSAR[17] 135 (M.bovis) / Pseudo-folding 2D lattice graph LDA Independent test / 10.Ensemble-SVM[18] 450 (E.coli σ70) / k-mer with location with respect to the TSS/ Symmetric uncertainty Ensemble-SVM 10-fold cross-validation / 11.TSS-PREDICT[19] 450 (E.coli σ70)
205 (B.subtilis)
26 (C.trachomatis)/ Information Content; PWM Ensemble-SVM Independent test / 12.TSS-SLP[20] 669 (E.coli σ70) / Dinucleotide Frequency Features SLP 5-fold cross-validation; Independent test / 13.PCSF[21] 683 (E.coli σ70) / Conversation of sequence segments; PCSF Score function 10-fold cross-validation / 14.IPMD[22] 270 (B.subtilis σ43)
741 (E.coli σ70)/ PCSF; ID Modified MD 10-fold cross-validation 0.847 (B.subtilis)
0.920 (E.coli)15.70ProPred[23] 741 (E.coli σ70) / PSTNPss; PseEIIP SVM 5-fold cross-validation; Jackknife test 0.990 16.iProEP[24] 270 (B.subtilis)
741 (E.coli)≤80% PseKNC; PCSF/ mRMR; IFS SVM 10-fold cross-validation 0.988 (B.subtilis)
0.976 (E.coli)17.IPWM[25] 683 (E.coli σ70) / Entropy-based conservative characteristics; Improved PWM Score function 10-fold cross-validation / 18.BacPP[26] 1034 (E.coli) / Binary digits ANN (2,3,10)-fold cross-validation; Independent test / 19.vw Z-curve[27] 1401 (E.coli) 660 (B.subtilis) / variable-window Z-curve/ IFS PLS 10-fold cross-validation / 20.Stability[28] 1035 (E.coli) / DNA duplex stability ANN (2,3,10)-fold cross-validation / 21.iPro54-PseKNC[29] 161 (prokaryotic σ54) ≤75% PseKNC/ F-score; IFS SVM Jackknife test / 22.Promote
Predictor[30]161 (prokaryotic σ54) ≤75% Motif profile-based ANF/ MRMD Bagging; RF; SVM 10-fold cross-validation; Independent test / 23.meta-predictior[31] 579 (E.coli σ70) ≤45% sequence-based features; structure-based features Meta-predictor Independent test 0.850 24.bTSSfinder[32] 3597 (E.coli) 12797 (Nostoc) 351 (Synechocystis)
1471 (S.elongatus)/ PWM; Physicochemical properties/ Mahalanobis distance ANN Independent test / 25.iPro70-PseZNC[33] 741 (E.coli σ70) / PseZNC/ F-score; IFS SVM 5-fold cross-validation 0.909 26.iPromoter-FSEn[34] 741 (E.coli σ70) / Nucleotide Statistics; k-mer; g-gapped k-mer; Approximate signal pattern count; Position specific occurences; Distribution of nucleotides/ Feature subspace Ensemble learning 10-fold cross-validation 0.932 27.iPro70-FMWin[35] 741 (E.coli σ70) / k-mer; g-gapped k-mer; Pattern finding; Positioning distance count/ Adaboost LR 10-fold cross-validation 0.959 28.CNNProm[36] 839 (E.coli σ70)
746 (B.subtilis)/ one-hot CNN 5-fold cross-validation / 29.IBBP[37] 1888 (E.coli σ70) / Image-based and evolutionary approach SVM Independent test / 30.SAPPHIRE[38] 170 (P. aeruginosa and P. putida σ70) / one-hot ANN 5-fold cross-validation; Independent test / 31.iPromoter-2L[39] 2860 (E.coli) ≤80% Multi-window-based PseKNC RF 5-fold cross-validation; Jackknife test / 32.iPromoter-2L2.0[40] 2860 (E.coli) ≤80% Smoothing Cutting Window algorithm; k-mer; PseKNC SVM; Ensemble learning 5-fold cross-validation / 33.MULTiPly[41] 2860 (E.coli) ≤80% Bi-profile bayes; KNN; k-mer;
DAC/ F-scoreSVM 5-fold cross-validation; Jackknife test; Independent test / 34.pcPromoter-CNN[42] 2860 (E.coli) ≤80% one-hot CNN 5-fold cross-validation; Independent test 0.957 35.iPromoter-BnCNN[43] 2860 (E.coli) ≤80% one-hot; k-mer; Structural
propertiesCNN 5-fold cross-validation; Independent test / 36.SELECTOR[44] 2860 (E.coli) ≤80% CKSNAP; PCPseDNC; PSTNPss; DNA strand Ensemble learning 5-fold cross-validation; Independent test 0.984 37.iPSW(2L)-PseKNC[45] 3382 (E.coli) ≤85% NCP; ANF SVM 5-fold cross-validation 0.905 38.deepPromoter[46] 3382 (E.coli) ≤85% Combination of Continuous
FastText N-Grams/ MRMDCNN 5-fold cross-validation 0.885 39.iPSW(PseDNC-DL)[47] 3382 (E.coli) ≤85% one-hot; PseDNC CNN 5-fold cross-validation 0.925 PWM: position weight matrix; SIDD: stress-induced DNA duplex destabilization; PCSF: position-correlation scoring function; ID: increment of diversity; PSTNPss: position-specific trinucleotide propensity based on single-strand; PseEIIP: electron-ion interaction pseudo-potentials of trinucleotide; PseKNC: pseudo k-tuple nucleotide composition; ANF: accumulated nucleotide frequency; PseZNC: pseudo multi-window Z-curve nucleotide composition; KNN: k-nearest neighbors; DAC: dinucleotide-based auto-covariance; PCPseDNC: parallel correlation pseudo dinucleotide composition; NCP: nucleotide chemical property; PseDNC: pseudo dinucleotide composition; mRMR: minimum redundancy maximum relevance; IFS: incremental feature selection; MRMD: maximum-relevance-maximum-distance; ANN: artificial neural network; SVM: support vector machine; FLD: fisher linear discriminant; SLP: single-layer perceptron; LSSVM: least square support vector machine; MD: mahalanobis discriminant; PLS: partial least squares; HMM: hidden markov models; RF: random forest; LR: logistic regression; CNN: convolution neural network; LDA: linear discriminant analysis. -
几乎所有的机器学习方法是以数值向量作为输入,因此需要一个合适的特征描述方法将数据集中的每一个样本转换为能够反映序列信息的数值向量。在原核启动子识别工作中,这些特征大致可以分为5类:核苷酸组成、核苷酸理化性质、伪核苷酸组成、二进制编码以及位置权重矩阵,以下对这5类特征进行简单的介绍。
-
核苷酸组成,也叫k-mer,统计了DNA序列片段的所有可能组合的k长度子串出现频率,其计算公式为:
$$ {f}_{i}=\frac{N\left(i\right)}{L-k+1} $$ (1) 式中,i代表某一k联体,有4k种可能性;N(t)表示DNA序列中某一k联体出现的次数;L表示DNA序列的长度。随着k值的增加,DNA序列的局部或短程信息也会逐渐增加。
此外,核苷酸组成还包括了g-gapped k-mer,GC含量,累积核苷酸频率(accumulated nucleotide frequency, ANF)等。ANF表示了每一个碱基在序列中的分布密度,表达式为:
$$ {d_i} = \frac{1}{{\left| {{s_i}} \right|}}\sum\limits_{i = 1}^L N \left( {{s_i}} \right)\quad\;\;N\left( {{s_i}} \right) = \left\{ {\begin{aligned} & {1\;\;\;{s_i} = q}\\ & {0\;\;\;其他} \end{aligned}} \right. $$ (2) 式中,
$ \left|{s}_{i}\right| $ 代表第i个碱基的位置;$ N\left({s}_{i}\right) $ 表示某一碱基出现频数;$ q\in \left\{A,C, G, T\right\}$ 。 -
DNA序列中碱基的理化性质也可作为启动子预测的重要特征,包括核苷酸的化学性质、双链的稳定性、自由能、应激诱导的DNA双链不稳定性等。
根据表2中对不同核苷酸的分类,DNA序列中第i个核苷酸可以表示为:
表 2 核苷酸化学性质
Chemical property Class Nucleotides Ring Structure Purine A, G Pyrimidine C, T Functional Group Amino A, C Keto G, T Hydrogen Bond Strong C, G Weak A, T $$ {{{N}}_i} = {\rm{ }}({x_i},{\rm{ }}{y_i},{\rm{ }}{z_i}) $$ (3) 式中,xi, yi, zi分别表示指环结构(ring structure),功能组别(function group),以及氢键(hydrogen bond),如:
$$ \begin{split} & {x_i} = \left\{ {\begin{aligned} & {1\;\;\;{N_i} \in \left\{ {A,G} \right\}}\\ & {0\;\;\;{N_i} \in \left\{ {C,T} \right\}} \end{aligned}} \right.\\ & {y_i} = \left\{ {\begin{aligned} & {1\;\;\;{N_i} \in \left\{ {A,C} \right\}}\\ & {0\;\;\;{N_i} \in \left\{ {G,T} \right\}} \end{aligned}} \right.\\ & {z_i} = \left\{ {\begin{aligned} & {1\;\;\;{N_i} \in \left\{ {A,T} \right\}}\\ & {0\;\;\;{N_i} \in \left\{ {C,G} \right\}} \end{aligned}} \right. \end{split} $$ (4) 因此4种碱基(A, C, G, T)可以分别表示为(1,1,1),(0,1,0),(1,0,0)和(0,0,1)。
-
伪核苷酸组成(pseudo k-tuple nucleotide composition, PseKNC)最初是由文献[52]提出,分为I型和II型。这两种方法基于核苷酸的物化性质引入了DNA序列的全局或长程顺序信息。
I型PseKNC,也叫平行相关伪核苷酸组成,将每一条DNA序列转化为4k + λ维的向量,具体表示为:
$$ {d_u} = \left\{ {\begin{aligned} & {\frac{{{f_u}}}{{\displaystyle \sum \limits_{i = 1}^{{4^k}} {f_u} + \omega \displaystyle \sum \limits_{j = 1}^{\rm{\lambda }} {\tau _j}}}\quad {1 \leqslant u \leqslant {4^k}} }\\ & {\frac{{\omega {\tau _{u - {4^k}}}}}{{\displaystyle \sum \limits_{i = 1}^{{4^k}} {f_u} + \omega \displaystyle \sum \limits_{j = 1}^{\rm{\lambda }} {\tau _j}}} \quad{{4^k} + 1 \leqslant u \leqslant {4^k} + {\rm{\lambda }}} } \end{aligned}} \right. $$ (5) II型PseKNC,也叫串联相关伪核苷酸组成,可产生4k + λ
$ \Lambda $ 维向量:$$ {d_u}= \left\{ {\begin{aligned} & {\frac{{{f_u}}}{{\displaystyle \sum \limits_{i = 1}^{{4^k}} {f_u} + \omega \displaystyle \sum \limits_{j = 1}^{{\rm{\lambda \Lambda }}} {\tau _j}}}\quad {1 \leqslant u \leqslant {4^k}} }\\ & {\frac{{\omega {\tau _{u - {4^k}}}}}{{\displaystyle \sum \limits_{i = 1}^{{4^k}} {f_u} + \omega \displaystyle \sum \limits_{j = 1}^{{\rm{\lambda \Lambda }}} {\tau _j}}}\quad {{4^k} + 1 \leqslant u \leqslant {4^k} + {\rm{\lambda \Lambda }}} } \end{aligned}} \right. $$ (6) 式(5)和式(6)中的
$ {f}_{u} $ 与式(1)相同;前4k个元素是核苷酸组成特征,后面的元素是伪核苷酸组成特征;$ \mathrm{\lambda } $ 是一个正整数,反映序列顺序关联阶数;$ \omega $ 是权重因子,用于权衡核苷酸组分和DNA序列局部结构性质的影响;$ {\tau }_{j} $ 代表的是m阶关联因子,反映了每条DNA序列所有二核苷酸的m阶顺序关联性。 -
二进制编码通过将4种核苷酸转换成包含4个元素的向量作为特征,其中一个元素为1,其余为0,既A、C、G和T分别表示为(1,0,0,0),(0,1,0,0),(0,0,1,0)以及(0,0,0,1)。因此,一段长为L的DNA序列可以用L×4的二维矩阵表示。
-
位置权重矩阵(position weight matrix, PWM)可用来表示序列的保守片段,以序列每一位置的碱基保守程度为参量,分别计算每种碱基的保守指数,以此作为特征,具体表示为:
$$ {S}_{i,j}=\mathrm{log}\frac{{q}_{_{i,j}}}{{b}_{i}} $$ (7) 式中,
$ {S}_{i,j} $ 表示碱基i在第j个位置的保守指数;$ {q}_{_{i,j}} $ 是指在背景序列中碱基i出现在第j个位置的频率;$ {b}_{i} $ 是背景概率。因此,PWM是一个4×L的二维矩阵:
$$ {\boldsymbol{P}}=\left\{\begin{aligned} & {S}_{A,1}{S}_{A,2}\cdots {S}_{A,L}\\ & {S}_{C,1}{S}_{C,2}\cdots {S}_{C,L}\\ & {S}_{G,1}{S}_{G,2}\cdots {S}_{G,L}\\ & {S}_{T,1}{S}_{T,2}\cdots {S}_{T,L}\end{aligned}\right\} $$ (8) -
从式(1)以及式(5)、式(6)可以看出,随着k值的增加,特征维度呈指数级增长,会导致“维度灾难”以及过拟合问题,而且由不同特征提取方法整合形成的融合特征集合往往会夹杂一些冗余或不相关的信息,所以为了避免出现上述问题并且提高计算效率,筛选有用的特征也是必不可少的步骤。
-
最小冗余最大相关(minimum redundancy maximum relevance, mRMR)[53]是一种通过筛选相关性最大的特征来减少信息冗余的方法。mRMR的应用大大减少了特征维数和模型训练的时间,几乎不丢失有效信息。
对于两个随机变量x和y,其互信息为:
$$ I\left(x, y\right)=\iint p(x, y)\mathrm{log}\frac{p(x, y)}{p\left(x\right)p\left(y\right)}{\rm{d}}x{\rm{d}}y $$ (9) 式中,p()表示概率密度函数。
最大相关性为:
$$ \max D(S, c)\quad D=\frac{1}{\left|S\right|}{\sum\limits _{{x}_{i}\in S}}I({x}_{i};c) $$ (10) 式中,c为类别变量;S为特征子集。
最小冗余度则表示为:
$$ \min R\left(S\right)\quad R=\frac{1}{{\left|S\right|}^{2}}{\sum\limits _{{x}_{i,}{x}_{j}\in S}}I({x}_{i};{x}_{j}) $$ (11) 最后的评选标准如式(12)所示:
$$ \max \phi \left(D, R\right)\quad \phi =D-R $$ (12) mRMR会将所有特征的最大相关最小冗余打分按从大从小排序,值越大表明该特征越重要。
-
当两个特征高度依赖时,它们对模型的贡献不能叠加,文献[54]基于距离函数提出了最大相关最大距离(max-relevance-max-distance, MRMD)来衡量每个特征的独立性。
MRMD包含两个方面的特征排序度量:1)特征子集与目标类别的相关性;2)特征子集的冗余度。采用皮尔逊相关系数来衡量相关性、多种距离函数来计算冗余度。皮尔逊相关系数越大,特征与目标类别之间的相关性越高;特征距离越大,特征子集的冗余度越低;相关性与距离之和大的特征被选入最终的特征子集。因此,MRMD生成的特征子集冗余度最低,与目标类别的相关性最强。
-
F-score是一种基于filter的特征选择方法,对每一个特征进行重要性打分,其具体计算方法为:
$$ \begin{split} & {F_{\left( i \right)}} =\\ & \frac{{{{\left( {\bar x_i^{\left( + \right)} - {{\bar x}_i}} \right)}^2} + {{\left( {\bar x_i^{\left( - \right)} - {{\bar x}_i}} \right)}^2}}}{{\dfrac{1}{{{n^ + } - 1}}\displaystyle \sum \limits_{k = 1}^{{n^ + }} {{\left( {x_{k,i}^{\left( + \right)} - \bar x_i^{\left( + \right)}} \right)}^2} + \frac{1}{{{n^ - } - 1}}\displaystyle \sum \limits_{k = 1}^{{n^ - }} {{\left( {x_{k,i}^{\left( - \right)} - \bar x_i^{\left( - \right)}} \right)}^2}}} \end{split} $$ (13) 式中,
$ {n}^{+}{\text{、}}{n}^{-} $ 分别表示正负样本的数量;$ {\bar{x}}_{i}^{\left(+\right)}{\text{、}} $ $ {\bar{x}}_{i}^{\left(-\right)}{\text{、}}{\bar{x}}_{i} $ 分别指第i个特征在正样本、负样本以及所有样本中的平均值;$ {x}_{k,i}^{\left(+\right)}{\text{、}}{x}_{k,i}^{\left(-\right)} $ 分别指的是正负样本中第k条序列的第i个特征的数值。F-score通常与增量特征选择技术相结合来确定最优特征子集。
-
增量特征选择(incremental feature selection, IFS)方法适用于确定最优特征子集。该方法的核心思想是将按重要性评分降序的特征依次加入到特征子集中,形成新的子集,将每一个子集输入至模型中,从而根据结果决策出最优特征子集。
A Brief Review for Identifying Prokaryotic Promoters Based on Computational Biology
-
摘要: 原核启动子作为DNA中的一个关键区域,含有RNA聚合酶特异性结合和基因转录起始所需的保守序列,在转录调控中发挥着重要作用。然而,由于实验方法在实验周期和实验耗材上的限制,对启动子序列进行批量准确的鉴定仍然是分子生物学领域一项艰巨的任务。随着计算机技术的发展,出现了多个基于计算生物学的原核启动子预测方法,这些方法在数据质量、数据集大小、提取的特征、特征选择技术、分类算法以及评估策略方面表现出高度的多样性。该文系统地比较并总结这些方法,以便改进和进一步发展原核启动子识别技术。Abstract: As a key region of deoxyribonucleic acid (DNA), prokaryotic promoter contains the conserved sequence required for specific binding of ribonucleic acid (RNA) polymerase and transcription initiation, and plays an important role in transcription regulation. However, due to the limitations of experimental methods that are long experimental period and high cost, the identification of prokaryotic promoter sequences remains a major challenge. With the development of computer technology, dozens of prokaryotic promoter identification methods based on computational biology have emerged, which show a high degree of diversity in terms of data quality, dataset size, extracted features, feature selection techniques, classification algorithms and evaluation strategies. Thus, there is an urgent need to systematically compare and summarize these methods so as to improve and further develop prokaryotic promoter recognition techniques.
-
Key words:
- bioinformatics /
- machine learning /
- predictor /
- prokaryotic promoter
-
表 1 39个原核启动子预测工具比较
Tools Benchmark dataset size (promoter) Sequence similarity Feature extraction/ selection Classification algorithm Evaluation strategy AUC 1.TLS-NNPP[9] 771 (E.coli) / The empirical probability
distribution of TSS-TLS distanceANN Independent test / 2.SIDD[10] 500 (E.coli) / SIDD FLD Independent test / 3.FS_LSSVM[11] 53 (E.coli) / A domain theory for promoters/
C4.5 decision treeLSSVM 10-fold cross-validation / 4.Free energy[12] 1044 (E.coli)
879 (B.subtilis)/ Free energy Modified scoring function Independent test / 5.PromPredict[13] 1145 (E.coli) 615 (B.subtilis)
82 (M.tuberculosis)/ GC content; Average free energy difference between the average free energy Training and validation / 6.SIDD-ANN[14] 1648 (E.coli) / SIDD profile data ANN Independent test / 7.PePPER[15] L.lactis / PWM HMM / / 8.G4PromFinder[16] 3570 (S.coelicolor)
2117 (P.aeruginosa)/ AT-rich element and G-quadruplex motif-based algorithm / Independent test / 9.LN-QSAR[17] 135 (M.bovis) / Pseudo-folding 2D lattice graph LDA Independent test / 10.Ensemble-SVM[18] 450 (E.coli σ70) / k-mer with location with respect to the TSS/ Symmetric uncertainty Ensemble-SVM 10-fold cross-validation / 11.TSS-PREDICT[19] 450 (E.coli σ70)
205 (B.subtilis)
26 (C.trachomatis)/ Information Content; PWM Ensemble-SVM Independent test / 12.TSS-SLP[20] 669 (E.coli σ70) / Dinucleotide Frequency Features SLP 5-fold cross-validation; Independent test / 13.PCSF[21] 683 (E.coli σ70) / Conversation of sequence segments; PCSF Score function 10-fold cross-validation / 14.IPMD[22] 270 (B.subtilis σ43)
741 (E.coli σ70)/ PCSF; ID Modified MD 10-fold cross-validation 0.847 (B.subtilis)
0.920 (E.coli)15.70ProPred[23] 741 (E.coli σ70) / PSTNPss; PseEIIP SVM 5-fold cross-validation; Jackknife test 0.990 16.iProEP[24] 270 (B.subtilis)
741 (E.coli)≤80% PseKNC; PCSF/ mRMR; IFS SVM 10-fold cross-validation 0.988 (B.subtilis)
0.976 (E.coli)17.IPWM[25] 683 (E.coli σ70) / Entropy-based conservative characteristics; Improved PWM Score function 10-fold cross-validation / 18.BacPP[26] 1034 (E.coli) / Binary digits ANN (2,3,10)-fold cross-validation; Independent test / 19.vw Z-curve[27] 1401 (E.coli) 660 (B.subtilis) / variable-window Z-curve/ IFS PLS 10-fold cross-validation / 20.Stability[28] 1035 (E.coli) / DNA duplex stability ANN (2,3,10)-fold cross-validation / 21.iPro54-PseKNC[29] 161 (prokaryotic σ54) ≤75% PseKNC/ F-score; IFS SVM Jackknife test / 22.Promote
Predictor[30]161 (prokaryotic σ54) ≤75% Motif profile-based ANF/ MRMD Bagging; RF; SVM 10-fold cross-validation; Independent test / 23.meta-predictior[31] 579 (E.coli σ70) ≤45% sequence-based features; structure-based features Meta-predictor Independent test 0.850 24.bTSSfinder[32] 3597 (E.coli) 12797 (Nostoc) 351 (Synechocystis)
1471 (S.elongatus)/ PWM; Physicochemical properties/ Mahalanobis distance ANN Independent test / 25.iPro70-PseZNC[33] 741 (E.coli σ70) / PseZNC/ F-score; IFS SVM 5-fold cross-validation 0.909 26.iPromoter-FSEn[34] 741 (E.coli σ70) / Nucleotide Statistics; k-mer; g-gapped k-mer; Approximate signal pattern count; Position specific occurences; Distribution of nucleotides/ Feature subspace Ensemble learning 10-fold cross-validation 0.932 27.iPro70-FMWin[35] 741 (E.coli σ70) / k-mer; g-gapped k-mer; Pattern finding; Positioning distance count/ Adaboost LR 10-fold cross-validation 0.959 28.CNNProm[36] 839 (E.coli σ70)
746 (B.subtilis)/ one-hot CNN 5-fold cross-validation / 29.IBBP[37] 1888 (E.coli σ70) / Image-based and evolutionary approach SVM Independent test / 30.SAPPHIRE[38] 170 (P. aeruginosa and P. putida σ70) / one-hot ANN 5-fold cross-validation; Independent test / 31.iPromoter-2L[39] 2860 (E.coli) ≤80% Multi-window-based PseKNC RF 5-fold cross-validation; Jackknife test / 32.iPromoter-2L2.0[40] 2860 (E.coli) ≤80% Smoothing Cutting Window algorithm; k-mer; PseKNC SVM; Ensemble learning 5-fold cross-validation / 33.MULTiPly[41] 2860 (E.coli) ≤80% Bi-profile bayes; KNN; k-mer;
DAC/ F-scoreSVM 5-fold cross-validation; Jackknife test; Independent test / 34.pcPromoter-CNN[42] 2860 (E.coli) ≤80% one-hot CNN 5-fold cross-validation; Independent test 0.957 35.iPromoter-BnCNN[43] 2860 (E.coli) ≤80% one-hot; k-mer; Structural
propertiesCNN 5-fold cross-validation; Independent test / 36.SELECTOR[44] 2860 (E.coli) ≤80% CKSNAP; PCPseDNC; PSTNPss; DNA strand Ensemble learning 5-fold cross-validation; Independent test 0.984 37.iPSW(2L)-PseKNC[45] 3382 (E.coli) ≤85% NCP; ANF SVM 5-fold cross-validation 0.905 38.deepPromoter[46] 3382 (E.coli) ≤85% Combination of Continuous
FastText N-Grams/ MRMDCNN 5-fold cross-validation 0.885 39.iPSW(PseDNC-DL)[47] 3382 (E.coli) ≤85% one-hot; PseDNC CNN 5-fold cross-validation 0.925 PWM: position weight matrix; SIDD: stress-induced DNA duplex destabilization; PCSF: position-correlation scoring function; ID: increment of diversity; PSTNPss: position-specific trinucleotide propensity based on single-strand; PseEIIP: electron-ion interaction pseudo-potentials of trinucleotide; PseKNC: pseudo k-tuple nucleotide composition; ANF: accumulated nucleotide frequency; PseZNC: pseudo multi-window Z-curve nucleotide composition; KNN: k-nearest neighbors; DAC: dinucleotide-based auto-covariance; PCPseDNC: parallel correlation pseudo dinucleotide composition; NCP: nucleotide chemical property; PseDNC: pseudo dinucleotide composition; mRMR: minimum redundancy maximum relevance; IFS: incremental feature selection; MRMD: maximum-relevance-maximum-distance; ANN: artificial neural network; SVM: support vector machine; FLD: fisher linear discriminant; SLP: single-layer perceptron; LSSVM: least square support vector machine; MD: mahalanobis discriminant; PLS: partial least squares; HMM: hidden markov models; RF: random forest; LR: logistic regression; CNN: convolution neural network; LDA: linear discriminant analysis. 表 2 核苷酸化学性质
Chemical property Class Nucleotides Ring Structure Purine A, G Pyrimidine C, T Functional Group Amino A, C Keto G, T Hydrogen Bond Strong C, G Weak A, T -
[1] PERRON G G, WHYTE L, TURNBAUGH P J, et al. Functional characterization of bacteria isolated from ancient arctic soil exposes diverse resistance mechanisms to modern antibiotics[J]. Plos One, 2015, 10(3): e0069533. doi: 10.1371/journal.pone.0069533 [2] COOK H, USSERY D W. Sigma factors in a thousand E. coli genomes[J]. Environ Microbiol, 2013, 15(12): 3121-3129. doi: 10.1111/1462-2920.12236 [3] BARRIOS H, VALDERRAMA B, MORETT E. Compilation and analysis of sigma(54)-dependent promoter sequences[J]. Nucleic Acids Res, 1999, 27(22): 4305-4313. doi: 10.1093/nar/27.22.4305 [4] JANGA S C, COLLADO-VIDES J. Structure and evolution of gene regulatory networks in microbial genomes[J]. Res Microbiol, 2007, 158(10): 787-794. doi: 10.1016/j.resmic.2007.09.001 [5] LEE S J, GRALLA J D. Sigma38 (rpoS) RNA polymerase promoter engagement via -10 region nucleotides[J]. J Biol Chem, 2001, 276(32): 30064-30071. doi: 10.1074/jbc.M102886200 [6] POTVIN E, SANSCHAGRIN F, LEVESQUE R C. Sigma factors in Pseudomonas aeruginosa[J]. Fems Microbiol Rev, 2008, 32(1): 38-55. doi: 10.1111/j.1574-6976.2007.00092.x [7] ABRIL A G, RAMA J L R, SANCHEZ-PEREZ A, et al. Prokaryotic sigma factors and their transcriptional counterparts in Archaea and Eukarya[J]. Appl Microbiol Biot, 2020, 104(10): 4289-4302. doi: 10.1007/s00253-020-10577-0 [8] 丁辉, 邓恩泽, 陈伟, 等. 细菌σ54启动子序列分析与预测[J]. 电子科技大学学报, 2015, 44(1): 147-149. doi: 10.3969/j.issn.1001-0548.2015.01.025 DING H, DENG E Z, CHEN W, et al. The sequence analysis and prediction of σ54 promoter in bacteria[J]. Journal of University of Electronic Science and Technology of China, 2015, 44(1): 147-149. doi: 10.3969/j.issn.1001-0548.2015.01.025 [9] BURDEN S, LIN Y X, ZHANG R. Improving promoter prediction for the NNPP2.2 algorithm: A case study using Escherichia coli DNA sequences[J]. Bioinformatics, 2005, 21(5): 601-607. doi: 10.1093/bioinformatics/bti047 [10] WANG H, BENHAM C J. Promoter prediction and annotation of microbial genomes based on DNA sequence and structural responses to superhelical stress[J]. Bmc Bioinformatics, 2006, 7: 248. doi: 10.1186/1471-2105-7-248 [11] POLAT K, GUNES S. A novel approach to estimation of E-coli promoter gene sequences: Combining feature selection and least square support vector machine (FS_LSSVM)[J]. Appl Math Comput, 2007, 190(2): 1574-1582. [12] RANGANNAN V, BANSAL M. Identification and annotation of promoter regions in microbial genome sequences on the basis of DNA stability[J]. J Biosciences, 2007, 32(5): 851-862. [13] RANGANNAN V, BANSAL M. Relative stability of DNA as a generic criterion for promoter prediction: Whole genome annotation of microbial genomes with varying nucleotide base composition[J]. Mol Biosyst, 2009, 5(12): 1758-1769. doi: 10.1039/b906535k [14] BLAND C, NEWSOME A S, MARKOVETS A A. Promoter prediction in E-coli based on SIDD profiles and artificial neural networks[J]. Bmc Bioinformatics, 2010, 11: S17. [15] DE J A, PIETERSMA H, CORDES M, et al. PePPER: A webserver for prediction of prokaryote promoter elements and regulons[J]. Bmc Genomics, 2012, 13: 299. doi: 10.1186/1471-2164-13-299 [16] DI S M, PINATEL E, TALA A, et al. G4PromFinder: An algorithm for predicting transcription promoters in GC-rich bacterial genomes based on AT-rich elements and G-quadruplex motifs[J]. Bmc Bioinformatics, 2018, 19(1): 36. doi: 10.1186/s12859-018-2049-x [17] PEREZ-BELLO A, MUNTEANU C R, UBEIRA F M, et al. Alignment-free prediction of mycobacterial DNA promoters based on pseudo-folding lattice network or star-graph topological indices[J]. J Theor Biol, 2009, 256(3): 458-466. doi: 10.1016/j.jtbi.2008.09.035 [18] GORDON J J, TOWSEY M W, HOGAN J M, et al. Improved prediction of bacterial transcription start sites[J]. Bioinformatics, 2006, 22(2): 142-148. doi: 10.1093/bioinformatics/bti771 [19] TOWSEY M, TIMMS P, HOGAN J, et al. The cross-species prediction of bacterial promoters using a support vector machine[J]. Comput Biol Chem, 2008, 32(5): 359-366. doi: 10.1016/j.compbiolchem.2008.07.009 [20] RANI T S, BHAVANI S D, BAPI R S. Analysis of E. coli promoter recognition problem in dinucleotide feature space[J]. Bioinformatics, 2007, 23(5): 582-588. doi: 10.1093/bioinformatics/btl670 [21] LI Q Z, LIN H. The recognition and prediction of sigma70 promoters in Escherichia coli K-12[J]. J Theor Biol, 2006, 242(1): 135-141. doi: 10.1016/j.jtbi.2006.02.007 [22] LIN H, LI Q Z. Eukaryotic and prokaryotic promoter prediction using hybrid approach[J]. Theory Biosci, 2011, 130(2): 91-100. doi: 10.1007/s12064-010-0114-8 [23] HE W, JIA C, DUAN Y, et al. 70ProPred: A predictor for discovering sigma70 promoters based on combining multiple features[J]. BMC Syst Biol, 2018, 12(Suppl 4): 44. [24] LAI H Y, ZHANG Z Y, SU Z D, et al. iProEP: A computational predictor for predicting promoter[J]. Mol Ther-Nucl Acids, 2019, 17: 337-346. doi: 10.1016/j.omtn.2019.05.028 [25] WU Q Q, WANG J J, YAN H. An improved position weight matrix method based on an entropy measure for the recognition of prokaryotic promoters[J]. Int J Data Min Bioin, 2011, 5(1): 22-37. doi: 10.1504/IJDMB.2011.038575 [26] DE A E, ECHEVERRIGARAY S, GERHARDT G J. BacPP: Bacterial promoter prediction—a tool for accurate sigma-factor specific assignment in enterobacteria[J]. J Theor Biol, 2011, 287: 92-99. doi: 10.1016/j.jtbi.2011.07.017 [27] SONG K. Recognition of prokaryotic promoters based on a novel variable-window Z-curve method[J]. Nucleic Acids Res, 2012, 40(3): 963-971. doi: 10.1093/nar/gkr795 [28] SILVA S A, FORTE F, SARTOR I T, et al. DNA duplex stability as discriminative characteristic for Escherichia coli sigma(54)- and sigma(28)- dependent promoter sequences[J]. Biologicals, 2014, 42(1): 22-28. doi: 10.1016/j.biologicals.2013.10.001 [29] LIN H, DENG E Z, DING H, et al. iPro54-PseKNC: A sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition[J]. Nucleic Acids Res, 2014, 42(21): 12961-12972. doi: 10.1093/nar/gku1019 [30] LIU B, HAN L, LIU X, et al. Computational prediction of sigma-54 promoters in bacterial genomes by integrating motif finding and machine learning strategies[J]. IEEE/ACM Trans Comput Biol Bioinform, 2019, 16(4): 1211-1218. doi: 10.1109/TCBB.2018.2816032 [31] ABBAS M M, MOHIE-ELDIN M M, EL-MANZALAWY Y. Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors[J]. Plos One, 2015, 10(3): e0119721. doi: 10.1371/journal.pone.0119721 [32] SHAHMURADOV I A, MOHAMAD RAZALI R, BOUGOUFFA S, et al. bTSSfinder: A novel tool for the prediction of promoters in cyanobacteria and Escherichia coli[J]. Bioinformatics, 2017, 33(3): 334-340. [33] LIN H, LIANG Z Y, TANG H, et al. Identifying sigma70 promoters with novel pseudo nucleotide composition[J]. IEEE ACM T Comput Bi, 2019, 16(4): 1316-1321. [34] RAHMAN M S, AKTAR U, JANI M R, et al. iPromoter-FSEn: Identification of bacterial sigma(70) promoter sequences using feature subspace based ensemble classifier[J]. Genomics, 2019, 111(5): 1160-1166. doi: 10.1016/j.ygeno.2018.07.011 [35] RAHMAN M S, AKTAR U, JANI M R, et al. iPro70-FMWin: Identifying Sigma70 promoters using multiple windowing and minimal features[J]. Mol Genet Genomics, 2019, 294(1): 69-84. doi: 10.1007/s00438-018-1487-5 [36] UMAROV R K, SOLOVYEV V V. Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks[J]. Plos One, 2017, 12(2): e0171410. doi: 10.1371/journal.pone.0171410 [37] WANG S, CHENG X, LI Y, et al. Image-based promoter prediction: A promoter prediction method based on evolutionarily generated patterns[J]. Sci Rep, 2018, 8(1): 17695. doi: 10.1038/s41598-018-36308-0 [38] COPPENS L, LAVIGNE R. SAPPHIRE: A neural network based classifier for sigma70 promoter prediction in Pseudomonas[J]. Bmc Bioinformatics, 2020, 21(1): 415. doi: 10.1186/s12859-020-03730-z [39] LIU B, YANG F, HUANG D S, et al. iPromoter-2L: A two-layer predictor for identifying promoters and their types by multi-window-based PseKNC[J]. Bioinformatics, 2018, 34(1): 33-40. doi: 10.1093/bioinformatics/btx579 [40] LIU B, LI K. iPromoter-2L2.0: Identifying promoters and their types by combining smoothing cutting window algorithm and sequence-based features[J]. Mol Ther Nucleic Acids, 2019, 18: 80-87. doi: 10.1016/j.omtn.2019.08.008 [41] ZHANG M, LI F, MARQUEZ-LAGO T T, et al. MULTiPly: A novel multi-layer predictor for discovering general and specific types of promoters[J]. Bioinformatics, 2019, 35(17): 2957-2965. doi: 10.1093/bioinformatics/btz016 [42] SHUJAAT M, WAHAB A, TAYARA H, et al. pcPromoter-CNN: A CNN-based prediction and classification of promoters[J]. Genes (Basel), 2020, 11(12): 1529. doi: 10.3390/genes11121529 [43] AMIN R, RAHMAN C R, AHMED S, et al. iPromoter-BnCNN: A novel branched CNN-based predictor for identifying and classifying sigma promoters[J]. Bioinformatics, 2020, 36(19): 4869-4875. doi: 10.1093/bioinformatics/btaa609 [44] LI F, CHEN J, GE Z, et al. Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework[J]. Brief Bioinform, 2021, 22(2): 2126-2140. doi: 10.1093/bib/bbaa049 [45] XIAO X, XU Z C, QIU W R, et al. IPSW(2L)-PseKNC: A two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition[J]. Genomics, 2019, 111(6): 1785-1793. doi: 10.1016/j.ygeno.2018.12.001 [46] LE N Q K, YAPP E K Y, NAGASUNDARAM N, et al. Classifying promoters by interpreting the hidden information of DNA sequences via deep learning and combination of continuous fasttext N-grams[J]. Front Bioeng Biotechnol, 2019, 7: 305. doi: 10.3389/fbioe.2019.00305 [47] TAYARA H, TAHIR M, CHONG K T. Identification of prokaryotic promoters and their strength by integrating heterogeneous features[J]. Genomics, 2020, 112(2): 1396-1403. doi: 10.1016/j.ygeno.2019.08.009 [48] SANTOS-ZAVALETA A, SALGADO H, GAMA-CASTRO S, et al. RegulonDB v 10.5: Tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12[J]. Nucleic Acids Research, 2019, 47(D1): D212-D220. doi: 10.1093/nar/gky1077 [49] ISHII T, YOSHIDA K, TERAI G, et al. DBTBS: A database of Bacillus subtilis promoters and transcription factors[J]. Nucleic Acids Research, 2001, 29(1): 278-280. doi: 10.1093/nar/29.1.278 [50] HUANG Y, NIU B F, GAO Y, et al. CD-HIT suite: A web server for clustering and comparing biological sequences[J]. Bioinformatics, 2010, 26(5): 680-682. doi: 10.1093/bioinformatics/btq003 [51] SU W, LIU M L, YANG Y H, et al. PPD: A manually curated database for experimentally verified prokaryotic promoters[J]. Journal of Molecular Biology, 2021, 433(11): 166860. [52] CHEN W, LEI T Y, JIN D C, et al. PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition[J]. Anal Biochem, 2014, 456: 53-60. doi: 10.1016/j.ab.2014.04.001 [53] PENG H C, LONG F H, DING C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy[J]. IEEE T Pattern Anal, 2005, 27(8): 1226-1238. doi: 10.1109/TPAMI.2005.159 [54] ZOU Q, ZENG J, CAO L, et al. A novel features ranking metric with application to scalable visual and bioinformatics data classification[J]. Neurocomputing, 2016, 173: 346-354. doi: 10.1016/j.neucom.2014.12.123 [55] CHANG C C, LIN C J. LIBSVM: A library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2011, 2(3): 1-27. [56] FISHER R A. The use of multiple measurements in taxonomic problems[J]. Annals of Human Genetics, 2012, 7(7): 179-188.