-
多序列比对是生物信息学研究中重要的课题之一,对于识别未知基因功能、分析物种间的进化关系、识别基因之间的保守区域等问题有着重要作用。随着测序技术的发展,基因序列数据快速增长,现有软件难以处理大规模的多序列比对问题。
目前大多数软件采用的是渐进式比对策略或者迭代式比对策略[1],如MAFFT[2]、Kalign3[3]、Clustal[4-5]、MUSCLE[6]、T-Coffee[7]、HAlign[8]等。渐进式比对需要先计算两两序列之间的距离,再根据距离矩阵使用层次聚类算法,如UPGMA、Neighbor Joining等构建一颗比对的指导树,沿着指导树的枝干进行两两比对与合并,最后得到最终结果。而迭代式比对策略在此基础上,还要对合并的最终结果选取适当的策略,如剪枝、局部重新比对和随机选取序列重新比对进行迭代,直到比对精确收敛或者迭代次数达到上限。迭代式比对策略可以解决渐进式比对初期可能遗留下的问题。因为渐进式比对策略是贪心策略,在初期局部的比对结果上可能陷入局部最优,而错误会一直保留至最终结果中。而通过迭代式比对可以选取适当的策略,去更正局部比对的一些错误,但迭代式比对增加了时间复杂度。这两者都有着较高的时间复杂度,所以难以在有限时间内处理大规模数据的比对。
渐进式比对策略的时间复杂度与序列数量呈多项式级增长,因此在面对大规模数据的情况下,该策略时间复杂度太高、比对时间过长。而星比对是一种启发性的策略,其时间复杂度与序列数量呈线性增长,这有效降低了大规模序列比对的时间。然而,星比对算法在相似度不高的数据集上的比对精度较低,目前只能应用到相似度非常高的同源序列上,这大大限制了星比对的应用。
针对星比对精度低的问题,本文将渐进比对的模式应用于星比对中,提出了基于profile比对的改进星比对算法。实验证明改进后的算法提高了比对的精度,同时也节省了比对时间。
Improved Center Star Alignment Algorithm Based on Profile Alignment
-
摘要: 多序列比对在序列分析研究中起着重要的作用,包括功能重要位点的识别和系统发育分析等问题。目前大多数比对软件都使用渐进比对或迭代比对的策略,但两种策略都具有较高的时间复杂度,因此难以处理长序列和大规模序列的比对问题。而星比对虽然具有很低的时间复杂度,但精度并不理想,目前只适用于相似度非常高的序列。针对此问题,引进了渐进比对中的profile比对来改进星比对算法的精度,同时避免大幅度地增加星比对的时间复杂度。最后,通过实验证明了改进的星比对算法可以有效地提高比对的精度。Abstract: Multiple sequence alignment plays an important role in sequence analysis, including identification of functionally important sites and phylogenetic analysis. At present, most alignment software uses the strategy of progressive alignment or iterative alignment, but both strategies have high time complexity, so it is difficult to deal with the alignment problem of long sequence and large datasets. Although star alignment has a very low time complexity, the accuracy of star alignment is not ideal, so it only applies to sequences with very high similarity. To solve this problem, we introduce profile alignment in progressive alignment to improve the accuracy of star alignment algorithm and avoid significantly increasing the time complexity of star alignment. Experiments show that the improved star alignment algorithm can effectively improve the accuracy of the alignment.
-
[1] BISWANATH C, GAUTAM G. A review on multiple sequence alignment from the perspective of genetic algorithm[J]. Genomics, 2017, 109(5): 419-431. [2] KATOH K, MISAWA K, KUMA K, et al. MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform[J]. Nucleic Acids Research, 2002, 30(14): 3059-3066. doi: 10.1093/nar/gkf436 [3] LASSMANN T. Kalign 3: Multiple sequence alignment of large datasets[J]. Bioinformatics, 2020, 36(6): 1928-1929. [4] SIEVERS F, WILM A, DINEEN D, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega[J]. Molecular Systems Biology, 2011, 7(1): 539. doi: 10.1038/msb.2011.75 [5] THOMPSON J D, HIGGINS D G, GIBSON T J. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice[J]. Nucleic Acids Research, 1994, 22(22): 4673-4680. doi: 10.1093/nar/22.22.4673 [6] EDGAR R C. MUSCLE: A multiple sequence alignment method with reduced time and space complexity[J]. BMC Bioinformatics, 2004, 5(1): 1-19. doi: 10.1186/1471-2105-5-1 [7] NOTREDAME C, HIGGINS D G, HERINGA J. T-Coffee: A novel method for fast and accurate multiple sequence alignment[J]. Journal of Molecular Biology, 2000, 302(1): 205-217. doi: 10.1006/jmbi.2000.4042 [8] ZOU Q, HU Q H, GUO M Z, et al. HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy[J]. Bioinformatics, 2015, 31(15): 2475-2481. doi: 10.1093/bioinformatics/btv177 [9] NEEDLEMAN S B, WUNSCH C D. A general method applicable to the search for similarities in the amino acid sequence of two proteins[J]. Journal of Molecular Biology, 1970, 48(3): 443-453. doi: 10.1016/0022-2836(70)90057-4 [10] SMITH T F, WATERMAN M S. Identification of common molecular subsequences[J]. Journal of Molecular Biology, 1981, 147(1): 195-197. doi: 10.1016/0022-2836(81)90087-5 [11] FU L M, NIU B F, ZHU Z W, et al. CD-HIT: Accelerated for clustering the next-generation sequencing data[J]. Bioinformatics, 2012, 28(23): 3150-3152. doi: 10.1093/bioinformatics/bts565