MBE Advance Access published online on August 20, 2007
Molecular Biology and Evolution, doi:10.1093/molbev/msm176
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Article |
Mind the Gaps: Evidence of Bias in Estimates of Multiple Sequence Alignments
1 School of Biological Sciences, University of Sydney
2 School of Biomolecular, Biomedical and Chemical Sciences, University of Western Australia
3 John Curtin School of Medical Research, Australian National University
4 Sydney Bioinformatics
5 Centre for Mathematical Biology
* Corresponding author; e-mail: lars.jermiin{at}usyd.edu.au, Telephone: 612-9351-3717
Received for publication June 14, 2007. Revision received August 2, 2007. Accepted for publication August 14, 2007.
Multiple sequence alignment (MSA) is a crucial first step in the analysis of genomic and proteomic data. Commonly occurring sequence features, such as deletions and insertions, are known to affect the accuracy of MSA programs, but the extent to which alignment accuracy is affected by the positions of insertions and deletions has not been examined independently of other sources of sequence variation. We assessed the performance of six popular MSA programs (CLUSTALW, DIALIGN-T, MAFFT, MUSCLE, PROBCONS and T-COFFEE), and one experimental program, PRANK, on amino acid sequences that differed only by short regions of deleted residues. The analysis showed that the absence of residues often led to an incorrect placement of gaps in the alignments, even though the sequences were otherwise identical. In datasets containing sequences with partially overlapping deletions, most MSA programs preferentially aligned the gaps vertically, at the expense of incorrectly aligning residues in the flanking regions. Of the programs assessed, only DIALIGN-T was able to place overlapping gaps correctly relative to one another, but this was usually context-dependent, and was observed only in some of the datasets. In datasets containing sequences with non-overlapping deletions, both DIALIGN-T and MAFFT (G-INS-I) were able to align gaps with near-perfect accuracy, but only MAFFT produced the correct alignment consistently. The same was true for datasets that comprised multiple isoforms of alternatively spliced gene products: both DIALIGN-T and MAFFT produced highly accurate alignments, with MAFFT being the more consistent of the two. Other programs, notably T-COFFEE and CLUSTALW, were less accurate. For all datasets, alignments produced by different MSA programs differed markedly, indicating that reliance on a single MSA program may give misleading results. It is therefore advisable to use more than one MSA program when dealing with sequences that may contain deletions or insertions, particularly for high-throughput and pipeline applications where manual refinement of each alignment is not practicable
Key Words: Multiple sequence alignment CLUSTALW DIALIGN-T MAFFT MUSCLE PROBCONS T-COFFEE
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
K. Katoh and H. Toh Recent developments in the MAFFT multiple sequence alignment program Brief Bioinform, July 1, 2008; 9(4): 286 - 298. [Abstract] [Full Text] [PDF] |
||||
