MBE Advance Access published online on February 14, 2008
Molecular Biology and Evolution, doi:10.1093/molbev/msn043
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Article |
Calculating Bootstrap Probabilities of Phylogeny Using Multilocus Sequence Data
Professional Programme for Agricultural Bioinformatics, University of Tokyo, 1-1-1 Yayoi Bunkyo-Ku Tokyo 113-8657 Japan, Phone: +81-3-5841-1139, Fax: +81-3-5841-5068, E-mail) seo{at}iu.a.u-tokyo.ac.jp, Web) http://www.iu.a.u-tokyo.ac.jp/
seo/seo.html
Received for publication September 17, 2007. Revision received January 11, 2008. Revision received February 3, 2008. Accepted for publication February 7, 2008.
Phylogeny estimation is extremely crucial in the study of molecular evolution. The increase in the amount of available genomic data facilitates phylogeny estimation from multilocus sequence data. Although maximum likelihood and Bayesian methods are available for phylogeny reconstruction using multilocus sequence data, these methods require heavy computation, and their application is limited to the analysis of a moderate number of genes and taxa. Distance matrix methods present suitable alternatives for analyzing huge amounts of sequence data. However, the manner in which distance methods can be applied to multilocus sequence data remains unknown. Here, we suggest new procedures to estimate molecular phylogeny using multilocus sequence data and evaluate its significance in the framework of the distance method. We found that concatenation of the multilocus sequence data may result in incorrect phylogeny estimation with an extremely high bootstrap probability, which is due to incorrect estimation of the distances and intentional ignorance of the inter-gene variations. Therefore, we suggest that the distance matrices for multilocus sequence data be estimated separately and these matrices be subsequently combined to reconstruct phylogeny instead of phylogeny reconstruction using concatenated sequence data. To calculate the bootstrap probabilities of the reconstructed phylogeny, we suggest that two-stage bootstrap procedures be adopted; in this, genes are resampled followed by resampling of the sequence columns within the resampled genes. By resampling the genes during calculation of bootstrap probabilities, inter-gene variations are properly considered. Via simulation studies and empirical data analysis, we demonstrate that our two-stage bootstrap procedures are more suitable than the conventional bootstrap procedure that is adopted after sequence concatenation.
Key Words: Bootstrap probability two-stage bootstrap distance method