MBE Advance Access published online on May 30, 2003
Molecular Biology and Evolution, doi:10.1093/molbev/msg115
Molecular Biology and Evolution © Society for Molecular Biology and Evolution 2003; all rights reserved
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Section of Evolution and Ecology, University of California, Davis, California, 95616 USA
* To whom correspondence should be addressed. E-mail: mjsanderson{at}ucdavis.edu.
To improve the accuracy of tree reconstruction, phylogeneticists are extracting increasingly large multi-gene data sets from sequence databases. Determining whether a database contains at least k genes sampled from at least m species is an NP-complete problem. However, the skewed distribution of sequences in these databases permits all such data sets to be obtained in reasonable computing times even for large numbers of sequences. We developed an exact algorithm for obtaining the largest multi-gene data sets from a collection of sequences. The algorithm was then tested on a set of 100,000 protein sequences of green plants and used to identify the largest multi-gene ortholog data sets having at least 3 genes and 6 species. The distribution of sizes of these data sets forms a hollow curve, and the largest are surprisingly small, ranging from 62 genes by 6 species, to 3 genes by 65 species, with more symmetrical data sets of around 15 taxa by 15 genes. These upper bounds to sequence concatenation have important implications for building the tree of life from large sequence databases. Key Words:
biclique, NP-complete, sequence concatenation, phylogeny, optimization
© 2003 Society for Molecular Biology and Evolution
Original Articles
Obtaining Maximal Concatenated Phylogenetic Data Sets from Large Sequence Databases
2 Department of Computer Science, Iowa State University, Ames, IA 50011, USA
![]()
Abstract ![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
R. C. Thomson and H. B. Shaffer Sparse Supermatrices for Phylogenetic Inference: Taxonomy, Alignment, Rogue Taxa, and the Phylogeny of Living Turtles Syst Biol, November 11, 2009; (2009) syp075v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. S. Haggerty, F. J. Martin, D. A. Fitzpatrick, and J. O. McInerney Gene and genome trees conflict at many levels Phil Trans R Soc B, August 12, 2009; 364(1527): 2209 - 2219. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Liu, J. W. Leigh, H. Brinkmann, M. T. Cushion, N. Rodriguez-Ezpeleta, H. Philippe, and B. F. Lang Phylogenomic Analyses Support the Monophyly of Taphrinomycotina, including Schizosaccharomyces Fission Yeasts Mol. Biol. Evol., January 1, 2009; 26(1): 27 - 34. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Rokas and S. B. Carroll Frequent and Widespread Parallel Evolution of Protein Sequences Mol. Biol. Evol., September 1, 2008; 25(9): 1943 - 1953. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Sanderson, D. Boss, D. Chen, K. A. Cranston, and A. Wehe The PhyLoTA Browser: Processing GenBank for Molecular Phylogenetics Research Syst Biol, June 1, 2008; 57(3): 335 - 346. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. D. Bailey, M. A. Koch, M. Mayer, K. Mummenhoff, S. L. O'Kane Jr, S. I. Warwick, M. D. Windham, and I. A. Al-Shehbaz Toward a Global Phylogeny of the Brassicaceae Mol. Biol. Evol., November 1, 2006; 23(11): 2142 - 2160. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. M. McMahon and M. J. Sanderson Phylogenetic Supermatrix Analysis of GenBank Sequences from 2228 Papilionoid Legumes Syst Biol, October 1, 2006; 55(5): 818 - 836. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. J. Wiens Can Incomplete Taxa Rescue Phylogenetic Analyses from Long-Branch Attraction? Syst Biol, October 1, 2005; 54(5): 731 - 742. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Ane and M. J. Sanderson Missing the Forest for the Trees: Phylogenetic Compression and Its Implications for Inferring Complex Evolutionary Histories Syst Biol, February 1, 2005; 54(1): 146 - 157. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. C. Driskell, C. Ane, J. G. Burleigh, M. M. McMahon, B. C. O'Meara, and M. J. Sanderson Prospects for Building the Tree of Life from Large Sequence Databases Science, November 12, 2004; 306(5699): 1172 - 1174. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Philippe, E. A. Snell, E. Bapteste, P. Lopez, P. W. H. Holland, and D. Casane Phylogenomics of Eukaryotes: Impact of Missing Data on Large Alignments Mol. Biol. Evol., September 1, 2004; 21(9): 1740 - 1752. [Abstract] [Full Text] [PDF] |
||||



