Molecular Biology and Evolution 17:1854-1858 (2000)
© 2000 Society for Molecular Biology and Evolution
ARTICLE |
Assessing an Unknown Evolutionary Process: Effect of Increasing Site-Specific Knowledge Through Taxon Addition
*Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, New Mexico;
and
Department of Biological Sciences, Louisiana State University at Baton Rouge
| Abstract |
|---|
|
|
|---|
Assessment of the evolutionary process is crucial for understanding the effect of protein structure and function on sequence evolution and for many other analyses in molecular evolution. Here, we used simulations to study how taxon sampling affects accuracy of parameter estimation and topological inference in the absence of branch length asymmetry. With maximum-likelihood analysis, we find that adding taxa dramatically improves both support for the evolutionary model and accurate assessment of its parameters when compared with increasing the sequence length. Using a method we call "doppelgänger trees," we distinguish the contributions of two sources of improved topological inference: greater knowledge about internal nodes and greater knowledge of site-specific rate parameters. Surprisingly, highly significant support for the correct general model does not lead directly to improved topological inference. Instead, substantial improvement occurs only with accurate assessment of the evolutionary process at individual sites. Although these results are based on a simplified model of the evolutionary process, they indicate that in general, assuming processes are not independent and identically distributed among sites, more extensive sampling of taxonomic biodiversity will greatly improve analytical results in many current sequence data sets with moderate sequence lengths.
| Introduction |
|---|
|
|
|---|
Understanding the evolutionary process in proteins and other macromolecules is central to the pursuit of evolutionary functional genomics (the use of evolutionary information to predict and better understand the structure, function, and interaction of genome components), but accurate inferences of both the topology of taxon relationships and the rates of substitution at different sites can be elusive. Many previous studies theoretically examined the question of topology assessment with known models of evolution, particularly for simple four-taxon situations (Gaut and Lewis 1995
We introduce a technique we call "doppelgänger trees," or shadowlike doubles of the tree of interest. The doppelgänger sequences are homologous to the sequences of interest but evolve independently. Thus, the phylogenetic relationships within each doppelgänger tree are exactly the same as the tree of interest, but the trees are connected by a branch of effectively infinite length. Incorporating these trees allows us to add site-specific information about rates of evolution in a controlled manner without adding information about the state of internal nodes. We conclude that, using maximum likelihood (ML), phylogenetic reconstruction and assessment of an unknown evolutionary process are often improved more efficiently by adding taxa than by increasing the length of existing sequences.
| Materials and Methods |
|---|
|
|
|---|
Branch reconstruction percentages, parameter estimates, and likelihood estimates were all obtained using PAUP* (Swofford 1998
lnL) can be used to determine levels of support for the more complicated models (Huelsenbeck and Rannala 1997
Compared with the equal-rates model, the gamma model has one more degree of freedom, in the form of the shape parameter (
), which is estimated in the ML procedure. The ratio of rates for the two different rate categories is directly calculable from
, and the likelihood at each site in GML is estimated as the sum of likelihoods for each of the two rate categories. When
=
, the underlying model in GML is the same as the equal-rates model. In addition to the correct and absolute prior on the rate category for each site, SSML also has one degree of freedom difference from EML, which is the ratio of the site-specific rates (
=
1/
2, where
2 and
1 are the rates for the two categories). When
= 1, SSML is equivalent to EML.
For most trees, the entire topology was reconstructed, but only the existence of the innermost branch was assessed. For doppelgänger trees, simulations were run independently for two or three eight-taxon trees and combined into one alignment. Reconstruction probabilities for the innermost branch were assessed for one of these trees (the focus tree), while the topology for the remaining branches was given prior to ML and Pars evaluation; this was necessary for sufficient speed in performing the repetitions. For the same reason, the attachment point for the branch between the focus tree and the doppelgänger trees was arbitrarily fixed on one of the internal branches other than the innermost branch in order to allow timely evaluation of the replicates. Analysis of eight-taxon trees where the topology of the terminal tips was fixed showed that fixing branches other than the innermost branch made only a slight difference in parameter estimation (
,
), log likelihood differences, and reconstruction probabilities for the innermost branch (data not shown). Simulations with doppelgänger trees were replicated 300 times, while other simulations were replicated 1,000 times.
The significances of log likelihood differences were calculated by assuming that the
lnL statistic was chi-square distributed with one degree of freedom (Huelsenbeck and Rannala 1997
). For example, the 5% significance level is thus 3.86.
| Results |
|---|
|
|
|---|
In an initial analysis, we began with a four-taxon tree and compared the effect of doubling the number of taxa with that of doubling the sequence length (fig. 1 ). Regardless of the number of taxa added, the topological question evaluated was always that of the unrooted reconstruction of the initial four-taxon tree. We found that in the four-taxon case,
lnL values between GML and EML did not significantly support GML in the majority of replicates, even though this model was in fact correct. With double the sequence length, slightly more than 75% of the replicates supported GML at the 5% significance level (table 1
). For the eight-taxon case, however, all of the replicates consistently gave extremely significant support (P << 0.001) for the gamma model. The shape parameter,
, had a variance approximately 100-fold lower for the eight-taxon case than for the four-taxon cases and, as a consequence, was also less biased. This reduction in the sampling variance of
when the number of taxa was increased is consistent with previous work comparing results for three and four taxa (Gu, Fu, and Li 1995
|
|
For the eight-taxon case, there are two plausible effects which could cause improvement in tree reconstruction capability relative to the four-taxon case: the increase of information about the state of the internal nodes (seen in the reconstruction improvement for EML with eight taxa), and information about which rate is in effect at each site (which, when known completely, yields the improved performance of SSML relative to EML). For GML, it appears plausible that despite more accurate knowledge of the global parameters of the model with the eight-taxon tree, the indeterminate placement of sites into rate categories lowers tree reconstruction success. In order to test this hypothesis, we added site-specific information to the eight-taxon tree by using doppelgänger trees. The doppelgänger trees were duplicates of the eight-taxon focus tree with sites evolving at the same rates but independently of that tree. These trees had the same topology and branch lengths as the tree of interest (the focus tree), but evolution was simulated independently; this is equivalent to the sequences from these trees being related by a branch of infinite length. The doppelgänger sequences were added to the alignment, and phylogenetic analyses were performed for the combined data sets. Thus, the doppelgänger trees provided additional information about the probable site-specific rate category of each site, but no information about the state of internal nodes on the focus tree.
The doppelgänger results confirmed that with more site-specific information derived from the data, GML can approach the performance of SSML. We tested a 16-taxon single doppelgänger (16TD) and a 24-taxon double doppelgänger (24TD), and for EML and SSML the likelihood values per eight-taxon tree were almost identical to previous results, as expected (table 1
). Also, topology reconstruction rates were essentially unchanged (fig. 2 ), which indicates that the doppelgänger trees had little effect when these models were used. In contrast, GML had twice the
lnL improvement per eight-taxon tree for 16TD, and for 24TD it was 2.6 times as high per eight-taxon tree as without doppelgängers, indicating nonlinear improvement in support for GML. The shape parameter was also better estimated; the variance of
for 16TD was about one third that for the eight-taxon case, and for 24TD it was about one fifth. The changes in topology reconstruction probabilities for GML were also dramatic (fig. 2
). For 16TD, GML made up half the difference with SSML, while for 24TD, GML reconstruction probability was at nearly the same level as for SSML. Reconstruction rates for parsimony in the 16TD and 24TD cases decreased to the same rates as in the original four-taxon case.
|
| Discussion |
|---|
|
|
|---|
It appears that more useful site-specific information can be obtained by adding taxa to a data set than by increasing sequence length. This information can increase phylogenetic reconstruction probabilities both by increasing knowledge of the state of internal nodes and by increasing knowledge of the rate at individual sites. Taxon addition also dramatically improves the accuracy of global parameter estimation, but this has little independent effect on phylogenetic reconstruction for the conditions of this study. This complements earlier observations that it is important to add taxa when reconstructing site-specific interactions (Pollock and Taylor 1997
lnL levels approached 1.0 per site. Although adding sequence length can be useful if the rate category for each site is specified (as in SSML), for the more general case where each site may belong to any of the possible rate categories, a large portion of the improvement in reconstruction capability can come only through taxon addition.
The number of taxa required to gain most of the potential improvement in this situation (24) is an obtainable number for most evolutionary researchers, although we note that actual benefits will vary depending on how added sequences are related to the initial sequences (Goldman 1998
; Rannala et al. 1998
). Although we used a simple model here to evaluate general principles, we expect that these principles will hold qualitatively for the more complicated models needed to describe protein evolution, which take into account codons, differential rates of exchange between the 20 amino acids, and varying rates and other parameters among many more site categories. When the evolutionary process is unknown, it is best to increase sampling of taxonomic biodiversity in order to get as much information as possible about site-specific substitution rates. This will lead to improved topological reconstruction and support for models that more accurately reflect the underlying complexity, and will in turn allow better understanding of the effect of structure and function on the evolutionary process.
Our results appear to conflict with some previous studies which have ascribed better results to increased sequence length rather than increased taxonomic sampling, or recommended avoidance of additional sequences outside the clade under consideration. These conflicts are the result of using Pars rather than ML. In order to understand the difference with regard to the question of taxon addition, we simulated a single rate at all sites, with different rates over a series of simulations. We found a broad zone in which parsimony fails to make efficient use of the information available in the eight-taxon tree (fig. 3
). Pars was equivalent to ML for slow rates, but it underperformed for all larger evolutionary rates up to the point where all methods performed equally poorly. This effect is different from the well-known problem of long-branch attraction (the "Felsenstein Zone" [Felsenstein 1978
; Hendy and Penny 1989
]), as in this case the tree was entirely symmetrical and all branches outside of the innermost branch were equal in length. The effect is surprising in that many phylogenetic researchers would expect parsimony to perform well in this situation (Hillis 1996, 1998
). While the average rate in our previous two-rate simulations was situated in the center of this zone, where the discrepancy between ML and Pars was greatest (as was the rate used by Poe and Swofford [1999
]), the individual rates were on either end of the zone, where the discrepancy was small. Parsimony appears to take on the average characteristics of the underlying rates rather than the characteristics of a single rate equal to the average, and the apparent conflict is thus explained. For four-taxon trees with double the sequence length, reconstruction using parsimony or ML was slightly less accurate than that for eight taxa using ML (data not shown).
|
In addition to this zone, we have shown that Pars is confounded by additional data from distant taxa (even without long-branch attraction), while ML is not distracted and can make use of the information about site-specific rates. We note that although the behavior of Pars appears somewhat pathological in our simulations, the situation is extreme in that the doppelgänger trees evolved independently from the focus tree, and this is not a realistic assumption for alignable sequences from the natural world. We did not specifically address (and in fact intentionally avoided) long-branch attraction (Felsenstein 1978
| Acknowledgements |
|---|
|
|
|---|
We thank A. L. Halpern, B. Korber, M. Lachmann, and C. Macken for comments on the manuscript. D.D.P. was supported by a Los Alamos National Laboratory Director's Fellowship, and W.J.B. was supported by a grant from the Department of Energy.
| Footnotes |
|---|
Antony Dean, Reviewing Editor
1 Abbreviations:
, Gamma shape parameter;
lnL, double the difference in log likelihoods between models;
1 and
2, the rate parameters for each of the two site categories;
=
1/
2, the parameter for the ratio of the site-specific rates; EML, maximum likelihood with equal rates among sites; GML, maximum likelihood with rates evolving according to a two-category gamma model; ML, maximum likelihood; Pars, parsimony; SSML, maximum likelihood with a site-specific two-rate model where the rate category of each site was correctly specified prior to evaluation. ![]()
2 Keywords: evolutionary models
maximum likelihood
rate variation
taxon addition
phylogenetic inference ![]()
3 Address for correspondence and reprints: David D. Pollock, Department of Biological Sciences, Louisiana State University, Baton Rouge, Louisiana 70803. Phone: 225-388-4597, Fax: 225-388-2597. E-mail: daviddpollock{at}yahoo.com ![]()
| literature cited |
|---|
|
|
|---|
Bruno, W. J., and A. L. Halpern. 1999. Topological bias and inconsistency of maximum likelihood using wrong models. Mol. Biol. Evol. 16:564566[ISI][Medline]
Felsenstein, J. 1978. Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. 27:401410
Gaut, B. S., and P. O. Lewis. 1995. Success of maximum likelihood phylogeny inference in the four-taxon case. Mol. Biol. Evol. 12:152162[Abstract]
Goldman, N. 1998. Phylogenetic information and experimental design in molecular systematics. Proc. R. Soc. Lond. B Biol. Sci. 265:17791786[Medline]
Graybeal, A. 1998. Is it better to add taxa or characters to a difficult phylogenetic problem? Syst. Biol. 47:917
Gu, X., Y.-X. Fu, and W.-H. Li. 1995. Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol. Biol. Evol. 12:546557[Abstract]
Hendy, M. D., and D. Penny. 1989. A framework for the quantitative study of evolutionary trees. Syst. Zool. 38:297309
Hillis, D. M. 1995. Approaches for assessing phylogenetic accuracy. Syst. Biol. 44:316
. 1996. Inferring complex phylogenies. Nature 383:130131
. 1998. Taxonomic sampling, phylogenetic accuracy, and investigator bias. Syst. Biol. 47:38
Huelsenbeck, J. P. 1995a. The performance of phylogenetic methods in simulation. Syst. Biol. 44:1748
. 1995b. The robustness of two phylogenetic methods: four-taxon simulations reveal a slight superiority of maximum likelihood over neighbor joining. Mol. Biol. Evol. 12:843849
. 1997. Is the Felsenstein Zone a fly trap? Syst. Biol. 46:6974
Huelsenbeck, J. P., and B. Rannala. 1997. Phylogenetic methods come of age: testing hypotheses in an evolutionary context. Science 276:227232
Kim, J. 1996. General inconsistency conditions for maximum parsimony: effects of branch lengths and increasing numbers of taxa. Syst. Biol. 45:363374
. 1998. Large-scale phylogenies and measuring the performance of phylogenetic estimators. Syst. Biol. 47:4360[ISI][Medline]
Poe, S., and D. L. Swofford. 1999. Taxon sampling revisited. Nature 398:299300
Pollock, D. D. 1998. Increased accuracy in analytical molecular distance estimation. Theor. Popul. Biol. 54:7890[ISI][Medline]
Pollock, D. D., and D. B. Goldstein. 1995. A comparison of two methods for constructing evolutionary distances from a weighted contribution of transition and transversion differences. Mol. Biol. Evol. 12:713717[Abstract]
Pollock, D. D., and W. R. Taylor. 1997. Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Eng. 10:647657
Pollock, D. D., W. R. Taylor, and N. Goldman. 1999. Coevolving protein residues: maximum likelihood identification and relationship to structure. J. Mol. Biol. 287:187198[ISI][Medline]
Rannala, B., J. P. Huelsenbeck, Z. Yang, and R. Nielsen. 1998. Taxon sampling and the accuracy of large phylogenies. Syst. Biol. 47:702710[ISI][Medline]
Swofford, D. L. 1998. Phylogenetic analysis using parsimony (*and other methods). Sinauer, Sunderland, Mass
Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306314[ISI][Medline]
. 1996. Phylogenetic analysis using parsimony and likelihood methods. J. Mol. Evol. 42:294307[ISI][Medline]
. 1998. On the best evolutionary rate for phylogenetic analysis. Syst. Biol. 47:125133[ISI][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
R. A. Goldstein and D. D. Pollock Observations of Amino Acid Gain and Loss during Protein Evolution Are Explained by Statistical Bias Mol. Biol. Evol., July 1, 2006; 23(7): 1444 - 1449. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. N. Eick, D. S. Jacobs, and C. A. Matthee A Nuclear DNA Phylogenetic Perspective on the Evolution of Echolocation and Historical Biogeography of Extant Bats (Chiroptera) Mol. Biol. Evol., September 1, 2005; 22(9): 1869 - 1886. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Z. Raina, J. J. Faith, T. R. Disotell, H. Seligmann, C.-B. Stewart, and D. D. Pollock Evolution of base-substitution gradients in primate mitochondrial genomes Genome Res., May 1, 2005; 15(5): 665 - 673. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. M. Krishnan, H. Seligmann, C.-B. Stewart, A. P. J. de Koning, and D. D. Pollock Ancestral Sequence Reconstruction in Primate Mitochondrial DNA: Compositional Bias and Effect on Functional Inference Mol. Biol. Evol., October 1, 2004; 21(10): 1871 - 1883. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Peintner, J.-M. Moncalvo, and R. Vilgalys Toward a better understanding of the infrageneric relationships in Cortinarius (Agaricales, Basidiomycota) Mycologia, September 1, 2004; 96(5): 1042 - 1058. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. P. Simmons, K. M. Pickett, and M. Miya How Meaningful Are Bayesian Support Values? Mol. Biol. Evol., January 1, 2004; 21(1): 188 - 199. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. J. Faith and D. D. Pollock Likelihood Analysis of Asymmetrical Mutation Bias Gradients in Vertebrate Mitochondrial Genomes Genetics, October 1, 2003; 165(2): 735 - 745. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. D. Pollock, J. A. Eisen, N. A. Doggett, and M. P. Cummings A Case for Evolutionary Genomics and the Comprehensive Examination of Sequence Biodiversity Mol. Biol. Evol., December 1, 2000; 17(12): 1776 - 1788. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



, EML with four taxa;
, EML with eight taxa;
, parsimony with eight taxa. Other than there being a single rate for all sites rather than two, conditions are the same as in 


