MBE Advance Access originally published online on May 21, 2004
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mol. Biol. Evol. 21(9):1629-1642. 2004
DOI: 10.1093/molbev/msh159
© 2004 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038
On Inconsistency of the Neighbor-Joining, Least Squares, and Minimum Evolution Estimation When Substitution Processes Are Incorrectly Modeled


* Genome Atlantic, Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada
Genome Atlantic, Canadian Institute for Advanced Research, Program in Evolutionary Biology, Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia, Canada
E-mail: susko{at}mathstat.dal.ca.
| Abstract |
|---|
Using analytical methods, we show that under a variety of model misspecifications, Neighbor-Joining, minimum evolution, and least squares estimation procedures are statistically inconsistent. Failure to correctly account for differing rates-across-sites processes, failure to correctly model rate matrix parameters, and failure to adjust for parallel rates-across-sites changes (a rates-across-subtrees process) are all shown to lead to a "long branch attraction" form of inconsistency. In addition, failure to account for rates-across-sites processes is also shown to result in underestimation of evolutionary distances for a wide variety of substitution models, generalizing an earlier analytical result for the Jukes-Cantor model reported in Golding (Mol. Biol. Evol. 1:125142, 1983) and a similar bias result for the GTR or REV model in Kelly and Rice (1996). Although standard rates-across-sites models can be employed in many of these cases to restore consistency, current models cannot account for other kinds of misspecification. We examine an idealized but biologically relevant case, where parallel changes in rates at sites across subtrees is shown to give rise to inconsistency. This changing rates-across-subtrees type model misspecification cannot be adjusted for with conventional methods or without carefully considering the rate variation in the larger tree. The results are presented for four-taxon trees, but the expectation is that they have implications for larger trees as well. To illustrate this, a simulated 42-taxon example is given in which the microsporidia, an enigmatic group of eukaryotes, are incorrectly placed at the archaebacteria-eukaryotes split because of incorrectly specified pairwise distances. The analytical nature of the results lend insight into the reasons that long branch attraction tends to be a common form of inconsistency and reasons that other forms of inconsistency like "long branches repel" can arise in some settings. In many of the cases of inconsistency presented, a particular incorrect topology is estimated with probability converging to one, the implication being that measures of uncertainty like bootstrap support will be unable to detect that there is a problem with the estimation. The focus is on distance methods, but previous simulation results suggest that the zones of inconsistency for distance methods contain the zones of inconsistency for maximum likelihood methods as well.
Key Words: inconsistency rates across sites distance methods Neighbor-Joining molecular evolution phylogenetics
| Introduction |
|---|
Distance methods for the estimation of phylogenetic trees such as the Neighbor-Joining method (NJ; Saitou and Nei 1987), and least-squares estimation (LS; Cavalli-Sforza and Edwards 1967) are valuable tools because of their computational simplicity. In large-scale problems, or when repeated fitting is required, maximum likelihood (ML) estimation can become infeasible, with distance methods providing a reasonable alternative. Such cases arise when competing models of molecular evolution are being considered, when large numbers of sequences are under consideration, when bootstrapping is required and in simulation studies. Because of the importance of distance methods, investigations of situations where these methods fail to consistently estimate topologies is valuable. In the absence of model misspecification most distance methods are consistent. Consistency of distance methods has been proven by Saitou and Nei (1987) for NJ, and by Rzhetsky and Nei (1993) for minimum evolution (ME) and LS. However some distance methods like UPGMA and ME with weighted least-squares branch lengths can be inconsistent even in the absence of model misspecification (Gascuel, Bryant, and Denis 2001).
We consider here whether the topology will be consistently estimated when an incorrect model or no model has been used in constructing the pairwise distances. Sequence evolution in nature is an extremely complex and heterogeneous (across sites and across lineages) process. In most cases, therefore, when inferring phylogeny from sequences, models with simplifying assumptions regarding rate matrices, rates-across-sites, or stationarity of the rates-across-sites process across lineages will always be incorrect to some degree. For this reason, it is of critical importance to investigate the performance of methods when their models are misspecified in various ways. Inconsistency of a method of phylogenetic estimation implies poor performance as the number of sites gets large, and as such it is at least theoretically possible that a method that is inconsistent will perform reasonably with small numbers of sites. This is rarely the case in practice; inconsistency is usually a good indicator of situations in which poor performance can be expected, even with small numbers of sites. Indeed, for the forms of model misspecification similar to those considered through simulation with finite samples by Gaut and Lewis (1995) and Huelsenbeck (1995), the zones of inconsistency match up well with regions of poor performance by the methods considered by those authors.
The results presented indicate that serious difficulties of inference often arise in the presence of inconsistency. In many of the cases considered we show that with probability converging to one, a particular incorrect topology is estimated. As a consequence, with larger numbers of sites, there will be large bootstrap support for this incorrect topology. To illustrate briefly, we simulated data from the tree ((A : 0.8, C : 0.1) : 0.1, (B : 0.8, D : 0.1)) with a Jukes-Cantor (JC; Jukes and Cantor 1969) substitution process. As indicated later, with probability one, NJ with Hamming or p-distance will estimate the long branch topology, characterized by the neighbors (A, B), with a large enough number of sites. Even with 200 sites, the probability is very large that the long branch topology will be estimated. In 1,000 simulations the long branch topology (A, B) was estimated every time. More significantly, the average bootstrap support for the topology (A, B) over the 1,000 simulations was 81.3%, so that there was little information to suggest that there was a problem.
The analytical nature of the results presented lend insight into the reasons for inconsistency. The form of inconsistency most commonly discussed in the literature is long branch attraction in the context of the maximum parsimony method (Felsenstein 1978; Hendy and Penny 1989). Long branch attraction turns out to be the main form of inconsistency that we find occurring as well. The reason for this is that in so many of the cases where an overly simple model or no model is used to construct the distances, the limiting distances turn out to be a concave function of the actual distances. In contrast, when the limiting distances are convex functions of the actual distances, long branch inconsistency will not arise [a continuous concave (convex) function has negative (positive) second derivative everywhere or roughly speaking, has an upside-down (right-side up) bowl shape]. Instead, as we illustrate with an example, the form of inconsistency that can arise in this case is a "long branches repel" type of inconsistency.
Because it is analytically tractable and eases discussion, we primarily consider four taxon topologies. However a form of parallel rates-across-sites variation model misspecification is considered both analytically and, through simulation, with more taxa. Although idealized, the basic notion of parallel rates-across-sites variation is biologically relevant as a model for the evolutionary implications of differing functional constraints for different organisms. For instance, Steel, Huson, and Lockhart (2000) discussed a case in which changes in the numbers of variable sites in 18S ribosomal DNA from groups of Holometabolus insects provided artifactual support for Strepsitera as a sister taxon of Diptera. In this study, we consider a 42-taxon example where a group of intracellular parasitic eukaryotes, the microsporidia, are erroneously placed at the archeabacteria-eukaryote split in an elongation factor 1
phylogeny rather than with their fungal sister taxa (Hirt et al. 1999; Baldauf et al. 2000), possibly because of model misspecification. Whether the simplified model considered actually explains the erroneous placement or not, it illustrates that the analytical results for four taxon topologies have implications for larger numbers of taxa.
The results that we present are shown for the NJ method. In the four-taxon case, however, there is a close connection between the zones for NJ, LS, and ME. An immediate connection comes from the results of Gascuel (1994), which show that ME, with sum of LS branch lengths, and NJ give the same estimated trees. More generally, with concave limiting distances, we show that the zones of inconsistency are the same for NJ and ME, with ME criteria the sums of absolute values of branch lengths, or the sums of positive branch lengths, as well as for constrained LS (branch lengths are constrained to be non-negative). However, the zones of inconsistency for unconstrained LS (negative branch lengths are allowed) are larger than those of NJ. This is consistent with the findings of Kuhner and Felsenstein (1994), who show through simulation that corrections for negative branch lengths improve the accuracy of both topological estimation and branch length estimation.
A number of simulations studies have been conducted that investigate the biases and variability of various phylogenetic methods. These include studies by Gaut and Lewis (1995), Huelsenbeck (1995), Kuhner and Felsenstein (1994), and Bruno and Halpern (1999). On the basis of this and other work, it appears that there is widespread understanding that model misspecification can lead to biased estimation and inconsistency. Theoretical work that explicitly considers consistency in the case of Hamming distances includes reports by DeBry (1992) and Rzhetsky and Sitnikova (1996). Models where rates vary across sites and subtrees have been considered theoretically by Chang (1996) who shows that inconsistency can arise as branch length converges to zero. The work here can be considered an addition to the present work on the topic that (1) lends analytical insight into the reasons for the long branch form attraction that so frequently arises, (2) provides illustrations of alternative forms of model misspecification that can lead to inconsistency (specifically parallel rates-across-sites variation) that are not easily adjusted for, and (3) more closely considers the possible inconsistency of NJ as well as its relationship to LS and ME.
An outline of the article is as follows. The next two sections are technical, introducing assumptions, notation, and the main derivations leading to a zone of inconsistency. The following two subsections give examples of zones of inconsistency arising from a failure to model rates-across-sites and rates-across-subtrees processes. We also show in these sections that failure to account for rates-across-sites variation results in underestimation of evolutionary distances for a variety of models generalizing previous results for the JC model (Golding 1983), GTR model (Kelly and Rice, 1996) and confirming the simulation results of Kuhner and Felsenstein (1994). Other examples are then given of inconsistency corresponding to misspecifications in conventional model assumptions and resulting in long branches repelling. We conclude with a discussion of the relationships between the zones of inconsistency for NJ and those of LS and ME methods.
| Limiting Distances and Assumptions |
|---|
Consider a pair of taxa, i and j. Distances between the two taxa are computed from the relative frequencies
, with which character states x and y arise in the two taxa across all sites. In some cases additional parameters might be required to compute the distance between taxa. For instance, computing gamma-corrected distances requires an estimate of the
parameter (c.f. Nei, Chakraborty, and Fuerst 1976). More generally we let F(i,j) be the matrix of relative frequencies
and let
represent any additional parameters. We assume that a consistent estimate of
,
, is available and let dij(F(i,j),
) denote the estimated distance between taxa i and j. As an example, Hamming distance or p-distance defined as the proportion of sites whether the character states for the two taxa differ, depends only on the relative frequencies F(i,j) through the proportion of different character states. Because no additional parameters are required for the computation of Hamming distance,
and
can be taken to equal 1. Here dij(F(i,j),
) = 1
x
.
We assume generally that dij(F(i,j),
) is a continuous function of F(i,j) and
. Let
be the probability that character states x and y arise in the two taxa across all sites. Since
converges to
as the number of sites goes to infinity, and by assumption
converges to
we get that
|
|
It will be valuable to think of incorrectly specified distances in terms of their dependence on the correctly specified distances. To do this, we must make the additional assumption that the distance between two taxa is a monotonically increasing transformation g(t) of the correctly specified distance or branch length. This assumption is equivalent to the statement that larger branch lengths should be associated with larger distances, which should always be the case for any reasonable distance. As an example, for the JC model, the Hamming distance along a branch is related to branch length through the transformation g(t) = 3/4 3 exp(4t/3)/4. In most of the examples that we consider, g(t) is a concave transformation of t, and so, unless stated otherwise, this is assumed as well. We will assume throughout most of the discussion that the true tree is ((A : b, C : a) : a, (B : b, D : a)):
|
In this case the limiting distances, dij(P(i,j),
), as the number of sites goes to infinity, are as given in table 1.
|
| Zones of Inconsistency: Neighbor Joining |
|---|
In the case of a four-taxon tree with taxa A, B, C, and D, there are three topologies which can be described in terms of the neighbor of A: (A, B), (A, C), and (A, D).
|
Let dij be the distance between taxa i and j. For a four-taxon tree the Neighbor-Joining algorithm can be shown (Saitou and Nei 1987) to be determined by the following rules:
- (A, C) is preferred to (A, D) if

- (A, C) is preferred to (A, B) if

Substituting the limiting distances from table 1 into (1) and (2) we obtain that, with probability 1, as the number of sites gets large, (A, C) is preferred to (A, D) if
|
|
|
|
|
|
Thus h(b;a) is a decreasing function of b and so either h(b;a) < 0 for all a or h(b0(a);a) = 0 for some b0(a) in which case h(b;a) > 0 for all b < b0(a) and h(b;a) < 0 for b > b0(a). The zone of inconsistency can then be found by determining the solution, b0(a) of the equation h(b;a) = 0; we set b0(a) =
if no solution exists. For fixed a the topologies estimated by NJ in the limit are as follows:
| ||||||||
On the boundary, when b = b0(a), regardless of the number of sites, both of the topologies (A, C) or (A, B) will have positive probability of being estimated.
Although in most cases b0(a) must be determined numerically, for some distances an explicit solution for b0(a) can be calculated. For instance, if the true model is a JC model but Hamming distance is used, then b0(a) is determined as
|
|
|
|
| Effects of Ignoring Rates-Across-Sites Variation |
|---|
Many phylogenetic models for sequence data assume independent Markov substitution processes across sites. Most of the parameters for these Markov processes are assumed constant across sites, but it has long been recognized that it is unreasonable to assume that the overall rate of evolution is constant across sites (Fitch and Markowitz 1970; Uzzel and Corbin 1971; Nei 1987). It is usually possible to adjust for the heterogeneity of rates across sites if one assumes a probability distribution for these rates (Gu and Li 1996; Waddell and Steel 1997). The corrections under a JC model are given in table 2 for some of the more common rates-across-sites (RAS) models. The zones of inconsistency for several RAS models when the distance is calculated without adjustment (assuming a single fixed rate) are given in figure 2A through 2C. In each case, the size of the zone of inconsistency increases with the magnitude of model misspecification.
|
|
One of the other consequences of failing to make RAS adjustments is that distances will be underestimated. Golding (1983) shows that this is true for the JC model. Kelly and Rice (1996) show that branch lengths are biased downwards for the GTR model when stationary frequencies are not estimated. In fact, it is more generally true, holding for the following commonly used models: JC, K2P (Kimura 1980), F81 (Felsenstein 1981), F84 (Felsenstein 1984), and the GTR or REV model (Yang 1994). In the remainder of this subsection we sketch the reasons for this phenomenon. Full details are available in the Supplementary Material online.
To establish notation, we fix the pair of taxa i and j under consideration and leave as implicit the dependence of various quantities on i and j. Let t denote the true distance between the pair,
the diagonal matrix with the stationary probabilities of the character states along its diagonal, and Q the rate matrix for the model under consideration. The number of additional parameters in Q varies depending on the model under consideration. For instance, for the JC model there are no additional parameters and for the F84 model there is a transition-transversion ratio. The matrix of pairwise probabilities, with xyth entry the probability of observing character states x an y for the two taxa, is then
E[eQrt], where expectation is with respect to the random rate r. The estimate
of
can be constructed from the observed frequencies or using ML estimation, but we assume that the estimated distance
and the rest of the parameters in the estimated rate matrix
are obtained through ML estimation without assuming a RAS model.
Now, the estimated distance
always satisfies that
|
|
e
is the estimate of the pairwise probability matrix. As is shown in the section titled Convergence of Pairwise Probability Estimates When Rates-Across-Sites Variation Is Ignored in the Supplementary Material Online, for the models listed above, this matrix converges to the actual pairwise probability matrix
E[eQrt]. In addition,
converges to
and so
converges to
|
|
{log(E[eQrt])}ii
Qiit with strict inequality as long as t
0 and the variance of the rate distribution is positive. Thus
|
|
| Parallel Rates-Across-Sites Variation |
|---|
Just as failing to adjust for rates-across-sites processes can cause inconsistency, so can a failure to account for rate variation across subtrees. We consider here the zones of inconsistency that can arise if parallel rates-across-sites changes occur in subtrees of the larger tree. In the usual rates-across-sites model, a site must remain in the same rate class over the whole tree, whereas in the model considered here, rates vary across sites and across subtrees. By parallel rates-across-sites variation we refer to variation across sites where there is an increase in the probability that a given rate shift will occur in parallel in two subtrees. In the case of homogeneous rates-across-sites processes, gamma-corrected distances can be used to avoid inconsistency. For the parallel rates-across-sites variation considered here, however, such a correction will not in general work, because the rates-across sites-process varies with the pairs of taxa considered. Although the model presented here is a simple one, the notion of parallel rates-across-sites variation is biologically relevant as a model for the evolutionary implications of differing functional constraints for different organisms (see Lockhart et al. 1998; Inagaki et al. 2003). In the 42-taxon EF-1
example considered later, it seems possible that some sites may be constrained in eukaryotes but not in archaebacteria or Microsporidia. In terms of distributional properties, this is suggestive of a certain frequency of larger rates-across-sites for archaebacteria and microsporidia and an increased probability that these rate shifts will occur in parallel in the two subtrees. We start by considering a four-taxon tree. The true tree is assumed to be of the form ((A : b, C : a) : a, (B : b, D : a). Rates are assumed constant along the middle branch and the branches leading to C and D. The rate variation is along the branches leading to A and B, and the rates-across-sites assigned to these branches are assumed to be positively correlated. Specifically, we assume that rates at any given site are assigned to the branches according to the probabilities of table 3.
|
The entries are chosen so that the expected rate along each branch is 1 and so that the same marginal rate distribution applies to each branch. For this to be a valid distribution
00
, where
00 is the probability that, at a site, 0 rates are assigned to both branches and, for a given branch,
is the marginal probability that, at a site, a 0 rate is assigned to it. For the rates to be positively correlated, it is necessary to have
00
2. Note that while the model has been specified in terms of the zero rates, the implication is that the branches leading to A and B will have certain probability at a site of having larger rates, 1/(1
), in parallel, than the rate 1 at all of the other branches. We assume a JC substitution process and that the estimates of the pairwise branches are the uncorrected JC branch length estimates,
|
|
ij is the proportion of nucleotides that are different. Let p(i, j) denote the path from i to j, and let rb, tb be the rates and branch lengths for branch b. Note that for the branches leading to C and D, as well as the middle branch, these rates would be 1 and the branch lengths would be a, whereas for the other two branches the rates are random variables. As the number of sites gets large this proportion approaches the probability that two nucleotides are different, which can be shown to be
|
|
|
|
| ||||||||||||||
Here
|
|
|
|
|
|
|
|
00
2, then
|
|
and
00 are given in figure 3. With
00 =
2, the rate variation in the two branches is independent. As
00 moves away from
2, making the rates in the two branches more positively correlated, the zone of inconsistency increases.
|
It is worth noting that the expressions for the limiting distances involve expectations of the form E[e4rb/3]. These expectations depend only on the marginal rates-across-sites distribution for r implied by the parallel rates-across-sites process described here. One might expect that by correctly inferring the marginal rates-across-sites distribution, a correction might be made to avoid inconsistency. Indeed, Tuffley and Steel (1997) prove as much for the covarion rates-across-sites and subtrees model they investigated. For the parallel rates-across-sites process considered here, however, that rate distribution will not be constant across pairs of taxa, as is indicated by equations (6) and (7), and thus correction becomes much more difficult.
| The Archaebacteria-Eukaryotes Split |
|---|
The split of archaebacteria from eukaryotes is considered to be near the root of the tree of life, and thus there is considerable interest in which taxa are closest to this split. In the estimated tree of figure 4A the clade of microsporidia (Glugea plecoglossi, Encephalitozoon cuniculi, and Nosema locustae) was placed with a long extending branch at the archaebacteria-eukaryotes split. To illustrate that problems of inconsistency from misspecified distances can occur with a larger number of taxa, we investigate what parameter settings in a model of parallel rate change could cause the microsporidia to be placed at the archaebacteria-eukaryotes split. We do this under the hypothesis that they are actually more closely related to the fungi (Podospora anserina and Schizosaccharomyces pombe in figure 4A). That Microsporidia are truly close relatives of fungi and not deeply branching eukaryotes (as their placement in the elongation factor-1
would suggest in figure 4A) has gained much recent support from phylogenetic analyses of RNA-encoding and protein genes both in isolation (Keeling and Doolittle 1996; Hirt et al. 1998; Fast, Logsdon, and Doolittle 1999; van de Peer, Ben Ali, and Meyer 2000) and in concatenated sets (Baldauf et al. 2000)
|
The tree given in figure 4A was constructed with NJ from eukaryotic and archaebacterial elongation factor-1
data consisting of 349 sites for 42 taxa generated by modifying Inagaki and Doolittle's data set (Inagaki and Doolittle 2000). The sequence data for a pair of taxa was used to obtain ML estimates of the distance between them, based on a model using the PAM amino acid substitution matrix, described in Dayhoff and Eck (1968) and Dayhoff, Schwartz, and Orcutt (1979), without a gamma correction; we will refer to them as "uncorrected gamma distances." The tree that was used in the simulations, with microsporidia more closely related to the fungi, is given in figure 4B. The branch with the microsporidia clade was placed between the fungi and the animals (Homo sapiens and Caenorhabditis elegans in figure 4B), but all other parts of the topology were fixed as in figure 4A. The branch lengths were re-estimated using ML estimation with a fixed topology. The model for rate change assumes that parallel changes occur along the branch A leading to the archaebacteria and the branch B leading to the microsporidia. Specifically, we assume the following:
- For the branches A and B, independent of all else, rates are assigned according to table 3.
- Whenever branches A and B are both assigned 0 rates (which happens with probability
00), a 0 rate is assigned to all other branches.
- Whenever at least one of the rates assigned to A or B is non-negative (which happens with probability 1
00), a common 0 rate is assigned to all other branches with probability
. Otherwise a common non-zero rate is assigned to all other branches.
The model of rate variation in branches A and B is the same model of parallel rate change that gave rise to long branch attraction in the four-taxon case. The main difference between this model and the model in the four-taxon case is that now a large proportion of 0 rate sites are allowed in all other branches of the tree. Under the model above, the expected proportion of sites with 0 rates is
for A and B, and it is
00 +
(1
00) for all other branches. This provides a simple model for parallel rate variation that could arise because of a loss of function in genes. The cellular structure of microsporidia is highly degenerated, probably because of their parasitic lifestyle. It is also likely that the reduction of these organisms could be extended to the molecular level, since only 1,997 open reading frames appear to be encoded in the E. cuniculi genome (Katinka et al. 2001). Suppose for instance, that some sites in elongation factor 1
(EF-1
) are constrained in Eukaryotes but not in archaebacteria because of functions carried out by eukaryotic EF-1
but never present in archaebacterial EF-1
. If those functions were later lost in microsporidia, the corresponding sites would be free to vary, in contrast to the rest of the eukaryotes and in parallel with the archaebacteria. Eukaryotic EF-1
(eEF 1
) proteins are known to have both the core function as a translation elongation factor that is absolutely indispensable for all cellular systems and non-translational (auxiliary) functions that probably are absent in the archaebacterial orthologs (reviewed in Negrutskii and El'skaya 1998). Therefore it is plausible that microsporidian EF-1
proteins secondarily lost some eukaryotic EF-1
specific auxiliary functions during the reductive evolution in this intracellular parasitic lineage (Inagaki et al. 2004). By these parallel absence of EF-1
auxiliary functions in the microsporidian and archaebacterial proteins, the two sub-trees could show correlated patterns of rate variation, different from other eukaryotes (Inagaki et al. 2004).
To obtain a reasonable set of parameters,
00,
, and
for the simulation, we used the proportions of sites with no change among the subgroups of taxa of interest:
| ||||||||||
A rough estimate of
, the probability of a 0 rate for the archaebacterial subtree, is then 0.258. Similarly 0.155 is an estimate for
00, and 0.436 is an estimate of
00 +
(1
00), the probability of a 0 rate in all other branches of the tree. Since 0.155 is an estimate for
00, an estimate for
can be obtained by solving
|
|
. All of the estimates above will be biased upward because some sites will appear invariant by chance. However, with 42 taxa, the expectation is that the bias will not be large. The tree used in the simulations is given in figure 4B. The branch with the microsporidia clade was placed between the fungi and the animals, but all other parts of the topology were fixed as in the topology estimated by NJ with uncorrected PAM distances (fig. 4A). If in fact parallel rate variation is present, then uncorrected ML estimates of the branch lengths for the topology in figure 4B will be biased. We used the regression approach described in the Supplementary Material online as a rough bias adjustment.
Because inconsistency implies that wrong topologies will be estimated with high probability when the number of sites is large, we simulated single data sets of 5,000 sites from the model above. Because of the uncertainty in the parameter and branch length estimates, simulations were conducted for a variety of settings for
,
00, and
near the estimated values of 0.333, 0.155, and 0.258. It was clear from these simulations that (1) larger values of the parameters
00 and
are more likely to give rise to microsporidia being placed with the archaebacteria, (2) changing
from its estimated value of 0.33 did not change results, and (3)
00 had to be close to
for microsporidia to be placed with the archaebacteria. The results for some of the parameter settings are indicated in table 4. For the estimated parameters (
,
00, and
set to 0.333, 0.155, and 0.258), the estimated topology places Entamoeba histolytica closest to archaebacteria, as do most of the parameter settings. Parameter settings with
00 large are the only ones that place microsporidia closest to the archaebacteria. The estimated topologies did not vary except in the placement of microsporidia. It was of interest to conduct simulations with 349 sites, because this was the size of the data set used. In this case, however, incorrect placement may be explained by sampling variation and, indeed, the estimated topologies differed more than just in their placement of microsporidia. Nevertheless, the results with respect to the clade closest to the archaebacteria were the same as indicated in table 4.
|
Although inconsistency did not result at the estimated parameter values, the results in table 4 indicate that there are plausible misspecified distances which could explain the placement of microsporidia near the archaebacteria. The model for parallel rate variation used here is a simple one. Given the results, it is possible that a more refined model would more clearly indicate that the placement of microsporidia can be explained by parallel rate variation. Cases of parallel rate variation may be particularly problematic for "deep" phylogenetic problems, where parallel losses or changes in functional constraints in distantly related lineages are more likely to be observed (Lockhart et al. 1998; 2000).
Regardless of the placement of microsporidia, the example illustrates that results for four taxon trees are likely to have implications for larger number of taxa. In the present case a key feature of the tree used in the simulation was the long branches leading to the microsporidia (0.77) and archaebacteria (0.30). The distance between the nodes corresponding to these branches was not small (0.25), and there were a number of intervening branches but, as indicated in figure 5, the shape of the tree was similar to the shape of the long branch four-taxon trees that have been considered throughout most of this article.
|
| Transition/Transversion Ratios |
|---|
Not adjusting for parameters in the substitution process at a site can lead to long branch inconsistency for NJ. We consider here failure to adjust for a transition/transversion ratio R different from that implied by the JC model (R = 1/2). For simplicity we assume a fixed rate model in which case the Kimura 2-parameter model (Kimura 1980) adjusts for a difference. The ML branch length estimates for a pair of taxa under this model is calculated to be
|
|
|
| An Example Where Long Branches Repel |
|---|
In the examples above and in most of the literature the most common form of inconsistency that arises is long branch attraction. It is, however, possible that long branches will repel. As an example of this phenomenon, we consider a true tree of the form ((A : b, C : b) : a, (B : a, D : a)), where the two long branches have length b.
|
In this case the limiting distances are as follows:
| ||||||||||||||||
Substituting these limiting distances into equations (1) and (2), we get that, in the limit, (A, C) is preferred to both (A, B) and (A, D), if
|
|
Suppose now that a JC substitution model is in effect with a gamma rate distribution having
= 1. If the RAS model adopted is incorrectly specified as consisting of a mixture of a fixed rate and a set of invariable sites arising with probability 0.2, then
|
|
|
|
| Connections Between LS, ME, and NJ |
|---|
The least-squares estimate of a tree is obtained by choosing a topology and branch lengths so that
|
|
ij is the sum of branch lengths along the path between i and j in the tree under consideration. This could be unconstrained least-squares estimationbranch lengths are allowed to be negativeor constrained estimationbranch lengths are restricted to be positive. Assuming a true tree ((A : b, C : a) : a, (B : b, D : a)), and that g is a monotone concave transformation, a similar analysis to the one for NJ, given in the Supplementary Material online, allows one to draw the following main conclusions.
- The zone of inconsistency for the unconstrained LS estimate contains the zone of inconsistency for NJ. There are values of (a, b) for which unconstrained LS will be inconsistent and NJ will be consistent. An example is given in figure 7.
- The constrained LS estimate will, with probability 1, be the same as the NJ estimate with a large enough number sites, and thus will have the same zone of inconsistency.
- In the region where the unconstrained and constrained LS estimates differ, the estimated middle branch for unconstrained LS is negative and the estimated topology is (A, D) from unconstrained LS, where it might be (A, C) or (A, B) for constrained LS, depending on the values of (a, b).
For four taxa, it is shown in Gascuel (1994) that ME, with sum of LS branch lengths, and NJ give the same estimated trees and hence share the same zones of inconsistency. Other suggestions for the ME criterion are sums of absolute values of branch lengths (Kid and Sgaramella-Zonta 1971) and sums of positive branch lengths (Swofford et al. 1996). Note that these sums will always be larger than or equal to the unconstrained sum of branch lengths with equality if the branch lengths are all non-negative. Thus if the estimated ME topology with sum of unconstrained branch lengths has non-negative branch lengths, it will be the estimated ME topology under the other two criteria as well. It is shown in the Constrained Least-Squares Estimation subsection of the Supplementary Material online that this is the case with concave g. Hence the NJ and ME zones of inconsistency coincide regardless of the criterion used.
| Concluding Remarks |
|---|
The analytical results presented here provide clear, rigorous reasons demonstrating that distance methods become inconsistent and the manner in which they do so when models are misspecified; we expect and simulations suggest that ML methods will exhibit similar zones of inconsistency. It is the shape of the limiting distances, which we have denoted by g(t) as a function of the true distances t, that determines how and whether there will be a zone of inconsistency. If these functions are concave, there will be a zone of inconsistency and it will favour topologies in which long branches attract. For many of the realistic examples of model misspecification or ignorance discussed here (including the use of Hamming distances), it is clear that, ignoring rate variation, misspecifying rate matrices, g(t) is a concave function and that long branch attraction results. Indeed, this appeared to be the case for other similar forms of model misspecification considered but not presented here. In light of this tendency it appears reasonable that investigators carefully consider the possibility of model misspecification for topologies with several long branches close to one another. Nonetheless an example was presented in which g(t) was convex, leading to a long branches repel form of inconsistency. This result indicates that it is not a certainty that long branch attraction will be a consequence of model misspecification. At present we can see no theoretical reasons that other heretofore unconsidered types of model misspecification will most likely lead to concave g(t) rather than convex g(t) and consequently long branch repulsion. What is clear through careful consideration of the form of the zone of inconsistency determining equation (4) in the case of long branch attraction and equation (8) in the case of long branch repulsion, is that for a topology to be in the zone, it is necessary for there to be several relatively long branches; this is the only way in which the negative terms in the sum will be larger than the positive terms. The implication is that model misspecification should only be suspected to be a serious potential source of error for estimated topologies with several relatively long branches.
Investigations about inconsistency like the one conducted here provide information about the bias of estimation procedures and not the variance. Through simulation Gaut and Lewis (1995) plot regions where ML estimation performs well with finite samples under model misspecification. For small middle branches a, the shapes of their plots are similar to the shapes given here, suggesting that the poor estimation in these finite sample cases is due in large part to bias. However, for larger middle branches a, the zones where ML performs poorly are much less restricted, indicating, not surprisingly since the variance in distance estimates increases with distance, that variance provides additional problems for estimation in these cases.
For almost all of the examples considered here, it has been shown that inconsistency implies that a particular topology, different from the true one, will be converged on with probability 1. Because a particular topology is being converged on, with a large number of sites, measures of uncertainty, such as for instance bootstrap support, will suggest that with a high degree of certainty this wrong topology is the only one supported by the data. The fact that a particular topology is being converged on is important for studies investigating the use of simple distance measures as well. An overly simplistic distance measure that has a convex limiting distance function g(t), for instance, will converge on the wrong topology if the generating topology has long branches together, but it will converge on the right topology if the long branches are apart. In fact, such a distance measure might actually appear to be very efficient in estimating this topology. This is a point that has been considered in some detail in Bruno and Halpern (1999).
The examples involving parallel rate variation were important for several reasons. First, the application to the archaebacteria-eukaryotes EF-1
data indicate that the four-taxon results presented here likely have implications for larger, more complicated topologies. Second, although the model considered is simple, the basic notion of parallel rates-across-sites variation is biologically relevant. Finally, where many of the other forms of model misspecification can be adjusted for with appropriate distances, adjustments here would require a careful consideration of the processes taking place throughout the entire tree. The issues raised are similar to but different from those in Tuffley and Steel (1997). A correction for distances in the covarion model they consider is available as a rates-across-sites adjustment. Similarly, in the parallel rates-across-sites case, corrections could in principle be made through a rates-across-sites adjustment; in contrast, however, the rate distribution will not be constant across the taxa considered. It seems possible that the problem could be diagnosed on the basis of a pilot estimate of the tree, with separate rate distributions being fit to separate subsets of taxa, but we cannot pursue that further here. In any case, some form of iterative distance estimation, iterating between tree estimation and distance estimation with model adjustments, or perhaps ML estimation, would be required.
| Acknowledgements |
|---|
E.S. and A.J.R. are supported by the Natural Sciences and Engineering Research Council of Canada. A.J.R. thanks the Canadian Institute for Advanced Research Program in Evolutionary Biology and the Canadian Institutes for Health Research for fellowship support. A.J.R. and Y.I. were supported by NSERC Operating Grant 227085-00 and NSERC Genomics Grant 228263-99. This collaboration is part of the Prokaryotic Genome Evolution and Diversity Project of Genome Atlantic/Genome Canada.
| Footnotes |
|---|
Peter Lockhart, Associate Editor
| Literature Cited |
|---|
Baldauf, S. L., A. J. Roger, I. Wenk-Siefert, and W. F. Doolittle. 2000. A kingdom-level phylogeny of eukaryotes based on combined protein data. Science 290:972-977.
Bruno, W. J., and A. L. Halpern. 1999. Topological bias and inconsistency of maximum likelihood using wrong models. Mol. Biol. Evol. 16:564-566.[Web of Science][Medline]
Cavalli-Sforza, L. L., and A. W. F. Edwards. 1967. Phylogenetic analysis: models and estimation procedures. Evolution 32:550-570.[CrossRef]
Chang, J. T. 1996. Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. Math. Biosci. 134:189-214.[CrossRef][Web of Science][Medline]
Dayhoff, M. O., and R. V. Eck. 1968. Atlas of protein sequence and structure 19671968. National Biomedical Research Foundation, Silver Spring, Maryland.
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1979. A model of evolutionary change in proteins. Pp 345352 in M. O. Dayhoff, ed. Atlas of protein sequence and structure, volume 5, supplement 3. National Biomedical Research Foundation, Silver Spring, Md.
DeBry, R. W. 1992. The consistency of several phylogeny-inference methods under varying evolutionary rates. Mol. Biol. Evol. 9:537-551.[Abstract]
Fast, N. M., J. M. Logsdon, and W. F. Doolittle. 1999. Phylogenetic analysis of the TATA-binding protein (TBP) gene from Nosema locustae: evidence for a microsporidia-fungi relationship and spliceosomal intron loss. Mol. Biol. Evol. 16:1415-1419.[Web of Science][Medline]
Felsenstein, J. 1978. Cases in which parsimony and compatibility methods will be positively misleading. Syst. Zool. 27:27-33.[CrossRef][Web of Science]
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[CrossRef][Web of Science][Medline]
Fitch, W. M., and E. Markowitz. 1970. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem. Genet. 4:579-593.[CrossRef][Web of Science][Medline]
Gascuel, O. 1994. Concerning the NJ algorithm and its unweighted version, UNJ. Pp 149170 in B. Mirkin, F. R. McMorris, F. S. Roberts, and A. Rzhetsky, eds. Mathematical hierarchies and biology, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Vol. 37. American Mathematical Society, Providence, R.I.
Gascuel, O., D. Bryant, and F. Denis. 2001. Strengths and limitations of the minimum evolution principle. Syst. Biol. 50:621-627.
Gaut, B. S., and P. O. Lewis. 1995. Success of maximum likelihood phylogeny inference in the four-taxon case. Mol. Biol. Evol. 12:152-162.[Abstract]
Golding, G. B. 1983. Estimates of DNA and protein sequence divergence: an examination of some assumptions. Mol. Biol. Evol. 1:125-142.[Abstract]
Gu, X., and W. H. Li. 1996. A general additive distance with time-reversibility and rate variation among nucleotide sites. Proc. Natl. Acad. Sci. USA 93:4671-4676.
Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160-174.[CrossRef][Web of Science][Medline]
Hendy, M. D., and D. Penny. 1989. A framework for the study of evolutionary trees. Syst. Zool. 38:297-309.[CrossRef][Web of Science]
Huelsenbeck, J. P. 1995. Performance of phylogenetic methods in simulation. Syst. Biol. 44:17-48.
Hirt, R. P., J. M. Logsdon, B. Healy, M. W. Dorey, W. F. Doolittle, and T. M. Embley. 1999. Microsporidia are related to fungi: evidence from the largest subunit of RNA polymerase II and other proteins. Proc. Natl. Acad. Sci. USA 96:580-585.
Inagaki, Y., E. Susko, N. M. Fast, and A. J. Roger. 2003. Covarion shifts cause a long branch attraction artifact that unites microsporidia and archaebacteria in EF-1
phylogenies. Mol. Biol. Evol. 21:1340-1349.
Inagaki, Y., and W. F. Doolittle. 2000. Evolution of the eukaryotic translation termination system: origins of release factors. Mol. Biol. Evol. 17:882-889.
Jukes, T. H., and C. R. Cantor. 1969. Pp 21123 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press. New York.
Katinka, M. D., S. Duprat, E. Cornillot, G. Metenier, F. Thomarat, G. Prensier, V. Barbe, E. Peyretaillade, P. Brottier, and P. Wincker, et al. 2001. Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi. Nature 414:450-453.[CrossRef][Medline]
Keeling, P. J., and W. F. Doolittle. 1996. Alpha-tubulin from early-diverging eukaryotes and the evolution of the tubulin family. Mol. Biol. Evol. 13:1297-1305.[Abstract]
Kelly, C., and Rice, J. 1996. Modeling nucleotide evolution: A heterogeneous rate analysis. Math. Biosci. 133:85-109.[CrossRef][Web of Science][Medline]
Kid, K. K., and L. A. Sgaramella-Zonta. 1971. Phylogenetic analysis: concepts and methods. Am. J. Hum. Genet. 23:235-252.[Web of Science][Medline]
Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120.[CrossRef][Web of Science][Medline]
Kuhner, M. K., and J. Felsenstein. 1994. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11:459-468.[Abstract]
Li, W. H. 1994. Molecular evolution. Sinauer Associates, Sunderland, Mass.
Lockhart, P. J., D. Huson, U. Maier, M. J. Fraunholz, Y. van de Peer, A. C. Barbrook, C. J. Howe, and M. A. Steel. 2000. How molecules evolve in eubacteria. Mol. Biol. Evol. 15:1183-1188.
Negrutskii B. S., and A. V. El'skaya. 1998. Eukaryotic elongation factor-1
: structure, expression, functions and possible role in aminoacyl-tRNA channeling. Proc. Nucl. Acid Res. Mol. Biol. 60:47-78.
Nei, M. 1987. Molecular evolutionary genetics. Columbia University Press, New York.
Nei, M., R. Chakraborty, and P. A. Fuerst. 1976. Infinite allele model with varying mutation rate. Proc. Natl. Acad. Sci. USA. 73:4164-4168.
Page, R. D. M., and E. C. Holmes. 1998. Molecular evolution: a phylogenetic approach. Blackwell Science, Oxford.
Penny, D., B. J. McComish, M. A. Charleston, and M. D. Hendy. 2001. Mathematical elegance with biochemical realism: the covarion model of molecular evolution. J. Mol. Evol. 53:711-723.[CrossRef][Web of Science][Medline]
Rzhetsky, A., and M. Nei. 1993. Theoretical foundation of the minimum-evolution methods of phylogenetic inference. Mol. Biol. Evol. 10:1073-1095.[Abstract]
Rzhetsky, A., and T. Sitnikova. 1996. When is it safe to use an oversimplified substitution model. Mol. Biol. Evol. 13:1255-1265.[Abstract]
Saitou, N., and M. Nei. 1987. The Neighbor-Joining method: a new method for reconstructing evolutionary trees. Mol. Biol. Evol. 4:406-425.[Abstract]
Steel, M., D. Huson, and P. J. Lockhart. 2000. Invariable sites models and their use in phylogeny reconstruction. Syst. Biol. 49:225-232.[CrossRef][Web of Science][Medline]
Swofford, D. L., G. L. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Chapter 11 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics, 2nd edition. Sinauer Associates, Sunderland, Mass.
Tuffley, C., and M. A. Steel. 1997. Modelling the covarion hypothesis of nucleotide substitution. Math. Biosci. 147:63-91.
Uzzel, T., and K. W. Corbin. 1971. Fitting discrete probability distributions to evolutionary events. Science 173:1089-1096.
van de Peer, Y., A. Ben Ali, and A. Meyer. 2000. Microsporidia: accumulating molecular evidence that a group of amitochondriate and suspectedly primitive eukaryotes are just curious fungi. Gene 246:1-8.[CrossRef][Web of Science][Medline]
Waddell, P. J., and M. A. Steel. 1997. General time-reversible distances with unequal rates across sites: Mixing
and inverse gaussian distributions with invariant sites. Mol. Phyl. Evol. 8:398-414.[CrossRef][Web of Science][Medline]
Yang, Z. 1994. Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39:105-111.[Web of Science][Medline]
This article has been cited by other articles:
![]() |
J. Kim and M. J. Sanderson Penalized Likelihood Phylogenetic Inference: Bridging the Parsimony-Likelihood Gap Syst Biol, October 1, 2008; 57(5): 665 - 674. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Birin, Z. Gal-Or, I. Elias, and T. Tuller Inferring horizontal transfers in the presence of rearrangements by the minimum evolution criterion Bioinformatics, March 15, 2008; 24(6): 826 - 832. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Spencer, D. Bryant, and E. Susko Conditioned Genome Reconstruction: How to Avoid Choosing the Conditioning Genome Syst Biol, February 1, 2007; 56(1): 25 - 43. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Ruano-Rubio and M. A. Fares Artifactual Phylogenies Caused by Correlated Distribution of Substitution Rates among Sites and Lineages: The Good, the Bad, and the Ugly Syst Biol, February 1, 2007; 56(1): 68 - 82. [Abstract] [Full Text] [PDF] |
||||
![]() |
H.-C. Wang, M. Spencer, E. Susko, and A. J. Roger Testing for Covarion-like Evolution in Protein Sequences Mol. Biol. Evol., January 1, 2007; 24(1): 294 - 305. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Shalchian-Tabrizi, M. Skanseng, F. Ronquist, D. Klaveness, T. R. Bachvaroff, C. F. Delwiche, A. Botnen, T. Tengs, and K. S. Jakobsen Heterotachy Processes in Rhodophyte-Derived Secondhand Plastid Genes: Implications for Addressing the Origin and Evolution of Dinoflagellate Plastids Mol. Biol. Evol., August 1, 2006; 23(8): 1504 - 1515. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J Roger and L. A Hug The origin and diversification of eukaryotes: problems with molecular phylogenetics and molecular clock estimation Phil Trans R Soc B, June 29, 2006; 361(1470): 1039 - 1054. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Lockhart, P. Novis, B. G. Milligan, J. Riden, A. Rambaut, and T. Larkum Heterotachy and Tree Building: A Case Study with Plastids and Eubacteria Mol. Biol. Evol., January 1, 2006; 23(1): 40 - 45. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Belda, A. Moya, and F. J. Silva Genome Rearrangement Distances and Gene Order Phylogeny in {gamma}-Proteobacteria Mol. Biol. Evol., June 1, 2005; 22(6): 1456 - 1467. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Spencer, E. Susko, and A. J. Roger Likelihood, Parsimony, and Heterogeneous Evolution Mol. Biol. Evol., May 1, 2005; 22(5): 1161 - 1164. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
















