Skip Navigation


MBE Advance Access originally published online on April 18, 2008
Molecular Biology and Evolution 2008 25(7):1512-1520; doi:10.1093/molbev/msn098
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
25/7/1512    most recent
msn098v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gruenheit, N.
Right arrow Articles by Martin, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gruenheit, N.
Right arrow Articles by Martin, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Difficulties in Testing for Covarion-Like Properties of Sequences under the Confounding Influence of Changing Proportions of Variable Sites

Nicole Gruenheit*, Peter J. Lockhart{dagger}, Mike Steel{ddagger} and William Martin*

* Institute of Botany III, University of Düsseldorf, Düsseldorf, Germany
{dagger} Institute for Molecular BioSciences, Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand
{ddagger} Biomathematics Research Centre, Allan Wilson Centre for Molecular Ecology and Evolution, University of Canterbury, Christchurch, New Zealand

E-mail: nicole.gruenheit{at}uni-duesseldorf.de.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
The covarion (COV)-like properties of sequences are poorly described and their impact on phylogenetic analyses poorly understood. We demonstrate using simulations that, under an evolutionary model where the proportion of variable sites changes in nonadjacent lineages, log likelihood values for rates across site (RAS) and COV models become similar, making models difficult to distinguish. Further, although COV and RAS models provide a great improvement in likelihood scores over a homogeneous model with these simulated data, reconstruction accuracy of tree building is low, suggesting caution when it is suspected that proportions of variable sites differ in different evolutionary lineages. We study the performance of a recently developed contingency test that detects the presence of COV-type evolution modified for protein data. We report that if proportions of variable sites (pvar) change in a lineage-specific manner such that their distributions in different lineages become sufficiently nonoverlapping, then the contingency test can incorrectly suggest a homogeneous model. Also of concern is the possibility of different proportions of variable sites between the groups being studied. In a study of chloroplast proteins, interpretation of the test is found to be susceptible to different partitioning of taxon groups, making the test very subjective in its implementation. Extreme intergroup differences in the extent of divergence and difference in proportions of variable sites could be contributing to this effect.

Key Words: covarion • phylogenetics • chloroplast proteins • contingency test


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Although sequence evolution is a temporally and spatially heterogeneous process, sequence evolution is typically described by a homogenous, stationary, time reversible model (Liò and Goldman 1998Go). Within this framework, improved phylogenetic estimates have often been obtained when site-specific properties of sequences have been modeled assuming that some sites are invariable (Adachi and Hasegawa 1995Go; Lockhart et al. 1996Go), nonindependent (von Haeseler and Schoniger 1998Go), and/or evolving with a discrete number of rate classes according to a gamma distribution (rates across site [RAS] models: Uzzel and Corbin 1971Go; Rzhetsky and Nei 1994Go; Yang 1994Go; Waddell et al. 1997Go).

More recently, a number of covarion (COV) (Fitch and Markowitz 1970Go) models have been implemented for phylogenetic analyses (Galtier 2001Go; Huelsenbeck 2002Go; Guindon et al. 2004Go; Wang et al. 2007Go), and these COV models have been found to provide further improvement over RAS models in terms of the relative fit to sequence data. This is presumably because these models capture a component of temporal heterogeneity in the evolutionary process—that is, unlike RAS models, they allow the substitution properties of a site to change over a time in a lineage-specific fashion. Under COV models, a site is free to switch back and forth between variable and invariable states along a branch.

In the COV model of Tuffley and Steel (1998)Go, a site in a sequence may be either variable or invariable, and the state may differ in different lineages. All sites that are variable, evolve under the same substitution process (e.g., JC69, HKY85, etc.) and at the same rate. The COV model of Huelsenbeck (2002)Go extends the Tuffley and Steel model by allowing there to be a discrete number of rate classes for the variable state. Under this model, a site can switch between the OFF state and one of the variable rate classes but not between the different variable rate classes. A third COV model is that of Galtier (2001)Go. In this model, there is a discrete number of rate classes for the variable state. A site can switch between these rate classes. Under this model, there is no OFF state. Most recently, Wang et al. (2007)Go have combined these 2 latter models and produced a general model (one in which there can be a switch between all variable states and an OFF state).

All these COV models are stationary time reversible models and have an expectation that the proportion of variable sites is the same in all evolutionary lineages. However, this assumption can be overly restrictive as proportions of variable sites, pvar, have been inferred to vary in lineage-specific ways (Lockhart et al. 1996Go, 2006Go; Lopez et al. 2002Go). This property of sequence evolution can lead to topological biases that will mislead tree building (Lockhart et al. 1996Go, 2006Go). With some proteins, changes in pvar can be explained by lineage-specific differences in functional and structural constraints, due to differential loss/gain of functions ancillary to the core function of specific molecules (Susko et al. 2002Go; Inagaki et al. 2004Go; Guo and Stiller 2005Go).

Improving substitution models for phylogenetic analysis requires accurate tests to quantify the extent and nature of substitution model misspecification. A number of tests have been proposed to characterize COV-like substitution properties. However, as we illustrate using simulated and real data, interpretation from these tests need to be made cautiously, particularly when pvar is not constant across the underlying phylogeny. Our findings highlight the need for improved analytical methods for studying the COV-like properties of sequences.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Maximum Likelihood Analyses
Within a maximum likelihood framework, log likelihood scores can be used to evaluate the relative fit of COV, RAS, and homogeneous models to sequence data. The best scores can then be used to identify the best substitution model for tree building. To examine the accuracy of this approach under conditions that might approximate the biological complexity expected with empirical data, we have examined the scores obtained when sequences are simulated under what we call a Tuffley and Steel (1998)Go + invariable site + switch (TS + I + S) model. In this model, a proportion of sites is specified as invariable and a proportion of sites is evolving under a TS model. At specified positions in the tree, a proportion of the specified invariable class switch to the TS class of sites, and some of the TS class of sites may also switch to the invariable class. On a 4-taxon tree that contains 2 switch positions (as in X and Y in fig 1), this model produces data that are identical to a phylogenetic mixture of 8 classes of TS model (each class has the same topology but different branch length), as described in table 1. This mixture representation is possible because a site that becomes invariable in various regions of the tree, but whose evolution is otherwise covered by the TS model, still follows a TS model but with branch lengths set to zero over the regions where the site is invariable.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Pvar influences log likelihood. (a) Phylogenetic tree used for simulating alignments under a TS model. Relative branch lengths are indicated. A dot marks the point where the proportion of variable sites is increased in the outer branches leading to taxa B and C; at that point, 0–30% of the invariable sites are switched on. (b) Mean log likelihood for the simulated alignments according to the tree they were simulated on and 6 different models as indicated in the inset at upper right. The highest standard deviation (not plotted) for log likelihood among any 100 replicates is 329.43 found for the general model at pvar = 0.2.

 

View this table:
[in this window]
[in a new window]

 
Table 1
 
However, in the special case of convergent increase in variable sites in nonadjacent lineages, where at the 2 switch positions 1) the only change in class is from invariable to TS sites and 2) the sites that change class are the same, the phylogenetic mixture reduces from 8 to just 3 classes of TS model. The mixture allowed us to specify the proportion of sites belonging to the invariable class and to specify the proportion of sites that switch from this class to the TS class in the nonadjacent lineages. Sequences 10,000 nt in length were simulated with Seq-Gen-Aminocov (Rambaut and Grassly 1997Go) specifying a Tuffley and Steel (1998)Go model wherein the variable states evolve according to a Jukes and Cantor (1969) rate matrix. Table 1 shows the calculation of relative partition sizes for mixtures that describe a 4-taxon tree (with relative branch lengths) wherein 1) the ancestral proportion of variable sites is 0.2; 2) the terminal branches have a length 0.4; and 3) where at a distance of 0.1 (at points x and y) along the branches to taxa B and C, the proportion of sites undergoing a Tuffley and Steel process is increased. Pvar was increased in increments of 2% up to 30% of the invariable class so that the overall pvar in B and C ranged from 20% (no increase) to 50% (30% increase). For sites in the Tuffley and Steel class, a switching rate setting of 0.1 was used.

In our study for each of the increments, 100 replicates were generated, and each simulated alignment was analyzed using Procov1.3 (Wang et al. 2007Go). The following models were compared using the standard optimization files without reestimation of the branch lengths: homogeneous, Tuffley and Steel (1998)Go, RAS (Yang 1994Go), Galtier (2001)Go, General (Wang et al. 2007Go), and Huelsenbeck (2002)Go. For each alignment and model, the log likelihood was extracted using a Perl script. The mean for each parameter was calculated and plotted using matlab, the standard deviations for each set of 100 replicates were very narrow (<0.1% of the mean in all cases) and hence were not plotted.

Trees were reconstructed for the simulated data sets using Paup* (Swofford 2003Go; maximum likelihood: lset nst = 1 basefreq = equal; lset tratio = 0.5 pinv = 0 rates = gamma shape = estimate; hsearch start = stepwise swap = tbr status = no nbest = 1; parsimony: hsearch start = stepwise swap = tbr status = no nbest = 1; parsimony: hsearch start = stepwise swap = tbr status = no nbest = 1) and MrBayes (Ronquist and Huelsenbeck 2003Go; lset nst = 1 covarion = yes; mcmc nruns = 1 ngen = 250000 samplefreq = 100 filename = run1.nex; sumt burnin = 400).

Contingency Tests
Another approach to test whether a collection of sites in a multiple sequence alignment exhibit COV-type evolutionary properties is the contingency test developed by Lockhart et al. (1998)Go, which is based on the test statistic W. This compares substitution differences between 2 groups of sequences

Formula
where N5 is the number of sites that have varied in both groups, N3 and N4 are the numbers of those sites that have varied in one group but not in the other, and N is the total number of sites. Site patterns referred to as N1 and N2 (Lockhart et al. 1998Go) are not relevant here. N1 site patterns have the same residue in both groups. N2 sites are polymorphic between but not within groups. W compares the fraction of varied sites in each group and the extent to which these sites overlap with sites that have varied in both groups.

N3 or N4 (syn. type 3 or type 4) sites should be less frequent if sequences are evolving according to a RAS model than if the sites are evolving in a manner that approximates a COV model (Lockhart et al. 1998Go). If in real data there are more N3 and N4 sites than expected to occur by chance under a RAS model, this would constitute evidence for deviation from the assumptions of a RAS model and possibly evidence for a COV modus of sequence evolution (Lockhart et al. 1998Go).

Ané et al. (2005)Go improved upon this test by providing a more rigorous means for obtaining expectations for the test statistic W under 3 different models of evolution 1) a homogeneous model, wherein different sequence positions are equally variable; 2) a RAS model, wherein some sites are evolving faster than other sites; and 3) a Tuffley and Steel (1998)Go COV model. In doing this, they noted that W predicts that sites that are varied in one group are likely to be varied in other groups under RAS and COV models but not under a homogeneous model. A RAS model predicts a strong degree of correlation and a COV model a weaker degree of correlation. Under a homogenous model, the W statistic is statistically zero. It is positive under a COV model and even more positive under a RAS model. The Ané et al. test uses simulation to interpret values of W in terms of support for each of the 3 models.

It does this by first examining whether there is evidence to reject a homogeneous model of sequence evolution in favor of a heterogeneous model. If so, it then examines whether there is evidence to reject a RAS model in favor of a more complex model of substitution. That is, if the derived W differs significantly from the expected distribution of the W under a RAS model, the nucleotide or protein sequence is inferred to have evolved under a RAS + COV model. Ané et al. (2005)Go used this test to infer that a large proportion of proteins encoded in chloroplast genomes evolve according to a RAS + COV model.

We have implemented the method of Ané et al. (2005)Go for analyzing protein sequences and used it to reexamine chloroplast genome sequences also studied by Ané et al. The sequences used are from Acorus calamus (NC_007407 [GenBank] ), Adiantum capillus-veneris (NC_004766 [GenBank] ), Amborella trichopoda (NC_005086 [GenBank] ), Anthoceros formosae (NC_004543 [GenBank] ), Arabidopsis thaliana (NC_000932 [GenBank] ), Atropa belladonna (NC_004561 [GenBank] ), Calycanthus floridus (NC_004993 [GenBank] ), Chaetosphaeridium globosum (NC_004115 [GenBank] ), Chlamydomonas reinhardtii (NC_005353 [GenBank] ), Chlorella vulgaris (NC_001865 [GenBank] ), Cyanidioschyzon merolae (NC_004799 [GenBank] ), Cyanophora paradoxa (NC_001675 [GenBank] ), Epifagus virginiana (NC_001568 [GenBank] ), Ginkgo biloba (DQ069337 [GenBank] –DQ069702 [GenBank] ), Guillardia theta (NC_000926 [GenBank] ), Lotus corniculatus (NC_002694 [GenBank] ), Marchantia polymorpha (NC_001319 [GenBank] ), Medicago truncatula (NC_003119 [GenBank] ), Mesostigma viride (NC_002186 [GenBank] ), Nephroselmis olivacea (NC_000927 [GenBank] ), Nicotiana tabacum (NC_001879 [GenBank] ), Nuphar advena (DQ069337 [GenBank] –DQ069702 [GenBank] ), Nymphaea alba (NC_006050 [GenBank] ), Odontella sinensis (NC_001713 [GenBank] ), Oenothera elata (NC_002693 [GenBank] ), Oryza sativa (NC_001320 [GenBank] ), Physcomitrella patens (NC_005087 [GenBank] ), Pinus koraiensis (NC_004677 [GenBank] ), Pinus thunbergii (NC_001631 [GenBank] ), Porphyra purpurea (NC_000925 [GenBank] ), Psilotum nudum (NC_003386 [GenBank] ), Ranunculus macranthus (DQ069337 [GenBank] –DQ069702 [GenBank] ), Saccharum officinarum (NC_006084 [GenBank] ), Spinacia oleracea (NC_002202 [GenBank] ), Triticum aestivum (NC_002762 [GenBank] ), Typha latifolia (DQ069337 [GenBank] –DQ069702 [GenBank] ), Yucca schidigera (DQ069337 [GenBank] –DQ069702 [GenBank] ), and Zea mays (NC_001666 [GenBank] ).

Sequences were aligned using ClustalW (Thompson et al. 1994Go), and all gapped sites were removed. To obtain a phylogenetic overview of the data set, sequences were concatenated, LogDet distances were computed with LDDist (Lake 1994Go; Lockhart et al. 1994Go; Thollesson 2004Go) from which phylogenetic networks were constructed with Neighbor-Net as implemented in splitstree 4 (Huson and Bryant 2006Go). A Java program was written to count the different types of sites and is available upon request. For each alignment, the user gets the numbers of type 1, 2, 3, 4, and 5 sites. Sites with gaps have been ignored.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Maximum Likelihood Analyses
We have investigated the extent to which time reversible substitution models describe the evolution of sequences that have evolved under a Tuffley and Steel (1998)Go model where the proportion of variable sites, pvar does not remain constant across all lineages. Figure 1 shows the relative fit (log likelihood scores) of homogeneous, RAS, and 3 COV models to these simulated data. When the sequences have evolved under the comparatively simple Tuffley–Steel model, more complicated COV models nevertheless gave improved likelihood scores both when pvar was constant across all lineages and when pvar was incrementally increased to 0.3 in 2 nonadjacent lineages. Changes in pvar > 0.16 resulted in differences of log likelihood for replicates for a given heterogeneous model that exceeded the differences among different heterogeneous models for a given pvar. As further shown in figure 1, log likelihood values for the different substitution models began to converge as pvar increased in nonadjacent lineages. For these data, tree building exhibited low reconstruction accuracy (fig. 2). A change in merely 8–12% of the invariable sites becoming variable in nonadjacent lineages caused sufficient topological distortion (long branches leading to sequences B and C) to mislead maximum likelihood (lset = 1, assumed discrete gamma, estimated alpha; fig. 2a), parsimony (fig. 2b), and Bayesian (lset = 1, assumed Huelsenbeck COV model, estimated switching rates; fig. 2c) tree building. Thus, although RAS and heterogeneous COV models provided an improved fit to the sequences, as pvar increased in nonadjacent lineages it became more difficult to distinguish among the different heterogeneous models, and neither RAS nor a COV model was sufficient to allow for reliable reconstruction of the true phylogeny with only moderate increases in pvar. This underscores a significant and seldom examined effect of pvar in phylogenetic inference.


Figure 2
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Phylogenetic reconstruction accuracy for increased proportions of variable sites. Data were simulated on the tree shown in figure 1 with incremental increases in pvar as indicated on the abcissa, and the phylogeny was inferred with 3 different methods. (a) Maximum likelihood inference (RAS). (b) Parsimony inference (PARS). (c) Bayesian inference (COV). Reconstruction accuracy for all 3 methods drops markedly with an increase of only ~12% of invariable sites switching to the Tuffley and Steel site class at points x and y shown in figure 1. Only the maximum likelihood inference method delivered unresolved (star) trees. BA, BC, and BD designate the 3 possible topologies, respectively.

 
Contingency Tests
Characterization of the COV-like properties of sequence data can also be made using contingency tests. The test of Ané et al. overcomes problems of interpreting the W statistic with real data that were not solved by Lockhart et al. (1998)Go. In doing so, these authors also described the impact that taxon sampling is expected to have on the power of the test. They studied this for the case of 2 monophyletic groups in terms of the edge lengths within (t) and between (T) compared clades. However, in implementing the test, these authors compared a monophyletic group and a paraphyletic group. Although, the test is still validly applied in this case, we report that the expectations for performance of the test differ from that described when 2 monophyletic groups are compared. In demonstrating this, 2 different data sets were analyzed. The first data set (#1) comprised 29 land plants and algae (fig. 3a) and 42 chloroplast proteins. In the second data set (#2), there were 26 land plants (fig. 3b) and 57 chloroplast proteins. The contingency test of Ané et al. (2005)Go was adapted to investigate protein instead of nucleotide sequences (source code available upon request). As in Ané et al. (2005)Go, we identified groups for comparison: (group 1) eudicotyledons, (group 2) angiosperms, and (group 3) angiosperms and gymnosperms. We also considered (group 4) angiosperms, gymnosperms, moss, and ferns. In the implementation of the test of Ané et al. (2005)Go, each of the first 3 groups was compared against the rest of the data set. In our implementation, 2 monophyletic groups were compared. The N3 and N4 sites in the alignment were counted for each comparison using a Java script and plotted as histograms (fig. 4).The proportions of N3 and N4 sites among all sites are shown in table 2.


Figure 3
View larger version (33K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Comparisons used in the present study. (a) Neighbor-Net (Bryant and Moulton 2004Go) of 45 concatenated alignments of chloroplast proteins representing all taxa of the old data set. Marked with colored boxes are the groups used in the analyzed comparisons. Groups 1 (red, eudicots) and 2 (green, angiosperms) were proposed in Ané et al. (2005)Go. In addition to those, 2 more groups were chosen. Group 3 (blue) contains all angiosperms and 2 gymnosperms, group 4 (turquoise) contains group 3 and all mosses and ferns. (b) Neighbor-Net of 58 concatenated alignments of chloroplast proteins representing taxa in the second data set. Groups used in the comparisons are indicated.

 

Figure 4
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Influence of groups compared upon relative proportions of inferred N3 and N4 sites in chloroplast-encoded proteins. Proportion of proteins in different group comparisons is given on the y axis. Proportions of N3 and N4 sites per protein are indicated on the x axis. Groups compared are shown in figure 3, comparisons (inset) are listed in table 1. (a) Ingroup–outgroup comparisons. (b) Comparisons of monophyletic groups.

 

View this table:
[in this window]
[in a new window]

 
Table 2 Comparisons of Group A versus Group B and Proportions of N3 and N4 Sites among All Sites

 
A striking feature of the aligned sequence data are the different proportions of N3 and N4 sites among different groups of sequences. In comparisons of a monophyletic versus paraphyletic group, the number of N3 sites greatly exceeds the number of N4 sites. All proteins had at least 20% N3 sites, and in 16% of the proteins >70% of all sites were N3, whereas no protein had an N4 site (fig. 4a). In some proteins, more than 80% of all sites were N3 or N4 sites. In the comparison of 2 monophyletic groups, far fewer N3 and N4 sites were found and a considerable greater balance between the numbers of N3 and N4 sites was observed (fig. 4b). Most proteins had <5% of either N3 or N4 sites, the maximum number of N3 or N4 sites lies between 40% and 50%. Ané et al. (2005)Go detected 21 proteins that were inferred to reject the RAS model using nucleotide site patterns. Using the same groups and amino acid site patterns (instead of nucleotide site patterns), we found that 28/42 (67%) of the proteins tested (data set #1) would reject the RAS model in all comparisons of the ingroup versus outgroup type. By contrast, only one protein out of 57 investigated (data set #2), rbcL, rejected RAS in all comparisons of monophyletic groups (table 3). Thus, balanced verses unbalanced sampling of sequences gave very different results in terms of evidence for COV-like properties of the sequences.


View this table:
[in this window]
[in a new window]

 
Table 3 Proteins Rejecting a RAS Model at P = 0.95

 
A further property of the test statistic W also suggests caution in its application. This is that W can become negative (or close to 0) when distributions of variable sites in the groups being compared become sufficiently different, as might happen if the spatial pattern of substitution differs from that expected under time reversible COV models. Thus, unexpected but nevertheless COV-like patterns could lead the W statistic to underestimate the heterogeneity of the substitution process. The expected value w of W can be written as:

Formula
where pi is the probability that a site has varied in group 1 or 2 and where p12 is the probability that a site has varied in group 1 and group 2. Under both the Tuffley–Steel model and the RAS model w ≥ 0. However, if the distribution of variable sites has evolved in a more complex manner than envisaged by Tuffley–Steel, then it can be shown that w ≤ 0. For example, consider a model where sites fall into 4 classes depending on whether they are variable or invariable in the 2 groups G1, G2, and let

vi = Proportion of variable sites in group Gi,
v12 = Proportion of variable sites in groups G1 and G2,
{pi}i = Probability that a site that is variable in Gi is varied in Gi, and
{pi}12 = Probability that a site that is variable in G1 and G2 is varied in G1 and G2.

Then pi{approx}{pi}ivi and p12{approx}{pi}12v12, and for a substitution process that is group based (e.g., Jukes and Cantor; Kimura 2P and Kimura 3ST models), we also have {pi}12={pi}1{pi}2 and w{approx}{pi}1{pi}2(v12v1v2). If the proportion of variable sites increases in G2 whereby the variable sites in G1 are a subset of the variable sites in G2, then the proportion of sites variable in both G1 and G2 will equal the proportion of sites variable in G1, thus v12 = v1 and w ≥ 0 because w{approx}{pi}1{pi}2(v12v1v2)={pi}1{pi}2v1(1–v2)≥0. However, if there is little, or in the extreme case, no overlap in the sites that are variable in G1 and G2, then w can take a negative value.

As a simple example, this could entail a hypothetical protein 100 amino acids in length. In G1, the 30 N-terminal sites of this protein become variable but the 70 C-terminal sites remain constant, whereas in G2, the 30 C-terminal sites of X become variable but the 70 N-terminal sites remain constant. In this case, w ≤ 0 even though the proportion of variable sites in the 2 groups is similar or the same (v1 = v2) provided that v12<vFormula (because if v1 = v2, then w{approx}{pi}1{pi}2(v12v1v2)={pi}1{pi}2(v12v12)) because w{approx}{pi}1{pi}2(v12v1v2)={pi}1{pi}2(0–(0.3x0.3))<0.


    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
For confidence in the reliability of tree building from highly diverged sequences, it is essential to develop low parameter substitution models that capture the heterogeneous complexity of sequence evolution. However, as we have illustrated, current methods need to be applied cautiously in characterizing the evolutionary properties of highly diverged sequences, and our current understanding of sequence evolution is limiting for model development. In this respect, it is important to note that tests of heterotachy, which we have not discussed (e.g., Lopez et al. 1999Go; Misof et al. 2002Go; Susko et al. 2002Go; Baele et al. 2006Go), while being informative are nevertheless not sufficient for developing models of sequence evolution. The reason is that different processes of change can lead to very similar patterns of heterotachy. These tests cannot distinguish an evolutionary model where there is a constant rate of evolution, but different proportions of variable sites in different lineages (the model studied by Lockhart and Steel 2005Go), from a model where there is the same proportion of variable sites in different lineages and lineage-specific rates of substitution (the scenario studied by Felsenstein 1978). This distinction is important because as demonstrated here when pvar changes, model fitting can favor a model that does not improve phylogenetic accuracy. Contingency tests to identify COV-like properties may seem promising, but their implementation is problematic. Sampling of taxa can significantly impact on the outcome of the test and deciding upon an objective sampling criterion is not straightforward. In the present study, the contingency test of Ané et al. (2005)Go gave very different results depending on whether comparisons were made between 2 monophyletic groups or a monophyletic group and a paraphyletic group. Both comparisons are valid, but which result is correct? Further, it is unclear whether this difference is due to the much greater divergence among the paraphyletic species (this group containing algae, e.g., which have had much more time to evolve than sites in the eudicots; hence, many N3 sites are expected even under a RAS model) or whether it is because the substitution properties in the algal sequences differ significantly from those in the higher plants (Lockhart et al. 2006Go; Rodriguez-Ezpelata et al. 2007Go).

A recent development in modeling substitution properties of sequences is to fit a mixture of substitution models to each site in an alignment of sequences (e.g., Pagel and Meade 2004Go; Lartillot et al. 2007Go). This approach can also be extended to fit a mixture of trees with different branch lengths to the sequences (e.g., Kolaczkowski and Thornton 2004Go; Zhou et al. 2007Go). There are issues of identifiability with complex mixture and COV models (Allman and Rhodes 2007Go), but potentially tests might be developed using such models to better characterize temporal heterogeneity in the evolution of sequences. Such developments will be important because although RAS models have generally improved phylogenetic inference, as we demonstrate here, they are unable to account for lineage-specific patterns of changing pvar. They, and currently implemented COV models, are unable to account for the form of heterotachy that most likely describes the evolution of biological sequences, the further development of mixture models is of interest in this respect.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
We thank Simon Whelan, Andrew Roger, Ed Susko, John Rhodes, Liat Shavit, Simon Joly, Elizabeth Allman, Oliver Deusch, and Tal Dagan for helpful discussions and Microsoft (P.J.L.) and the Julius von Haast Fellowship Fund (W.M.) for research fellowships. This work was funded by the New Zealand Marsden Fund (P.J.L.) and the Deutsche Forschungsgemeinschaft (W.M.).


    Footnotes
 
Andrew Roger, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 

    Adachi J, Hasegawa M. Improved dating of the human/chimpanzee separation in the mitochondrial DNA tree: heterogeneity among amino acid sites. J Mol Evol (1995) 40:622–628.[CrossRef][Web of Science][Medline]

    Allman ES, Rhodes J. The identifiability of tree topology for phylogenetic models. J Comput Biol (2007) 13:1103–1113.

    Ané C, Burleigh JG, MacMahon MM, Sanderson MJ. Covarion structure in plastid genome evolution: a new statistical test. Mol Biol Evol (2005) 22:914–924.[Abstract/Free Full Text]

    Baele G, Raes J, Van de Peer Y, Vansteelandt S. An improved statistical method for detecting heterotachy in nucleotide sequences. Mol Biol Evol (2006) 23:1397–1405.[Abstract/Free Full Text]

    Bryant D, Moulton V. Neighbor-Net: an agglomerative method for the construction of phylogenetic networks. Mol Biol Evol (2004) 21:255–265.[Abstract/Free Full Text]

    Felsenstein J. Cases in which parsimony or compatibility methods will be positively misleading, Syst Zool. (1978) 27:401–410.

    Fitch WM, Markowitz E. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem Genet (1970) 4:579–593.[CrossRef][Web of Science][Medline]

    Galtier N. Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol Biol Evol (2001) 18:866–873.[Abstract/Free Full Text]

    Guo Z, Stiller J. Comparative genomics and evolution of proteins associated with RNA polymerase II C terminal domain. Mol Biol Evol (2005) 22:2166–2178.[Abstract/Free Full Text]

    Guindon S, Rodrigo AG, Dyer KA, Huelsenbeck JP. Modelling the site specific variation of selection patterns along lineages. Proc Natl Acad Sci USA (2004) 101:12957–12962.[Abstract/Free Full Text]

    Huelsenbeck JP. Testing a covariotide model of DNA substitution. Mol Biol Evol (2002) 19:698–707.[Abstract/Free Full Text]

    Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol (2006) 23:254–267.[Abstract/Free Full Text]

    Inagaki Y, Susko E, Fast NM, Roger AJ. Covarion shifts cause a long-branch attraction artifact that unites microsporidia and archaebacteria in EF-1{alpha} phylogenies. Mol Biol Evol (2004) 21:1340–1349.[Abstract/Free Full Text]

    Jukes TH, Cantor CR. Evolution of protein molecules. In: Mammalian protein metabolism—Munro H N, ed. (1969) New York: Academic Press. 21–123.

    Kolaczkowski B, Thornton JW. Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature (2004) 431:980–984.[CrossRef][Medline]

    Lake JA. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc Natl Acad Sci USA (1994) 91:1455–1459.[Abstract/Free Full Text]

    Lartillot N, Brinkmann H, Philippe H. Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. BMC Evol Biol (2007) 7(Suppl 1):S4.[CrossRef][Medline]

    Liò P, Goldman N. Models of molecular evolution and phylogeny. Genome Res (1998) 8:1233–1244.[Abstract/Free Full Text]

    Lockhart PJ, Larkum AWD, Steel MA, Waddell PJ, Penny D. Evolution of chlorophyll and bacteriochlorophyll: the problem of invariant sites in sequence analysis. Proc Natl Acad Sci USA (1996) 93:1930–1934.[Abstract/Free Full Text]

    Lockhart PJ, Novis P, Milligan BG, Riden J, Rambaut A, Larkum AWD. Heterotachy and tree building: a case study with plastids and eubacteria. Mol Biol Evol (2006) 23:40–45.[Abstract/Free Full Text]

    Lockhart PJ, Steel M. A tale of two processes. Syst Biol (2005) 54:948–951.[Free Full Text]

    Lockhart PJ, Steel M, Barbrook AC, Huson DH, Charleston MA, Howe CJ. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol Biol Evol (1998) 15:1183–1188.[Abstract]

    Lockhart PJ, Steel M, Hendy M, Penny D. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol Biol Evol (1994) 11:605–612.[Web of Science][Medline]

    Lopez P, Casane D, Philippe H. Heterotachy an important process of protein evolution. Mol Biol Evol (2002) 19:1–7.[Abstract/Free Full Text]

    Lopez P, Forterre P, Philippe H. The root of the tree of life in the light of the covarion model. J Mol Evol (1999) 49:496–508.[CrossRef][Web of Science][Medline]

    Misof B, Anderson CL, Buckley TR, Erpenbeck D, Rickert A, Misof K. An empirical analysis of mt 16S rRNA covarion-like evolution of insects: site-specific rate variation is clustered and frequently detected. J Mol Evol (2002) 55:460–469.[CrossRef][Web of Science][Medline]

    Pagel M, Meade A. A phylogenetic mixture model for detecting pattern heterogeneity in gene sequence or character-state data. Syst Biol (2004) 53:571–581.[Abstract/Free Full Text]

    Rambaut A, Grassly NC. Seq-Gen: an application for the Monte-Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci (1997) 13:235–238.[Abstract/Free Full Text]

    Rodriguez-Ezpelata N, Philippe H, Brinkmann H, Burkhard B, Melkonian M. Phylogenetic analyses of nuclear, mitochondrial and plastid multi-gene datasets support the placement of Mesostigma in the Streptophyta. Mol Biol Evol (2007) 24:723–731.[Abstract/Free Full Text]

    Ronquist F, Huelsenbeck JP. MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics (2003) 19:1572–1574.[Abstract/Free Full Text]

    Rzhetsky A, Nei M. Unbiased estimates of the number of nucleotide substitutions when substitution rate varies among different sites. J Mol Evol (1994) 38:295–299.[Web of Science][Medline]

    Susko E, Inagaki Y, Field C, Holder ME, Roger AJ. Testing for differences in rates-across-sites distributions in phylogenetic subtrees. Mol Biol Evol (2002) 19:1514–1523.[Abstract/Free Full Text]

    Swofford DL. PAUP*. Phylogenetic analysis using parsimony (*and other methods), version 4 (2003) Sunderland (MA): Sinauer.

    Thollesson M. LDDist: a Perl module for calculating LogDet pair-wise distances for protein and nucleotide sequences. Bioinformatics (2004) 20:416–418.[Abstract/Free Full Text]

    Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res (1994) 22:4673–4680.[Abstract/Free Full Text]

    Tuffley C, Steel M. Modeling the covarion hypothesis of nucleotide substitution. Math Biosci (1998) 147:63–91.[CrossRef][Web of Science][Medline]

    Uzzel T, Corbin KW. Fitting discrete probability distributions to evolutionary events. Science (1971) 172:1089–1096.[Abstract/Free Full Text]

    von Haeseler A, Schoniger M. Evolution of DNA or amino acid sequences with dependent sites. J Comput Biol (1998) 5:149–164.[Web of Science][Medline]

    Waddell PJ, Penny D, Moore T. Hadamard conjugations and modelling sequence evolution with unequal rates across sites. Mol Phylogenet Evol (1997) 8:33–50.[CrossRef][Web of Science][Medline]

    Wang H-C, Spencer M, Susko E, Roger AJ. Testing for covarion-like evolution in protein sequences. Mol Biol Evol (2007) 24:294–305.[Abstract/Free Full Text]

    Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol (1994) 39:306–314.[CrossRef][Web of Science][Medline]

    Zhou Y, Rodrigue N, Lartillot N, Philippe H. Evaluation of the models handling heterotachy in phylogenetic inference. BMC Evol Biol (2007) 7:206.[CrossRef][Medline]

Accepted for publication April 13, 2008.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Am. J. Bot.Home page
S. W. Graham and W. J. D. Iles
Different gymnosperm outgroups have (mostly) congruent signal regarding the root of flowering plant phylogeny
Am. J. Botany, January 1, 2009; 96(1): 216 - 227.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
25/7/1512    most recent
msn098v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gruenheit, N.
Right arrow Articles by Martin, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gruenheit, N.
Right arrow Articles by Martin, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?