Molecular Biology and Evolution 19:1514-1523 (2002)
© 2002 Society for Molecular Biology and Evolution
Testing for Differences in Rates-Across-Sites Distributions in Phylogenetic Subtrees



*Department of Mathematics and Statistics, Dalhousie University;
Program in Evolutionary Biology, Department of Biochemistry and Molecular Biology, Canadian Institute for Advanced Research, Dalhousie University;
Xeotron Corporation, Houston
| Abstract |
|---|
|
|
|---|
It has long been recognized that the rates of molecular evolution vary amongst sites in proteins. The usual model for rate heterogeneity assumes independent rate variation according to a rate distribution. In such models the rate at a site, although random, is assumed fixed throughout the evolutionary tree. Recent work by several groups has suggested that rates at sites often vary across subtrees of the larger tree as well as across sites. This phenomenon is not captured by most phylogenetic models but instead is more similar to the covarion model of Fitch and coworkers. In this article we present methods that can be useful in detecting whether different rates occur in two different subtrees of the larger tree and where these differences occur. Parametric bootstrapping and orthogonal regression methodologies are used to test for rate differences and to make statements about the general differences in the rates at sites. Confidence intervals based on the conditional distributions of rates at sites are then used to detect where the rate differences occur. Such methods will be helpful in studying the phylogenetic, structural, and functional bases of changes in evolutionary rates at sites, a phenomenon that has important consequences for deep phylogenetic inference.
| Introduction |
|---|
|
|
|---|
Recent work has suggested that the rates of molecular evolution at sites in an alignment often vary across subtrees of the larger evolutionary tree as well as across sites (Miyamoto and Fitch 1995
The first type of methodology for detecting rate differences uses distances between rates estimated separately for the two subtrees of interest. Because there is a wide variety of ways in which rate variation can occur (for instance, only small rates might vary between the two subtrees or only large rates might vary) a number of different distances are considered. A parametric bootstrap is used with these distances to determine what types of distances are expected under the null hypothesis that there are no rate differences across the two trees.
An alternative approach for detecting whether rate differences exist comes through a regression analysis of the rate estimates. This is fairly easy to implement and can be used to complement the parametric bootstrap methodology. Under the null hypothesis that the rates are the same at each of the sites, a plot of the rates for one of the trees against the rates at the corresponding site for the other tree should look like scatter around the x = y line. We use orthogonal regression methodology with the rate estimates for the two trees to test whether the mean relationship between the rates satisfies this hypothesis. The estimated coefficients from the regression methods allow for statements about the general differences in the rates.
Given that rate differences exist between subtrees, it is of interest to determine at which sites in the sequences these changes occur. Confidence intervals for the rates at a site in the two subtrees based on the conditional distribution of the rates, given the data at the site, provide one way of detecting where changes that take random variation into account occur. Although confidence intervals for the rate at a site for a tree are of interest themselves, in the present context they are useful to construct confidence intervals for the differences in rates. Sites with confidence intervals for differences that do not contain 0 are likely to correspond to rate differences in the two subtrees.
Confidence intervals can be estimated separately for subtrees or based on a bivariate distribution for the rates in the separate subtrees. The bivariate model considered here is similar to Gu's (2001)
model of functional divergence. Under the Gu (2001)
model, the bivariate distribution of the rates in subtrees is a mixture of a distribution in which the rates are assigned independently to the two subtrees and a distribution in which the same rates are assigned to both subtrees. Here, in contrast, the bivariate distribution for rates is not restricted to any particular form. In addition, the methodology presented here provides confidence intervals for the differences in the rates at sites.
| Models for Rate Variation |
|---|
|
|
|---|
The usual model for rate variation treats rates at different sites as random variables that are independently distributed from a common rate distribution (cf. Yang, 1994
![]() |
1, ...,
k to rates r1, ..., rk is then
|
|
Given the estimated tree for a set of data, rate estimates are based on the conditional distribution of rates, given the data at a site. This distribution can be obtained through Bayes' formula:
|
|
|
|
In cases where differences in rates in subtrees are of interest, separate rate estimates can be obtained by treating the data in the subtrees separately:
- Use the data for subtree 1 to obtain a tree estimate T1 for the taxa in that subtree, and similarly obtain a tree estimate T2 for the taxa in subtree 2.
- Use the separate conditional distributions P(rj|x;T1), P(rj|x;T2) and the separate data at the sites to obtain rate estimates ri1 and ri2 at site i.
| Parametric Bootstrap Methodology for Detecting Rate Differences |
|---|
|
|
|---|
The first method that we present for detecting whether rate differences exist between two subtrees compares the distance between rates estimated separately for the two subtrees with a parametric bootstrap distribution of the distances under the null hypothesis of no rate change. To obtain an overall measure of the change in rates in subtrees we developed three global distance measures, whereby real data could be compared with Monte-Carlo simulated data under the null hypothesis of no rate change.
The first rate-distance measure is simply the absolute value of the conditional mode rate for position i in the alignment from the first subtree subtracted from the conditional mode rate for the corresponding position in the second subtree. The sum of these distances, summed over all sites, yields the global arsum distance measure
|
|
< 1 where the distribution has a large, but low mass, tail), the arsum measure will be most strongly influenced by changes between the largest rate category in one subtree and all other rate categories in the other subtree. To deal with this problem we used a second rate-distance measure based on the absolute value of the logarithm of site-by-site rate ratios summed over all sites:
|
|
|
|
In addition to these three distance measures, the same measures were calculated without the absolute values (designated rsum, lrsum, and brsum, respectively). In principle, changes in rates at sites can be equally distributed across both subtrees (a homogeneous shift) or unequally distributed (nonhomogeneous shift). In the latter case, the rates could be systematically higher in one subtree versus the second subtree. The nonabsolute value measure described here can distinguish between these alternatives. If no systematic shift in rates has occurred, then the summed differences are expected to cancel out and fall within the simulated null distribution of the same measure. By contrast, a systematic shift will yield an imbalance of positive or negative distances that, through summation, will be evident as a large positive or negative value falling outside of the simulated null distribution.
The parametric bootstrapping tests were carried out as follows. For each pair of subtrees, the full data set alignment of the two subdata sets together is initially considered. From this alignment, the
parameter,
total, is estimated by maximum likelihood on a neighbor-joining topology using TREE-PUZZLE version 4.02 (Strimmer and von Haeseler 1996
), and maximum likelihood distances for all pairs of taxa are estimated using the PAM +
model (the PAM 001 model was used and the gamma distribution is approximated by an eight-category discrete rate distribution, where the eight categories have equal probability and the rate for each category is set to its mean). From this distance matrix, an optimal topology is estimated using the Fitch-Margoliash weighted least-squares method with 10 random additions with global rearrangements implemented in the PHYLIP 3.5c (Felsenstein 1993
) program FITCH. Maximum likelihood branch lengths for topology are estimated with TREE-PUZZLE with the PAM +
model using
total. From this topology, N data sets of equal size are simulated with the PAM +
model using the program PSeq-Gen (Rambaut and Grassly 1997
). Each of the full data set alignments (observed or simulated) is then split into the two smaller data sets corresponding to the separate subtrees (data set1 and data set2) for separate analysis. For each of the smaller data sets, a maximum likelihood distance matrix is estimated with the PAM +
model using
total. A topology is estimated from this matrix using the neighbor-joining algorithm (implemented in the PHYLIP program NEIGHBOR), and this is used as a user-defined tree for a second round of TREE-PUZZLE analysis. This analysis uses the PAM +
model, using
total to estimate maximum likelihood branchlengths for the NJ topology and the conditional mode rates for the data set. The conditional mode rates for each site are then compared between data set1 and data set2, using the global distance measures described earlier. A shell-script program, COVAR, was written to automate these comparisons.
| Testing for Rate Differences Using Regression Methods |
|---|
|
|
|---|
The regression methods presented in this section provide a straightforward way of testing whether rate changes have occurred at all in the two subtrees of interest.
Let ri1, ri2 denote the rate estimates at site i for the two trees, and let µi1, µi2 denote the true unobserved rates at the site. We can then define errors in estimation through
![]() |
ijs will be near 0. Without additional information that indicates otherwise, we assume that the error in estimation of rates will on average be similar for both of the trees so that the
ijs are independent 0 mean random variables having a common distribution.
Under the null hypothesis of interest H0: µi1 = µi2 for all i. If this null hypothesis is true, then
![]() |
i =
i2 -
i1. Then
![]() |
![]() |
In most regression models of the form (5)
, least squares estimates of (ß0, ß1) are used. These are not, however, appropriate in the current setting. This is because both of the variables in the regression equation are subject to error. As a consequence, least squares estimates of the slope are biased in the direction of 0. Because of the common error process for the two trees (the errors
i1 and
i2 are assumed to have the same error distribution), an adjustment is available through orthogonal regression (cf.
12.3 Casella and Berger 1990
, pp. 583594).
Orthogonal regression is best explained by contrast with least squares regression. In both forms of regression, parameters are chosen to make the sum of the distances between the observed rates (ri1, ri2) and points on the regression line as small as possible. The regression methods differ in the way they measure the distance from the observed (ri1, ri2) to the line. As illustrated in figure 1 , least squares estimation uses the vertical distance, whereas orthogonal regression uses perpendicular distance.
|
Standard calculations give the orthogonal regression estimates of the slope and intercept of the regression line as
|
|
i (ri1 -
1)2, S22 =
i (ri2 -
2)2, and S12 =
i (ri1 -
1)(ri2 -
2).
If the distribution of the error terms
i1 and
i2 is normal, orthogonal regression turns out to be equivalent to maximum likelihood estimation of the regression parameters. Even if this is not the case, it provides a reasonable methodology for estimating the parameters in the model. A standard error for the estimate
1 is available as
|
|
The hypothesis that (ß0, ß1) = (0,1) in (eq. 5
) is equivalent to the hypothesis that ß1 = 1 and that the mean rates, averaged over all sites, for the two subtrees is the same. A paired t-test can be used to test the hypothesis that the mean rates are equal. A test of the hypothesis that ß1 = 1 can be constructed from the orthogonal regression estimate and standard error for
1. An approximate P value for the test is
![]() |
(z) is the standard normal cumulative distribution function. | Confidence Bounds for Rates |
|---|
|
|
|---|
The regression and parametric bootstrap methodologies allow one to draw inferences about whether a rate difference between two subtrees is present. In the event that a rate difference is detected through the methodology, a follow-up analysis of the location of the rate differences is required. In this section we present methods that can be used to construct confidence intervals for rates at individual sites in a given subtree.
Confidence bounds for the rate at a site for a given subtree is most naturally based on the conditional distribution for rates at a site:
|
|
) x 100% confidence limits for the rate at site with data x is given by [l(x), u(x)], where l(x) and u(x) satisfy
|
|
) for any l(x) and u(x) satisfying equations (9) and (10)
. To make the widths of the confidence intervals as small as possible, l(x) should be chosen as the largest number satisfying equation (9)
and u(x) should be chosen as the smallest value satisfying equation (10) . The confidence interval [l(x), u(x)] must be considered an approximate confidence interval because its stated confidence properties hold when the rate distribution, branch lengths, and all other parameters involved in the calculation of P(rj|x; T) are known. In practice, estimates of these quantities are used.
The confidence bounds for the rates at individual sites in individual subtrees are of interest in their own right but also provide a means for detecting the locations of rate differences. If, for a given site, the confidence bounds for rates in two subtrees do not overlap, a rate difference is suggested. More generally, the confidence intervals for the rates at sites can be used to construct confidence intervals for the rate difference between the two sites. For a given site, suppose that the (1 -
/2) x 100% confidence bounds for the rates in two subtrees are [l1, u1] and [l2, u2]. Then a (1 -
) x 100% confidence interval for the rate difference is the set of all difference r1 - r2 with r1 in [l1, u1] and r2 in [l2, u2]; this set is simply [l1 - u2, u1 - l2].
| Confidence Bounds for Rate Differences |
|---|
|
|
|---|
Using individual confidence intervals to detect rate differences turns out to be a crude approach to the problem in cases where rates at sites for two subtrees appear to be positively correlated. In this case, estimation of a bivariate rate distribution can be used to construct tighter confidence limits for the rate difference at a site.
The usual model for rate variation treats rates at different sites as random variables that are independently distributed from a common rate distribution (cf. Yang 1994
; Felsenstein and Churchill 1996
). For all practical purposes, rate distributions are discrete; they assign probabilities
1, ...,
k to a set of rates r1, ..., rk. For a given tree, and a set of rates r1, ..., rk, the likelihood for the rate distribution parameters
1, ...,
k is obtained by taking the product of the probabilities of data at the individual sites. An estimate of the rate distribution can be obtained by maximizing the likelihood over all
1, ...,
k corresponding to rate distributions that have a mean rate of 1 (cf. Susko et al. 2001
).
The extension of rate distributions for a given tree to separate subtrees would be a bivariate distribution that assigns probabilities to pairs of rates (r1, r2) at a site for the two subtrees. If r1, ..., rk is the set of possible rates for tree T1 and s1, ..., sk is the set of possible rates for tree T2, a bivariate distribution would assign some probability
ij to the pair of rates (ri, sj). To ensure that the branch lengths can be interpreted as the expected number of substitutions, the rate distribution should be chosen so that the expected rates, separately, at the two subtrees are 1. Specifically,
|
|
ij > 0 only if i = j, then the rates for the two subtrees are always the same. Another possibility would be that
ij =
i
j, where
1, ...,
k are the probabilities of the rates r1, ..., rk for subtree T1 and
1, ...,
k are the probabilities of the rates s1, ..., sk for subtree T2. Here rate assignment would be independent in the two subtrees.
When estimating rate distributions or constructing confidence intervals for a rate in a given subtree, it suffices to consider only the data in that subtree. In contrast, calculations of the probability of data at a site in a bivariate model require that the data in both subtrees be considered jointly. For a given site and given rates r and s for subtrees T1 and T2, let f (x,y|r,s;T1,T2) be the joint probability of the data x in subtree T1 and data y in subtree T2. The probability of data x and y at a site is then obtained as
|
|
ij is obtained by taking the product of the probabilities of data at the individual sites. An estimate of the rate distribution can be obtained by maximizing the likelihood over all
ij corresponding to bivariate rate distributions that have a mean rate of 1 for each of the subtrees.
Given probabilities
ij for the pairs of rates (ri, sj), similarly as in the case of a single subtree, the conditional distribution of the pair (ri, sj) is calculated as
|
|
|
|
) x 100% confidence interval can be calculated as for individual rates as [l(x), u(x)], where l(x) and u(x) satisfy that
|
|
The joint probability of the data f (x,y|r,s; T1,T2) is needed in all of the aforementioned calculations. Calculation of this quantity requires an additional branch length for the branch connecting the two subtrees, branches in the two subtrees where the additional branch will join the two subtrees, and a position along the additional branch where the rate change occurs. To avoid the additional computation implied by these additional parameters we use an independence model approximation in practice, replacing f (x,y|r,s; T1,T2) by the product of the separate probabilities of data in the two subtrees: f (x|r;T1)f(y|s;T2). If the branch connecting the two subtrees is relatively long, this should provide a good approximation.
| Results |
|---|
|
|
|---|
As an illustrative example we consider amino acid data for taxa coming from four subtrees, one for the eukaryotic elongation factor 1
(eukaryotic EF-1
, the eEF-1
data set), one for archaebacterial EF-1
(the aEF-1
data set), one for Hsp70 subfamily B Suppressor 1 (HBS1), and one for the eukaryotic release factor 3 (eRF3). Although eukaryotic and archaebacterial EF-1
have the same primary biological function of delivering aminoacyl-tRNAs to the A site of the ribosome during translation elongation, they are known to have different auxiliary protein-protein interactions in the two domains of life (Inagaki and Doolittle 2000
paralogs that are believed to have arisen from EF-1
via gene duplications before the divergence of extant eukaryotes (Inagaki and Doolittle 2000
or eRF3 (Inagaki and Doolittle 2000
and eEF-1
data sets as well as to compare the HBS1 and eRF3 data sets. The aEF-1
data set had 13 taxa, the eEF-1
data set had 28 taxa, the HBS1 data set had 13 taxa, and the eRF3 data set had 17 taxa. New sequences were added manually to the previous alignment (Inagaki and Doolittle 2000| Parametric Bootstrap Methodology |
|---|
|
|
|---|
Maximum likelihood distance-Fitch Margoliash trees were obtained under a discrete gamma model with 16 rate categories for two combined data sets, including the aEF-1
and eEF-1
data set (aEF-1
+ eEF-1
data set), and the HBS1 and eRF3 data sets (HBS1 + eRF3 data set). The estimated trees are given in figure 2
. The
parameters for the gamma models were estimated from the aEF-1
+ eEF-1
and HBS1 + eRF3 data sets in TREE-PUZZLE as 0.63 and 1.03, respectively. These two
parameters were used as estimates of
total for the parametric bootstrap analyses.
|
The parametric bootstrap analyses detected a significant difference between eukaryotic and archaebacterial EF-1
data sets. The P values for the test statistics arsum, alrsum, and abrsum were estimated from the parametric bootstrap distribution as 0.001, 0.000, and 0.000, respectively (fig. 3AC
). Curiously, the nonabsolute value rate distances between the two EF-1
data sets, lrsum and brsum, fell significantly on one side of the bootstrap distribution (>99th percentile, data not shown). Their positions indicate that, at least for medium and low rate differences (emphasized by the lrsum and brsum values), sites in archaebacterial EF-1
evolved at a systematically higher rate than those in the eukaryotic homologs.
|
By contrast, the distances observed between HBS1 and eRF3 gave P values that ranged from being marginally insignificant to marginally significant at the 0.01 level of significance for the three test statistics. The P values for the test statistics arsum, alrsum, and abrsum were estimated from the parametric bootstrap distribution as 0.003, 0.01, and 0.018, respectively (fig. 4AC ). The observed differences from nonabsolute value distance measures were not significantly different from those under the null distribution (data not shown). These results suggest that the tempo and mode of the HBS1 evolution may be slightly different from those of eRF3. However, these two show less overall rate difference than that observed for the aEF-1
and eEF-1
comparison.
|
| Regression Methodology Results |
|---|
|
|
|---|
A scatter plot of the log-transform estimated rates for the aEF-1
and eEF-1
data sets is given in figure 5A.
As an initial test of whether rate differences exist between the aEF-1
and eEF-1
data sets or the HBS1 and eRF3 data sets, we used the orthogonal regression methodology. The rate estimates used were the conditional mean rate estimates given by equation (1)
. The usual rate estimates are conditional mode estimates which, with a discrete gamma model, take on only a few values. Regression methods are more appropriate for rate estimates that can take on a large set of values, which is the case for conditional mean estimates. Because many of the rate estimates are small, log transformations were taken to increase the range of data and avoid difficulties with outlying values. The orthogonal regression methodology was then applied to the log-transformed conditional mean rate estimates. The results are given in table 1 and suggest very strongly that there are rate differences between the aEF-1
and eEF-1
subtrees but that there is little evidence of rate differences between the HBS1 and eRF3 subtrees.
|
|
| Confidence Bounds for Rate Differences |
|---|
|
|
|---|
Because the results of the orthogonal regression tests suggest that there are rate differences for the aEF-1
and eEF-1
subtrees but little evidence of rate differences for the HBS1 and eRF3 subtrees, further analysis was restricted to the aEF-1
and eEF-1
subtrees. Confidence intervals were calculated for the rates at each of the sites for each of the subtrees. A total of 13 out of the 269 sites (5%) have 95% confidence intervals that do not overlap (fig. 6)
.
|
Use of nonoverlapping confidence intervals to identify sites where significant rate differences exist is valid but can be expected not to detect some of the sites where changes have occurred because of its failure to incorporate the clear correlations between rates at sites in the two subtrees evident from the orthogonal regression analysis. Modeling the bivariate distribution of the rates allows one to adjust for the correlation. For the aEF-1
and eEF-1
data sets we estimated a bivariate rate distribution, using likelihood methods and then constructed confidence intervals for the rate differences. A total of 60 of the 269 rate differences (22%) have 95% confidence intervals that do not contain 0 (fig. 7 ). All of the sites that were deemed to have significant differences because the individual confidence intervals were nonoverlapping are still significant, but a large number of additional sites are also significant.
|
| Concluding Remarks |
|---|
|
|
|---|
We have presented a number of different methods in this article. The first set of methods, orthogonal regression and parametric bootstrap tests, are useful for detecting whether rate changes have occurred. The orthogonal regression methodology uses the rate estimates alone as a means of determining whether there are significant differences in the means for the subtrees. The parametric bootstrap repeats the process of tree, branch length, and rate estimation under the null hypothesis. Because it incorporates additional sources of variability, it can be expected to be more sensitive in detecting significant rate differences. This is borne out in the comparison of rates between the HBS1 and eRF3 data sets. The orthogonal regression methodology failed to find significant differences where the parametric bootstrap results ranged from marginally significant to marginally insignificant. Both methodologies have value; the parametric bootstrap can be expected to be more sensitive in detecting departures from the null, whereas the orthogonal regression methods requires much less computation in large problems. In general, all of the methods will detect rate differences more easily if there are large numbers of sites and taxa; however, power investigations in simplified settings (data not shown) suggest that larger rate difference might be detected with as few as five taxa per subtree.
Failure to reject the hypothesis of significant differences suggests that further analysis is not necessary and effectively controls the type I error rate. Given the rejection of the orthogonal regression or parametric bootstrap tests, confidence intervals can be constructed for rates to locate sites that are likely to have significant rate differences. Confidence bounds for rate differences, calculated after estimating a bivariate rate distribution for the two subtrees of interest, can be expected to produce tighter bounds for the rate differences. To allow for easier computation, an independence assumption was made about the data in the two subtrees. When the branch connecting the two subtrees is relatively long, one can expect the resulting approximations to be reasonable. We expect that sites with low rates in both subtrees will be the ones most significantly affected by a loosening of the independence assumption. In this case, the branch length between the two subtrees for the site is effectively shortened because of the low rates so that the data in the two subtrees become more dependent. Determining how or whether this would affect the confidence interval for the rate difference will require further study.
Our methods have been applied to compare rates at sites for data sets with subtrees from the HBS1 and eRF3 data sets. Although significant rate differences were suggested in the comparison of the rates from the aEF-1
and eEF-1
data sets, the rate differences in the HBS1 versus eRF3 comparisons were marginally significant and insignificant using the parametric bootstrapping test and regression test, respectively. Further analysis is needed to determine what the effects of failing to adjust for rate differences in subtrees might be for phylogenetic estimation.
The likelihood ratio statistic test of Knudsen and Miyamato (2001)
treats the rate difference at a site as a fixed parameter for estimation. It uses only the data at the site and assumes that large sample likelihood theory is applicable. In contrast, the bivariate confidence intervals treat the rate difference as a random variable. The information about the range of likely rate differences contained within the data at a site is obtained by conditioning on the data at a site, but information from other sites is obtained by using the bivariate rate distribution in the calculation of intervals. The bivariate model considered here can be viewed as an extension of the rates-across-sites model where rates are allowed to vary in subtrees of the larger tree. The model is similar to Gu's (2001)
model for functional divergence but places less restrictions on the form of the bivariate distribution. The extension is also similar to the covarion models of Tuffley and Steel (1998), Galtier (2001)
, and Penny et al. (2001)
, which also allow rate variation between subtrees. These covarion models assume a stationary process of rate variation at a site throughout the tree. This differs from the bivariate model considered here in at least two respects. First, the process of rate variation is constant throughout the tree so that the rate distribution or rates-across-sites model for one subtree should be the same as for any other subtree. The bivariate model allows different rate distributions in different subtrees. Second, in the Tuffley and Steel model, rates are allowed to vary within any branch of a subtree. In contrast, the bivariate model assumes a single rate at a site for a given subtree. Nevertheless, the bivariate models considered here can be useful in detecting covarion-type rate variation similar to that of the Tuffley and Steel model. In the Tuffley and Steel model one can think of an average rate at a site for a subtree, where the average is taken over branches of the subtree. Assuming that rates at a site vary randomly throughout a tree, by chance there should be differences in the average rates at a site in any two subtrees of the larger tree. Because the average rates will differ, with sufficient data, the orthogonal regression or parametric bootstrap tests presented here will reject the null hypothesis of a single rate distribution model.
The bivariate model can be extended to allow rate variation in smaller and smaller subtrees of the tree of interest. In the most general case, we could partition the tree into m subtrees, T1, ..., Tm. In a multivariate model for rate variation at a site, a set of rates r1, ..., rm for the subtrees would be drawn from a multivariate distribution that assigns some probability
(i1, ..., im) to every possible set ri1, ..., rim of m rates that could arise for the subtrees.
| Acknowledgements |
|---|
|
|
|---|
E.S. and C.F. were supported by the Natural Sciences and Engineering Research Council of Canada. A.J.R. thanks the Canadian Institute for Advanced Research Program in Evolutionary Biology for fellowship support. A.J.R. and Y.I. were supported by the NSERC Operating Grant 227085-00 and NSERC Genomics Grant 228263-99. This collaboration is part of a Genome Atlantic/Genome Canada Large-Scale Project.
| Footnotes |
|---|
Naruya Saitou, Reviewing Editor
Abbreviations: EF-1
, elongation factor 1
; aEF-1
, archaebacterial EF-1
; eEF-1
, eukaryotic EF-1
; eRF3, eukaryotic release factor 3. ![]()
Keywords: covarion
rates-across-sites
Markov models
maximum likelihood
molecular evolution
phylogenetics ![]()
Address for correspondence and reprints: Edward Susko, Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia B3H 3J5, Canada. E-mail: susko{at}mathstat.dal.ca ![]()
| References |
|---|
|
|
|---|
Casella G., R. L. Berger, 1990 Statistical inference Brooks/Cole, Pacific Grove, Calif
Dayhoff M. O., R. M. Schwartz, B. C. Orcutt, 1979 A model of evolutionary change in proteins Pp. 345352 in M. O. Dayhoff, ed. Atlas of protein sequence and structure, Vol. 5(Suppl. 3). National Biomedical Research Foundation, Silver Spring, Md
Felsenstein J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach J. Mol. Evol 17:368-376[Web of Science][Medline]
. 1993 PHYLIP (phylogeny inference package). Version 3.5c Distributed by the author, Department of Genetics, University of Washington, Seattle
Felsenstein J., G. A. Churchill, 1996 A hidden Markov model approach to variation among sites in rate of evolution Mol. Biol. Evol 13:93-104[Abstract]
Fitch W. F., E. Markowitz, 1970 An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution Biochem. Genet 4:579-593[Web of Science][Medline]
Galtier N., 2001 Maximum-likelihood phylogenetic analysis under a covarion-like model Mol. Biol. Evol 18:866-873
Gaucher E. A., M. M. Miyamato, S. A. Benner, 2001 Function-structure analysis of proteins using covarion-based evolutionary approaches: elongation factors Proc. Natl. Acad. Sci. USA 98:548-552
Gu X., 2001 Maximum likelihood approach for gene family evolution under functional divergence Mol. Biol. Evol 18:453-464
Inagaki Y., W. F. Doolittle, 2000 Evolution of the eukaryotic translation termination system: origins of release factors Mol. Biol. Evol 17:882-889
Knudsen B., M. M. Miyamato, 2001 A likelihood ratio test of evolutionary rate shifts and functional divergence among proteins Proc. Natl. Acad. Sci. USA 98:14512-14517
Lockhart P. J., D. H. Huson, U. Mairer, M. J. Fraunholz, Y. Van de Peer, A. C. Barbrook, C. J. Howe, M. A. Steel, 1998 A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages Mol. Biol. Evol 15:1183-1188[Abstract]
. 2000 How molecules evolve in eubacteria Mol. Biol. Evol 17:835-838
Lopez P., Forterre P., H. Phillipe, 1999 The root of the tree of life in the light of the covarion model J. Mol. Evol 49:496-508[Web of Science][Medline]
Miyamoto M. M., W. M. Fitch, 1995 Testing the covarion hypothesis of molecular evolution Mol. Biol. Evol 12:503-513[Abstract]
Penny D., B. J. McComish, M. A. Charleston, M. D. Hendy, 2001 Mathematical elegance with biochemical realism: the covarion model of molecular evolution J. Mol. Evol 53:711-723[Web of Science][Medline]
Rambaut A., N. C. Grassly, 1997 Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees Comput. Appl. Biosci 13:235-238
Strimmer K., A. von Haeseler, 1996 Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies Mol. Biol. Evol 13:964-969[Web of Science]
Susko E., C. Field, C. Blouin, A. R. Roger, 2002 Estimation of rate distributions in phylogenetic models Preprint.
Tuffley C., M. Steel, 1998 Modeling the covarion hypothesis of nucleotide substitution Math. Biosci 147:63-91[Web of Science][Medline]
Yang Z., 1994 Maximum likelihood phylogentic estimation from DNA sequences when substitution rates differ over sites: approximate methods J. Mol. Evol 39:306-314[Web of Science][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
N. Gruenheit, P. J. Lockhart, M. Steel, and W. Martin Difficulties in Testing for Covarion-Like Properties of Sequences under the Confounding Influence of Changing Proportions of Variable Sites Mol. Biol. Evol., July 1, 2008; 25(7): 1512 - 1520. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Kolaczkowski and J. W. Thornton A Mixed Branch Length Model of Heterotachy Improves Phylogenetic Accuracy Mol. Biol. Evol., June 1, 2008; 25(6): 1054 - 1066. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. P. Townsend Profiling Phylogenetic Informativeness Syst Biol, April 1, 2007; 56(2): 222 - 231. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Ruano-Rubio and M. A. Fares Artifactual Phylogenies Caused by Correlated Distribution of Substitution Rates among Sites and Lineages: The Good, the Bad, and the Ugly Syst Biol, February 1, 2007; 56(1): 68 - 82. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Doron-Faigenboim and T. Pupko A Combined Empirical and Mechanistic Codon Model Mol. Biol. Evol., February 1, 2007; 24(2): 388 - 397. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J Roger and L. A Hug The origin and diversification of eukaryotes: problems with molecular phylogenetics and molecular clock estimation Phil Trans R Soc B, June 29, 2006; 361(1470): 1039 - 1054. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Furmonaviciene, B. J. Sutton, F. Glaser, C. A. Laughton, N. Jones, H. F. Sewell, and F. Shakib An attempt to define allergen-specific molecular surface features: a bioinformatic approach Bioinformatics, December 1, 2005; 21(23): 4201 - 4204. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Landau, I. Mayrose, Y. Rosenberg, F. Glaser, E. Martz, T. Pupko, and N. Ben-Tal ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures Nucleic Acids Res., July 1, 2005; 33(suppl_2): W299 - W302. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Blouin, D. Butt, and A. J. Roger Impact of Taxon Sampling on the Estimation of Rates of Evolution at Sites Mol. Biol. Evol., March 1, 2005; 22(3): 784 - 791. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. J. Keeling and Y. Inagaki A class of eukaryotic GTPase with a punctate distribution suggesting multiple functional replacements of translation elongation factor 1{alpha} PNAS, October 26, 2004; 101(43): 15380 - 15385. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Mayrose, D. Graur, N. Ben-Tal, and T. Pupko Comparison of Site-Specific Rate-Inference Methods for Protein Sequences: Empirical Bayesian Methods Are Superior Mol. Biol. Evol., September 1, 2004; 21(9): 1781 - 1791. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Inagaki, E. Susko, N. M. Fast, and A. J. Roger Covarion Shifts Cause a Long-Branch Attraction Artifact That Unites Microsporidia and Archaebacteria in EF-1{alpha} Phylogenies Mol. Biol. Evol., July 1, 2004; 21(7): 1340 - 1349. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Inagaki, C. Blouin, E. Susko, and A. J. Roger Assessing functional divergence in EF-1{alpha} and its paralogs in eukaryotes and archaebacteria Nucleic Acids Res., July 15, 2003; 31(14): 4227 - 4237. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


























