MBE Advance Access originally published online on October 19, 2005
Molecular Biology and Evolution 2006 23(2):352-364; doi:10.1093/molbev/msj040
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Article |
On the Correlation Between Composition and Site-Specific Evolutionary Rate: Implications for Phylogenetic Inference
School of Computer Science, University of Manchester, Manchester M13 9PL, United Kingdom
E-mail: magnus.rattray{at}cs.man.ac.uk.
| Abstract |
|---|
|
|
|---|
Model-based phylogenetic reconstruction methods traditionally assume homogeneity of nucleotide frequencies among sequence sites and lineages. Yet, heterogeneity in base composition is a characteristic shared by most biological sequences. Compositional variation in time, reflected in the compositional biases among contemporary sequences, has already been extensively studied, and its detrimental effects on phylogenetic estimates are known. However, fewer studies have focused on the effects of spatial compositional heterogeneity within genes. We show here that different sites in an alignment do not always share a unique compositional pattern, and we provide examples where nucleotide frequency trends are correlated with the site-specific rate of evolution in RNA genes. Spatial compositional heterogeneity is shown to affect the estimation of evolutionary parameters. With standard phylogenetic methods, estimates of equilibrium frequencies are found to be biased towards the composition observed at fast-evolving sites. Conversely, the ancestral composition estimates of some time-heterogeneous but spatially homogeneous methods are found to be biased towards frequencies observed at invariant and slow-evolving sites. The latter finding challenges the result of a previous study arguing against a hyperthermophilic last universal ancestor from the low apparent G + C content of its rRNA sequences. We propose a new model to account for compositional variation across sites. A Gaussian process prior is used to allow for a smooth change in composition with evolutionary rate. The model has been implemented in the phylogenetic inference software PHASE, and Bayesian methods can be used to obtain the model parameters. The results suggest that this model can accurately capture the observed trends in present-day RNA sequences.
Key Words: phylogeny RNA G + C content compositional heterogeneity across-site variation ancestral composition
| Introduction |
|---|
|
|
|---|
Contemporary phylogenetic methods are no longer limited to the descriptive study of species relatedness. Current model-based approaches are as much designed to unravel the mechanisms of sequence evolution as they are to recover the evolutionary relationship of organisms, and the Markov models that describe the nucleotide substitution process have become increasingly complex (see Felsenstein [2004]
There are cases where evolutionary patterns are known to be different within a gene (e.g., different codon positions of a protein-coding gene or loops and stems of RNA genes). An easy way to account for this heterogeneity is to partition the data beforehand according to a priori knowledge. Combined models that use different substitution models for different partitions have been studied and are in common use (Yang, 1996b
; Pupko et al., 2002
; Seo, Kishino, and Thorne, 2005
).
Nevertheless, partitioning of the data is only an option when partitions can be identified in advance, and we are more interested here in compositional heterogeneity within a partition. In the past 10 years, methods have been proposed to capture spatial heterogeneity (not specifically compositional heterogeneity) when the correct partitioning scheme is unknown or when intragenic variability cannot be related directly to specific DNA segments. Latent class models (or mixture models) have been used quite often in this context. These models assume that each site is evolving according to an unobserved process chosen among a finite set of substitution models, and the likelihood is computed by integrating over all the possible substitution processes at a site. For instance, Huelsenbeck and Nielsen (1999)
introduced a model where the transition/transversion rate ratios can vary along a sequence according to a gamma distribution. Following the work of Nielsen and Yang (1998)
, Yang et al. (2000)
proposed a set of codon models with variable synonymous/nonsynonymous ratio across sites. The standard discrete gamma rate model for ASRV (Yang, 1994
) and the invariant sites model (Reeves, 1992
) were early examples of latent class models that allow for rate heterogeneity across sites without prior classification.
Closer to the main topic of this paper, some methods have also been devised specifically to account for the variation in composition across sites. Most substitution models ignore this aspect of sequence evolution and assume that the equilibrium frequencies across sites are constant. Dimmic, Mindell, and Goldstein (2000)
incorporated site-specific selection effects using different amino acid fitness functions. More recently, Pagel and Meade (2004)
introduced a "pattern-heterogeneity" mixture model to account for the heterogeneity both in average evolutionary rate and exchangeability parameters, and their model can easily be extended to describe the spatial variation of frequency parameters. Both models are also latent class methods. Models using a specific frequency vector at each site have also been designed. Lartillot and Philippe (2004)
constructed a model that allows for a variable number of frequency profiles to fit the variation of the amino acid equilibrium distribution and, in effect, classifies sites into distinct categories. A parameter-rich model was also attempted by Bruno (1996)
and Halpern and Bruno (1998)
, where a single mutation process is shared across sites but a site-specific frequency vector is used to model the variation of selection effects across sites.
In this paper we concentrate on the variation in nucleotide composition across RNA genes. Secondary structure constraints on RNA molecules and differences between paired and unpaired regions are taken into account. By contrasting the composition observed at slow- and fast-evolving pairs in contemporary RNA helices, it is suggested that the differences between their patterns of evolution is not simply limited to variation of the average substitution rate. The most stable G:C pairs are more common in slowly evolving parts of RNA stems, and nucleotide frequencies in RNA loops are also found to be quite variable across rate categories. In general, it appears that base composition is correlated with the site-specific evolutionary rate. In order to account for the observed compositional trends across sites in present-day sequences, we propose a new latent class method for ASRV. Our model extends the discrete gamma model (Yang, 1994
) to allow for variable equilibrium frequencies in each rate category. The main model assumption, supported by empirical evidence, is that these frequencies change in a smooth way across rate categories. We use a Gaussian process to control the smoothness, and parameters are estimated from the data during a Bayesian inference process using Markov chain Monte Carlo (MCMC) techniques. This model is implemented and available in PHASE, a software package for maximum likelihood (ML) and Bayesian phylogenetic inference designed specifically for use with RNA genes (Jow et al., 2002
; Hudelot et al., 2003
).
We motivate our work by highlighting the biasing effects of compositional heterogeneity across sites on standard phylogenetic methods that do not account for it. In our previous works, we noticed that frequency estimates were biased towards the frequencies of fast-evolving sites (Jow et al., 2002
; Hudelot et al., 2003
), and we investigated whether compositional variations across sites could explain these results. More generally, the effects of compositional heterogeneity on phylogenetic estimates are explored in the ML and Bayesian frameworks.
Because lineage-specific base compositional bias has previously been shown to cause phylogenetic artifacts (Lockhart et al., 1994
; Jermiin et al., 2004
), more effort has been put into modeling compositional variation over evolutionary time rather than across sites (Yang and Roberts, 1995
; Galtier and Gouy, 1998
; Brooks, Fresco, and Singh, 2004
; Foster, 2004
). However, we find that the impact of compositional variation across sites on models proposed to relax the assumption of time homogeneity is also worrying. We show, as an example, that under a model that wrongly assumes across-site homogeneity of the equilibrium frequencies, part of the observed variation across time will in fact be due to variations across sites, especially when the model is used conjointly with a model of rate heterogeneity. Galtier, Tourasse, and Gouy (1999)
suggested that the low inferred G + C content of the last universal common ancestor (LUCA) of all extent life forms was not compatible with the expected G + C content of a thermophilic species, but this result has been challenged by several other studies (Di Giulio, 2000
, 2003
; Schwartzman and Lineweaver, 2004
). We provide results suggesting that the ancestral G + C compositions proposed by Galtier, Tourasse, and Gouy (1999)
for the two studied rRNAs genes were most likely underestimated.
| Materials and Methods |
|---|
|
|
|---|
Throughout the paper, two very different methods are used to estimate frequencies at slow- and fast-evolving sites and to produce "frequency versus average rate" curves. The first method is a fast ML method that estimates the parameters of a standard site-homogeneous model (but allowing for rate variation across sites) and computes rate-specific empirical Bayes posterior averages. The second method uses a site-heterogeneous model that we develop in this work. Both methods are descibed at the end of Materials and Methods after a description of the evolutionary models.
Substitution Models
Standard Markov models of nucleotide substitution are used in this paper (Swofford et al., 1996
). These models are fully defined by a matrix Q = {tij} of transition rates between states (e.g., the 4 nt or the 20 amino acids). We typically assume that the process is reversible, and consequently, it can be parameterized by a set of stationary frequencies,
= {
i}, which is the expected composition when the process is at equilibrium, and a symmetric matrix of exchangeability parameters R = {
ij} = {
ji}. Using these parameters, the instantaneous transition rates of the Markov process are defined as:
![]() | (1) |
![]() | (2) |
The most general time-reversible model for nucleotide substitution has four frequency parameters and six exchangeability parameters. Frequencies must sum up to 1.0, and we fix
AG to 1.0 because it is not possible to identify separately the six exchangeability parameters and µ. For computational reasons and because it does not affect the arguments we are developing in this paper, we will use a constrained version of this general model proposed by Tamura and Nei (1993)
. It has two exchangeability rates
AG and
CT for nucleotide transitions (with
AG = 1.0) and only one exchangeability rate for nucleotide transversions (
AC =
AT =
CG =
CT).
RNA molecules fold onto themselves to produce small stretches of base-pairing (helices or stems) and single-stranded loops. Watson-Crick pairs (A:U and G:C) and the intermediate noncanonical G:U pair are most commonly found in RNA helices. Consequently, RNA sequences are associated with a secondary structure that turns out to be well conserved over evolutionary time, even though the underlying nucleotide sequences are not. To preserve RNA helices, compensatory substitutions occur frequently to maintain complementarity (Stephan, 1996
). Replacements within the two symmetric groups [A:U, G:U, G:C] and [U:A, U:G, C:G] are quite common, whereas exchanges between these two groups are unlikely because they involve an intermediate state that is thermodynamically unstable. As far as phylogenetics is concerned, the compensatory substitution mechanism affects standard methods because it is traditionally assumed that sites are evolving independently, and spatially distant substitutions are assumed to be uncorrelated. Specific substitution models that account for the RNA secondary structure are implemented in PHASE (http://www.bioinf.man.ac.uk/resources/phase). Reviewed by Savill, Hoyle, and Higgs (2001)
, these models consider pairs of nucleotides as their elementary unit and partially alleviate the independence assumption. In this paper, we use the Other RNA (OTRNA) model introduced by Tillier and Collins (1998)
and referred to as 7D in Savill, Hoyle, and Higgs (2001)
. The transition matrix is given below. This model is a seven-state model with the four Watson-Crick pairs (A:U/U:A and G:C/C:G) and the two U:G/G:U pairs. The 10 remaining unstable combinations are lumped into a single mismatch state MM. The OTRNA model has seven frequency parameters. Exchangeabilities corresponding to each other in the two groups defined above are assumed to be equal, for example,
A:U,G:U =
U:A,U:G, and the model has only four different exchangeability parameters. Inside each symmetric group, one parameter is used for the two single transitions
s =
A:U,G:U =
G:U,G:C and one for the double transition
d =
A:U,G:C, set to 1.0 as a reference. The third exchangeability parameter ß is for the unlikely replacements between the two groups defined above while the fourth one
is for the replacements to and from the mismatch state.
|
|
Across-Sites Heterogeneity
RNA genes were partitioned according to their secondary structure in order to perform combined analyses. Single-stranded and helical regions are modeled using different evolutionary models. A standard nucleotide model is used with unpaired nucleotides, and a specific paired-site model is used for the helices. Because secondary structure can change over evolutionary time, a consensus structure must be used. Pairs that are not conserved for at least 50% of the species are not retained, and the 2 nt are considered as independently evolving sites in this case. In a combined analysis, substitution models can share some parameters (Yang, 1996b
), but we assume here that only the phylogeny is shared. The two partitions are supposed to have evolved along a common topology, and it is also assumed that the average substitution rate of each substitution model is related by a proportionality factor that is constant over the whole tree. This is in effect equivalent to the model of Yang (1996b)
, which assumes that branch lengths for each partition are related by a proportionality factor.
We use the discrete gamma model to account for ASRV (Yang, 1994
). Across-site rate variation is modeled by a gamma distribution, approximated for computational reasons by a finite number of categories k with equal prior probabilities 1/k. Two parameters control the shape of this distribution: a mean and a shape parameter. The mean is the average substitution rate of the model, that is, the expected number of substitutions over a branch of length 1.0. The gamma shape parameter (
) determines the extent of rate variation.
Sequence Data
We use both real and synthetic data sets. The real RNA alignments D1 and D2 are distributed with the PHASE package or available upon request. The synthetic data sets D3 and D4 were generated by PHASE and used to investigate spatial compositional heterogeneity using simulations. Later on, when mentioning D3 or D4, we actually refer to a replicate generated according to the evolutionary models described below, and the number of sites in the alignment will always be specified.
D1: Mitochondrial tRNA and rRNA Genes from Primates
This data set was extracted directly from the data set used for mammalian phylogenetic inference by Hudelot et al. (2003)
. The genes used are the complete set of mitochondrial tRNAs and rRNAs. For computational reasons, we kept only 13 primate species (gorilla, human, chimpanzee, pygmy chimpanzee, orangutan, gibbon, baboon, barbary ape, capuchin, loris, tarsier, ring-tailed lemur, malayan flying lemur) and 3 species from the grouping Laurasiatheria as an outgroup (dog, cow and rhinoceros) from the original data set. The consensus secondary structure is a key part of this alignment because nucleotides are treated differently according to their position. This data set is used to illustrate compositional variation across sites in real sequences and we assume that the substitution process is not excessively time heterogeneous with this alignment.
D2a & D2b: Nuclear Large Subunit and Single Subunit RNAs
These two rRNA genes were used by Galtier, Tourasse, and Gouy (1999)
to infer the G + C composition of the ancestral sequence of the common ancestor to all life forms and deduce its likely environmental temperature. The sequences were downloaded from the European Ribosomal RNA database (Wuyts et al., 2001
, 2002
) for the same set of 40 species. Ambiguous regions were removed, and the alignments were further refined by eye. Once the biasing effects of compositional variation across sites on time-heterogeneous methods are demonstrated, this data set, highly heterogeneous both in time and in space, is used to reevaluate the robustness of the results of Galtier, Tourasse, and Gouy (1999)
.
D3: Synthetic Data Set with Spatially Homogeneous Composition
Replicates of D3 are generated using a TN93 substitution model (see Substitution Models) along the branches of a 10-species tree. Arbitrary but reasonable values are chosen for the branch lengths (see fig. 1). The substitution model is time homogeneous, and the base frequency distribution is expected to remain constant both across sites and lineages. Rate heterogeneity is simulated using a gamma distribution with 20 equiprobable discrete gamma categories. The gamma shape parameter
is set to 0.5, a reasonable value for the genes we are studying. This corresponds to an L-shaped distribution of the average substitution rate across sites with most of the sites evolving slowly. Exchangeabilities are set at reasonable values:
CT = 2.5,
transversion = 0.4, and the equilibrium frequencies are {
A,
C,
G,
T} = {0.40, 0.25, 0.15, 0.20}.
|
D4: Synthetic Data Set with Spatial Compositional Heterogeneity
D4 is the equivalent of D3 with spatial compositional heterogeneity. The base frequency distribution is no longer assumed to be shared across sites and we use a different set of frequency parameters for each gamma category to simulate a correlation between composition and site-specific average substitution rate (see table 1 and the "real process" curve in fig. 3). Fast-evolving sites are G + C rich and A + T poor compared to slow-evolving sites. Frequency parameters for each gamma category were chosen so that the frequency distribution, averaged over the whole sequence, is that given for D3. Note that the process is still time homogeneous and stationary. Consequently, we expect the composition to be time homogeneous for each gamma category, and the pattern of compositional variation across sites is the same for contemporary and ancestral sequences.
|
|
Empirical Estimate of Compositional Variation Across Sites
A likelihood-based method is used to estimate the site-specific average rate of evolution and to investigate associations between evolutionary rates and site-specific base composition. We neglect here the possibility of significant rate change at a site over evolutionary time (the covarion hypothesis introduced by Fitch and Markowitz, 1970
) and a gamma distribution of rates, we compute maximum likelihood estimates (MLEs) for branch lengths
and substitution model parameters
These values maximize the probability that the assumed evolutionary model generated the observed sequences
computed by the pruning algorithm (Felsenstein, 1981
![]() | (3) |
We are interested in the composition of each rate category, and we estimate the distribution of frequencies conditional on rate using the posterior probabilities from equation (3) as weights. Then the expectation value of the frequency vector
conditional on rate ri is,
![]() | (4) |
is approximated by the observed frequencies at each site.
Modeling Compositional Heterogeneity Across Sites
In this paper, we mention some issues related to the use of time-heterogeneous method with spatially heterogeneous data sets. To avoid confusion, we wish to emphasize that the new ASRV method proposed here is meant to model compositional variation across sites and not compositional variation in time. Actually, we expect this new model to behave badly in case of compositional biases among species.
The evolutionary processes that cause the observed compositional trends with respect to the site-specific evolutionary rate are poorly understood and, in any case, hard to model explicitly. One simple method is to use different equilibrium frequencies for each rate category and to consider each frequency vector as a free parameter to be estimated from the data. Nevertheless, this method has some disadvantages. It is parameter rich, and the number of gamma categories used has a direct impact on the model and the number of parameters (whereas discretization of the gamma distribution was primarily supposed to be a mathematical convenience). This substitution model was implemented and tested with simulated data sets and we found that the method did not fit the real frequency curves in practice.
We do not expect the composition to vary strongly between neighboring rate categories. Consequently, it seems reasonable to solve the issues with the previous unconstrained method by assuming that equilibrium frequencies vary smoothly with respect to the site-specific evolutionary rate. We tried simple parametric models to express the frequencies as a function of the substitution rate, but again, we often found the fit unsatisfactory or limited to a specific data set (work not shown).
We believe parametric methods failed because they were too rigid, and we consequently used Gaussian processes to incorporate the prior of smoothness in our model of compositional variation. Gaussian processes are becoming popular in the machine learning community (MacKay, 1998
), and they have already been applied in various fields, usually for classification (Gibbs and MacKay, 2000
) and regression problems (Williams and Rasmussen, 1996
; see also Chu et al. [2005]
for an application in computational biology). Easy to implement and to interpret, Gaussian processes are also appealing because their nonparameteric nature does not constrain the model to a specific functional form. In our specific case, Gaussian processes can simply be considered as useful smoothing devices that return a prior probability of observing a set of frequencies without making excessively strong assumptions on the underlying variation process. Frequencies at each rate category were treated as free parameters, but strong variations were penalized by the Gaussian process prior. The method was implemented in the Bayesian framework using MCMC techniques.
Gaussian processes are fully defined by their mean and covariance function. The covariance matrix C specifies how similar the composition is between each pair of rate categories. In phylogenetic inference, substitution rates are always expressed relative to each other, and consequently, the problem was reparameterized with xi = log(ri), where ri is the average substition rate of the rate category i. We chose a covariance form adapted to our problem:
![]() | (5) |
0 defines the amount of variation expected for a typical function,
1 scales the rate abscissa,
2 and
3 introduce a linear trend in the Gaussian process and define the mean expected values of the process as a straight line with respect to the evolutionary rates x while
4 is a jitter element on the diagonal of the covariance matrix which prevents it from being ill conditioned. A distinct covariance matrix (i.e., a specific vector
) is used for each of the n possible states of the model (e.g., 4 nt for TN93 and seven pairs for OTRNA) defining a set of hyperparameters
We used a unique prior on each
For each model, the n frequency parameters used at a specific rate category were reparameterized with n "activation" values using a softmax function to free ourselves from the usual constraints imposed on frequency parameters: 0
i
1 and
![]() | (6) |
The ai(x) vectors are not uniquely defined. It would have been possible to keep the extra degree of freedom for each gamma category and to let the Gaussian process prior resolve the identifiability issue, but we preferred to add the simple constraint
although it has some effects on the prior for the variations.
The probability density of a set of activations
where
are the rates of the discrete gamma categories, is:
![]() | (7) |
Z is a normalizing factor, and
(x) is the Dirac delta function. When a standard substitution model is used for Bayesian inference in PHASE, a Dirichlet proposal mechanism is used to propose new equilibrium frequencies from the current distribution vector during an MCMC run. Multiple frequency vectors are used in the substitution model presented here and we simply chose to update them independently using the same mechanism. Because strong correlations between the frequencies of neighboring categories were introduced, distant moves are now regularly refused, and the parameters that control the spread of the Dirichlet proposals are initially chosen (and also tuned during the MCMC "burnin" period) to ensure a reasonable acceptance rate when we sample from the chain. Because large moves are not possible, mixing properties are not impressive, and a huge number of cycles are necessary to obtain a reliable sample from the stationary distribution of each gamma category. A possible solution would be to design a suitable move affecting all the frequencies at once in order to preserve the correlations but this problem was not addressed here and we used very long runs instead. Two runs were performed for each simulation and we checked that the final posterior probability distributions of frequency parameters were similar for each gamma category.
| Results |
|---|
|
|
|---|
Correlation of Substitution Rate and Composition
Using the empirical Bayesian method described above (equations 3 and 4), compositional trends and their correlation with the site-specific rate of evolution were studied in the primate data set D1. As previously mentioned, a TN93 substitution model was used with RNA loops, and the OTRNA model was used with RNA stems. In both models, a gamma distribution of rates was assumed and approximated with eight discrete categories. The empirical method requires us to assume a specific topology, and we simply used the majority-rule consensus topology found by a Bayesian analysis with this data set (see Jow et al. [2002]
Graphs for the estimated composition of RNA loops and RNA helices in D1 are shown in figure 2. The estimated frequency distribution for each discrete rate category is plotted against the average substitution rate of the category. In RNA stems, frequencies of symmetrical pairs (e.g.,
G:U and
U:G) were summed for clarity. In loops, we observe that
G is negatively correlated with the rate of evolution and that G is scarce at fast-evolving sites. In RNA stems, we observe a striking increase in
A:U+U:A and
MM, correlated with the site-specific evolutionary rate, whereas
G:C+C:G is decreasing.
|
In order to ascertain whether this method is robust, it was tested with 10 replicates of D3 (1,000 nt) and 10 replicates of D4 (20,000 nt). Data sets were analyzed using a TN93 substitution model and assuming the true topology, known for these synthetic data sets. Rate heterogeneity was modeled with 20 discrete gamma categories in both cases. The test with the spatially heterogeneous data set D4 was to evaluate the performance of the method. The test with the spatially homogeneous and small data set D3 was done to ensure that we were not finding heterogeneity when there was none. Indeed, the Markov process is observed for a very short period of time and with limited sequence size, and we wanted to test whether noise or other side effects could be responsible for the observed trends with the real data set D1.
Over the 10 experiments with D3, frequencies estimated for each rate category were found to be quite close to the stationary distribution that was used to generate the sequences. Differences between the estimated composition for each category and the real uniform frequencies were in the tight range (2.48%, +2.56%; 95% confidence interval [CI]). Moreover, no significant trends related to the substitution rate were visible. Consequently, we rule out the possibility that a methodological bias is responsible for the trends observed with real data sets. Figure 3 shows the results with data set D4 (first replicate) which indicate that the method can identify compositional differences across sites but is probably underestimating the extent of variation when it exists.
Effects on Standard Phylogenetic Inference Methods
To assess the impact of spatial compositional heterogeneity in phylogenetic inference, 100 replicates of D4 were generated (20,000 sites per alignment) and analyzed with a standard homogeneous substitution model. A TN93 substitution model with gamma distributed rate heterogeneity (20 discrete rate categories) was used. This is the generative model for D3, but it cannot account for the variation of composition across sites in D4. For each replicate, branch lengths and substitution model parameters (i.e., frequencies, exchangeabilities, and gamma shape parameter) were reestimated by ML optimization, assuming the known tree topology.
The following values were recovered for the frequency parameters {
A = 38.7 ± 0.5%,
C = 27.3 ± 0.5%,
G = 16.3 ± 0.4%,
T = 17.7 ± 0.4%}. These MLEs were clearly biased towards the composition of fast-evolving sites (given in table 1). The site-specific composition depends on the rate category with replicates of D4, but one would have expected the MLEs for the frequency parameters to be close to the empirical frequencies, averaged over the whole sequence {
A = 40%,
C = 25%,
G = 15%,
T = 20%}. The bias also exists when ASRV is not modeled, but it was less noticeable: {
A = 39.2 ± 0.5%,
C = 25.4 ± 0.4%,
G = 15.8 ± 0.4%,
T = 19.6 ± 0.4%}. Frequencies were not the only parameters affected. Yang, Goldman, and Friday (1994)
and Huelsenbeck and Nielsen (1999)
noticed that branch lengths were underestimated when using a simpler evolutionary model that does not accommodate rate heterogeneity or transition/transversion rate variation across sites. Similarly, we noticed that all branch lengths were slightly underestimated (by 3% approximately) when spatial frequency variation was not accounted for. The estimation of exchangeability parameters was also affected, with
CT = 2.02 ± 0.15 and
transversion = 0.37 ± 0.02 instead of
CT = 2.50 and
transversion = 0.40. Experiments were repeated with replicates of D3 (same alignment size, 20,000 sites) to confirm that deviations from the expected result were only due to the compositional variation across sites in D4. When replicates of D3 were used, we naturally found that MLEs were close to their true values. Indeed, ML is consistent when the true model is used and the alignments were large enough to grant us such results. Because these different biases were not observed when replicates of data set D3 were used, we can be quite confident that spatial compositional variation in the synthetic replicates D4 was responsible for these results.
Above, we presented results where the frequency distribution is a free parameter of the model inferred with the other evolutionary parameters during ML optimization. Because stationarity is assumed, a sensible method, commonly applied in phylogenetic inference, could have been to estimate the frequency parameters directly from the empirical composition of the studied sequences (Whelan and Goldman, 1999
). Experiments with D4 were repeated in such a setting, and we found that biases were more pronounced when frequencies were fixed to their empirical values. Branch lengths were even more underestimated (by 4% approximately) as were the exchangeability parameters (
CT = 1.85 ± 0.15 and
transversion = 0.36 ± 0.02).
Effects on topology estimates were studied in the Bayesian framework. One hundred replicates of D4 were generated with 1,000 sites per alignment. We compared the Bayesian posterior probability (BPPs) of clades obtained when the true substitution model that generated D4 was assumed (i.e., base frequencies for each rate category were different and fixed to their true values) and when a homogeneous TN93 substitution model was assumed (i.e., with a unique equilibrium frequency vector estimated during the inference process). Twenty discrete gamma categories were used in both cases. On top of the biases highlighted above in the ML framework, we found that the BPPs used to measure the support for different clades were sensitive to the model used. Focusing on the clades with BPPs between 50% and 95% when the true model was used (the values that are usually reported on a consensus tree), the corresponding BPPs obtained with the spatially homogeneous model were slightly (but significantly) different (± 5% on average and ± 10%20% in some cases).
Effects on Time-Heterogeneous Methods
Standard evolutionary models are stationary and time homogeneous: frequency and exchangeability parameters are supposed to remain constant over time and over lineages, and the distribution of character states is at equilibrium. Nevertheless, it is unrealistic to assume that the forces that drive evolution have been constant over time and heterogeneous methods have been proposed (Yang and Roberts, 1995
; Galtier and Gouy, 1998
; Foster, 2004
). Though details differ in the three works mentioned, the general idea is to use different homogeneous substitution models for each branch in the tree.
We used the program EVAL_NH in the NHML package (Galtier, Tourasse, and Gouy, 1999
), to analyze the effect of compositional variation across sites with time-heterogeneous methods. This program was chosen because the implemented substitution model has few parameters and is easily tractable. It has one equilibrium frequency,
, which is the equilibrium G + C content (
/2 =
G =
C using our notation) and one transition/transversion exchangeability
. The
parameter is shared all over the tree, but
is branch specific and allowed to vary. With heterogeneous methods, the likelihood is dependent on the root position (Yang and Roberts, 1995
), and the ancestral G + C frequency (
) observed at the root node cannot be assumed to be equal to the stationary distribution of states anymore because the system is not at equilibrium. However, it is possible to infer it directly from the data during the inference process.
Replicates of D4 (100 alignments of 20,000 nt) were analyzed with EVAL_NH. The true topology was assumed once again. Experiments were performed with and without a discrete gamma model, where eight categories were used in the first case. With the time-heterogeneous and rate-heterogeneous models implemented in NHML, we found that the inferred ancestral G + C frequency
was 33.85% (95% CI: 33%, 34.57%). Albeit the model that generated the replicates of D4 is spatially heterogeneous, it is still time homogeneous and stationary so that the real ancestral G + C frequency in replicates of D4 is equal to the mean stationary frequency (
G +
C = 40%) plus some stochastic noise. Nevertheless, the
we found is clearly biased towards the G + C frequency of slow-evolving sites (
G +
C = 27.1% for the slowest gamma category). The bias was less striking but still visible without rate heterogeneity (
= 37.3%, 95% CI: 36.56%, 38.06%).
The model proposed by Galtier and Gouy (1998)
is a time-heterogeneous version of the T92 model (Tamura, 1992
). Because the T92 substitution model is less complex than the TN93 used to generate the data and because we used only eight discrete gamma categories to analyze data sets generated with 20 categories, one might object that the violation of these assumptions is responsible for the observed differences. Actually, when the experiments were repeated with replicates of D3, results confirmed that these assumptions had an impact on the results (
= 38.8 ± 0.7% with rate heterogeneity among sites and
= 39.7 ± 0.8% without). Nevertheless, deviations from the expected ancestral G + C frequency are not as great as deviations found with D4. Moreover, similar results were reproduced when the generative TN93 substitution model used to create the synthetic data sets D3 and D4 was replaced with a T92 model.
The G + C content of rRNA genes is known to be correlated to the optimal growth temperature in prokaryots (Galtier and Lobry, 1997
). Galtier, Tourasse, and Gouy (1999)
analyzed the evolutionary history of the large subunit and single subunit genes (data sets similar to D2a and D2b) with the nonhomogeneous model described above in order to infer the ancestral G + C content, and the environmental temperature, of the LUCA. They found a low ancestral G + C content and naturally concluded that LUCA was a nonhyperthermophilic species. We tried to evaluate whether the effects of spatial heterogeneity on time-heterogeneous methods could be responsible for this unexpected conclusion. Di Giulio (2000)
already reported that the results of Galtier, Tourasse, and Gouy (1999)
could not be repeated using maximum parsimony, and the bias we are reporting here is a likely explanation for this disagreement. Because the possible bias of the nonhomogeneous method is related to a directional trend in the composition with respect to the site-specific evolutionary rate, we studied the variations of the G + C content across sites in D2a and D2b as we did for D1. Loops and stems were not partitioned, and we used a single TN93 substitution model (with 8 discrete rate categories) in the Bayesian empirical approach described previously. In D2a and D2b, the average G + C content is found to increase with the average evolutionary rate (see fig. 4). A similar amount of variation is observed for both genes (i.e., 10%15%). This suggests that the G + C content returned by the time-heterogeneous method is probably an underestimate in both cases. Note that the procedure we are using to estimate the correlation between site-specific rate and composition is assuming that the substitution process is homogeneous, whereas the data set D2 is clearly time heterogeneous (and not only spatially heterogeneous). We do not expect the assumption of homogeneity to have a great impact on the site-specific rate estimation procedure and on the results of the Bayesian empirical method. Nevertheless, the curves for the two genes in figure 4 can only be considered as a mean across their respective data set, and the species-specific variation of G + C composition with respect to the evolutionary rate differs markedly from this average behavior.
|
Modeling compositional heterogeneity across sites
The new model that constrains compositional variations between neighboring rate categories with a Gaussian process was tested with data sets D1 and D4. The data set D1 was analyzed using the same two substitution models as previously: TN93 for loops and OTRNA for stems. The eight gamma categories assumed in both models were assigned a different set of frequencies, and Gaussian process priors were used in both models to avoid strong variations at neighboring rate categories. The synthetic data set D4 was analyzed using a TN93 substitution model and eight gamma categories with variable frequencies.
Mean posterior estimates (MPEs) for the hyperparameters
(eq. 5) are given in table 2 for both runs. Figures 5 and 6 show the MPEs for the frequency parameters at each rate category. Inferred frequency parameters are compared (1) with the frequency found with the Bayesian empirical method for the real primate data set D1 (fig. 5) and (2) with the stationary frequencies of the simulated process for heterogeneous data set D4 (fig. 6). The results in figure 6 show excellent agreement between inferred and true frequencies. The fit is less impressive in figure 5 (though reasonable), but one must keep in mind that the empirical estimates probably flatten the curves and underestimate the variation of frequencies across sites (as previously observed in fig. 3).
|
|
|
This new model was also tested with a replicate of D3 (1,000 nt) to perform a negative control. Though we recovered some trends, the real homogeneous frequency parameters are within the credibility intervals. Results are provided as Supplementary Material online.
| Discussion |
|---|
|
|
|---|
Compositional Trends in Real Sequences
The empirical Bayesian method that we proposed for the study of compositional variations among RNA sites suffers from some drawbacks. Mayrose et al. (2004)
For the results with the primate data set D1, we focus the discussion around the frequencies that vary across rate categories. Note that the high A content observed with D1 in unpaired RNA regions is explained by Gutell et al. (2000)
, who suggest that unpaired adenine nucleotides are crucial components for the formation of three-dimensional rRNA molecules. We previously argued that the biological processes responsible for compositional variations were hardly understood and difficult to model, but we think the observed trends can mostly be explained by the combined effects of the mutation and selection processes. Presumably fast categories are under reduced selection and respond to the mutational pressure, whereas the frequency distribution we observe at sites under purifying selection reflects an average of the site-specific selection pressure over all slow-evolving sites. This explanation fits nicely with the results. In loops, the decreasing trend in G content is certainly due to the strong mutational bias away from G. This mutational bias is often invoked to explain the low G content at the third codon position for mitochondrial protein-coding genes found on the H strand (Gibson et al., 2005
), and our results are consistent with the hypothesis of the deamination of C on the H strand which results in the decrease of G and increase of A in RNA products (Reyes et al., 1998
). Indeed, the two rRNA genes, which account for two-thirds of D1, are on the H strand too. Moreover, when D1 was split into separate data sets (tRNAs and rRNAs) and variation of composition was studied separately, the decreasing trend in G content is more striking for rRNAs than for tRNAs as was expected because tRNA genes can be found both on the L and the H strands (curves not shown). Nevertheless, because a consensus secondary structure was used, we cannot exclude the possibility of a "contamination" of the loop partition with slow-evolving paired sites, more G + C rich than the standard composition in RNA loops. However, we would also expect strong compositional variations for the C and U bases in such a case, and because G and A are the only unpaired nucleotides exhibiting trends, we find this explanation less plausible.
Compositional trends in RNA helices are stronger and fit with what could reasonably be expected. The frequency of mismatches increases steadily with the evolutionary rate, reflecting a weaker selection pressure to maintain the RNA secondary structure at fast-evolving sites. Because G:C pairs are thermodynamically stronger than A:U pairs and have a strong stabilizing effect on the structure of RNA molecules, it seems reasonable that slow-evolving regions, subject to a stronger selection pressure, are G:C rich compared to fast-evolving regions. Nevertheless, the mutation bias away from G mentioned previously might also be responsible for the striking decrease in the G:C content in favor of the A:U content. More studies are probably needed to assess how ubiquitous the decrease in the G:C content is.
Our study here was limited to the correlations between site-specific evolutionary rate and composition in RNA genes. Though we did not test this assumption, similar trends certainly exist with protein-coding genes at the nucleotide/codon level and at the amino acid level. Selective constraints imposed by the genetic code and mutational biases might give rise to unexpected patterns in other genes.
Impact on Evolutionary Estimates
When a single stationary distribution of frequencies is assumed and shared across sites evolving at different rates, an invariant site provides only one single independent sample of this distribution, whereas, at the other extreme, a site with infinite substitution rate would provide N uncorrelated samples (N being the number of taxa in the alignment). Deviations of frequency estimates from the empirical composition of fast-evolving sites consequently have a larger detrimental effect on the likelihood and, as confirmed by our results with simulated data sets, ML and Bayesian inference methods favor frequency parameters that are closer in composition to the fast-evolving sites when they are estimated during the inference process. This was not tested here, but we can reasonably expect the divergence between empirical and estimated frequencies to grow as taxon sampling increases (because N will increase). Though the effects of spatial compositional heterogeneity on the estimation of substitution parameters are worrying, the impact on branch lengths and topology estimates seems quite limited (but not negligible). Consequently, disregarding the pattern heterogeneity of the substitution process has probably little effect on the inferred phylogeny, although we feel more studies are needed.
In this paper, we make the assumption that the spatial compositional variation observed in the studied sequences results from the variation of the substitution process across sites. This is a reasonable assumption for a set of closely related species, but we must concede that compositional trends can also be generated (and consequently explained) by a process heterogeneous in time. Indeed, a time-heterogeneous method coupled with a standard model of ASRV can generate sequences with complex compositional variation patterns across sites because fast-evolving sites react more quickly to the variations of the shared equilibrium distribution. The composition at fast and slow-evolving sites cannot remain homogeneous for long. Perhaps the main issue with the time-heterogeneous model of Galtier and Gouy (1998)
is not that it does not take into account the possibility that the process is heterogeneous across sites but the fact that the composition of the ancestral sequence is assumed homogeneous.
The ancestral G + C frequency obtained when analyzing D4 with NHML is strongly biased towards the frequencies observed at slow-evolving sites. Indeed, spatial variation of the G + C content, as seen in D4, can be accommodated by a time-heterogeneous process having a low ancestral G + C composition and a high G + C equilibrium frequency (as observed at fast-evolving sites). The program NHML has been used in many studies so far. Effects of the bias are probably limited for phylogenetic reconstruction, but we feel that studies that focused on the ancestral G + C estimate values should be complemented by a study of the correlation between the G + C content and the site-specific evolutionary rate to confirm the robustness of their results. Although we did not use exactly the same sequences and the same alignment, the data set used by Galtier, Tourasse, and Gouy (1999)
probably exhibited trends similar to what we found with D2 (see fig. 4). If site-specific variation of the substitution process also has a role in the observed compositional heterogeneity across sites, then it is quite likely that Galtier, Tourasse, and Gouy (1999)
underestimated the actual ancestral G + C content of the two rRNA genes. The spatial G + C trend estimated in D2 is comparable, in range and direction, to the G + C variation we observed when D4 was analyzed with the empirical method (see above and fig. 3). The number of species, shape of the tree, and branch lengths certainly have a big influence on the extent of the underestimation, and more importantly, the real substitution process acting with D2 is probably both heterogeneous in time and in space unlike the simulated process that generated D4. We do not think that the extent of underestimation with D2 and D4 is similar, but we cannot help noticing that if both ancestral G + C contents of the two real rRNA genes have been underestimated by as much as 6% (as was found with D4), then it would be almost possible to conclude that LUCA lived in a thermophilic environment. We think that a model combining both variation in time and variation across sites is necessary for a more accurate estimate of the ancestral composition.
Model of Spatial Heterogeneity
In Thorne, Goldman, and Jones (1996)
and in subsequent works from the Goldman group (e.g., Goldman, Thorne, and Jones, 1998
; Liò and Goldman, 2002
), a trained Hidden Markov Model is used to incorporate information on the variation of the substitution patterns across sites before the inference. On the contrary, in more recent works (Soyer et al., 2002
; Lartillot and Philippe, 2004
; Pagel and Meade, 2004
), evolutionary models that could potentially detect and characterize new patterns of heterogeneity across sites with little a priori knowledge were used. Incorporating strong prior constraints to prevent variations at neighboring rate categories clearly prevents the putative patterns of evolution to emerge freely in our model. Even though different substitution models are used for each rate category, we did not completely relax the "one model fits all" assumption used in standard substitution models.
The mutational process can be assumed constant across sites, and we believe it is fine to suppose that frequencies at fast-evolving categories are correlated. Nevertheless, one can reasonably argue that slow-evolving sites should not be grouped according to their average substitution rate but rather according to the site-specific selection pressure which drives the process equilibrium frequencies. While we agree with that point, we wish to emphasize that the new model we propose is primarily designed for RNA helices. We believe that slow-evolving pairs can be treated with the same process because structural stability is the common and predominant selection factor.
Results using a Gaussian process prior on the primate data set D1 and the synthetic data set D4 are promising. With D1, inferred frequencies do not seem to fit very well the empirical curves (fig. 5) but, as previously mentioned, results from the empirical Bayesian methods are just a best guess derived from a homogeneous model, and the actual amount of variation might be seriously underestimated. Results with D4 are quite impressive (fig. 6) because the alignment is of sufficient size and because the frequencies of the D4 process were close to linear in the logged substitution rate scale which is an assumption of the prior. Sensitivity and effect of the prior should always be investigated in Bayesian inference. In this case, frequencies inferred at slow-evolving sites are clearly more influenced by the trend imposed by the prior rather than the data. A low value for the hyperparameter
0 is important to prevent the unrealistic variations observed with the unconstrained method at slow-evolving sites. With some data sets we found that the method would not behave properly unless
0 was constrained close to 0 with an exponential prior. One obvious reason might be that patterns uncorrelated to the evolutionary rate were present in these real data sets and negating our attempt to smooth variations. However, we believe that time heterogeneity and "difficult" species are a more likely explanation for these occasional failures. More work is clearly needed to understand and evaluate the impact of the Gaussian process prior used to smooth frequency variations.
| Conclusion |
|---|
|
|
|---|
We have found a heterogeneous base composition in RNA genes that is most likely due to the variation of the substitution process across sites. The distribution of states observed at a site is correlated to the site-specific evolutionary rate, and a new method is proposed to capture this aspect of sequence evolution. Variation of equilibrium frequencies across sites is not accounted for by most evolutionary models, and we investigated the effects of spatial compositional variation with standard methods. Impact on phylogenetic reconstruction is probably limited, but we demonstrate that frequency parameter estimates in statistical phylogenetic inference methods are biased towards the frequencies of rapidly evolving sites. With nonhomogeneous methods, ancestral frequency estimates are biased towards frequency distributions of conserved sites, and we recommend caution in applying these methods.
| Supplementary Material |
|---|
|
|
|---|
Supplementary materials are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|
We thank anonymous reviewers for helpful comments and valuable suggestions. V.G.-S. gratefully acknowledges a studentship from the School of Computer Science at the University of Manchester.
| Footnotes |
|---|
Peter Lockhart, Associate Editor
| References |
|---|
|
|
|---|
Brooks, D. J., J. R. Fresco, and M. Singh. 2004. A novel method for estimating ancestral amino acid composition and its application to proteins of the Last Universal Ancestor. Bioinformatics 20:22512257.
Bruno, W. J. 1996. Modeling residue usage in aligned protein sequences via maximum likelihood. Mol. Biol. Evol. 13:13681375.[Abstract]
Carlin, B. P., and T. A. Louis. 2000. Bayes and empirical Bayes methods for data analysis. Chapman and Hall, New York.
Chu, W., Z. Ghahramani, F. Falciani, and D. L. Wild. 2005. Biomarker discovery in microarray gene expression data with Gaussian processes. Bioinformatics 21:33853393.
Di Giulio, M. 2000. The universal ancestor lived in a thermophilic or hyperthermophilic environment. J. Theor. Biol. 203:203213.[CrossRef][Web of Science][Medline]
. 2003. The universal ancestor was a thermophile or a hyperthermophile: tests and further evidence. J. Theor. Biol. 221:425436.[CrossRef][Medline]
Dimmic, M. W., D. P. Mindell, and R. A. Goldstein. 2000. Modeling evolution at the protein level using an adjustable amino acid fitness model. Pac. Symp. Biocomput. 5:1829.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376.[CrossRef][Web of Science][Medline]
. 2001. Taking variation of evolutionary rates between sites into account in inferring phylogenies. J. Mol. Evol. 53:447455.[CrossRef][Web of Science][Medline]
. 2004. Inferring phylogenies. Sinauer Associates, Sunderland, Mass.
Fitch, W. M., and E. Markowitz. 1970. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem. Genet. 4:579593.[CrossRef][Web of Science][Medline]
Foster, P. G. 2004. Modeling compositional heterogeneity. Syst. Biol. 53:485495.
Galtier, N., and M. Gouy. 1998. Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871879.[Abstract]
Galtier, N., and J. R. Lobry. 1997. Relationships between genomic G+C content, RNA secondary structures, and optimal growth temperature in prokaryotes. J. Mol. Evol. 44:632636.[CrossRef][Web of Science][Medline]
Galtier, N., N. Tourasse, and M. Gouy. 1999. A nonhyperthermophilic common ancestor to extant life forms. Science 283:220221.
Gibbs, M. N., and D. J. C. MacKay. 2000. Variational Gaussian process classifiers. IEEE Trans. Neural. Netw. 11:1456.
Gibson, A., V. Gowri-Shankar, P. G. Higgs, and M. Rattray. 2005. A comprehensive analysis of mammalian mitochondrial genome base composition and improved phylogenetic methods. Mol. Biol. Evol. 22:251264.
Goldman, N., J. L. Thorne, and D. T. Jones. 1998. Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics 149:445458.
Gutell, R. R., J. J. Cannone, Z. Shang, Y. Du, and M. J. Serra. 2000. A story: unpaired adenosine bases in ribosomal RNAs. J. Mol. Biol. 304:335354.[CrossRef][Web of Science][Medline]
Halpern, A. L., and W. J. Bruno. 1998. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol. Biol. Evol. 15:910917.[Abstract]
Hudelot, C., V. Gowri-Shankar, H. Jow, M. Rattray, and P. Higgs. 2003. RNA-based phylogenetics methods: application to mammalian mitochondrial RNA sequences. Mol. Phylogenet. Evol. 28:241252.[CrossRef][Web of Science][Medline]
Huelsenbeck, J. P., and R. Nielsen. 1999. Variation in the pattern of nucleotide substitution across sites. J. Mol. Evol. 48:8693.[CrossRef][Web of Science][Medline]
Jermiin, L. S., S. Y. W. Ho, F. Ababneh, J. Robinson, and A. W. D. Larkum. 2004. The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol. 53:638643.
Jow, H., C. Hudelot, M. Rattray, and P. G. Higgs. 2002. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. Mol. Biol. Evol. 19:15911601.
Kimura, M., and T. Otha. 1974. On some principles governing molecular evolution. Proc. Natl. Acad. Sci. USA 71:28482852.
Lartillot, N., and H. Philippe. 2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacment process. Mol. Biol. Evol. 21:10951109.
Liò, P., and N. Goldman. 2002. Modeling mitochondrial protein evolution using structural information. J. Mol. Evol. 54:519529.[CrossRef][Web of Science][Medline]
Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11:605612.[Web of Science][Medline]
MacKay, D. J. C. 1998. Introduction to Gaussian processes. Pp. 133165 in C. M. Bishop, ed. Neural networks and machine learning. Kluwer Academic Press, Boston.
Mayrose, I., D. Graur, N. Ben-Tal, and T. Pupko. 2004. Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. Mol. Biol. Evol. 21:17811791.
Nielsen, R., and Z. Yang. 1998. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148:929936.
Pagel, M., and A. Meade. 2004. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. 53:571581.
Pupko, T., D. Huchon, Y. Cao, N. Okada, and M. Hasegawa. 2002. Combining multiple data sets in a likelihood analysis: which models are the best? Mol. Biol. Evol. 19:22942307.
Reeves, J. H. 1992. Heterogeneity in the substitution process of amino acid sites of proteins coded for by mitochondrial DNA. J. Mol. Evol. 35:1731.[CrossRef][Web of Science][Medline]
Reyes, A., C. Gissi, G. Pesole, and C. Saccone. 1998. Asymmetrical directional mutation pressure in the mitochondrial genome of mammals. Mol. Biol. Evol. 15:957966.[Abstract]
Savill, N. J., D. C. Hoyle, and P. G. Higgs. 2001. RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum likelihood methods. Genetics 157:399411.
Schwartzman, D. W., and C. H. Lineweaver. 2004. The hyperthermophilic origin of life revisited. Biochem. Soc. Trans. 32:168171.[CrossRef][Medline]
Seo, T., H. Kishino, and J. L. Thorne. 2005. Incorporating gene-specific variation when inferring and evaluating optimal evolutionary tree topologies from multilocus sequence data. Proc. Natl. Acad. Sci. USA 102:44364441.
Soyer, O., M. W. Dimmic, R. R. Neubig, and R. A. Goldstein. 2002. Using evolutionary methods to study g-protein coupled receptors. R. B. Altman, A. K. Dunker, L. Hunter, K. Lauderdale, T. E. and Klein, eds. Pac. Symp. Biocomput., 7:625636.
Stephan, W. 1996. The rate of compensatory evolution. Genetics 144:419426.[Abstract]
Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 407515 in D. M. Hillis, ed. Molecular systematics, 2nd edition. Sinauer Associates, Sunderland, Mass.
Tamura, K. 1992. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. Mol. Biol. Evol. 9:678687.[Abstract]
Tamura, K., and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512526.[Abstract]
Thorne, J. L., N. Goldman, and D. T. Jones. 1996. Combining protein evolution and secondary structure. Mol. Biol. Evol. 13:666673.[Abstract]
Tillier, E. R. M., and R. A. Collins. 1998. High apparent rate of simultaneous compensatory base-pair substitutions in ribosomal RNA. Genetics 148:19932002.
Whelan, S., and N. Goldman. 1999. Distributions of statistics used for the comparison of models of sequence evolution in phylogenetics. Mol. Biol. Evol. 16:12921299.[Web of Science]
Williams, C. K. I., and C. E. Rasmussen. 1996. Gaussian processes for regression. Pp. 514520 In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, eds. Advances in neural information processing systems, Vol. 8. MIT Press, Cambridge, Mass.
Wuyts, J., P. De Rijk, Y. Van de Peer, T. Winkelmans, and R. De Watcher. 2001. The European large subunit ribosomal RNA database. Nucleic Acids Res. 29:175177.
Wuyts, J., Y. Van de Peer, T. Winkelmans, and R. De Watcher. 2002. The European database on small subunit ribosomal RNA. Nucleic Acids Res. 30:183185.
Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306314.[CrossRef][Web of Science][Medline]
. 1996a. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 11:367372.[CrossRef]
. 1996b. Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. 42:587596.[CrossRef][Web of Science][Medline]
Yang, Z., N. Goldman, and A. Friday. 1994. Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol. Biol. Evol. 11:316324.[Abstract]
Yang, Z., R. Nielsen, N. Goldman, and A. M. K. Pedersen. 2000. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431449.
Yang, Z., and D. Roberts. 1995. On the use of nucleic acid sequences to infer early branchings in the tree of life. Mol. Biol. Evol. 12:451458.[Abstract]
Yang, Z., and T. Wang. 1995. Mixed model analysis of DNA sequence evolution. Biometrics 51:552561.[CrossRef][Web of Science][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
R. R. Stocsits, H. Letsch, J. Hertel, B. Misof, and P. F. Stadler Accurate and efficient reconstruction of deep phylogenies from structured RNAs Nucleic Acids Res., October 1, 2009; 37(18): 6184 - 6193. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Delport, K. Scheffler, and C. Seoighe Models of coding sequence evolution Brief Bioinform, January 1, 2009; 10(1): 97 - 109. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. C. Sheffield, H. Song, S. L. Cameron, and M. F. Whiting A Comparative Analysis of Mitochondrial Genomes in Coleoptera (Arthropoda: Insecta) and Genome Descriptions of Six New Beetles Mol. Biol. Evol., November 1, 2008; 25(11): 2499 - 2509. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Gowri-Shankar and M. Rattray A Reversible Jump Method for Bayesian Phylogenetic Inference with a Nonhomogeneous Substitution Model Mol. Biol. Evol., June 1, 2007; 24(6): 1286 - 1299. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. M. Kjer, J. J. Gillespie, and K. A. Ober Opinions on Multiple Sequence Alignment, and an Empirical Comparison of Repeatability and Accuracy between POY and Structural Alignment Syst Biol, February 1, 2007; 56(1): 133 - 146. [Full Text] [PDF] |
||||
![]() |
B. Boussau and M. Gouy Efficient Likelihood Computations with Nonreversible Models of Evolution Syst Biol, October 1, 2006; 55(5): 756 - 768. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

















