MBE Advance Access originally published online on March 7, 2007
Molecular Biology and Evolution 2007 24(6):1286-1299; doi:10.1093/molbev/msm046
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
A Reversible Jump Method for Bayesian Phylogenetic Inference with a Nonhomogeneous Substitution Model
School of Computer Science, University of Manchester, Manchester, United Kingdom
E-mail: magnus.rattray{at}manchester.ac.uk.
| Abstract |
|---|
|
|
|---|
Nonhomogeneous substitution models have been introduced for phylogenetic inference when the substitution process is nonstationary, for example, when sequence composition differs between lineages. Existing models can have many parameters, and it is then difficult and computationally expensive to learn the parameters and to select the optimal model complexity. We extend an existing nonhomogeneous substitution model by introducing a reversible jump Markov chain Monte Carlo method for efficient Bayesian inference of the model order along with other phylogenetic parameters of interest. We also introduce a new hierarchical prior which leads to more reasonable results when only a small number of lineages share a particular substitution process. The method is implemented in the PHASE software, which includes specialized substitution models for RNA genes with conserved secondary structure. We apply an RNA-specific nonhomogeneous model to a structure-based alignment of rRNA sequences spanning the entire tree of life. A previous study of the same genes from a similar set of species found robust evidence for a mesophilic last universal common ancestor (LUCA) by inference of the G + C composition at the root of the tree. In the present study, we find that the helical GC composition at the root is strongly dependent on the root position. With a bacterial rooting, we find that there is no longer strong support for either a mesophile or a thermophile LUCA, although a hyperthermophile LUCA remains unlikely. We discuss reasons why results using only RNA helices may differ from results using all aligned sites when applying nonhomogeneous models to RNA genes.
Key Words: Bayesian phylogeny nonhomogeneous nonstationary RNA substitution model MCMC compositional bias
| Introduction |
|---|
|
|
|---|
Standard phylogenetic methods have been shown to be adversely affected when nucleotide composition varies across taxa. The potential biasing effect of this compositional heterogeneity on distance, parsimony, and model-based reconstruction methods has long been recognized (see, e.g., Loomis and Smith 1990
As a practical illustration that this effect can be significant, in figure 1 we show results obtained with a nuclear large subunit (LSU) rRNA alignment from 5 bacterial species. Although convincing external evidence supports the grouping of the genus Thermus with the genus Deinococcus (Embley et al. 1993
; Foster 2004
), the thermophilic species Thermus thermophilus is attracted to the 2 other thermophilic species when standard homogeneous methods are used with this data set. As discussed later, the G + C content of the rRNA genes of prokaryote species is correlated to their optimal growth temperature, and homogeneous methods presumably fail to recover the correct tree because of the resulting G + C compositional bias present in the sequences (we use G + C to denote the G and C nucleotides in a DNA sequence; GC is specifically used to denote GC base pairs in RNA). Other studies have found similar results on these species with the small subunit (SSU) rRNA (Galtier and Gouy 1995
; Foster 2004
; Jayaswal et al. 2005
; Ababneh et al. 2006
).
|
In this paper, we focus on model-based phylogenetic methods in which the model parameters are typically estimated by maximum likelihood or Bayesian inference. These methods explicitly model sequence evolution as a Markov process with a parameterized substitution model determining the rate of nucleotide or amino acid substitutions at each site in an alignment. Standard time-homogeneous substitution models are parameterized with a single composition vector that defines the equilibrium distribution of the substitution process. Such models are consistent with an assumption of a stationary substitution process with the same equilibrium nucleotide composition everywhere in the tree. This assumption is therefore clearly violated in data sets where composition differs significantly across taxa.
Several approaches have been proposed in order to reduce bias when applying model-based methods to data showing compositional heterogeneity. The simplest approach is to recode the data in a way that masks any compositional variation, for example, by using RY-coding (Woese et al. 1991
; Phillips et al. 2004
) or AGY-coding (Gibson et al. 2005
) approaches. Although successful under some circumstances, these recoding schemes are limited to specific types of heterogeneity in composition and do not allow for changes in the relative frequency of purines and pyrimidines. A more generally applicable approach is to adapt the substitution models used by model-based methods in order to capture changes in composition. Following the early work of Barry and Hartigan (1987)
, Yang and Roberts (1995)
and Galtier and Gouy (1998)
proposed some nonhomogeneous substitution models that assign different composition parameters to the branches of the phylogeny. Such processes are not at equilibrium, and the average nucleotide composition varies along the branches of the tree. Felsenstein's (1981)
pulley principle does not apply with these processes that are not time reversible, and the likelihood of such evolutionary models depends on the placement of the root. Consequently, rooted trees have to be considered, and it is also necessary to specify the initial state composition at the root, which can also be considered as parameters of the substitution model.
In Yang and Roberts (1995)
, the most general nonhomogeneous substitution model, called N2, assigns 4 nucleotide composition parameters (3 free parameters) to each branch of the phylogeny. Three more parameters are used for the ancestral composition at the root of the tree. This model is highly parameterized, which can be problematic because there may not be enough data available to accurately estimate all these parameters and the computational burden of estimating an independent composition vector for each branch is considerable. Yang and Roberts also proposed a less flexible N1 model, which uses a common set of composition parameters for all the internal branches. Nonhomogeneous models of Yang and Roberts are based on the HKY85 nucleotide substitution model (Hasegawa et al. 1985
). Galtier and Gouy (1998)
proposed a simplified version of N2 by replacing the HKY85 model with the T92 model (Tamura 1992
). In the T92 model, the nucleotide composition is only described by the G + C content and is obtained by imposing
C =
G =
/2 on the composition parameters of the HKY85 model. This model consequently uses a single parameter for each branch and for the root, instead of 3. Nevertheless, this model still requires the estimation of many parameters when a large number of species is used.
The nonhomogeneous model presented here is based on these early works. Using an independent set of composition parameters for each branch, as was done in the model N2 of Yang and Roberts (1995)
and in Galtier and Gouy (1998)
, would certainly be more realistic, but a typical RNA alignment does not contain enough variable sites to estimate the composition parameters of a complex model on each branch of a phylogeny. A parameter-rich model is not necessarily a better model if there are not enough data to estimate its parameters accurately (Steel 2005
), and restricting the number of composition vectors seems a good and necessary trade-off to model compositional heterogeneity. Foster (2004)
has shown, on specific examples, that compositional heterogeneity can be accounted for with a limited number of composition parameters, and the same approach is followed here. The composition parameters for each branch are chosen from a pool of available composition vectors. The composition vectors in the pool and their placement on the phylogeny are both variables of the model that are estimated during the inference process.
The approach developed in this paper is quite similar to Foster's approach but the work is carried further. Firstly, the number of composition vectors is a parameter of the model which is allowed to vary by using Markov chain Monte Carlo (MCMC) methods for Bayesian parameter estimation. Reversible jump MCMC methods are used to add or remove composition parameters during the inference and to determine the amount of heterogeneity evidenced by the data (Green 1995
), bypassing the need for complex model selection procedures. Another advantage of using reversible jump methods is that they account for the uncertainty in the amount of heterogeneity while other phylogenetic parameters of interest are estimated. Reversible jump techniques have already been applied in Bayesian phylogenetics to model variations of the evolutionary process across sites (Suchard et al. 2003
), for model selection (Huelsenbeck et al. 2004
), and to allow for polytomous tree topologies (Lewis et al. 2005
). Secondly, rooted trees are considered, and the root position is not constrained to internal trifurcating nodes. New MCMC proposals are devised to cope with the presence of the root and to account for a parameter that allocates the composition vectors to the branches. Prior knowledge can be used to constrain the location of the root during the inference process, and this functionality is used when the method has difficulty recovering the root position with certainty. Thirdly, it is shown that using a uniform prior for the allocation of composition vectors on the branches has some unexpected, and probably unwanted, side effects. A hierarchical model that defines a more flexible prior on this allocation parameter is presented.
As well as modifying an important general class of nonhomogeneous substitution models, we also provide an implementation which allows for the use of specialized models of substitution in RNA genes with conserved secondary structure. As mentioned above, RNA genes are examples where compositional heterogeneity is considered important in some cases due to large changes in G + C composition in the helices. The substitution process between paired nucleotides in the RNA helices is strongly correlated because nucleotide substitutions on one side of the helix are often compensated for by substitutions on the opposite side, in order to maintain the molecular structure by base pairing (Higgs 1998
). Models have been introduced which account for this covariation by treating base-paired nucleotides as the basic units of evolution (reviewed by Savill et al. 2001
). A number of these models are implemented in the PHASE software for phylogenetic inference (Jow et al. 2002
; Hudelot et al. 2003
) and they can be used in conjunction with standard nucleotide substitution models for the nonpaired sites. These models have been shown to provide a better fit to data from RNA alignments than standard nucleotide substitution models (Telford et al. 2005
). When we used a standard homogeneous RNA model on the data used as an example in figure 1, we found the correct topology without resorting to a nonhomogeneous model (results not shown). Here, we will use these RNA-specific models when applying our nonhomogeneous modeling framework to RNA genes.
| Methods |
|---|
|
|
|---|
Substitution Model
The branches of a rooted phylogenetic tree are each associated with substitution models selected from a pool. Each branch is therefore associated with an allocation parameter a
(q) specifying which model in the pool is associated with branch q = 1,2,...,b of tree
. The substitution models are Markov processes described by an instantaneous rate matrix, which defines the substitution rate rij =
ij
j between states i and j, where
ij is a symmetric matrix of exchangeability parameters and
i are state composition parameters which are collected in the vector
= [
i]. Specific models are defined by introducing constraints on the exchangeability and composition parameters. In a standard nucleotide substitution model, the states are the 4 nucleotides (A, C, T, and G). In the results presented in this paper, we use the TN93 model (Tamura and Nei 1993We have not attempted to select optimal substitution models in this work, and it is possible that more flexible models would be better supported by the data. We have chosen models that we think will provide a reasonable trade-off between flexibility and computational tractability. However, there are many alternatives that could also have been considered for both the unpaired and paired sites.
In this paper, we follow Foster's approach of using a common set of exchangeability parameters for all models in the pool and only allowing the composition vectors to differ between models. We have implemented a more general scheme in the PHASE software, which also allows for differences between the exchangeability parameters.
Markov Chain Monte Carlo
Given a substitution model and a rooted tree one can obtain the likelihood
of the alignment data D, given model k with parameters
(tree topology, branch lengths, exchangeability parameters, composition parameters of each model in the pool, and allocation parameter) (Felsenstein 1981
). Bayes theorem provides the posterior probability distribution of the model parameters,
|
| (1) |
is the prior over the model parameters and
is the marginal likelihood, also known as the evidence, for model k. Each model k in the present context refers to a particular choice for the size of the pool of substitution models. The denominator in this expression is intractable due to the huge number of alternative tree topologies and allocations of models to branches. MCMC methods have become popular in phylogenetics because they provide a practical way to sample from this posterior distribution (reviewed by Lewis [2001]
A MetropolisHastings MCMC algorithm constructs a Markov chain in the parameter space. A sequence
1,
2, ...,
n is generated according to a 1st order Markov process that will eventually produce a sample from the posterior distribution
for large n (Metropolis et al. 1953
; Hastings 1970
). Let
be the probability with which a new state in the chain is proposed. The proposal is accepted with the acceptance probability
which is calculated as
![]() | (2) |
n =
. This approach is tractable because the denominator in equation (1) cancels when computing the acceptance probability. The quantity
is known as the Hastings ratio. The standard MetropolisHastings sampler only works when the model k has a fixed dimension. Later, we describe a more general reversible jump MCMC scheme in which the model dimension can also change as part of the inference process.
MCMC Proposals for Continuous Parameters
The proposals for continuous parameters are the same as those already used in the PHASE software version 2. We use a Dirichlet distribution for the prior and proposal distributions, with density function
![]() | (3) |
i] and "tightness" parameter p0. Each composition vector in the pool is considered independently and has a uniform Dirichlet prior, that is, setting p0
i = 1 above. New values are proposed by drawing from a Dirichlet distribution centered at the current composition vector, that is, with
i =
i, where the tightness parameter p0 is tuned during the burn-in period to get a reasonable acceptance probability. The same prior and proposal are used for an additional composition vector at the root of the tree. A Dirichlet prior and proposal are also used for the set of exchangeability parameters, as suggested by Zwickl and Holder (2004)When partitioned models are used, then the sum of the average substitution rates for each partition is constrained to be equal to the number of substitution models and a uniform Dirichlet prior and proposal is used for the rates of each partition normalized by the sum of rates of all the partitions.
MCMC Proposals for Topologies
The nearest neighbor interchange (NNI) and subtree pruning and regrafting (SPR) topology proposals available in PHASE for unrooted trees have been altered slightly in order to account for the presence of the root. These proposals disrupt the allocation of composition vectors, and it is necessary to describe how the allocation parameter is handled when a new topology is proposed. In figures 2 and 3, the numbers refer to the composition vectors associated with each branch (these can refer to the same composition vector).
|
|
The NNI proposal is described in figure 2. An internal branch is chosen at random, let it be R
E where R is the parent. The second sibling branch emanating from R is then swapped with one of the child branches of E chosen randomly. The subtree containing the root (A in the fig. 2) is kept unchanged. The proposal is unaffected if the chosen internal branch is directly linked to the root, that is, if R is the root and A does not exist. The Hastings ratio for such a proposal is 1. The SPR proposal is described in figure 3. Two branches are chosen randomly and the first branch is detached and reattached on the second one. The 2 branches linked to the root are considered as a single branch for this proposal. When the first branch is removed, the 2 adjacent branches have to be merged into a single one. Their 2 lengths are summed, but it would be difficult to design a proposal that would merge their 2 composition vectors while remaining reversible. Consequently, the pruned branch "carries" one of the composition vectors and inserts it on the destination branch. The Hastings ratio is the ratio of the length of the insertion branch to the sum of the lengths of the 2 branches adjacent to the displaced branch.
MCMC Proposal for Swapping Composition Vectors
This proposal changes the composition vectors assigned to a subset of branches of the tree. First, a limited number of branches is selected for the proposal. Each branch is added to the subset with probability p. Because this proposal would not modify the current state when this subset is empty, it is applied to one branch randomly chosen if no branch is selected in the first step. The composition vector of each selected branch is then replaced by another vector drawn from the pool. Note that the current vector is excluded from the draw and this proposal is consequently not allowed when there is just 1 vector in the pool. The parameter p is tuned during the burn-in period to reach an acceptance rate between 20% and 25%. The Hastings ratio of this proposal is 1. Some partial likelihoods used in the previous iteration are still valid when attempting such a proposal, and partial likelihoods at an internal node do not have to be computed if the subtree below this node was not modified.
Birth and death proposals were not implemented, and it is possible for one or more composition vectors in the pool to end up being unused with this proposal. Such superfluous parameters have no influence on the likelihood and are not strongly penalized by the prior used in the hierarchical model presented later. This inflates the amount of compositional heterogeneity without real justification from the data. We do not consider this to be a serious problem because it does not strongly affect the statistical inferences made by the model, but it does lead to computational inefficiency because it can result in an unnecessary increase in the number of parameters. The problem can be corrected with a prior that penalizes substitution models using a larger number of composition vectors, as we will show in the Results section.
Reversible Jump MCMC Proposals
Reversible jump MCMC methods can be used to carry out Bayesian inference when the dimension of the parameter space can vary (Green 1995
). This is very useful here as it allows for changes in the size of the pool of composition vectors.
Consider changes between the states in 2 different models,
and
where k and l are discrete parameters representing the models (in our case, the size of pool) and
(k) and
(l) are sets of parameters valid in their 2 subspaces which have dimensions dk and dl, respectively. To perform a reversible jump from the first to the second parameter space and to cope with their unequal dimensions, a vector of continuous random variables u(k) of size mk is drawn from a given probability distribution and
(l) is determined from
(k) and u(k). Reciprocally, one jumps from the 2nd to the 1st space by drawing a vector of random variables u(l) of size ml and by determining
(k) from
(l) and u(l). Green (1995)
showed that by establishing a bijection between (
(k), u(k)) and (
(l), u(l)) and by matching their dimension, that is, dk + mk = dl + ml, it is possible to perform a reversible jump between the 2 spaces. In practice, it is often easier not to draw any random variables when jumping from a higher dimension to a lower dimension but, for reasons that are explained below, both mk or ml are chosen here to be nonzero.
The acceptance probability for the MCMC transition
is now
![]() | (4) |
(k), u(k)) to (
(l), u(l)). If a nonuniform prior on k is chosen, then that must also be included in the above ratio.
Split and merge proposals are used to increase and reduce the dimensionality of the model. These proposals are described here assuming that the sequence data are not partitioned and that a unique substitution model is used at all sites, but the algorithm is easily adapted to the case of partitioned data. If k is the number of composition vectors in the current pool, then, with probability pS(k), a split is attempted and 2 new composition vectors,
(1) and
(2), are created from an initial composition vector
(0) randomly chosen from the pool. With probability pM(k), a merge is proposed and 2 randomly chosen composition vectors,
(1) and
(2), are fused to build the composition vector
(0).
Because composition vectors appear and disappear during these moves, the allocation vector that assigns composition parameters to each branch has to be modified when split and merge proposals are attempted. A natural choice for a split is to reassign randomly, with equal probability, the branches previously allocated to
(0) to
(1) and
(2). Symmetrically, assigning to
(0) all the branches previously allocated to
(1) and
(2) seems reasonable for a merge (see fig. 4).
|
Split and merge proposals should produce states that are similar enough to have comparable likelihood. Let b0 be the number of branches allocated to
(0) before the split or after the merge. Let b1 and b2 be the number of branches allocated respectively to
(1) and
(2) after the split or before the merge. Intuitively,
(0) should be a compromise of
(1) and
(2) that takes b1 and b2 into account.
Composition parameters are all greater than 0 and must sum to 1. The last composition parameter is entirely determined by the others and therefore we define
for i = 0, 1, 2, where n is the number of states in the substitution model. Dimension matching is achieved by drawing several random variables and establishing a bijection between
and
The
are random variables drawn from some probability distributions when a split proposal is attempted and {s1, s2} are drawn when a merge is attempted. Additional parameters
are determined from knowledge of the n1 other parameters, that is,
From the 3 sets of composition vectors
(0),
(1), and
(2) and the 3 random variables s0, s1, and s2, 3 sets of variables
are defined such that:
|
| (5) |
for i = 0, 1, or 2. These 3 sets of variables represent the unscaled composition parameters and were introduced to bypass the fact that composition parameters have to sum up to 1.
The vector
is then defined as a compromise between the vectors
and
weighted by their relative importance in the phylogeny before the merge or after the split. Some models in the pool might not be allocated and there is no guarantee that b0, b1, and b2 are greater than 0, which would be an issue in the mathematical expressions that follow. Therefore,
are introduced to express the relations between the 3 vectors,
|
| (6) |
and
are defined using the set of random variables
![]() | (7) |
values remain positive. Admittedly, this is a rather complex choice for the bijection satisfying the dimension-matching requirement and this particular choice deserves a brief explanation. When a Dirichlet proposal is used to modify a composition vector, one has to balance the desire to perform a long-distance move while maintaining a reasonable acceptance rate. The tightness parameter of the Dirichlet distribution (p0 in eq. 3) is used to control the step size. Experience has shown that for a fixed p0, the acceptance rate can vary widely depending on the data set, which is why it is modified during the burn-in period to improve the mixing. Intuitively, the optimal step size for the split proposal should be related to the optimal step size used when modifying composition vectors. The bijection chosen to perform split and merge proposals was consequently designed to approximate roughly the original Dirichlet proposal. For that purpose, s0, s1, and s2 are drawn from a gamma distribution with parameter p0 and each uj is drawn from the standard normal distribution.
The acceptance probability of this proposal is determined by the prior ratio, proposal distribution, and Jacobian of the bijection and is given in the Appendix.
Hierarchical Model for Nonuniform Allocation Probabilities
In a model with a fixed number k of composition vectors, Foster (2004)
maximizes likelihood under an assumption that each branch is equally likely to select one composition vector from this pool. This corresponds to a uniform prior over the allocation of composition vectors a
(q) to the
branches,
|
| (8) |
|
| (9) |
| Results |
|---|
|
|
|---|
The Effect of the Prior on the Allocation Parameter
We first compare the uniform and hierarchical priors on the allocation parameter using synthetic data sets that clearly reveal the differences between them. A random ultrametric tree of 100 species was generated with a Yule process and the common distance from root to tips was set to 0.5. On one side of the root, an outgroup clade of 12 species was found, whereas 2 clades c1 and c2 with 18 and 70 species, respectively, was found on the other side of the root. Using a program of the PHASE package, 100 alignments of 1,000 nucleotides were generated by Monte Carlo simulation using different nucleotide substitution models for the different clades. The composition vector {
A = 25%,
C = 20%,
G = 25%,
T = 30%} was used to generate randomly the ancestral sequence at the root, and a TN93 substitution model was subsequently used to evolve the sequences along the branches of the tree. A discrete gamma model with 4 gamma categories was assumed to model rate heterogeneity across sites, and a constant gamma shape parameter equal to 0.4 was used (Yang 1994
trans = 0.1,
AG = 0.3,
CT = 0.6), but composition parameters were different in the 3 main clades: {
A = 25%,
C = 23%,
G = 25%,
T = 27%} was used for the equilibrium distribution on the branches of the outgroup clade, {
A = 25%,
C = 17%,
G = 25%,
T = 33%} was used for c1, and {
A = 25%,
C = 35%,
G = 25%,
T = 15%} was used for c2. These large alignments were then reduced to produce 100 data sets of 24 species that were analyzed with the 2 different priors. To reduce each replicate from 100 to 24 species, the selected sequences were randomly extracted from the original data set: 7 species were chosen from the outgroup, 5 species from c1 and 12 species from c2. The sequences were analyzed with both priors. The upper bound of the uniform prior on the model order k was arbitrarily set to 10. Notice that there are comparatively few species extracted from c1 and that the composition parameters used for c1 and the outgroup are quite similar. Both methods are consequently expected to have difficulty recovering 2 different models for the outgroup and c1. The results are summarized in figure 5, and they demonstrate that the hierarchical prior is more likely to introduce new composition vectors into the pool. On average, the maximum a posteriori estimate for the number of composition vectors was found to be equal to the correct number of vectors more often with the hierarchical Bayesian model than with the original method (69% against 53%).
|
RNA Tree of Life
G + C Content and Thermophily
The G + C content of nuclear rRNA sequences is quite variable among prokaryotes. The SSU and LSU rRNA sequences of 40 species, spanning the Tree of Life, were downloaded from the European rRNA database (Wuyts et al. 2004
The G + C content of nuclear RNA genes is correlated with the optimal growth temperature (OGT) in prokaryotes (Galtier and Lobry 1997
). The rRNA sequences of thermophilic (50 °C < Topt < 80 °C) and hyperthermophilic species (Topt > 80 °C) are G + C rich. The correlation is strongest in stem regions (Wang and Hickey 2002
), presumably because GC base pairs bond together through 3 hydrogen bonds and are more stable than AU pairs that pair with 2 hydrogen bonds. The stronger bonds confer better heat resistance to the G + C rich rRNA molecules that can remain operational at higher temperature. This is illustrated in figure 6 where the OGT of the prokaryotes (Bacteria and Archaea) is contrasted with the percentage of GC pairs found in the stems of their rRNA genes. The regression results reported by Galtier and Lobry (1997)
, and later by Wang and Hickey (2002)
, were obtained using standard linear regression and therefore ignore the confounding influence of the phylogenetic signal. However, it has been shown that the relationship between G + C content and OGT does remain significant when the phylogenetic signal is considered (Hurst and Merchant 2001
).
|
Galtier and Gouy (1998)
Phylogenetic Reconstruction
Before studying the ancestral composition using only RNA stems, we used both the stems and the loops in order to infer the topology of the tree. The concatenated LSU + SSU data set was analyzed with the nonhomogeneous model using TN93 for loops and OTRNA for stems (see Methods). In both blocks, rate heterogeneity across sites was accounted for using the discrete gamma model with 8 discrete gamma categories (Yang 1994
).
Sixteen MCMC experiments were run without constraining the position of the root but the chains converged toward different results. Eleven chains converged toward a distribution where the root was on the bacterial branch with 100% Bayesian posterior probability (BPP) and 3 chains converged toward a distribution where the root was on the eukaryal branch with 100% BPP. The 2 remaining chains were discarded because the root, which was initially within the Archaea, switched to the branch leading to Bacteria during the sampling. This clearly means that current algorithms cannot resolve the position of the root on the rRNA tree and cannot attach a posterior probability to the different alternatives. Because the chains were run for 30,000,000 iterations plus 9,000,000 iterations for the burn-in, we do not believe longer runs could have solved this mixing problem.
The rooting of the universal tree is not yet a resolved problem. Because the monophyly of Archaea is uncertain and not supported by this data set, we will only consider the bacterial and eukaryotic rootings in what follows. Even though the 14 retained chains converged toward 2 different posterior distributions, results were consistent for a given rooting point. The resulting phylogenies for both rootings are summarized in the supplementary material (Supplementary Material online) along with the mean estimated GC content on each branch. Apart from the root position, both tree topologies are actually identical and the tree is similar to the one recovered by Galtier et al. (1999)
and is most probably the same (cf. fig. 1 of their paper). The monophyly of Archaea is rejected in both cases, albeit with a low support for the clade EuryarchaeotaEubacteria when the root is positioned on the branch leading to eukaryotes. Both trees seem surprisingly well resolved, that is, not necessarily correct but with high posterior probabilities for the different clades, most of them being equal to 100%. Nevertheless, clade posterior probabilities were found to be even higher when the corresponding homogeneous model, that is, TN93 + OTRNA, was used. The unrooted topology recovered with a homogeneous model was found to be the same except for one major difference: when composition parameters are assumed to be constant across lineages, Entamoeba histolytica is found to branch out after the phylum Euglenozoa (represented here by 2 species). This suggests that the branch leading to the G + C-rich Giardia intestinalis and the branch leading to the G + C-poor E. histolytica are repulsing each other when variation of the substitution process over time is not accounted for (see also Hasegawa et al. 1993
).
Ancestral G + C Content
Using their software EVAL_NH, Galtier et al. (1999)
recovered a maximum likelihood estimate for the ancestral G + C content of LUCA that falls at the upper end of the range of G + C content found in contemporary mesophiles (cf. fig. 2 of their paper). They consequently concluded that LUCA was probably not a hyperthermophile. This result was found to be quite robust to the gene used (i.e., SSU or LSU), to the rooting point, and to species sampling. We partially repeated their experiment using the Bayesian methods developed here. Because the variation of the G + C content is related to the variation of the proportion of GC pairs in RNA stems and is only weakly correlated to the G + C content in RNA loops (Wang and Hickey 2002
), loops were not used during this analysis. To allow for direct comparison between the inferred ancestral GC content and the GC content observed in contemporary species, positions containing pairs with gaps and/or ambiguous nucleotides were also removed from the data set leaving 613 aligned pairs.
We inferred the ancestral GC content assuming that the unrooted topology found with the complete data set was correct. The position of the root was constrained during these analyses but both rooting points were considered. An OTRNA substitution process with 8 discrete gamma rate categories was used to model the evolution of RNA pairs and 4 MCMC runs were performed for each possible rooting point. Because this data set is smaller, we could afford 60,000,000 sampling iterations for each run. Long runs were required because mixing was very slow for the model order parameter (see Supplementary Material online for a comparison of results from each independent run). The inferred ancestral GC composition is given in figure 7 and compared with the GC composition of extant species.
|
With the bacterial rooting, the mean posterior estimate for the ancestral GC content is GCbact = 66.02%. With the eukaryotic rooting, GCeuk = 61.42% was inferred. Although these results do not drastically differ from the results of Galtier et al. (1999)
Experiments were repeated without fixing the topology in order to account for phylogenetic uncertainty when inferring the ancestral GC composition. The bacterial and eukaryotic clades were constrained to be monophyletic and the ancestral GC content was estimated once again using both possible rootings. The 2 consensus trees, inferred only with the stems, are slightly different, and the support values are lower than with the complete data set. Nevertheless, mean posterior estimates and credibility intervals for the ancestral GC content are found to be almost identical (GCbact = 65.97% and GCeuk = 61.45%), and accounting for the phylogenetic uncertainty does not alter the previous results.
Amount of Heterogeneity
The number of composition vectors is an important parameter of the nonhomogeneous substitution model developed here. When analyzing the LSU sequences of 5 bacterial species (see fig. 1), we found that 2 composition vectors could explain the data well, which is consistent with Foster's result using the SSU gene (Foster 2004
). Nevertheless, once a flexible prior was implemented for the allocation of composition parameters on the tree and with a larger number of species, it was found that a relatively large number of composition vectors were necessary to fit the Tree of Life data set presented here (see fig. 8). This suggests that the method cannot be applied easily to large data sets when the data are highly heterogeneous. The chains had to run for a long time to produce the results presented here because the model order parameter was found to mix slowly and many samples were required to get consistent results from independent runs for this parameter (see Supplementary Material online for a comparison of results from each independent run).
|
The choice of a particular prior on the number of composition vectors in the pool can have a substantial impact on the posterior distribution. In figure 9, the posterior distributions of the number of composition vectors obtained with various prior distributions are compared. Using a Poisson prior with a low mean significantly affects the posterior distribution on the number of composition vectors, but the impact on the ancestral GC posterior distribution is negligible. Clade support values were also largely unaffected by a change of the prior distribution (differences were within 6%).
|
| Discussion |
|---|
|
|
|---|
The original maximum likelihood methods for modeling compositional variation over evolutionary time do not scale well when a large number of species is used because they require the estimation of independent composition vectors on each branch (Yang and Roberts 1995
We applied our methods to a similar Tree of Life rRNA data set as considered by Galtier et al. (1999)
but using only RNA stems and an RNA-specific evolutionary model that accounts for the covariation between base-paired nucleotides. It was found that the GC composition of LUCA was very sensitive to the rooting of the tree. With the bacterial rooting it was observed that one cannot completely rule out a thermophilic ancestor, although a hyperthermophilic one seems unlikely.
Our results are significantly different from those of Galtier et al. (1999)
. This is probably not only due to the difference in the model used here. We have recently demonstrated that variation in composition across sites in an RNA alignment can have a significant impact on phylogenetic estimates (Gowri-Shankar and Rattray 2006
). In particular, it was shown that nonhomogeneous methods of the type considered here produce estimates of ancestral composition that may be biased toward the composition observed at slowly evolving sites. When RNA loops and stems are combined, as in the data set used by Galtier et al. (1999)
, the slowly evolving sites are observed to have a lower G + C composition than the rapidly evolving sites. This suggests that their results may be biased toward lower ancestral G + C composition. Conversely, when only stems are considered, then the GC composition of the slowly evolving sites is higher than at the fast-evolving sites. Therefore, it is also possible that our own ancestral estimates are biased upwards. Unfortunately, it may be quite difficult to separate the effects of this heterogeneity across sites from the temporal heterogeneity considered in this paper. We have introduced a model that allows for a nonlinear relationship between composition and substitution rate assuming a stationary and time-homogeneous substitution process (Gowri-Shankar and Rattray 2006
). In order to resolve time and site compositional heterogeneity, it may be necessary to combine the model considered there with the nonhomogeneous model considered here. Although that may be straightforward in principal, the computational burden of such a combined model is substantial and it would be a challenge to develop a practical implementation.
Blanquart and Lartillot (2006) have recently introduced a new nonhomogeneous model in which changes to the substitution model are themselves modeled as a stochastic process on the phylogenetic tree. This approach removes the unrealistic assumption used here, and in previous models, that changes in the substitution model must occur at the base of the branches. Blanquart and Lartillot explicitly model the "breakpoints" where the model changes on the tree, and they develop a practical Bayesian MCMC method for parameter estimation, which also allows for the model dimensionality to be learned as part of the inference process. Their approach is elegant and clearly leads to a very parsimonious model when the number of breakpoints is small. In their current implementation a zeroth order process is used for model changes and each breakpoint will therefore be associated with a completely new model. Different parts of the tree, separated by breakpoints, cannot share parameters. In our own case, the allocation of models to neighboring branches is independent, but branches across the tree can share parameters. If similar substitution patterns evolve many times independently, then the approach considered in this paper may provide a simpler explanation of the data. As Blanquart and Lartillot suggest, it would therefore be very interesting to compare the reversible jump approach developed here with their model and that is clearly an important future direction of research. The 2 models could also be combined by having the models used at each breakpoint selected from a pool, and in this case, the reversible jump proposals developed here could prove useful.
The models discussed above are all locally homogeneous corresponding to an assumed model of punctuated changes in the substitution process. In the original work of Barry and Hartigan (1987)
, a more general model was considered where the assumption of local homogeneity is not required. Methods have recently been developed for estimating the parameters of this model (Jayaswal et al. 2005
) and for using the model to test various aspects of the substitution process (Ababneh et al. 2006
). This provides an interesting alternative approach to the development of locally homogeneous methods.
Methods that have previously been developed to account for the variation in nucleotide composition across lineages do not always solve the obvious topological errors encountered when compositionally heterogeneous data sets are studied (Foster and Hickey 1997
; Chang and Campbell 2000
; Tarrío et al. 2001
). Consequently, the method developed here, which is in essence very similar, will probably not be the panacea for all composition-related reconstruction artifacts. Conant and Lewis (2001)
have already argued that compositional heterogeneity alone cannot completely explain the failure of traditional methods in some examples used by Lockhart et al. (1994)
. This suggests that accounting only for the compositional bias cannot prevent all reconstruction artifacts.
| Supplementary Material |
|---|
|
|
|---|
Supplementary materials are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Appendix |
|---|
|
|
|---|
In this appendix, we calculate the acceptance probability for the dimension changing reversible jump proposals. First, we consider the case of a uniform prior distribution over allocations of models to branches. Then, we consider the changes introduced when a hierarchical nonuniform allocation distribution is used.
The acceptance probability for the split proposal is considered here.
![]() | (10) |
n denotes the nth state in the MCMC chain, k represents the size of the pool,
and 
are respectively the current tree topology and the associated set of branch lengths, a
(k) is the allocation vector that assigns these composition parameters to the branches,
is the fixed dimension vector that groups the substitution parameters that are constant across branches and also includes the ancestral state distribution, and
(k) is the set of composition vectors available in the pool.
Reformulating equation (4) with a product of ratios, the acceptance probability can also be written as
|
|
Prior Ratio
A simple factorized form is chosen for this prior
|
| (11) |
(0) and the 2 resulting vectors
(1) and
(2). The complete prior ratio for a split move is consequently
![]() | (12) |
Proposal Ratio
The proposal ratio is a complex term and it can be further decomposed into a product of 3 ratios. The 1st term of the product is related to the tree more specifically to the modification of the allocation vector as described in figure 4. The 2nd term is related to the composition parameters and the probability of switching between
(0) and {
(1),
(2)} once the parameters to split/merge have been selected. The 3rd term is a nonobvious term that arises because no ordering constraint is enforced on the parameters of the different subspaces.
The probability of attempting a split and proposing the new model and allocation vector {k = d + 1, a
(d+1)} from {k = d, a
(d)} is equal to
|
|













. In the 1st case, 
