Skip Navigation


MBE Advance Access originally published online on August 24, 2006
Molecular Biology and Evolution 2006 23(11):2058-2071; doi:10.1093/molbev/msl091
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/11/2058    most recent
msl091v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Blanquart, S.
Right arrow Articles by Lartillot, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Blanquart, S.
Right arrow Articles by Lartillot, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

A Bayesian Compound Stochastic Process for Modeling Nonstationary and Nonhomogeneous Sequence Evolution

Samuel Blanquart and Nicolas Lartillot

Projet Méthodes et Algorithmes pour la Bioinformatique, LIRMM-CNRS, Montpellier, France

E-mail: samuel.blanquart{at}lirmm.fr.


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Material
 Results
 Discussion
 Supplementary Material
 Appendix
 Acknowledgements
 References
 
Variations of nucleotidic composition affect phylogenetic inference conducted under stationary models of evolution. In particular, they may cause unrelated taxa sharing similar base composition to be grouped together in the resulting phylogeny. To address this problem, we developed a nonstationary and nonhomogeneous model accounting for compositional biases. Unlike previous nonstationary models, which are branchwise, that is, assume that base composition only changes at the nodes of the tree, in our model, the process of compositional drift is totally uncoupled from the speciation events. In addition, the total number of events of compositional drift distributed across the tree is directly inferred from the data. We implemented the method in a Bayesian framework, relying on Markov Chain Monte Carlo algorithms, and applied it to several nucleotidic data sets. In most cases, the stationarity assumption was rejected in favor of our nonstationary model. In addition, we show that our method is able to resolve a well-known artifact. By Bayes factor evaluation, we compared our model with 2 previously developed nonstationary models. We show that the coupling between speciations and compositional shifts inherent to branchwise models may lead to an overparameterization, resulting in a lesser fit. In some cases, this leads to incorrect conclusions, concerning the nature of the compositional biases. In contrast, our compound model more flexibly adapts its effective number of parameters to the data sets under investigation. Altogether, our results show that accounting for nonstationary sequence evolution may require more elaborate and more flexible models than those currently used.

Key Words: phylogeny • MCMC • nonstationary • nonhomogeneous • compositional bias • compound stochastic process


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Material
 Results
 Discussion
 Supplementary Material
 Appendix
 Acknowledgements
 References
 
Base composition has been shown to be highly variable among species (Jukes and Bhushan 1986Go; Montero et al. 1990Go; Bernardi 1993Go), a phenomenon generally denoted as compositional biases. When analyzing phylogenetic relationships between species using standard methods, a similar nucleotidic composition is often interpreted as phylogenetic signal, leading unrelated species to be grouped together in the resulting tree (Lockhart et al. 1992Go, 1994Go; Lake 1994Go; Galtier and Gouy 1995Go; Yang and Roberts 1995Go; Foster and Hickey 1999Go; Mooers and Holmes 2000Go; Foster 2004Go).

A first way to avoid this problem consists in recoding the character states into functional groups, so as to homogenize the composition between sequences. For instance, the RY coding (Woese et al. 1991Go) consists in replacing nucleotides A and G by R (purine) and C and T by Y (pyrimidine). In this way, only transversion events are considered, nucleotides A and G, C and T become synonymous and GC biases are removed. As transitions often occur more frequently than transversions (Brown et al. 1982Go), the RY coding also decreases saturation and this enhances ancient phylogenetic signal. It has been used for resolving deep divergences (Phillips and Penny 2002Go; Delsuc et al. 2003Go). An analogous coding system has been proposed for amino acid sequences (Dayhoff coding, Hrdy et al. 2004Go). More generally, one can accommodate the data by removing saturated sites from the analysis such as third codon positions (Swofford et al. 1996Go; Delsuc et al. 2002Go; Canbäck et al. 2004Go) or fast-evolving sites (Brinkmann and Philippe 1999Go; Philippe et al. 2000Go). These methods have not been specifically devised to deal with compositional biases, but assuming that biased sites are generally among the fast-evolving ones, they should nevertheless be efficient against them. Altogether, these filtering and recoding methods have proven quite effective, in particular, for resolving deep divergences (Philippe et al. 2000Go; Phillips and Penny 2002Go; Delsuc et al. 2003Go). On the other hand, they may also result in a loss of phylogenetic information. Moreover, the RY and the Dayhoff coding may not be efficient in all situations. For instance, the RY coding only alleviates GC biases and will not efficiently suppress more general compositional shifts in DNA composition.

All methods mentioned thus far aim at filtering away the compositional bias from the data set but do not question the underlying evolutionary model itself, which is still assumed to be stationary and homogeneous. In contrast, a series of methods based on nonstationary models have been proposed. Among them, there are distance-based methods, such as the "LogDet" (Lockhart et al. 1994Go), "paralinear" (Lake 1994Go), or modified "Tamura–Nei" (Tamura and Kumar 2002Go) distances, and full-likelihood approaches, based on maximum likelihood (Yang and Roberts 1995Go; Galtier and Gouy 1998Go) or Bayesian (Foster 2004Go) frameworks. Some of these methods have been successfully applied, in particular, to studies about ancestral GC contents (Galtier et al. 1999Go; Tarrio et al. 2001Go) or to phylogenetic inference from GC-biased sequences (Herbeck et al. 2004Go).

In the model proposed by Galtier and Gouy, a single parameter is associated to each branch of the phylogenetic tree. This parameter represents the branch-specific GC ratio. The model proposed by Yang and Robert accommodates more general biases as 3 free parameters, handling frequencies for each of the 4 nucleotides, are assigned to each branch. In both models, the values of the parameters are estimated from the data by maximum likelihood. However, associating different compositions to each branch may result in a large amount of free parameters, and problems of overparameterization may then be encountered (Foster 2004Go). In order to reduce such overparameterization effects, Foster (2004)Go proposed a model based on a predefined number of clusters of base frequencies, smaller than the number of branches. The base frequencies of each cluster, and the cluster each branch is associated to, are sampled from their joint posterior distribution by Markov chain Monte Carlo (MCMC, Neal 1993Go). Such a method may be effective in reducing the number of parameters. On the other hand, it still lacks flexibility as the number of different clusters is fixed to a prespecified number. Ideally, this number should be a free parameter of the phylogenetic inference. Moreover, it does not completely address the problems that are the original cause of the overparameterization of the branchwise models, such as those of Galtier and Gouy or Yang and Roberts. Fundamentally, this overparameterization is a consequence of the fact that the equilibrium frequencies of the substitution process are reassessed at the base of each branch of the tree, whereas in many practical situations, the equilibrium frequencies may have remained constant for time periods spanning several branches, sometimes entire groups. Moreover, in the branchwise models, changes of compositional bias are associated with speciation events, that is, with the nodes of the phylogenetic tree, which is not realistic; speciation events and changes of composition should ideally be uncoupled.

In this direction, an interesting solution was proposed by Huelsenbeck et al. (1999)Go, allowing variations of the substitution rate along lineages, according to the so-called compound stochastic process. Specifically, substitutions between sequence states happen according to a classical first-order Markov process, whose instant rate is described by a second, piecewise constant, stochastic process. Realizations of the second stochastic process are sampled using MCMC methods. The number of rate change events is thus a free variable in the model of Huelsenbeck et al.

Here we propose a nonstationary model, also based on a compound stochastic process, generalizing the models of both Galtier and Gouy and Yang and Robert. As in the model of Huelsenbeck et al., we use an additional stochastic process operating along lineages, but this time, to model compositional shift events. Those events occur independently from speciations, according to a Poisson process, and are thus a free, Poisson distributed variable of our model. Importantly, this approach allows a flexible dimensionality, in contrast to previous proposed nonstationary models. We implemented this model in a Bayesian framework, using the MCMC approach to sample realizations of the compositional shift stochastic process, and applied it to several nucleotidic data sets. In particular, our results show that its free dimensionality leads our model to better fit the data, especially compared with the model proposed by Yang and Robert, which in contrast appears to be penalized by its lack of control of the dimensionality.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Material
 Results
 Discussion
 Supplementary Material
 Appendix
 Acknowledgements
 References
 
A set of homologous aligned sequences is available in the form of a data matrix of J sequences of K sites. Phylogenetic relationships between the J extant species are represented by a rooted binary tree, denoted as {tau}, whose nodes represent speciation events. A length is associated to each branch. Let t = {t1, ..., t2J – 2}, where (1, ..., 2J – 2) are branch indices, denote the set of branch lengths. Additionally, sites have their own rate of substitution, r = (r1, ..., rK), distributed according to a continuous gamma distribution.

Markovian Model of the Substitution Process
Probabilistic models in phylogenetics usually assume that sequence evolution can be seen as a Markovian process. This Markovian process is defined on a state space of size S (S = 4 for nucleotide, S = 20 for amino acid) and is characterized by a stochastic instantaneous rate matrix, Q, of size S x S. Given a stationary probability vector, {pi} (or "profile"), of size S, and a matrix {rho} of relative exchange rates between states, a stochastic matrix Q is obtained as follows:

Formula (1)

Formula (2)

At a site k of rate rk, and along a branch j of length tj, the evolutionary distance vjk is

Formula

A state l is substituted into a state m in an evolutionary distance vjk with probability:

Formula (3)

Nonstationary Model of Substitution
Homogeneous models of sequence evolution assume a single Markovian substitution process, defined by a single Q matrix operating along the whole tree. The Markovian process is assumed to be at equilibrium, and thus, the model is stationary. In this article, we design a nonstationary model of sequence evolution. We model shifts of sequence composition along lineages as a compound and piecewise constant stochastic process, defined as follows: discontinuous changes occur according to a Poisson process of rate {varepsilon}. At each discontinuity point, the profile {pi} of the substitution process (i.e., its stationary probability vector) switches to a new value {pi}', directly drawn from a prespecified distribution G0 on the simplex:

Formula

We use by default a uniform distribution for G0. Note that {pi}' is independent from {pi}, consequently the series of successive compositional shifts follows a 0th-order Markov process. The relative exchange rates {rho}, which are here considered as free parameters, are kept constant in the whole tree, so that the substitution process is described by a stochastic matrix Q = {pi}{rho} before, and Q' = {pi}'{rho} after, the discontinuity point. Thanks to these compositional shift events, our nonstationary model can take into account substitution processes specific to each part of the tree. For example, high stationary probabilities for G and C found in a given lineage will drive an evolution toward a GC richer content.

Ideally, the likelihood of particular values of the parameters (topology, branch lengths, etc.) has to be integrated over all realizations of the stochastic compositional shift process described above. However, computing this integral directly is intractable. We, therefore, use the MCMC approach to sample realizations of this process. We call "break points," or BP, the points at which the discontinuous changes occur (fig. 1). A realization of the process is then defined by the number (N) of break points, each of them being specified by its position in the tree and its associated profile.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Splitting the tree into piecewise constant areas. Three BP are placed on the tree: a default break point placed at the root of the tree, defining the black area, and 2 other break points, defining the hatched and white areas. Each area involves a specific Markovian process of substitution.

 
Data Structure for the Nonstationary Model
Conditional on a particular tree, of total length T, the number N of break points follows a Poisson process of rate {varepsilon}. With N break points, plus a default root break point placed at the root of the tree (with index 0), we split the tree into N + 1 constant areas (i.e., where the substitution process is homogeneous).

A break point n is entirely characterized by its profile {pi}n, the lineage where it appears bn (i.e., the branch onto the break point is placed), and the point along the branch at which it appears xn (i.e., the relative position of the break point on its branch bn: 0 < xn < 1), where 0 ≤ n ≤ N. By extension, for the particular root break point, we defined b0 = 0, t0 = 0, and x0 = 0. Let us denote by Nj the number of break point on the jth branch. Thus, N is equal to {sum}j=12J–2 Nj, where J is the number of taxa and 2J – 2 the number of branches.

Given the global set of relative exchange rates {rho} and a break point profile {pi}n, one can compute the Qn matrix (eqs. 1 and 2). Then, substitution probabilities between states (eq. 3) are computed, in each area of the split tree, using in each case the relevant rate matrix.

Finally, as the stationary assumption does not apply, we cannot assume that stationary probabilities at the root are equal to those of the process at this point of the tree (i.e., {pi}0). As in Galtier and Gouy, we therefore define an extraparameter {pi}{infty}, which represents the stationary probabilities at the root. Altogether, the parameter vector {theta} of the nonstationary model is written:

Formula

Probability Densities of the Data Structure
The fact that we define break points with relative positions on branches induces a nontrivial prior distribution, which is explained here. Break points appear following a Poisson process of rate µ, as in Huelsenbeck et al. (1999)Go. Thus, the number Nj of break points on branch j, of length tj, follows a Poisson distribution of mean µtj:

Formula

Given Nj, all possible distributions of the Nj break points along the branch are equally likely. Denoting one such distribution by Xj = {x1,x2,...,xNj}, such that x1<x2<...<xNj the density of Xj is Formula where

Formula
The joint probability density for Nj and Xj is thus:

Formula
Taking the product over all branches and rearranging the factors yields the prior density of the overall break point distribution:

Formula

This last formula can be decomposed into 2 factors: the first factor is the probability of observing N break points on the whole tree, given a Poisson distribution of rescaled rate {varepsilon} = µT. The second is the probability density of a particular distribution of the N break points on the tree. We directly parameterize our model in terms of {varepsilon}, rather than µ. This allows a more direct interpretation of the results: {varepsilon} is simply the mean number of break points across the whole tree.

Canonical priors are used for all other model parameters. Specifically, we use a uniform prior over topologies ({tau}), a Gamma distribution of mean 1 and variance Formula for the rate across sites (r), an exponential prior of mean Formula for the branch lengths (t), an exponential prior of mean 1 for relative exchange rate parameters ({rho}), and finally, a uniform prior for the profiles ({pi}). The hyperparameters {alpha}, ß, and {varepsilon} are also free parameters of the model, and all follow exponential priors of mean 1.

Likelihood Computation
We can easily adapt the pruning algorithm of Felsenstein (1981)Go to compute the likelihood given the break points and the base frequencies. In effect, each break point is equivalent to a new node, so that a branch with Nj break points is subdivided into Nj + 1 segments. Along each segment, the vector of partial likelihoods is propagated using the relevant Q matrix, which is itself obtained by combining the global set of relative exchange rates with the local base frequencies.

MCMC Sampling
By Bayes theorem, the posterior probability is proportional to the prior times the likelihood:

Formula
where D denotes the data, M a given model, and {theta} a particular realization of its associated parameters. In order to obtain a sample from the posterior distribution of {theta}, we use the MCMC sampling method, based on the Metropolis–Hasting's algorithm. Applying MCMC to phylogenetic reconstruction problems has been developed by Larget and Simon (1999)Go, Huelsenbeck and Ronquist (2001)Go, and Huelsenbeck et al. (2002)Go. We have implemented the nonstationary model into the software "PhyloBayes" (Lartillot and Philippe 2004Go), which provides a MCMC implementation in a stationary context. Briefly, at each step of a MCMC, one modifies the current value of parameter vector {theta}, according to a stochastic kernel q({theta}, d{theta}'), obtaining a new value {theta}', which is accepted with probability:

Formula
where Formula is the ratio of posterior densities, or Metropolis ratio, and Formula is the Hastings ratio (Neal 1993Go), that is, the probability of proposing a backward modification on {theta}' that would exactly reverse the forward modification on {theta}, divided by the probability of the forward modification. Green (2003)Go provides a general formula for the Hastings ratio:

Formula (5)
where w and w' are the set of random numbers picked with distribution g and g', when modifying {theta} into {theta}', or symmetrically {theta}' into {theta}. The second factor Formula is the absolute value of the Jacobian determinant of the transformation from {{theta}, w} to {{theta}', w'}. Originally, Green's formula was introduced for dealing with reversible MCMC moves, but as noted by Holder et al. (2005)Go, this formula happens to be useful in much more general MCMC frameworks.

Update Mechanisms
Three stochastic kernels, or update mechanisms, were devised to update the break point structure mapped onto a given topology, allowing one to update the number, positions, and profiles of the break points. We also devised 3 topological update mechanisms that keep track of the break point structure and leave the total length of the tree and the number of break points unchanged: a "subtree pruning and regrafting" or SPR, as described by Swofford et al. (1996)Go, a "node sliding" as described by Lartillot and Philippe (2004)Go, and a topological move of the root's position. All these update mechanisms, and their corresponding Hastings ratios, are described in the Appendix. Rates across sites, relative exchange rates, branch lengths, and hyperparameters are updated as described by Lartillot and Philippe (2004)Go.

Nonstationary Model Configurations
Several variants of our nonstationary model can be proposed. Instead of considering the general compositional shift process, where state frequencies of the {pi} profiles are free parameters, one can constrain the model so that {pi}C = {pi}G and {pi}A = {pi}T, according to a GC ratio parameter. Additionally, rather than considering the number of break points and their positions as free parameters, it is possible to constrain the model so that a break point is placed at the beginning of each branch. In this way, our nonstationary model reduces to the models proposed by Galtier and Gouy or Yang and Robert. More specifically, if one uses GC ratio parameters, one obtains the model proposed by Galtier and Gouy, denoted in the following as GGGC, and otherwise, if profiles are left unconstrained, the settings are equivalent to the model proposed by Yang and Robert, denoted as YR{pi}. By homology, one denotes our model (considering the number of break points and their positions as free parameters) by BPGC and BP{pi}, depending on whether GC or unconstrained profiles are used. Finally, when constraining the number of break points to N = 0, and setting {pi}{infty} = {pi}0, the Markovian substitution process defined at the root is applied to the whole tree and our nonstationary model reduces to a stationary and homogeneous GTR + Gamma model, denoted in the following as STAT.

MCMC Settings
We define a MCMC cycle as the consecutive call to all relevant update mechanisms, given a model. Some update mechanisms are called several times, with different tuning parameters (see table S1, Supplementary Material online for details concerning a cycle). Continuous update mechanisms were tuned so as to reach an acceptance ratio of 30–70%. The transdimensional update (creating and deleting break point) cannot by tuned. Its acceptance ratio was highly variable, for example, of about 10% for a data set of 5 16S rRNAs, and seems to decrease as the lengths of sequences increase. We run chains for a total of 500,000 cycles, discarding a burn-in period of 100,000 cycles and saving 4,000 samples among the 400,000 remaining points. Some chains were performed under free topology, in which case we computed the majority-rule consensus (Margush and McMorris 1981Go) from the topologies of the 4,000 resulting samples. When chains were run under fixed topology, we estimated for each branch j the mean number of break points:

Formula
where A is the number of samples, and Njais the number of break points placed on branch j at sample a. We also computed the mean profile per branch:

Formula
where Formula is the mean frequency, averaged over A samples, of state i along branch j, xn+1a and xna are the relative positions of 2 consecutive break points (with x0a = 0 and xNj+1a = tja), tja is the length of branch j, {pi}ina is the frequency of state i defined at break point n, and a denotes a particular sample. We estimated standard errors (SE) on Formula and Formula using the "window estimators" method described by Geyer (1992)Go, discussed by Raftery and Lewis (1992)Go, and used by Wilson et al. (2003)Go. The method consists in estimating the effective size of the sample from its autocorrelation function. The SE is then equal to the standard deviation divided by the square root of the effective size. To check convergence of the MCMC, each experiment was run twice.

Model Comparison by Bayes Factor Evaluation
Given a data set D, the relative fit between 2 models M0 and M1 can be formulated by the ratio of their marginal likelihoods:

Formula
where

Formula

A Bayes factor B greater (respectively, lower) than one indicates a support in favor of model M1 (respectively, M0). To numerically estimate the Bayes factor, we used the thermodynamic integration method (Ogata 1989Go; Gelman et al. 2004Go). An implementation of this method is provided in the PhyloBayes program (Lartillot and Philippe 2006Go). Specifically, we used the "model-switch" scheme as defined in that paper. We performed several types of thermodynamic integrations: 1) between one of the BP{pi}, BPGC, YR{pi}, GGGC, and STAT models and the stationary model already implemented in the PhyloBayes program, under free or fixed topology, and 2) between 2 topologies under a fixed model. The first type of thermodynamic integration allows us to compare fits of model configurations using the stationary model as a reference. The second type of thermodynamic integration is a way of determining the relative support of 2 candidate topologies under a given model. Although sampling the topology space already gives such an answer, problems in the chains' mixing behavior may be encountered, and thus evaluation of the following Bayes factor:

Formula
where {tau}1 and {tau}2 are the 2 candidate topologies, provides a confirmation of the results obtained under free topology.

For each experiment, we ran a 1,000,000 cycle long bidirectional thermodynamic integration. For each direction, we got a set of 1,000 samples. Sampling, thermic lag, and discretization errors are combined into a single 95% confidence interval for the Bayes factor approximation, as proposed by Lartillot and Philippe (2006)Go.


    Material
 TOP
 Abstract
 Introduction
 Methods
 Material
 Results
 Discussion
 Supplementary Material
 Appendix
 Acknowledgements
 References
 
We first applied the nonstationary model to a data set of 5 eubacterial (Thermus thermophilus, Deinococcus radiodurans, Bacillus subtilis, Thermotoga maritima, and Aquifex pyrophilus) 16S rRNAs, assembled by Embley et al. (1993)Go. A topology {tau}1, supported by much independent evidences (Murray 1991Go; Eisen 1995Go; Gupta 1998Go), groups T. thermophilus with D. radiodurans, to the exclusion of B. subtilis, T. maritima, and A. pyrophilus (fig. 2A). However, this set of sequences is known to be prone to phylogenetic reconstruction artifacts under stationary models due to the attraction of sequence of similar composition (Embley et al. 1993Go; Mooers and Holmes 2000Go; Foster 2004Go). The artifact leads to group together the mesophilic bacteria D. radiodurans and B. subtilis, leading to topology {tau}2 (fig. 2B). In the following, we will call {tau}1 and {tau}2, respectively, as the "correct" and the "artifact" topology.


Figure 2
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Candidate phylogenies for bacteria Thermus thermophilus, Deinococcus radiodurans, Bacillus subtilis, Thermotoga maritima, and Aquifex pyrophilus. (A) The assumed correct phylogeny. (B) A commonly obtained reconstruction artifact, where mesophilic bacteria with similar GC content attract together (as well as thermophilic GC richer bacteria, which also attract together). Percentages indicate the GC content of each species.

 
We additionally compared the fits of the 4 nonstationary model configurations (BP{pi}, YR{pi}, BPGC, and GGGC, see Methods), on several sets of bacterial 16S rRNAs and of yeast genes. First, we analyzed 4 data sets of 5, 10, 15, and 20 Proteobacteria and Deinococci 16S rRNAs (species belonging to data sets 5, 10, and 20: Pelobacter propionicus, Photobacterium profundum, Vitreoscilla stercoraria, Sulfurospirillum arcachonensis, and D. radiodurans; to data sets 10, 15, and 20: Thioploca ingrica, Burkholderia pseudomallei, Zymomonas mobilis, Alvinella pompejana epibiont, and T. thermophilus; to data sets 15 and 20: Syntrophus gentianae, Flavimonas oryzihabitans, Leptothrix cholodnii, Rhodospirillum molischianum, and Deinococcus murrayi; to data set 15: Desulfuromusa bakii, Zymobacter palmae, Rickettsia honei, Campylobacter rectus, and Thermus ruber; and to data set 20: Desulfovibrio fairfieldensis, Nitrosomonas europae, Acidosphaera rubrifaciens, Arcobacter cryaerophilus, and Thermus filiformis).

Second, we analyzed the BAS1 gene, chosen among the 106 genes of the data set assembled by Rokas and Carroll (2005)Go. In the latter case, we investigate 2 versions of the data set, comprising 7 and 14 species (species belonging to data sets 7 and 14: Saccharomyces paradoxus, Saccharomyces kudriavzevii, Saccharomyces castellii, Saccharomyces kluyveri, Kluyveromyces lactis, Debaryomyces hansenii, and Yarrowia lipolytica; and to data set 14: Saccharomyces cerevisiae, Saccharomyces mikatae, Saccharomyces bayanus, Candida glabrata, Candida albicans, Ashbya gossypii, and Kluyveromyces waltii).

For each data set, we evaluated the Bayes factors using thermodynamic integration between the nonstationary models and the reference stationary model, under fixed topology (see Methods). For the 4 bacterial 16S rRNAs data sets, the topologies were obtained using the MrBayes software (Huelsenbeck and Ronquist 2001Go) under the default GTR + Gamma model, and for the 2 yeast gene data sets, we use the topology obtained using MrBayes, by Jeffroy et al. (2006)Go on amino acid sequences. In the latter case, MrBayes chains were run under the WAG + Gamma + Invariant model.


    Results
 TOP
 Abstract
 Introduction
 Methods
 Material
 Results
 Discussion
 Supplementary Material
 Appendix
 Acknowledgements
 References
 
Check of Model and Implementation
We performed several checks of our implementation. First, when the likelihood terms in the Metropolis–Hastings ratio are omitted, the MCMC should yield a sample from the prior distribution defined by our model, which we checked, marginally, on several parameters of interest (break point number and profiles, relative exchange rates, branch lengths, fig. S1, Supplementary Material online; and bipartitions, fig. S2, Supplementary Material online). Second, we compared the posterior mean values obtained under the default GTR + Gamma model of MrBayes, with those of our model configured as closely as possible to MrBayes (table S2, Supplementary Material online). All parameter posterior values determined under our model were close to those estimated by MrBayes: the largest relative difference is of 0.8%, obtained for the total tree length. This latter difference still represents 5 times the SE, but this could be explained by the fact that our stationary configuration is not strictly identical to that of MrBayes. In particular, the prior of the relative exchange rates is not the same. Finally, we performed simulations, and measured the hit probabilities, as was done by Wilson et al. (2003)Go. The underlying idea is that, if the implementation is correct, the expected fraction of the simulations for which the true (simulation) value of a given parameter falls within the p x 100% confidence interval should be equal to p. Our checks (table S3, Supplementary Material online) are consistent with these expected fractions.

Posterior Values of Model Parameters on Fixed Topologies
As a way of illustrating the behavior of our model, we performed a series of fixed topology analyses. We considered the data set of 5 16S rRNAs (T. thermophilus, D. radiodurans, B. subtilis, T. maritima, and A. pyrophilus) and run chains under the BP{pi} model, fixing the topology to its correct {tau}1 or to its artifact {tau}2 configuration. We rooted the tree as in Olsen et al. (1994)Go and Galtier and Gouy (1998)Go, in the branch leading to A. pyrophilus. As the number of break points on each branch, as well as their respective profiles, may change during the MCMC, we propose to visualize the average effect of all these fluctuations by just looking at the mean posterior number of break points, and at the mean posterior profiles of stationary probabilities, on each branch (fig. 3 and table S4, Supplementary Material online).


Figure 3
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Posterior mean profiles and number of break points per branch obtained under the nonstationary model on the artifact (A and C) and the correct (B and D) topology. (A) and (B): posterior mean profiles, with the {pi}{infty} profile placed at the very bottom of the figure. (C) and (D): posterior mean number of break points.

 
On both the correct and the artifact topologies, AT rich profiles are favored along terminal branches leading to B. subtilis and D. radiodurans (fig. 3A and B, Supplementary Material online) and more specifically in the case of the artifact topology, also along the internal branch leading to the clade (B. subtilis and D. radiodurans) (fig. 3A). The model thus takes into account the compositional shift of B. subtilis and D. radiodurans toward an AT richer content. Moreover, one obtains mean posterior numbers of break point (1) of 1.030 on the B. subtilis and D. radiodurans ancestor branch, and of at most 0.074 on all other branches, for the artifact topology (fig. 3C), and (2) of 1.234 on the B. subtilis branch, of 1.269 on the D. radiodurans branch, and of at most 0.117 on all other branches, for the correct topology (fig. 3D). In other words, on average, the chains mainly sampled break points on branches leading to the most significantly biased sequences of the data set. Moreover, the model parsimoniously adapts to the correct, or to the artifact, topology and explains the compositional shift of B. subtilis and D. radiodurans toward AT richness, respectively, as a convergent evolution (2 independent events) or as a shared derived character (1 ancestral event).

Comparison of 2 Candidate Topologies
We then wanted to know which of the 2 candidate topologies is preferred, depending on the model used, that is, the nonstationary model BP{pi} or the stationary model STAT. As expected from previous analyzes (Foster 2004Go), under the STAT model B. subtilis groups with D. radiodurans (artifact topology {tau}2), with a posterior probability of 0.97. In contrast, under the BP{pi} model, we obtained the clade (T. thermophilus and D. radiodurans) with a posterior probability of 1 (correct topology {tau}1).

As a confirmation, we evaluated the relative support of the {tau}1 and {tau}2 topologies, by thermodynamic integration under a fixed model, either BP{pi} or STAT (see Methods). A positive value of the logarithm of the Bayes factor (95% credibility interval [2.4, 15.7]) is obtained under the nonstationary model, whereas a negative value (in interval [–9.5, –0.1]) is obtained under the stationary model. This means that the correct {tau}1 topology better fits the data under the BP{pi} model, and in contrast, that the artifact {tau}2 topology is chosen by STAT. These results are consistent with our analyses under free topology and show that our model is able to recover the correct topology.

As previously suggested by Foster (2004)Go, this disagreement between the 2 models may be explained by the fact that the stationary model tends to artifactually group together unrelated taxa sharing similar base composition. In contrast, the nonstationary model, handling compositional heterogeneities, would be able to discern the phylogenetic signal from the compositional bias. If this interpretation is correct, one would expect a better fit on this data set for the nonstationary model than for the stationary one. We, therefore, compared the 2 models by Bayes factor evaluation, using the thermodynamic integration method. Importantly, as the 2 models do not favor the same phylogeny, the integration was done under free topology (see Methods). We estimated the logarithm of the Bayes factor to lie in a 95% credibility interval of [59.3, 67.2] (table 1). This estimation strongly rejects the stationary model in favor of the BP{pi} model and thus retrospectively confirms previous results and assumptions considering the {tau}1 topology to be closer than {tau}2 to biological reality, and {tau}2 to be an artifact caused by compositional bias (Murray 1991Go; Embley et al. 1993Go; Eisen 1995Go; Gupta 1998Go; Mooers and Holmes 2000Go; Foster 2004Go).


View this table:
[in this window]
[in a new window]

 
Table 1 Logarithm of the Bayes Factor Estimated for Several Alternative Models on 16S rRNAs (Thermus thermophilus, Deinococcus radiodurans, Bacillus subtilis, Thermotoga maritima, and Aquifex pyrophilus). The Stationary Model Is Used as a Reference (the comparison to the STAT configuration is a control, which is expected to be close to 0). 95% Credibility Interval Is Shown

 
Comparison between Nonstationary Model Configurations
Several nonstationary models have already been proposed (Yang and Roberts 1995Go; Galtier and Gouy 1998Go; Foster 2004Go), which differ mostly by the kind of bias that they consider (i.e., GC or general biases). In addition all these models are branchwise (i.e., the stationary probabilities of the substitution process are reassessed at the base of each branch). We wanted to provide a comprehensive analysis of the relative merits of some of these models, in particular those of Yang and Roberts (1995)Go and Galtier and Gouy (1998)Go, whose configurations can be reproduced in our implementation (i.e., YR{pi} and GGGC, see Methods). We, therefore, performed a general Bayes factor analysis of the bacterial 16S rRNA data set introduced in the previous section, under all these model configurations and using thermodynamic integration under free topology.

According to our results, all nonstationary models better fit the data than the stationary model, confirming that the stationary model is strongly rejected on this data set (table 1). In addition, among the nonstationary models under consideration, BP{pi} obtained the best Bayes factor and YR{pi} the worst. However, the estimated Bayes factors are very close to each other, which suggests that the differences are not significant in this case.

In order to get a more informative view of the relative merits of the nonstationary models, we performed additional Bayes factor evaluations on 4 other data sets of 5, 10, 15, and 20 Proteobacteria and Deinococci 16S rRNAs (see Material). On the 10, 15, and 20 species data sets, the estimated Bayes factors are all positive, rejecting the stationary model in favor of the nonstationary ones (fig. 4A). Only the 5 species data set behaves differently. However, it displays weak compositional heterogeneity and is close to stationarity (a {chi}2 test yields a minimum P value of 0.22 over the 5 taxa). Accordingly, on this data set, all models except BP{pi} are rejected in favor of the stationary model. Interestingly, the mean posterior number of break points sampled under the BP{pi} model is close to 0 in this case (95% CI = [0, 2], table 2), indicating that, when the analyzed data display no significant compositional bias, the compound process model reduces itself to the stationary model.


Figure 4
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— (A) Bayes factor estimations, obtained by thermodynamic integration under fixed topology, on 4 16S rRNA data sets. Error bars stand for 95% CI. (B) Mean posterior number of free parameters for each of the considered data sets (inferred from table 2).

 

View this table:
[in this window]
[in a new window]

 
Table 2 Posterior Mean Number of Break Points Sampled under Nonstationary Models, on Data Sets of 5, 10, 15, and 20 Proteobacteria and Deinococci 16S rRNAs. 95% CI Are Shown within Brackets

 
As shown in table 2, the mean posterior number of break points inferred by BP{pi} and BPGC models remains smaller than the number of branches, which implies that these models always use fewer free parameters than their homologous branchwise versions. This could indicate that not all the stationary profiles assumed by the YR{pi} and GGGC models (i.e., one per branch) are useful and, thus, that some of them represent no real compositional shift events. Consistent with this observation, we note that the YR{pi} model systematically obtained the worst Bayes factor and, at the same time, handles the largest number of free parameters (table 2 and fig. 4B). These correlation between Bayes factors and model dimensionality can be explained as follows: for n successive branches along which there are no compositional shift events, the YR{pi} has to infer n times anew the same profile; this is a highly unlikely configuration a priori, which penalizes the model by lowering its marginal likelihood. Note that this interpretation does not totally fit all the observations. In particular, the BPGC model handles the smallest number of free parameters, yet it obtains a smaller Bayes factor than its homologous model, GGGC. However, this may be due to the conservative prior we chose for the {varepsilon} parameter, which leads the mean number of break points {varepsilon} and, consequently, the posterior number of break points to tend toward one. This prior seems to penalize the BPGC model, compared with its homologous GGGC having a fixed number of break points, especially when many break points must be fitted, that is, when the number of analyzed biased sequences increases, as is the case here (table 2).

Finally, the 2 GC models (BPGC and GGGC) obtained the best fit on the data sets of 15 and 20 species. This may indicate that these data sets do not contain significant biases other than GC biases. Consistent with this, rRNA stems have similar proportions of G and C nucleotides (Higgs 2000Go), and thus, the observed compositional biases should be well described by GC ratio parameters. In this case, GC bias–based models avoid another overparameterization effect as they do not have to repeatedly infer similar proportions for A and T and for G and C.

However, not all biases are GC, and therefore, to offer a more complete spectrum of model comparisons, we analyzed data sets displaying more unequal composition in G and C nucleotides. We evaluated the Bayes factors of the nonstationary models for 2 data sets of 7 and 14 yeast species, for the BAS1 gene (see Material). In these 2 cases, the BPGC and GGGC were penalized and did not obtain the best Bayes factor (fig. 5A). Importantly, on the 14 species data set, the YR{pi} model again obtained the worst Bayes factor and, at the same time, was the model involving the highest number of free parameters (table 3 and fig. 5B). In contrast, on both data sets, the BP{pi} model obtained the best fit. Note that only 2.2 (95% CI = [1, 4]), and 4.1 (95% CI = [2, 6]), break points are inferred on average, respectively, on the 7 and 14 species data sets (table 3), which results in a considerably smaller number of free parameters, compared with YR{pi}. These observations reinforce the interpretation proposed above for the lack of fit of YR{pi}, that is, that it is fundamentally an overparameterization problem.


Figure 5
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 5.— (A) Bayes factor estimations, obtained by thermodynamic integration under fixed topology, on 2 yeast gene data sets. Error bars stand for 95% CI. (B) Mean posterior number of free parameters for the 2 data sets (inferred from table 3).

 

View this table:
[in this window]
[in a new window]

 
Table 3 Posterior Mean Number of Break Points Sampled under Nonstationary Models, on Data Sets of 7 and 14 BAS1 Genes of Yeast Species. 95% CI Are Shown within Brackets

 
Importantly, in the present case, this overparameterization phenomenon causes YR{pi} to obtain a worse fit than GGGC. Thus, relying on branchwise models only, one would conclude that the bias of the 14 species data set is a pure GC bias, rather than a more general one. However, the BAS1 gene displays very unequal proportions in G and C (35% of A, 16% of C, 28% of G, and 21% of T). Consistent with this, the break point models show a better fit in favor of general biases (BP{pi}), compared with GC biases (BPGC). Hence, we are here in a case where the lack of control of the number of parameters inherent to branchwise models would have resulted in a wrong biological interpretation. In contrast, our compound process model, which is able to control its dimensionality according to the data, provides a more reliable conclusion.


    Discussion
 TOP
 Abstract
 Introduction
 Methods
 Material
 Results
 Discussion
 Supplementary Material
 Appendix
 Acknowledgements
 References
 
The nonstationary model introduced here differs from previous full-likelihood–based models handling compositional bias phenomena (Yang and Roberts 1995Go; Galtier and Gouy 1998Go; Foster 2004Go) by allowing one to infer a free number of compositional shift events along lineages. This was done using the compound stochastic process method, inspired from Huelsenbeck et al. (1999)Go, to model variations of substitution rates along lineages. To deal with the implied variations in the model dimensionality, we used the Green (2003)Go formalism. Compared with the models proposed by Galtier and Gouy and Yang and Robert, and as we were able to show by Bayes factor evaluations, our model is less subject to overparameterization effects, especially when many species are analyzed.

The overparameterization issue seems to be highly important and is particularly conspicuous in the case of the branchwise general compositional shift model (the Yang and Robert–like settings). In our experiments, this model always involved the greatest number of parameters and, at the same time, obtained the worst fit. One would expect this overparameterization problem to loose its importance as the length of the alignment increases as the branchwise versions of the nonstationary model are consistent in the limit of infinite sequence length (Chang 1996Go). But in practice, it should be remembered that many phylogenetic studies are conducted with rRNA, using a large number of taxa (Maidak et al. 1996Go; Cole et al. 2003Go). As was demonstrated previously by Hasegawa and Hashimoto (1993)Go, rRNAs are often compositionally biased and should therefore be investigated using adequate nonstationary models. Yet, in such cases, branchwise models will probably not be so reliable because of overparameterization. In contrast, our break point version should behave more reliably. Branchwise models may also be problematic when applied to the amino acid sequences, as amino acid alphabet implies 19 free parameters per branch (instead of 3 for nucleotides), which would have overwhelming deleterious consequences on the fit of the model and maybe also on the estimated phylogeny. Because amino acid sequences can be biased (Foster et al. 1997Go; Foster and Hickey 1999Go), an efficient nonstationary model is also needed in this case. Normally, the break point model should be able to handle this correctly.

It is not clear whether the branchwise model of Foster is sensitive to such overparameterization effects. It was, in fact, explicitly designed to avoid these problems. However, the equilibrium frequencies still need to be rechosen among the available ones at the base of each branch, which also has a cost, in principle, and possibly a significant one in a context where there are few compositional shift events, compared with the number of nodes. In addition, at least in the current version of this model, the number of profiles needs to be fixed a priori, which lacks flexibility. In this respect, it would be informative to modify this model such as it handles a free number of profiles, using reversible-jump Monte Carlo (e.g., as in Green 2003Go) and then to compare this modified version to our model. Or conversely, to draw the profile of each break point of our compound process model from a predefined set of profiles, as in Foster's model.

Apart from this, some improvements of the realism of our model can be considered. First, the exponential prior on the apparition rate of compositional shift events is conservative and penalizes the model when many events have to be fitted. To avoid this problem, one could instead use another prior (e.g., uniform). Second, one could change the uniform distribution G0, from which profiles are a priori created and evaluated, to a generalized Dirichlet, whose hyperparameters can also be inferred. Third, modeling the compositional shifts as piecewise constant processes is not realistic and should be considered as a convenient mathematical device. In this respect, another improvement of the model would be to use a first order, rather than a 0th order, Markov process, when creating the profile of a new break point. Each break point profile depending on the profile of the previous break point, more break points may be created, which may allow to model quasi continuous compositional shift, although at the cost of an increasing computational time (the complexity of our modified pruning algorithm depending linearly on the number of break points). All these elaborations of our model could improve the quality of our reconstruction of the history of compositional trends of the substitution process. In each case, Bayes factor evaluation can be performed to see which of these configurations actually improve the resulting fit.

In a completely opposite direction, one could try to simplify the current model and keep only its most essential aspects. In particular, if one is not so much interested in the detailed reconstruction of the history of the compositional shifts, but only in the phylogenetic reconstruction, it might make sense to consider the branch of the phylogenetic tree as the fundamental unit of resolution. In this context, one could go back to the usual practice consisting in constraining the stochastic events to appear at the tree nodes, although now, not systematically, but with a probability that could itself be estimated from the data. This proposition remains different from the models proposed by Foster, Galtier and Gouy, and Yang and Robert, as (1) the equilibrium frequencies are not systematically reassessed for all branches, and as (2) each event leads to an independent compositional drift. Such a simplified version of the compound stochastic process presented here may have the same statistical properties than more complex versions, while avoiding the overparameterization pitfalls.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Methods
 Material
 Results
 Discussion
 Supplementary Material
 Appendix
 Acknowledgements
 References
 
Supplementary figures S1 and S2 and tables S1–S4 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Appendix
 TOP
 Abstract
 Introduction
 Methods
 Material
 Results
 Discussion
 Supplementary Material
 Appendix
 Acknowledgements
 References
 
Creating and Deleting Break Points
Note that, in the following, w will denote a random number picked from a uniform distribution.

The number of break points N is a free parameter of the nonstationary model. The "create/delete" MCMC update mechanism creates, or deletes, a break point with probability Formula or Formula if N > 0. If N = 0, then there is no break point to delete and pcreate = 1. When deleted, a break point is uniformally picked among the N extant ones (except the default root break point, which cannot be destroyed for obvious reasons). The probability to delete one of the extant break points is thus:

Formula

When created, the position of the new break point on the tree, that is, its branch b and its relative position x on its branch, are picked uniformally. The probability of creating a break point, at any relative position, on a given branch j is thus Formula The profile {pi} of the newly created break point is randomly picked following a uniform Dirichlet distribution: p({pi}) = Dir({pi}). Taking the product yields the overall probability:

Formula

By definition, the Hastings ratio, H, is the probability of the backward move divided by the probability of the forward move. Thus, in the case of break point creation, Formula and reciprocally, in the case of break point deletion, Formula

Formula (6)

Formula (7)

Note that some factors involved in these expressions will cancel out with the ratio of prior probabilities, Formula According to equation (4), this ratio is:

Formula (8)
in the case of a creation, and

Formula (9)
in the case of a deletion. Combining equation (6) with equation (8) and equation (7) with equation (9) yields factored Hastings-prior ratios, HP:

Formula
that is, in the case of a creation: Formula if N = 0 and Formula if N > 0, and in the case of a deletion: Formula if N = 1 and Formula if N > 1.

Updating Break Point Positions
To update the relative positions of break points, we randomly pick one of them, except the default root break point, and set its new relative position as x' = x + {lambda}(w – 0.5), where {lambda} is the tuning parameter, and w is a uniform [0, 1] number. If x' ≤ 0 or x' ≥ 1, we reflect x' back into [0, 1]. The corresponding Hastings ratio is 1. Note that this update mechanism may swap the relative positions of 2 break points on their branch, and as a result, modify the effect areas.

Updating Break Point Profiles
To update break point profiles, we uniformally pick one of them, including the profile of the default root break point. The new profile {pi}' is picked from a Dirichlet distribution centered on the current {pi} profile value: {pi}' ~ Dirichlet({lambda}{pi}1, {lambda}{pi}2, ...,