Skip Navigation

This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (14)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Schadt, E.
Right arrow Articles by Lange, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Schadt, E.
Right arrow Articles by Lange, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Molecular Biology and Evolution 19:1534-1549 (2002)
© 2002 Society for Molecular Biology and Evolution

Codon and Rate Variation Models in Molecular Phylogeny

Eric Schadt*{dagger} and Kenneth Lange*{ddagger}

*Department of Biomathematics, UCLA School of Medicine, Los Angeles, CA;
{dagger}Computational Genomics, Rosetta Inpharmatics, Kirkland, WA;
{ddagger}Department of Human Genetics, UCLA School of Medicine, Los Angeles, CA


    Abstract
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Acknowledgements
 Appendix
 References
 
This article generalizes previous models for codon substitution and rate variation in molecular phylogeny. Particular attention is paid to (1) reversibility, (2) acceptance and rejection of proposed codon changes, (3) varying rates of evolution among codon sites, and (4) the interaction of these sites in determining evolutionary rates. To accommodate spatial variation in rates, Markov random fields rather than Markov chains are introduced. Because these innovations complicate maximum likelihood estimation in phylogeny reconstruction, it is necessary to formulate new algorithms for the evaluation of the likelihood and its derivatives with respect to the underlying kinetic, acceptance, and spatial parameters. To derive the most from maximum likelihood analysis of sequence data, it is useful to compute posterior probabilities assigning residues to internal nodes and evolutionary rate classes to codon sites. It is also helpful to search through tree space in a way that respects accepted phylogenetic relationships. Our phylogeny program LINNAEUS implements algorithms realizing these goals. Readers may consult our companion article in this issue for several examples.


    Introduction
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Acknowledgements
 Appendix
 References
 
If the promise of the burgeoning DNA sequence and protein databases is to be realized, then more refined models of evolution at the sequence level must be constructed. Not only must these models be more realistic, they must also permit fast computation in frequentist and Bayesian statistical inference. This is a tall order. There are several ways to increase the realism of single-nucleotide substitution models. For instance, one can work at the protein level and consider transitions between amino acids or between secondary-structure within proteins rather than between nucleotides (Koshi and Goldstein 1997Citation , 1998Citation ). Although this traditional approach has the advantage of taking evolution and sequence conservation seriously, it suffers from two defects.

First, not all sequence conservation occurs within coding regions. Regions of gene regulation also experience the guiding hand of selection. Second, amino acid substitution models blur the distinction between mutation, which operates at the nucleotide level, and selection, which operates at the amino acid level. One can reasonably expect the rate of mutation at a nucleotide site to be independent of the role its resident nucleotide plays at the codon level. Whether selection retains or eliminates a mutation depends on whether the corresponding codon substitution is synonymous or nonsynonymous. Some nonsynonymous substitutions can be more easily tolerated than others. These considerations suggest that we view selection as accepting some codon substitutions and rejecting others according to definite probabilistic rules (Hasegawa 1986Citation ; Adachi and Hasegawa 1992Citation ; Goldman and Yang 1994Citation ; Pedersen, Wiuf, and Christiansen 1998Citation ; Yang, Nielsen, and Hasegawa 1998Citation ). More realistic modeling must recognize that some codon sites experience stronger selection than other sites. This fact suggests that one allows for variation in the rate of evolution among sites (Koshi and Goldstein 1998Citation ; Yang, Nielsen, and Hasegawa 1998Citation ; Yang 2000Citation ). Finally, because conserved catalytic domains, protein folds, and tertiary structures extend over both consecutive and nonconsecutive codon sites, it is desirable to pursue models that allow for spatial correlation in rate variation (Yang 1995Citation ; Felsenstein and Churchill 1996Citation ; Wagner, Baake, and Gerisch 1999Citation ).

These broad research trends have been set in motion by the researchers just noted. Our goal here is to refine some of the current models in interesting and productive ways and show how these models fit into fast maximum likelihood estimation. For example, we advocate modeling the acceptance of nonsynonymous substitutions using equivalence classes of amino acids on the basis of physical and chemical similarities. Modeling of spatial variation in the rate of evolution has been limited to the construction of hidden Markov chains and coupled Ising chains connecting adjacent nucleotide sites (Yang 1995Citation ; Felsenstein and Churchill 1996Citation ; Wagner, Baake, and Gerisch 1999Citation ). These natural developments parallel the successful applications of hidden Markov chains in contexts such as gene finding, similarity searches, and homology recognition (Gribskov, McLachlan, and Eisenberg 1987Citation ; Tatusov, Altschul, and Koonin 1994Citation ; Yi and Lander 1994Citation ; Burge and Karlin 1997Citation ). But there is a growing recognition of the need to liberate phylogenetic modeling from the unidirectional flow of information implicit in Markov chains and simple nearest neighbor interactions required by Ising chains (Koshi and Goldstein 1998Citation ; Yang, Nielsen, and Hasegawa 1998Citation ). The most nuanced models will achieve these goals with codons rather than nucleotides as the units of selection.

In our view, the theory of Markov random fields offers a perfect framework for expansion of the rate variation models. In particular, we demonstrate that imposing a Gibbs prior with limited spatial interactions is fully compatible with fast likelihood evaluation. This opens the door to more realistic, but still parsimonious, codon models. On the numerical side, we present several algorithmic improvements in likelihood evaluation that can decrease computing times by at least two orders of magnitude. To stimulate the wider use of sophisticated models, we have implemented all of our advances in the computer program LINNAEUS. Applications of LINNAEUS appear in our companion article (Schadt, Sinsheimer, and Lange 2002Citation ) and will not be pursued here.


    Theory and Methods
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Acknowledgements
 Appendix
 References
 
Our point of departure is our previous generalization of Kimura's nucleotide substitution model (Kimura 1980Citation ; Schadt, Sinsheimer, and Lange 1998Citation ). This generalization possesses just enough symmetry to permit explicit diagonalization and exponentiation of the infinitesimal generator, or rate matrix, of the underlying continuous-time Markov chain. Given the partial derivatives of the infinitesimal generator, we have shown how to evaluate the partial derivatives of the likelihood almost as quickly as the likelihood itself. Such evaluation is important because most methods of maximum likelihood search rely heavily on gradient information. Our previous algorithmic advances have even greater relevance to codon models, with 61 states, than to nucleotide models, with just four states. Fortunately, many of the analytic successes at the nucleotide level carry over to the codon level. The algorithms for quick evaluation of the likelihood and its derivatives are cases in point. Other tasks such as matrix exponentiation are less tractable. But considerable progress can be made even here if one pays heed to the sparseness and reversibility properties of the infinitesimal generator. In particular, it is profitable to relate reversibility at the codon level to reversibility at the nucleotide level.

Nucleotide Models
Codon models are preferable to amino acid models because they incorporate detail at the mechanistic level at which mutation operates. The codon models of Goldman and Yang (1994)Citation and Muse and Gaut (1994)Citation rely on a continuous-time Markov chain with 61 states. Mutations to one of the three stop codons totally disrupt gene function and can be ignored safely. Of course, codon models are no panacea. Nucleotide substitution is seldom completely random in conserved regions flanking coding regions, near intron-exon boundaries, or in intronic regions such as the silencer and enhancer regions implicated in regulation and splicing. In the case of regulation, substitutions can affect transcription and transcript stability; in the case of splicing, substitutions can change the entire structure of the translated protein (Miyata and Yasunaga 1980Citation ; Li, Wu, and Luo 1985Citation ; Nei and Gojobori 1986Citation ; Li 1993Citation ).

As a basis for our codon models, we adopt the generalization of Kimura's nucleotide substitution model proposed by Schadt, Sinsheimer, and Lange (1998)Citation . Kimura's model and subsequent generalizations view a phylogeny as a binary tree with contemporary taxa forming the leaves. Evolution occurs from site to site and branch to branch via independent Markov chains bouncing back and forth among the four possible states A (adenine), G (guanine), C (cytosine), and T (thymine). These continuous-time chains share the common infinitesimal generator


in our earlier nucleotide model. All eigenvalues and eigenvectors of {Lambda} are real and explicitly computable. This fact facilitates finding the equilibrium distribution and forming the finite-time transition matrix P(t) = exp(t{Lambda}). The entry pij(t) of P(t) represents the probability that a daughter node of the tree occupies state j at time t given that its mother node occupies state i at time 0. Note that {Lambda} embodies the assumption that transversion rates depend only on their destination states and not on their initial states. This assumption allows more flexibility than competing models (Kimura 1980Citation ; Goldman and Yang 1994Citation ; Muse and Gaut 1994Citation ) currently available in widely used phylogeny programs. Although we presently ignore nucleotide and codon gaps, the models discussed here can accommodate gaps as additional states. If these models prove useful, we will extend them to allow for gaps.

Rate Variation Among Lineages
Because, in general, it is impossible to separate the time parameter for each branch from the kinetic parameters, we adopt Yang's (1994)Citation convention and set one of the branch lengths equal to 1. This convention allows a branch length to be interpreted as the expected number of changes between the two taxa at the ends of the branch. The nucleotide substitution models sketched above implicitly assume that the rate of evolution across lineages is constant. To allow for stochastic variation in branch lengths, one can introduce randomized times for each branch.

There are two ways of choosing these times, both involving gamma distributed random variates. In the traditional model (Nei and Gojobori 1986Citation ; Tamura and Nei 1993; Yang 1993Citation ; Kelly and Rice 1996), each branch length at a given site is multiplied by the same gamma variate with mean 1. The gamma variates at different sites are chosen independently. Unfortunately, this procedure enormously complicates likelihood evaluation because the likelihood at the site must be integrated over all possible realizations of the gamma variate. Although one can carry out such quadratures numerically, using say Laguerre polynomials, accuracy tends to be low, and computation times tend to be high, particularly when derivatives of the likelihood are also sought.

Alternatively, one can imagine multiplying the different branch lengths at a given site by independent gamma distributed variates of mean 1. Although this creates even more integrals, these can be done along each branch separately as part of a single overall likelihood evaluation. Furthermore, as noted below, the necessary quadratures can be performed analytically. To promote model parsimony, it is preferable to specify the same shape parameter across both branches and sites for the independent gamma variates. In this regard, recall that the standard parameterization of a gamma distributed variate T involves a scale parameter {zeta} and a shape parameter {nu}. The shape parameter determines the coefficient of variation of T through the equation


The constraint E(T) = {nu}/{zeta} = 1 forces {zeta} = {nu}.

In our alternative model with independent gamma multipliers rather than a common gamma multiplier, we must compute the unconditional finite-time transition matrix Q for evolution along each branch. Assuming a branch has average length 1/{zeta} and the random multiplier T has mean 1 and shape parameter {nu}, one calculates Q using the formula


The most natural method of computing Q({Lambda}) is to consider it to be a matrix version of the analytic function


defined for Real({lambda}) < {zeta}. Differentiation under the integral sign and integration by parts show that f({lambda}) satisfies the ordinary differential equation


with solution f({lambda}) = (1 - {lambda}/{zeta})-{nu} subject to the initial condition f(0) = 1.

Although this insight is useful, practical computation of Q({Lambda}) is best achieved when {Lambda} is diagonalizable (Horn and Johnson 1991Citation , 520 p.). If {Lambda} = UDU-1 with D diagonal having diagonal entries di, then


where f(D) is the diagonal matrix having diagonal entries f(di). The condition Real(di) <= 0, satisfied by all infinitesimal generators {Lambda}, guarantees that all diagonal entries f(di) are well defined. Extension of Q({Lambda}) to nondiagonalizable {Lambda} is discussed thoroughly by Horn and Johnson (1991)Citation and will be omitted here.

From Nucleotide Models to Codon Models
Any nucleotide substitution model induces a trivial codon substitution model. For example, if vbd is the infinitesimal transition rate from nucleotide b to nucleotide d, then on the codon level, the infinitesimal transition rate from codon (a,b,c) to codon (a,d,c) is still vbd, assuming sites mutate independently and neither codon (a,b,c) nor codon (a,d,c) is a stop codon. Substitutions involving the first or third sites are treated similarly, and single-step substitutions between codons differing at more than one site are prohibited.

The only difference between running this naive codon model and three adjacent independent nucleotide models is that the three stop codons are disallowed in the codon model. For this reason, the naive codon model is not terribly interesting. One can increase its relevance by incorporating an acceptance mechanism. If two neighboring codons (a,b,c) and (a,d,c) are nonsynonymous, then one can penalize transitions between them by replacing vbd with {rho}vbd. Here {rho} [0, 1] is the probability that the proposed evolutionary change is accepted. If the destination codon is one of the three stop codons, the acceptance probability is 0. The acceptance probability for a synonymous change is 1. Mathematically, the codon model with acceptance probabilities is determined by a 61 x 61 infinitesimal generator {Upsilon} whose off-diagonal entries are the products of the corresponding entries of the infinitesimal generator of the naive codon model and the acceptance matrix. The diagonal entries of {Upsilon} are defined so that row sums vanish.

Preservation of Reversibility
Although the codon model is too complicated to permit explicit exponentiation of its infinitesimal generator {Upsilon}, symmetry considerations can simplify this task and likelihood computation in general. For example, to avoid complex arithmetic in computing the eigenstructure of {Upsilon}, it is desirable to derive sufficient conditions guaranteeing that {Upsilon} has real eigenvalues and eigenvectors. Reversibility is the key. One can easily show that reversibility in the nucleotide model is equivalent to imposing the two constraints ß{gamma} = {lambda}{sigma} and {alpha}{delta} = {epsilon}{kappa} on {Lambda}. With these constraints in force, the equilibrium distribution {pi} is


If we let D{pi} denote the diagonal matrix with the entries of {pi} along its diagonal, then reversibility is equivalent to the detailed balance condition

where the superscript * denotes vector or matrix transpose. Detailed balance implies that the eigenvalues of {Lambda} are real. To prove this result rigorously, consider the matrix D{pi}1/2{Lambda}D{pi}-1/2, which is similar to {Lambda} (Kelly 1979Citation , pp. 5–10). The calculation


does the trick when we recall that a symmetric matrix has real eigenvalues and real orthogonal eigenvectors (Horn and Johnson 1985Citation , pp. 169–171).

Besides accounting for nonsynonymous codon changes in a parsimonious manner, the codon model possesses the attractive property of turning a reversible nucleotide model into a reversible codon model. The scalar version of detailed balance is {pi}bvbd = {pi}dvdb. If the acceptance probabilities in the codon model are symmetric, then we claim that the equilibrium probability of the codon (a,b,c) is {pi}a{pi}b{pi}c up to a multiplicative constant. The proof of this statement is simply the equality

where {rho} = {rho}(a,b,c)->(a,d,c) = {rho}(a,d,c)->(a,b,c). Thus, detailed balance is preserved, and {Upsilon} is reversible with an identified equilibrium distribution. Because stop codons are omitted, we must normalize the proposed equilibrium distribution by dividing its entries by the sum {Sigma}(a,b,c) {pi}a{pi}b{pi}c taken over all nonstop codons (a,b,c).

Codon Acceptance Probabilities
In contrast to Muse and Gaut (1994)Citation , who employ a single acceptance probability that makes no distinctions between the different kinds of nonsynonymous codon substitutions, and in contrast to Goldman and Yang (1994)Citation , who use empirically determined codon acceptance probabilities based on the amino acid distances given in Grantham (1974)Citation , we prefer to estimate a small number of codon acceptance probabilities directly from the data. A large amount of sequence data would tend to favor this tactic of trading extra parameters for greater biological realism. Of course, the extreme model treating each amino acid separately requires (20 2) = 190 symmetric acceptance probabilities. To avoid such a proliferation of parameters, we group amino acids into similarity (equivalence) classes and then define acceptance probability parameters for transitions within or between classes. For instance, if we define similarity classes on the basis of the polarity properties of the amino acids, then the natural classes comprise the nonpolar amino acids S1 = {G, I, V, L, A, M, P, F, W}, the positive-polar–positively charged amino acids S2 = {Q, N, C, Y, H, K, R}, and the negative-polar–negatively charged amino acids S3 = {S, T, E, D}. We could further divide these classes on the basis of the size of amino acid side chains or other properties potentially affecting substitution rates. For instance, to account for cysteine's propensity to form disulfide bonds bridging different parts of a protein, we could place cysteine in its own similarity class S4 = {C}.

With the similarity classes S1 through S4 just defined, we postulate an acceptance probability {rho}0 within classes, an acceptance probability {rho}1 between a polar and a nonpolar class, an acceptance probability {rho}2 between different polar classes, and an acceptance probability {rho}3 involving substitution to or from cysteine. One would anticipate that {rho}0 > {rho}1 > {rho}2 and {rho}0 > {rho}3. The symmetry constraints implicit in this parameterization are not completely realistic. For instance, they probably fail for transitions involving cysteine. Substitution of another amino acid for a cysteine involved in a disulfide bond is bound to be much less likely than the reverse substitution. But it would take an enormous amount of data to see this effect, and it is mathematically advantageous to maintain symmetry for the sake of reversibility.

The penalty model just described appears to coincide with one proposed earlier by Yang, Nielsen, and Hasegawa (1998)Citation . In our companion article (Schadt, Sinsheimer, and Lange 2002Citation ), we compare it with a generalization of the amino acid metric model proposed by Goldman and Yang (1994)Citation . No clear winner emerges in this comparison, but the similarity class and metric models are both clearly superior to the naive Muse and Gaut (1994)Citation model.

Matrix Exponentiation and Differentiation
One of the primary concerns in likelihood evaluation is computation of the finite-time transition matrix exp(t{Upsilon}) and its partial derivatives with respect to the parameters. Computation is much easier if {Upsilon} is diagonalizable. Under the conditions of reversibility described above, let {nu} denote the product equilibrium distribution constructed from the nucleotide equilibrium distribution (eq. 1 ). Also let D{nu} denote the diagonal matrix with the entries of {nu} on its diagonal. Because D{nu}1/2{Upsilon}D{nu}-1/2 is symmetric, there exists a real orthogonal matrix U and a real diagonal matrix {Delta} such that D{nu}1/2{Upsilon}D{nu}-1/2 = U{Delta}U*. The columns of U are the eigenvectors of D{nu}1/2{Upsilon}D{nu}-1/2, and the diagonal entries of {Delta} are its eigenvalues. It follows from this representation that


Here exp(t{Delta}) is the diagonal matrix derived by exponentiating each of the eigenvalues separately. The sparseness of D{nu}1/2{Upsilon}D{nu}-1/2 helps in finding U. Only about 15% of the entries of {Upsilon} and thus of D{nu}1/2{Upsilon}D{nu}-1/2 are nonzero in the codon model. When we use special routines in the LAPACK linear algebra package and the D2a class in the IML++ package that exploit sparseness and symmetry, the computational burden of computing exp(t{Upsilon}) decreases by a factor of five. Details on the use of these packages for sparse matrix computations can be found in Dongarra et al. (1996)Citation and Pozo, Remington, and Lumsdaine (1996)Citation .

Suppose {xi} is a parameter of the codon model. Any numerical method that can compute the entrywise partial derivatives ({partial}/{partial}{xi})exp(t{Upsilon}) using only {Upsilon} and the entrywise partial derivatives ({partial}/{partial}{xi}){Upsilon} is highly desirable. For instance, the optimization engine SEARCH of LINNAEUS works faster and more reliably given these partial derivatives (Lange, Boehnke, and Weeks 1988Citation ). Formulas for ({partial}/{partial}{xi})exp(t{Upsilon}) are available in the statistical literature on fitting differential equation models (Jennrich and Bright 1974Citation ) and in the Lie group literature (Richtmyer 1981Citation , pp 143–144), but these have not seen application in molecular phylogeny. For this reason, we indicate briefly one method of calculating ({partial}/{partial}{xi})exp(t{Upsilon}).

Suppose f(z) is an analytic function of the complex variable z. The particular choice f(z) = exp(tz) is the pertinent one for us. Because f(z) is locally defined by an absolutely convergent power series, one can define a matrix-valued analytic function f(M) on square matrices M by substituting M for z in this series. This correspondence gives rise to a matrix functional calculus whose crowning highlight is the Cauchy integral formula


where I is the identity matrix, C is a simple closed curve strictly enclosing all the eigenvalues of M, and f(z) is analytic on C and its interior (Ikebe and Inagaki 1986Citation ; Horn and Johnson 1991Citation ). Differentiation of this formula yields


If we assume that M is diagonalizable in the form M = SDS-1 with D a diagonal matrix with jth diagonal entry dj, then we can write


where Dj is a matrix of zeros except for a one in diagonal entry j. Substituting this expression in equation (2) produces


based on the standard version of Cauchy's integral formula for scalar-valued analytic functions (Marsden 1987Citation , pp. 121–122). If F is the matrix with entry fjk = f'(dj) when dj = dk and fjk = (f(dj) - f(dk))/(dj - dk) when dj != dk, then this reduces to


where {circ} denotes the Hadamard (pointwise) product of two matrices. The computational complexity of computing ({partial}/{partial}{xi})f(M) is consequently dominated by ordinary matrix multiplication.

When M is not diagonalizable, a more complicated formula of Schwerdtferger applies based on the Jordan canonical form and Frobenius covariants of M (Horn and Johnson 1991Citation ). Fortunately for reversible models, {Upsilon} is diagonalizable. Even when {Upsilon} is not diagonalizable a priori, we can assume this condition in numerical practice because almost all matrices have distinct eigenvalues and are therefore diagonalizable. The appendix contains a short proof of this measure-theoretic assertion.

Rate Variation Models
Several authors have objected to the naive assumptions that different codon sites evolve equally rapidly and independently (Goldman and Yang 1994Citation ; Yang 1995Citation ; Felsenstein and Churchill 1996Citation ). Clearly, residues forming part of a catalytic domain or critical protein fold are more strongly conserved than residues that fall outside such regions. Better models of evolution must allow for variation in the rate of evolution and correlation in the rates of neighboring codon sites. The simplest way to model rate variation is to suppose that each codon site i is assigned a random rate class Ci from among r possible rate classes, which for convenience we label 1, ..., r. For two rate classes, we nominate the first class as slow and the second as fast. Probably the most parsimonious way of differentiating slow from fast evolution is to modulate the rate of evolution through the acceptance probabilities. Slow evolution can be distinguished from fast evolution by introducing a multiplicative parameter {eta} (0, 1) and replacing each acceptance probability {rho}i by {eta}{rho}i. For synonymous codon changes, one retains the acceptance probability of 1.

Modeling correlations in the rate of evolution among neighboring codon sites is more subtle. It is important to recognize that we are dealing with two different graphs in molecular phylogeny: (1) a taxon graph representing the relationships between the taxa and (2) a site graph representing which codon sites interact in determining rates of evolution. The site-to-site independence assumption essentially hides the fact that the site graph exists. The nodes of this graph are the different codon sites; its edges connect interacting pairs of sites. Yang (1995)Citation and Felsenstein and Churchill (1996)Citation implicitly define a site graph through their Markov chain models for rate variation. The Markov chain models postulate that the rate of evolution at the current site depends on the rate of evolution at the previous site. Although this is certainly a more realistic premise than independent rate variation, it is not the last word on the subject. Hidden Markov chains impose a unidirectional flow of information. Except for mathematical convenience, there is no reason to assume a preferred direction. Markov chain models also permit interactions only between nearest neighbors. These limitations can be relaxed without overly complicating maximum likelihood estimation and statistical inference.

The key to generalization is to replace Markov chains by Gibbs random fields (Kelly 1979Citation ; Whittaker 1990Citation , pp. 47–61; Cressie 1993Citation , pp. 410–422). The prototype Gibbs random field is the Ising model from statistical mechanics. As opposed to a Markov chain, it allows a bidirectional flow of information. Wagner, Baake, and Gerisch (1999)Citation introduced a simple two-state nucleotide evolution model based on coupled Ising chains. This model captures nearest neighbor interactions between particles on a one-dimensional lattice (Thompson 1972Citation , pp. 116–144). The Ising model and more complicated Gibbs fields are defined through potential functions H(c) defined on the realizations of the random rate class vector C = (C1, ..., Cn) over n consecutive codon sites. A Gibbs random field assigns the prior probability


to the realization C = c. For the sake of notational simplicity, we omit the usual minus signs in the definition of potentials and priors. The partition function {Sigma}d eH(d) is a multiple sum extending over all vectors d = (d1, ..., dn) drawn from the n-fold Cartesian product of {1, ..., r}. As a prelude to integrating the Gibbs models with phylogeny likelihood calculation, it is useful to review some particular Gibbs fields that can be applied directly to this problem.

Gibbs Fields and Their Partition Functions
In the Potts generalization of the Ising model (Baxter 1982Citation , pp. 422–452), one determines the prior probability of the random vector C = (C1, ..., Cn) over n codon sites through either the linear potential function


or the circular potential function


Here the vector (µi) controls the frequency of the different rate classes, and the matrix ({theta}ij) controls the spatial correlation in rate class between neighboring codons. This model generalizes the Ising model by allowing codon sites to belong to an arbitrary number of rate classes. In the circular case, we set cn+1 = c1.

One of the hardest problems in statistical mechanics has been the evaluation of partition functions. Fortunately, the Potts model permits exact evaluation. If we define the matrix


then


Conveniently, the circular model dispenses with frequency parameters after reparameterization. If we let M be the r x r matrix (e{omega}i,j) and v the r x 1 vector (eµi/2), then the partition functions reduce to the expressions


These expressions can be further simplified by diagonalizing M.

For example, in the Ising model when r = 2 and


the matrix


has eigenvalues


These can be used to write tr(Mn) = {phi}+n + {phi}-n in the circular Ising model (Ising 1925Citation ; Kramers and Wannier 1941Citation ; Onsager 1944Citation ).

Evaluation of the partition function can sometimes be done directly without recourse to linear algebra. For instance, consider the linear logistic model


with variation in nearest neighbor interactions but no frequency differences in rate classes (Derin and Elliott 1987Citation ). For the sake of argument, imagine that the rate indicators Ci are independent and uniformly distributed over the r rate classes. The n - 1 indicator random variables 1{Ci=Ci+1} are then independent, and the normalized partition function


can be viewed as their multivariate moment generating function. By virtue of independence, m({theta}1, ..., {theta}n-1) reduces to the product


of the univariate moment generating functions. It follows that the partition function satisfies


which obviously simplifies to a power if the {theta}i are equal. Extending this result to include frequency differences appears intractable (Strauss 1977Citation ), but in this simple case we see that allowing for rate variation among codon sites imposes little extra computational burden.

As a final example, suppose that r = 2 and the rate indicators Ci are chosen from {-1, 1} rather than {1, 2}. The partition function


joins nearest and next nearest neighbors. It can be radically simplified by writing Di = CiCi+1 and observing that Di {-1, 1} and DiDi+1 = CiCi+2. These considerations imply that


Thus H(d) exactly matches the potential of the linear Ising model with n - 1 sites. It follows that the explicit formula found for the Ising partition function applies here as well.

Likelihood Evaluation Under Rate Variation
To compute the likelihood of the data under the rate variation model, we let L(c) = {Pi}ni=1 Li(ci) denote the likelihood of the observations over all codon sites conditional on the value (c1, ..., cn) of the rate class vector C. Here Li(ci) is the likelihood specific to codon site i when it is in rate class ci. Given the value of the finite-time transition matrix et{Upsilon} and its derivatives ({partial}/{partial}{xi})exp(t{Upsilon}), computation of Li(ci) and its derivatives ({partial}/{partial}{xi})Li(ci) is carried out by the algorithms sketched by Felsenstein (1981)Citation and Schadt, Sinsheimer, and Lange (1998)Citation . These algorithms mimic Baum and Petrie's (1966)Citation forward and backward algorithms and have computational complexity O(ts2), where t is the number of taxa and s = 4 or s = 61 is the number of states of the model. As Felsenstein (1981)Citation notes, a complexity of O(ts2) is a vast improvement over the complexity O(st) encountered when one visits the internal nodes of the evolutionary tree simultaneously rather than recursively. Jensen, Lauritzen, and Olesen (1990)Citation present a nice generalization of the Baum and Petrie and Felsenstein algorithms to more complex graphical structures. This generalization closely resembles algorithms developed in pedigree analysis (Lange 1997Citation , pp. 102–110).

To compute the full likelihood in the rate variation model with spatial correlation, we must take into account the prior probability of the rate class vector C. Given a potential function H(c), we have


At this level of generality, it is not possible to suggest an efficient method of evaluating . But once we limit the extent of spatial interaction, things become easier. For instance, let us take as a concrete example the potential function


for a linear logistic model with interactions extending over k nearest neighbors. In this case, the exponentiated potential


comes into play as a product of arrays.

Computation of either the partition function or the numerator of reduces to the evaluation of a multiple sum of products of arrays. It is best to compute the multiple sums as iterated sums. Consider the numerator of . In the forward algorithm, we eliminate the indices c1, ..., cn in the indicated order. Starting with the array f0(c1, ..., ck) = 1 at step 0, at step i we recursively create the new array fi(ci+1, ..., cmin{i+k,n}) using


and then immediately discard fi-1(ci, ..., cmin{i+k-1,n}). The final array fn has zero dimension and furnishes the numerator of in equation (3) . In computing the array fi(ci+1, ..., cmin{i+k,n}), it is advantageous to perform the indicated array multiplications pairwise, starting with the two smallest arrays Li(ci)eµci, multiplying the resulting product against e{theta}11{ci=ci+1}, and so forth until a single array holding Li(ci)eµc, {Pi}min{k,n-i}j=1 e{theta}j1{ci=ci+j} is formed. The final multiplication of this array against fi-1(ci, ..., cmin{i+k-1,n}) can be carried out simultaneously with addition over the index ci. When the array e{theta}j1{ci=ci+j} is brought into play, its obvious symmetries can be exploited to reduce the overall computational burden.

The computational complexity of the forward algorithm scales linearly in n. At each step of the algorithm, one should rescale constructed arrays to prevent numerical underflow. Because a similar algorithm applies to the partition function, there is no need to express the partition function in closed form. Of course, if a closed form exists, it will usually be preferred on esthetic grounds.

Likelihood Differentiation Under Rate Variation
For certain rate variation models, it may be impossible to compute partial derivatives exactly. In such cases numerical partial derivatives of the likelihood can be approximated by either forward or central differences. Each forward difference requires an additional likelihood evaluation, and each central difference requires two additional likelihood evaluations. These extra likelihood evaluations are especially costly for kinetic parameters because they involve recalculation of all of the site-specific likelihoods Li(ci). For spatial parameters, the site-specific likelihoods are fixed, and it is possible to evaluate the full likelihood quickly given them. This suggests that numerical differentiation with respect to the spatial parameters is reasonable. One drawback of forward and central differences is that they entail roundoff errors originating from the subtraction of nearly equal function values. For a real analytic function f(x), there is a better alternative (Squire and Trapp 1998Citation ) based on the approximation


where x and y are real and i = . This approximation can be implemented in computer languages such as C and Fortran that accommodate complex arithmetic. Fortunately, in the linear logistic model (eq. 4 ), is a real analytic function of the spatial parameters µj and {theta}i.

The product rule allows us to express the partial derivative of with respect to a kinetic parameter {xi} as


For each i, the inner sum on the multi-index c can be computed using the forward algorithm by substituting ({partial}/{partial}{xi})Li(ci) for Li(ci). Although each of these separate evaluations is fast, the cumulative work can be considerable if n is large. It is possible to reduce the required effort by carrying out a backward algorithm and storing intermediate results. The backward algorithm for likelihood evaluation begins with the array bn+1(cn-k+1, ..., cn) = 1 and recursively defines new arrays


until reaching the constant array = b1.

It is conceptually simple but notationally messy to describe how to combine the forward and backward algorithms to compute partial derivatives. The basic idea is to exploit as much of the forward and backward information as possible. This is accomplished by using fi-1(ci, ..., ci+k-1) and bi+k(ci, ..., ci+k-1) as a basis for further computation. For the sake of simplicity, we deal with a central value of i satisfying k <= i <= n - k. When i < k, we omit the forward array and deal with the backward array alone, and when i > n - k we do the reverse. These special cases we leave to the reader. For a central value of i we collect all arrays that have not been visited into a single array a(ci, ..., ci+k-1) by forming the product


It is now a simple matter of summing over the remaining indices to produce


In practice, efficient evaluation of and its partial derivatives with respect to the kinetic parameters can be orchestrated by first carrying out the backward algorithm, storing the intermediate arrays, and then performing the forward algorithm. At the ith step of the forward algorithm, we sandwich the necessary forward and backward results and compute one piece of the differential of as indicated in equations (6) and (7) .

Inferring Hidden States
The models and algorithms discussed so far lend themselves to the optimal assignment of codons to internal nodes and of codon sites to rate classes. This flexibility will help scientists in understanding and visualizing the most likely ancestral sequences and the most highly conserved regions in a protein. Felsenstein and Churchill (1996)Citation detailed such an algorithm in computing posterior probabilities for a given nucleotide site belonging to a given rate class, using their hidden Markov-based model of rate variation among sites. Others have detailed fast and efficient algorithms for the reconstruction of nucleotide and amino acid sequences and demonstrated the usefulness of these reconstructions in annotating the functional regions of protein sequences (Pupko et al. 2000Citation ; Armon, Graur, and Ben-Tal 2001Citation ). We generalize these reconstruction and rate assignment algorithms so they can be applied in the context of the generalized reconstruction models we have developed here.

There are two common methods for inferring hidden states in the type of likelihood models detailed here. The first simply computes the posterior probability of each possible assignment and chooses the most probable one at each node or site. The second method finds the most likely constellation of assignments over all nodes or sites simultaneously by dynamic programming. The posterior probability method involves fairly modest changes to the various likelihood algorithms. The dynamic programming method requires more radical surgery and adapts the Viterbi (1967)Citation algorithm as generalized by Dawid (1992)Citation .

Inferring Hidden Rate Classes
Consider for instance computation of the posterior probability that a codon site belongs to a given rate class. The posterior probability in question is the ratio of a joint probability and the likelihood of the data. The joint probability captures the data with the given codon site restricted to a particular rate class. Computation of the joint probability proceeds exactly as likelihood computation with the exception that when the codon site in question is reached in the computation, the range of summation shifts from all possible rate classes to a single fixed rate class. To carry out this strategy as efficiently as possible, we mimic the process of computing partial derivatives. Thus, by analogy to equations (6) and (7) we form


and the multiple sum


omitting the sum on ci. Dividing S(ci) defined by equation (9) by the likelihood now gives the desired posterior probability of rate class ci at codon site i.

To explain the Viterbi algorithm for finding the best sequence of rate classes, it is clearer to adopt a general perspective. The problem of likelihood evaluation reduces to an evaluation of a multiple sum of multiple products of arrays of the form


Here Z = Z1 x ··· x Zn is a Cartesian product of n finite sets. In our case, each set Zi = {1, ..., r} contributing to the product represents the possible choices of rate class at codon i. The multi-index or sequence c = (c1, ..., cn) determines a rate class assignment for each of the n codon sites. The array aTj(c) depends only on the components of c determined by the subset Tj {1, ..., n}. For instance, if n = 3 and Tj = {1, 3}, then the array aTj(c) reduces to an array a{1,3}(c1, c3) depending on the first and third indices. The restriction to subsets indexed by elements j J conveys the fact that only some of the 2n subsets of {1, ..., n} occur in the product defining . We can avoid duplicate sets Tj = Tk by replacing corresponding arrays aTj(c) and aTk(c) by their pointwise product aTj(c)aTk(c). In practice, {cup}jJTj = {1, ..., n}.

In contrast to likelihood evaluation, the best assignment problem involves finding a multi-index c maximizing {Pi}jJ aTj(c). In other words, maxima replace sums. Dynamic programming solves a problem by taking advantage of already computed solutions of smaller examples of the same problem. In this case, we solve a sequence of n subproblems. For subproblem m, put


to pin down the arrays partially indexed by one or more of the first m indices c1, ..., cm. In the current context, dynamic programming exploits the identity


The validity of this identity hinges on the fact that the arrays in the first product on the right involve the index cm but none of the indices c1, ..., cm-1. The solutions to the inner maximization depend on the indices ci with


These are precisely the indices other than c1, ..., cm-1 implicated by association with c1, ..., cm-1 through some array. For each choice of the associated indices, we must preserve a pointer that specifies a choice of c1, ..., cm-1 providing a solution. When we perform the outer maximization over cm, the solution depends on the indices ci with i {cup}jUm Tj\{1, ..., m}. Again we preserve a pointer that specifies a best cm together with the sequence c1, ..., cm-1 specified by the solution at stage m - 1. At stage n we find a maximum and recover the solution vector by tracing back along the associated pointers.

As a toy example, consider the product


For m = 1 we find a maximum of


as a function of c2 and record a pointer specifying a best c1 for each c2. For m = 2 we find a maximum of


as a function of (c3, c4) and record a pointer from (c3, c4) to (c1, c2). For m = 3 we find a maximum of


as a function of (c4) and record a pointer from (c4) to (c1, c2, c3). For m = 4 we find a maximum over c4 and combine it with its pointer to (c1, c2, c3).

The rate class problem possesses two major advantages. First, the index set (eq. 10 ) at stage m - 1 always contains at most k indices for k nearest neighbor interactions. Second, it is natural to proceed from left to right along the n codon sites. As we will discuss later, the same strategy may or may not work well for neighborhood structures connecting nonconsecutive sites.

Inferring Hidden Codons: Ancestral Sequence Reconstruction
We now turn to the problem of imputing codons to internal nodes. To simplify our exposition, we fix a tree connecting the taxa of interest, a codon site, and a rate class for that site. In our previous article (Schadt, Sinsheimer, and Lange 1998Citation ), we sketched Felsenstein's (1981)Citation algorithm for computing the likelihood of the data observed at the leaves of the tree. This algorithm works for both nucleotide and codon models. In discussing the algorithm, we introduced several probabilities. These were (1) the conditional probability Tk->l(i, j) that the daughter node l of a branch k -> l is in state j given that the mother node k is in state i, (2) the conditional probability Dk(i) of the subtree rooted at node k given node k is in state i, (3) the conditional probability Lk(i) of the left subtree rooted at node k given node k is in state i, (4) the conditional probability Rk(i) of the right subtree rooted at node k given node k is in state i, and (5) the joint probability Uk(i) of node k being in state i and the complementary subtree showing its observed bases at its leaves. In definition (1), note that Tk->l(i, j) is nothing more than an entry of the finite-time transition matrix P(t) pertaining to the branch k -> l. The complementary subtree referred to in definition (3) contains node k and all nodes that are not descendents of k. The letters D, L, R, and U in these definitions suggest the directions down, left, right, and up, with the root occurring at the top of the tree.

We remind the reader that the quantities Lk(i) and Rk(i) are computed in a postorder traversal of the tree in which daughter nodes are visited prior to mother nodes. The quantities Uk(i) are computed in a subsequent preorder traversal of the tree in which mother nodes are visited before daughter nodes. If l is one of the daughter nodes of k, then computation of the Ul(j) relies on the stored values of Lk(i) and Rk(i) available after the postorder traversal and on the stored values of Uk(i) available after the partially completed preorder traversal. At the time Ul(j) is computed, it is trivial to compute the product Ul(j)Ll(j)Rl(j) giving the joint probability of the data at the site and an assigned state j for node l. The ratio of this probability to the likelihood of the data represents the conditional probability of the assigned state i given the data.

To remove the restriction to a particular site and rate class, let i denote a codon site, l a node, and ci a rate class at site i. We have just shown how to compute the joint likelihood Li(ci, xl) of the observed data at site i and a specified codon xl at node l. Within the context of the rate class model, efficient computation of the posterior probabilities of the various codons xl at the various nodes l proceeds by constructing the array a(ci, ..., ci+k-1) of equation (8) omitting the factor Li(ci). This amended array is then used to form the sum S(ci) of equation (9) . The joint probability of the data and codon xl at node l is determined by the sum {Sigma}rci=1 Li(ci, xl)S(ci). Dividing this joint probability by the likelihood of the data yields the posterior probability of codon xl at node l of site i. If done correctly, the whole process of imputing codons to internal nodes is very efficient.

Implementation of the Viterbi algorithm is similar in spirit to the posterior probability algorithm, but practical implementation and notation becomes cumbersome when one attempts to integrate it with the codon model. For this reason, let us assume that we have already assigned as previously described an optimal rate class to each site. This allows us to fix a tree, a codon site, and a rate class for that site. Our description of the Viterbi algorithm for assigning optimal codons to nodes again exploits both a postorder and a preorder traversal. We retain the symbol Dm(i) but interpret it as the conditional probability of the subtree emanating from m with optimally assigned codons. Here we obviously condition on codon i at node m. If m is a leaf, then Dm(i) is the 0/1 function 1{i=j} indicating consistency between the hypothesized codon i and the observed codon j at m. Now consider a mother node m with left and right daughter nodes l and r. The appropriate recurrence defining Dm(i) is


When we reach the root, we evaluate maxi {pi}iDroot(i) using the codon priors {pi}i at the root. This gives the probability of the best assignment of codons.

Of course, this does not solve the problem of finding the best assignment. To find the best assignment, we must define left and right pointers at each node m that track the best j and k in equation (11) given an i. In the preorder phase of the Viterbi algorithm, we then simply trace back the pointers from mother nodes to daughter nodes and note the optimal codon at each node as we visit it. The whole algorithm is obviously reminiscent of maximum parsimony. Both methods assign codon (or nucleotides) to nodes. If the models we have been exploring capture natural variation well, then maximum parsimony assignment is apt to be inferior to Viterbi assignment using likelihoods. Further exploration of this claim is clearly warranted.

Tree Building Algorithms
Efficient likelihood calculation and good optimization algorithms go only partway toward easing the computational burdens of finding the best evolutionary tree. For m contemporary taxa, there are Tm = (2m - 3)!/[2m-2(m - 2)!] rooted trees to consider. A host of methods exists for efficiently searching tree space; among these are branch-and-bound (Felsenstein 1981Citation ), simulated annealing (Schadt, Sinsheimer, and Lange 1998Citation ), and the recent progressive search algorithms of Warnow, Moret, and St. John (2001)Citation and Nakhleh et al. (2001)Citation . Here we will discuss methods of exhaustive enumeration that take into account constraints on phylogenies available through accepted groupings of taxa.

To illustrate our point, consider the unrooted tree of figure 1 depicting 16 complete HIV-1 genomes across five subtypes. In the absence of prior information on subtypes, these taxa are connected by more than 6.19 x 1015 possible rooted trees. Grouping the taxa by subtype dramatically reduces the number of possible trees. There are three taxa each of subtype A, B, C, and F and four taxa of subtype G. Each group of three related taxa can be arranged in three trees, the group of four related taxa can be arranged in 15 trees, and the five subtypes themselves can be arranged in 105 trees. Therefore, there are only (34)(15)(105) = 127,575 rooted trees connecting the taxa and preserving the subtypes. This represents a reduction in the number of possible trees by 10 orders of magnitude. Because there are fewer trees, the time devoted to fitting each possible tree is less of an issue. Accordingly, constraining tree space through known groupings promotes more sophisticated modeling.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 1.—Constraining the topology of tree space can greatly reduce the number of possible trees.

 
Before discussing direct enumeration with group constraints, it is worth spending a few moments considering direct enumeration without such constraints. The standard inductive proof of the formula Tm = (2m - 3)!/[2m-2(m - 2)!] suggests an efficient means of visiting all possible rooted trees. Suppose we number the taxon from 1 to m. Given a tree with m - 1 taxa, a tree with m taxa can be constructed by attaching taxon m to any one of the existing 2m - 4 edges or directly to the root if the current bifurcation at the root is moved forward in time. If we label the edges 1, ..., 2m - 4 and the root 2m - 3, then we can think of enlarging the tree as a choice among these integers. We can view the whole tree as a vector with m - 2 integer entries representing such successive choices. Position 1 correspond