Molecular Biology and Evolution 19:1534-1549 (2002)
© 2002 Society for Molecular Biology and Evolution
Codon and Rate Variation Models in Molecular Phylogeny


*Department of Biomathematics, UCLA School of Medicine, Los Angeles, CA;
Computational Genomics, Rosetta Inpharmatics, Kirkland, WA;
Department of Human Genetics, UCLA School of Medicine, Los Angeles, CA
| Abstract |
|---|
|
|
|---|
This article generalizes previous models for codon substitution and rate variation in molecular phylogeny. Particular attention is paid to (1) reversibility, (2) acceptance and rejection of proposed codon changes, (3) varying rates of evolution among codon sites, and (4) the interaction of these sites in determining evolutionary rates. To accommodate spatial variation in rates, Markov random fields rather than Markov chains are introduced. Because these innovations complicate maximum likelihood estimation in phylogeny reconstruction, it is necessary to formulate new algorithms for the evaluation of the likelihood and its derivatives with respect to the underlying kinetic, acceptance, and spatial parameters. To derive the most from maximum likelihood analysis of sequence data, it is useful to compute posterior probabilities assigning residues to internal nodes and evolutionary rate classes to codon sites. It is also helpful to search through tree space in a way that respects accepted phylogenetic relationships. Our phylogeny program LINNAEUS implements algorithms realizing these goals. Readers may consult our companion article in this issue for several examples.
| Introduction |
|---|
|
|
|---|
If the promise of the burgeoning DNA sequence and protein databases is to be realized, then more refined models of evolution at the sequence level must be constructed. Not only must these models be more realistic, they must also permit fast computation in frequentist and Bayesian statistical inference. This is a tall order. There are several ways to increase the realism of single-nucleotide substitution models. For instance, one can work at the protein level and consider transitions between amino acids or between secondary-structure within proteins rather than between nucleotides (Koshi and Goldstein 1997
First, not all sequence conservation occurs within coding regions. Regions of gene regulation also experience the guiding hand of selection. Second, amino acid substitution models blur the distinction between mutation, which operates at the nucleotide level, and selection, which operates at the amino acid level. One can reasonably expect the rate of mutation at a nucleotide site to be independent of the role its resident nucleotide plays at the codon level. Whether selection retains or eliminates a mutation depends on whether the corresponding codon substitution is synonymous or nonsynonymous. Some nonsynonymous substitutions can be more easily tolerated than others. These considerations suggest that we view selection as accepting some codon substitutions and rejecting others according to definite probabilistic rules (Hasegawa 1986
; Adachi and Hasegawa 1992
; Goldman and Yang 1994
; Pedersen, Wiuf, and Christiansen 1998
; Yang, Nielsen, and Hasegawa 1998
). More realistic modeling must recognize that some codon sites experience stronger selection than other sites. This fact suggests that one allows for variation in the rate of evolution among sites (Koshi and Goldstein 1998
; Yang, Nielsen, and Hasegawa 1998
; Yang 2000
). Finally, because conserved catalytic domains, protein folds, and tertiary structures extend over both consecutive and nonconsecutive codon sites, it is desirable to pursue models that allow for spatial correlation in rate variation (Yang 1995
; Felsenstein and Churchill 1996
; Wagner, Baake, and Gerisch 1999
).
These broad research trends have been set in motion by the researchers just noted. Our goal here is to refine some of the current models in interesting and productive ways and show how these models fit into fast maximum likelihood estimation. For example, we advocate modeling the acceptance of nonsynonymous substitutions using equivalence classes of amino acids on the basis of physical and chemical similarities. Modeling of spatial variation in the rate of evolution has been limited to the construction of hidden Markov chains and coupled Ising chains connecting adjacent nucleotide sites (Yang 1995
; Felsenstein and Churchill 1996
; Wagner, Baake, and Gerisch 1999
). These natural developments parallel the successful applications of hidden Markov chains in contexts such as gene finding, similarity searches, and homology recognition (Gribskov, McLachlan, and Eisenberg 1987
; Tatusov, Altschul, and Koonin 1994
; Yi and Lander 1994
; Burge and Karlin 1997
). But there is a growing recognition of the need to liberate phylogenetic modeling from the unidirectional flow of information implicit in Markov chains and simple nearest neighbor interactions required by Ising chains (Koshi and Goldstein 1998
; Yang, Nielsen, and Hasegawa 1998
). The most nuanced models will achieve these goals with codons rather than nucleotides as the units of selection.
In our view, the theory of Markov random fields offers a perfect framework for expansion of the rate variation models. In particular, we demonstrate that imposing a Gibbs prior with limited spatial interactions is fully compatible with fast likelihood evaluation. This opens the door to more realistic, but still parsimonious, codon models. On the numerical side, we present several algorithmic improvements in likelihood evaluation that can decrease computing times by at least two orders of magnitude. To stimulate the wider use of sophisticated models, we have implemented all of our advances in the computer program LINNAEUS. Applications of LINNAEUS appear in our companion article (Schadt, Sinsheimer, and Lange 2002
) and will not be pursued here.
| Theory and Methods |
|---|
|
|
|---|
Our point of departure is our previous generalization of Kimura's nucleotide substitution model (Kimura 1980
Nucleotide Models
Codon models are preferable to amino acid models because they incorporate detail at the mechanistic level at which mutation operates. The codon models of Goldman and Yang (1994)
and Muse and Gaut (1994)
rely on a continuous-time Markov chain with 61 states. Mutations to one of the three stop codons totally disrupt gene function and can be ignored safely. Of course, codon models are no panacea. Nucleotide substitution is seldom completely random in conserved regions flanking coding regions, near intron-exon boundaries, or in intronic regions such as the silencer and enhancer regions implicated in regulation and splicing. In the case of regulation, substitutions can affect transcription and transcript stability; in the case of splicing, substitutions can change the entire structure of the translated protein (Miyata and Yasunaga 1980
; Li, Wu, and Luo 1985
; Nei and Gojobori 1986
; Li 1993
).
As a basis for our codon models, we adopt the generalization of Kimura's nucleotide substitution model proposed by Schadt, Sinsheimer, and Lange (1998)
. Kimura's model and subsequent generalizations view a phylogeny as a binary tree with contemporary taxa forming the leaves. Evolution occurs from site to site and branch to branch via independent Markov chains bouncing back and forth among the four possible states A (adenine), G (guanine), C (cytosine), and T (thymine). These continuous-time chains share the common infinitesimal generator
|
|
are real and explicitly computable. This fact facilitates finding the equilibrium distribution and forming the finite-time transition matrix P(t) = exp(t
). The entry pij(t) of P(t) represents the probability that a daughter node of the tree occupies state j at time t given that its mother node occupies state i at time 0. Note that
embodies the assumption that transversion rates depend only on their destination states and not on their initial states. This assumption allows more flexibility than competing models (Kimura 1980
Rate Variation Among Lineages
Because, in general, it is impossible to separate the time parameter for each branch from the kinetic parameters, we adopt Yang's (1994)
convention and set one of the branch lengths equal to 1. This convention allows a branch length to be interpreted as the expected number of changes between the two taxa at the ends of the branch. The nucleotide substitution models sketched above implicitly assume that the rate of evolution across lineages is constant. To allow for stochastic variation in branch lengths, one can introduce randomized times for each branch.
There are two ways of choosing these times, both involving gamma distributed random variates. In the traditional model (Nei and Gojobori 1986
; Tamura and Nei 1993; Yang 1993
; Kelly and Rice 1996), each branch length at a given site is multiplied by the same gamma variate with mean 1. The gamma variates at different sites are chosen independently. Unfortunately, this procedure enormously complicates likelihood evaluation because the likelihood at the site must be integrated over all possible realizations of the gamma variate. Although one can carry out such quadratures numerically, using say Laguerre polynomials, accuracy tends to be low, and computation times tend to be high, particularly when derivatives of the likelihood are also sought.
Alternatively, one can imagine multiplying the different branch lengths at a given site by independent gamma distributed variates of mean 1. Although this creates even more integrals, these can be done along each branch separately as part of a single overall likelihood evaluation. Furthermore, as noted below, the necessary quadratures can be performed analytically. To promote model parsimony, it is preferable to specify the same shape parameter across both branches and sites for the independent gamma variates. In this regard, recall that the standard parameterization of a gamma distributed variate T involves a scale parameter
and a shape parameter
. The shape parameter determines the coefficient of variation of T through the equation
|
|
/
= 1 forces
=
.
In our alternative model with independent gamma multipliers rather than a common gamma multiplier, we must compute the unconditional finite-time transition matrix Q for evolution along each branch. Assuming a branch has average length 1/
and the random multiplier T has mean 1 and shape parameter
, one calculates Q using the formula
|
|
) is to consider it to be a matrix version of the analytic function
|
|
) <
. Differentiation under the integral sign and integration by parts show that f(
) satisfies the ordinary differential equation
|
|
) = (1 -
/
)-
subject to the initial condition f(0) = 1.
Although this insight is useful, practical computation of Q(
) is best achieved when
is diagonalizable (Horn and Johnson 1991
, 520 p.). If
= UDU-1 with D diagonal having diagonal entries di, then
|
|
0, satisfied by all infinitesimal generators
, guarantees that all diagonal entries f(di) are well defined. Extension of Q(
) to nondiagonalizable
is discussed thoroughly by Horn and Johnson (1991)
From Nucleotide Models to Codon Models
Any nucleotide substitution model induces a trivial codon substitution model. For example, if vbd is the infinitesimal transition rate from nucleotide b to nucleotide d, then on the codon level, the infinitesimal transition rate from codon (a,b,c) to codon (a,d,c) is still vbd, assuming sites mutate independently and neither codon (a,b,c) nor codon (a,d,c) is a stop codon. Substitutions involving the first or third sites are treated similarly, and single-step substitutions between codons differing at more than one site are prohibited.
The only difference between running this naive codon model and three adjacent independent nucleotide models is that the three stop codons are disallowed in the codon model. For this reason, the naive codon model is not terribly interesting. One can increase its relevance by incorporating an acceptance mechanism. If two neighboring codons (a,b,c) and (a,d,c) are nonsynonymous, then one can penalize transitions between them by replacing vbd with
vbd. Here
[0, 1] is the probability that the proposed evolutionary change is accepted. If the destination codon is one of the three stop codons, the acceptance probability is 0. The acceptance probability for a synonymous change is 1. Mathematically, the codon model with acceptance probabilities is determined by a 61 x 61 infinitesimal generator
whose off-diagonal entries are the products of the corresponding entries of the infinitesimal generator of the naive codon model and the acceptance matrix. The diagonal entries of
are defined so that row sums vanish.
Preservation of Reversibility
Although the codon model is too complicated to permit explicit exponentiation of its infinitesimal generator
, symmetry considerations can simplify this task and likelihood computation in general. For example, to avoid complex arithmetic in computing the eigenstructure of
, it is desirable to derive sufficient conditions guaranteeing that
has real eigenvalues and eigenvectors. Reversibility is the key. One can easily show that reversibility in the nucleotide model is equivalent to imposing the two constraints ß
= 
and 
= 
on
. With these constraints in force, the equilibrium distribution
is
|
|
If we let D
denote the diagonal matrix with the entries of
along its diagonal, then reversibility is equivalent to the detailed balance condition
![]() |
are real. To prove this result rigorously, consider the matrix D
1/2
D
-1/2, which is similar to
(Kelly 1979|
|
Besides accounting for nonsynonymous codon changes in a parsimonious manner, the codon model possesses the attractive property of turning a reversible nucleotide model into a reversible codon model. The scalar version of detailed balance is
bvbd =
dvdb. If the acceptance probabilities in the codon model are symmetric, then we claim that the equilibrium probability of the codon (a,b,c) is
a
b
c up to a multiplicative constant. The proof of this statement is simply the equality
![]() |
=
(a,b,c)
(a,d,c) =
(a,d,c)
(a,b,c). Thus, detailed balance is preserved, and
is reversible with an identified equilibrium distribution. Because stop codons are omitted, we must normalize the proposed equilibrium distribution by dividing its entries by the sum
(a,b,c)
a
b
c taken over all nonstop codons (a,b,c).
Codon Acceptance Probabilities
In contrast to Muse and Gaut (1994)
, who employ a single acceptance probability that makes no distinctions between the different kinds of nonsynonymous codon substitutions, and in contrast to Goldman and Yang (1994)
, who use empirically determined codon acceptance probabilities based on the amino acid distances given in Grantham (1974)
, we prefer to estimate a small number of codon acceptance probabilities directly from the data. A large amount of sequence data would tend to favor this tactic of trading extra parameters for greater biological realism. Of course, the extreme model treating each amino acid separately requires (20 2) = 190 symmetric acceptance probabilities. To avoid such a proliferation of parameters, we group amino acids into similarity (equivalence) classes and then define acceptance probability parameters for transitions within or between classes. For instance, if we define similarity classes on the basis of the polarity properties of the amino acids, then the natural classes comprise the nonpolar amino acids S1 = {G, I, V, L, A, M, P, F, W}, the positive-polarpositively charged amino acids S2 = {Q, N, C, Y, H, K, R}, and the negative-polarnegatively charged amino acids S3 = {S, T, E, D}. We could further divide these classes on the basis of the size of amino acid side chains or other properties potentially affecting substitution rates. For instance, to account for cysteine's propensity to form disulfide bonds bridging different parts of a protein, we could place cysteine in its own similarity class S4 = {C}.
With the similarity classes S1 through S4 just defined, we postulate an acceptance probability
0 within classes, an acceptance probability
1 between a polar and a nonpolar class, an acceptance probability
2 between different polar classes, and an acceptance probability
3 involving substitution to or from cysteine. One would anticipate that
0 >
1 >
2 and
0 >
3. The symmetry constraints implicit in this parameterization are not completely realistic. For instance, they probably fail for transitions involving cysteine. Substitution of another amino acid for a cysteine involved in a disulfide bond is bound to be much less likely than the reverse substitution. But it would take an enormous amount of data to see this effect, and it is mathematically advantageous to maintain symmetry for the sake of reversibility.
The penalty model just described appears to coincide with one proposed earlier by Yang, Nielsen, and Hasegawa (1998)
. In our companion article (Schadt, Sinsheimer, and Lange 2002
), we compare it with a generalization of the amino acid metric model proposed by Goldman and Yang (1994)
. No clear winner emerges in this comparison, but the similarity class and metric models are both clearly superior to the naive Muse and Gaut (1994)
model.
Matrix Exponentiation and Differentiation
One of the primary concerns in likelihood evaluation is computation of the finite-time transition matrix exp(t
) and its partial derivatives with respect to the parameters. Computation is much easier if
is diagonalizable. Under the conditions of reversibility described above, let
denote the product equilibrium distribution constructed from the nucleotide equilibrium distribution (eq. 1 ). Also let D
denote the diagonal matrix with the entries of
on its diagonal. Because D
1/2
D
-1/2 is symmetric, there exists a real orthogonal matrix U and a real diagonal matrix
such that D
1/2
D
-1/2 = U
U*. The columns of U are the eigenvectors of D
1/2
D
-1/2, and the diagonal entries of
are its eigenvalues. It follows from this representation that
|
|
) is the diagonal matrix derived by exponentiating each of the eigenvalues separately. The sparseness of D
1/2
D
-1/2 helps in finding U. Only about 15% of the entries of
and thus of D
1/2
D
-1/2 are nonzero in the codon model. When we use special routines in the LAPACK linear algebra package and the D2a class in the IML++ package that exploit sparseness and symmetry, the computational burden of computing exp(t
) decreases by a factor of five. Details on the use of these packages for sparse matrix computations can be found in Dongarra et al. (1996)
Suppose
is a parameter of the codon model. Any numerical method that can compute the entrywise partial derivatives (
/
)exp(t
) using only
and the entrywise partial derivatives (
/
)
is highly desirable. For instance, the optimization engine SEARCH of LINNAEUS works faster and more reliably given these partial derivatives (Lange, Boehnke, and Weeks 1988
). Formulas for (
/
)exp(t
) are available in the statistical literature on fitting differential equation models (Jennrich and Bright 1974
) and in the Lie group literature (Richtmyer 1981
, pp 143144), but these have not seen application in molecular phylogeny. For this reason, we indicate briefly one method of calculating (
/
)exp(t
).
Suppose f(z) is an analytic function of the complex variable z. The particular choice f(z) = exp(tz) is the pertinent one for us. Because f(z) is locally defined by an absolutely convergent power series, one can define a matrix-valued analytic function f(M) on square matrices M by substituting M for z in this series. This correspondence gives rise to a matrix functional calculus whose crowning highlight is the Cauchy integral formula
|
|
|
|
If we assume that M is diagonalizable in the form M = SDS-1 with D a diagonal matrix with jth diagonal entry dj, then we can write
|
|
|
|
dk, then this reduces to
|
|
denotes the Hadamard (pointwise) product of two matrices. The computational complexity of computing (
/
)f(M) is consequently dominated by ordinary matrix multiplication.
When M is not diagonalizable, a more complicated formula of Schwerdtferger applies based on the Jordan canonical form and Frobenius covariants of M (Horn and Johnson 1991
). Fortunately for reversible models,
is diagonalizable. Even when
is not diagonalizable a priori, we can assume this condition in numerical practice because almost all matrices have distinct eigenvalues and are therefore diagonalizable. The appendix contains a short proof of this measure-theoretic assertion.
Rate Variation Models
Several authors have objected to the naive assumptions that different codon sites evolve equally rapidly and independently (Goldman and Yang 1994
; Yang 1995
; Felsenstein and Churchill 1996
). Clearly, residues forming part of a catalytic domain or critical protein fold are more strongly conserved than residues that fall outside such regions. Better models of evolution must allow for variation in the rate of evolution and correlation in the rates of neighboring codon sites. The simplest way to model rate variation is to suppose that each codon site i is assigned a random rate class Ci from among r possible rate classes, which for convenience we label 1, ..., r. For two rate classes, we nominate the first class as slow and the second as fast. Probably the most parsimonious way of differentiating slow from fast evolution is to modulate the rate of evolution through the acceptance probabilities. Slow evolution can be distinguished from fast evolution by introducing a multiplicative parameter
(0, 1) and replacing each acceptance probability
i by 
i. For synonymous codon changes, one retains the acceptance probability of 1.
Modeling correlations in the rate of evolution among neighboring codon sites is more subtle. It is important to recognize that we are dealing with two different graphs in molecular phylogeny: (1) a taxon graph representing the relationships between the taxa and (2) a site graph representing which codon sites interact in determining rates of evolution. The site-to-site independence assumption essentially hides the fact that the site graph exists. The nodes of this graph are the different codon sites; its edges connect interacting pairs of sites. Yang (1995)
and Felsenstein and Churchill (1996)
implicitly define a site graph through their Markov chain models for rate variation. The Markov chain models postulate that the rate of evolution at the current site depends on the rate of evolution at the previous site. Although this is certainly a more realistic premise than independent rate variation, it is not the last word on the subject. Hidden Markov chains impose a unidirectional flow of information. Except for mathematical convenience, there is no reason to assume a preferred direction. Markov chain models also permit interactions only between nearest neighbors. These limitations can be relaxed without overly complicating maximum likelihood estimation and statistical inference.
The key to generalization is to replace Markov chains by Gibbs random fields (Kelly 1979
; Whittaker 1990
, pp. 4761; Cressie 1993
, pp. 410422). The prototype Gibbs random field is the Ising model from statistical mechanics. As opposed to a Markov chain, it allows a bidirectional flow of information. Wagner, Baake, and Gerisch (1999)
introduced a simple two-state nucleotide evolution model based on coupled Ising chains. This model captures nearest neighbor interactions between particles on a one-dimensional lattice (Thompson 1972
, pp. 116144). The Ising model and more complicated Gibbs fields are defined through potential functions H(c) defined on the realizations of the random rate class vector C = (C1, ..., Cn) over n consecutive codon sites. A Gibbs random field assigns the prior probability
|
|
d eH(d) is a multiple sum extending over all vectors d = (d1, ..., dn) drawn from the n-fold Cartesian product of {1, ..., r}. As a prelude to integrating the Gibbs models with phylogeny likelihood calculation, it is useful to review some particular Gibbs fields that can be applied directly to this problem.
Gibbs Fields and Their Partition Functions
In the Potts generalization of the Ising model (Baxter 1982
, pp. 422452), one determines the prior probability of the random vector C = (C1, ..., Cn) over n codon sites through either the linear potential function
|
|
|
|
ij) controls the spatial correlation in rate class between neighboring codons. This model generalizes the Ising model by allowing codon sites to belong to an arbitrary number of rate classes. In the circular case, we set cn+1 = c1.
One of the hardest problems in statistical mechanics has been the evaluation of partition functions. Fortunately, the Potts model permits exact evaluation. If we define the matrix
|
|
|
|
i,j) and v the r x 1 vector (eµi/2), then the partition functions reduce to the expressions
|
|
For example, in the Ising model when r = 2 and
|
|
|
|
|
|
+n +
-n in the circular Ising model (Ising 1925
Evaluation of the partition function can sometimes be done directly without recourse to linear algebra. For instance, consider the linear logistic model
|
|
|
|
1, ...,
n-1) reduces to the product
|
|
|
|
i are equal. Extending this result to include frequency differences appears intractable (Strauss 1977
As a final example, suppose that r = 2 and the rate indicators Ci are chosen from {-1, 1} rather than {1, 2}. The partition function
|
|
{-1, 1} and DiDi+1 = CiCi+2. These considerations imply that
|
|
Likelihood Evaluation Under Rate Variation
To compute the likelihood of the data under the rate variation model, we let L(c) =
ni=1 Li(ci) denote the likelihood of the observations over all codon sites conditional on the value (c1, ..., cn) of the rate class vector C. Here Li(ci) is the likelihood specific to codon site i when it is in rate class ci. Given the value of the finite-time transition matrix et
and its derivatives (
/
)exp(t
), computation of Li(ci) and its derivatives (
/
)Li(ci) is carried out by the algorithms sketched by Felsenstein (1981)
and Schadt, Sinsheimer, and Lange (1998)
. These algorithms mimic Baum and Petrie's (1966)
forward and backward algorithms and have computational complexity O(ts2), where t is the number of taxa and s = 4 or s = 61 is the number of states of the model. As Felsenstein (1981)
notes, a complexity of O(ts2) is a vast improvement over the complexity O(st) encountered when one visits the internal nodes of the evolutionary tree simultaneously rather than recursively. Jensen, Lauritzen, and Olesen (1990)
present a nice generalization of the Baum and Petrie and Felsenstein algorithms to more complex graphical structures. This generalization closely resembles algorithms developed in pedigree analysis (Lange 1997
, pp. 102110).
To compute the full likelihood
in the rate variation model with spatial correlation, we must take into account the prior probability of the rate class vector C. Given a potential function H(c), we have
|
|
. But once we limit the extent of spatial interaction, things become easier. For instance, let us take as a concrete example the potential function
|
|
|
|
Computation of either the partition function or the numerator of
reduces to the evaluation of a multiple sum of products of arrays. It is best to compute the multiple sums as iterated sums. Consider the numerator of
. In the forward algorithm, we eliminate the indices c1, ..., cn in the indicated order. Starting with the array f0(c1, ..., ck) = 1 at step 0, at step i we recursively create the new array fi(ci+1, ..., cmin{i+k,n}) using
|
|
in equation (3)
. In computing the array fi(ci+1, ..., cmin{i+k,n}), it is advantageous to perform the indicated array multiplications pairwise, starting with the two smallest arrays Li(ci)eµci, multiplying the resulting product against e
11{ci=ci+1}, and so forth until a single array holding Li(ci)eµc,
min{k,n-i}j=1 e
j1{ci=ci+j} is formed. The final multiplication of this array against fi-1(ci, ..., cmin{i+k-1,n}) can be carried out simultaneously with addition over the index ci. When the array e
j1{ci=ci+j} is brought into play, its obvious symmetries can be exploited to reduce the overall computational burden. The computational complexity of the forward algorithm scales linearly in n. At each step of the algorithm, one should rescale constructed arrays to prevent numerical underflow. Because a similar algorithm applies to the partition function, there is no need to express the partition function in closed form. Of course, if a closed form exists, it will usually be preferred on esthetic grounds.
Likelihood Differentiation Under Rate Variation
For certain rate variation models, it may be impossible to compute partial derivatives exactly. In such cases numerical partial derivatives of the likelihood can be approximated by either forward or central differences. Each forward difference requires an additional likelihood evaluation, and each central difference requires two additional likelihood evaluations. These extra likelihood evaluations are especially costly for kinetic parameters because they involve recalculation of all of the site-specific likelihoods Li(ci). For spatial parameters, the site-specific likelihoods are fixed, and it is possible to evaluate the full likelihood
quickly given them. This suggests that numerical differentiation with respect to the spatial parameters is reasonable. One drawback of forward and central differences is that they entail roundoff errors originating from the subtraction of nearly equal function values. For a real analytic function f(x), there is a better alternative (Squire and Trapp 1998
) based on the approximation
|
|
. This approximation can be implemented in computer languages such as C and Fortran that accommodate complex arithmetic. Fortunately, in the linear logistic model (eq. 4
),
is a real analytic function of the spatial parameters µj and
i.
The product rule allows us to express the partial derivative of
with respect to a kinetic parameter
as
|
|
/
)Li(ci) for Li(ci). Although each of these separate evaluations is fast, the cumulative work can be considerable if n is large. It is possible to reduce the required effort by carrying out a backward algorithm and storing intermediate results. The backward algorithm for likelihood evaluation begins with the array bn+1(cn-k+1, ..., cn) = 1 and recursively defines new arrays
|
|
= b1.
It is conceptually simple but notationally messy to describe how to combine the forward and backward algorithms to compute partial derivatives. The basic idea is to exploit as much of the forward and backward information as possible. This is accomplished by using fi-1(ci, ..., ci+k-1) and bi+k(ci, ..., ci+k-1) as a basis for further computation. For the sake of simplicity, we deal with a central value of i satisfying k
i
n - k. When i < k, we omit the forward array and deal with the backward array alone, and when i > n - k we do the reverse. These special cases we leave to the reader. For a central value of i we collect all arrays that have not been visited into a single array a(ci, ..., ci+k-1) by forming the product
|
|
|
|
and its partial derivatives with respect to the kinetic parameters can be orchestrated by first carrying out the backward algorithm, storing the intermediate arrays, and then performing the forward algorithm. At the ith step of the forward algorithm, we sandwich the necessary forward and backward results and compute one piece of the differential of
as indicated in equations (6)
and (7)
.
Inferring Hidden States
The models and algorithms discussed so far lend themselves to the optimal assignment of codons to internal nodes and of codon sites to rate classes. This flexibility will help scientists in understanding and visualizing the most likely ancestral sequences and the most highly conserved regions in a protein. Felsenstein and Churchill (1996)
detailed such an algorithm in computing posterior probabilities for a given nucleotide site belonging to a given rate class, using their hidden Markov-based model of rate variation among sites. Others have detailed fast and efficient algorithms for the reconstruction of nucleotide and amino acid sequences and demonstrated the usefulness of these reconstructions in annotating the functional regions of protein sequences (Pupko et al. 2000
; Armon, Graur, and Ben-Tal 2001
). We generalize these reconstruction and rate assignment algorithms so they can be applied in the context of the generalized reconstruction models we have developed here.
There are two common methods for inferring hidden states in the type of likelihood models detailed here. The first simply computes the posterior probability of each possible assignment and chooses the most probable one at each node or site. The second method finds the most likely constellation of assignments over all nodes or sites simultaneously by dynamic programming. The posterior probability method involves fairly modest changes to the various likelihood algorithms. The dynamic programming method requires more radical surgery and adapts the Viterbi (1967)
algorithm as generalized by Dawid (1992)
.
Inferring Hidden Rate Classes
Consider for instance computation of the posterior probability that a codon site belongs to a given rate class. The posterior probability in question is the ratio of a joint probability and the likelihood of the data. The joint probability captures the data with the given codon site restricted to a particular rate class. Computation of the joint probability proceeds exactly as likelihood computation with the exception that when the codon site in question is reached in the computation, the range of summation shifts from all possible rate classes to a single fixed rate class. To carry out this strategy as efficiently as possible, we mimic the process of computing partial derivatives. Thus, by analogy to equations (6)
and (7)
we form
|
|
|
|
To explain the Viterbi algorithm for finding the best sequence of rate classes, it is clearer to adopt a general perspective. The problem of likelihood evaluation reduces to an evaluation of a multiple sum of multiple products of arrays of the form
|
|
{1, ..., n}. For instance, if n = 3 and Tj = {1, 3}, then the array aTj(c) reduces to an array a{1,3}(c1, c3) depending on the first and third indices. The restriction to subsets indexed by elements j
J conveys the fact that only some of the 2n subsets of {1, ..., n} occur in the product defining
. We can avoid duplicate sets Tj = Tk by replacing corresponding arrays aTj(c) and aTk(c) by their pointwise product aTj(c)aTk(c). In practice,
j
JTj = {1, ..., n}.
In contrast to likelihood evaluation, the best assignment problem involves finding a multi-index c maximizing
j
J aTj(c). In other words, maxima replace sums. Dynamic programming solves a problem by taking advantage of already computed solutions of smaller examples of the same problem. In this case, we solve a sequence of n subproblems. For subproblem m, put
|
|
|
|
|
|
j
Um Tj\{1, ..., m}. Again we preserve a pointer that specifies a best cm together with the sequence c1, ..., cm-1 specified by the solution at stage m - 1. At stage n we find a maximum and recover the solution vector by tracing back along the associated pointers.
As a toy example, consider the product
|
|
|
|
|
|
|
|
The rate class problem possesses two major advantages. First, the index set (eq. 10 ) at stage m - 1 always contains at most k indices for k nearest neighbor interactions. Second, it is natural to proceed from left to right along the n codon sites. As we will discuss later, the same strategy may or may not work well for neighborhood structures connecting nonconsecutive sites.
Inferring Hidden Codons: Ancestral Sequence Reconstruction
We now turn to the problem of imputing codons to internal nodes. To simplify our exposition, we fix a tree connecting the taxa of interest, a codon site, and a rate class for that site. In our previous article (Schadt, Sinsheimer, and Lange 1998
), we sketched Felsenstein's (1981)
algorithm for computing the likelihood of the data observed at the leaves of the tree. This algorithm works for both nucleotide and codon models. In discussing the algorithm, we introduced several probabilities. These were (1) the conditional probability Tk
l(i, j) that the daughter node l of a branch k
l is in state j given that the mother node k is in state i, (2) the conditional probability Dk(i) of the subtree rooted at node k given node k is in state i, (3) the conditional probability Lk(i) of the left subtree rooted at node k given node k is in state i, (4) the conditional probability Rk(i) of the right subtree rooted at node k given node k is in state i, and (5) the joint probability Uk(i) of node k being in state i and the complementary subtree showing its observed bases at its leaves. In definition (1), note that Tk
l(i, j) is nothing more than an entry of the finite-time transition matrix P(t) pertaining to the branch k
l. The complementary subtree referred to in definition (3) contains node k and all nodes that are not descendents of k. The letters D, L, R, and U in these definitions suggest the directions down, left, right, and up, with the root occurring at the top of the tree.
We remind the reader that the quantities Lk(i) and Rk(i) are computed in a postorder traversal of the tree in which daughter nodes are visited prior to mother nodes. The quantities Uk(i) are computed in a subsequent preorder traversal of the tree in which mother nodes are visited before daughter nodes. If l is one of the daughter nodes of k, then computation of the Ul(j) relies on the stored values of Lk(i) and Rk(i) available after the postorder traversal and on the stored values of Uk(i) available after the partially completed preorder traversal. At the time Ul(j) is computed, it is trivial to compute the product Ul(j)Ll(j)Rl(j) giving the joint probability of the data at the site and an assigned state j for node l. The ratio of this probability to the likelihood of the data represents the conditional probability of the assigned state i given the data.
To remove the restriction to a particular site and rate class, let i denote a codon site, l a node, and ci a rate class at site i. We have just shown how to compute the joint likelihood Li(ci, xl) of the observed data at site i and a specified codon xl at node l. Within the context of the rate class model, efficient computation of the posterior probabilities of the various codons xl at the various nodes l proceeds by constructing the array a(ci, ..., ci+k-1) of equation (8)
omitting the factor Li(ci). This amended array is then used to form the sum S(ci) of equation (9) . The joint probability of the data and codon xl at node l is determined by the sum
rci=1 Li(ci, xl)S(ci). Dividing this joint probability by the likelihood of the data yields the posterior probability of codon xl at node l of site i. If done correctly, the whole process of imputing codons to internal nodes is very efficient.
Implementation of the Viterbi algorithm is similar in spirit to the posterior probability algorithm, but practical implementation and notation becomes cumbersome when one attempts to integrate it with the codon model. For this reason, let us assume that we have already assigned as previously described an optimal rate class to each site. This allows us to fix a tree, a codon site, and a rate class for that site. Our description of the Viterbi algorithm for assigning optimal codons to nodes again exploits both a postorder and a preorder traversal. We retain the symbol Dm(i) but interpret it as the conditional probability of the subtree emanating from m with optimally assigned codons. Here we obviously condition on codon i at node m. If m is a leaf, then Dm(i) is the 0/1 function 1{i=j} indicating consistency between the hypothesized codon i and the observed codon j at m. Now consider a mother node m with left and right daughter nodes l and r. The appropriate recurrence defining Dm(i) is
|
|
iDroot(i) using the codon priors
i at the root. This gives the probability of the best assignment of codons. Of course, this does not solve the problem of finding the best assignment. To find the best assignment, we must define left and right pointers at each node m that track the best j and k in equation (11) given an i. In the preorder phase of the Viterbi algorithm, we then simply trace back the pointers from mother nodes to daughter nodes and note the optimal codon at each node as we visit it. The whole algorithm is obviously reminiscent of maximum parsimony. Both methods assign codon (or nucleotides) to nodes. If the models we have been exploring capture natural variation well, then maximum parsimony assignment is apt to be inferior to Viterbi assignment using likelihoods. Further exploration of this claim is clearly warranted.
Tree Building Algorithms
Efficient likelihood calculation and good optimization algorithms go only partway toward easing the computational burdens of finding the best evolutionary tree. For m contemporary taxa, there are Tm = (2m - 3)!/[2m-2(m - 2)!] rooted trees to consider. A host of methods exists for efficiently searching tree space; among these are branch-and-bound (Felsenstein 1981
), simulated annealing (Schadt, Sinsheimer, and Lange 1998
), and the recent progressive search algorithms of Warnow, Moret, and St. John (2001)
and Nakhleh et al. (2001)
. Here we will discuss methods of exhaustive enumeration that take into account constraints on phylogenies available through accepted groupings of taxa.
To illustrate our point, consider the unrooted tree of figure 1 depicting 16 complete HIV-1 genomes across five subtypes. In the absence of prior information on subtypes, these taxa are connected by more than 6.19 x 1015 possible rooted trees. Grouping the taxa by subtype dramatically reduces the number of possible trees. There are three taxa each of subtype A, B, C, and F and four taxa of subtype G. Each group of three related taxa can be arranged in three trees, the group of four related taxa can be arranged in 15 trees, and the five subtypes themselves can be arranged in 105 trees. Therefore, there are only (34)(15)(105) = 127,575 rooted trees connecting the taxa and preserving the subtypes. This represents a reduction in the number of possible trees by 10 orders of magnitude. Because there are fewer trees, the time devoted to fitting each possible tree is less of an issue. Accordingly, constraining tree space through known groupings promotes more sophisticated modeling.
|
Before discussing direct enumeration with group constraints, it is worth spending a few moments considering direct enumeration without such constraints. The standard inductive proof of the formula Tm = (2m - 3)!/[2m-2(m - 2)!] suggests an efficient means of visiting all possible rooted trees. Suppose we number the taxon from 1 to m. Given a tree with m - 1 taxa, a tree with m taxa can be constructed by attaching taxon m to any one of the existing 2m - 4 edges or directly to the root if the current bifurcation at the root is moved forward in time. If we label the edges 1, ..., 2m - 4 and the root 2m - 3, then we can think of enlarging the tree as a choice among these integers. We can view the whole tree as a vector with m - 2 integer entries representing such successive choices. Position 1 correspond


































