Molecular Biology and Evolution 19:8-13 (2002)
© 2002 Society for Molecular Biology and Evolution
Estimating Amino Acid Substitution Models: A Comparison of Dayhoff's Estimator, the Resolvent Approach and a Maximum Likelihood Method
Deutsches Krebsforschungszentrum, Theoretische Bioinformatik, Im Neuenheimer Feld 280, Heidelberg, Germany
Duke University, Institute of Statistics and Decision Sciences, Durham, North Carolina
| Abstract |
|---|
|
|
|---|
Evolution of proteins is generally modeled as a Markov process acting on each site of the sequence. Replacement frequencies need to be estimated based on sequence alignments. Here we compare three approaches: First, the original method by Dayhoff, Schwartz, and Orcutt (1978) Atlas Protein Seq. Struc. 5:345352, secondly, the resolvent method (RV) by Müller and Vingron (2000) J. Comput. Biol. 7(6):761776, and finally a maximum likelihood approach (ML) developed in this paper. We evaluate the methods using a highly divergent and inhomogeneous set of sequence alignments as an input to the estimation procedure. ML is the method of choice for small sets of input data. Although the RV method is computationally much less demanding it performs only slightly worse than ML. Therefore, it is perfectly appropriate for large-scale applications.
| Introduction |
|---|
|
|
|---|
Differences between homologous proteins are the result of a mutation process starting from a common, though unknown ancestor. In a mutation event, an amino acid at a certain position in a protein is replaced by another one. These exchanges are constrained by the requirement to maintain protein structure or function. Certain mutations tend to have little effect in this respect and are observed more frequently than exchanges that clearly influence the protein structure.
If this replacement improves the fitness of the organism the new amino acid will be accepted by natural selection. Replacements of very dissimilar amino acids often drastically change the fold of the protein, leading to a complete loss of function. Hence, such mutations are less often observed than those of similar amino acids, which have only slight effects with respect to fold and function. Therefore, amino acid similarity is reflected in replacement frequencies. It is important to note that actual mutation counts depend not only on these similarities but also on the degree of divergence of the sequences that one compares. We thus need a model which describes protein evolution as a function of time.
Modeling amino acid replacements by a Markov chain has been introduced by Dayhoff, Schwartz, and Orcutt (1978)
. In this approach a set of identical Markov chains acting independently on each site of the protein is used. The time index of the process is interpreted as a measure of evolutionary divergence. The challenge is to estimate the parameters of the process from divergent and time-inhomogeneous sequence data.
In the original approach of Dayhoff, Schwartz, and Orcutt (1978)
the actual estimation is restricted to only very closely related pairs of sequences. However, once a Markov model is fitted to this data, replacement frequencies characteristic for distantly related sequences can be extrapolated from the model.
Dayhoff's approach has been generalized and applied to larger data sets (Gonnet, Cohen, and Benner 1992
; Jones, Taylor, and Thornton 1992
). Furthermore, the advent of large numbers of structurally derived alignments has raised interest in using information also from very distant related alignments (Overington et al 1990
; Risler et al 1988
). However, these authors do not provide an estimation procedure which would account for the time-inhomogeneity of the input data.
Benner, Cohen, and Gonnet (1994)
were the first to point out the problem of estimating one consistent model from an inhomogeneous pool of alignment data. They sketch a normalization algorithm, which is based on computing logarithms of transition matrices which they approximate by power series. The approach is heuristic, as the convergence of the power series cannot be guaranteed for empirically derived matrices.
Müller and Vingron (2000)
present a rigorous estimation procedure which is based on an entirely different mathematical formalism. We refer to this method as resolvent method and briefly review it in the Resolvent Method section. Alternatively, we describe a novel maximum likelihood based approach. The Maximum Likelihood section provides the details of the mathematical formalisms and computations.
In principle, one has two important, although conflicting, criteria for evaluating the quality of the methods. For large-scale applications, time performance of the algorithms is crucial, whereas low statistical efficiency of the estimator can be compensated for by the huge amount of data that is used. On the other hand, if one is restricted to only a small set of input data, the accuracy of the estimator is more important. In the Results section we discuss the individual merits of the resolvent and the maximum likelihood estimator.
| Model |
|---|
|
|
|---|
Let P(t) be the transition probability matrix of a time continuous Markov chain with entries pij(t) = Prob(X [s + t] = j | X[s] = i). We consider only Markov chains for which the rate matrix Q = liml
0 (P[t] - I)/t exists, where I denotes the identity matrix. Hence, we get the following linear approximation
![]() |
|
|
tP(t) using the product rule for matrix differentiation we get
|
|
> 0, t
0. Using this we obtain:
|
|
-1 reduces the above equation to
![]() |
is in fact invertible for all
> 0 (see Müller and Vingron 2000).
The forward and backward equations can be solved under the initial condition P(0) = I and yield
![]() |
The problem of modeling amino acid replacement frequencies requires additional assumptions on the Markov chain. Following Dayhoff, Schwartz, and Orcutt (1978)
we model the evolution of each site of the proteins by a single time-homogeneous Markov chain X(t), calibrated, such that on average 1% of the amino acids are changed after one unit of time: Prob[X(t)
X(t + 1)] = 0.01. Once calibrated, the time t in the Markov chain can be used as a measure of evolutionary divergence. The acronym PAM (Point Accepted Mutations) is commonly used for this unit of divergence.
We assume that any amino acid can change into any other one. This is ensured by the requirement that the exchange rates qij be strictly positive for all i, j. Then there exists a unique limiting amino acid distribution
= (
1, ...,
20) where
j = limt
pij(t) > 0 is independent of the initial residue i. The distribution
fulfills the equations
Q = 0 and
P(t) =
for all t
0 and is called the equilibrium distribution. We only consider Markov processes which are in equilibrium. With the transition probabilities (Pt)t
0 and the overall amino acid distribution we calculate the joint distribution mij(t) =
ipij(t) of (Xs, Xs+t). M(t) = mij(t) denotes the probability of finding amino acid ai and amino acid aj aligned with each other in two sequences that are t time units apart.
The Markov chain Xt describes the evolution of a single site in a protein from ancestors to descendants. Whereas data from ancestor sequences are not available, we do observe pairs of proteins that have evolved from a common, though unknown, ancestor. We cannot decide the direction of the mutation, we only observe pairs of corresponding amino acids at certain positions in a protein. It would not be appropriate to model such a direction. Such processes are described by the class of time-reversible Markov chains. This means that the probability of being in amino acid ai and going from ai to aj in time t is equal to that of being in amino acid aj and going from amino acid aj to ai. Consequently, we get the detailed balance equation
iqij =
jqji. In particular, M(t) is a symmetric matrix. We use the shortcut Evolutionary Markov Process (EMP) for a process satisfying all the conditions discussed earlier.
The transition and the rate matrix of an EMP have the following mathematical properties. Denote by F the diagonal matrix with entries
i. Then F constitutes a symmetric, positive definite matrix and
x, y
F =
x, Fy
defines an inner product. Because of the reversibility, Q and P(t) are selfadjoint relative to
·, ·
F, i.e.,
P(t)x, y
F =
x, P(t)y
F, and can therefore be transformed into diagonal form by a change of coordinates. The eigenvalues of Q are real by selfadjointness, and negative owing to Gershgorin's theorem. Using definition (4) we can rewrite Q and P(t) as
|
|
i are the eigenvalues of Q, and the matrix S consists of the joint orthonormal basis of eigenvectors of Q and P(t). According to the Perron-Frobenius theorem the largest eigenvalue of P(t) equals 1 and therefore the largest eigenvalue of Q equals 0. In general, given a rate matrix Q of an EMP, one can easily calculate P(t) for all t > 0 by formula (5). | Estimation Algorithms |
|---|
|
|
|---|
The problem we are dealing with is the estimation of the parameters of an EMP from observed replacement frequencies. However, practically these frequencies are derived from different sequence alignments and will generally not conform to one EMP. We describe three approaches to the EMP estimation problem. We start with summarizing the original work of Dayhoff, Schwartz, and Orcutt (1978)
Dayhoff's Method
Dayhoff's method has a strategy consisting of estimating P(1) and then extrapolating to higher PAM distances. She pools input alignments of only closely related sequences. Nevertheless, this data can still be time-inhomogeneous. From this data she derives a calibrated transition matrix using the fact that on small evolutionary distances the calibration can be carried out linearly, as in equation (1)
. It is important to note that Dayhoff intended to use only alignments of very closely related pairs of sequences. There is no theoretical justification for applying it to more divergent input alignments. In fact, from our discussion earlier it becomes clear that linear calibration fails for more divergent data. The computational details of Dayhoff's method are summarized in various textbooks, see e.g., Setubal and Meidanis (1997)
.
Maximum Likelihood
With the enormous number of divergent alignments available today, Dayhoff's approach implies a huge loss of information. Yet it is highly desirable to exploit sequence alignments of widely different evolutionary distances. However, this requires progress in the theory of EMP estimation. An appropriate estimator should account for the evolutionary divergence of each alignment in the data set.
Our procedure is based on a two-step algorithm to improve a given set of rates and an equilibrium distribution. In the first step, the degree of evolutionary distance for each given alignment is estimated under the input parameters. In the second step, new parameters are estimated assuming the divergence times just calculated. This two-step procedure was started using Dayhoff's parameters and then iterated until no change of the parameters could be detected. This iterative approach is discussed in Müller and Vingron (2000)
. The time estimation part is discussed in Barry and Hartigan (1987a, 1987b
); Adachi and Hasegawa (1996)
; Baake and von Haeseler (1999)
; Müller and Vingron (2000)
. For the following exposition we thus focus on the estimation of the rate matrix assuming that the evolutionary distances of all alignments are given.
The main problem in formulating a maximum likelihood estimator is to develop a parameterization for the rate matrix that reflects all requirements for an EMP. In order to specify the EMP, we estimate 210 parameters, namely the equilibrium distribution
and the rate matrix Q.
For simplicity we start by formulating the estimator only for a single given alignment with evolutionary distance t. The maximum likelihood method yields the following estimates for
and Q:
|
|
i, and Q is a rate matrix. We use the parameterization
![]() |
j>i
jrij)-1 and
|
|
i < 1,
20i=1
i = 1, and i = -
i
j qij. Note that Q is a rate matrix and satisfies the detailed balance equation. Conversely, all rate matrices that fulfill detailed balance can be parameterized as in (8). We proceed to show that multiplication by the factor c = (200
j>i
jrij)-1 calibrates the process to 1 PAM. To this end, we search for a c such that tr (Fec
) = 0.99. From equation (1)
we obtain
![]() |
-(100 tr[F
])-1. We want to express c in terms of the parameterization variables:
|
|
|
|
|
|
The maximization problem is tackled using standard optimization algorithms (Brent 1973
; Press et al 1990
). To this end, the constrained optimization problem needs to be mapped to an unconstrained one. For example, we map the distribution values to the interval (0,1) using the function x
arctan(x)/
+ 0.5 and we map the positive relative rates to the positive real line by the function x
exp(-x).
Resolvent Method
An alternative method for estimating the rate matrix Q of an EMP is presented in Müller and Vingron (2000)
. It is based on the relation
![]() |
denotes the resolvent of the Markov chain (2). Once the resolvent is computed, one can derive the rate matrix by applying this formula. The problem is putting this formalism to use in the estimation problem, where we do not have perfect knowledge of all transition matrices, but instead are given discrete sets of counts drawn at arbitrary distances.
Let n alignments be given and assume that tk is the degree of divergence of the sequences in alignment k. The goal is to estimate an EMP from the alignment data using the distances tk. We first estimate P(tk) by the empirical transition frequencies in the respective alignments. We estimate pij(tk) by counting all occurrences of (ai, aj) and (aj, ai), and then normalizing by the overall frequency of amino acid i. For each alignment, this yields one estimated transition matrix
(tk) for each time tk. We want to approximate the integral (R
)ij = 
0 e-
tpij(t) dt. This is done using linear interpolation of the pij(tk) and then integrating the piecewise functions. Note that the 20 x 20 entries of the resolvent can be calculated separately and independently of each other. Theoretically, the rate matrix is independent of
, but for empirical integrals it is not. The approximation of the integral (R
)ij = 
0 e-
tpij(t) dt is most accurate if the lattice points pij(tk) lie in the high-density region of the integrand. Whether this is the case or not depends on the parameter
. Consequently, a sensitive choice for
is called for. Our approach is likelihood based. With equation (3) we obtain
|
|
with maximal likelihood gives us a rate matrix. In practice, the resolvent method is used iteratively with time estimation updates as in the case of the maximum likelihood method. | Results |
|---|
|
|
|---|
Clearly, if applied to real sequence data, the different estimation procedures result in different substitution models. We do not see any obvious biological criterion that can be applied to decide whether one model is superior to another. In contrast, we start from a given model (Q,
) and a set of evolutionary degrees of divergence t1, ..., tn and sample artificial pairwise alignment data according to the associated distribution M(ti) = F exp(tiQ). More concretely, we draw independent pairs of amino acids from this distribution, and the concatenation of these pairs gives us a simulated gap-free alignment at PAM distance tk.
We pool alignments of various degrees of divergence and run the estimators on this data, which is then used to reestimate the parameters (Q,
) that are used for alignment generation. In this setup, estimator evaluation is straightforward.
The resulting estimated substitution models can be equivalently represented by the rate matrix Q, any transition matrix P(t), or a matrix of pair frequencies M(t). As the first two display strong diagonal dominance, graphical comparison of them is not appropriate. Instead we choose the M(100) pair frequencies.
We show three simulation results. First, we test the estimators in the case of a small input data set. We reestimate the model parameters from 30 alignments of 300 sites each, where the degree of divergence varies from 10 to 300 PAMs with one alignment for each distance. The results are shown in figure 1 . In this situation the maximum likelihood method is more accurate than either the resolvent method or Dayhoff's method. This is not surprising, because maximum likelihood approaches are known to yield highly efficient estimators. Yet one can clearly improve the accuracy of estimations by using more input data. In the case of protein evolution, tens of thousands of alignments are easily accessible. Theoretically, one would assume that the maximum likelihood estimator would be superior in this setup, but in practice, it is not appropriate because it is computationally too demanding. For real data sets the evolutionary degree of divergence of all alignments is very likely to be different. This is especially bad as it makes likelihood evaluations slow, see equation (10) . In order to simulate the performance of the maximum likelihood estimator on a data set of medium size we use a set containing 30 alignments of 5,000 sites, for distances ranging from 10 to 300 PAM. This yields a large set of observed amino acid pairs but only at 30 different PAM distances. The results are shown in figure 2 . One can clearly observe that the resolvent method catches up when compared to the maximum likelihood method, whereas Dayhoff's method shows the expected bias resulting from ignoring the evolutionary distances.
|
|
For huge amounts of input data, the maximum likelihood estimator cannot be evaluated. Figure 3 compares Dayhoff's method with the resolvent estimator for 10,000 alignments of length 300 with PAM distances distributed uniformly at the interval [0, 300]. The resolvent method shows very satisfying accuracy, and as one would expect, it clearly outperforms Dayhoff's method.
|
Table 1 summarizes the three comparison experiments. Here, we assess the performance of the three methods in terms of the relative entropy H(
[100] | M[100]) =
ij mij(100)log([mij{100}]/[
ij{100}]) of the joint distributions
(100) of the estimated models to the model M(100) that is underlying the simulations.
|
As we pointed out, a sensitive choice of the resolvent parameter
is critical for the performance of the resolvent method. This is clearly supported by the simulation experiments. Figure 4
shows a plot of the average log-likelihood of the fitted model versus
. It becomes clear that although the rate matrix Q is theoretically independent of
, numerical problems can significantly weaken the resolvent method if
is inappropriately chosen.
|
| Discussion |
|---|
|
|
|---|
We discussed the problem of estimating amino acid replacement frequencies from inhomogeneous divergent alignment data. We tested the applicability of Dayhoff's method to this kind of data, reviewed and evaluated the resolvent method, and developed a novel maximum likelihood estimator. All of these methods model protein evolution by a Markov process acting independently on each site of the proteins. Whereas the RV and ML methods take evolutionary distances into account, Dayhoff's method does not. In simulations with time-heterogeneous alignment data we prove the importance of including distances into the model. In particular, maximum likelihood proved to perform best for small data sets, whereas for larger data sets where maximum likelihood becomes computationally infeasible, the resolvent method is a good alternative.
The EMP model reduces the phenomenon of protein evolution to 210 parameters. It is obvious that this cannot cover the entire complexity of evolution. One assumes that the positions in a protein evolve independently of each other with the same dynamics for each site, which can be modeled by a Markov chain. Of course, it is well known that different sites in a protein may evolve at different speeds and that possibly different replacement mechanisms are operating. All of our assumptions are questionable from a biological point of view. However, from the perspective of data analysis it is obvious that one needs to simplify to make model fitting practical. The challenge is to reflect as much of the reality as possible with 210 parameters.
In 1972 when Dayhoff et al. proposed the first solution to this problem, only a few sequences were available, and homology detection was restricted to relatively closely related pairs of sequences. Clearly, their method was intended for the use of this kind of data. Our use of Dayhoff's method in the context of divergent alignments is not in the sense of its authors and has no theoretical justification. We use it to demonstrate the practical importance of including the divergence parameter into a model.
A natural shortcoming of using only closely related sequence alignments is that the estimator is biased toward the evolution of fast-evolving positions in proteins. Thus, basing the estimation on the large and divergent data set that we used does not only improve the model parameters because of the much larger amount of input data, but might also reflect protein evolution on longer time scales more appropriately.
Our simulation results show that the maximum likelihood estimator is more efficient than the resolvent method. On the other hand, it is restricted to input data sets of moderate size. But more input data clearly gives more accurate estimates for the transition probabilities, which may well compensate for the theoretical suboptimality of the estimator. In principle, we have a trade off between the statistical and the computational efficiency of the estimators. For small data sets we recommend a maximum likelihood approach, whereas the resolvent method is a practical alternative tailored for huge data sets.
| Acknowledgements |
|---|
|
|
|---|
We would like to thank two anonymous referees for helpful comments on an earlier version of this paper. We are also grateful to Sven Rahmann and Marc Rehmsmeier for many stimulating discussions.
| Footnotes |
|---|
William Taylor, Reviewing Editor
Keywords: amino acid replacement
amino acid score matrix
maximum-likelihood
protein evolution ![]()
Address for correspondence and reprints: Tobias Müller, Deutsches Krebsforschungszentrum, Theoretische Bioinformatik, Im Neuenheimer Feld 280, 69120 Heidelberg, Germany. t.mueller{at}dkfz.de
. ![]()
| References |
|---|
|
|
|---|
Adachi J., M. Hasegawa, 1996 Molphy: (programs for molecular phylogenetics) Version 2.3 Institute of Statistical Mathematics, Tokyo
Baake E., A. von Haeseler, 1999 Distance measures in terms of substitution processes Theor. Popul. Biol 5:166-175
Barry D., J. Hartigan, 1987a. Asynchronous distance between homologous DNA sequences Biometrics 43::261-276[ISI][Medline]
. 1987b. Statistical analysis of hominoid molecular evolution Stat. Sci 2:191-210
Benner S., M. Cohen, G. Gonnet, 1994 Amino acid substitution during functionally constrained divergent evolution of protein sequences Protein Eng 7:1323-1332
Brent R. P., 1973 Algorithms for minimization without derivatives Prentice Hall, Englewood Cliffs, NJ
Dayhoff M., R. Schwartz, B. Orcutt, 1978 A model of evolutionary change in protein Atlas Protein Seq. Struct 5:345-352
Gonnet G., M. Cohen, S. Benner, 1992 Exhaustive matching of the entire protein sequence database Science 256:1443-1445
Jones D. T., W. R. Taylor, J. M. Thornton, 1992 The rapid generation of mutation data matrices from protein sequences CABIOS 8:275-282
Müller T., M. Vingron, 2000 Modeling amino acid replacement J. Comput. Biol 7:761-776[ISI][Medline]
Overington J., M. Johnson, A. Sali, T. Blundell, 1990 Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction Proc. R. Soc. Lond. B 241:132-145[Medline]
Press W. H., B. P. Flannery, S. Teukolsky, W. T. Vetterling, 1990 Numerical recipes in C Press Syndicate of the University of Cambridge, Cambridge
Risler J., M. Delorme, H. Delacroix, A. Henaut, 1988 Amino acid substitutions in structurally related proteins J. Mol. Biol 204:1019-1029[ISI][Medline]
Setubal J., J. Meidanis, 1997 Introduction to computational molecular biology PWS Publishing Company, Boston
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
M. Csuros, I. B. Rogozin, and E. V. Koonin Extremely Intron-Rich Genes in the Alveolate Ancestors Inferred with a Flexible Maximum-Likelihood Approach Mol. Biol. Evol., May 1, 2008; 25(5): 903 - 911. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Selig, M. Wolf, T. Muller, T. Dandekar, and J. Schultz The ITS2 Database II: homology modelling RNA structure for molecular systematics Nucleic Acids Res., January 11, 2008; 36(suppl_1): D377 - D380. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Csuros, J. A. Holey, and I. B. Rogozin In search of lost introns Bioinformatics, July 1, 2007; 23(13): i87 - i96. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Doron-Faigenboim and T. Pupko A Combined Empirical and Mechanistic Codon Model Mol. Biol. Evol., February 1, 2007; 24(2): 388 - 397. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Luz and M. Vingron Family specific rates of protein evolution Bioinformatics, May 15, 2006; 22(10): 1166 - 1171. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. A. Price, G. E. Crooks, R. E. Green, and S. E. Brenner Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap Bioinformatics, October 15, 2005; 21(20): 3824 - 3831. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. E. Crooks, R. E. Green, and S. E. Brenner Pairwise alignment incorporating dipeptide covariation Bioinformatics, October 1, 2005; 21(19): 3704 - 3710. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. E. Crooks and S. E. Brenner An alternative model of amino acid replacement Bioinformatics, April 1, 2005; 21(7): 975 - 980. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. A. Simossis, J. Kleinjung, and J. Heringa Homology-extended sequence alignment Nucleic Acids Res., February 7, 2005; 33(3): 816 - 824. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Lartillot and H. Philippe A Bayesian Mixture Model for Across-Site Heterogeneities in the Amino-Acid Replacement Process Mol. Biol. Evol., June 1, 2004; 21(6): 1095 - 1109. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. C. Edgar MUSCLE: multiple sequence alignment with high accuracy and high throughput Nucleic Acids Res., March 19, 2004; 32(5): 1792 - 1797. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. C. Edgar Local homology recognition and distance measures in linear time using compressed amino acid alphabets Nucleic Acids Res., January 16, 2004; 32(1): 380 - 385. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-K. Yu, J. C. Wootton, and S. F. Altschul The compositional adjustment of amino acid substitution matrices PNAS, December 23, 2003; 100(26): 15688 - 15693. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Gil, F. J. Silva, E. Zientz, F. Delmotte, F. Gonzalez-Candelas, A. Latorre, C. Rausell, J. Kamerbeek, J. Gadau, B. Holldobler, et al. The genome sequence of Blochmannia floridanus: Comparative analysis of reduced genomes PNAS, August 5, 2003; 100(16): 9388 - 9393. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



















