Molecular Biology and Evolution 17:890-896 (2000)
© 2000 Society for Molecular Biology and Evolution
Article |
A Fast Algorithm for Joint Reconstruction of Ancestral Amino Acid Sequences


*Department of Zoology, George S. Wise Faculty of Life Sciences,
and
Department of Computer Science, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Ramat Aviv, Israel
| Abstract |
|---|
|
|
|---|
A dynamic programming algorithm is developed for maximum-likelihood reconstruction of the set of all ancestral amino acid sequences in a phylogenetic tree. To date, exhaustive algorithms that find the most likely set of ancestral states (joint reconstruction) have running times that scale exponentially with the number of sequences and are thus limited to very few taxa. The time requirement of our new algorithm scales linearly with the number of sequences and is therefore applicable to practically any number of taxa. A detailed description of the new algorithm and an example of its application to cytochrome b sequences are provided.
| Introduction |
|---|
|
|
|---|
By using extant sequences and the phylogenetic relationships among them, it is possible to infer the most plausible ancestral sequences from which they have been derived. Ancestral reconstruction has been applied in several contexts. For instance, the parsimony method (Fitch 1971
Maximum likelihood (ML) is a general estimation paradigm which has been widely utilized in evolutionary studies, overcoming many shortcomings of parsimony (Felsenstein 1981
; Kishino, Miyata, and Hasegawa 1990
). ML-based methods for inference of ancestral sequences were devised by Yang, Kumar, and Nei (1995)
and Koshi and Goldstein (1996)
. A widely used variant of ML (the Bayesian approach) finds the most probable parameter set given the data. Applying this ML variant to ancestral-sequence reconstruction, one maximizes P(ancient amino acid sequences | contemporary sequences). Indeed, the method developed by Yang, Kumar, and Nei (1995)
is Bayesian. By using the most likely set of sequences at all the internal nodes to evaluate the number of synonymous versus nonsynonymous substitutions along branches, Zhang, Rosenberg, and Nei (1998)
inferred positive Darwinian selection after gene duplication in primate ribonuclease genes.
Yang (1995)
distinguished between two variants of ancestral ML reconstruction, termed "joint" and "marginal." To illustrate the difference between the two methods, let us consider the multifurcated tree in figure 1
. Suppose character state A can change to either B or C, and then to D ... I, according to the probabilities listed in figure 1
. If we are interested in the most likely pathway, then clearly the answer is set{A, C, I}, i.e., A
C
I, which has a probability of 0.45 x 0.9 = 0.405. If, on the other hand, we are interested in the most likely character state after one step, then B is the winner (although B does not even feature in the most likely set). As far as ancestral-sequence inference is concerned, we have an analogous situation. We may be interested either in a the set of all the hypothetical taxonomic unit (HTU) sequences (joint reconstruction) or in a specific HTU whose sequence we would like to estimate (marginal reconstruction). As our examples demonstrate, the results are not necessarily the same under the two methods of ML reconstruction.
|
The two ML reconstruction methods have been implemented in the phylogenetic software packages PAML (Yang 1995
Koshi and Goldstein (1996)
have developed a fast dynamic programming algorithm for marginal reconstruction, whose variants are implemented in existing software (Yang 1995
; Zhang and Nei 1997
). To date, no fast algorithm exists for joint reconstruction.
Here, we provide a new efficient algorithm for joint ML ancestral reconstruction. The running time of our algorithm scales linearly with the number of sequences and thus can be applied to a practically unlimited number of sequences. The new algorithm is based on the dynamic programming scheme (for a general description of this scheme, see, e.g., Cormen, Leiserson, and Rivest 1990
). Finally, we compare the performance of our method to those of extant algorithms by applying it to a data set of cytochrome b sequences.
| Materials and Methods |
|---|
|
|
|---|
ML Ancestral Reconstruction
Following Yang, Kumar, and Nei (1995)
j replacement probability, denoted as Pij(t), can be calculated from the eigenvalue decomposition of M (Kishino, Miyata, and Hasegawa 1990
Consider, for example, an unrooted tree with five operational taxonomic units (OTUs), as in figure 2
. At a given sequence position, there are 203 possible reconstructions of the amino acids at the three internal nodes (A in node 6, A in node 7, A in node 8; or A in node 6, A in node 7, C in node 8; ... or Y in node 6, Y in node 7, Y in node 8). The aim of joint ML is to identify the triplet
, maximizing P(
| data). That is, we want to find, from among all possible triplets, the one that maximizes
|

Since P(data) is the same for all triplets, it suffices to maximize P(data |
) x P(
). For the tree in figure 2
, we solve

The choice of node 8 as the root is arbitrary, because the model assumed is time-reversible (Felsenstein 1981
; Yang, Kumar, and Nei 1995
).
Complexity
The complexity issue is central to ancestral reconstruction. The solution sought in equation (2) is the maximum over all possible triplets. However, for larger trees, say, with h HTUs, one needs to maximize over the set of all possible combinations of h ancestral character states. As explained in the introduction, this set is very large, including 20h such combinations. Even if one considers only the c character states observed in the site, there are ch combinations of such states that are likely to appear (Zhang and Nei 1997
). Naive implementation, examining each such combination separately, is impractical for all but very modest values of h.
In the following, we present a fast new algorithm for ancestral reconstruction. This dynamic programming algorithm guarantees finding the set of sequences, one sequence per node, most likely to have been the progenitors of the extant sequences. The complexity of the algorithm is linear. That is, its running time per site is proportional to the number of internal nodes, and its efficiency enables its employment for any number of OTUs. Note that our algorithm maximizes the likelihood over all 20h possible combinations. For very small numbers of OTUs, this implies longer running times than those for existing programs, which check only ch combinations. However, since our algorithm is efficient, it does not require more than a few seconds (see Results).
Terminology
We root the tree at an arbitrary internal node. If node x is the direct descendant of node y, we say that y is the father of node x, and x is the son of node y. Thus, each nonroot node has a father. Each internal node has two sons, except for the root, which has three sons. OTUs have no sons. Also, each tree branch supports a subtree, which includes the father and son that it connects, together with all of the descendants of the son. For a demonstration of the terminology, consider the tree in figure 2
. Node 8 is the arbitrary root. Node 7 is the father of nodes 3 and 6. The branch connecting nodes 7 and 6 supports the subtree in the shaded region, including nodes 7, 6, 5, and 4.
A Linear-Time Algorithm for Joint Ancestral Reconstruction
Only alanine (A) and valine (V) are observed at OTUs of the tree in figure 2
. Thus, the chances of any other amino acid occurring at internal nodes are quite negligible. For the sake of clarity and compactness, in this section we shall only deal with the values of Pij(t) for i and j which are either alanine or valine. For simplicity, we further assume that all the branches of the phylogenetic tree in figure 2
are of the same length t and that the replacement probabilities Pij(t) for this value of t are given in table 1
. Note, however, that this table is merely a toy model for the sole purpose of demonstrating the algorithm. In practice, we use Pij(t) values from matrices published in standard literature (Dayhoff, Schwartz, and Orcutt 1978
; Jones, Taylor, and Thornton 1992
; Adachi 1995
) and use trees whose branch lengths have been estimated.
|
Like many phylogenetic algorithms, our algorithm first traverses the tree from the OTUs toward the root. Upon visiting a nonroot node x, we compute for each character state i a quantity Lx(i) and a character state Cx(i). Lx(i) and Cx(i) are to be interpreted as follows: If the father of node x is assigned character state i, then finding the best reconstruction of the subtree supported by the branch between x and its father is a separate problem from reconstructing the rest of the tree. Lx(i) is the likelihood of the best reconstruction of this subtree on the condition that the father of node x is assigned character state i. Cx(i) is the character state assigned to node x in this optimal conditional reconstruction.
These concepts are best understood with an example. Consider node x = 6 in the tree in figure 2 . If one assumes its father, i.e., node 7, is assigned A, then the best reconstruction of the shaded subtree is given in figure 2b. In this case, L6(A) = 0.7 x 0.7 x 0.3 = 0.147, and C6(A) = A. By a similar argument, L6(V) = 0.55 x 0.55 x 0.45 = 0.136, and C6(V) = V, as demonstrated in the legend of figure 2 .
We now give a full description of the algorithm.
- For each OTU y perform the following:
- 1a. Let j be the amino acid at y. Set, for each amino acid i: Cy(i) = j. This implies that no matter what is the amino acid in the father of y, j is assigned to node y.
- 1b. Set for each amino acid i: Ly(i) = Pij(ty), where ty is the branch length between y and its father.
|
Computer Program
A computer program that implements the algorithm presented above, called FastML, was written in C++ for PC architecture. It is available at http://kimura.tau.ac.il/~tal. The program allows the user to obtain both joint and marginal reconstructions of amino acid sequences at all the internal nodes of a given phylogenetic tree. It also calculates the likelihoods of the reconstructed sequences and the posterior probability at each sequence position.
Empirical Example
Aligned cytochrome b amino acid sequences from a sample of mammals were taken from the data of Takezaki and Gojobori (1999)
. Phylogenetic trees with 4 to 24 sequences (from 21 taxa) were prepared by taking a subset from these sequences and using the neighbor-joining algorithm (Saitou and Nei 1987
) to reconstruct the tree. For each such tree, we compared the running time required to estimate the ancestral sequences by the three programs: FastML, PAML, and ANCESTOR.
We also compared joint versus marginal reconstruction for cytochrome b sequences of the 21 taxa in the data set. The phylogenetic tree obtained for these sequences (fig. 5
) was in agreement with the tree obtained by Takezaki and Gojobori (1999)
. Calculations of the replacement probabilities were done with the REV model (Adachi 1995
).
|
| Results |
|---|
|
|
|---|
Complexity
The program ANCESTOR was unable to reconstruct the best joint ancestral sequences for more than nine amino acid sequences. In PAML, exact search for joint reconstruction is limited to six amino acid sequences. For a larger number of taxa, both the ANCESTOR and PAML programs resort to heuristic approaches, which are not guaranteed to find the optimal reconstruction. The running time of our program (FastML) versus the running time of ANCESTOR for the exact search is given in figure 4 .
|
Joint Versus Marginal Reconstruction
Results of joint and marginal ML reconstruction of the 19 internal nodes for the tree in figure 5 are available at http://kimura.tau.ac.il/~tal. Differences between joint reconstruction and marginal reconstruction at each of the nodes are given in table 2 . Thirteen differences were found at five positions, each in different HTUs. One such example, involving the inference of sequence position 241, is shown in figure 5 .
|
| Discussion |
|---|
|
|
|---|
Previous joint reconstruction algorithms are of exponential time complexity, which severely limits their applicability to only a small number of sequences. Our algorithm overcomes this problem, reducing the complexity of the exhaustive search to linear time. Thus, the algorithm guarantees the identification of the most probable amino acid ancestral pathway of amino acid replacements even when the number of taxa is very large. Linear time algorithms are implemented in PAML and ANCESTOR only for marginal reconstruction. Nevertheless, marginal reconstruction optimizes a different criterion and can only be considered an approximation to joint reconstruction, as demonstrated in our results.
As shown in table 2 , in all but five positions, there was no difference between joint and marginal reconstruction. However, depending on the intention of the study, the differences may be important. For example, consider node 33 (fig. 5 ). If one seeks the most likely amino acid in this HTU, then it is leucine. However, when jointly reconstructing the whole tree, the most likely member of the set containing all the internal-node amino acid assignments is methionine. Deciding which is "more correct" depends on the question asked. For instance, if one wishes to count the number of threonine-to-methionine replacements over the entire tree, then the joint reconstruction should be used to obtain this number (2, in our case, on the branch connecting node 24 to node 3 and the branch connecting node 32 to node 33). However, if one wishes to synthesize the hypothetical cytochrome b sequence of the ancestor of Cetartiodactyla, then one should use the marginal reconstruction approach. We emphasize that both methods compute optimal reconstructions by using all of the available data. Discrepancies originate not from misuse of information, but from the difference in the nature of the probabilistic questions asked.
The rate of amino acid replacement is usually not constant among sites. Our algorithm finds the most likely ancestral sequence only for the case of constant rate only. There are as yet no programs for joint reconstruction that take rate variation among sites into account. In PAML, a marginal reconstruction of ancestral sequence that assumes gamma distribution among sites is available. We were unable to apply our linear algorithm to the gamma model because of the different expressions that have to be maximized.
The algorithm described above is presented in terms of amino acids and bifurcating trees. However, the algorithm can be easily adapted for nucleotide sequences and multifurcating trees.
| Acknowledgements |
|---|
|
|
|---|
This study was supported in part by a grant from the Israel Ministry of Science and a fellowship of the Clore Foundation. This study was also supported by the Magnet Da'at Consortium of the Israel Ministry of Industry and Trade and a grant from Tel Aviv University (689/96).
| Footnotes |
|---|
Shozo Yokoyama, Reviewing Editor
1 Keywords: ancestral sequences
fast algorithm
joint reconstruction
maximum likelihood
dynamic programming
molecular evolution ![]()
2 Address for correspondence and reprints: Dan Graur, Department of Zoology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Israel. E-mail: graur{at}post.tau.ac.il ![]()
| literature cited |
|---|
|
|
|---|
Adachi, J. 1995. Modeling of molecular evolution and maximum likelihood inference of molecular phylogeny. Ph.D. dissertation, Graduate University for Advanced Studies, Hayama, Japan.
Cormen, T. H., C. E. Leiserson, and R. L. Rivest. 1990. Introduction to algorithms. MIT Press, Cambridge, Mass.
Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1978. A model of evolutionary change in proteins. Pp. 345352 in M. O. Dayhoff, ed. Atlas of protein sequences and structure. Vol. 5, Suppl. 3. National Biomedical Research Foundation, Washington, D.C.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376.[ISI][Medline]
Fitch, W. M. 1971. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Zool. 20:406416.
Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8:275282.
Kishino, H., T. Miyata, and M. Hasegawa. 1990. Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J. Mol. Evol. 31:151180.
Koshi, J. M., and R. A. Goldstein. 1996. Probabilistic reconstruction of ancestral protein sequences. J. Mol. Evol. 42:313320.[ISI][Medline]
Malcolm, B. A., K. P. Wilson, B. W. Matthews, J. F. Kirsch, and A. C. Wilson. 1990. Ancestral lysozymes reconstructed, neutrality tested, and thermostability linked to hydrocarbon packing. Nature 345:8689.
Saitou, N., and M. Nei. 1987. The neighbor joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425.[Abstract]
Stewart, C.-B. 1995. Active ancestral molecules. Nature 374:1213.
Swofford, D. L. 1993. PAUP: phylogenetic analysis using parsimony. Version 3.1. Illinois Natural History Survey, Champaign.
Takezaki, N., and T. Gojobori. 1999. Correct and incorrect phylogenies obtained by the entire mitochondrial DNA sequences. Mol. Biol. Evol. 16:590601.[Abstract]
Walker, A. E. 1998. Problems with parsimony in sequences of biased base composition. J. Mol. Evol. 47:686690.[ISI][Medline]
Yang, Z. 1995. PAML: a phylogenetic analysis by maximum likelihood. Version 2.0e. Pennsylvania State University, University Park.
Yang, Z., S. Kumar, and M. Nei. 1995. A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141:16411650.
Zhang, J., and M. Nei. 1997. Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihood, and distance methods. J. Mol. Evol. 44(Suppl. 1):s139s146.
Zhang, J., H. F. Rosenberg, and M. Nei. 1998. Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc. Natl. Acad. Sci. USA 95:37083713.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
S. L. Kosakovsky Pond, A. F.Y. Poon, A. J. Leigh Brown, and S. D.W. Frost A Maximum Likelihood Method for Detecting Directional Evolution in Protein Sequences and Its Application to Influenza A Virus Mol. Biol. Evol., September 1, 2008; 25(9): 1809 - 1824. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Fuchs, A. J. Martin-Galiano, M. Kalman, S. Fleishman, N. Ben-Tal, and D. Frishman Co-evolving residues in membrane proteins Bioinformatics, December 15, 2007; 23(24): 3312 - 3319. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Yang PAML 4: Phylogenetic Analysis by Maximum Likelihood Mol. Biol. Evol., August 1, 2007; 24(8): 1586 - 1591. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Iwasaki and T. Takagi Reconstruction of highly heterogeneous gene-content evolution across the three domains of life Bioinformatics, July 1, 2007; 23(13): i230 - i239. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Borenstein, T. Shlomi, E. Ruppin, and R. Sharan Gene loss rate: a probabilistic measure for the conservation of eukaryotic genes Nucleic Acids Res., January 12, 2007; 35(1): e7 - e7. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Jin, L. Nakhleh, S. Snir, and T. Tuller Maximum likelihood of phylogenetic networks Bioinformatics, November 1, 2006; 22(21): 2604 - 2611. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Yu and J. L. Thorne Dependence among Sites in RNA Evolution Mol. Biol. Evol., August 1, 2006; 23(8): 1525 - 1537. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. G. Hall Simple and accurate estimation of ancestral protein sequences PNAS, April 4, 2006; 103(14): 5431 - 5436. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Gophna, J. R. Thompson, Y. Boucher, and W. F. Doolittle Complex Histories of Genes Encoding 3-Hydroxy-3-methylglutaryl-CoenzymeA Reductase Mol. Biol. Evol., January 1, 2006; 23(1): 168 - 178. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Csuros and I. Miklos Statistical Alignment of Retropseudogenes and Their Functional Paralogs Mol. Biol. Evol., December 1, 2005; 22(12): 2457 - 2471. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. L. Kosakovsky Pond and S. D. W. Frost Not So Different After All: A Comparison of Methods for Detecting Amino Acid Sites Under Selection Mol. Biol. Evol., May 1, 2005; 22(5): 1208 - 1222. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Blanchette, E. D. Green, W. Miller, and D. Haussler Reconstructing large regions of an ancestral mammalian genome in silico Genome Res., December 1, 2004; 14(12): 2412 - 2423. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Pachter and B. Sturmfels Tropical geometry of statistical models PNAS, November 16, 2004; 101(46): 16132 - 16137. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. M. Robinson, D. T. Jones, H. Kishino, N. Goldman, and J. L. Thorne Protein Evolution with Dependence Among Codons Due to Tertiary Structure Mol. Biol. Evol., October 1, 2003; 20(10): 1692 - 1704. [Abstract] [Full Text] |
||||
![]() |
E. Schadt and K. Lange Codon and Rate Variation Models in Molecular Phylogeny Mol. Biol. Evol., September 1, 2002; 19(9): 1534 - 1549. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Zhang and H. F. Rosenberg From the Cover: Complementary advantageous substitutions in the evolution of an antiviral RNase of higher primates PNAS, April 16, 2002; 99(8): 5486 - 5491. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Zhang and H. F. Rosenberg From the Cover: Complementary advantageous substitutions in the evolution of an antiviral RNase of higher primates PNAS, April 16, 2002; 99(8): 5486 - 5491. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









