MBE Advance Access originally published online on November 30, 2005
Molecular Biology and Evolution 2006 23(3):626-632; doi:10.1093/molbev/msj069
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Article |
Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions


* School of Computer Science, Tel-Aviv University, Israel;
Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand; and
Mathematics Department, University of California, Berkeley
E-mail: m.hendy{at}massey.ac.nz.
| Abstract |
|---|
|
|
|---|
Maximum likelihood (ML) is a popular method for inferring a phylogenetic tree of the evolutionary relationship of a set of taxa, from observed homologous aligned genetic sequences of the taxa. Generally, the computation of the ML tree is based on numerical methods, which in a few cases, are known to converge to a local maximum on a tree, which is suboptimal. The extent of this problem is unknown, one approach is to attempt to derive algebraic equations for the likelihood equation and find the maximum points analytically. This approach has so far only been successful in the very simplest cases, of three or four taxa under the Neyman model of evolution of two-state characters. In this paper we extend this approach, for the first time, to four-state characters, the Jukes-Cantor model under a molecular clock, on a tree T on three taxa, a rooted triple. We employ spectral methods (Hadamard conjugation) to express the likelihood function parameterized by the path-length spectrum. Taking partial derivatives, we derive a set of polynomial equations whose simultaneous solution contains all critical points of the likelihood function. Using tools of algebraic geometry (the resultant of two polynomials) in the computer algebra packages (Maple), we are able to find all turning points analytically. We then employ this method on real sequence data and obtain realistic results on the primate-rodents divergence time.
Key Words: maximum likelihood phylogenetic trees Jukes-Cantor Hadamard conjugation analytical solutions symbolic algebra
| Introduction |
|---|
|
|
|---|
Maximum likelihood (ML) is increasingly used as an optimality criterion for selecting evolutionary trees (Felsenstein 1981
|
The change from two to four character states incurs a major increase in the complexity of the underlying algebraic system. Like previous works, our starting point is to present the general ML problem on phylogenetic trees as a constrained optimization problem. This gives rise to a complex system of polynomial equations, and the goal is to solve them. The complexity of this system prevents any manual solution, so we relied heavily on Maple, a symbolic mathematical system. However, even with Maple, a simple attempt to solve the system (e.g., just typing solve) fails, and additional tools are required. Spectral analysis (Hendy and Penny 1993
lgp/small-trees/.
It is evident that the currently available heuristic methods can fail to infer the correct tree, even for small number of taxa. This is true not only in the presence of multiple ML points but also in cases where the neighborhood of the (single) ML point is relatively flat. Therefore, a practical consequence of this work is the use of rooted triplets in supertree methods (constructing a big tree from small subtrees). For unrooted trees, the subtrees must have at least four leaves (e.g., "the tree from quartets" problem). For rooted trees, it is sufficient to amalgamate a set of rooted triplets (Aho et al. 1981
). Our work enables such approaches to rely on rooted ML triplets based on four characters states rather than two.
Additionally, analytic solutions are able to reveal properties of the ML points that are not obtainable numerically. For small trees, our result can serve as a method for checking the accuracy of the heuristic methods used in practice.
The remainder of this work is organized as following. In the next section we describe the materials and methods we used in this work. Specifically, in Definitions, Notations, and the Hadamard Conjugation we provide definitions and notations used throughout the rest of this work. In Definitions, Notations, and the Hadamard Conjugation we develop the identities and structures induced by the Jukes-Cantor model, while in Obtaining the ML Solution we develop ML on phylogenetic trees and subsequently solves the system for the special case of three species under Jukes-Cantor and molecular clock. In Results and Discussion we give results and discuss future research directions: Results on Genomic Sequences reports on experimental results of applying our method on real genomic sequences, while in Directions for Future Research we conclude with three open problems.
| Materials and Methods |
|---|
|
|
|---|
Here we detail on the methods and tools we employed in order to obtain our results. These are developed specifically for the tree T on three taxa referred to as 1, 2, and 3. We label the leaves of T as 1, 2, and 3 and the edges (branches) as e1, e2, and e3, as illustrated in figure 1a. Taxon 3 is chosen to be the reference taxon. Our analysis is focused on the site substitutions required to transform the reference sequence to those of 1 and 2 under Kimura's 3-substitution model (K3ST) (Kimura 1981
Definitions, Notations, and the Hadamard Conjugation
Given aligned nucleotide sequences for the three taxa 1, 2, and 3, there are 43 nucleotide combinations possible at each site. We refer to each of these as a "character pattern." A character pattern
can be described by the state
3 at taxon 3, together with the substitutions
Kimura (1981)
in his K3ST model, describes three classes of substitution: the transitions; and two types of transversions. Following Hendy and Snir (2005)
, we use
to denote a transition, ß and
to denote the two transversion types, together with
to indicate no substitution, as described in figure 2. For example, the character pattern
is obtained with the assignment of C to taxon 3, and the substitution pattern
for taxa 1 and 2.
|
In Hendy, Penny, and Steel (1994)
is indexed by the pair of subsets ({1, 2}, {2}), and the probability of obtaining this substitution pattern is written as s12,2.
The 16 site pattern probabilities are arranged as a 4 x 4 matrix
![]() |
In the Jukes-Cantor model, the expected numbers of the three substitution types on an edge ei are the same, that is,
![]() |
![]() |
Theorem 1where
(1)
(2) and the log function ln is applied to each component of the matrix HQH.
Equation (2) in Theorem 1 is equivalent to earlier expressions (Steel et al. 1992
; Székely et al. 1993
) of Hadamard conjugations for the K3ST model, which used Hadamard matrices of 2n rows and columns applied to vectors of 2n entries, with qi = q
(ei) = qß(ei) = q
(ei). The current expression has recently been developed in Hendy and Snir (2005)
. A full proof of this result follows directly from applying Theorem 7 of Hendy and Snir (2005)
on the tree T. This approach has been followed in order to clarify the role of path sets, which explain the intermediate terms in the conjugations, enabling us to derive and interpret the values in equation (8) directly.
We now define several auxiliary matrices that will be useful in the sequel:
![]() |
![]() |
![]() |
![]() | (3) |
![]() | (4) |
![]() | (5) |
![]() | (6) |
![]() | (7) |
![]() |
![]() |
![]() | (8) |
![]() | (9) |
From Theorem 1, we find the expected sequence spectrum is
![]() | (10) |
![]() | (11) |
![]() | (12) |
In particular, when we impose the molecular clock condition q1 = q2, and hence x1 = x2, equation (12) reduces to
![]() | (13) |
Obtaining the ML Solution
When we have an aligned sequence of length c for each of the three taxa 1, 2, and 3, we can count the relative frequencies (normalized to sum to 1) of the substitution patterns among the sites. For (X, Y) | X, Y
{1, 2}, we set fX,Y to be the relative frequency of observing the site pattern (X, Y), and let F be the 4 x 4 matrix with entries f(X, Y) indexed as in the sequence spectrum S. If the sequences were generated under the Jukes-Cantor model on the tree T with edge weights q1, q2, q2, then the expected number of substitution pattern (X, Y) is fX,Y = csX,Y, where c is the number of sites.
However, in c independent samples, the observed frequencies will include sampling error, so we cannot directly conclude S = F. ML provides a method of estimating the parameters q1, q2, and q3 from the observed frequencies F.
Given a set of edge lengths q1, q2, and q3, the "likelihood" of observing F under the Jukes-Cantor model on T is
![]() | (14) |
![]() | (15) |
![]() |
![]() |
![]() |
![]() |
![]() |
In contrast to previous works (Chor et al. 2000
This eliminates the need to introduce the constraints of the ML points being on a "tree surface." By the chain rule, we get:
![]() | (16) |
![]() | (17) |
) we have from equation (14)
![]() | (18) |
![]() |
x3. In our analysis below, we will explicitly impose the equality x1 = x2 to find the turning points. The inequality will need to be tested on any potential solution and, if it were not satisfied, a maximum could be sought on the boundary of the valid tree domain, where x1 = x2 = x3.
The constraint x1 = x2 implies a1 = a2, so by replacing x1 and a1, equation (18) reduces to
![]() | (19) |
We now show that the system of two resulting polynomials {E1, E2} has only finitely many solutions, all of which we can find. The major tool used here is the "resultant" of two polynomials. Let
and
be two polynomials in one variable, x. The resultant of f and g, denoted Res(f, g, x), is a polynomial in the coefficients ai and bj of f and g, which is 0 whenever f and g have a common zero. The coefficients can themselves be unknowns or functions of other variables, in which case, the resultant replaces the two polynomials f and g by a single polynomial in one fewer variable.
Computing the resultant is a classical technique for eliminating one variable from two equations. There is an elegant formula for computing it due to Sylvester and another due to Bezout, which have been implemented in the computer algebra package Maple.
We can compute the resultant ER = Res(E1, E2, x3) of E1 and E2 with respect to x3. This eliminates x3 from the equations and yields a single polynomial ER, in just x2 and the parameters. The polynomial ER has the form:
![]() | (20) |
Theorem 2 The turning points of L (eq. 14) corresponding to realistic trees (namely, trees with positive edge lengths) are exactly the roots of P0.
Proof. The only factors in ER (eq. 20) that admit positive real roots are P0 andHowever, x2 = 1 implies q2 = 0, which is not realistic.
Corollary 3 The Jukes-Cantor triplet has a finite number of ML points.
Proof. P0 has at most 11 different solutions and, for each such solution, we can back substitute to obtain all the values of x3.
Maxima on the Boundaries of the Parameter Space
The realistic parameter space R is bounded by the constraints
![]() |
![]() |
We identify three boundary subspaces
![]() |
gives a single polynomial constraint.
In case I, with x3 = 0, x = x1 = x2, we find
![]() |
![]() |
In case II, with x = x1 = x2 = x3, we find
![]() |
In case III, with x = x3, x1 = x2 = 1, L = 0 (unless f1 = f2 = f4 = 0, whence L is undefined).
| Results and Discussion |
|---|
|
|
|---|
Here we report experimental results obtained by employing our method on genomic sequences. We conclude by discussion and pointing out future research directions.
Results on Genomic Sequences
In order to evaluate our method, we tested it on three sets of real genomic sequences. In each case, we found a single solution to P0 = 0 in the realistic parameter space, for each of the three rooted triples. By evaluating the likelihood at each of these turning points, we found each was a maximum.
We looked at the natural killer cell receptor D gene on human, mouse, and rat (accession numbers AF260135, AF030313 and AF009511, respectively). We aligned the sequences using ClustalW (Thompson, Higgins, and Gibson 1994
). Next, we computed the observed sequence spectrum F (eq. 21, as explained in Obtaining the Maximum Likelihood Solution).
![]() | (21) |
We also calculated the ML value for each of the three rooted trees for the beta actin gene, for the three species guinea pig, goose, and Caenorhabditis elegans (accession numbers AF508792, M26111, and NM_076440, respectively), finding the ((guinea pig, goose), C. elegans) tree maximal, with q1 = q2 = 0.021819 and q3 = 0.050188 giving ln L = 1241.5. Finally, we calculated the ML value for each of the three rooted trees for the histone gene of Drosophila melangoster, Hydra vulgaris, and human (accession numbers AY383571, AY383572, and NM_002107, respectively), finding the ((D. melangoster, H. vulgaris), Human) tree maximal, with q1 = q2 = 0.001555 and q3 = 0.012740 with ln L = 86.835133.
Each of the results above agree closely with the numerical values obtained using the popular phylogenetic reconstruction packages PHYLIP (Felsenstein 1995
) and PAUP* (Swofford 1998
), which use iterative methods to estimate the maxima. In each case, the likelihood function had a unique maximum in the parameter range.
Directions for Future Research
The progress made here brings up a number of open problems:
- Our ML solutions are derived from the roots of a univariate, degree 11 polynomial. This implies that the number of ML solutions is finite. It would be interesting to explore the question of "uniqueness" of the solution. If this is the case, it will most likely follow from the existence of a single solution corresponding to a realistic tree, as in Chor, Khetan, and Snir (2003)
.
- The Jukes-Cantor substitution model is a special case of the family of Kimura substitution models. It would be interesting to further extend the result in this paper for the other models (two and three parameters) of the Kimura family.
- It would be interesting to extend these results to rooted trees with "four leaves" under Jukes-Cantor model and a molecular clock. Here we have two different topologiesthe fork and the comb (Chor, Khetan, and Snir 2003
). It is expected that such extension will face substantial technical difficulties.
- The Jukes-Cantor substitution model is a special case of the family of Kimura substitution models. It would be interesting to further extend the result in this paper for the other models (two and three parameters) of the Kimura family.
| Appendix |
|---|
|
|
|---|
The polynomial P0 is presented where the coefficients are functions of the observed pattern frequencies, fi with f12 = f1 + f2. Using Maple we find:
|
|
| Acknowledgements |
|---|
|
|
|---|
Thanks to Joseph Felsenstein for his fruitful discussions, to Bernd Sturmfels for enlightening comments on this manuscript and informing us about his manuscript (Hosten, Khetan, and Sturmfels 2004), and to the two other reviewers for their constructive comments. This research was supported by the Israel Science Foundation grant 418/00.
| Footnotes |
|---|
William Martin, Associate Editor
| References |
|---|
|
|
|---|
Aho, A. V., Y. Sagiv, T. G. Szymanski, and J. D. Ullman. 1981. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput. 10:405421.[CrossRef]
Catanese, F., S. Hosten, A. Khetan, and B. Sturmfels. 2004. The maximum likelihood degree. (http://front.math.ucdavis.edu/math.AG/0406533).
Chor, B., M. Hendy, B. Holland, and D. Penny. 2000. Multiple maxima of likelihood in phylogenetic trees: an analytic approach. Mol. Biol. Evol. 17:15291541. (Earlier version appeared in RECOMB 2000).
Chor, B., M. Hendy, and D. Penny. 2001. Analytic solutions for three taxon mlmc trees with variable rates across sites. Pp. 204213 in B. Moret and O. Gascuel, eds. Lecture notes in Computer Science 2149 in Workshop on Algorithms in BioInformatics 2001. Springer-Verlag, Heidelberg, Germany.
Chor, B., A. Khetan, and S. Snir. 2003. Maximum likelihood on four taxa phylogenetic trees: analytic solutions. Pp. 7683 in W. Miller, M. Vingron, S. Istrail, P. Pevzner, and M. Waterman, eds. Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (RECOMB). AMC Press, Berlin.
Chor, B., and S. Snir. 2004. Molecular clock fork phylogenies: closed form analytic maximum likelihood solutions. Syst. Biol. 53(6):963967.[CrossRef][ISI][Medline]
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376.[CrossRef][ISI][Medline]
. 1995. PHYLIP (phylogeny inference package). Distributed by the author, Department of Genetics, University of Washington, Seattle.
Hendy, M. D., and D. Penny. 1993. Spectral analysis of phylogenetic data. J. Classif. 10:524.
Hendy, M. D., D. Penny, and M. A. Steel. 1994. Discrete Fourier analysis for evolutionary trees. Proc. Natl. Acad. Sci. USA 91:33393343.
Hendy, M. D., and S. Snir. 2005. Hadamard conjugation for the Kimura 3st model: combinatorial proof using pathsets. (http://arxiv.org/abs/q-bio.PE/0505055).
Hosten, S., A. Khetan, and B. Sturmfels. 2004. Solving the likelihood equations. (http://front.math.ucdavis.edu/math.ST/0408270).
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York.
Kimura, M. 1981. Estimation of evolutionary sequences between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 78:454458.
Neyman, J. 1971. Molecular studies of evolution: a source of novel statistical problems. Pp. 127 in S. Gupta and Y. Jackel, eds. Statistical decision theory and related topics. Academic Press, New York.
Steel, M., M. D. Hendy, and D. Penny. 1998. Reconstructing phylogenies from nucleotide pattern probabilities: a survey and some new results. Discrete Appl. Math. 88:367396.
Steel, M. A., M. D. Hendy, L. A. Székely, and P. L. Erdös. 1992. Spectral analysis and a closest tree method for genetic sequences. Appl. Math. Lett. 5:6367.
Sturmfels, B., and S. Sullivant. 2005. Toric ideals of phylogenetic invariants. J. Comput. Biol. 12:204228.[CrossRef][ISI][Medline]
Swofford, D. L. 1998. PAUP*beta. Sinauer Associates, Sunderland, Mass.
Székely, L., P. L. Erdös, M. A. Steel, and D. Penny. 1993. A Fourier inversion formula for evolutionary trees. Appl. Math. Lett. 6:1317.
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalty and weight matrix choice. Nucleic Acids Res. 22:46734780.
Yang, Z. 2000. Complexity of the simplest phylogenetic estimation problem. Proc. R. Soc. Lond. B 267:109116.[Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

q3, then this tree satisfies the molecular clock as the path lengths from r to each leaf are the same.





and the log function ln is applied to each component of the matrix HQH.




























However, x2 = 1 implies q2 = 0, which is not realistic. 








