MBE Advance Access originally published online on October 19, 2006
Molecular Biology and Evolution 2007 24(1):288-293; doi:10.1093/molbev/msl153
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Performance of a New Invariants Method on Homogeneous and Nonhomogeneous Quartet Trees
Department of Matemàtica Aplicada I, Universitat Politècnica de Catalunya, Barcelona, Spain
E-mail: marta.casanellas{at}upc.edu.
| Abstract |
|---|
|
|
|---|
An attempt to use phylogenetic invariants for tree reconstruction was made at the end of the 80s and the beginning of the 90s by several researchers (the initial idea due to Lake [1987] and Cavender and Felsenstein [1987]). However, the efficiency of methods based on invariants is still in doubt (Huelsenbeck 1995; Jin and Nei 1990). Probably because these methods only used few generators of the set of phylogenetic invariants. The method studied in this paper was first introduced in Casanellas et al. (2005) and it is the first method based on invariants that uses the "whole" set of generators for DNA data. The simulation studies performed in this paper prove that it is a very competitive and highly efficient phylogenetic reconstruction method, especially for nonhomogeneous models on phylogenetic trees.
Key Words: invariants phylogenetic inference molecular evolution Markov process phylogenetic tree nonhomogenous
| Introduction |
|---|
|
|
|---|
Since the introduction of phylogenetic invariants by Cavender and Felsenstein (1987)
Phylogenetic invariants are relationships satisfied by the expected pattern frequencies occurring in sequences evolving along a given tree topology T under an evolutionary model. More precisely, if t is the set of d model parameters on T and p
(t) is the probability of observing the pattern
at the leaves of T, by letting t vary on an open subset of Rd, the probability vector p(t) = (pAA...A(t), ..., pTT...T(t)) defines a subset ST of dimension
d of R4n. A "phylogenetic invariant" is a real-valued continuous function f(x) on R4n such that f(p) = 0 for any p
ST, but not for all the points on the subset ST' determined by another tree topology T'. Essentially, the equations f(p) = 0 are satisfied for pattern frequencies arising from any model parameters on a fixed tree, so they might be used for recovering the tree topology.
In practice, the vector of observed pattern frequencies
obtained from an alignment of sequences for n taxa with enough data should approximate p(t) for some set of parameters t on a tree topology T. In other words,
should be a point close to the subset ST, so if f is an invariant for the topology T, one should have f(
) very close to 0. As the tree topology is identifiable via invariant-based methods (Allman and Rhodes 2006
), using phylogenetic invariants for tree reconstruction is a consistent method (see Hagedorn and Landweber 2000
; Felsenstein 2003
). A practical introduction to the theory of invariants can be found in the book by J. Felsenstein (2003
), Chapter 22, whereas the book by Pachter and Sturmfels (2005)
provides a beautiful insight into the applications of algebraic statistics (and in particular polynomial phylogenetic invariants) to computational biology.
There are 2 major motivations for using phylogenetic invariants in tree reconstruction. One of them is the prohibitive computational expense of a full maximum likelihood estimation of a tree, its edge lengths, a base distribution, and a rate matrix. The other is that the evolutionary models underlying the theory of invariants allow for nonhomogeneous mutation. Indeed, it is known that, for some biological data sets, different rate matrices should be allowed in different lineages. Thus, it is essential to have at our disposal phylogenetic methods for reconstructing trees admitting nonhomogeneous models (Galtier and Gouy 1998
; Yang and Yoder 1999
).
In this paper, a phylogenetic reconstruction method that uses polynomial phylogenetic invariants (introduced in Casanellas et al. 2005
) is studied and tested for quartet trees evolving under the Kimura (1981)
3-parameter model of nucleotide substitution. Actually, we consider an "algebraic" Kimura model: the parameters of the model are the entries of the substitution matrices on the edges (and not a single rate matrix together with edge lengths). Hence, the model is nonhomogenous—because it allows different rate matrices among the edges—but it is stationary (and the distribution of the bases is uniform), and we always assume that all sites are independent and identically distributed (i.i.d. hypotheses). We performed simulation studies to test its efficiency. One of our approaches to evaluate the performance and efficiency of the method is taken from Huelsenbeck (1995)
so that a large portion of the tree space is examined to get a general idea of how the algorithm performs. We present the results obtained for sequences of length 100 up to 10,000 base pairs.
We also checked the performance of the method on simulated data from nonhomogeneous models by carrying out a comparison to Neighbor-Joining algorithm (Saitou and Nei 1987
), maximum likelihood algorithm (Felsenstein 1981
), and an algorithm for a nonhomogeneous model from PAML (Yang 1997
) for sequences generated under a Kimura (1980)
2-parameter model and different rate matrices along different tree branches.
| Results |
|---|
|
|
|---|
Homogeneous Data
The performance of the invariants method studied here on homogeneous Kimura models can be seen in the figure 1. Using the approach of J.P. Huelsenbeck (1995)
– ß –
):
![]() |
= 0.1,
= 3.0, and ß = 0.5 (hence, a 5:1 transition:transversion bias) and for nucleotide sequences of lengths 100, 500, 1,000, and 10,000. See also figures III and IV, Supplementary Material online, for studies performed with other rate matrices. This figure is to be compared with those shown in Figure A2 of (Huelsenbeck 1995)
|
|
|
For length 1,000, the efficiency of the invariants method considered here is similar to that obtained for lengths
10,000 in many of the methods tested in Huelsenbeck (1995)
Nonhomogeneous Data
We tested the invariants method studied in this paper on data simulated according to a nonhomogeneous Kimura model by comparing it with other methods.
Comparison with Neighbor-Joining
First of all, we compared the performance of the invariants method presented here with Neighbor-Joining (the algorithm of Saitou and Nei 1987
) using Kimura (1981)
3-parameter distance. As it can be seen in figure 2, considering certain nonhomogeneous sets of simulated data, we found that the invariants method is more efficient than Neighbor-Joining. Indeed, we simulated data on an unrooted quartet tree evolving under the Kimura 3-parameter model where different rate matrices at each edge were chosen (see fig. 3), and we studied the efficiency of the method when varying the length of nucleotide sequences. In this case, the mean of correctly reconstructed trees for the invariants method is 90.2%, whereas the mean for Neighbor-Joining is 84%. It is worth pointing out that the use of Kimura distance is still justified in the nonhomogeneous setting, so that it is fair to compare both methods. For this particular tree, the maximum likelihood algorithm for the Kimura 3-parameter model reconstructed the tree correctly almost all times, so we do not include the results here.
Comparison with Maximum Likelihood
We compared the invariants method with 2 versions of maximum likelihood: the usual maximum likelihood for Kimura (1980)
2-parameter and nonhomogeneous maximum likelihood method developed in the package PAML (Yang 1997
) for Kimura 2-parameter model (namely, the option nhomo). This last method allows different transition/transversion ratio in different tree branches. To perform this comparison, we simulated data according to the tree in figure 3 evolving under the Kimura 2-parameter model. When
= 1, the tree is homogeneous and we studied the efficiency of the 3 methods as
increases up to 9. Figure 4 summarizes the results relative to the comparison between our method, the nonhomogeneous method in PAML for Kimura 2-parameter model (option nhomo=2 in the baseml control file) and the maximum likelihood homogeneous in PAML for Kimura 2-parameter (option nhomo=0). It shows the percentage of correctly reconstructed trees for each value of parameter
. As expected, the homogeneous maximum likelihood algorithm becomes less efficient as
increases. The use of a method assuming homogeneity on data produced in accord with a nonhomogeneous model is, of course, not recommended as such model misspecification is known to lead to unreliable results. Already for
= 5, the invariants method overtakes the homogeneous maximum likelihood algorithm. As it is deduced from figure 4, our method performs worse than the nonhomogeneous algorithm in PAML. However, it is worth noticing that in this test we generated data according to Kimura 2-parameter model and the invariants method presented here was developed under Kimura 3-parameter model. Similar results are obtained when the data evolve under Kimura 3-parameter model (see figure II, Supplementary Material online).
|
| Methods |
|---|
|
|
|---|
The phylogenetic reconstruction method used in this paper is based on phylogenetic invariants and was first introduced by the first author and L.D. Garcia and S. Sullivant in Casanellas et al. (2005)
The taxa are given by an alignment of n DNA sequences of length N. A Markov process along an unrooted binary tree of n species is assumed, and we consider that all sites are independently and identically distributed (i.i.d. hypotheses). The parameters of the model we are considering are the entries of the substitution matrices of a Kimura (1981)
3-parameter model (we should rather speak of an "algebraic" Kimura 3-parameter model, according to the book by Pachter and Sturmfels 2005
, Chapters 1 and 4). Note that in the usual Kimura 3-parameter model, a rate matrix Q is fixed and common to the whole tree and the substitution matrix is the exponential eQti (for some parameter ti representing time). However, in our method, we do not make use of rate matrices Q—we only use the substitution matrices—so, as we will see later, the rate matrices might vary among the edges. It is worth pointing out that, as we consider a Kimura 3-parameter model along all branches, the uniform distribution of base composition holds for the whole tree (stationarity hypothesis).
Phylogenetic Invariants
Sturmfels and Sullivant (2005)
gave an explicit description of the generators of the set of polynomial phylogenetic invariants I(T) for an arbitrary tree evolving under a "group-based model12." For the Kimura 3-parameter model on an unrooted 4-taxa tree, the ideal of phylogenetic invariants has 8,002 minimal generators (see the Web page of Garcia and Porter 2005
, Casanellas et al. 2005
and the discussion in Sturmfels and Sullivant 2005
, Section 7 to see why a smaller subset of invariants does not suffice). According to the results in Sturmfels and Sullivant (2005)
, we produced this generating set for an unrooted tree with 4 leaves under the Kimura 3-parameter model. This requires doing a Fourier transform (or Hadamard conjugation) on the vector of probabilities (pA...A,...,pT...T), and the phylogenetic invariants are then described as binomials in the Fourier coordinates.
Algorithm
Our tree reconstruction algorithm performs the following tasks. Given 4 aligned DNA sequences s1, s2, s3, and s4, it first computes the observed relative frequencies of each pattern for the topology ((s1, s2), s3, s4) on an unrooted quartet tree. Then it transforms these relative frequencies to Fourier coordinates. From this, we compute the Fourier coordinates in the other 2 possible topologies for unrooted trees with 4 species. We then evaluate all phylogenetic invariants for the Kimura 3-parameter model in the Fourier coordinates of each tree topology. We call sfT the absolute value of this evaluation for the polynomial f and tree topology T. From the values {sfT}f, we produce a score for each tree topology T, namely
The algorithm then chooses the topology that corresponds to the minimum score. The code was written in PERL and is available upon request. For an alignment of 4 sequences of 1,000 nt, it takes 0.35 s on a single 3.0-Ghz processor.
Software Used
The simulations of this study were obtained using the program Seq-Gen v1.3.2 (Rambaut and Grassly 1997
). We used the algorithms implemented in the package APE (Paradis et al. 2006
) v1.8-3 of R (R Development Core Team 2005
) v2.1.1 to compute Kimura (1981)
3-parameter distance and perform Neighbor-Joining algorithm (Saitou and Nei 1987
). The package PAML (Yang 1997
) was used for phylogenetic inference involving maximum likelihood methods.
| Discussion |
|---|
|
|
|---|
The simulation studies performed in this paper present a very competitive phylogenetic reconstruction method based on invariants. If one compares our results on the tree space and those of Huelsenbeck (1995)
We would like to comment on the algorithm presented here and, more generally, on all methods based on invariants. First of all, we need to emphasize that the computation of the phylogenetic invariants of a given model just needs to be done once, so the computation need not contribute to the running time of an algorithm using them. Secondly, increasing the size of the sequences does not drop the computational efficiency of the algorithm. Indeed, the sequence's length only accounts for computing the relative frequencies of the observed patterns (which is something that most algorithms based on evolutionary models must do), but it does not participate in any other part of the algorithm.
A small comment on the election of the 1-norm: we performed simulation studies not presented here to prove that the algorithm performs clearly better with 1-norm than with maximum norm and slightly better than with the euclidean norm (the "1-norm" of a vector x = (x1, ..., xn) is
the "euclidean norm" is
and the "maximum norm" is ||x||
= maxi |xi|). Another consideration that might be important for the computational efficiency of the method is that, in Fourier coordinates, the polynomials considered here are "binomials," and hence they are easy to evaluate at a given point (so there is no need to worry computationally about the evaluation of the polynomials). Moreover, as it is proved in Sturmfels and Sullivant (2005)
, these binomials have degree 4 at the most so, again, the computational cost is low. Choosing another generating set of invariants or a different weightening of the polynomials will lead to other results, so this is an issue that should certainly be studied in the future.
We implemented and tested the algorithm presented here only on 4-taxon trees. This seems a limitation of the method, but as the reader may have noticed, the method is universal and could be used to infer the topology of trees with arbitrary number of taxa. However, the computational demands of deducing large phylogenies led us not to develop this algorithm for larger trees. Instead, we suggest that invariants might be a good starting point for quartet methods of phylogenetic inference. In this direction, it is also worth thinking about new tree reconstruction algorithms for arbitrary taxa based on invariants (this is something the authors will surely work on in the future).
In this paper, we focused on the Kimura 3-parameter evolutionary model. However, a full generating set of invariants is known for any group-based model (Sturmfels and Sullivant 2005
), a large set of invariants is known for the general Markov model (Allman and Rhodes 2003
, 2004a
) and for a strand symmetric model (Casanellas and Sullivant 2005
), and some invariants are already known for certain rate-class models (Allman and Rhodes 2006
). Therefore, the method presented here can be extrapolated to these evolutionary models and in the future further models can be considered. In particular, invariants might also perform well for models allowing site-to-site variation.
As we pointed out in the introduction, invariants-based methods focus on recovering the tree topology and not estimating the parameters. Nevertheless, as Allman and Rhodes (2003
, 2004b
) say, it is fair to think that phylogenetic invariants may also be useful for parameter recovery.
As we have already mentioned, one of the advantages of the method presented here versus other methods of phylogenetic reconstruction based on evolutionary models is that the model considered here is nonhomogeneous in the sense that the rate matrix is allowed to vary along the different branches of the tree. However, as we are assuming a Kimura model, the base distribution is the same at all nodes of the tree and so the model is stationary (note that invariants methods based on other evolutionary models would not require stationarity of base distribution). For an unrooted tree with n taxa, our algebraic Kimura model has 3(2n – 3) free parameters, and it is a special case of the "general Markov model12" which involves 12(2n – 3) + 3 parameters. If one considered a Kimura 3-parameter model (not the algebraic model considered here) allowing different rate matrices along the tree one would have 3(2n – 3) + 2n – 4 (the extra parameters correspond to time parameters). Other nonhomogeneous models have been considered in the literature, see for example Galtier and Gouy (1998)
and Yang and Roberts (1995)
. The emphasis in these 2 works is put on the nonstationarity hypothesis, and the maximum likelihood approach taken in most nonhomogeneous methods makes them computationally unfeasible.
| Supplementary Material |
|---|
|
|
|---|
Supplementary figures I–IV are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|
The authors would like to thank B. Sturmfels, L. Pachter, and N. Eriksson for useful comments on a draft version of the paper. We would also like to thank the reviewers for useful comments that improved the paper. Both authors were partially supported by Ministerio de Educación y Ciencia (Ramón y Cajal and Juan de la Cierva programs) and BFM2003-06001. J.F.S. was also supported by MTM2005-01518.
| Footnotes |
|---|
Peter Lockhart, Associate Editor
| References |
|---|
|
|
|---|
Allman E and Rhodes J. (2003) Phylogenetic invariants for the general Markov model of sequence mutation. Math Biosci 186:2113–144.[CrossRef][ISI][Medline]
Allman E and Rhodes J. (2004a) Phylogenetic ideals and varieties for the general Markov model. Available from: http://arxiv.org/abs/math.AG/0410604. Advances in Applied Mathematics.
Allman E and Rhodes J. (2004b) Quartets and parameter recovery for the general Markov model of sequence mutation. Appl Math Res Exp 2004:4107–131.[CrossRef]
Allman E and Rhodes J. (2006) The identifiability of tree topology for phylogenetic models, including covarion and mixture models. J Comput Biol 13:1101–1113.[CrossRef][ISI][Medline]
Casanellas M, Garcia L, Sullivant S. (2005) Catalog of small trees. In Pachter L and Sturmfels B (Eds.). Algebraic statistics for computational biology. Chapter 15(Cambridge University Press, Cambridge (UK)).
Casanellas M and Sullivant S. (2005) The strand symmetric model. Algebraic statistics for computational biology. Chapter 16 (Cambridge University PressIn Pachter L and Sturmfels B (Eds.). , Cambridge (UK)).
Cavender J and Felsenstein J. (1987) Invariants of phylogenies in a simple case with discrete states. J Classif 4:57–71.[CrossRef]
Eriksson N. (2005) Tree construction using singular value decomposition. In Pachter L and Sturmfels B (Eds.). Algebraic statistics for computational biology. Chapter 19(Cambridge University Press, Cambridge (UK)) pp. 347–358.
Evans S and Speed T. (1993) Invariants of some probability models used in phylogenetic inference. Ann Stat 21:355–377.
Felsenstein J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376.[CrossRef][ISI][Medline]
Felsenstein J. (2003) Inferring phylogenies(Sinauer Associates Inc, Sunderland (MA)).
Ferretti V and Sankoff D. (1995) Phylogenetic invariants for more general evolutionary models. J Theor Biol 14:7–162.
Galtier N and Gouy M. (1998) Inferring pattern and process: maximum likelihood implementation of a non-homogeneous model of DNA sequence evolution for phylogenetic analysis. Mol Biol Evol 154:4871–879.
Garcia LD and Porter J. (2005) Small phylogenetic trees webpage. Available from: http://bio.math.berkeley.edu/ascb/chapter15/.
Hagedorn T and Landweber L. (2000) Phylogenetic invariants and geometry. J Theor Biol 205:365–376.[CrossRef][ISI][Medline]
Huelsenbeck J. (1995) Performance of phylogenetic methods in simulation. Syst Biol 44:17–48.[CrossRef][ISI]
Jin L and Nei M. (1990) Limitations of the evolutionary parsimony method of phylogenetic analysis. Mol Biol Evol 7:82–102.[Abstract]
Kim YR, Kwon O, Paeng S, Park C. (2006) Phylogenetic tree constructing algorithms fit for grid computing with SVD. Available from http://arix.org/abs/q-bio.QM/0611015.
Kimura M. (1980) A simple method for estimating evolutionary rates of base substitution through comparative studies of nucleotide sequences. J Mol Evol 16:111–120.[CrossRef][ISI][Medline]
Kimura M. (1981) Estimation of evolutionary sequences between homologous nucleotide sequences. Proc Natl Acad Sci USA 78:454–458.
Lake J. (1987) A rate-independent technique for analysis of nucleic acid sequences: evolutionary parsimony. Mol Biol Evol 4:167–191.[Abstract]
In Pachter L and Sturmfels B (Eds.). Algebraic statistics for computational biology (2005) (Cambridge University Press, Cambridge (UK)).
Paradis E, Strimmer K, Claude J, Jobb G, Opgen-Rhein R, Dutheil J, Noel Y, Bolker B, Lemon J. (2006) Ape: analyses of phylogenetics and evolution. R package version 1.8-3.
R Development Core Team. (2005) R: a language and environment for statistical computing. (R Foundation for Statistical Computing, Vienna (Austria)).
Rambaut A and Grassly N. (1997) Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci 13:235–238.
Saitou N and Nei M. (1987) The neighbor joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:4406–425.[Abstract]
Sankoff D and Blanchette M. (1999) Phylogenetic invariants for genome rearrangements. J Comput Biol 6:431–445.[CrossRef][ISI][Medline]
Steel M, Székely L, Erd
s P, Waddell P. (1993) A complete family of phylogenetic invariants for any number of taxa under Kimura's 3ST model. N Z J Bot 31:289–296.
Sturmfels B and Sullivant S. (2005) Toric ideals of phylogenetic invariants. J Comput Biol 12:204–228.[CrossRef][ISI][Medline]
Yang Z. (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 15:555–556.
Yang Z and Roberts D. (1995) On the use of nucleic acid sequences to infer early branchings in the tree of life. Mol Biol Evol 12:451–458.[Abstract]
Yang Z and Yoder AD. (1999) Estimation of the transition/transversion rate bias and species sampling. J Mol Evol 48:274–283.[CrossRef][ISI][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




