MBE Advance Access originally published online on November 9, 2005
Molecular Biology and Evolution 2006 23(3):491-498; doi:10.1093/molbev/msj059
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Article |
Beyond Pairwise Distances: Neighbor-Joining with Phylogenetic Diversity Estimates

* Department of Mathematics, University of California, Berkeley and
Department of Mathematics, Duke University
E-mail: lpachter{at}math.berkeley.edu.
| Abstract |
|---|
|
|
|---|
The "neighbor-joining algorithm" is a recursive procedure for reconstructing trees that is based on a transformation of pairwise distances between leaves. We present a generalization of the neighbor-joining transformation, which uses estimates of phylogenetic diversity rather than pairwise distances in the tree. This leads to an improved neighbor-joining algorithm whose total running time is still polynomial in the number of taxa. On simulated data, the method outperforms other distance-based methods. We have implemented neighbor-joining for subtree weights in a program called MJOIN which is freely available under the Gnu Public License at http://bio.math.berkeley.edu/mjoin/.
Key Words: neighbor-joining phylogenetic diversity subtree weights
| Introduction |
|---|
|
|
|---|
Distance-based methods for phylogenetic reconstruction are based on the observation that edge-weighted phylogenetic X-trees (trees that have a set X as their leaves, all interior vertices of degree at least three and nonnegative weights
on every edge) can be encoded by certain metrics on X. Perhaps, the most fundamental result underlying distanced-based tree reconstruction is as follows.Theorem 1 (four-point condition [Buneman 1971]). Given a metric
there exists an edge-weighted phylogenetic X-tree T such that
iff
for every four leaves i, j, k, l. Furthermore, T is unique.
Such metrics are called "tree metrics," and many methods have been proposed for projecting dissimilarity maps (functions
with D(x, x) = 0 and D(x, y) = D(y, x)) to "nearby" tree metrics. The neighbor-joining algorithm, introduced by Saitou and Nei (1987)
, is the most popular and widely used. It is particularly convenient for reconstructing phylogenetic trees when the size of X is large, in which case methods that require an exhaustive exploration of the space of trees are computationally prohibitive.
There are four parts to the neighbor-joining algorithm:
- A procedure for estimating pairwise distances between elements of X.
- A criterion for identifying neighboring pendant edges (cherries) in a tree.
- A recursive reduction.
- A branch length estimation formula.
- A criterion for identifying neighboring pendant edges (cherries) in a tree.
Theorem 2 (Saitou and Nei 1987; Studier and Keppler 1988
). If D is a tree metric and
then the pair x, y that minimizes QD(x, y) is a cherry in the tree.
Although the exact formula for Q may seem a bit mysterious at first, it is a very natural criterion. For example, the neighbor-joining algorithm which is based on it is "consistent" (i.e., if D is a tree metric then the algorithm returns the tree), the input order of the taxa does not change the outcome of the algorithm, and the criterion is a linear function of the distances. Bryant (2005)
has recently shown that the neighbor-joining selection criterion Q(i, j) is the "only" one satisfying the properties above. Furthermore, Gascuel (1997b)
has shown that the neighbor-joining criterion can be interpreted as greedily minimizing a balanced minimum evolution criterion which provides added understanding as to why it has been a very successful method.
The recursive reduction step and branch length estimation formula have been examined extensively and have resulted in a number of improvements to the basic neighbor-joining algorithm. For example, the reduction step has been shown to be optimal when variances on the estimates are unknown, yet improvable when variance information is incorporated (Gascuel 1994
, 1997a
, 1997b
).
Nevertheless, the main problem with the neighbor-joining scheme is that in the first step, the distances are estimated from noisy data, and consequently, the resulting dissimilarity map is very unlikely to be a tree metric. For biological sequences, the pairwise distance estimates are typically based on a probabilistic model of evolution such as the Jukes and Cantor (1969)
model: given two sequences of length L with k differences between them, the distance is estimated as
![]() |
The variance is approximated by Kimura and Ohta (1972)
![]() |
the variance approaches infinity, which reflects the fact that long branch lengths are difficult to resolve with finite sequences. This phenomenon exists whenever branch lengths are estimated using Markov models of evolution. Although the neighbor-joining algorithm is consistent, the fact that dissimilarity maps estimated from the data are not tree metrics means that there is no guarantee that the algorithm produces the correct tree. A number of attempts have been made to understand the good results obtained with the neighbor-joining algorithm, especially given the problems with the inference procedures used for estimating pairwise distances. One of the main results is the following theorem
Theorem 3 (Atteson 1999). Neighbor-joining has l
radius 1/2.
This means that if the distance estimates are at most half the minimal edge length of the tree away from their true value, then the neighbor-joining algorithm will reconstruct the correct tree. However, as we will see in Results, this criterion is rarely attained even in cases where neighbor-joining has a high success rate.
Despite the unavailability of formal criteria for predicting the success of neighbor-joining, there have been efforts aimed at improving the distance estimates which form the input to the algorithm. For example, the TRIPLEML method (Ranwez and Gascuel 2002
) improves on the pairwise distance estimates by adjusting them using additional taxa: for each pair of leaves, a third leaf is selected and an approximate (numerical) maximum likelihood estimate for the branch lengths of the three leaf subtree is computed from which the pairwise distance of the original leaves is estimated. The use of three-way distances has also been considered by Joly and Calvé (1995)
. In the WEIGHBOR algorithm (Bruno, Socci, and Helpern 2000
), the neighbor-joining criterion is replaced so as to weight long branch lengths. These methods, and others similar to them, have the drawback that either their performance remains limited by the inherent uncertainty in pairwise distance estimates, or else the simple, natural, and mathematically justified structure of the neighbor-joining algorithm is abandoned.
It was suggested by Pachter and Speyer (2004)
that an alternative encoding of edge-weighted phylogenetic X-trees may be used to improve phylogenetic reconstruction while preserving many of the properties of distance-based methods. This alternate encoding is based on a natural extension of dissimilarity from pairs of taxa to subsets of taxa of size m (Fig. 1).
|
Let T be an edge-weighted X-tree, and let R be an m element subset of X. By [R] we denote the smallest subtree of T spanning R. The sum of the weights of the edges in [R] is called the "m-subtree weight" or "phylogenetic diversity" of R (Faith 1992
X, |R| = m,
Notice that if m = 2 and R = {i, j}, then [R] is simply the path from i to j, and DT(R) is equal to the weight of the path: DT is a tree metric and determining T from DT follows from the four-point condition. However, for general m, we have the following theorem.
Theorem 4 (Pachter and Speyer 2004). Let T be a phylogenetic X-tree (|X| = n) and m
2 be an integer. If n
2m 1, then T is determined by the m-dissimilarity map DT (and this is not true if 2m 2 = n > 2).
Instead of reconstructing trees from pairwise dissimilarity (m = 2), it was suggested that maximum likelihood methods could be used to more accurately estimate the phylogenetic diversity values D(R) for every R
X, |R| = m. Such estimates result in
values which form an m-dissimilarity map, D. The problem then is to develop consistent tree reconstruction algorithms that find a tree whose m-subtree weights are "close" to the m-dissimilarity map.
In this paper we propose a practical, efficient method for tree reconstruction based on m-dissimilarity maps. We begin by refining Theorem 4 and show that even if n < 2m 1, partial information about the tree is recoverable. We then describe a neighbor-joining algorithm whose cherry-picking criterion makes use of m-subtree weights. The algorithm is a generalization of standard neighbor-joining (in the special case m = 2, the formulas in the algorithm simplify to neighbor-joining). It also satisfies many of the same properties: the method is consistent, the input order of the taxa does not change the outcome, and the cherry-picking criterion is a linear function of the distances. In Results, we argue that it is more accurate than neighbor-joining, and the fact that it is polynomial in the number of taxa means that it is practical for the same kinds of large problems for which neighbor-joining is used. In fact, the running time for m = 3 is O(n3), the same as for standard neighbor-joining (only with a higher time constant for the initial estimation of the weights).
Our main results depend on yet another encoding of phylogenetic X-trees. Given four leaves i, j, k, l in a phylogenetic X-tree, we denote by (i, j; k, l) the edges common to the path between i and j, and the path between k and l. We say that (i, j; k, l) is a tree quartet if |(i, j; k, l)| = Ø. If q(T) denotes the set of tree quartets, then there is a partial order less than or equal to
all X-trees where T'
T iff q(T')
q(T).
Theorem 5 (Buneman 1971; Semple and Steel 2003
). Let T and T' be two phylogenetic X-trees. Then q(T) = q(T') iff T
T'.
| Tree Metrics from m-Weights |
|---|
|
|
|---|
Our main results about m-subtree weights are based on a mapping that associates to any m-dissimilarity map D a 2-dissimilarity map SD such that, if D is induced by a tree T, then SD is a tree metric that preserves a certain subforest of T. This subforest is characterized by containing those edges whose removal results in sufficiently small components in the tree. Specifically, for a tree T, the removal of any edges results in two components, and we denote by T
k the subforest of T whose edge set consists of edges whose removal results in one of the components having size at most k. For example, T
1 consists of all the pendant edges (adjacent to leaves), and T
k = T for any
because the removal of any edge in a tree leaves a component of size at most
For the tree T in figure 2 with 24 leaves, T = T
12.
|
Theorem 6. Let D be an m-dissimilarity map on a set X of size n and defineIf D = DT for some edge-weighted phylogenetic X-tree T, then SD is a tree metric. If T' is the tree corresponding to this tree metric, then T'
(1) T with T'
nm
T
nm. Furthermore, there is an invertible linear map between the edge weights in T
nm and the corresponding edge weights in T'
nm (with the exception that in the case that T
T
nm, the pendant edge weights are not uniquely determined).
One can find a detailed proof of Theorem 6 in Appendix. Intuitively, SD(i, j) may be thought of as the sum of all m-weight values containing i and j. The proof relies on counting the number of occurrences of an edge in the spanning subtrees comprising the value of SD and then showing that SD satisfies the four-point condition in exactly the same way as T. Hence, SD is a tree metric for a tree T' which is isomorphic to T. However, the edges of T' are much longer, and to recover the original edge lengths we require the following proposition.
Proposition 7. Let T, T', m, and n as in Theorem 6.
- If e is an internal edge of T
nm, then
where a and c are leaves on opposite sides of e, and Li(e) denotes the set of leaves in the component of T e that contains leaf i.
- If T = T
nm, denote the edges adjacent to the leaves by e1, ..., en (with corresponding edges in
) and the set of internal (nonpendant) edges by int(E(T)). Let
Then,
where
I is the identity matrix, and J is the matrix consisting entirely 1s.
A full proof of Proposition 7 may be found in Appendix. In order to recover wT(e) for every edge, we start by calculating the interior edge weights, after which we can calculate the values Ci. The matrix A is always invertible if m
n 1; however, calculating Ci requires that int(E(T)) = int(E(T')). If n < 2m 1, then while we can determine all the interior edge weights of T
nm from T', it is possible that some interior edges of T have been collapsed in T': in particular, the set of edges in E(T)\E(T
nm). If E(T)\E(T
nm)
Ø, then T
nm is composed of at least two connected components, and every connected component has strictly fewer than m leaves. As a result, every m-subtree weight will include at least one undetermined edge, and so there is no way to uniquely determine the weights of the pendant edges.
| Neighbor-Joining with Subtree Weights |
|---|
|
|
|---|
Theorem 6 forms the basis of the neighbor-joining algorithm with subtree weights. First, we need a generalization of the neighbor-joining criterion: for a fixed tree T and integer m, let S = SD where D is the m-dissimilarity map induced by T.
Theorem 8 (cherry-picking theorem). Let T be an edge-weighted phylogenetic X-tree with |X| = n, and let m be an integer satisfying 2m
n 2. Let
be the m-dissimilarity map corresponding to the weights of the subtrees of size m in T. If QD(x, y) is a minimal element of the matrix
then x, y is a cherry in the tree T.
Note that when m = 2, this is exactly the neighbor-joining criterion (Q-criterion of Theorem 2) as described by Studier and Keppler (1988)
.
Proof: LetBy Theorem 6 we know that S is a tree metric. Observe that
In other words, QD(i, j) is just a scalar multiple of the neighbor-joining criterion for the tree metric S. By Theorem 2 (m = 2) we know that the minimal element of QS(i, j) is a cherry in T' (the tree corresponding to the tree metric S). Because m
n 1, we know that T'
2 is isomorphic to T
2, and therefore, the minimal element of QD(i, j) is a cherry.
It follows from Theorem 6 that if
then the neighbor-joining algorithm applied directly to S is "topologically consistent," that is, will reconstruct the correct tree topology starting with the weights of all subtrees of size m. The fact that there is an invertible linear map between the edge weights means that we can reconstruct T, thus leading to a consistent neighbor-joining algorithm with subtree weights (Algorithm 1).
Algorithm 1: Neighbor-joining algorithm with subtree weights
|
The running time for computing the weights of the subtrees is O(Lnm), where L is the length of the alignment and the computation of S(i, j) is O(nm) (both steps are trivially parallelizable). The subsequent neighbor-joining is O(n3), and the edge weight reconstruction is O(n2). It is interesting to note that for fixed L the running time of the algorithm is O(n3) for both m = 2 and m = 3.
| Results |
|---|
|
|
|---|
We have implemented the neighbor-joining algorithm for subtree weights in a program called MJOIN. The implementation incorporates the fastDNAml (Olsen et al. 1994
We tested MJOIN with simulated data on the two-parameter family of trees described by Ota and Li (2000)
. These are trees for which neighbor-joining has difficulty in resolving the correct topology. We simulated 1,000 data sets on each of the two tree shapes, T1 and T2 (fig. 3) at the three edge length ratios, a/b = 0.01/0.07, 0.02/0.19, and 0.03/0.42. This was repeated twice for sequences of length 500 and 1,000 bp. We also repeated the runs with the Kimura 2-parameter model and obtained similar results (not shown).
|
Table 1 denotes the success rate of MJOIN for m = 2, 3, and 4 (denoted by NJ(m)) for each data set and compares these results to the success rate of other tree reconstruction methods. The success rate is defined as the proportion of trials in which the generating tree is reconstructed with the complete accurate tree topology. It is clear from the table that as m increases, the success rate of MJOIN increases. Hence, for m > 2, MJOIN consistently outperforms neighbor-joining (NJ(2)). For the T1 tree, NJ(4) outperforms even fastDNAml.
|
Figure 4 shows the standard deviation in the m-weights. We believe it is the relative improvement in the m-weight errors that is contributing to the improved performance of MJOIN as m increases. Checking the l
distance of the 2-distance maps from the true tree metric, we find that even in cases where neighbor-joining has a high success rate, the number of distance maps that satisfy Atteson's condition is fewer than 1%. This suggests that the success of neighbor-joining is due to other favorable features of the projection, and we believe that a deeper understanding of neighbor-joining is necessary in order to rigorously understand the reasons for the improvements with m-subtree weights.
|
| Discussion |
|---|
|
|
|---|
Theorem 6 establishes that pairwise distance-based reconstruction methods can be used to reconstruct trees from m-subtree weights. This immediately suggests a number of potential improvements to the algorithm we have described. For example, by taking into account the variances of the S(i, j), it should be possible to improve on the neighbor-joining algorithm for subtree weights with better agglomeration (as is done in BIONJ).
In tests we performed with n = 10 taxa and m = 5 (results not reported), we observed a deterioration in the accuracy of the tree reconstruction algorithm, which we attribute to inaccuracies in the subtree weights estimated with fastDNAml. In fact, tests with fastDNaml on five taxa revealed that fastDNaml fails to even reconstruct the correct tree topology a significant fraction of the time. Thus, we believe that until further improvements are made in maximum likelihood estimation of trees, the best subtree weight size to use will be m = 4. We are encouraged by various efforts in this direction (Contois and Levy 2005
; Ho
ten, Khetan, and Sturmfels 2005
).
We have found subtree weight reconstruction to be practical and efficient for much larger examples than described here. We have run the algorithm with m = 3 on trees of up to 50 taxa on a standard personal computer, and it is worth noting that for larger problems it is trivial to parallelize the m-weight estimation. Thus, we believe that our method is practical and recommended for large tree constructions that currently rely on either a pairwise distance method or a heuristic maximum likelihood search. Because the latter can fail with regularity on trees with only five taxa, it is unlikely to be accurate for large trees.
Our investigations have opened up a number of interesting questions. For example, it would be useful to obtain an analog of the four-point condition that characterizes the space of m-dissimilarity maps arising from trees. It would also be of interest to develop a subtree-weight analog of the neighbor-net algorithm (Bryant and Moulton 2004
).
Finally, we point out that our results can be viewed as providing approximations to maximum likelihood tree reconstruction by refining distance-based methods. We believe that a deeper understanding of m-dissimilarity maps should yield further results in this direction.
| Appendix |
|---|
|
|
|---|
Proof of Theorem 6 and Proposition 7
Let T be an edge-weighted phylogenetic X-tree, and let m
2 be an integer. Further, let S = SD where D = DT is the m-dissimilarity map induced by T. We first observe that any linear combination of the m-subtree weights is a linear combination of the edge weights wT(e) in the tree. For a linear function on the m-subtree weights
let vF(e) denote the coefficient of wT(e) in F. For instance,
![]() |
[R]. Note that
We use the notation Li(e) to denote the set of leaves in the component of T e that contains leaf i and we let Pab be the path from vertex a to b.
Lemma 9. Given a pair of leaves a, b, and any edge e we have
Proof: If e is on the path from a to b, then it will be included in all the subtrees [{a, b}Y]. If e is not on the path from a to b, then the only way it will be excluded is if all the other leaves fall on the a side of e (which is the same as the b side). That is, if Y
La(e)\{a, b}. There are
such sets.
Lemma 10. Given a quartet (a1, a2; a3, a4) in T with interior vertices b1 and b2 (fig. 5), then,
and
View larger version (5K):
[in this window]
[in a new window]
FIG. 5. A quartet (a1, a2; a3, a4).
Proof: We use the fact thatand apply the previous lemma. We also note that for
for all i.
Corollary 11. For a quartet (a1, a2; a3, a4), we define
Then,
![]() |
Corollary 11 implies that S satisfies the four-point condition (Theorem 1), although it may be that
which means that there are interior edges in T' which have been collapsed (with length equal to 0). Suppose, however, that (a1, a2; a3, a4)
q(T) and [{a1, a2, a3, a4}] is in a connected component of T
nm (in other words the subtree spanning the quartet consists of edges whose removal leaves a small component). This means that if
then either
or
and so S(a1, a2; a3, a4) < 0 which means that (a1, a2; a3, a4)
q(T'). Therefore q(T
nm)
q(T'
nm) and it follows from Theorem 5 that T
nm
T'
nm.
It remains to show that there is an invertible linear map between the edge weights in the forests T
nm and T'
nm.
Lemma 12. If e is an internal edge of Tnm with e' the corresponding edge in T', then
where a is a leaf in one component of T e and c is a leaf in the other.
Proof: Because e is an internal edge, we may choose a, b, c, and d such that e is the only edge on the splitting path of (a, b; c, d) (fig. 6). Then
View larger version (4K):
[in this window]
[in a new window]
FIG. 6. The quartet (a, b; c, d) has only the one edge e on its splitting path.
Rewriting terms, we have that
![]() |
Lemma 13. Denote the leaf edges of T by e1, ..., en (with corresponding edges in), and denote the internal edges by int(E(T)). Let
and let A be the matrix
where I is the identity matrix and J is the matrix consisting entirely 1s. Then
Proof: The interior vertex of an edge e also adjacent to a leaf i is incident to two other edges. Choose a leaf a such that Pia intersects one of the edges and b such that Pib intersects the other (fig. 7). Thenwhich after some algebra gives the result.
View larger version (4K):
[in this window]
[in a new window]
FIG. 7. The leaf edge ei is incident on two other edges. We may choose leaves a and b such that Pia Pib = ei.
| Acknowledgements |
|---|
|
|
|---|
We thank the anonymous referees for comments that improved the manuscript. This work was partially funded by the National Institutes of Health (NIH) grant (R01HG2362). L.P. was also supported by a Sloan foundation fellowship. D.L. was also supported by NIH grant (GM68423).
| Footnotes |
|---|
William Martin, Associate Editor
| References |
|---|
|
|
|---|
Atteson, K. 1999. The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica 25:251278.[CrossRef]
Bruno, W., N. Socci, and A. Halpern. 2000. Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol. 17:189197.
Bryant, D. 2005. On the uniqueness of the selection criterion in neighbor-joining. J. Classif. 22:315.
Bryant, D., and V. Moulton. 2004. NeighborNet: an agglomerative algorithm for the construction of planar phylogenetic networks. Mol. Biol. Evol. 21:255265.
Buneman, P. 1971. The recovery of trees from measures of dissimilarity. Pp. 387395 in F. Hodson, D. Kendall, and P. Tautu, eds. Mathematics in the archaeological and historical sciences. Edinburgh University Press, Edinburgh, United Kingdom.
Contois, M., and D. Levy. 2005. Small trees and generalized neighbor joining. Pp. 335346 in L. Pachter and B. Sturmfels, eds. Algebraic statistics for computational biology. Cambridge University Press, Cambridge.
Faith, D. 1992. Conservation evaluation and phylogenetic diversity. Biol. Conserv. 61:110.
Gascuel, O. 1994. A note on Sattath and Tversky's Saitou and Nei's and Studier and Keppler's algorithms for inferring phylogenies from evolutionary distances. Mol. Biol. Evol. 11:961963.[ISI][Medline]
. 1997a. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14:685695.[Abstract]
. 1997b. Concerning the NJ algorithm and its unweighted version, UNJ. Pp. 149170 in B. Mirkin, F. McMorris, F. Roberts, and A. Rhetsky, eds. Mathematical hierarchies and biology. American Mathematical Society, Providence, R.I.
Ho
ten, S., A. Khetan, and B. Sturmfels. 2005. Solving the likelihood equations. Found. Comput. Math. (in press).
Joly, S., and G. Calvé. 1995. Three way distances. J. Classif. 12:191205.
Jukes, T., and C. Cantor. 1969. Evolution of protein molecules. Pp. 2132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York.
Kimura, M., and T. Ohta. 1972. On the stochastic model for estimation of mutational distance between homologous proteins. J. Mol. Evol. 2:8790.[CrossRef][ISI][Medline]
Olsen, G., H. Matsuda, R. Hagstrom, and R. Overbeek. 1994. fastDNAml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Comput. Appl. Biosci. 10:4148.
Ota, S., and W. Li. 2000. NJML: a hybrid algorithm for the neighbor-joining and maximum likelihood methods. Mol. Biol. Evol. 17:14011409.
Pachter, L., and D. Speyer. 2004. Reconstructing trees from subtree weights. Appl. Math. Lett. 17:615621.[CrossRef]
Ranwez, V., and O. Gascuel. 2002. Improvement of distance-based phylogenetic methods by a local maximum likelihood approach using triplets. Mol. Biol. Evol. 19:19521956.
Saitou, N., and M. Nei. 1987. The neighbor joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406425.[Abstract]
Semple, C., and M. Steel. 2003. Phylogenetics. Oxford lecture series in mathematics and its applications, Vol. 24. Oxford University Press, Oxford.
Studier, J., and K. Keppler. 1988. A note on the neighbor-joining method of Saitou and Nei. Mol. Biol. Evol. 5:729731.[ISI][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
there exists an edge-weighted phylogenetic X-tree T such that
iff 







) and the set of internal (nonpendant) edges by int(E(T)). Let

I is the identity matrix, and J is the matrix consisting entirely 1s.
be the m-dissimilarity map corresponding to the weights of the subtrees of size m in T. If QD(x, y) is a minimal element of the matrix 
By Theorem 6 we know that S is a tree metric. Observe that 






.



Y]. If e is not on the path from a to b, then the only way it will be excluded is if all the other leaves fall on the a side of e (which is the same as the b side). That is, if Y
such sets. 



and apply the previous lemma. We also note that for
for all i. 





where I is the identity matrix and J is the matrix consisting entirely 1s. Then 


Pib = ei.