MBE Advance Access originally published online on July 28, 2006
Molecular Biology and Evolution 2006 23(11):1997-2000; doi:10.1093/molbev/msl072
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Review |
Neighbor-Joining Revealed

* LIRMM, Montpellier, France
Biomathematics Research Centre, University of Canterbury, Christchurch, New Zealand
E-mail: gascuel{at}lirmm.fr; M.Steel{at}math.canterbury.ac.nz.
| Abstract |
|---|
|
|
|---|
It is nearly 20 years since the landmark paper (Saitou and Nei 1987) in Molecular Biology and Evolution introducing Neighbor-Joining (NJ). The method has become the most widely used method for building phylogenetic trees from distances, and the original paper has been cited about 13,000 times (Science Citation Index). Yet the question "what does the NJ method seek to do?" has until recently proved somewhat elusive, leading to some imprecise claims and misunderstanding. However, a rigorous answer to this question has recently been provided by further mathematical investigation, and the purpose of this note is to highlight these results and their significance for interpreting NJ. The origins of this story lie in a paper by Pauplin (2000) though its continuation has unfolded in more mathematically inclined literature. Our aim here is to make these findings more widely accessible.
Key Words: distance method algorithm phylogenetic criterion minimum evolution consistency Neighbor-Joining
First we review briefly the Neighbor-Joining (NJ) method and outline what it does "not" do. NJ builds a tree from a matrix of pairwise evolutionary distances relating the set of taxa being studied. The distance between any taxon pair i and j is denoted as d(i, j) and can be obtained from sequence data by a variety of approaches, for example, using Kimura's (1980)
2-parameter estimate. NJ iteratively selects a taxon pair, builds a new subtree, and agglomerates the pair of selected taxa to reduce the taxon set by one (fig. 1). Pair selection is based on choosing the pair i, j that minimizes the following Q criterion:
|
| (1) |
![]() | (2) |
|
| (3) |
|
Again, this formula is that of Studier and Keppler, not that of Saitou and Nei, but they are equivalent and the 2 NJ versions always reconstruct the same tree, both in terms of topology and branch lengths (Gascuel 1994
Even though NJ behaved well, both with simulated and real data (Saitou and Nei 1987
; Saitou and Imanishi 1989
; Kuhner and Felsenstein 1994
), mathematical foundations of these 3 formulas only became clear over several years, following mathematical investigations. The 2 main questions related to consistency (i.e., whether NJ correctly reconstructs a tree when the distances fit perfectly on that tree) and to the criterion that NJ optimizes. The consistency proofs of Saitou and Nei and Studier and Keppler were contested by Mirkin (1996)
but fixed by Gascuel (1997)
and Atteson (1999)
. The latter even proved a stronger result, showing that NJ still reconstructs the correct tree when the distance matrix is perturbed by small noise and that NJ is optimal regarding tolerable noise amplitude. Recently, Bryant (2005)
provided a simple and elegant proof of NJ consistency, and so the first question has been well and truly laid to rest.
The second question was that of the criterion being optimized and of its biological relevance. Indeed, most phylogenetic methods are based on explicit criteria, for example, parsimony or maximum likelihood, and understanding properties of these criteria is a central issue, which is independent of (and should precede) the design of optimization algorithms. Saitou and Nei (see also Saitou 1996
; Nei and Kumar 2000
) showed that the selection criterion (1) is related to minimization of tree length, which is conceptually close to parsimony. Assuming that the current taxon set (a, b, c, d, e, f, and g in fig. 1a) only contains original taxa (i.e., not resulting from previous agglomerations, as is the case with a, f, and g), they proved that selection criterion (1) is equivalent to minimizing the ordinary (unweighted) least-squares (OLS) length estimate (Felsenstein 2004
, p 148) of the agglomerated tree (as shown in fig. 1b, but removing dashed subtrees). Accordingly, they called this approach the minimum evolution (ME) principle, and the criterion being minimized by NJ was thought to be the OLS tree length estimate. However, this result was not fully convincing. First, the biological meaning and consistency of this ME + OLS principle was not established at the time of its publication. This was partly solved by Rzhetsky and Nei (1993)
who demonstrated ME + OLS consistency; but the biological relevance of the OLS component was still questionable because it does not account for the high variance of the large evolutionary distances, as weighted least-squares (WLS) approaches do (Fitch and Margoliash 1967
). Second, the result of Saitou and Nei applies to the first NJ step where we only have original taxa in the taxon set but not to the subsequent steps; assuming that the current taxon set contains agglomerated nodes (e.g., a, f, and g in fig. 1), the OLS tree length estimate of the agglomerated tree (fig. 1b) differs from the estimate of Saitou and Nei. Thus, NJ performs local optimization to select taxon pairs but is not truly guided by minimization of the OLS tree length estimate. Several simulation studies (Saitou and Imanishi 1989
; Kumar 1996
; Gascuel 2000
) showed that trees being shorter (in the OLS sense) than NJ trees are easily found, but these simulations also showed that these shorter trees are (slightly) less accurate than NJ trees, thus demonstrating that 1) ME + OLS does not fit the phylogenetic requirements well and 2) NJ cannot be viewed as optimizing this criterion.
Due to this confused situation (consistency, biological relevance, and the fact that NJ does not fully minimize the OLS tree length estimate), numerous authors (e.g., Swofford et al. 1996
, p 490; Gascuel 2000
; Felsenstein 2004
, p 169) wrote that NJ does not explicitly optimize any criterion. In fact, NJ greedily optimizes a natural tree length estimate, as we shall explain now. In that respect, NJ does not differ from the other usual phylogenetic methods, for example, based on parsimony or maximum likelihood. It performs a heuristic search of tree space where each step is guided by this (global) tree length criterion. However, just as other standard methods, it does not explore the whole space and has no guarantee of finding the "best" tree for the criterion it seeks to optimize.
Given a tree with branch lengths, its length (l) is the total sum of all its branch lengths. We can express l as a weighted sum of pairwise distances between pairs of taxa, as has been observed by various authors (e.g., Yushmanov 1984
; Makarenkov and Leclerc 1997
; Sumiyama et al. 2001
). For example, the reader should convince oneself that for the tree at the top of figure 2 we can write
|
| (4) |
|
| (5) |
|
As Pauplin observed, this suggests a new version of the ME principle: simply select the tree that minimizes tree length estimate in equation (5), instead of using OLS estimator as proposed by Saitou and Nei. We designed fast algorithms following this new "balanced minimum evolution" (BME) scheme (Desper and Gascuel 2002
Equation (5) also provides a way of quantifying any modification of an existing tree by considering the change in the l value. In particular, we can consider the selection of a pair of taxa, as occurs in the selection of neighbors by NJ (fig. 1), and this brings us to the main result of this note, whose proof appears in Desper and Gascuel (2005)
.
Theorem. The NJ method, as defined by equations (1), (2), and (3), selects at each step as neighbors that pair of current taxa, which most decreases the whole tree length, as computed using the generalized Pauplin formula (eq. 5).
Whole tree means that we consider all subtrees resulting from previous agglomerations and not only the central node. With figure 1, this implies that among all possible agglomerations within the current taxon set (i.e., (a, b), (a, c), ..., until (f, g)), NJ selects that pair which minimizes the length of the whole tree shown in figure 1b, including the subtrees and the original taxa a', a'', f', f'', etc. In other words, NJ greedily minimizes the BME score. Moreover, this explains why FastME performs better than NJ: it is based on the same BME criterion but optimizes it further via topological rearrangements. This does not provide any guarantee on a given data set as the correct tree may not be the one with best fit, whatever the phylogenetic criterion being optimized (e.g., see Nei et al. 1998
). But, on average, topological accuracy is improved by intensifying criterion optimization, as clearly shown by the comparisons of Vinh and von Haeseler (2005)
between NJ and FastME.
We end by noting some other recent mathematical insights into NJ. Bryant (2005)
has provided a different way of viewing the way that NJ selects pairs of taxa. Rather than showing that this selection optimizes some global quantity (the BME score) at each step, Bryant lists 3 desirable properties that any method for building trees from distances should possess if it operates (like NJ does) by successively grouping pairs of taxa. These properties are related to consistency and to taxon exchangeability. Assuming this, he proves that NJ selection criterion is unique among all possible linear criteria. In a further development, Levy et al. (2005)
have shown how to modify the NJ algorithm to move beyond pairwise distances to more informative multiway distances (as might be obtained using maximum likelihood on sequences). Most recently, Mihaescu et al. (2006)
have verified a conjecture of Atteson (1999)
by establishing a robustness property of NJ to small perturbations in the data.
| Acknowledgements |
|---|
|
|
|---|
Many thanks to Richard Desper and Charles Semple, who coauthored several of our articles on the topics that are described here.
| Footnotes |
|---|
Naruya Saitou, Associate Editor
| References |
|---|
|
|
|---|
Atteson K. (1999) The performance of the neighbor-joining methods of phylogenetic reconstruction. Algorithmica 25:25178.[CrossRef]
Bryant D. (2005) On the uniqueness of the selection criterion in neighbor-joining. J Classif 22:315.
Desper R and Gascuel O. (2002) Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J Comput Biol 9:687705.[CrossRef][Web of Science][Medline]
Desper R and Gascuel O. (2004) Theoretical foundations of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting. Mol Biol Evol 21:58798.
Desper R and Gascuel O. (2005) The minimum evolution distance-based approach to phylogenetic inference. In Gascuel O (Ed.). Mathematics of evolution & phylogeny(Oxford University Press, Oxford, UK) pp. 132.
Felsenstein J. (2004) Inferring phylogenies (Sinauer Associates, Sunderland, MA).
Fitch WM and Margoliash E. (1967) Construction of phylogenetic trees. Science 155:27984.
Gascuel O. (1994) A note on Sattath and Tversky's, Saitou and Nei's and Studier and Keppler's algorithms for inferring phylogenies from evolutionary distances. Mol Biol Evol 11:9613.[Web of Science][Medline]
Gascuel O. (1997) Concerning the NJ algorithm and its unweighted version, UNJ. In Mirkin B, McMorris FR, Roberts FS, Rzhetsky A (Eds.). Mathematical hierarchies and biology. DIMACS series in discrete mathematics and theoretical computer science(American Mathematical Society, Providence, RI) pp. 14970.
Gascuel O. (2000) On the optimization principle in phylogenetic analysis and the minimum-evolution criterion. Mol Biol Evol 17:4015.
Kimura M. (1980) A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J Mol Evol 16:11120.[CrossRef][Web of Science][Medline]
Kuhner MK and Felsenstein J. (1994) A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol 11:45968.[Abstract]
Kumar S. (1996) A stepwise algorithm for finding minimum evolution trees. Mol Biol Evol 13:58493.[Abstract]
Levy D, Yoshida R, Pachter L. (2005) Beyond pairwise distances: neighbor joining with phylogenetic diversity estimates. Mol Biol Evol (Advanced access November 9, 2005).
Makarenkov V and Leclerc B. (1997) Circular orders of tree metrics, and their uses for the reconstruction and fitting of phylogenetic trees. In Mirkin B, McMorris FR, Roberts F, Rzhetsky A (Eds.). Mathematical hierarchies and biology. DIMACS series in discrete mathematics and theoretical computer science(American Mathematical Society, Providence, RI) pp. 183208.
Mihaescu R, Levy D, Pachter L. (2006) Why neighbour-joining works. arXiv:cs.DS/0602041 v1.
Mirkin B. (1996) Mathematical classification and clustering. (Kluwer Academic Publishers, London).
Nei M and Kumar S. (2000) Molecular evolution and phylogenetics(Oxford University Press, Oxford, UK).
Nei M, Kumar S, Takahashi K. (1998) The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino acids used is small. Proc Natl Acad Sci 95:123907.
Pauplin Y. (2000) Direct calculation of a tree length using a distance matrix. J Mol Evol 51:417.[Web of Science][Medline]
Rzhetsky A and Nei M. (1993) Theoretical foundation of the minimum-evolution method of phylogenetic inference. Mol Biol Evol 10:107395.[Abstract]
Saitou N. (1996) Reconstruction of gene trees from sequence data. In Doolittle R (Ed.). Methods in enzymology(Academic Press, Inc, Orlando, FL) Volume 266: pp. 42749.[Web of Science][Medline]
Saitou N and Imanishi M. (1989) Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum-evolution, and neighbor-joining methods of phylogenetic reconstructions in obtaining the correct tree. Mol Biol Evol 6:51425.[Web of Science]
Saitou N and Nei M. (1987) The neighbor-joining method: a new method for reconstruction of phylogenetic trees. Mol Biol Evol 4:40625.[Abstract]
Semple C and Steel M. (2003) Phylogenetics(Oxford University Press, Oxford, UK).
Semple C and Steel M. (2004) Cyclic permutations and evolutionary trees. Adv Appl Math 32:66980.[CrossRef]
Studier JA and Keppler KJ. (1998) A note on the neighbor-joining method of Saitou and Nei. Mol Biol Evol 5:72931.
Sumiyama K, Kim CB, Ruddle FH. (2001) An efficient cis-element discovery method using multiple sequence comparisons based on evolutionary relationships. Genomics 71:2602.[CrossRef][Web of Science][Medline]
Swofford DL, Olsen GL, Waddell PJ, Hillis DM. (1996) Phylogenetic inference. In Hillis DM, Moritz C, Mable BK (Eds.). Molecular sytematics(Sinauer, Sunderland, MA) pp. 407514.
Vinh LS and von Haeseler A. (2005) Shortest triplet clustering: reconstructing large phylogenies using representative sets. BMC Bioinformatics 6:92.[CrossRef][Medline]
Yushmanov SV. (1984) Construction of a tree with p leaves from 2p-3 elements of its distance matrix (Russian). Matematicheskie Zametki 35:87787.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
M. N. Price, P. S. Dehal, and A. P. Arkin FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix Mol. Biol. Evol., July 1, 2009; 26(7): 1641 - 1650. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Pulim, B. Berger, and J. Bienkowska Optimal contact map alignment of protein-protein interfaces Bioinformatics, October 15, 2008; 24(20): 2324 - 2328. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Gillor, J. A. C. Vriezen, and M. A. Riley The role of SOS boxes in enteric bacteriocin regulation Microbiology, June 1, 2008; 154(6): 1783 - 1792. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





