MBE Advance Access originally published online on October 12, 2005
Molecular Biology and Evolution 2006 23(2):254-267; doi:10.1093/molbev/msj030
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Article |
Application of Phylogenetic Networks in Evolutionary Studies

* Center for Bioinformatics (ZBIT), Tübingen University, Tübingen, Germany; and
Department of Mathematics, Auckland University, Auckland, New Zealand
E-mail: huson{at}informatik.uni-tuebingen.de.
| Abstract |
|---|
|
|
|---|
The evolutionary history of a set of taxa is usually represented by a phylogenetic tree, and this model has greatly facilitated the discussion and testing of hypotheses. However, it is well known that more complex evolutionary scenarios are poorly described by such models. Further, even when evolution proceeds in a tree-like manner, analysis of the data may not be best served by using methods that enforce a tree structure but rather by a richer visualization of the data to evaluate its properties, at least as an essential first step. Thus, phylogenetic networks should be employed when reticulate events such as hybridization, horizontal gene transfer, recombination, or gene duplication and loss are believed to be involved, and, even in the absence of such events, phylogenetic networks have a useful role to play. This article reviews the terminology used for phylogenetic networks and covers both split networks and reticulate networks, how they are defined, and how they can be interpreted. Additionally, the article outlines the beginnings of a comprehensive statistical framework for applying split network methods. We show how split networks can represent confidence sets of trees and introduce a conservative statistical test for whether the conflicting signal in a network is treelike. Finally, this article describes a new program, SplitsTree4, an interactive and comprehensive tool for inferring different types of phylogenetic networks from sequences, distances, and trees.
Key Words: phylogeny networks software confidence intervals
| Introduction |
|---|
|
|
|---|
The familiar evolutionary model assumes a tree, a model that has greatly facilitated the discussion and testing of hypotheses. However, it is well known that more complex evolutionary scenarios are poorly described by such models. Even when evolution proceeds in a tree-like manner, analysis of the data may not be best served by forcing the data onto a tree or assuming a tree-like model. Rather, visualization and exploration of the data to discover and evaluate its properties can be an essential first step. Recognition of these issues has led to the development of a number of different types of phylogenetic networks (see Fig. 1).
|
In this article we first discuss the terminology used to describe and distinguish between the different types of phylogenetic networks and methods that are currently in use. There are a number of different types of phylogenetic networks, and this can be a source of confusion.
We then provide a detailed introduction to split networks, how they are defined, and how they can be interpreted. We discuss the use of split networks in strategies for dealing with systematic error, especially as a preliminary step to tree-based analysis and show examples of a number of different types of networks.
We outline the beginnings of a statistical framework for phylogenetic inference using split network methods. Split networks have been widely used as a tool for visualization but have been neglected as a tool for statistical inference. Their success as a tool for visualizing incompatible and ambiguous phylogenetic signals suggests that if they are properly used, they can become an invaluable statistical inference tool for phylogenetics. Here we describe a method for constructing approximate confidence intervals (CIs) for trees using networks and a simple test to determine whether an incompatible signal in a network is statistically significant. There is considerable scope for improvement of both methods.
Finally, we present our new program SplitsTree4, which provides a comprehensive and interactive framework for estimating phylogenetic trees and networks. The program provides methods for computing split networks "from sequences," such as the median network (Bandelt et al. 1995
) and networks based on spectral analysis (Hendy and Penny 1993
); "from distances," such as split decomposition (Bandelt and Dress 1992
) and neighbor-net (Bryant and Moulton 2004
); and "from trees," such as consensus networks (Holland et al. 2004
) and supernetworks (Huson et al. 2004
). Moreover, new methods are provided for computing hybridization networks from trees (Huson et al. 2005
) and recombination networks from binary sequences (Huson and Kloepper 2005
). Additionally, SplitsTree4 includes a large number of different distance computations and tree-building methods.
| Terminology |
|---|
|
|
|---|
A "phylogenetic tree" is commonly defined as a leaf-labeled tree that represents the evolutionary history of a set of taxa, possibly with branch lengths, either unrooted or rooted.
The concept of a "phylogenetic network" is much less well defined, and there exist many different usages of the term (see Fig. 1). We propose to define a phylogenetic network as "any" network in which taxa are represented by nodes and their evolutionary relationships are represented by edges. (For phylogenetic trees, edges are referred to as branches.) Under this very general heading, one can distinguish between a number of different types of networks. Phylogenetic trees constitute one type (see Fig. 2). A second type is the "split network," which is obtained as a combinatorial generalization of phylogenetic trees and is designed to represent incompatibilities within and between data sets (see Fig. 3a). A third type, "reticulate network," represents evolutionary histories in the presence of reticulate events such as hybridization, horizontal gene transfer, or recombination (see Fig. 3b). A number of additional types of networks exist (Posada and Crandall 2001
), for example, to represent gene duplication and loss phylogenies (Hallett and Lagergren 2000
; Durand, Halldorsson, and Vernot 2005
) or host and parasite coevolution (Charleston 1998
). Other approaches for constructing phylogenetic networks include "statistical parsimony" (Templeton, Crandall, and Sing 1992
), the "netting" method (Fitch 1997
), and a method that, in effect, consists of adding "short-cut" edges to a tree (Legendre and Makarenkov 2002
).
|
|
One major source of confusion has been that different authors define the generic term phylogenetic network rather narrowly to mean some particular type of network currently under study. For example, a recent article on recombination (Gusfield and Bansal 2005
Reticulate networks provide an "explicit" representation of evolutionary history, generally depicted as a phylogenetic tree with additional edges. The internal nodes in such a network represent ancestral species, and nodes with more than two parents correspond to reticulate events such as hybridization or recombination.
Split networks are used to represent incompatible and ambiguous signals in a data set. In such a network, parallel edges, rather than single branches, are used to represent the splits computed from the data. To be able to accommodate incompatible splits, it is often necessary that a split network contains nodes that do not represent ancestral species. Thus, split networks provide only an "implicit" representation of evolutionary history.
The distinction between these explicit and implicit representations of evolution is important and has not always been made clear in the past (Morrison 2005
).
| Background |
|---|
|
|
|---|
A "split" is a partition of the taxa into two nonempty subsets, such as the partition obtained when we remove a branch from a phylogenetic tree. For example, removing the branch indicated by the arrow in Figure 2a splits the taxa into two groups {B, C, D, E} and {o, A, F, G, H, I, J, K}. The interpretation of split networks is based on one simple principle: "a split network contains exactly the same information as a list of splits with a weight for each split."
In a split network, every edge is associated with a split of the taxa, but there may be a number of parallel edges associated with each split. In Figure 4, all the edges corresponding to a split are highlighted, as well as all the taxa on one side of the split. If the edges associated with a particular split were to be deleted, then the network would become disconnected with precisely two components corresponding to the two parts of the split. The edges separate taxa on one side of the split from the taxa on the other side of the split. The length of an edge in the network is proportional to the weight of the associated split. This is analogous to the length of a branch in a phylogenetic tree.
|
The example in Figure 4a and b gives two representations of the same information: (a) a split network representing 14 different splits of a set of eight taxa, and (b), a listing of the splits and their weights.
Formally, for a given taxon set X and set of splits
we define a split network
to be a connected graph in which some of the nodes are labeled by taxa and all edges are labeled by splits, such that
- (N1) Removing all edges associated with a given split S in
divides
into two connected components, one part containing all taxa on one side of S and the other part containing all taxa on the other side.
- (N2) The edges along any shortest path in
are all associated with different splits.
- (N2) The edges along any shortest path in
Every split network represents a unique collection of splits. However, uniqueness does not hold in the other direction, as a given collection of splits can have many different split network representations. In Figure 5 we show two different split networks that both represent the splits ABC|DEF, ABF|CDE, AEF|BCD, together with all six trivial splits on the set {A, B, C, D, E, F} (Wetzel 1995
).
|
Due to this nonuniqueness, it is often inappropriate to consider internal nodes as hypothetical ancestors (Bandelt and Dress 1992
The split network, then, is a graphical representation of a collection of splits with weights. The interpretation of the network therefore depends on exactly how the splits were constructed and assigned weights. As we shall see, this varies considerably between methods and between applications.
Interpreting Split Networks: Representing Multiple Trees
There are many situations in phylogenetics where we have a large collection of trees that we wish to summarize in some way. The trees might be the result of a bootstrap analysis, samples from a posterior distribution, or might come from a multigene analysis, for example. Techniques for summarizing multiple trees using split networks are described in Bandelt (1995)
, Holland et al. (2004)
, and Huson et al. (2004)
. The basic idea is to code each individual tree as a collection of splits, define a summary set of splits from these, and represent the resulting set using a split network. For example, "consensus networks" are constructed from all splits appearing in at least some fixed proportion of the input trees. A consensus network can represent much more information than a single tree with P values.
Here we take the consensus network methods one step further and use split networks to define "confidence sets" of trees. Suppose that we have assigned an interval for the weights of each split (represented) in a split network
We say that a tree T is contained within the split network
if
- (1) Every split in the tree is a split in the network.
- (2) For every split in the tree, the corresponding branch length is contained within the interval assigned to the appropriate split weight.
- (3) For every split in the network not in the tree, the interval assigned to that split contains zero.
- (2) For every split in the tree, the corresponding branch length is contained within the interval assigned to the appropriate split weight.
|
A geometric interpretation might prove useful here. Suppose that the splits are indexed from 1 to m. A tree can then be coded as a point in m-dimensional space: the ith coordinate is the length corresponding to the ith split, or 0 if that split is not present in the tree (Holmes 2005
A network with intervals assigned to edges is an "X% confidence" network if, for different random samples, it has an X% probability of containing the "true" tree. In the Appendix we propose a method for constructing "approximate" confidence networks using nonparametric bootstrapping. Confidence networks can also be defined for Bayesian analyses, where they would describe posterior confidence sets. The split summaries produced by MrBayes (Ronquist and Huelsenbeck 2003
) come close to this.
The use of split networks to describe confidence sets for trees fits well with the geometric analysis of tree space in Billera, Holmes, and Vogtman (2001)
. They construct a model of tree space by pasting together sections of Euclidean space. Each dimension, in their picture, corresponds to a different split. Hence their model of tree space sits naturally in the much larger, and much simpler, space of split networks.
Interpreting Split Networks: Networks and Systematic Error
The rapid growth in available genomic sequence data opens up exciting new possibilities and also new challenges for phylogenetic inference. This development means that "sampling error" is becoming less of an issue while the impact of "systematic error" is becoming increasingly important. Sampling error is random error resulting from a small sample size (number of sites). Systematic error occurs when mistakes in the assumptions of a model or method cause data to be misinterpreted, something that is even more likely to occur when we consider large, multigene, heterogeneous data sets. These errors cause biases and artifacts in phylogenetic inference, some of which can be corrected by modifying the employed model of sequence evolution (Swofford et al. 1996
; Felsenstein 2004a
; Bryant, Galtier, and Poursat 2005
; Delsuc, Brinkmann, and Philippe 2005
). Systematic error is particularly important when there is a possibility of reticulate evolutionary events because, in such cases, no tree-based model can accurately model the data.
The two most widely used methods for checking the reliability of a tree are the nonparametric bootstrap (Felsenstein 1985
) and (for Bayesian analysis) multiple samples from the posterior distribution (Rannala and Yang 1996
; Ronquist and Huelsenbeck 2003
). These techniques are designed to protect from sampling error; they are not designed to protect from systematic error. On short sequences, bootstrap resampling has the effect of "jiggling" the data and so can provide some assessment of robustness. However, when the number of sites is very large, the bootstrap replicates will be all very similar and so the bootstrap support will be high, no matter how poorly the data fit a tree.
Systematic error, unlike sampling error, does not disappear as the sequence length increases. Indeed it might even become worse (Delsuc, Brinkmann, and Philippe 2005
).
In tree-based phylogenetic analysis the goal is to find the phylogenetic tree that best explains the observed patterns in the data. When there is systematic error, shortcomings of the model mean that the observed data will (in general) not appear to have originated from any tree. The tree-building method will attempt to fit a tree, even if there is still a huge gap between the data and the best tree that the method can find. This gap is the principal cause of reconstruction artifacts (Steel 2005
).
Model-based split network methods (e.g., Huber et al. 2002
; Bryant and Moulton 2004
; Winkworth et al. 2005
) deal with systematic error by adding parameters to the evolutionary model. In conventional phylogenetics, parameters are added to give a more complex model of the substitution process down a branch. However, there are two kinds of parameters in phylogenetic inference: parameters describing the evolutionary model (rate variation, substitution probabilities, etc.) and parameters describing the topology (the tree and branch lengths). Evolutionary models based on split networks add extra topology-related parameters. A phylogenetic tree corresponds to a collection of compatible splits (with weights or lengths). A split network model is obtained by allowing additional splits with weights. These extra parameters allow split networks to fit the data better than individual trees.
It may seem odd to use split networks, which are not trees, to represent phylogenetic signals that, for the most part, originate from trees. However this is not an unusual practice in statistics. Consider the statement "in the year 2000, the expected number of children in a randomly chosen family in the US was 1.86." Of course, there was not a single family with exactly 1.86 children, and it does not make sense to talk about a fractional number of children in a given family. It does make sense, however, to use fractions in summary statistics like this one. The same applies to split networks. In the absence of reticulation, it does not make sense to speak of sequences evolving on a network, but it does make sense to infer split networks as summary statistics.
The example in Kolaczkowski and Thornton (2004)
demonstrates that unaccounted rate variation can mislead both maximum likelihood (ML) and parsimony-based analyses into selecting the wrong tree. We show later that the rate variation did not mislead a split network method, split decomposition, which displayed, simultaneously, support for the true tree and the artifactual tree. The split network better represented the phylogenetic signal than either tree.
The example in Esser et al. (2004)
shows how split network methods can extract phylogenetic signals that are missed by tree-based methods. The study involves multiple genes, and an ML tree analysis of each gene gives statistical (i.e., bootstrap) support for conflicting phylogenies. However, it is apparent that the ML analyses are affected by systematic error. A neighbor-net analysis returns split networks for each gene that incorporates both the ML tree and additional splits. Furthermore, many of these additional splits appear in almost all networks for the other genes. The probability of this occurring by chance is extremely small. Hence, the additional splits recovered by the split network method are probably contained in the true underlying phylogeny.
To summarize, an initial strategy for using split network methods in phylogenetic inference would be as follows:
- (1) Construct a split network using the best available model and method.
- (2) Determine if the network is significantly different from a tree.
- (3) If the network is significantly nontreelike, then there is probably an error in the model. If possible, improve the model and go back to step 1. If the conflicting or ambiguous signal in the split network cannot be explained, then this failure to explain the data properly should be reported and taken into account in any conclusions drawn from the phylogenetic analysis.
- (4) If the network is not significantly nontreelike (and there is evidence that the sampling error of the network method is not too large), then continue with a tree-based phylogenetic analysis.
- (2) Determine if the network is significantly different from a tree.
Reticulate Networks
Split networks provide an implicit picture of evolutionary relationships. Sets of parallel edges are employed to represent splits. As mentioned above, internal nodes in a split network do not necessarily correspond to hypothetical ancestors.
In contrast, reticulate networks provide an explicit picture of evolution. In such a network, edges represent lineages of descent or reticulate events such as hybridization, horizontal gene transfer, or recombination, and all nodes correspond to hypothetical ancestors, whether the product of speciation and mutation, or reticulate events.
Such explicit networks are usually drawn "rooted," so that the edges have a direction with an evolutionary meaning. In contrast, most split networks published to date are displayed as unrooted networks. However, it is possible to root split networks, for example, by specifying an outgroup, as illustrated in Figure 2a, and an algorithm for drawing rooted split networks is available in SplitsTree4.
A hybridization network is a reticulate network
that can explain a given set of trees in terms of hybridization (see Maddison 1997
; Baroni, Semple, and Steel 2004
; Nakhleh, Warnow, and Linder 2004
; Huson et al. 2005
). More precisely, given a set of trees
usually obtained from a collection of different genes, one would like to determine a putative reticulate network
from which the trees arise. If such a network can be found, then we say that the network
"explains" the set of trees
in terms of hybridization.
In a recent article (Huson et al. 2005
), we describe a method that can solve this problem for certain patterns of hybridization, and this is implemented in the SplitsTree4 program. It takes as input a set of trees and first computes the splits network that represents all splits present in the input trees. Each "netted component" ("2-connected component," in terms of graph theory) of the network is then individually analyzed in turn and is replaced by a reticulation scenario, if one can be found that explains the given splits in the component.
As an example, suppose that the two trees depicted in Figure 2 are based on two different genes. In Figure 3a we show the split network that represents all splits present in either of the two trees. The netted regions in the network indicate in which parts of the phylogeny there is disagreement between the two trees.
Figure 3b depicts a reticulate network that can explain the differences in the two trees using three reticulation events. In this example, the clade {B, C} arises from a reticulation event between the lineages leading to taxa A and D, whereas the taxon H, or I, arises from a reticulation event between the lineages leading to G and {J, K} or to {F, G} and J, respectively.
Recombination as a reticulate event is usually considered in population studies (Hudson 1983
; Hein 1990
) and the arising graphs are called ancestor recombination graphs (ARGs) or recombination networks. A number of algorithms for inferring such networks have been proposed recently (Gusfield and Bansal 2005
; Huson and Kloepper 2005
; Lyngsoe, Song, and Hein 2005
). Such methods take as input a sequence of two-state characters and attempt to explain the given characters in terms of evolution by speciation, mutation, and recombination under the restriction that a mutation of any character may happen at most once throughout the whole phylogeny. In Huson and Kloepper (2005)
, we describe a new method that solves this problem for certain patterns of recombination, and this approach is implemented in SplitsTree4.
| Methods |
|---|
|
|
|---|
SplitsTree4 is a completely new program that we have developed over the past 3 years. It was inspired by SplitsTree3 (Huson 1998
There are now many published methods for inferring split networks, and we have implemented most of them in SplitsTree4. These include the following:
- Median networks (Bandelt et al. 1995
), parsimony splits (Bandelt and Dress 1994
), and spectral analysis (Steel et al. 1992
), which construct split networks (or equivalently, weighted splits) directly from character data.
- Split decomposition (Bandelt and Dress 1992
) and neighbor-net (Bryant and Moulton 2004
) which construct split networks from inferred distance matrices.
- Consensus networks (Bandelt 1995
; Holland et al. 2004
) and supernetworks (Huson et al. 2004
) which construct split networks from sets of trees.
- Split decomposition (Bandelt and Dress 1992
The software implements ML estimation of distances from amino acid and nucleotide sequences, under all standard evolutionary models. The implemented phylogenetic tree methods include "neighbor joining" (Saitou and Nei 1987
), "Bio-NJ" (Gascuel 1997
), "Unweighted Pair Group Method with Arithmetic Mean (UPGMA)" (Sokal and Michener 1958
), "the Buneman tree" (Buneman 1971
), the "refined Buneman tree" (Brodal et al. 2003
), and standard consensus tree methods. SplitsTree4 also provides a graphical front-end for ML analysis using PhyML (Guindon and Gascuel 2003
) and parsimony analysis using PHYLIP (Felsenstein 2004b
). All character-based analyses can be bootstrapped.
The graphical interface makes it easy for users to interactively explore their data using different phylogenetic methods, starting with a standard file format (e.g., NEXUS, FastA, PHYLIP, etc.). Users can select from different analyses, filter data, and manipulate networks, all using standard menus. The data, and networks, can be exported to a variety of different file formats including the standard image formats PostScript, JPEG, SVG, PNG, and GIF. Multiple analyses can be run at the same time, and the program is multithreaded, so time-consuming calculations can be done in the background.
More details on the architecture of the program can be found in the Appendix.
| Examples |
|---|
|
|
|---|
Heterogeneous Evolution and Split Networks
The variation of evolutionary rates across sites and between lineages is a well-recognized source of phylogenetic error. A particularly intriguing synthetic example is described in Kolaczkowski and Thornton (2004)
Sequences of varying lengths were evolved on a single tree: half of the sites with one set of branch lengths and the remaining sites with a second set of branch lengths (see Fig. 7a). The change in branch lengths simulates an extreme example of rate variation, one that violates the basic assumptions of standard ML analysis. Consequently, ML repeatedly reconstructs the wrong tree, even with long (or indeed, infinite) sequences. Parsimony was also misled, but, in this setup, less so than ML.
|
We repeated the experiment of Kolaczkowski and Thornton (2004)
Split decomposition returned the correct split for even small values of r. Of course, split decomposition gets an unfair advantage: effectively, it can choose two trees instead of one. This is exactly the advantage of networks. In this case, split decomposition cannot decide between two trees: the true tree and the ML tree. The networks returned for different values of r (and infinite sequences) are given in Figure 7c. Looking, for example, at the split network produced when r = 0.3, we see that there is an even balance between support for the tree AB|CD and the tree AD|BC. The indecision in the ML analysis is due to sampling error pushing the signal towards one tree or the other. As r increases, the phylogenetic tree methods settle uniquely on a single tree which, in this case, is the correct tree. However the network still has a large box: the closest tree may be the correct tree, but there is a substantial amount of phylogenetic signal in the data that is not explained adequately by a single tree.
Animal Phylogeny
Our second case study illustrates the application of SplitsTree4 to genome-scale phylogenetics. The data set was prepared by Philippe, Lartillot, and Brinkman (2005)
and consists of 71 (slowly evolving) genes (20,705 amino acid positions) from 35 animals, 10 fungi, and 2 choanoflagellates. The phylogeny for the bilaterian animals has been the subject of much discussion recently, one key question being whether the ecdysozoa (e.g., arthropods, nematodes, tardigrades) or the coelomates (e.g., mollusca, deuterosomes, arthropoda) are monophyletic (Hedges 2002
; Wolf, Rogozin, and Koonin 2004
; Philippe, Lartillot, and Brinkman 2005
).
We conducted a neighbor-net analysis using ML distances inferred from a concatenated data set under the Jones, Taylor, and Thornton (JTT) + F +
model (Jones, Taylor, and Thornton 1992
) (see Fig. 8). One hundred bootstrap replicates were performed. Nematodes, arthropods, deuterotomes, choanoflagellates, and platyhelminthes are all well supported groups with 100% bootstrap support. However, the relationships between these groups are less clear: the network seems midway between a tree grouping nematodes, tardigrades, and arthropods (the ecdysozoa hypothesis) and a tree grouping arthropods, tardigrades, and deuterostomes (the coelomate hypothesis). Neither hypothesis is supported by a clear split in the network: the clearest split in favor of the ecdysozoa hypothesis misplaces the annelids, while the clearest split in favor of the coelomates misplaces the cnidaria.
|
Neighbor-net, like any phylogenetic method, is affected by sampling error, and this could potentially explain the observed conflicting signals. There are many potential sources for systematic error in a data set this large and diverse (Philippe, Lartillot, and Brinkman 2005
Using a subset of the taxa, it was recently demonstrated (Philippe, Lartillot, and Brinkman 2005
) that support for a coelomates clade or an ecdysozoa clade with a tree-based method depends greatly on the choice of outgroup (see Fig. 9). The ecdysozoa clade appeared only once a closely related outgroup (a cniderian) was used, indicating that the coelomate tree is due to long-branch attraction. This is an example of the effect of taxon instability: the information present in the cniderian sequence changes the resolution of taxa in other parts of the tree. If we look at a split network analysis of the sequences, we see that there is at least some support for the ecdysozoa clade even when the cnidarian taxon is absent.
|
When the cniderian taxon is absent, the tree-based method (in this case, Gascuel 1997
The Evolutionary History of Dusky Dolphins
This example is taken from a recent article (Cassens et al. 2003
) that investigates the phylogeography of dusky dolphins (Lagenorhhynchus oscurus) and compares a number of different network methods. The data used are the 60 variable positions in the DNA sequences of the full mitochondrial cytochrome b gene for 36 different haplotypes seen in 124 individuals, sampled off Peru, Argentina, and Southwest Africa. The article discusses the application of four different network methods: split decomposition, the "minimum spanning network," "statistical parsimony," and the "median-joining network."
We have reanalyzed this data using methods available in SplitsTree4. Calculation of observed P distances and application of the neighbor-joining method produced the tree depicted in Figure 10a. The edges are labeled by the percent bootstrap support attained in 1,000 bootstrap replicates.
|
Based on this tree, the 95% confidence network for neighbor joining displayed in Figure 10b shows that there is considerable sampling variance in the tree estimate, particularly when we consider the large CIs on the edge lengths. One split S' places the Atlantic haplotype A9 together with two Pacific haplotypes P4.1 and P4.2, which is incompatible with the central split S between Pacific and Atlantic haplotypes. The CIs for both splits include 0, so there is too much sampling error with neighbor joining to discriminate between the two possibilities.
Application of the maximum parsimony algorithm using the PHYLIP dnapars program produced three most parsimonious trees, whose consensus network is depicted in Figure 10c. Here we see that the three trees do not differ significantly.
In Figure 10d we display the median network. Each split is labeled by the columns of the alignment that support it. We see that the central split S separating Pacific and Atlantic haplotypes is supported by precisely three positions, 3, 18, 25, whereas the incompatible split S' is only supported by position 11.
Application of split decomposition to the uncorrected P distances produces the split network shown in Figure 10e. On this data set, the split decomposition captures most of the incompatible signals that are present in the median network, including the two incompatible splits mentioned above. Finally, we display the split network produced by neighbor-net in Figure 10f. By definition, this method can only produce a planar network. In this example the resulting network is missing some of the splits that are present in the median network and also in the split decomposition network. (In theory, and for small or highly similar data sets, the split decomposition method produces more resolved networks than neighbor-net, as the networks produced by split decomposition are not restricted to be planar. However, in practice, and for large or divergent data sets, split decomposition suffers from low resolution and the networks produced by neighbor-net are usually more resolved.)
| Discussion |
|---|
|
|
|---|
Phylogenetic networks have an important role to play in the reconstruction of evolutionary history. Implicit models such as split networks are very useful for exploring and visualizing the different signals in a data set. Explicit models such as hybridization and recombination networks can be used to provide an explicit description of reticulate evolution. Both types of networks have an important role to play. Many current methods for computing split networks from characters, distances, or trees are very robust and can provide valuable insights. In contrast, the existing methods for computing hybridization and recombination networks are unproven, and, as all are based on some kind of combinatorial analysis of a given configuration of splits, they are very susceptible to false-positive signals.
In the past, split network methods have been neglected as a tool for statistical inference in phylogenetics. This was due, in part, to the lack of an appropriate statistical framework, the absence of an integrated software package, and conceptual difficulties of thinking in terms of split networks as well as trees.
We have described the beginnings of a comprehensive statistical framework for phylogenetic analysis based on split network methods. The key messages are that split networks are representations of splits, and using more splits permits a more accurate representation of the data. The confidence networks and tests we presented illustrate a new general approach, although much remains to be done to improve the efficiency and power of these methods.
Finally, we have introduced the new SplitsTree4 program, which was inspired by the popular SplitsTree3 program. SplitsTree4 is an integrated and user-friendly software package allowing users to conduct phylogenetic analysis using trees, split networks, and reticulate networks. With SplitsTree4, we hope to provide a robust framework for inferring and investigating phylogenetic networks.
| Appendix |
|---|
|
|
|---|
Bootstrap Approximate Confidence Networks
A confidence network is essentially a representation of multiple CIs: one CI for the weight of each split. The problem of constructing confidence networks is therefore the same as the problem of constructing "simultaneous confidence intervals." Methods for this problem date back to Scheffé (1953)
Suppose that for each split we construct the interval for the weight of the split. We say that the intervals are "balanced" if the probability that each interval contains the true weight is the same for each interval. We say that the collection of intervals (or equivalently, the confidence network) "level" 1
, if the probability that all of the intervals contain the true weights simultaneously is 1
. The goal when constructing simultaneous CIs is to construct a balanced collection of intervals with the correct level.
In SplitsTree4 we have implemented the nonparametric bootstrapping "B method" of Beran (1988
, 1990
) to construct the simultaneous CIs represented by a confidence network (see Beran [1988]
for a complete description of the method; note that we use the difference between split weights as the confidence set root when implementing the B method). Beran proved that, under fairly general conditions, the confidence sets will be asymptotically correct to a first-order approximation. Hence, given sufficient bootstrap replications and long-enough sequences, the confidence network will have close to the correct level and will be approximately balanced.
There are two major caveats. First, network methods like neighbor-net are not continuous, in the sense that a small change in the data can sometimes cause a substantial change in the inferred network. This means that the asymptotic convergence conditions outlined in Beran (1988)
will hold locally but not over the entire parameter space.
The second problem is that methods like split decomposition and neighbor-net only construct split networks involving a small number of splits compared to the number of splits in total. When bootstrapping, many splits appear in only one- or two-bootstrap replicate networks, and so most splits have weight zero for almost all replicates. The net result is that, in simulation, the confidence networks constructed using Beran's B method have incorrect level, even with sequences of length 5,000 and 1,000 bootstrap replicates. The problem is caused by splits that appear in the true tree but not in the estimate tree, perhaps because of their signals being lost in the sampling error.
There are several avenues for future investigation. Beran (1990)
describes a "double" bootstrap method that is more accurate than the original B method. Efron, Halloran, and Holmes (1996)
applied a double bootstrapping to one-dimensional hypothesis tests in phylogenetics. Unfortunately, both these methods require that the number of bootstrap replicates be squared.
Network Tree-Likeness Test
Suppose that we have inferred a split network
from some data and that this network is not a tree. We want to test whether it is likely that the data originated from a tree. This is easy to do if we have an efficient and correct method for constructing correct confidence networks.
- (1) Construct a confidence network for
with level 1
.
- (2) If the confidence network does not "contain" a tree (in the technical sense defined above), then reject the null hypothesis that the data originated on a tree.
- (2) If the confidence network does not "contain" a tree (in the technical sense defined above), then reject the null hypothesis that the data originated on a tree.
Part (1) can be executed quickly by constructing the set of splits with CIs excluding zero and rejecting the null hypothesis if and only if this set is incompatible. The performance of the test will, naturally, depend on the confidence network method used. Simulations using the above confidence network indicate that the test is correct but has unacceptably low power. Much work remains to be done in this field.
The Architecture of SplitsTree4
The graphical interface makes it easy for a user to interactively explore their data using different phylogenetic methods. The user can select from different analyses, filter data, and manipulate networks, all using standard menus. The data and networks can be imported and exported using a variety of different file formats. Multiple analyses can be run simultaneously, and the program is multithreaded so that time-consuming calculations are done in the background. SplitsTree4 uses the NEXUS format, as does PAUP*, MrBayes, Mesquite, and MacClade.
For the more advanced features, it is important to have an understanding of how SplitsTree4 organizes and processes data. The program arranges its data into a list of different "blocks," namely the "taxa," "unaligned," "characters," "distances," "quartets," "trees," "splits," and "network" blocks. The taxa block is mandatory, but the other blocks are optional. The list of blocks, in this order, is called the "processing pipeline." Additional blocks are used to direct the analyses and assumptions made by the program.
Usually, the initial input to the program will consist of a taxa block and one additional block containing the input data, which is called the "source" block. Any phylogenetic method is viewed as a "transformation" of one type of data block to another. Most analyses involve a sequence of transformations applied in the same order as the processing pipeline. For example, suppose that the source is a characters block containing an alignment of DNA sequences and the goal is to produce the neighbor-net network for this data. The data in the characters block are first transformed into a distances block using a distance calculation such as the "uncorrected P" calculation. Next, the distances block is transformed into a splits block using the neighbor-net algorithm. Finally, the data in the splits block is transformed into a network block using a network construction algorithm such as "equal angle" (Dress and Huson 2004
).
SplitsTree4 is written in Java, and installers are available for Windows, MacOS X, and Unix/Linux. The program can be used interactively in a GUI mode or can be run in a noninteractive mode using the command line to facilitate batch processing.
| Acknowledgements |
|---|
|
|
|---|
We would like to thank Andreas W. M. Dress for introducing us to splits and phylogenetic networks. We would also like to thank Pete J. Lockhart, Barry G. Hall, and Susan Holmes for helpful discussions.
| Footnotes |
|---|
William Martin, Associate Editor
| References |
|---|
|
|
|---|
Bandelt, H. J. 1995. Combination of data in phylogenetic analysis. Plant Syst. Evol. Suppl. 9:355361.
Bandelt, H. J., and A. W. M. Dress. 1992. A canonical decomposition theory for metrics on a finite set. Adv. Math. 92:47105.[CrossRef]
. 1994. A relational approach to split decomposition. Technical Report, Universität Bielefeld, Bielefeld, Germany.
Bandelt, H. J., P. Forster, B. C. Sykes, and M. B. Richards. 1995. Mitochondrial portraits of human population using median networks. Genetics 141:743753.[Abstract]
Baroni, M., C. Semple, and M. A. Steel. 2004. A framework for representing reticulate evolution. Ann. Comb. 8:391408.[CrossRef]
Beran, R. 1988. Balanced simultaneous confidence sets. J. Am. Stat. Assoc. 83:679686.[CrossRef]
. 1990. Refining bootstrap simultaneous confidence sets. J. Am. Stat. Assoc. 85:417426.[CrossRef]
Billera, L., S. Holmes, and K. Vogtman. 2001. Geometry of the space of phylogenetic trees. Adv. Appl. Math. 27:733767.[CrossRef]
Brodal, G. S., R. Fagerberg, A. Östlin, C. N. S. Pedersen, and S. S. Rao. 2003. Computing refined buneman trees in cubic time. Lect. Notes Comput. Sci. 2812:259270.
Bryant, D., N. Galtier, and M.-A. Poursat. 2005. Likelihood calculation in molecular phylogenetics. Pp. 3362 in O. Gascuel, ed., Mathematics of evolution and phylogeny. Oxford University Press, Oxford, UK.
Bryant, D., and V. Moulton. 2004. NeighborNet: an agglomerative algorithm for the construction of planar phylogenetic networks. Mol. Biol. Evol. 21:255265.
Buneman, P. 1971. The recovery of trees from measures of dissimilarity. Pp. 387395 in F. R. Hodson, D. G. Kendall, and P. Tautu, eds., Mathematics in the archaeological and historical sciences. Edinburgh University Press, Edinburgh.
Cassens, I., K. van Waerebeek, P. B. Best, E. A. Crespo, J. Reyes, and M. C. Milinkovitch. 2003. The phylogeography of dusky dolphins (lagenorhynchus obscurus): a critical examination of network methods and rooting procedures. Mol. Ecol. 12:17811792.[CrossRef][Medline]
Charleston, M. A. 1998. Jungles: a new solution to the host/parasite phylogeny reconciliation problem. Math. Biosci. 149:191223.[CrossRef][ISI][Medline]
Delsuc, F., H. Brinkmann, and H. Philippe. 2005. Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Genet. 6:361375.[ISI][Medline]
Dress, A. W. M., and D. H. Huson. 2004. Constructing splits graphs. IEEE Trans. Comput. Biol. Bioinform. 1:109115.[CrossRef]
Durand, D., B. V. Halldorsson, and B. Vernot. 2005. Hybrid micro-macroevolutionary approach to gene tree reconstruction. Pp. 250264 in S. Miyano, J. Mesirov, S. Kasif, S. Istrail, P. Pevzner, and M. Waterman, eds., Lecture Notes in Computer Science, vol. 3500. Springer-Verlag, Berlin.
Efron, B., E. Halloran, and S. Holmes. 1996. Bootstrap confidence levels for phylogenetic trees. Proc. Natl. Acad. Sci. USA 93:1342913434.
Esser, C., N. Ahmadinejad, C. Wiegand et al. (15 co-authors). 2004. A genome phylogeny for mitochondria among alpha-proteobacteria and a predominantly eubacterial ancestry of yeast nuclear genes. Mol. Biol. Evol. 21:16431660.
Felsenstein, J. 1985. Confidence-limits on phylogenies, an approach using the bootstrap. Evolution 39:7837911.[CrossRef][ISI]
. 2004a. Inferring phylogenies. Sinauer Associates, Sunderland, Mass.
. 2004b. PHYLIP (phylogeny inference package). Version 3.6. http://evolution.genetics.washington.edu/phylip.html. Distributed by the author, Department of Genetics, University of Washington, Seattle.
Fitch, W. M. 1997. Networks and viral evolution. J. Mol. Evol. 44:S65S75.[Medline]
Gascuel, O. 1997. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol. 14:685695.[Abstract]
Guindon, S., and O. Gascuel. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696704.[CrossRef][ISI][Medline]
Gusfield, D., and V. Bansal. 2005. A fundamental decomposition theory for phylogenetic networks and incompatible characters. Pp. 217232 in S. Miyano, J. Mesirov, S. Kasif, S. Istrail, P. Pevzner, and M. Waterman, eds. Research in Computational Biology. Lecture Notes in Computer Science, vol. 3500, Springer-Verlag, Berlin.
<









