Skip Navigation


MBE Advance Access originally published online on November 9, 2005
Molecular Biology and Evolution 2006 23(5):848-855; doi:10.1093/molbev/msj061
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/5/848    most recent
msj061v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (11)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Holland, B. R.
Right arrow Articles by Moulton, V.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Holland, B. R.
Right arrow Articles by Moulton, V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Proceedings of the SMBE Tri-National Young Investigators' Workshop 2005

Improved Consensus Network Techniques for Genome-Scale Phylogeny

Barbara R. Holland*, Lars S. Jermiin{dagger} and Vincent Moulton{ddagger}

* Allan Wilson Centre, Institute of Fundamental Sciences, Massey University, New Zealand; {dagger} School of Biological Sciences and Sydney University Biological Informatics & Technology Centre, University of Sydney, Sydney, Australia; Unité de Biologie Moleculaire de Gène chez les Extrêmophiles, Institut Pasteur, Paris, France; and {ddagger} School of Computing Sciences, University of East Anglia, Norwich, United Kingdom

E-mail: b.r.holland{at}massey.ac.nz.


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 References
 
Although recent studies indicate that estimating phylogenies from alignments of concatenated genes greatly reduces the stochastic error, the potential for systematic error still remains, heightening the need for reliable methods to analyze multigene data sets. Consensus methods provide an alternative, more inclusive, approach for analyzing collections of trees arising from multiple genes. We extend a previously described consensus network method for genome-scale phylogeny (Holland, B. R., K. T. Huber, V. Moulton, and P. J. Lockhart. 2004. Using consensus networks to visualize contradictory evidence for species phylogeny. Mol. Biol. Evol. 21:1459–1461) to incorporate additional information. This additional information could come from bootstrap analysis, Bayesian analysis, or various methods to find confidence sets of trees. The new methods can be extended to include edge weights representing genetic distance. We use three data sets to illustrate the approach: 61 genes from 14 angiosperm taxa and one gymnosperm, 106 genes from eight yeast taxa, and 46 members of a gene family from 15 vertebrate taxa.

Key Words: consensus networks • genome-scale phylogeny • gene trees


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 References
 
Recent phylogenetic studies have used massive amounts of sequence data to infer the evolutionary history of genomes (Rokas et al. 2003Go; Goremykin et al. 2004Go), in some cases using over 100 genes for a moderate numbers of taxa (Rokas et al. 2003Go). Although these studies have resulted in highly supported trees—Rokas et al. (2003)Go, for example, used bootstrap analysis and obtained 100% support for all edges in the tree—it is alarming to discover how sensitive these trees are to data coding and the choice of model of sequence evolution employed in the tree reconstruction. Commentaries by Phillips, Delsuc and Penny (2004)Go on the sequence data used by Rokas et al. (2003)Go and by Stefanovic, Rice, and Palmer (2004)Go on the sequence data used by Goremykin et al. (2004)Go illustrate this sensitivity. The fact that stochastic error is greatly reduced for genome-scale data sets shifts the focus toward model specification, a problem that is already known to be difficult for many phylogenetic data sets (e.g., Hillis, Huelsenbeck and Swofford 1994Go; Buckley 2002Go; Ho and Jermiin 2004Go; Jermiin et al. 2004Go).

To model sequence evolution for a concatenation of genes, it is possible to allow a different substitution model for each gene or for each codon position within each gene (see Pagel and Meade 2004Go; Poladian and Jermiin 2005Go). However, this requires estimating a very large number of parameters. An alternative approach is to infer trees for each gene separately and then to combine these trees into a consensus tree (see e.g., Rokas et al. 2003Go). Although conceptually simple, the consensus tree approach, by its very nature, omits information as it cannot explicitly display incongruence between different gene trees.

In order to retain more of the information available from the gene trees, Holland et al. (2004)Go proposed consensus networks and demonstrated their use by reanalyzing the data set presented in Rokas et al. (2003)Go. Consensus networks were obtained by combining the optimal tree for each gene, under either the maximum parsimony or maximum likelihood criterion, and assigning them equal weight. However, this method does not include any additional information on the confidence in phylogenetic trees inferred from the genes. For example, stochastic error is more of a problem for short genes than for long genes and genes may be more or less conserved.

Consensus networks are described in Holland and Moulton (2003)Go and Holland, Delsuc and Moulton (2005)Go and build upon an idea originally presented in Bandelt (1995)Go. They generalize strict and majority-rule consensus trees by allowing the representation of conflicting information that cannot be displayed in a single tree. In particular, given a collection of trees on the same set of taxa, a consensus network displays all those bipartitions, or "splits," that correspond to the edges that are present in more than a certain proportion, x, of the input trees. For example, if x = 1.0, then the consensus network is a strict consensus tree because only those splits that correspond to edges present in every tree will be displayed; if x = 0.5, then the consensus network is the majority-rule consensus tree; and if x < 0.5, then the consensus network can display conflicting splits. In its simplest setting, where edge weights and tree weights are not considered, each split displayed by the consensus network is given support equal to the frequency of its occurrence in the input trees; this can be reflected by the length of the edges that represent the split in the consensus network.

The standard consensus network method is appropriate when the input trees contribute equally. Such cases include the single gene setting with equally parsimonious or equally likely trees, sets of bootstrap trees (Felsenstein 1985Go), or, in a Bayesian setting (Huelsenbeck et al. 2002Go), trees from Monte Carlo Markov chains. However, we also wish to combine trees that should not contribute equally, so we require additional techniques. In this paper, we present new methods to incorporate additional information into consensus networks and demonstrate their use by applying them to three previously published data sets.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 References
 
In the multigene setting, phylogenetic analysis of each gene often leads to a collection of plausible trees. When such collections arise from a bootstrap analysis (or from a Markov chain Monte Carlo simulation), a straightforward way to retain the information on what confidence we have in the gene trees is to generate a consensus network for the amalgamation of all the bootstrap trees from each gene. In this way, we identify the strongly supported splits for each gene and then display these splits within a consensus network.

In other cases, the plausible trees for each gene may have associated weights. For example, expected likelihood weights (Strimmer and Rambaut 2002Go), or the P values derived using the Shimodaira and Hasegawa (SH)-test (Shimodaira and Hasegawa 1999Go), can be used to weight the trees before they are combined into the consensus network. In order to incorporate tree-specific weights into consensus networks, higher weights are assigned to splits corresponding to edges in trees with higher weights than to splits in trees with lower weights.

Suppose we have a collection of trees indexed by some set I, for which tree Ti has weight Wi. For split j, let Ij be the subset of I consisting of the indices of the trees containing split j, and define the support of split j as Formula We then display those splits in the consensus network for which sj is greater than the threshold, x. Note that in the case where all tree weights are equal, this is equivalent to ignoring the tree weights and applying the standard consensus network approach. Moreover, if the tree weights are positive integers and we replace the original collection of trees by a new one consisting of Wi copies of the tree with index i, then this method is again equivalent to applying the standard consensus network. This new weighted consensus network method can also be viewed as a generalization of the weighted consensus tree method described in Jermiin et al. (1997)Go.

We now describe the case where the input trees have edge lengths as well as tree-specific weights. Define lij to be the edge length of the edge displaying split j in tree i. Taking a given threshold, x, we display split j in the consensus network if sj > x and give the edge in the consensus network displaying this split the length wj, where Formula In other words, the edges representing some split in the consensus network are given length equal to the weighted average of the lengths of the edges representing that split in each of the input trees. Analogous schemes could be used to calculate median or minimum weights.

The methods described above have been implemented as Python scripts that create Nexus files, which can be displayed as networks by Spectronet (Huber et al. 2002Go). The scripts are available from the corresponding author on request and can also be downloaded from http://www.usyd.edu.au/subit/.


    Results and Discussion
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 References
 
We illustrate the use of incorporating weights into consensus networks through three examples. The first example uses the multigene data set that was analyzed by Goremykin et al. (2005)Go. Using this data set we apply the approaches of combining strongly supported splits for each gene. The second example uses the multigene data set that was analyzed by Rokas et al. (2003)Go. Using this data set we apply the approaches of combining strongly supported "trees" for each gene. The third example uses a gene family that was analyzed by Hardy et al. (2004)Go. Using this data set we illustrate the use of tree-specific weights to select a set of splits to be displayed for a single gene data set. For all three examples we also illustrate the difference between networks with: edge lengths calculated as a weighted average of the relevant edge lengths from the collection of input trees, and edge lengths proportional to the support of the split they represent. Our aim is to illustrate the incorporation of weights into consensus networks for a range of different data sets and weighting schemes.

Data of Goremykin et al. (2005)Go
The data consist of 61 alignments of protein-coding genes in 15 taxa (14 angiosperm species, with "Pinus" as an outgroup). Phylogenetic analysis was carried out using PAUP* version 4.0b10 (Swofford 2002Go). For each gene we used Modeltest version 3.06 (Posada and Crandall 1998Go) to estimate the symmetric model of nucleotide substitution that most appropriately fits the gene (as selected by the hierarchical likelihood ratio test option). To reduce computation time, we estimated the parameters of the model on a neighbor-joining tree inferred from p-distances rather than simultaneously estimating the parameters and the most likely tree for each gene. These model parameters were then fixed when estimating the most likely tree using heuristic search (the default heuristic search settings were retained).

The consensus network of the 61 most likely gene trees is shown in figure 1A. We chose a threshold value of x = 0.1 as experimentation showed that using this threshold displayed the most interesting conflicts, but maintained a split system that could be displayed in three dimensions. (Note that this is much better than the worse case scenario with x = 0.1, which could contain nine-dimensional hypercubes.) To identify the strongly supported splits for each gene, we generated 100 bootstrap trees using maximum likelihood for each gene and then concatenated these results into a single collection of trees. The consensus network with x = 0.1 is shown in figure 1B. The network in figure 1C contains the same splits as figure 1B but with edge lengths proportional to a weighted average of the genetics distances.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
 
FIG. 1.— Consensus networks for data from Goremykin et al. (2005)Go generated by combining: (A) the maximum likelihood tree for each of 61 genes, x = 0.1; (B) 100 maximum likelihood bootstrap trees for each of 61 gene, x = 0.1; and (C) displays the splits from (B) with edge lengths proportional to the weighted average of genetic distance.

 
There are three areas of uncertainty within figure 1AC. The most important of these concerns the position of the root, where there are four groupings that appear above the x = 0.1 threshold: Amborella basal; Nymphaea basal; Amborella plus Nymphaea basal; and grasses (Triticum, Oryza, Zea, Saccharum) basal. The support for the Amborella plus Nymphaea basal hypothesis is reduced when considering the strongly supported splits from each gene via the maximum likelihood bootstrap (fig. 1B) rather than a single maximum likelihood tree for each gene (fig. 1A). Also, support for two splits drops below the threshold when the extra information about split support is considered, meaning that figure 1B has less conflict than figure 1A. By considering 6,100 maximum likelihood bootstrap trees, rather than just 61 maximum likelihood trees, we expect to reduce the variance of individual split support values.

The results of Goremykin et al. (2003Go, 2004)Go, suggesting that grasses may be the most basal angiosperms, were criticized by Stefanovic, Rice and Palmer (2004)Go on the grounds that the phylogenetic methodology was inadequate and that the taxon sampling was unbalanced. They suggested that parsimony and maximum likelihood without correction for variable rates across sites were being mislead by long-branch attraction. The analysis here mitigates against the affect of long-branch attraction by using the maximum likelihood optimality criterion with an appropriate model fit for each gene. However, the unbalanced taxon sampling remains an issue. Overall, our results suggest that none of the four possibilities listed above for the rooting of the angiosperms can be conclusively ruled out.

This example illustrates the different advantages of the two edge weighting schemes (either proportional to genetic distance [fig. 1C] or to support [fig. 1A and B]). The potential for long-branch attraction between Pinus and the grasses can be seen in figure 1C but is not apparent in figure 1A and B. However, the relative support for different rootings of the angiosperms is shown most clearly in figure 1B.

Data of Rokas et al. (2003)Go
The data consist of 106 gene alignments for eight taxa (seven species from the Saccharomyces genus, with Candida albicans as an outgroup). Phylogenetic analysis of the genes was carried out using PAUP* (Swofford 2002Go) and Modeltest (Posada and Crandall 1998Go) as described for the previous example.

We used two different methods to identify strongly supported trees for each gene: expected likelihood weights (ELWs) (Strimmer and Rambaut 2002Go) and the multiple-comparison test of SH-test (Shimodaira and Hasegawa 1999Go). These two approaches were chosen to provide an illustration of the method; in general, any method that produces trees with weights could be used, different weighting schemes could be appropriate depending on the particular interests of the user. Both ELWs and the SH-test require maximum-likelihood scores to be evaluated for many bootstrap replicates for all the trees under consideration. To ease the computational burden, we considered the 45 trees agreeing with the constraint tree ((Saccharomyces cerevisiae, Saccharomyces paradoxus), (C. albicans, Saccharomyces castellii, Saccharomyces kluyveri), Saccharomyces mikatae, Saccharomyces kudriavzevii, Saccharomyces bayanus) and used the resampling of estimated log-likelihoods (RELL) bootstrap procedure (Kishino, Miyata, and Hasegawa 1990Go). To aid comparison of the two weighting schemes, we used the same RELL bootstrap procedure to calculate the ELW as well as the P values for the SH-test. Consensus networks, based on the ELW and the P value for the SH-test, are shown in figure 2A and B, respectively. The network in figure 2C contains the same splits as in figure 2A and B but with edge lengths proportional to a weighted average of the genetics distances. In all networks x = 0.2; again this value was chosen after some experimentation in order to show the major conflicting splits without allowing the network to become too high dimensional.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
 
FIG. 2.— Consensus networks for data from Rokas et al. (2003)Go generated by combining collections of weighted trees for each gene, tree-specific weights were: (A) ELWs (Strimmer and Rambaut 2002Go), x = 0.2; (B) P values from the SH-test (Shimodaira and Hasegawa 1999Go), x = 0.2; and (C) the same split system but with edge lengths proportional to the weighted average of genetic distance.

 
The networks obtained from using ELWs (fig. 2A) and SH-test P values as weights (fig. 2B) both contain the same splits, but they give different support to these splits. Using ELWs, the split {S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii}|{S. bayanus, C. albicans, S. castellii, S. kluyveri} has higher weight than the conflicting split {S. cerevisiae, S. paradoxus, S. mikatae, C. albicans, S. castellii, S. kluyveri}|{S. bayanus, S. kudriavzevii}, but using SH-test, P values as weights gives higher weight to the latter split. Figure 2A also has the same splits, and similar split support, to figure 1B from Holland et al. (2004)Go (a consensus network of the 106 most likely trees for each gene). We checked if this was due to most genes giving support to just one or few trees, but this was not the case. The average number of trees in the 95% confidence interval for each gene was 15.47 for the bootstrap trees and 11.75 for the ELWs. For the SH-test there were an average of 23.23 trees for each gene that could not be rejected as being significantly worse than the most likely tree at {alpha} = 0.05. We conclude that the similarity is due to the fact that such a large number of genes (106) are considered. For data sets with fewer genes, we expect that the differences between consensus networks that do or do not consider tree weights could be greater.

There are two contradictory signals regarding the placement of S. kudriavzevii. Phillips, Delsuc, and Penny (2004)Go provided a possible explanation of this: they showed that a signal linking S. kudriavzevii and S. bayanus may be due to compositional heterogeneity in the data and that the other signal is probably historical. The other conflict in the networks (fig. 2) arises from the three different possibilities for grouping the taxa that are associated with edges, that is, C. albicans, S. kluyveri, and S. castellii. The concatenated data set resolves this in favor of the split grouping C. albicans and S. kluyveri. Figure 2C, which incorporates edge lengths proportional to genetic distance, suggests that support for the other two splits may result from attraction of long edges.

Data of Hardy et al. (2004)Go
This data set is based on an amino acid alignment of 46 type I interferons from human, mouse, chicken, sheep, goat, musk ox, giraffe, cow, pig, horse, rabbit, duck, and zebrafish, Takifugu rubripes and Tetraodon nigroviridis. An initial collection of 10,000 trees was inferred using Tree-Puzzle (Schmidt et al. 2002Go) under the Jones-Taylor-Thornton model of amino acid substitutions (Jones, Taylor, and Thornton 1992Go), with rates-of-change across sites modeled by a discrete {Gamma} distribution with four rate categories and a proportion of invariant sites (all parameters were estimated on the neighbor-joining tree). The initial collection of 10,000 trees were compared using the test by Kishino and Hasegawa (1989)Go, and 9,499 trees were found to differ significantly from the most likely tree. This gave a reduced set of 501 plausible trees that were assigned tree-specific weights using the likelihood-weighted tree-averaging method of Jermiin et al. (1997)Go; in the present case, we used exponential weighting of the differences between the log-likelihood score of the most likely tree and those of the other plausible trees, standardized by the standard errors of those differences (and with a significance level of 5%) (for details and justification, see Jermiin et al. 1997Go).

The 501 plausible trees and their tree-specific weights were used to infer a weighted consensus tree (fig. 3). The edge lengths were obtained through maximum-likelihood analysis of the data, given the consensus tree and the model of substitution. The tree shows that the evolutionary relationship of these sequences is well resolved in many areas of the tree and poorly resolved in other areas: at the origin of the ß-, {varepsilon}-, and {kappa}-subfamilies; at the origin of the {omega}- and {tau}-subfamilies; and within the human and mouse {alpha}-subfamily. In some instances, short edges are highly supported, and in other cases long edges are poorly supported. This latter case suggests that there may be other conflicting signals in the data that cannot be displayed within a consensus tree.


Figure 3
View larger version (12K):
[in this window]
[in a new window]
 
FIG. 3.— Weighted majority-rule consensus tree showing the inferred evolutionary history of the type I interferon family. The edge lengths represent the evolutionary distances (the scale bar represents 20 substitutions per 100 sites) and are inferred by maximum likelihood on the consensus tree. The relative-likelihood scores (0%–100%) associated with internal edges represent the degree of confidence in these edges.

 
The 501 plausible trees and their tree-specific weights were also used to infer a consensus network (fig. 4A) using x = 0.2. The network clearly identifies areas where there is support for conflicting hypotheses. However, the network does not use any edge length information from the input trees, so the network is difficult to interpret within an evolutionary context.


Figure 4
View larger version (15K):
[in this window]
[in a new window]
 
FIG. 4.— (A) Consensus networks for the type I interferon family, tree weights were assigned using the likelihood-weighted tree-averaging method of Jermiin et al. (1997)Go with {alpha} = 0.05 and Class-V weighting of the plausible trees. In (A), edge lengths correspond to split support, and in (B), edge lengths correspond to a weighted average of genetic distance.

 
The collection of plausible trees, with their edge lengths and tree-specific weights, was used to estimate an alternative consensus network (fig. 4B), again using x = 0.2. This network allows for assessment of conflicting evidence in the data as well as for interpretation of the network within an evolutionary context. Of the four poorly resolved areas, one stands out strongly (i.e., the one at the joining of the ß-, {varepsilon}-, and {kappa}-subfamilies) and another one is barely visible (i.e., the one at the joining of the {omega}- and {tau}-subfamilies). The other two areas of conflict are negligible and most likely due to a high degree of sequence similarity. Due to the presence of edge lengths that reflect the amount of genetic change along different lineages, the interpretation of the consensus network is relatively easy and in this case largely consistent with that of Hardy et al. (2004)Go.


    Conclusions
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 References
 
If stochastic error was the only challenge to accurate phylogenetic estimation, then there would be no question that concatenating genes is the best strategy. However, analyses of genome-scale data sets (Phillips, Delsuc, and Penny 2004Go) show that this is not the case. Tree estimation for these data is sensitive to data coding and model selection, indicating that some nonhistorical signals arising from systematic biases in the data are of a magnitude similar to the historical signal. Trees inferred from concatenated data sets tend to have high bootstrap support, but it is clearly inappropriate to interpret high bootstrap support values as measures of accuracy. A recent modification to the bootstrap that makes it more appropriate for use with multigene data was suggested by Seo, Kishino, and Thorne (2005)Go; they implement a two-stage bootstrap procedure where the first stage resamples genes and the second stage resamples columns (sites). However, even with this improvement, given long enough sequences, data sets with many conflicting signals may still receive high bootstrap support.

Consensus networks provide a more inclusive approach than phylogenetic analysis of concatenated gene sequences because they allow weak or conflicting signals in the genes to be shown. However, we believe there are more appropriate ways to combine available information from different genes than simply to take the best tree for each gene. Here we have presented methods that incorporate more of the information that can be obtained from different genes.

In the approach involving identification of the strongly supported splits at each locus, for example, via a bootstrap analysis, all genes contribute the same number of source trees, but short genes with weak signals will contribute many different trees, whereas longer genes with stronger signals will contribute many copies of the same tree. In contrast, concatenations of genes produce results where the length of each gene acts as a weight on phylogenetic signal of each gene, so long genes with strong signals will tend to dominate the signals of other genes. While this might be desirable from some points of view, it does not allow for assessment of potential conflicts between different subsets of phylogenetic data. While this paper has concentrated on conflicts in the phylogenetic signal, in some data sets conflict might be due to historical signal, for example, in the case of either lateral gene transfer or hybridization.

How much different genes contribute to the overall picture, when combining the sets of plausible trees from different genes, depends on the tree-specific weights used. The advantage of using ELW or Bayesian posterior probabilities is that they sum to a constant amount for each gene, that is, the sum of the tree-specific weights for each gene is 1 for ELW and 0.95 for Bayesian posterior probabilities (given {alpha} = 0.05). By contrast, P values, such as those determined by the Kishino and Hasegawa (KH)-tests or SH-tests, do not. In the case of an uninformative gene, many trees would have high P values and thus bias the consensus network. The sum of these P values might be much greater than 1, and the sum will differ from gene to gene, which may be undesirable. One possible solution to this problem would be to use the SH-test or KH-test to identify a set of plausible trees and then to normalize the corresponding P values so that the sum of the tree weights for each gene was equal across genes.

Consensus networks, as originally described, do not conform with the conventional biological interpretation of edge lengths (Morrison 2005Go). Usually, the length of an edge in a phylogenetic tree represents the genetic distance between two nodes, whereas, the length of an edge in a standard consensus network is proportional to the number of input trees that display the particular edge. The implementation of consensus networks in SplitsTree4 (Huson and Bryant 2005Go) offers various options in regard to the edge lengths—edges of the network representing a certain split can have their length set to the minimum, median, or average length of the edges representing that split in the input trees. The support values for each edge (i.e., the number of trees in which they appear) can be displayed as numbers associated with the edge in the style of bootstrap values or by edge thickness. Here we have incorporated both edge weights and tree weights into the consensus network.

One difficulty in interpreting consensus networks, or more generally splits graphs, is that their internal nodes are not meant to represent inferred ancestral species; that is, in contrast to phylogenetic trees, they do not attempt to provide an explicit representation of evolutionary history. Thus, it might be useful to develop alternative approaches to combining (rooted) trees into networks such as those described in, for example, Baroni, Semple, and Steel (2004)Go. It might also be interesting to see whether the approaches described in this paper for consensus networks can be extended to super networks (Huson et al. 2004Go), networks that are similar in spirit to consensus networks but allow for partial data, that is, some genes are not available for all taxa.

In conclusion, we have presented some new approaches for incorporating additional phylogenetic information in the construction of consensus networks. Using three data sets, we have demonstrated that the resulting methods allow us to form hypotheses about whether conflicting signals are due to, for example, rate heterogeneity among lineages causing long-branch attraction, small distances between sequences causing lack of resolution, or systematic biases. We believe that consensus networks that incorporate the information available from analysis of individual genes, both tree weights and edge weights, will provide a useful tool for exploring the issues arising in the rapidly expanding field of genome-scale phylogeny.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 References
 
We wish to thank Michael Charleston and David Penny for their constructive comments on the manuscript and Patrick Forterre and the Institut Pasteur for their hospitality toward L.S.J. We thank Vadim Goremykin and Antonis Rokas for providing us with their alignments. B.R.H. acknowledges funding from the New Zealand Foundation for Research Science and Technology. This research was partly funded by a Discovery Grant (DP0453173) from the Australian Research Council. This is research paper number 016 from the Sydney University Biological Informatics & Technology Centre.


    Footnotes
 
Laura Katz, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Acknowledgements
 References
 

    Bandelt, H.-J. 1995. Combination of data in phylogenetic analysis. Plant Syst. Evol. Suppl. 9:355–361.

    Baroni, M., C. Semple, and M. Steel. 2004. A framework for representing reticulate evolution. Ann. Comb. 8:391–408.[CrossRef]

    Buckley, T. R. 2002. Model misspecification and probabilistic tests of topology: evidence from empirical data sets. Syst. Biol. 51:509–523.[CrossRef][ISI][Medline]

    Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791.[CrossRef][ISI]

    Goremykin, V. V., K. I. Hirsch-Ernst, S. Wölfl, and F. H. Hellwig. 2003. Analysis of the Amborella trichopoda chloroplast genome sequence suggests that Amborella is not a basal angiosperm. Mol. Biol. Evol. 20:1499–1505.[Abstract/Free Full Text]

    ———. 2004. The chloroplast genome of Nymphaea alba: whole-genome analyses and the problem of identifying the most basal angiosperm. Mol. Biol. Evol. 21:1445–1454.[Abstract/Free Full Text]

    Goremykin, V. V., B. Holland, K. I. Hirsch-Ernst, and F. H. Hellwig. 2005. Analysis of Acorus calamus chloroplast genome and its phylogenetic implications. Mol. Biol. Evol. 22:1813–1822.[Abstract/Free Full Text]

    Hardy, M. P., C. M. Owczarek, L. S. Jermiin, M. Ejdebäck, and P. J. Hertzog. 2004. Characterization of the type I interferon locus and identification of novel genes. Genomics 84:331–345.[CrossRef][ISI][Medline]

    Hillis, D. M., J. P. Huelsenbeck, and D. L. Swofford. 1994. Hobgoblin of phylogenetics? Nature 369:363–364.[CrossRef][Medline]

    Ho, S. Y. W., and L. S. Jermiin. 2004. Tracing the decay of the historical signal in biological sequence data. Syst. Biol. 53:623–637.[CrossRef][ISI][Medline]

    Holland, B. R., F. Delsuc, and V. Moulton. 2005. Visualising conflicting evolutionary hypotheses in large collections of trees: using consensus networks to study the origins of placentals and hexapods. Syst. Biol. 54:66–76.[CrossRef][ISI][Medline]

    Holland, B. R., K. T. Huber, V. Moulton, and P. J. Lockhart. 2004. Using consensus networks to visualize contradictory evidence for species phylogeny. Mol. Biol. Evol. 21:1459–1461.[Abstract/Free Full Text]

    Holland, B., and V. Moulton. 2003. Consensus networks: a method for visualising incompatibilities in collections of trees. Pp. 165–176 in G. Benson and R. Page, eds. Algorithms in bioinformatics. Springer-Verlag, Berlin.

    Huber, K. T., M. Langton, D. Penny, V. Moulton, and M. Hendy. 2002. Spectronet: a package for computing spectra and median networks. Appl. Bioinform. 1:159–161.

    Huelsenbeck, J. P., B. Larget, R. E. Miller, and F. Ronquist. 2002. Potential applications and pitfalls of Bayesian inference of phylogeny. Syst. Biol. 51:673–688.[CrossRef][ISI][Medline]

    Huson, D. H., and D. Bryant. 2005. Estimating phylogenetic trees and networks using SplitsTree4. (http://www.splitstree.org).

    Huson, D. H., T. Dezulian, T. Kloepper, and M. Steel. 2004. Phylogenetic super-networks from partial trees. IEEE/ACM Trans. Comput. Biol. Bioinform. 1:151–158.

    Jermiin, L. S., S. Y. W. Ho, F. Ababneh, J. Robinson, and A. W. D. Larkum. 2004. The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol. 53:637–643.

    Jermiin, L. S., G. J. Olsen, K. L. Mengersen, and S. Easteal. 1997. Majority-rule consensus of phylogenetic trees obtained by maximum-likelihood analysis. Mol. Biol. Evol. 14:1296–l302.[ISI]

    Jones, D. T, W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8:275–282.[Abstract/Free Full Text]

    Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J. Mol. Evol. 29:170–179.[CrossRef][ISI][Medline]

    Kishino, H., T. Miyata, and M. Hasegawa. 1990. Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J. Mol. Evol. 30:151–160.

    Morrison, D. 2005. Networks in phylogenetic analysis: new tools for population biology. Int. J. Parasitol. 35:567–582.[CrossRef][ISI][Medline]

    Pagel, M., and A. Meade. 2004. A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Syst. Biol. 53:571–581.[CrossRef][ISI][Medline]

    Phillips, M. J., F. Delsuc, and D. Penny. 2004. Genome-scale phylogeny and the detection of systematic biases. Mol. Biol. Evol. 21:1455–1458.[Abstract/Free Full Text]

    Poladian, L., and L. S. Jermiin. 2006. Multi-objective evolutionary algorithms and phylogenetic inference with multiple data sets. Soft Comput. 10:359–368.[CrossRef]

    Posada, D., and K. A. Crandall. 1998. Modeltest: testing the model of DNA substitution. Bioinformatics 14:817–818.[Abstract/Free Full Text]

    Rokas, A., B. L. Williams, N. King, and S. B. Carroll. 2003. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798–803.[CrossRef][Medline]

    Schmidt, H. A., K. Strimmer, M. Vingron, and A. von Haeseler. 2002. TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18:502–504.[Abstract/Free Full Text]

    Seo, T.-K., H. Kishino, and J. L. Thorne. 2005. Incorporating gene-specific variation when inferring and evaluating optimal evolutionary tree topologies from multilocus sequence data. Proc. Natl. Acad. Sci. USA 102:4436–4441.[Abstract/Free Full Text]

    Shimodaira, H., and M. Hasegawa. 1999. Multiple comparison of log-likelihoods with applications to phylogenetic inference. Mol. Biol. Evol. 16:1114–1116.[ISI]

    Stefanovic, S., D. W. Rice, and J. D. Palmer. 2004. Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots? BMC Evol. Biol. 4:35.[CrossRef][Medline]

    Strimmer, K., and A. Rambaut. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. R. Soc. Lond. B 269:137–142.[Medline]

    Swofford, D. L. 2002. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Mass.

Accepted for publication November 2, 2005.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
C. Ane, B. Larget, D. A. Baum, S. D. Smith, and A. Rokas
Bayesian Estimation of Concordance among Gene Trees
Mol. Biol. Evol., February 1, 2007; 24(2): 412 - 426.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/5/848    most recent
msj061v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (11)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Holland, B. R.
Right arrow Articles by Moulton, V.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Holland, B. R.
Right arrow Articles by Moulton, V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?