Skip Navigation


MBE Advance Access originally published online on July 13, 2007
Molecular Biology and Evolution 2007 24(9):2029-2039; doi:10.1093/molbev/msm139
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
24/9/2029    most recent
msm139v3
msm139v2
msm139v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by White, W.
Right arrow Articles by Penny, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by White, W.
Right arrow Articles by Penny, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Treeness Triangles: Visualizing the Loss of Phylogenetic Signal

WT White*,1, SF Hills*,1, R Gaddam*,1, BR Holland* and David Penny*

* Allan Wilson Center for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand

E-mail: D.Penny{at}massey.ac.nz.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
It is well known that molecular data "saturates" with increasing sequence divergence (thereby losing phylogenetic information) and that in addition the accumulation of misleading information due to chance similarities or to systematic bias may accompany saturation as well. Exploratory data analysis methods that can quantify the extent of signal loss or convergence for a given data set are scarce. Such methods are needed because genomics delivers very long sequence alignments spanning substantial phylogenetic depth, where site saturation may be compounded by systematic biases or other alternative signals. Here we introduce the Treeness Triangle (TT) graph, in which signals detectable by Hadamard (spectral) analysis are summed into 3 categories—those supporting 1) external and 2) internal branches in the optimal tree, in addition to 3) the residuals (potential internal branches not present in the optimal tree). These 3 values are plotted in a standard ternary coordinate system. The approach is illustrated with simulated and real data sets, the latter from complete chloroplast genomes, where potential problems of paralogy or lateral gene acquisition can be excluded. The TT uncovers the divergence-dependent loss of phylogenetic signal as subsets of chloroplast genomes are investigated that span increasingly deeper evolutionary timescales. The rate of signal loss (or signal retention) varies with the gene and/or the method of analysis.

Key Words: plastid genomes • spectral analysis • model misspecification • exploratory data analysis • ternary plot • Hadamard conjugation


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Estimating phylogenies for deep divergences with sequence data is known to be a mathematically hard problem for a number of reasons. Over timescales on the order of about 600 Myr or more, the historical signal contained in sequences will be obscured by random noise (Penny et al. 2001Go). The theoretical results of Mossel and Steel (2004Go, 2005Go) demonstrate that under standard Markov models, as currently employed in molecular phylogenetics, primary sequences should lose all information about divergences approaching 1 billion years in age. For example, following theorem 14.2 of Mossel and Steel (2005)Go, we can calculate that for 4 sequences of length 1000 evolving under a Jukes–Cantor model of nucleotide substitution with a mutation rate per nucleotide of about 108 per year, that if all 4 lineages existed as far back as 1 billion years ago, the probability of correctly estimating the tree would be 1/3 plus 0.002 (where the 1/3 term is just the chance of guessing correctly). With this model and substitution rate, it requires sequences of ~100,000 bp to have a 50% chance of recovering the correct tree for just 4 taxa. This calculation assumes ideal conditions; any sources of conflicting information would require longer sequences to compensate, and hence, the calculation places an upper bound on the expected result for the case of a simple, known model.

A related complication is that commonly used models of sequence evolution assume that, across the entire tree, each site is evolving in the same rate class. This includes the widely used general time reversible (GTR) model and its extension to models where a distribution of rates-across-sites (RAS) is assumed, with or without some sites being considered to be invariant. However, models assuming a gamma distribution require that each site must stay in the same rate class across all lineages (Steel et al. 1994Go). Such RAS models are only a simplified approximation of how sequences really evolve in nature (Lockhart et al. 2000Go; 2006Go), but for shorter time scales they provide a sufficiently good approximation to allow accurate phylogenetic estimation. We refer to such short to intermediate time periods (up to about 300 Myr) as the "comfort zone" because simulations reinforce the conclusion that phylogenetic inference is very powerful here (Penny et al. 2001Go). However, over time scales of half a billion years or more, the failure to incorporate lineage-specific processes, such as changes in nucleotide composition between taxa, may have dire consequences for phylogenetic estimation (see e.g., Ho and Jermiin 2004Go). Simulations allow us to predict the loss of information under specific models, but for real data sets where the actual substitution process is poorly understood, we need to be able to assess quantitatively the phylogenetic information in a given data set.

Confidence in inferred trees is often estimated by bootstrap values or posterior probabilities. Such values are useful when assessing whether or not sampling error may be influencing the results. However, bootstrap values do not detect systematic error; thus, they do not guarantee whether or not the branch in question is correct. For example, several studies of genome-scale data sets have shown that "support" in terms of bootstrap proportions (BPs) can swing from 100% for one tree to 100% for a different tree by adjusting the model of nucleotide substitution (Phillips et al. 2004Go; Goremykin et al. 2005Go). The bootstrap is generally not useful for assessing either loss or presence of phylogenetic signal for deep divergences because it does not take into account systematic error such as mutational bias (Lockhart et al. 1992Go; Lockhart and Cameron 2001Go; Buckley 2002Go). Stated another way, the bootstrap permits statements about site pattern frequencies, but it does not address the issue of whether or not site patterns reflect historical signal.

To determine whether systematic error is readily detectable for a given data set, tools to evaluate the goodness of fit of models of evolution are often employed. In present practice, goodness of fit is typically assessed using relative tests such as the likelihood ratio test or the Akaike Information Criterion as implemented in Modeltest (Posada and Crandall 1998Go), which ask whether model A fits the data significantly better than model B without, however, revealing how close model B comes to approximating the true model. Another class of tests has been used to answer questions about the absolute goodness of fit of models to data in a phylogenetic context (Reeves 1992Go; Goldman 1993Go; Bollback 2002Go; Waddell 2005Go; Jayaswal et al. 2005Go), but failure to pass such tests does not explain what aspects within the data are causing the poor fit. The parametric bootstrap is another test and can be used to compare, for example, the observed and predicted numbers of "singleton" sites, which basically correspond to the external branches of the tree (Goremykin et al. 2005Go; Waddell 2005Go).

Phylogenetic network methods also allow exploration of different, potentially conflicting, signals in the data. It is well known that there is a one-to-one correspondence between phylogenetic trees and sets of compatible splits; a binary tree with n taxa corresponds to a set of 2n – 3 splits. Network methods allow sets of incompatible splits and correspondingly more detailed graphs. One of the first was split decomposition (Bandelt and Dress 1992aGo) which takes a metric (distance matrix) on n taxa and produces a set of up to n(n – 1)/2 weakly compatible weighted splits, as implemented in SplitsTree 4 (Huson and Bryant 2006Go). A useful feature is that both the proportion of the metric that is explained (graphically represented) by the split system and the residual that is not explained (undepicted) are both calculated. NeighborNet (Bryant and Moulton 2004Go) is a more recent method that produces a set of up to n(n – 1)/2 circular splits; these can always be represented on a planar graph. Other exploratory methods include spectral analysis (Hendy and Penny 1993Go), Lento plots (Lento et al. 1995Go), and consensus networks (Holland et al. 2005Go). These methods have proved useful for assessing conflicting signals within individual data sets (Kennedy et al. 2005Go; Nannya et al. 2005Go). The likelihood-mapping approach of Strimmer and von Haeseler (1997)Go, which also uses a triangle plot, provides a useful graphical gauge of phylogenetic signal without recourse to assumptions about the underlying tree. Unfortunately, its output may be difficult to interpret: if most points fall near the center of the diagram, it can be concluded with confidence that the data is non-treelike, but if the points cluster at corners of the triangle, the data may or may not be treelike. More importantly, there is frequently a need to compare multiple data sets or various models on the same data set. In such cases, it is convenient to have an exploratory approach that enables rapid comparison across many data sets. Although the approaches mentioned above are useful, they are also visually complex—meaning it is hard to compare results across many data sets and treatments. For example, whereas likelihood-mapping summarizes a data set with a set of points on a diagram, a treeness triangle (TT) summarizes a data set with a single point, enabling multiple data sets, or multiple analyses of a single data set, to be compared on a single diagram. Although the dekapentagonal mapping approach of Zhaxybayeva et al. (2004)Go extends the quartet-based likelihood-mapping method to 5-taxon data sets, with a single point per data set, generalizing the method to n taxa appears problematic.

Building upon the concept of treeness, introduced by Andreas Dress and used in Eigen and Winkler-Oswatitsch (1981)Go and Eigen et al. (1988)Go to assess how well data fit a tree, we introduce the "Treeness Triangle" (TT) method. This assorts phylogenetic signals in aligned sequences into 3 components: signals that correspond to internal edges (branches) of a tree (I), signals that correspond to external edges of a tree (E), and the residual signals (R) that correspond to edges not present in the specified tree. These 3 values must sum to 1.0 and can therefore be plotted in a standard triangle (ternary) plot that readily reveals the relative proportion of each signal type in a given data set. We illustrate the TT with both simulated and real data—the latter from complete chloroplast (plastid) genomes. Here, we compare the redistribution of signal proportions for the same genes as a function of increasing evolutionary time from flowering plant (fp) evolution spanning roughly 160–200 Myr (Magallon and Sanderson 2005Go) to the early diversification of photosynthetic eukaryote lineages, including red algae, whose fossil record spans at least 1200 Myr (Butterfield 2000Go). It is essential to understand the extent to which sequences retain phylogenetic signal for ancient divergences and to detect conflicting signals. For the reasons given above, this chloroplast data set is a suitable test case for evaluating the TT.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Simulated Data
Random ultrametric trees were sampled from the PDA (Proportional to Distinguishable Arrangements) distribution, in which each tree topology is equally likely: the Markov model that generates these trees is in Steel and Penny (1993). Each random ultrametric tree was produced by taking a symmetric two-taxon rooted tree and randomly adding edges. Sequences were simulated on these random trees using Seq-Gen (Rambaut and Grassly 1997Go) and the Jukes–Cantor model, with 0.2 (figs. 2A and D), 0.4 (figs. 2B and E), and 0.6 (figs. 2C and F) expected mutations per site along any path from the root to a tip. One hundred random trees were produced, and for each tree and mutation rate, data sets were generated with 100, 200, 400, 800, 1600, 3200, and 6400 sites.


Figure 2
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— TT illustration of sampling error on simulated data. Results are for different sequence lengths and amounts of nucleotide change using the distance Hadamard (panels AC) and projected Hadamard (panels DF). TT points for 100 12-taxon trees under a molecular clock model were simulated with the Jukes–Cantor model and having 0.2 (panels A and D), 0.4 (panels B and E), and 0.6 (panels C and F) expected mutations per site between the root and the tip. For each point, data was generated on a different random tree consistent with a molecular clock and the optimal tree inferred by using the closest tree algorithm on the recovered split vector. The values are for 100 (blue), 400 (green), 1600 (orange), and 6400 (red) sites. Data sets created by taking 6400-site data sets and randomly shuffling nucleotides within columns are shown in gray. Note that some data sets produced split vectors that could not be analyzed using the projected Hadamard directly because of negative arguments to the log function; points for these data sets were omitted. (See also supplementary table S2, Supplementary Material online.)

 
Real Data
The real data set has 30 complete chloroplast (plastid) sequences and is subdivided into 4 overlapping subsets of 12 taxa each. The first subset has 12 flowering plants (fp), and each subsequent subset contains 6 sequences from the previous data set and 6 new ones. Thus, the land plant (lp) data set has 6 fps and 6 others from conifers to bryophytes; the green plant (gp) data set has 6 lps and 6 green algae; and the plastid (pl) data set has 6 from the gp data set and 6 other algae. The taxa in each data set, together with GenBank accession numbers, are shown in table 1.


View this table:
[in this window]
[in a new window]

 
Table 1 The Complete Chloroplast Genomes Used in This Study and the Data Sets They Appear in. They are fp (flowering plants), lp (land plants), gp (green plants), and pl (plastids–reds, greens, and browns)

 
Complete annotated plastid genomes were downloaded from GenBank, and annotations for the genomes were tabulated using a Perl script. A table was generated which included information about gene sequence, protein sequence, and gene location. The sequences were then imported into a Microsoft Access database. The database allowed sequences for each gene to be accessed quickly across the taxa of interest. For each gene sequence, alignments were carried out in BioEdit (Hall 1999Go); nucleotide data was translated to protein sequences, aligned, and translated back to nucleotide sequences. Where automated alignment was carried out, Clustal X (Thompson et al. 1994Go) was used, together with manual editing. Distance matrices were generated using PAUP* (Swofford 2001Go) from the aligned gene data sets (all alignments are available from http://awcmee.massey.ac.nz/downloads.htm).

Hadamard Transformation
Although the TT could be used directly on the frequencies of splits as observed in the data, it is usual to use it after correcting for inferred multiple changes. For mathematical reasons (Hendy et al. 1994Go), the full Hadamard transform requires either 2-state characters with a symmetric distribution or 4-state characters for the Kimura 3ST model and its submodels, namely the Jukes–Cantor and Kimura 2ST. However, the distance Hadamard calculation can be used with more complex models, including those that are nonstationary such as the general Markov model to which the LogDet applies (Lockhart et al. 1994Go) and any form of maximum likelihood distances (Felsenstein 2003Go, p. 196–221). This method is summarized in the next section. Despite it initially appearing counterintuitive, because of the reduction of information in distances relative to sequences (Penny 1982Go; Huson and Steel 2004Go), there are some potential advantages of the distance Hadamard method over the full Hadamard. Because the distance Hadamard only uses pairwise distances, both the variance and the bias are reduced when correcting for inferred multiple changes (Hendy and Charleston 1993Go; Charleston et al. 1994Go; Waddell et al. 1994Go; Nei 1996Go). The variance and the bias on distance values both increase as the number of changes between taxa increases, and the increase in the bias is faster than linear owing to a logarithmic factor used in the correction term (Tajima 1993Go). Obviously, the minimum observed length of a quartet must be larger than that for the pairs contained within it, and consequently, the variance and bias of the inferred length of the quartet will be larger than for either pair. However, because of the loss of information in distances (Penny 1982Go; Huson and Steel 2004Go), we test for the effect of this loss and also use the projected Hadamard method (Waddell and Hendy 1997Go). This uses a separate Hadamard conjugation for each of the 3 parameters under the Kimura 3ST model. The comparison of the distance and projected Hadamard approaches is thus straightforward.

Calculation of the Distance Hadamard
The Hadamard transformation requires distance values for all subsets of taxa with an even number of members; 0, 2, 4, 6, 8, ... n. This is an extension from quartet methods (e.g., Vinh and von Haeseler 2004Go) that only include subsets of 4 taxa. The values for nC2 = n(n – 1)/2 pairs of taxa are standard pairwise distances and are given by the input distance matrix. The values are either observed (uncorrected or Hamming) distances calculated directly from sequences or corrected (inferred) distances. Similarly, there are nC4 possible quartets of 4 taxa and each value is the minimum of the 3 combinations of pairwise distances. For example, for the quartet q = {i, j, k, l}, the entry is min{d(i, j) + d(k, l), d(i, k) + d(j, l), d(i, l) + d(j, k)}, where d(x, y) is the pairwise distance between taxa x and y. Again, the quartet values are from observed values or from inferred distances. For all larger subsets having an even number m of sequences, the distance is determined by finding the combination of taxon pairs from this subset having minimum total distance. In practice, it suffices to examine the sums of the distance values for each pair and the remaining m – 2 taxa, which have already been calculated.

Treeness Triangle
The TT uses splits, subdivisions of a set of n taxa into 2 disjoint subsets, thus corresponding to an edge in a tree. In general, for n taxa there are 2n–1 splits including the null split. The analysis was carried out on software based on SpectroNet (Huber et al. 2002Go). The programs are available from http://awcmee.massey.ac.nz/downloads.htm. The main operations are indicated in supplementary figure S1 (Supplementary Material online). Nucleotide or RY-coded sequences can be translated directly into the frequency of observed splits (s vector), or, using a program such as PAUP* (Swofford 2001Go) or the freely available PHYLIP (Felsenstein 2004Go), nucleotide, RY-coded or protein sequences can be converted to either observed or corrected pairwise distances. Pairwise distance values can be expanded into full generalized distances (respectively, r for observed and {rho} for inferred/corrected), which have values for all subsets with an even number of taxa (Hendy and Penny 1993Go; Penny et al. 1993Go; Hendy et al. 1994Go) via the distance Hadamard. The values in the s, r, {rho}, and {gamma} vectors (supplementary fig. S1, Supplementary Material online) are interconvertible by the Hadamard conjugation. Subsets of entries from either the s or {gamma} vector can be selected, for example, those with values greater than zero. A network (Huber et al. 2002Go), Lento plot (Lento et al. 1995Go), or TT (fig. 1A) can then be drawn. The Lento plot (fig. 1B) and TT both require a tree for their calculation. In the current implementation, the tree is obtained by the closest tree method (Hendy 1991Go) using a standard branch and bound search (Penny and Hendy 1987Go), although we emphasize that the TT can be used with any methods for producing both a set of splits and a tree from a data set. For example, when working with distance data, the weakly compatible set of splits output by SplitsTree 4 (Huson and Bryant 2006Go) could be used as an alternative to splits generated via the distance Hadamard, and a minimum evolution algorithm (Rzhetsky and Nei 1992Go) could be used to generate a tree.


Figure 1
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— A TT (A) and a Lento plot (B). The sum of all signals in the data ({gamma} values after a Hadamard transform) are normalized to unity. The 3 values calculated are the proportion of signal on the external and internal edges (branches) of the optimal tree and the sum of the remaining (residual) signals. The upper vertex of the triangle (E) is from a star tree where there are no internal edges in the data, the lower left (I) where the data fits entirely onto the internal edges of the tree, and the lower right (R, residuals) where all signals are of equal value (and thus the data does not represent a tree). The value plotted is 0.33, 0.53, and 0.14 for internal, external, and residuals, respectively. In principle, any tree can be used for the plot but the "closest tree" was used here because of its speed of computation and reasonable statistical properties (Hendy 1991Go). The Lento plot (B) shown is for the atpB gene of the fp data set. The values above the axis are the values of the signals for a split (edge or branch of a tree) and the values below the axis are the normalized sum of the values of other splits that are incompatible with that split. The bars have been shaded dark, medium, and white for splits belonging the E, I, and R classes, respectively. Note that the abscissa scale differs by a factor of 2 above and below the origin. The species, in order, are Arabidopsis thaliana, Oenothera elata, Lotus corniculatus, Atropa belladonna, Nicotiana tabacum, Panax ginseng, Spinacia oleracea, Amborella trichopoda, Nymphaea alba, Calycanthus floridus, Zea mays, and Oryza sativa (see table 1).

 
Of course, if the model used to build the tree is incorrect (model misspecification) or if there is insufficient data (sampling error), it is possible that the tree used as input to the TT does not match the (usually unknown) true tree. In the case of the closest tree algorithm used in this paper, the tree recovered corresponds roughly to the tree that gives the best possible treeness values, in the sense of minimising the R component. Thus, the treeness components computed for a data set will be optimistic when a tree different from the true tree better explains the data. This does not invalidate the outcome of a TT analysis: the TT faithfully evaluates the tree likeness of a data set "with respect to a tree-building method of the user's choice." Although both sampling error and model misspecification can be tested for (e.g., using bootstrapping and the absolute and relative tests of goodness of fit described in the introduction, respectively), this is probably not justified when using the TT simply as an exploratory data analysis tool.

In the TT, the values of all signals in the data sum to unity and the proportion of signal on the external (E) and internal (I) edges (branches) of the optimal tree are indicated by the first 2 of the 3 entries indicated at the 3 apices of the triangle. The sum of the residual signals (R) is the third entry. Given a set of splits and a tree as input, these 3 values are computed by classifying each split in the split set as an external edge of the tree, an internal edge of the tree, or absent from the tree and adding the split's weight to the corresponding total: E, I, or R, respectively. The final step is normalization so that the total E + I + R equals 1. The upper apex represents the star tree where there are no internal edges in the data, the lower left where the data fits entirely onto internal edges of the tree, and the lower right where all signals are of equal value (there is no support for any particular tree). For a specified tree, the 3 classes of values (E, I, and R) are summed as described above, normally as the {gamma} values, and these 3 coordinates are plotted within the TT as illustrated in figure 1A. This summarizes 3 signals in just one point, in contrast to a Lento plot (shown in fig. 1B for the fp data set). In further contrast to a Lento plot, the TT can summarize a large number of comparisons in a single graph (see figs. 2 and 3). For data that perfectly fit a tree, all points would have an R component of 0 and hence would lie on the line connecting the E and I apices.


Figure 3
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Basic results for the 4 chloroplast data sets, with the distance (panels A and D) and projected (panels B and E) Hadamard methods. Each of 35 genes from the 4 data sets (fp, red; lp, orange; gp, green; and pl, blue) are used, together with randomized (shuffled) columns for all 4 data sets in gray. In A and B, each dot represents a separate gene. In C and F, 6 genes are identified and the arrows indicate the change in TT value in going from the fp, lp, green algae, and pl data sets (fp -> lp -> gp -> pl). There are 3 genes with >1,500 bp (atpB, psbB, and rbcL), 2 with >500 bp (psbA and petB) and one with <150 bp (psaJ). As expected, there is a decrease in signal for the internal branches on moving from the fps to the pl data set. For the pl data set (blue), there is apparently little phylogenetic signal at all on the internal branches (however, see the concatenated data set in fig. 4). This analysis shows that much more signal is retained in the projected Hadamard than for the distance Hadamard, but much of that the additional signal does not fit onto the optimal tree. G is the residual component of the TT plotted against sequence length for all genes and taxa sets. D and E are equivalent to 3A and 3B but are the result of using the global optimum tree (concatenated) instead of the closest trees calculated on each data set individually. As expected, the residuals are marginally larger in 3D and E.

 

    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Simulated Data: TT Using the True Model
We first analyzed simulated data in order to examine the extent of the sampling error in the residuals when the model (tree plus mechanism of nucleotide change) is correct. Figure 2 shows results for simulated nucleotide data sets with 12 taxa and increasing numbers of sites. For 100, 200, 400, 800, 1600, 3200, and 6400 sites, each data set was analyzed for both the distance and projected Hadamard methods. Although in this case the true tree for each data set was available, the trees used for TT analysis were recovered from the simulated data using the closest tree algorithm as is usual for real data sets, allowing for the (realistic) possibility of recovering an incorrect tree. Additional data sets, created by shuffling the data within each alignment column, were used in order to measure the effect of complete information loss. For clarity, figure 2 only shows the results for data sets with 100, 400, 1600, and 6400 sites, and the shuffled sites of 6400 nt. Figures 2AC shows the results for the distance Hadamard and figures 2DF the projected Hadamard.

For figure 2A in particular, the points representing the same sequence lengths cluster into bands, each point indexed by 3 signal types (I, E, and R). As expected, with increasing sequence length the points approach the IE (R = 0) line. Because this data is simulated on a tree and uses the same mechanism to simulate the data and recover the tree, sampling error is the only significant factor contributing to the residual component. These results can be compared later with real data where model misspecification may be significant. As expected (see Waddell et al. 1994Go), the residuals (R axis) decrease in inverse proportion to the sequence length. In contrast, there is a faster-than-linear increase in the residuals component as the rate of change increases. This trend is shown in row 1 (figs. 2AC) and row 2 (figs. 2DF), where the diagrams on the left, center, and right correspond to expected numbers of substitutions per site of 0.2, 0.4, and 0.6, respectively.

The spread of points along the I–E line occurs because each point is from a different random tree; the spread does not vary noticeably between data set sizes. The projected Hadamard (figs. 2DF) retains more information from the original data and hence carries a larger residual component. This is seen by comparing each TT plot in figure 2DF with the one immediately above it. This means that the distance Hadamard is still underestimating the full values of the residual component. With the projected Hadamard, there is still a significant residual component with 6400 sites for the highest rates of nucleotide change (fig. 2F).

Real Data
We first checked for each gene whether the parameters for the gamma distribution ({Gamma}) of rates across sites and the proportion of variable sites (Pvar) were reasonably constant in the 4 subsets of taxa (supplementary table S2, Supplementary Material online). To conform to the mathematical assumptions, this constancy of gamma and Pvar should hold when going from the fp to the lp, gp, and pl data sets. Although the estimates for the gamma shape parameter vary considerably from ~0.3 to {infty} across genes and data sets, there is no clear trend with increasing divergence of the taxa. (Note that {infty} is a valid gamma value, indicating that all sites are evolving at an equal rate.) As expected, the proportion of invariant sites is significantly higher in the fp than the other 3 taxa sets. This may be a bias in estimation from having more constant sites in the fp alignments than in the 3 more divergent taxa sets. Nevertheless, the decrease in constant sites is consistent with the prediction of a relaxed covarion model that additional sites will become variable for deeper comparisons (Gaucher et al. 2002Go; Lockhart et al. 2006Go). In the 3 most divergent taxa sets (lp, gp, and pl), there is a significant positive correlation between the gamma shape parameter and the proportion of invariant sites (correlation coefficients of 0.46, 0.44, and 0.48, respectively). In other words, in models where more sites are classed as invariant, rates are close to being equal across the variable sites and in models where there are few invariant sites the rate distribution is more skewed.

Figures 3A and B are TT plots with points for each of 35 genes, calculated under the distance Hadamard (fig. 3A) and the projected Hadamard (fig. 3B). The points colored red, orange, green, and blue correspond to the fp, lp, gp, and pl subsets of taxa. The gray points are shuffled versions of each data set. For the distance Hadamard, there is a strong tendency, as expected, for the points to move closer toward the E (external branches) apex with older divergences (fp -> lp -> gp -> pl). For the projected Hadamard, there is a similar tendency to move toward the E apex with increasing divergence. Compared with the distance Hadamard, the projected Hadamard yields TT points with a much larger residual value.

With progressively deeper geological divergence times from ~200 Ma to ~1.2 Ga, the points in figure 4A migrate toward apex E, but there is no apparent shift toward the R apex, as might have been expected for random data. To understand this effect, consider the following. As sequence length tends to infinity, we expect shuffling by columns to produce homogenous genetic distances between all pairs of taxa. In other words, all entries in the resulting distance matrix, apart from the diagonal, would become equal to some constant d, whose value is determined by the number and nature of sequence differences in the data. Such distances can be represented exactly on a star tree with each external edge of length d/2; this corresponds to the upper point (the E apex) in the TT. However, for shuffled alignments of finite length, the values in the distance matrix only approximate d; hence, we see some signal mapped to internal edges and some, typically a larger component, mapped to residuals. Notably, genes for the pl data set, spanning more than 1.2 Ga, map to points in the same region of the plot as the shuffled (randomized) data. The contrast with figure 3B is instructive: with the projected Hadamard there is a stronger movement toward the R apex. Figures 3C and F track 6 individual genes, with arrows in the direction of increasingly divergent taxa sets.


Figure 4
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Results from concatenated genes for each of the 4 data sets (fp, red; lp, orange; gp, green; and pl, blue). Filled circles are used for distance (GTR) calculations, and crosses are used for projected Hadamard calculations on the concatenated data sets. For data sets with older divergences, the values again show a progression toward the E axis (longer external and shorter internal edges). Nevertheless, the points are much closer to the IE (Internal–External) axis than with the individual genes (fig. 3). This is a positive result and implies that to some extent the nonphylogenetic signals observed with individual genes cancel out. In contrast, the crosses indicate that in comparison to the distance Hadamard, the projected Hadamard retains considerably more of the information in the original data, but little of the information retained corresponds to the closest tree.

 
In figure 3A, it appears that for individual genes most phylogenetic signal, corresponding to internal edges of the optimal tree, is lost in the oldest data sets comprising gp and algae, especially pl. Indeed, for many genes the residual component (the distance from the IE axis) is larger than the signal for the internal branches of the optimal tree (distance from the ER axis). In general, the TT reveals that for each gene taken individually, most of the phylogenetic signal for deep divergences has been lost. Figure 3G shows the relationship between sequence length and the residual component of the signals. As expected from the simulation results (fig. 2), the residuals component is generally smaller for longer genes.

For concatenated genes, however, the results for the distance Hadamard for each of the 4 data sets (filled circles in fig. 4) indicate a substantial component of signal that maps to internal edges. The expected migration of points toward the E (external) apex with increasing evolutionary time is observed in the transition from the fp -> lp -> gp -> pl data sets. Again, for each of the corresponding data sets shuffled by columns, virtually all signal on internal edges of the tree is lost. However, it is most striking that in the concatenated data the points lie close to the internal–external (IE) axis, meaning that the signal for the residual axis is both quite small and spread over many possible alternative signals. This important observation suggests that the high residual signal for individual genes differs across genes. Put another way, not only the residual signal from the individual genes could stem from both sampling effects of gene length and also lineage-specific differences in functional constraints (that might average out).

Another area where the TT allows easy comparison is across different treatments of the same data set. In figure 5, we show the effect of different distance corrections on the position of the points within the TT for the genes atpF (fig. 5A) and petD (fig. 5B); the results for the other genes are in supplementary material online. The 4 distance methods used are uncorrected p distances, filled diamond; Tamura–Nei corrected distances, open diamond; LogDet distances, closed circle; and GTR maximum likelihood distances, open circle. All genes, except for rbcL (see supplementary material online) show the fp, lp, gp, and pl progression. It is interesting that going from the uncorrected distances to any form of correction tends to increase the values of both the internal (I) and residuals components (R). This indicates that the uncorrected data underestimates the internal branches of the tree but simultaneously that the signal not conforming to an optimal tree is amplified when distances are corrected for multiple changes.


Figure 5
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 5.— Effect of different corrections on the fit between model and data. The distance Hadamard applied under different corrections for inferred multiple changes. The 4 distance matrices used are: filled diamond, uncorrected p distances; open diamond, Tamura–Nei corrected distances; filled circle, LogDet distances; and open circle, maximum likelihood distances. A and B are for the proteins atpF and petD, respectively. In general, the more complex the optimal model, the better the fit between the data and the tree.

 

    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
The tendency in current phylogenetic practice is to focus attention on those aspects of a given data set that map onto a particular tree. But the issue of how well a bifurcating tree actually describes the observed properties of the data in question is at least as important. What can we really assume safely about sequence evolution? For any given "individual gene," it can probably be safely assumed that all sequences that we observe in nature are in fact related by a series of treelike lineage splits that correspond to a recurrent process of DNA duplication and mutational accumulation; the only readily imaginable exceptions to such a rule would entail intragenic recombination or gene conversion among sequence variants possessing fixed differences. If we neglect the latter 2 mechanisms, then our default assumption would be that gene and protein sequences are related by processes that in mathematical terms are well described by trees.

The issues become distinctly more problematic if we entail the further assumption that "all" gene sequences in a particular given chromosome are related by one and the same tree. This assumption is inherent to the concept that there is a single tree of life by which all things are related and that all we need to do is to identify its topology. But many evolutionary mechanisms that affect the evolution of genes are known, which are fundamentally not depictable as strictly bifurcating trees. The 4 most prominent and mechanistically best understood examples of non–treelike evolutionary processes include 1) hybridization events, as are common among flowering plants (fps); 2) gene transfers from organelle genomes in the endosymbiotic origin of organelles; 3) lineage sorting, which occurs when gene trees differ from species trees because of coalescence events occurring in a different order than speciation events (as described in e.g., Rosenberg 2002Go; Degnan and Salter 2005Go); and 4) lateral gene transfer among prokaryotic chromosomes, as mediated by 4a) transduction via phages, 4b) transformation in the case of naturally competent bacteria such as Haemophilus influenzae, and 4c) conjugation via plasmids, as any hospital that uses antibiotics can attest. Treelike or not, all these processes do fit within the more general mechanism of descent with modification.

In the age of genomes and phylogenomics, where gene trees are produced on an industrial scale, we often find discrepancies between trees produced for a collection of genes within a particular set of chromosomes. It has become quite popular to infer a prevalence of lateral gene transfer or other non–treelike biological process as the cause of such differences. However, from the mathematical standpoint the issue might more readily be formulated as, "How likely is it that we will infer the same tree, or even similar trees, for 2 genes from the same set of organisms even if we know exactly how molecules are evolving?" Even when the true tree and the true model of sequence evolution is known, as in computer simulated data, it is very difficult to infer the true tree for moderately diverged genes (Nei 1996Go; Penny et al. 2001Go), and only with such "perfect" data can we begin to feel how well or how poorly methods of phylogenetic inference actually perform with distantly related taxa. If our goal is to learn something about the evolutionary past from gene sequence data, we need to better understand the relationship between the data that we observe and the trees that are inferred from them. That means that there is a need to understand not only the site patterns that will fit onto a binary tree, but also those that will not (i.e., conflicting data). Networks, Lento plots, and TT plots are steps in that direction.

Here, we investigated both simulated data and real sequence data from chloroplast genomes. The reason for investigating the latter stems from the circumstance that, with the exception of rbcL (which has long been known to exhibit paralogy across the red algal-green algal boundary Martin et al. [1992]Go), there is every reason to assume that the sequences of proteins encoded in chloroplast genomes are all related by the same historical process of evolutionary bifurcations. This is because there are no known cases of gene families within chloroplast genomes, no duplicate copies of chloroplast genes (with the exception of those encoded in the inverted repeat, whose sequences are identical), and no known examples of gene replacement via lateral acquisitions (leaving rbcL aside). Therefore, for a given taxon sample, all chloroplast-encoded proteins should, in principle, produce the same tree in phylogenetic inference. The observation is, however, that they produce different trees, sometimes with very high BPs (Goremykin et al. 1997Go; Martin et al. 1998Go; Lockhart et al. 2000Go; Vogl et al. 2003Go). The reasons underlying the inability of current molecular phylogenetic methods to extract the same tree for different chloroplast proteins (or any other protein set where paralogy or lateral acquisitions can be reasonably excluded a priori), need clarification, if progress is to be made in understanding deeper evolutionary history. The problem of distinguishing between historical and other types of signal in molecular data is hard and becomes increasingly severe for deep divergence times.

The projected Hadamard (Waddell and Hendy 1997Go) uncovers more conflict than the distance Hadamard. There is still the option for exploring the full Hadamard on 4-state characters. However, this requires a vector with 4n–1 entries, rather than 2n–1 for the distance Hadamard and 3 x 2n–1 for the projected. The number of signals in the residual component is large. For n taxa there are 2n–1 possible splits, n of which correspond to external branches of the tree, n 3 to internal branches, and thus (omitting also the null split), there are 2n–1 – 2n – 4 = 2002 splits for n = 12 taxa. In principle, both the mean and standard deviation of the support for any particular split can be calculated for the Hadamard (Waddell et al. 1994Go). In practice, the large number of signals means that the variance of the splits will be relatively high, and this will contribute to the higher residual values for the projected Hadamard versus the distance Hadamard.

In this paper, we have generated TT points with respect to the closest tree, although the method could be used more generally. For example, to compare the effect of different distance corrections on the weakly compatible splits systems produced by split decomposition (Bandelt and Dress 1992bGo), one could define a triangle point by summing up the weights of trivial splits (of the form A|B, where either |A| or |B| = 1) and assigning it to the E (external) corner, summing up the weights of the nontrivial splits and assigning it to the I (internal) corner, and assigning the split-prime residue to the R (residual) corner. When it comes to depicting signal conflicts, TT is complementary to both Lento plots and networks; Lento plots show all the conflicting signals and networks show the most important conflicting signals. A TT point shows how much conflicting signal there is, without identifying the signals, making it easy to compare the amount of signal across different data sets and treatments. TTs reveal that the vast majority of all phylogenetic signals observed in the real chloroplast data (or in simulated data) conflict with the optimal tree, rather than support it, even for comparatively short divergence times corresponding to less than about 200 Myr. Using the projected Hadamard, the difference between the shuffled and unshuffled pl data set was small. This warrants caution with regard to interpreting trees for deeper divergences.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Supplementary materials are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
This work was supported by the New Zealand Marsden Fund. We thank Bill Martin for considerable input for the design of the project at the SMBE meetings at UC Irvine, for the first chloroplast dataset used during development, and for ongoing discussions.


    Footnotes
 
1 These authors contributed equally to this work. Back

Jianzhi Zhang, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 

    Bandelt H-J, Dress AWM. A canonical decomposition theory for metrics on a finite set. Adv Math. (1992a) 92:47–105.[CrossRef]

    Bandelt H-J, Dress AWM. Split decomposition: a new and useful approach to phylogenetic analysis of distance data. Mol Phylogenet Evol. (1992b) 1:242–252.[CrossRef][Medline]

    Bollback JP. Bayesian model adequacy and choice in phylogenetics. Mol Biol Evol. (2002) 19:1171–1180.[Abstract/Free Full Text]

    Bryant D, Moulton V. Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Mol Biol Evol. (2004) 21:255–265.[Abstract/Free Full Text]

    Buckley TR. Model misspecification and probabilistic tests of topology: evidence from empirical data sets. Syst Biol. (2002) 51:509–523.[CrossRef][ISI][Medline]

    Butterfield NJ. Bangiomorpha pubescens n. gen. n. sp.: implications for the evolution of sex, multicellularity, and the Mesoproterozoic/Neoproterozoic radiation of eukaryotes. Paleobiology. (2000) 26:386–404.[Abstract/Free Full Text]

    Charleston MA, Hendy MD, Penny D. The effects of sequence length, tree topology and number of taxa on the performance of phylogenetic methods. J Comp Biol. (1994) 1:133–151.

    Degnan JH, Salter LA. Gene tree distributions under the coalescent process. Evolution. (2005) 59:24–37.[ISI][Medline]

    Eigen M, Winkler-Oswatitsch R. Transfer-RNA: the early adaptor. Naturwissenschaften. (1981) 68:217–228.[CrossRef][ISI][Medline]

    Eigen M, Winkler-Oswatitsch R, Dress A. Statistical geometry in sequence space—a method of quantitative comparative sequence-analysis. Proc Natl Acad Sci USA. (1988) 85:5913–5917.[Abstract/Free Full Text]

    Felsenstein J. Inferring phylogenies (2003) Sunderland (MA): Sinauer Associates.

    Felsenstein J. PHYLIP (phylogeny inference package). (2004) Seattle (WA): Department of Genome Sciences, University of Washington. Version 3.6b. Distributed by the author.

    Gaucher EA, Gu X, Miyamoto MM, Benner SA. Predicting functional divergence in protein evolution by site-specific rate shifts. Trends Biochem Sci. (2002) 27:315–321.[CrossRef][ISI][Medline]

    Goldman N. Statistical tests of models of DNA substitution. J Mol Evol. (1993) 36:182–198.[CrossRef][ISI][Medline]

    Goremykin VV, Hansmann S, Martin WF. Evolutionary analysis of 58 proteins encoded in six completely sequenced chloroplast genomes: revised molecular estimates of two seed plant divergence times. Plant Syst Evol. (1997) 206:337–351.[CrossRef]

    Goremykin VV, Holland B, Hirsch-Ernst KI, Hellwig FH. Analysis of Acorus calamus chloroplast genome and its phylogenetic implications. Mol Biol Evol. (2005) 22:1813–1822.[Abstract/Free Full Text]

    Hall TA. BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symp Ser. (1999) 41:95–98.

    Hendy MD. A combinatorial description of the closest tree algorithm for finding evolutionary trees. Discrete Math. (1991) 96:51–58.[CrossRef]

    Hendy MD, Penny D. Spectral analysis of phylogenetic data. J Classif. (1993) 10:5–24.[Medline]

    Hendy MD, Charleston MA. Hadamard conjugation—a versatile tool for modeling nucleotide-sequence evolution. N Z J Bot. (1993) 31:231–237.

    Hendy MD, Penny D, Steel MA. A discrete Fourier analysis for evolutionary trees. Proc Natl Acad Sci USA. (1994) 91:3339–3343.[Abstract/Free Full Text]

    Ho S, Jermiin L. Tracing the decay of historical signal in biological sequence data. Syst Biol. (2004) 53:623–637.[CrossRef][ISI][Medline]

    Holland BR, Delsuc F, Moulton V. Visualizing conflicting evolutionary hypotheses in large collections of trees: using consensus networks to study the origins of placentals and hexapods. Syst Biol. (2005) 54:66–76.[CrossRef][ISI][Medline]

    Huber KT, Langton M, Penny D, Moulton V, Hendy M. SpectroNet: a package for computing spectra and median networks. Appl Bioinformatics. (2002) 1:159–161.[Medline]

    Huson DH, Bryant D. Application of phylogenetic networks in evolutionary networks. Mol Biol Evol. (2006) 23:254–267.[Abstract/Free Full Text]

    Huson DH, Steel M. Distances that perfectly mislead. Syst Biol. (2004) 53:327–332.[CrossRef][ISI][Medline]

    Jayaswal V, Jermiin LS, Robinson J. Estimation of phylogeny using a general Markov matrix. Evol Bioinform Online. (2005) 1:62–80.

    Kennedy M, Holland BR, Gray RD, Spencer HG. Untangling long branches: identifying conflicting phylogenetic signals a priori using spectral analysis, neighbor-net, and consensus networks. Syst Biol. (2005) 54:620–633.[CrossRef][ISI][Medline]

    Lento GM, Hickson RE, Chambers GK, Penny D. Use of spectral-analysis to test hypotheses on the origin of pinnipeds. Mol Biol Evol. (1995) 12:28–52.[Abstract]

    Lockhart PJ, Cameron SA. Trees for bees. Trends Ecol Evol. (2001) 16:84–88.[CrossRef][Medline]

    Lockhart PJ, Huson D, Maier U, Fraunholz MJ, Van de Peer Y, Barbrook AC, Howe CJ, Steel MA. How molecules evolve in eubacteria. Mol Biol Evol. (2000) 17:835–838.[Free Full Text]

    Lockhart PJ, Novis P, Milligan BG, Riden J, Rambaut A, Larkum T. Heterotachy and tree building: a case study with plastids and eubacteria. Mol Biol Evol. (2006) 23:40–45.[Abstract/Free Full Text]

    Lockhart PJ, Penny D, Hendy MD, Howe CJ, Beanland TJ, Larkum AWD. Controversy on chloroplast origins. FEBS Lett. (1992) 301:127–131.[CrossRef][ISI][Medline]

    Lockhart PJ, Steel MA, Hendy MD, Penny D. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol Biol Evol. (1994) 11:605–612.[ISI]

    Magallon SA, Sanderson MJ. Angiosperm divergence times: the effect of genes, codon positions, and time constraints. Evolution. (2005) 59:1653–1670.[ISI][Medline]

    Martin W, Somerville CC, Loiseauxdegoer S. Molecular phylogenies of plastid origins and algal evolution. J Mol Evol. (1992) 35:385–404.[ISI]

    Martin W, Stoebe B, Goremykin V, Hansmann S, Hasegawa M, Kowallik KV. Gene transfer to the nucleus and the evolution of chloroplasts. Nature. (1998) 393:162–165.[CrossRef][Medline]

    Mossel E, Steel MA. A phase transition for a random cluster model on phylogenetic trees. Math Biosci. (2004) 187:189–203.[CrossRef][ISI][Medline]

    Mossel E, Steel M. How much can evolved characters tell us about the tree that generated them? In: Mathematics of evolution and phylogeny—Gascuel O, ed. (2005) Oxford: Oxford University Press. 384–412.

    Nannya Y, Sanada M, Nakazaki K. (11 co-authors). A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res. (2005) 65:6071–6079.[Abstract/Free Full Text]

    Nei M. Phylogenetic analysis in molecular evolutionary genetics. Annu Rev Genet. (1996) 30:371–403.[CrossRef][ISI][Medline]

    Penny D. Towards a basis for classification: the incompleteness of distance measures, incompatibility analysis and phenetic classification. J Theor Biol. (1982) 96:129–142.[CrossRef][ISI][Medline]

    Penny D, Hendy MD. Turbotree—a fast algorithm for minimal trees. Comput Appl Biosci. (1987) 3:183–187.[Abstract/Free Full Text]

    Penny D, McComish BJ, Charleston MA, Hendy MD. Mathematical elegance with biochemical realism: the covarion model of molecular evolution. J Mol Evol. (2001) 53:711–723.[CrossRef][ISI][Medline]

    Penny D, Watson EE, Hickson RE, Lockhart PJ. Some recent progress with methods for evolutionary trees. N Z J Bot. (1993) 31:275–288.

    Phillips MJ, Delsuc F, Penny D. Genome-scale phylogeny: sampling and systematic errors are both important. Mol Biol Evol. (2004) 21:1455–1458.[Abstract/Free Full Text]

    Posada D, Crandall KA. Modeltest: testing the model of DNA substitution. Bioinformatics. (1998) 14:817–818.[Abstract/Free Full Text]

    Rambaut A, Grassly NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. (1997) 13:235–238.[Abstract/Free Full Text]

    Reeves JH. Heterogeneity in the substitution process of amino acid sites of proteins coded for by mitochondrial DNA. J Mol Evol. (1992) 35:17–31.[CrossRef][ISI][Medline]

    Rosenberg NA. The probability of topological concordance of gene trees and species trees. Theor Pop Biol. (2002) 61:225–247.[CrossRef][ISI][Medline]

    Rzhetsky A, Nei M. A simple method for estimating and testing minimum-evolution trees. Mol Biol Evol. (1992) 9:945–967.[ISI]

    Steel MA, Penny D. Distributions of tree comparison metrics - some new results. Syst. Biol. (1993) 42:126–141.[CrossRef][ISI]

    Steel MA, Székely L, Hendy MD. Reconstructing trees when sequence sites evolve at variable rates. J Comput Biol. (1994) 1:153–163.[Medline]

    Strimmer K, von Haeseler A. Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment. Proc Natl Acad Sci USA (1997) 94:6815–6819.[Abstract/Free Full Text]

    Swofford DL. PAUP* phylogenetic analysis using parsimony (*and other methods). Version 4.0b8 (2001) Sunderland (MA): Sinauer Associates.

    Tajima F. Unbiased estimation of evolutionary distance between nucleotide sequences. Mol Biol Evol. (1993) 10:677–688.[Abstract]

    Thompson JD, Higgins DG, Gibson TJ. Clustal-W—improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. (1994) 22:4673–4680.[Abstract/Free Full Text]

    Vinh S, von Haeseler A. IQPNNI: moving fast through tree space and stopping in time. Mol Biol Evol. (2004) 21:1565–1571.[Abstract/Free Full Text]

    Vogl C, Badger J, Kearney P, Li M, Glegg M, Jiang T. Probabilistic analysis indicates discordant gene trees in chloroplast evolution. J Mol Evol. (2003) 56:330–340.[CrossRef][ISI][Medline]

    Waddell PJ. Measuring the fit of sequence data to phylogenetic model: allowing for missing data. Mol Biol Evol. (2005) 22:395–401.[Abstract/Free Full Text]

    Waddell PJ, Hendy MD. Using phylogenetic invariants to enhance spectral analysis of nucleotide sequence data. In: Information and Mathematical Sciences Reports, Series B (A. Swift, ed) (1997) Massey University: Palmerston North. [cited 2007 July 23]. [Internet]. http://awcmee.massey.ac.nz/people/mhendy/pdf/ProjectedHadamardTemp.pdf.

    Waddell PJ, Penny D, Hendy MD, Arnold GC. The sampling distributions and covariance matrix of phylogenetic spectra. Mol Biol Evol. (1994) 11:630–642.[ISI]

    Zhaxybayeva O, Hamel L, Raymond J, Gogarten JP. Visualization of the phylogenetic content of five genomes using dekapentagonal maps. Genome Biol. (2004) 5:R20.[CrossRef][Medline]

Accepted for publication June 22, 2007.