Skip Navigation


MBE Advance Access originally published online on February 14, 2008
Molecular Biology and Evolution 2008 25(5):960-971; doi:10.1093/molbev/msn043
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
25/5/960    most recent
msn043v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Seo, T.-K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Seo, T.-K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Calculating Bootstrap Probabilities of Phylogeny Using Multilocus Sequence Data

Tae-Kun Seo

Professional Programme for Agricultural Bioinformatics, University of Tokyo, 1-1-1 Yayoi Bunkyo-Ku, Tokyo 113-8657, Japan

E-mail: seo{at}iu.a.u-tokyo.ac.jp.


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Concluding Remarks
 Acknowledgements
 References
 
Phylogeny estimation is extremely crucial in the study of molecular evolution. The increase in the amount of available genomic data facilitates phylogeny estimation from multilocus sequence data. Although maximum likelihood and Bayesian methods are available for phylogeny reconstruction using multilocus sequence data, these methods require heavy computation, and their application is limited to the analysis of a moderate number of genes and taxa. Distance matrix methods present suitable alternatives for analyzing huge amounts of sequence data. However, the manner in which distance methods can be applied to multilocus sequence data remains unknown. Here, we suggest new procedures to estimate molecular phylogeny using multilocus sequence data and evaluate its significance in the framework of the distance method. We found that concatenation of the multilocus sequence data may result in incorrect phylogeny estimation with an extremely high bootstrap probability (BP), which is due to incorrect estimation of the distances and intentional ignorance of the intergene variations. Therefore, we suggest that the distance matrices for multilocus sequence data be estimated separately and these matrices be subsequently combined to reconstruct phylogeny instead of phylogeny reconstruction using concatenated sequence data. To calculate the BPs of the reconstructed phylogeny, we suggest that 2-stage bootstrap procedures be adopted; in this, genes are resampled followed by resampling of the sequence columns within the resampled genes. By resampling the genes during calculation of BPs, intergene variations are properly considered. Via simulation studies and empirical data analysis, we demonstrate that our 2-stage bootstrap procedures are more suitable than the conventional bootstrap procedure that is adopted after sequence concatenation.

Key Words: bootstrap probability • 2-stage bootstrap • distance method • phylogeny


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Concluding Remarks
 Acknowledgements
 References
 
Reconstruction of phylogeny is one of the most important issues in the study of evolution. In the past when limited genomic information was available, molecular phylogeny was typically estimated using single locus data; however, due to the increase in the amount of available genomic data, it is currently being estimated using multilocus sequences (e.g., Adachi et al. 2000Go; Cao, Fujiwara, et al. 2000Go; Cao, Sorenson, et al. 2000Go; Nikaido et al. 2003Go; Nishihara et al. 2007Go). A simple approach to deal with multilocus sequence data is concatenation, that is, merging multilocus sequences and treating the merged set as a single locus. It has been noted that separate analysis is preferred over sequence concatenation (e.g., Adachi et al. 2000Go; Cao, Fujiwara, et al. 2000Go; Cao, Sorenson, et al. 2000Go; Nikaido et al. 2003Go; Nishihara et al. 2007Go). Because different genes may have been subject to different evolutionary processes, it would be appropriate to investigate their individual evolutionary histories. Evolutionary histories are typically summarized using parameters such as tree topology, nucleotide frequencies, branch lengths of phylogeny, etc. Sequence concatenation does not reflect the natural data partitioning; hence, the evolutionary parameters estimated from the concatenated sequences may not adequately represent the evolutionary implications of individual data partitions. Another problem concerning sequence concatenation occurs during the bootstrap procedure for measuring the uncertainty in the estimated phylogeny. The conventional bootstrap procedures (e.g., Felsenstein 1985Go; Kishino and Hasegawa 1989Go; Shimodaira and Hasegawa 1999Go) assume that the sequence columns are independently and identically distributed (i.i.d.). With this assumption, pseudodata are generated by resampling of the sequence columns from the concatenated sequences. However, it is inappropriate to assume an i.i.d. for all the concatenated sequence columns because different genes have been subject to different evolutionary processes.

Recently, Seo et al. (2005)Go suggested new procedures for determining and testing optimal tree topology using multilocus sequence data in the framework of the likelihood method. Instead of "combining (i.e., concatenating) sequence data," they suggested "combining the log-likelihood scores" obtained from the separate analysis. The approach of combining log-likelihood scores is conventional and has been adopted in many analyses (e.g., Adachi et al. 2000Go; Cao, Fujiwara, et al. 2000Go; Cao, Sorenson, et al. 2000Go; Nikaido et al. 2003Go; Nishihara et al. 2007Go). Calculation of phylogenetic uncertainty by conventional separate analysis involves 1) merging the pools of log-likelihood scores and applying the bootstrap procedure (Kishino and Hasegawa 1989Go; Shimodaira and Hasegawa 1999Go) or 2) stratifying the pools of log-likelihood scores and applying the bootstrap procedure within the strata (Yoder and Yang 2000Go). The latter approach is referred to as "stratified sampling." In contrast to the conventional bootstrap procedures after separate analysis, Seo et al. (2005)Go suggested 2-stage bootstrap procedures in which genes are resampled and the sequence columns are subsequently resampled from within the resampled genes. By resampling the genes, the intergene variations are appropriately considered while testing for the significance of tree topologies.

Although the bootstrap procedures for likelihood methods that were proposed by Seo et al. (2005)Go compare the competing tree topologies as a whole, they cannot quantify the significance of subclades within the phylogeny. Typically, the significance of subclades is represented by bootstrap probabilities (BPs) that are determined by parsimony and distance methods (Felsenstein 1985Go). However, the manner in which parsimony and distance methods can be applied to multilocus sequence data without sequence concatenation remains unknown. Bayesian approaches can be applied to multilocus sequence data without sequence concatenation (Rannala and Yang 1996Go; Ronquist and Huelsenbeck 2003Go; Edwards et al. 2007Go). However, Bayesian methods generally require considerable computation, and their application is limited to a moderate number of taxa and sequences.

In our study, we suggest new distance methods for reconstructing the phylogeny and calculating the BPs using multilocus sequence data. By a simulation study, we demonstrate the advantage of separate analysis over sequence concatenation and the importance of considering intergene variations while measuring the BPs for the subclades. Further, we discuss the applicability of our procedures with the analysis of mammalian sequence data.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Concluding Remarks
 Acknowledgements
 References
 
In this section, we begin by explaining the potential problem associated with sequence concatenation and the origin of the problem with reference to the framework of the distance method. Subsequently, we discuss 1) how to combine the individual results obtained using multilocus sequence data in order to determine optimal tree topology and 2) how to calculate BPs.

Potential Problem Associated with Sequence Concatenation
Recently, it was noted that heterotachy can result in incorrect tree estimation under the parsimony and likelihood framework (Kolaczkowski and Thornton 2004Go). When a single locus data set comprises multiple partitions having identical tree topologies but different branch lengths, the tree reconstructed using the entire single locus data set may be inconsistent with that of the individual partitions. The mechanism by which the problem of heterotachy occurs in single locus data set straightforwardly applies to concatenated sequence data. Due to the differences in the evolutionary processes, different genes possess different sets of branch lengths even when they have identical tree topologies. Here, we show that heterotachy is also a concern in the application of the distance method, and we investigate the origin of incorrect estimation of the phylogenetic tree.

Suppose, we measure the evolutionary distance between taxa A and B using sequence data from 2 loci, that is, loci 1 and 2. We denote the evolutionary distance between A and B at the loci 1 and 2 as dFormula and dFormula, respectively. The evolutionary distance is measured in terms of the number of nucleotide substitutions per nucleotide site. To simplify the problem, we assume that both loci have identical and infinitely large sequence lengths. Generally, due to the intergene variations caused by the different evolutionary processes, dFormula != dFormula. Because the 2 loci are infinitely long, we can estimate dFormula and dFormula without any error. According to the Jukes–Cantor (JC; Jukes and Cantor 1969Go) model, the following relationship is known:

Formula (1)
where i = 1, 2 and pFormula is the proportion of the different sites between taxa A and B in the ith gene.

Statistically, we can assume that dFormula and dFormula are random samples from a certain unknown distribution. The mean of this distribution would be a good summary measure of the evolutionary distance between taxa A and B. The arithmetic mean of the 2 distances is the unbiased estimator for the mean of distance distribution:

Formula (2)
On sequence concatenation of the 2 loci, the proportion of the different sites is (pFormula + pFormula)/2. The JC evolutionary distance estimated from the concatenated sequences is as follows:

Formula (3)
Based on the concave relationship between pFormula and dFormula in equation (1), we can show that

Formula
That is, dFormula is a "biased" estimator for the mean of distance distribution. The underestimation by dFormula is more severe when distantly related sequences are concatenated with closely related sequences. Figure 1 is the schematic representation of the underestimation by dFormula.


Figure 1
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Underestimation by dFormula:pFormula(i = 1,2) represents the proportion of the different sites between taxa A and B in the ith gene. The dFormula represents the evolutionary distance derived from pFormula. Because of the concave relationship between pFormula and dFormula, dFormula – which is derived from (pFormula + pFormula)/2–is always less than dFormula:=(dFormula + dFormula/2.

 
Figure 2 shows an example in which the underestimation by the concatenation results in an incorrect tree estimation with perfect confidence. We compare the phylogenetic reconstruction of 4 taxa using the approaches specified in equations (2) and (3), followed by the Neighbor-Joining (NJ) algorithm (Saitou and Nei 1987Go). The 2 loci have identical and infinite sequence lengths. They evolve along identical tree topologies but have different branch lengths. Because the sequence lengths are infinite, we can perfectly estimate the pairwise distances and reconstruct the true individual phylogenies of the 2 loci. In this case, the pairwise distances are consistent with the reconstructed tree for each gene. That is, the distance between the 2 taxa is equal to the sum of the lengths of all the branches connecting the 2 taxa. This consistency is maintained when we combine the 2 distances with equation (2), and the estimated tree topology is identical to each individual tree topology (refer to the right lower box of fig. 2). However, the consistency is not guaranteed on concatenation of the 2 loci, which may result in incorrect estimation of the phylogenetic tree (refer to the right upper box of fig. 2). Because the concatenated sequences are infinitely long, the resampled data are very similar to the original concatenated data in terms of the estimated distances, and we obtain 100% BP for the incorrect phylogeny.


Figure 2
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— The potential problem associated with sequence concatenation: The 2 boxes on the left represent the distance matrices and phylogenies estimated using infinitely long sequences from the 4 taxa. The terminal and internal branches b1, b2, and b3 represent 0.1, 0.01, and 0.4 nucleotide (nt) substitutions per nucleotide site, respectively. The upper and lower boxes on the right exhibit the distance matrices and the phylogenies estimated using the concatenated sequence data (eq. 3) and the arithmetic means of the individual distances (eq. 2). The distances were estimated using the JC model, and the phylogenies were reconstructed by the NJ method.

 
Although the example in figure 2 may appear to be somewhat artificial due to the infinite sequence lengths, it clearly reveals the potential problem associated with sequence concatenation. It has been noted that the consistent phylogenetic signal of each gene, which is too weak to be detected individually, can be amplified and detected in the concatenated sequences (e.g., Queiroz and Gatesy 2006Go). This "hidden support" is occasionally taken into consideration during phylogeny reconstruction. However, the example in figure 2 reveals that the hidden support is not necessarily a real signal but may be an artifact. The clade (A,C) in the phylogenetic tree estimated from the concatenated sequences is not supported at all by the individual loci.

Determining the Optimal Tree by Separate Analysis
Because there is a potential problem associated with sequence concatenation, we suggest that sequence data be subjected to separate analysis and the results be combined for phylogeny reconstruction. For convenience of mathematical notation, we denote the pairwise distances in the vector form. Let the true distance vector of the ith gene be {delta}i. For given m taxa, there exist m(m – 1)/2 pairwise distances, and {delta}i has m(m 1)/2 components. Based on the variability of the {delta}is among genes, we assume that {delta}i is a random variable whose mean and variance are denoted as follows:

Formula (4)
When the effects of the systematic biases—such as incomplete lineage sorting and horizontal gene transfer (Degnan and Rosenberg 2006Go)—are relatively minor, {Delta} would be consistent with the true species tree. Therefore, phylogeny reconstruction using {Delta} is quite reasonable (refer to Discussion).

For given n genes, the unbiased estimators of {Delta} and {Sigma} are calculated as follows:

Formula
The vectors of {delta}is are not directly observed but estimated from the sequence data and are then denoted as Formula s. Thus, we can consider the following more practical estimators of {Delta} and {Sigma}:

Formula (5)

Formula (6)
The phylogenetic tree can be estimated by applying the tree reconstruction algorithm T(·) to Formula :

Formula (7)

Calculating BPs Using Multilocus Sequence Data
The variance of Formula can be calculated as follows:

Formula (8)
where the first and the second approximations are valid when the sequence lengths and n are sufficiently large.

With regard to the nonnegative definite matrix Formula in equation (8), we can determine a unique nonsingular matrix Formula such that (Anderson 1984Go)

Formula
where Formula is the rank for Formula and the estimate for the rank of {Sigma}. We note that {Sigma} is not necessarily of full rank. With regard to the bifurcating phylogeny of m taxa, m(m – 1)/2 elements of {delta}i are determined by the lengths of 2m – 3 branches. When all the genes have identical tree topologies, the variability of m(m – 1)/2 elements can be explained by the variability of 2m – 3 branches, and the rank of {Sigma} is 2m – 3. In practice, the rank of {Sigma} may differ from 2m – 3 because each gene may not be perfectly consistent with the tree structure.

The purpose of obtaining Formula from Formula is to normalize Formula ; that is, when n and the sequence lengths are large:

Formula

We adopt 2-stage bootstrap procedures (Seo et al. 2005Go) for quantifying the uncertainty of distance measure. In the first stage, we resample genes with replacement from the original data set, which will be denoted by the "*" sign. In the second stage, we resample the sequence columns with replacement within the resampled genes, which will be denoted by the "{star}" sign. With regard to the resampled data set, we consider the following statistics, similar to those in equations (5) and (6):

Formula
For large n and long sequence lengths, we can show that

Formula
With regard to the nonnegative definite matrix Formula , we can determine a unique nonsingular matrix Formula (Anderson 1984Go) such that

Formula
where Formula is the rank for Formula . Here, Formula is obtained in order to normalize Formula . That is,

Formula

Following the 2 guidelines—centering and pivoting—for bootstrap hypothesis testing (Hall and Wilson 1991Go), we approximate the distribution of the normalized Formula by using the distribution of the normalized Formula :

Formula (9)
where the "~" sign signifies that the distribution of the left term can be approximated with that of the right term.

In equation (9), {Delta} is an unknown parameter. If we adopt the Bayesian idea of Efron et al. (1996)Go, we can consider the posterior probability of {Delta}. That is, with uniform prior of {Delta}, equation (9) leads to

Formula (10)
With regard to the original data set, we calculate Formula and Formula once. With regard to each resampled data set, we calculate Formula and Formula . Using iteration to generate the pseudodata set and calculate Formula and Formula , we obtain the posterior distribution of the unknown parameter {Delta}. The posterior probability of T({Delta}) can be used to calculate the BPs for the phylogenetic tree.

Simple Bootstrap Strategies
Equation (10) represents the most general bootstrap procedure in our study, hereafter referred to as B5. We consider 2 other bootstrap procedures—B3 and B4—that are simpler than B5 and involve resampling of genes. The performance of the procedures B3–B5 will be compared with that of the conventional procedures—B1 and B2 (see Results and Discussion). The procedures B1–B4 are explained below. Whereas B1 is used with sequence concatenation, B2–B5 are used with separate analysis of equations (5) and (7).

  1. B1: Multilocus sequences are concatenated for tree reconstruction, and the sequence columns are resampled with replacement from the concatenated data in order to calculate the BPs. The BPs are calculated in a manner similar to that used for the analysis of a single gene (Felsenstein 1985Go).
  2. B2: To generate pseudodata, we resample the sequence columns within each gene without resampling the genes. That is, the BPs for the phylogenetic tree are calculated by iteration of TFormula . This "stratified sampling" has been used for likelihood methods (e.g., Yoder and Yang 2000Go). By comparing the results of B1 and B2, we can see the effects of separate analysis.
  3. B3: Following the 2-stage bootstrap procedures (Seo et al. 2005Go), we resample the genes and sequence columns. However, we do not consider centering and pivoting of the bootstrap guidelines set forth by Hall and Wilson (1991)Go. The BPs for the phylogenetic tree are calculated by iterating TFormula . By comparing the results of B2 and B3, we can see the effects of the resampling genes.
  4. B4: The Bayesian idea of Efron et al. (1996)Go is adopted to calculate the BPs. The centering scheme proposed by Hall and Wilson (1991)Go is considered; however, the pivoting scheme is ignored. Ignoring pivoting in equation (10) implies ignoring Formula and Formula , leading to

    Formula
    The BPs for the phylogenetic tree are calculated using the iteration of T(2Formula Formula ). By comparing the results of B3 and B4, we can see the effects of the centering scheme of bootstrap procedures. Further, by comparing the results of B4 and B5, we can see the effects of the pivoting scheme because centering and pivoting are both considered in B5.


    Results
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Concluding Remarks
 Acknowledgements
 References
 
Simulation Studies
We performed simulation in order to compare the performance of the bootstrap procedures B1–B5. We used sequence concatenation and separate analysis as represented in equations (5) and (7) to reconstruct the phylogenies of randomly generated genes. When generating genes for the 4 taxa, we considered the variations in terms of the tree topology, branch lengths, and sequence lengths. We assumed that each gene evolved along either of the 2 tree topologies, namely, ((A,B),(C,D)) and ((A,C),(B,D)), which will hereafter be referred to as Trees 1 and 2, respectively. Each gene was determined to evolve along Tree 1 (or Tree 2) with the probability P (or 1 – P). The terminal and internal branch lengths of all genes were random samples obtained from the uniform distributions Unif[0, 0.2] and Unif[0, 0.02], respectively. The sequence length of each gene was a random sample obtained from the uniformly distributed integers between 501 and 3,500. In each gene, random DNA sequences of the 4 taxa were generated using the JC model (Jukes and Cantor 1969Go) after determination of the tree topology, branch lengths, and sequence lengths.

Once we generated n random genes, the pairwise distances for the concatenated sequences and separate genes were calculated using the JC model. For tree reconstruction, the NJ algorithm (Saitou and Nei 1987) was applied to the distance matrix of the concatenated sequences and to the average distance matrix of equation (5). To calculate the BPs, we generated 100 pseudosamples by following the procedures B1–B5. We calculated the BPs for 3 clades, that is, (A,B), (A,C), and (A,D), hereafter referred to as Nodes 1, 2, and 3, respectively.

For a given n and P, the procedure for generating genes and calculating the BPs at Nodes 1–3 was repeated 1, 000 times. Subsequently, we investigated the distribution of the BPs at the 3 Nodes. Node 3 is not supported by either Tree 1 or Tree 2. Nodes 1 and 2 compete with and are complimentary to each other, that is, a high BP at Node 1 implies a low BP at Node 2 and vice versa. Therefore, we will focus on the description of the results at Node 1, although we have shown the figures for all 3 Nodes. Using different values for n and P, we consider the following 4 scenarios:

  1. Scenario 1 (P = 0.6, n = 5). In this Scenario, we consider a situation in which the number of multilocus genes is relatively small (n = 5) and one of the 2 trees is favored. Because P > 0.5, Tree 1 is favored, and it is reasonable to expect high BPs at Node 1. When the procedures B1 and B2 are used, some data sets exhibit very high BPs (≥95%) at Node 1; however, other data sets exhibit very low BPs (≤5%) at Node 1 (fig. 3). A low BP at Node 1 is inconsistent with the fact that Tree 1 is favored (P = 0.6). The difference between the calculated values of the BPs for different data sets is more exaggerated when the B1 procedure is used. Of the 1,000 iterations in the B1 procedure, 199 data sets display ≤5% BP at Node 1. Although extremely low BPs are observed with B3 and B4, their proportions out of 1,000 iterations are smaller than those of B1 and B2. In contrast to B1–B4, B5 displays neither extremely high nor extremely low BPs at Node 1 (see Discussion).
  2. Scenario 2 (P = 0.6, n = 100). As in Scenario 1, Tree 1 is favored (P = 0.6); however, the number of multilocus genes is relatively large (n = 100). In figure 4, the histograms of the BPs for B1–B5 at Node 1 are concentrated in the higher range, which is consistent with the fact that Tree 1 is favored (P = 0.6). It is notable that some data sets (36 out of 1,000) for B1 exhibit very low BPs (≤5%) at Node 1, despite the proportion being small. Because the intergene variations are taken into consideration, B3–B5 have a more scattered distribution than B1 and B2.
  3. Scenario 3 (P = 0.5, n = 5). The number of multilocus genes is relatively small (n = 5), and both trees are equally favored (P = 0.5). When the procedures B1 and B2 are used, a majority of the data sets display extremely high or low BPs at Node 1 (fig. 5). However, as in the result of Scenario 1, the proportions of the extreme BPs for B3 and B4 are smaller than those of B1 and B2. When B5 is used, the distribution of BPs at Node 1 is similar to that observed in Scenario 1, and most of the data sets are concentrated in the moderate range of BPs.
  4. Scenario 4 (P = 0.5, n = 100). As in Scenario 3, both trees are equally favored (P = 0.5). However, the number of multilocus genes is relatively large (n = 100). When the procedures B1 and B2 are used, a majority of the data sets display extremely high or low BPs at Node 1 (fig. 6). When the procedures B3–B5 are used, the distribution of BPs at Node 1 is scattered between 0 and 1 and is similar to uniform distribution (see Discussion).


Figure 3
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Scenario 1 (P = 0.6, n = 5). The procedure for generating 5 (n = 5) genes from 4 taxa and calculating the BPs at the 3 Nodes is iterated 1,000 times. For each iteration, the BPs are calculated using 5 different procedures (B1–B5). The resulting histograms of the BPs are shown. When random sequences are generated, the topology of each gene is determined as Tree 1 (or Tree 2) with the probability P (or 1 – P). The 2 tree topologies, namely, Trees 1 and 2, are ((A,B),(C,D)) and ((A,C),(B,D)), respectively. Nodes 1, 2, and 3 are (A,B), (A,C), and (A,D), respectively. Therefore, Nodes 1 and 2 are consistent with Trees 1 and 2, respectively.

 

Figure 4
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Scenario 2 (P = 0.6, n = 100): Refer to the legend of figure 3.

 

Figure 5
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 5.— Scenario 3 (P = 0.5, n = 5): Refer to the legend of figure 3.

 

Figure 6
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 6.— Scenario 4 (P = 0.5, n = 100): Refer to the legend of figure 3.

 
Empirical Data Analysis
To compare the performance of the procedures B1–B5, we analyzed the sequence data published by Nishihara et al. (2007)Go. They collected a sequence data set comprising 2,789 protein-coding data partitions from 10 mammalian species and investigated their tree topologies and significance. They considered 3 candidate trees—Trees 1, 2, and 3 (fig. 7)—representing different relationships among Boreotheria, Xenarthra, and Afrotheria. Using the maximum likelihood (ML) method and the nucleotide substitution model, they observed that the concatenation approach yields the monoclade of Xenarthra and Afrotheria (Tree 3 of fig. 7) with a 100% BP (see similar result by Hallstrom et al. [2007]Go). However, by separate analysis, Tree 1 was observed to be the ML tree. Although the BPs of Tree 1 were generally higher than those of Tree 3, the BPs of Trees 1 and 3 varied considerably according to the differences in the evolutionary models and the categorization of the sequence data. Therefore, they noted that Tree 3 cannot be rejected against Tree 1, and a greater amount of sequence data are needed to resolve the phylogenetic relationship.


Figure 7
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 7.— The 3 major trees of 10 mammalian species. The Nodes 1–6 are common among the 3 trees. The Nodes 7, 8, and 9 are consistent with the Trees 1, 2, and 3 respectively. Group A, B, and C represent Boreotheria, Xenarthra, and Afrotheria, respectively.

 
The total length of 2,789 data partitions is 1,011,870 nt. This implies that on an average, 1 data partition has approximately 395 nt. We found that short sequences can create a problem in our distance method because distance estimates have large variances (see Discussion). Therefore, from among the 2,789 data partitions, we selected partitions whose lengths are moderately large. In this data set, there are 508 and 103 data partitions that are longer than 500 and 1,000 nt, respectively, hereafter referred to as data sets A and B, respectively. The total lengths of the data sets A and B are 422,937 and 146,958 nt, respectively. For each data partition within the data sets A and B, we calculated the distances using the Tamura–Nei (TN; Tamura and Nei 1993Go) model with a gamma distribution of the rate heterogeneity among sites (Yang 1993Go, 1994bGo).

We adopted the value of 0.375 as the gamma rate heterogeneity parameter and calculated the TN distances using the formula of Nei and Kumar (2000Go, p. 44; for justification of the value of 0.375, see Discussion). Reconstruction of NJ trees with sequence concatenation of the data sets A and B yielded Tree 3 as shown in figure 7. Reconstruction of NJ trees with separate analysis using equations (5) and (7) for the data sets A and B yielded Tree 1. Table 1 shows the BPs for the data sets A and B obtained using the 5 bootstrap procedures at Nodes 1–9 of figure 7.


View this table:
[in this window]
[in a new window]

 
Table 1 BPs Calculated Using 5 Bootstrap Procedures (B1–B5)

 

    Discussion
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Concluding Remarks
 Acknowledgements
 References
 
New Distance Methods Using Multilocus Sequence Data
Here, we suggest new procedures for determining the optimal tree topology and calculating BPs using the distance method and multilocus sequence data. For calculating BPs using the B4 and B5 approaches, we adopt the Bayesian idea of Efron et al. (1996)Go. Therefore, the BPs calculated using the procedures B4 and B5 can be regarded as the posterior probabilities for the subclades. The advantage of adopting the Bayesian idea is that interpretation of the BP is rendered easy and intuitive. The interpretation of a conventional BP is "repeatability" (Felsenstein 1985Go). The pseudodata repeatedly generated around the original data represent information regarding the true data-generating mechanism for the original data. This interpretation is rather complicated and is not intuitive. By adopting the Bayesian idea, BP can be interpreted as the posterior probability that will enable the determination of the existence of a subclade.

The advantage of our procedures over the conventional Bayesian methods (e.g., Rannala and Yang 1996Go; Ronquist and Huelsenbeck 2003Go) is that they require considerably less computation and can deal with a larger sequence data set. The application of the conventional Bayesian methods is limited to a moderate number of genes and taxa because it requires heavy computation. Our procedures could be used as alternatives to the conventional Bayesian methods when large amounts of sequence data are analyzed.

The Importance of Considering Intergene Variations
Based on the results of figure 3, it is evident that the problem associated with concatenation (B1) can be very serious when the number of genes is small. When n is small, stochastic errors have a greater influence on the sampled genes, and all genes may be incidentally sampled from a tree topology that does not represent the true phylogenetic relationship. In such situations, B1 after sequence concatenation and B2 of stratified sampling may yield high BPs for incorrect subclades. Considering the existence of intergene variations, more care should be exercised during confidently formulating a conclusion based on a small number of genes. Even when all the sampled genes support identical tree topologies, if there are considerable intergene variations in terms of the distance, we reasonably expect that there may be other genes that support different topologies but are incidentally not selected. This expectation prevents us from formulating a confident decision until we obtain sufficient information, that is, a large number of genes. As a result, the BPs calculated using B3–B5 are distributed mostly in the intermediate range when n is small (fig. 3) and are concentrated in the higher range at Node 1 when n is large (fig. 4). When there is little variation or consistency among genes, the procedures B3–B5 can yield confident results even with a small number of genes. In all Scenarios, Node 3 is supported by neither Tree 1 nor Tree 2, which is consistent among all genes. Therefore, B3–B5 yield low BPs at Node 3 even with a small number of genes (figs. 3 and 5).

The poor performance of B1 after concatenation and that of B2 with stratified sampling is very clear in Scenario 4 (fig. 6). The number of genes is large (n = 100). Because P = 0.5, Trees 1 and 2 are equally probable. However, when B1 and B2 are used, a majority of the 1,000 iterations exhibit extremely high (≥95%) or low (≤5%) BPs at Node 1. Because intergene variations are considered, a more flat and widely scattered distribution is obtained using B3–B5. The distributions of BPs calculated using B3–B5 are similar to the uniform distribution between 0 and 1. This is against our intuitive expectation that BPs would be concentrated around 0.5 at Node 1. A similar problem was noted under the Bayesian framework (Suzuki et al. 2002Go; Lewis et al. 2005Go). When the multifurcating node is true, the posterior probabilities of a bifurcating node are uniformly distributed between 0 and 1 (Lewis et al. 2005Go). The situation of P = 0.5 in Scenario 4 is somewhat similar to the situation of the multifurcating node that Lewis et al. (2005)Go considered. When P = 0.5, the true {Delta} is located at the boundary between the space of Trees 1 and 2. The distance estimates Formula are symmetrically distributed around the boundary. If Formula is incidentally located in the space of Tree 1, a majority of the resampled estimates Formula s will be located in the space of Tree 1, resulting in a high BP at Node 1. If Formula is incidentally far away from the boundary, the BP at Node 1 can be extremely high. Similarly, if Formula is incidentally located in the space of Tree 2, the BP at Node 1 can be low, even extremely low. This intuitively explains why the distributions of the BPs at Node 1 calculated using B3–B5 converge to a uniform distribution between 0 and 1 instead of converging to a point mass at 0.5. If the Formula s are located exactly at the boundary between the space of Trees 1 and 2, the Formula s should be symmetrically distributed around the boundary, and the distributions of BPs should converge to a point mass at 0.5. However, this does not happen at all. However, large the amount of data, the Formula s are symmetrically distributed around the boundary and are not located exactly at it.

In Scenario 3, the BPs at Node 1 calculated using B5 are more concentrated around 0.5, whereas the BPs at Node 1 calculated using B3–B4 are rather uniformly distributed between 0 and 1. The deviation of B5 from the uniform distribution can be explained on the basis of poor approximation for the variance. The approximation of equation (8) is valid when the sequence lengths and number of genes are sufficiently large. Otherwise, Formula and Formula are not reliable approximates; therefore, B3 and B4 should be preferred over B5. When the number of genes is large, the distribution of the BP calculated using B5 is similar to the distributions of the BPs calculated using B3 and B4 as shown in Scenarios 2 and 4. This implies that the effects of centering and pivoting are relatively minor when the data set is large. Therefore, we conclude that B3 and B4 are sufficiently reliable for application in most circumstances, despite B5 being, theoretically, a more sophisticated procedure.

We note that different sequence lengths can be considered during phylogenetic reconstruction and 2-stage bootstrap procedures. By considering distance matrices weighted by sequence lengths, we can modify equation (5) and define Formula as follows:

Formula
where li is the sequence length of ith gene. Then, the phylogeny can be reconstructed by using T(Formula . This approach is similar to the "sum criterion" of ML estimation proposed by Seo et al. (2005)Go. The 2-stage bootstrap procedures using Formula and Formula result in considerable variations in the BPs because the variations of the sequence lengths are considered (data not shown). Tree reconstruction using Formula and Formula yields 2 extremes: one, in which the information regarding sequence length is completely ignored and the other, in which it is considered. It is possible to create an intermediate weighting scheme between these 2 extremes. In our study, we mainly focused on the procedure using Formula rather than discussing the selection of an optimal weighting scheme.

Analysis of Mammalian Data
Rate Heterogeneity among Sites
Nishihara et al. (2007)Go observed that tree reconstruction by the NJ method (Saitou and Nei 1987Go) with the TN model (Tamura and Nei 1993Go) yields a misleading tree topology in which rodents (mouse and rat) are placed basally and are grouped together with opossum. They noted that the unreasonable tree reconstruction by the above method may be due to long-branch attraction. Here, we note that ignoring the rate heterogeneity among sites may be another possible explanation for the misleading tree topology. All the data partitions considered for analysis by Nishihara et al. are protein-coding sequences. In general, the third codon position evolves faster than the first and second positions. Therefore, it is appropriate to take into consideration the rate heterogeneity among sites during the calculation of the pairwise distances. It is very important to consider rate heterogeneity among sites during tree reconstruction and ignoring it has been noted to possibly result in the reconstruction of an incorrect phylogeny (e.g., Yang 1995Go).

The rate heterogeneity among sites is typically modeled with a gamma distribution whose mean and variance are 1 and 1/{alpha}, respectively (Yang 1993Go). In principle, it is not possible to estimate {alpha} by the distance method, and we applied the ML method to determine the appropriate {alpha} for calculating the pairwise distances. PAML (Yang 2007Go) was used to estimate {alpha} using the TN model (Tamura and Nei 1993Go) for the concatenated sequences of the 2,789 data partitions. The estimated {alpha}s for Trees 1, 2, and 3 of figure 7 are 0.375, 0.375, and 0.374, respectively. Similarly, {alpha}s for the concatenated sequences of the data sets A and B were estimated, and the estimated {alpha}s for Trees 1, 2, and 3 of the data set A are 0.363, 0.363, and 0.362, respectively, and those of the data set B are 0.371, 0.372, and 0.370, respectively. The value of 0.375 that we adopted as an {alpha} parameter for calculating the pairwise distances is qualitatively similar to the {alpha}s estimated from the concatenated data partitions. When {alpha} = 0.375 is considered during the calculation of the pairwise distances, the rodent group is properly positioned within the Boreotheria group. The NJ trees for 2,789 data partitions reconstructed using the TN model with sequence concatenation and with separate analysis were observed to be Trees 3 and 1, respectively. This result is consistent with that of Nishihara et al. (2007)Go. They applied the ML method with the General Time Reversible (Yang 1994aGo) model, and the ML trees reconstructed by sequence concatenation and by separate analysis were observed to be Trees 3 and 1, respectively.

Variations among the Data Partitions
Because different data partitions have been subject to different evolutionary mechanisms, the optimal {alpha} for different data partitions will be different. However, we did not consider this variation in calculating the pairwise distances and applied {alpha} = 0.375 to all data partitions. It is possible to estimate {alpha}s for separate data partitions by the ML method and to apply the separate {alpha}s for calculating the pairwise distances for separate data partitions. However, this approach would be impractical when a huge amount of data need to be analyzed, and the ML method also requires a lot of computational time. Our general recommendation is to apply the ML method to concatenated sequences to obtain the appropriate value for {alpha}; this is relatively easy even when a huge amount of sequence data need to be analyzed. As shown in our results, the Formula estimates appear to be quite robust for different trees and different lengths of concatenated data.

We note that a substantial portion of the 2,789 data partitions comprises relatively short sequences. For example, there are 645 (1,991) data partitions whose lengths are smaller than 200 (400) nt. In our distance method, we found that short sequences could cause a problem because distance estimates have a large variance. In the analysis using the ML method conducted by Nishihara et al. (2007)Go, they adopted the resampling estimated log-likelihood (RELL; Kishino and Hasegawa 1989Go) procedure. That is, instead of resampling the sequence data and recalculating the likelihood scores for the resampled data, they resampled the likelihood scores obtained from the original sequence data. When the sequences are short, the RELL procedure may underestimate the variance because it ignores the variance of the estimation conducted using resampled data. In our distance method, pairwise distances are recalculated for resampled sequence data sets. Because the distance estimates from short sequences are subject to large variations, BPs are also subject to large variations. Although Tree 1 represents the NJ tree reconstructed using the 2,789 data partitions and Formula of equation (5), the BPs at Nodes 1–6 obtained by the 2-stage bootstrap procedures are relatively low (data not shown). Because of the large variance in the distance estimates, the distance matrices from the resampled data are distributed over the space of many trees, resulting in relatively low BPs for a specific clade. To reduce the problem caused by short sequences, we arbitrarily set the threshold value as 500 nt (for data set A) and 1,000 nt (for data set B) for selection of data partitions. With moderately long data partitions, the NJ trees were identical to those obtained from all 2,789 data partitions and the BPs at Nodes 1–6 were high (table 1).

Problem Associated with Sequence Concatenation
The phylogenetic relationship among Boreotheria, Xenarthra, and Afrotheria remains controversial, and Nishihara et al. (2007)Go noted that a greater amount of sequence data and a more appropriate taxon sampling are required to resolve the phylogenetic relationship among these 3 taxa. Nishihara et al. criticized the sequence concatenation approach because it presents a 100% BP for the clade of Xenarthra and Afrotheria (Node 9 of fig. 7). In our analysis, we obtained results that are consistent with those of Nishihara et al. When the number of data partitions is moderate (data set B), the BP at Node 9 calculated using B1 is not extreme; however, when the number of data partitions is large (data set A), it is 100.0%. This leads us to arrive at an incorrect conclusion, which is also demonstrated in our simulation study. For data sets A and B, the BPs at Nodes 7 and 9 calculated using B2–B5 were moderately high, which is consistent with the results of Nishihara et al.’s ML method. Whereas there is a drastic difference between B1 and B2–B5 (table 1), the differences among B2–B5 are relatively minor. However, this does not generally imply that B2 is as acceptable as B3–B5. It appears that the effect of separate analysis is dominant over the effect of the 2-stage bootstrap procedures in this example involving mammalian data. As shown in the simulation study, there is a potential problem associated with B2 because it ignores the intergene variations. Therefore, we emphasize that B3–B5 are generally preferable over B1 and B2.

Gene Tree versus Species Tree
In our procedures, we assume that {Delta} is consistent with the species tree. Therefore, the species tree can be estimated using Formula —the arithmetic mean of Formula . The gene trees reconstructed using Formula are distributed around the species tree, and we assume that systematic bias has a minor effect on the distribution of Formula .

Recently, it was noted that gene trees are not necessarily consistent with the species tree due to various evolutionary events, such as incomplete lineage sorting and horizontal gene transfer (Degnan and Rosenberg 2006Go). In such situations, the species tree cannot be estimated using a consensus gene tree. Our tree reconstruction method using equations (5) and (7) does not consider systematic bias caused by incomplete lineage sorting and horizontal gene transfer. However, it is notable that incomplete lineage sorting is generally not a major concern when the species are moderately distant (e.g., Degnan and Rosenberg 2006Go). Moreover, only a small portion of the genomic loci may be subject to horizontal gene transfer during the evolution of higher organisms. Tree estimation using equations (5) and (7) is suitable for situations in which the effect of systematic bias is minor, and the consensus gene tree is consistent with the species tree.


    Concluding Remarks
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Concluding Remarks
 Acknowledgements
 References
 
Here, we introduce new procedures for estimating phylogeny and calculating BPs using multilocus sequence data in the context of the distance method. By theoretical investigation, simulation studies, and empirical data analysis, we demonstrated that there is a potential problem associated with the sequence concatenation and that intergene variations should be considered during the measurement of phylogenetic uncertainty. Although B5 is the most sophisticated procedure, its application appears to be limited to data sets comprising a large number of genes and long sequences. Therefore, B3 and B4 would be preferable in most circumstances. Because our distance methods require considerably less computation, they are good alternatives to the Bayesian and likelihood methods when huge amounts of sequence data need to be analyzed.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Concluding Remarks
 Acknowledgements
 References
 
The author thanks Jeffrey L. Thorne, Asger Hobolth, Sang Chul Choi, Hirohisa Kishino, Marcy Uyenoyama, and 2 anonymous reviewers for their valuable comments and Hidenori Nishihara for providing data. The author was supported by Japan Society for the Promotion of Science (EYS B-18770216).


    Footnotes
 
Marcy Uyenoyama, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Concluding Remarks
 Acknowledgements
 References
 

    Adachi J, Waddell PJ, Martin W, Hasegawa M. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast. J Mol Evol (2000) 50:348–358.[Web of Science][Medline]

    Anderson TW. Introduction to multivariate statistical analysis (1984) 2nd ed. Hoboken (NJ): John Wiley & Sons, Inc. 596.

    Cao Y, Fujiwara M, Nikaido M, Okada N, Hasegawa M. Interordinal relationships and timescale of eutherian evolution as inferred from mitochondrial genome data. Gene (2000) 259:149–158.[CrossRef][Web of Science][Medline]

    Cao Y, Sorenson MD, Kumazawa Y, Mindell DP, Hasegawa M. Phylogenetic position of turtles among amniotes: evidence from mitochondrial and nuclear genes. Gene (2000) 259:139–148.[CrossRef][Web of Science][Medline]

    Degnan JH, Rosenberg NA. Discordance of species trees with their most likely gene trees. PLoS Genet (2006) 2:762–768.[Web of Science]

    Edwards SV, Liu L, Pearl DK. High-resolution species trees without concatenation. Proc Natl Acad Sci USA (2007) 102:4436–4441.[CrossRef]

    Efron B, Halloran E, Holmes S. Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci USA (1996) 93:13429–13434.[Abstract/Free Full Text]

    Felsenstein J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution (1985) 39:783–791.[CrossRef][Web of Science]

    Hall P, Wilson SR. Two guidelines for bootstrap hypothesis testing. Biometrics (1991) 47:757–762.[CrossRef][Web of Science]

    Hallstrom BM, Kullberg M, Nilsson MA, Janke A. Phylogenomic data analyses provide evidence that Xenarthra and Afrotheria are sister groups. Mol Biol Evol (2007) 24:2059–2068.[Abstract/Free Full Text]

    Jukes TH, Cantor CR. Evolution of protein molecules. In: Mammalian protein metabolism—Munro HN, ed. (1969) New York: Academic Press. 21–132.

    Kishino H, Hasegawa M. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. J Mol Evol (1989) 29:170–179.[CrossRef][Web of Science][Medline]

    Kolaczkowski B, Thornton JW. Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature (2004) 431:980–984.[CrossRef][Medline]

    Lewis PO, Holder MT, Holsinger KE. Polytomies and Bayesian phylogenetic inference. Syst Biol (2005) 54:241–253.[Abstract/Free Full Text]

    Nei M, Kumar S. Molecular evolution and phylogenetics (2000) New York: Oxford University Press.

    Nikaido M, Cao Y, Harada M, Okada N, Hasegawa M. Mitochondrial phylogeny of hedgehogs and monophyly of Eulipotyphla. Mol Phylogenet Evol (2003) 28:276–284.[CrossRef][Web of Science][Medline]

    Nishihara H, Okada N, Hasegawa M. Rooting the eutherian tree: the power and pitfalls of phylogenomics. Genome Biol (2007) 8:R199.1–R199.10.

    Queiroz AD, Gatesy J. The supermatrix approach to systematics. Trends Ecol Evol (2006) 22:34–41.[CrossRef]

    Rannala B, Yang Z. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. J Mol Evol (1996) 43:304–311.[Web of Science][Medline]

    Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics (2003) 19:1572–1574.[Abstract/Free Full Text]

    Saitou N, Nei N. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol (1987) 4:406–425.[Abstract]

    Seo T-K, Kishino H, Thorne JL. Incorporating gene-specific variation when inferring and evaluating optimal evolutionary tree topologies from multilocus sequence data. Proc Natl Acad Sci USA (2005) 102:4436–4441.[Abstract/Free Full Text]

    Shimodaira H, Hasegawa M. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol (1999) 16:1114–1116.[Web of Science]

    Suzuki Y, Glazko GV, Nei N. Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics. Proc Natl Acad Sci USA (2002) 99:16138–16143.[Abstract/Free Full Text]

    Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol (1993) 10:512–526.[Abstract]

    Yang Z. Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol (1993) 10:1396–1401.[Abstract]

    Yang Z. Estimating the pattern of nucleotide substitution. J Mol Evol (1994a) 39:105–111.[Web of Science][Medline]

    Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol (1994b) 39:306–314.[CrossRef][Web of Science][Medline]

    Yang Z. Evaluation of several methods for estimating phylogenetic trees when substitution rates differ over nucleotide sites. J Mol Evol (1995) 40:689–697.[CrossRef][Web of Science]

    Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol (2007) 24:1586–1591.[Abstract/Free Full Text]

    Yoder AD, Yang Z. Estimation of primate speciation dates using local molecular clocks. Mol Biol Evol (2000) 17:1081–1090.[Abstract/Free Full Text]

Accepted for publication February 7, 2008.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Syst BiolHome page
L. Liu, L. Yu, D. K. Pearl, and S. V. Edwards
Estimating Species Phylogenies Using Coalescence Times among Sequences
Syst Biol, October 1, 2009; 58(5): 468 - 477.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
L. Liu and S. V. Edwards
Phylogenetic Analysis in the Anomaly Zone
Syst Biol, August 1, 2009; 58(4): 452 - 460.
[Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
25/5/960    most recent
msn043v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Seo, T.-K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Seo, T.-K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?