MBE Advance Access originally published online on February 14, 2008
Molecular Biology and Evolution 2008 25(5):960-971; doi:10.1093/molbev/msn043
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Calculating Bootstrap Probabilities of Phylogeny Using Multilocus Sequence Data
Professional Programme for Agricultural Bioinformatics, University of Tokyo, 1-1-1 Yayoi Bunkyo-Ku, Tokyo 113-8657, Japan
E-mail: seo{at}iu.a.u-tokyo.ac.jp.
| Abstract |
|---|
|
|
|---|
Phylogeny estimation is extremely crucial in the study of molecular evolution. The increase in the amount of available genomic data facilitates phylogeny estimation from multilocus sequence data. Although maximum likelihood and Bayesian methods are available for phylogeny reconstruction using multilocus sequence data, these methods require heavy computation, and their application is limited to the analysis of a moderate number of genes and taxa. Distance matrix methods present suitable alternatives for analyzing huge amounts of sequence data. However, the manner in which distance methods can be applied to multilocus sequence data remains unknown. Here, we suggest new procedures to estimate molecular phylogeny using multilocus sequence data and evaluate its significance in the framework of the distance method. We found that concatenation of the multilocus sequence data may result in incorrect phylogeny estimation with an extremely high bootstrap probability (BP), which is due to incorrect estimation of the distances and intentional ignorance of the intergene variations. Therefore, we suggest that the distance matrices for multilocus sequence data be estimated separately and these matrices be subsequently combined to reconstruct phylogeny instead of phylogeny reconstruction using concatenated sequence data. To calculate the BPs of the reconstructed phylogeny, we suggest that 2-stage bootstrap procedures be adopted; in this, genes are resampled followed by resampling of the sequence columns within the resampled genes. By resampling the genes during calculation of BPs, intergene variations are properly considered. Via simulation studies and empirical data analysis, we demonstrate that our 2-stage bootstrap procedures are more suitable than the conventional bootstrap procedure that is adopted after sequence concatenation.
Key Words: bootstrap probability 2-stage bootstrap distance method phylogeny
| Introduction |
|---|
|
|
|---|
Reconstruction of phylogeny is one of the most important issues in the study of evolution. In the past when limited genomic information was available, molecular phylogeny was typically estimated using single locus data; however, due to the increase in the amount of available genomic data, it is currently being estimated using multilocus sequences (e.g., Adachi et al. 2000
Recently, Seo et al. (2005)
suggested new procedures for determining and testing optimal tree topology using multilocus sequence data in the framework of the likelihood method. Instead of "combining (i.e., concatenating) sequence data," they suggested "combining the log-likelihood scores" obtained from the separate analysis. The approach of combining log-likelihood scores is conventional and has been adopted in many analyses (e.g., Adachi et al. 2000
; Cao, Fujiwara, et al. 2000
; Cao, Sorenson, et al. 2000
; Nikaido et al. 2003
; Nishihara et al. 2007
). Calculation of phylogenetic uncertainty by conventional separate analysis involves 1) merging the pools of log-likelihood scores and applying the bootstrap procedure (Kishino and Hasegawa 1989
; Shimodaira and Hasegawa 1999
) or 2) stratifying the pools of log-likelihood scores and applying the bootstrap procedure within the strata (Yoder and Yang 2000
). The latter approach is referred to as "stratified sampling." In contrast to the conventional bootstrap procedures after separate analysis, Seo et al. (2005)
suggested 2-stage bootstrap procedures in which genes are resampled and the sequence columns are subsequently resampled from within the resampled genes. By resampling the genes, the intergene variations are appropriately considered while testing for the significance of tree topologies.
Although the bootstrap procedures for likelihood methods that were proposed by Seo et al. (2005)
compare the competing tree topologies as a whole, they cannot quantify the significance of subclades within the phylogeny. Typically, the significance of subclades is represented by bootstrap probabilities (BPs) that are determined by parsimony and distance methods (Felsenstein 1985
). However, the manner in which parsimony and distance methods can be applied to multilocus sequence data without sequence concatenation remains unknown. Bayesian approaches can be applied to multilocus sequence data without sequence concatenation (Rannala and Yang 1996
; Ronquist and Huelsenbeck 2003
; Edwards et al. 2007
). However, Bayesian methods generally require considerable computation, and their application is limited to a moderate number of taxa and sequences.
In our study, we suggest new distance methods for reconstructing the phylogeny and calculating the BPs using multilocus sequence data. By a simulation study, we demonstrate the advantage of separate analysis over sequence concatenation and the importance of considering intergene variations while measuring the BPs for the subclades. Further, we discuss the applicability of our procedures with the analysis of mammalian sequence data.
| Methods |
|---|
|
|
|---|
In this section, we begin by explaining the potential problem associated with sequence concatenation and the origin of the problem with reference to the framework of the distance method. Subsequently, we discuss 1) how to combine the individual results obtained using multilocus sequence data in order to determine optimal tree topology and 2) how to calculate BPs.
Potential Problem Associated with Sequence Concatenation
Recently, it was noted that heterotachy can result in incorrect tree estimation under the parsimony and likelihood framework (Kolaczkowski and Thornton 2004
). When a single locus data set comprises multiple partitions having identical tree topologies but different branch lengths, the tree reconstructed using the entire single locus data set may be inconsistent with that of the individual partitions. The mechanism by which the problem of heterotachy occurs in single locus data set straightforwardly applies to concatenated sequence data. Due to the differences in the evolutionary processes, different genes possess different sets of branch lengths even when they have identical tree topologies. Here, we show that heterotachy is also a concern in the application of the distance method, and we investigate the origin of incorrect estimation of the phylogenetic tree.
Suppose, we measure the evolutionary distance between taxa A and B using sequence data from 2 loci, that is, loci 1 and 2. We denote the evolutionary distance between A and B at the loci 1 and 2 as d
and d
, respectively. The evolutionary distance is measured in terms of the number of nucleotide substitutions per nucleotide site. To simplify the problem, we assume that both loci have identical and infinitely large sequence lengths. Generally, due to the intergene variations caused by the different evolutionary processes, d
d
. Because the 2 loci are infinitely long, we can estimate d
and d
without any error. According to the Jukes–Cantor (JC; Jukes and Cantor 1969
) model, the following relationship is known:
|
| (1) |
is the proportion of the different sites between taxa A and B in the ith gene.
Statistically, we can assume that d
and d
are random samples from a certain unknown distribution. The mean of this distribution would be a good summary measure of the evolutionary distance between taxa A and B. The arithmetic mean of the 2 distances is the unbiased estimator for the mean of distance distribution:
|
| (2) |
+ p
)/2. The JC evolutionary distance estimated from the concatenated sequences is as follows:
![]() | (3) |
and d
in equation (1), we can show that |
|
is a "biased" estimator for the mean of distance distribution. The underestimation by d
is more severe when distantly related sequences are concatenated with closely related sequences. Figure 1 is the schematic representation of the underestimation by d
.
|
Figure 2 shows an example in which the underestimation by the concatenation results in an incorrect tree estimation with perfect confidence. We compare the phylogenetic reconstruction of 4 taxa using the approaches specified in equations (2) and (3), followed by the Neighbor-Joining (NJ) algorithm (Saitou and Nei 1987
|
Although the example in figure 2 may appear to be somewhat artificial due to the infinite sequence lengths, it clearly reveals the potential problem associated with sequence concatenation. It has been noted that the consistent phylogenetic signal of each gene, which is too weak to be detected individually, can be amplified and detected in the concatenated sequences (e.g., Queiroz and Gatesy 2006
Determining the Optimal Tree by Separate Analysis
Because there is a potential problem associated with sequence concatenation, we suggest that sequence data be subjected to separate analysis and the results be combined for phylogeny reconstruction. For convenience of mathematical notation, we denote the pairwise distances in the vector form. Let the true distance vector of the ith gene be
i. For given m taxa, there exist m(m – 1)/2 pairwise distances, and
i has m(m – 1)/2 components. Based on the variability of the
is among genes, we assume that
i is a random variable whose mean and variance are denoted as follows:
|
| (4) |
would be consistent with the true species tree. Therefore, phylogeny reconstruction using
is quite reasonable (refer to Discussion).
For given n genes, the unbiased estimators of
and
are calculated as follows:
![]() |
is are not directly observed but estimated from the sequence data and are then denoted as
and
:
|
| (5) |
|
| (6) |
|
| (7) |
Calculating BPs Using Multilocus Sequence Data
The variance of
can be calculated as follows:
![]() | (8) |
With regard to the nonnegative definite matrix
in equation (8), we can determine a unique nonsingular matrix
such that (Anderson 1984
)
|
|
is the rank for
. We note that
is not necessarily of full rank. With regard to the bifurcating phylogeny of m taxa, m(m – 1)/2 elements of
i are determined by the lengths of 2m – 3 branches. When all the genes have identical tree topologies, the variability of m(m – 1)/2 elements can be explained by the variability of 2m – 3 branches, and the rank of
is 2m – 3. In practice, the rank of
may differ from 2m – 3 because each gene may not be perfectly consistent with the tree structure.
The purpose of obtaining
from
is to normalize
; that is, when n and the sequence lengths are large:
![]() |
We adopt 2-stage bootstrap procedures (Seo et al. 2005
) for quantifying the uncertainty of distance measure. In the first stage, we resample genes with replacement from the original data set, which will be denoted by the "*" sign. In the second stage, we resample the sequence columns with replacement within the resampled genes, which will be denoted by the "
" sign. With regard to the resampled data set, we consider the following statistics, similar to those in equations (5) and (6):
![]() |
![]() |
|
|
![]() |
Following the 2 guidelines—centering and pivoting—for bootstrap hypothesis testing (Hall and Wilson 1991
), we approximate the distribution of the normalized
by using the distribution of the normalized
:
|
| (9) |
" sign signifies that the distribution of the left term can be approximated with that of the right term.
In equation (9),
is an unknown parameter. If we adopt the Bayesian idea of Efron et al. (1996)
, we can consider the posterior probability of
. That is, with uniform prior of
, equation (9) leads to
|
| (10) |
. The posterior probability of T(
) can be used to calculate the BPs for the phylogenetic tree.
Simple Bootstrap Strategies
Equation (10) represents the most general bootstrap procedure in our study, hereafter referred to as B5. We consider 2 other bootstrap procedures—B3 and B4—that are simpler than B5 and involve resampling of genes. The performance of the procedures B3–B5 will be compared with that of the conventional procedures—B1 and B2 (see Results and Discussion). The procedures B1–B4 are explained below. Whereas B1 is used with sequence concatenation, B2–B5 are used with separate analysis of equations (5) and (7).
- B1: Multilocus sequences are concatenated for tree reconstruction, and the sequence columns are resampled with replacement from the concatenated data in order to calculate the BPs. The BPs are calculated in a manner similar to that used for the analysis of a single gene (Felsenstein 1985
).
- B2: To generate pseudodata, we resample the sequence columns within each gene without resampling the genes. That is, the BPs for the phylogenetic tree are calculated by iteration of T
. This "stratified sampling" has been used for likelihood methods (e.g., Yoder and Yang 2000
). By comparing the results of B1 and B2, we can see the effects of separate analysis.
- B3: Following the 2-stage bootstrap procedures (Seo et al. 2005
), we resample the genes and sequence columns. However, we do not consider centering and pivoting of the bootstrap guidelines set forth by Hall and Wilson (1991)
. The BPs for the phylogenetic tree are calculated by iterating T
. By comparing the results of B2 and B3, we can see the effects of the resampling genes.
- B4: The Bayesian idea of Efron et al. (1996)
is adopted to calculate the BPs. The centering scheme proposed by Hall and Wilson (1991)
is considered; however, the pivoting scheme is ignored. Ignoring pivoting in equation (10) implies ignoring
and
, leading to
The BPs for the phylogenetic tree are calculated using the iteration of T(2
–
). By comparing the results of B3 and B4, we can see the effects of the centering scheme of bootstrap procedures. Further, by comparing the results of B4 and B5, we can see the effects of the pivoting scheme because centering and pivoting are both considered in B5.
| Results |
|---|
|
|
|---|
Simulation Studies
We performed simulation in order to compare the performance of the bootstrap procedures B1–B5. We used sequence concatenation and separate analysis as represented in equations (5) and (7) to reconstruct the phylogenies of randomly generated genes. When generating genes for the 4 taxa, we considered the variations in terms of the tree topology, branch lengths, and sequence lengths. We assumed that each gene evolved along either of the 2 tree topologies, namely, ((A,B),(C,D)) and ((A,C),(B,D)), which will hereafter be referred to as Trees 1 and 2, respectively. Each gene was determined to evolve along Tree 1 (or Tree 2) with the probability P (or 1 – P). The terminal and internal branch lengths of all genes were random samples obtained from the uniform distributions Unif[0, 0.2] and Unif[0, 0.02], respectively. The sequence length of each gene was a random sample obtained from the uniformly distributed integers between 501 and 3,500. In each gene, random DNA sequences of the 4 taxa were generated using the JC model (Jukes and Cantor 1969
Once we generated n random genes, the pairwise distances for the concatenated sequences and separate genes were calculated using the JC model. For tree reconstruction, the NJ algorithm (Saitou and Nei 1987) was applied to the distance matrix of the concatenated sequences and to the average distance matrix of equation (5). To calculate the BPs, we generated 100 pseudosamples by following the procedures B1–B5. We calculated the BPs for 3 clades, that is, (A,B), (A,C), and (A,D), hereafter referred to as Nodes 1, 2, and 3, respectively.
For a given n and P, the procedure for generating genes and calculating the BPs at Nodes 1–3 was repeated 1, 000 times. Subsequently, we investigated the distribution of the BPs at the 3 Nodes. Node 3 is not supported by either Tree 1 or Tree 2. Nodes 1 and 2 compete with and are complimentary to each other, that is, a high BP at Node 1 implies a low BP at Node 2 and vice versa. Therefore, we will focus on the description of the results at Node 1, although we have shown the figures for all 3 Nodes. Using different values for n and P, we consider the following 4 scenarios:
- Scenario 1 (P = 0.6, n = 5). In this Scenario, we consider a situation in which the number of multilocus genes is relatively small (n = 5) and one of the 2 trees is favored. Because P > 0.5, Tree 1 is favored, and it is reasonable to expect high BPs at Node 1. When the procedures B1 and B2 are used, some data sets exhibit very high BPs (
95%) at Node 1; however, other data sets exhibit very low BPs (
5%) at Node 1 (fig. 3). A low BP at Node 1 is inconsistent with the fact that Tree 1 is favored (P = 0.6). The difference between the calculated values of the BPs for different data sets is more exaggerated when the B1 procedure is used. Of the 1,000 iterations in the B1 procedure, 199 data sets display
5% BP at Node 1. Although extremely low BPs are observed with B3 and B4, their proportions out of 1,000 iterations are smaller than those of B1 and B2. In contrast to B1–B4, B5 displays neither extremely high nor extremely low BPs at Node 1 (see Discussion).
- Scenario 2 (P = 0.6, n = 100). As in Scenario 1, Tree 1 is favored (P = 0.6); however, the number of multilocus genes is relatively large (n = 100). In figure 4, the histograms of the BPs for B1–B5 at Node 1 are concentrated in the higher range, which is consistent with the fact that Tree 1 is favored (P = 0.6). It is notable that some data sets (36 out of 1,000) for B1 exhibit very low BPs (
5%) at Node 1, despite the proportion being small. Because the intergene variations are taken into consideration, B3–B5 have a more scattered distribution than B1 and B2.
- Scenario 3 (P = 0.5, n = 5). The number of multilocus genes is relatively small (n = 5), and both trees are equally favored (P = 0.5). When the procedures B1 and B2 are used, a majority of the data sets display extremely high or low BPs at Node 1 (fig. 5). However, as in the result of Scenario 1, the proportions of the extreme BPs for B3 and B4 are smaller than those of B1 and B2. When B5 is used, the distribution of BPs at Node 1 is similar to that observed in Scenario 1, and most of the data sets are concentrated in the moderate range of BPs.
- Scenario 4 (P = 0.5, n = 100). As in Scenario 3, both trees are equally favored (P = 0.5). However, the number of multilocus genes is relatively large (n = 100). When the procedures B1 and B2 are used, a majority of the data sets display extremely high or low BPs at Node 1 (fig. 6). When the procedures B3–B5 are used, the distribution of BPs at Node 1 is scattered between 0 and 1 and is similar to uniform distribution (see Discussion).
|
|
|
|
Empirical Data Analysis
To compare the performance of the procedures B1–B5, we analyzed the sequence data published by Nishihara et al. (2007)
|
The total length of 2,789 data partitions is 1,011,870 nt. This implies that on an average, 1 data partition has approximately 395 nt. We found that short sequences can create a problem in our distance method because distance estimates have large variances (see Discussion). Therefore, from among the 2,789 data partitions, we selected partitions whose lengths are moderately large. In this data set, there are 508 and 103 data partitions that are longer than 500 and 1,000 nt, respectively, hereafter referred to as data sets A and B, respectively. The total lengths of the data sets A and B are 422,937 and 146,958 nt, respectively. For each data partition within the data sets A and B, we calculated the distances using the Tamura–Nei (TN; Tamura and Nei 1993
We adopted the value of 0.375 as the gamma rate heterogeneity parameter and calculated the TN distances using the formula of Nei and Kumar (2000
, p. 44; for justification of the value of 0.375, see Discussion). Reconstruction of NJ trees with sequence concatenation of the data sets A and B yielded Tree 3 as shown in figure 7. Reconstruction of NJ trees with separate analysis using equations (5) and (7) for the data sets A and B yielded Tree 1. Table 1 shows the BPs for the data sets A and B obtained using the 5 bootstrap procedures at Nodes 1–9 of figure 7.
|
| Discussion |
|---|
|
|
|---|
New Distance Methods Using Multilocus Sequence Data
Here, we suggest new procedures for determining the optimal tree topology and calculating BPs using the distance method and multilocus sequence data. For calculating BPs using the B4 and B5 approaches, we adopt the Bayesian idea of Efron et al. (1996)
The advantage of our procedures over the conventional Bayesian methods (e.g., Rannala and Yang 1996
; Ronquist and Huelsenbeck 2003
) is that they require considerably less computation and can deal with a larger sequence data set. The application of the conventional Bayesian methods is limited to a moderate number of genes and taxa because it requires heavy computation. Our procedures could be used as alternatives to the conventional Bayesian methods when large amounts of sequence data are analyzed.
The Importance of Considering Intergene Variations
Based on the results of figure 3, it is evident that the problem associated with concatenation (B1) can be very serious when the number of genes is small. When n is small, stochastic errors have a greater influence on the sampled genes, and all genes may be incidentally sampled from a tree topology that does not represent the true phylogenetic relationship. In such situations, B1 after sequence concatenation and B2 of stratified sampling may yield high BPs for incorrect subclades. Considering the existence of intergene variations, more care should be exercised during confidently formulating a conclusion based on a small number of genes. Even when all the sampled genes support identical tree topologies, if there are considerable intergene variations in terms of the distance, we reasonably expect that there may be other genes that support different topologies but are incidentally not selected. This expectation prevents us from formulating a confident decision until we obtain sufficient information, that is, a large number of genes. As a result, the BPs calculated using B3–B5 are distributed mostly in the intermediate range when n is small (fig. 3) and are concentrated in the higher range at Node 1 when n is large (fig. 4). When there is little variation or consistency among genes, the procedures B3–B5 can yield confident results even with a small number of genes. In all Scenarios, Node 3 is supported by neither Tree 1 nor Tree 2, which is consistent among all genes. Therefore, B3–B5 yield low BPs at Node 3 even with a small number of genes (figs. 3 and 5).
The poor performance of B1 after concatenation and that of B2 with stratified sampling is very clear in Scenario 4 (fig. 6). The number of genes is large (n = 100). Because P = 0.5, Trees 1 and 2 are equally probable. However, when B1 and B2 are used, a majority of the 1,000 iterations exhibit extremely high (
95%) or low (
5%) BPs at Node 1. Because intergene variations are considered, a more flat and widely scattered distribution is obtained using B3–B5. The distributions of BPs calculated using B3–B5 are similar to the uniform distribution between 0 and 1. This is against our intuitive expectation that BPs would be concentrated around 0.5 at Node 1. A similar problem was noted under the Bayesian framework (Suzuki et al. 2002
; Lewis et al. 2005
). When the multifurcating node is true, the posterior probabilities of a bifurcating node are uniformly distributed between 0 and 1 (Lewis et al. 2005
). The situation of P = 0.5 in Scenario 4 is somewhat similar to the situation of the multifurcating node that Lewis et al. (2005)
considered. When P = 0.5, the true
is located at the boundary between the space of Trees 1 and 2. The distance estimates
are symmetrically distributed around the boundary. If
is incidentally located in the space of Tree 1, a majority of the resampled estimates
s will be located in the space of Tree 1, resulting in a high BP at Node 1. If
is incidentally far away from the boundary, the BP at Node 1 can be extremely high. Similarly, if
is incidentally located in the space of Tree 2, the BP at Node 1 can be low, even extremely low. This intuitively explains why the distributions of the BPs at Node 1 calculated using B3–B5 converge to a uniform distribution between 0 and 1 instead of converging to a point mass at 0.5. If the
s are located exactly at the boundary between the space of Trees 1 and 2, the
s should be symmetrically distributed around the boundary, and the distributions of BPs should converge to a point mass at 0.5. However, this does not happen at all. However, large the amount of data, the
s are symmetrically distributed around the boundary and are not located exactly at it.
In Scenario 3, the BPs at Node 1 calculated using B5 are more concentrated around 0.5, whereas the BPs at Node 1 calculated using B3–B4 are rather uniformly distributed between 0 and 1. The deviation of B5 from the uniform distribution can be explained on the basis of poor approximation for the variance. The approximation of equation (8) is valid when the sequence lengths and number of genes are sufficiently large. Otherwise,
and
are not reliable approximates; therefore, B3 and B4 should be preferred over B5. When the number of genes is large, the distribution of the BP calculated using B5 is similar to the distributions of the BPs calculated using B3 and B4 as shown in Scenarios 2 and 4. This implies that the effects of centering and pivoting are relatively minor when the data set is large. Therefore, we conclude that B3 and B4 are sufficiently reliable for application in most circumstances, despite B5 being, theoretically, a more sophisticated procedure.
We note that different sequence lengths can be considered during phylogenetic reconstruction and 2-stage bootstrap procedures. By considering distance matrices weighted by sequence lengths, we can modify equation (5) and define
as follows:
|
|
Analysis of Mammalian Data
Rate Heterogeneity among Sites
Nishihara et al. (2007)
observed that tree reconstruction by the NJ method (Saitou and Nei 1987
) with the TN model (Tamura and Nei 1993
) yields a misleading tree topology in which rodents (mouse and rat) are placed basally and are grouped together with opossum. They noted that the unreasonable tree reconstruction by the above method may be due to long-branch attraction. Here, we note that ignoring the rate heterogeneity among sites may be another possible explanation for the misleading tree topology. All the data partitions considered for analysis by Nishihara et al. are protein-coding sequences. In general, the third codon position evolves faster than the first and second positions. Therefore, it is appropriate to take into consideration the rate heterogeneity among sites during the calculation of the pairwise distances. It is very important to consider rate heterogeneity among sites during tree reconstruction and ignoring it has been noted to possibly result in the reconstruction of an incorrect phylogeny (e.g., Yang 1995
).
The rate heterogeneity among sites is typically modeled with a gamma distribution whose mean and variance are 1 and 1/
, respectively (Yang 1993
). In principle, it is not possible to estimate
by the distance method, and we applied the ML method to determine the appropriate
for calculating the pairwise distances. PAML (Yang 2007
) was used to estimate
using the TN model (Tamura and Nei 1993
) for the concatenated sequences of the 2,789 data partitions. The estimated
s for Trees 1, 2, and 3 of figure 7 are 0.375, 0.375, and 0.374, respectively. Similarly,
s for the concatenated sequences of the data sets A and B were estimated, and the estimated
s for Trees 1, 2, and 3 of the data set A are 0.363, 0.363, and 0.362, respectively, and those of the data set B are 0.371, 0.372, and 0.370, respectively. The value of 0.375 that we adopted as an
parameter for calculating the pairwise distances is qualitatively similar to the
s estimated from the concatenated data partitions. When
= 0.375 is considered during the calculation of the pairwise distances, the rodent group is properly positioned within the Boreotheria group. The NJ trees for 2,789 data partitions reconstructed using the TN model with sequence concatenation and with separate analysis were observed to be Trees 3 and 1, respectively. This result is consistent with that of Nishihara et al. (2007)
. They applied the ML method with the General Time Reversible (Yang 1994a
) model, and the ML trees reconstructed by sequence concatenation and by separate analysis were observed to be Trees 3 and 1, respectively.
Variations among the Data Partitions
Because different data partitions have been subject to different evolutionary mechanisms, the optimal
for different data partitions will be different. However, we did not consider this variation in calculating the pairwise distances and applied
= 0.375 to all data partitions. It is possible to estimate
s for separate data partitions by the ML method and to apply the separate
s for calculating the pairwise distances for separate data partitions. However, this approach would be impractical when a huge amount of data need to be analyzed, and the ML method also requires a lot of computational time. Our general recommendation is to apply the ML method to concatenated sequences to obtain the appropriate value for
; this is relatively easy even when a huge amount of sequence data need to be analyzed. As shown in our results, the
estimates appear to be quite robust for different trees and different lengths of concatenated data.
We note that a substantial portion of the 2,789 data partitions comprises relatively short sequences. For example, there are 645 (1,991) data partitions whose lengths are smaller than 200 (400) nt. In our distance method, we found that short sequences could cause a problem because distance estimates have a large variance. In the analysis using the ML method conducted by Nishihara et al. (2007)
, they adopted the resampling estimated log-likelihood (RELL; Kishino and Hasegawa 1989
) procedure. That is, instead of resampling the sequence data and recalculating the likelihood scores for the resampled data, they resampled the likelihood scores obtained from the original sequence data. When the sequences are short, the RELL procedure may underestimate the variance because it ignores the variance of the estimation conducted using resampled data. In our distance method, pairwise distances are recalculated for resampled sequence data sets. Because the distance estimates from short sequences are subject to large variations, BPs are also subject to large variations. Although Tree 1 represents the NJ tree reconstructed using the 2,789 data partitions and
of equation (5), the BPs at Nodes 1–6 obtained by the 2-stage bootstrap procedures are relatively low (data not shown). Because of the large variance in the distance estimates, the distance matrices from the resampled data are distributed over the space of many trees, resulting in relatively low BPs for a specific clade. To reduce the problem caused by short sequences, we arbitrarily set the threshold value as 500 nt (for data set A) and 1,000 nt (for data set B) for selection of data partitions. With moderately long data partitions, the NJ trees were identical to those obtained from all 2,789 data partitions and the BPs at Nodes 1–6 were high (table 1).
Problem Associated with Sequence Concatenation
The phylogenetic relationship among Boreotheria, Xenarthra, and Afrotheria remains controversial, and Nishihara et al. (2007)
noted that a greater amount of sequence data and a more appropriate taxon sampling are required to resolve the phylogenetic relationship among these 3 taxa. Nishihara et al. criticized the sequence concatenation approach because it presents a 100% BP for the clade of Xenarthra and Afrotheria (Node 9 of fig. 7). In our analysis, we obtained results that are consistent with those of Nishihara et al. When the number of data partitions is moderate (data set B), the BP at Node 9 calculated using B1 is not extreme; however, when the number of data partitions is large (data set A), it is 100.0%. This leads us to arrive at an incorrect conclusion, which is also demonstrated in our simulation study. For data sets A and B, the BPs at Nodes 7 and 9 calculated using B2–B5 were moderately high, which is consistent with the results of Nishihara et al.s ML method. Whereas there is a drastic difference between B1 and B2–B5 (table 1), the differences among B2–B5 are relatively minor. However, this does not generally imply that B2 is as acceptable as B3–B5. It appears that the effect of separate analysis is dominant over the effect of the 2-stage bootstrap procedures in this example involving mammalian data. As shown in the simulation study, there is a potential problem associated with B2 because it ignores the intergene variations. Therefore, we emphasize that B3–B5 are generally preferable over B1 and B2.
Gene Tree versus Species Tree
In our procedures, we assume that
is consistent with the species tree. Therefore, the species tree can be estimated using
—the arithmetic mean of
. The gene trees reconstructed using
are distributed around the species tree, and we assume that systematic bias has a minor effect on the distribution of
.
Recently, it was noted that gene trees are not necessarily consistent with the species tree due to various evolutionary events, such as incomplete lineage sorting and horizontal gene transfer (Degnan and Rosenberg 2006
). In such situations, the species tree cannot be estimated using a consensus gene tree. Our tree reconstruction method using equations (5) and (7) does not consider systematic bias caused by incomplete lineage sorting and horizontal gene transfer. However, it is notable that incomplete lineage sorting is generally not a major concern when the species are moderately distant (e.g., Degnan and Rosenberg 2006
). Moreover, only a small portion of the genomic loci may be subject to horizontal gene transfer during the evolution of higher organisms. Tree estimation using equations (5) and (7) is suitable for situations in which the effect of systematic bias is minor, and the consensus gene tree is consistent with the species tree.
| Concluding Remarks |
|---|
|
|
|---|
Here, we introduce new procedures for estimating phylogeny and calculating BPs using multilocus sequence data in the context of the distance method. By theoretical investigation, simulation studies, and empirical data analysis, we demonstrated that there is a potential problem associated with the sequence concatenation and that intergene variations should be considered during the measurement of phylogenetic uncertainty. Although B5 is the most sophisticated procedure, its application appears to be limited to data sets comprising a large number of genes and long sequences. Therefore, B3 and B4 would be preferable in most circumstances. Because our distance methods require considerably less computation, they are good alternatives to the Bayesian and likelihood methods when huge amounts of sequence data need to be analyzed.
| Acknowledgements |
|---|
|
|
|---|
The author thanks Jeffrey L. Thorne, Asger Hobolth, Sang Chul Choi, Hirohisa Kishino, Marcy Uyenoyama, and 2 anonymous reviewers for their valuable comments and Hidenori Nishihara for providing data. The author was supported by Japan Society for the Promotion of Science (EYS B-18770216).
| Footnotes |
|---|
Marcy Uyenoyama, Associate Editor
| References |
|---|
|
|
|---|
Adachi J, Waddell PJ, Martin W, Hasegawa M. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast. J Mol Evol (2000) 50:348–358.[Web of Science][Medline]
Anderson TW. Introduction to multivariate statistical analysis (1984) 2nd ed. Hoboken (NJ): John Wiley & Sons, Inc. 596.
Cao Y, Fujiwara M, Nikaido M, Okada N, Hasegawa M. Interordinal relationships and timescale of eutherian evolution as inferred from mitochondrial genome data. Gene (2000) 259:149–158.[CrossRef][Web of Science][Medline]
Cao Y, Sorenson MD, Kumazawa Y, Mindell DP, Hasegawa M. Phylogenetic position of turtles among amniotes: evidence from mitochondrial and nuclear genes. Gene (2000) 259:139–148.[CrossRef][Web of Science][Medline]
Degnan JH, Rosenberg NA. Discordance of species trees with their most likely gene trees. PLoS Genet (2006) 2:762–768.[Web of Science]
Edwards SV, Liu L, Pearl DK. High-resolution species trees without concatenation. Proc Natl Acad Sci USA (2007) 102:4436–4441.[CrossRef]
Efron B, Halloran E, Holmes S. Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci USA (1996) 93:13429–13434.
Felsenstein J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution (1985) 39:783–791.[CrossRef][Web of Science]
Hall P, Wilson SR. Two guidelines for bootstrap hypothesis testing. Biometrics (1991) 47:757–762.[CrossRef][Web of Science]
Hallstrom BM, Kullberg M, Nilsson MA, Janke A. Phylogenomic data analyses provide evidence that Xenarthra and Afrotheria are sister groups. Mol Biol Evol (2007) 24:2059–2068.
Jukes TH, Cantor CR. Evolution of protein molecules. In: Mammalian protein metabolism—Munro HN, ed. (1969) New York: Academic Press. 21–132.
Kishino H, Hasegawa M. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. J Mol Evol (1989) 29:170–179.[CrossRef][Web of Science][Medline]
Kolaczkowski B, Thornton JW. Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature (2004) 431:980–984.[CrossRef][Medline]
Lewis PO, Holder MT, Holsinger KE. Polytomies and Bayesian phylogenetic inference. Syst Biol (2005) 54:241–253.
Nei M, Kumar S. Molecular evolution and phylogenetics (2000) New York: Oxford University Press.
Nikaido M, Cao Y, Harada M, Okada N, Hasegawa M. Mitochondrial phylogeny of hedgehogs and monophyly of Eulipotyphla. Mol Phylogenet Evol (2003) 28:276–284.[CrossRef][Web of Science][Medline]
Nishihara H, Okada N, Hasegawa M. Rooting the eutherian tree: the power and pitfalls of phylogenomics. Genome Biol (2007) 8:R199.1–R199.10.
Queiroz AD, Gatesy J. The supermatrix approach to systematics. Trends Ecol Evol (2006) 22:34–41.[CrossRef]
Rannala B, Yang Z. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. J Mol Evol (1996) 43:304–311.[Web of Science][Medline]
Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics (2003) 19:1572–1574.
Saitou N, Nei N. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol (1987) 4:406–425.[Abstract]
Seo T-K, Kishino H, Thorne JL. Incorporating gene-specific variation when inferring and evaluating optimal evolutionary tree topologies from multilocus sequence data. Proc Natl Acad Sci USA (2005) 102:4436–4441.
Shimodaira H, Hasegawa M. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol (1999) 16:1114–1116.[Web of Science]
Suzuki Y, Glazko GV, Nei N. Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics. Proc Natl Acad Sci USA (2002) 99:16138–16143.
Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol (1993) 10:512–526.[Abstract]
Yang Z. Maximum likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol (1993) 10:1396–1401.[Abstract]
Yang Z. Estimating the pattern of nucleotide substitution. J Mol Evol (1994a) 39:105–111.[Web of Science][Medline]
Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol (1994b) 39:306–314.[CrossRef][Web of Science][Medline]
Yang Z. Evaluation of several methods for estimating phylogenetic trees when substitution rates differ over nucleotide sites. J Mol Evol (1995) 40:689–697.[CrossRef][Web of Science]
Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol (2007) 24:1586–1591.
Yoder AD, Yang Z. Estimation of primate speciation dates using local molecular clocks. Mol Biol Evol (2000) 17:1081–1090.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
L. Liu, L. Yu, D. K. Pearl, and S. V. Edwards Estimating Species Phylogenies Using Coalescence Times among Sequences Syst Biol, October 1, 2009; 58(5): 468 - 477. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Liu and S. V. Edwards Phylogenetic Analysis in the Anomaly Zone Syst Biol, August 1, 2009; 58(4): 452 - 460. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


:=(d











