Skip Navigation


MBE Advance Access originally published online on September 13, 2006
Molecular Biology and Evolution 2006 23(12):2274-2278; doi:10.1093/molbev/msl116
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/12/2274    most recent
msl116v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pie, M. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pie, M. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Letters

The Influence of Phylogenetic Uncertainty on the Detection of Positive Darwinian Selection

Marcio R. Pie

Departamento de Zoologia, Universidade Federal do Paraná, Curitiba, PR, Brazil

E-mail: pie{at}ufpr.br.


    Abstract
 TOP
 Abstract
 Acknowledgements
 References
 
The power of maximum likelihood tests of positive selection on protein-coding genes depends heavily on detecting and accounting for potential biases in the studied data set. Although the influence of transition:transversion and codon biases have been investigated in detail, little is known about how inaccuracy in the phylogeny used during the calculations affects the performance of these tests. In this study, 3 empirical data sets are analyzed using sets of simulated topologies corresponding to low, intermediate, and high levels of phylogenetic uncertainty. The detection of positive selection was largely unaffected by errors in the underlying phylogeny. However, the number of sites identified as being under positive selection tended to be overestimated.

Key Words: adaptive evolution • molecular phylogeny • likelihood ratio tests

The identification of the selective pressures that shape genetic variation has become a major goal of molecular evolution studies over the past few decades (Yang and Bielawski 2000Go; Yang and Nielsen 2000Go; Yang et al. 2000Go; Nielsen 2001Go; Wong et al. 2004Go). Traditionally, the measurement of selection on protein-coding genes is assessed by estimating {omega}, the ratio between nonsynonymous (dN) and synonymous (dS) substitution rates. Positive selection is identified whenever {omega} > 1 (i.e., dN is higher than dS), whereas the cases of {omega} = 1 and {omega} < 1 would indicate neutral and purifying selection, respectively.

Following the development of a variety of tests of positive selection (see Yang and Bielawski 2000Go and included references), considerable attention has been devoted to understanding how several potential sources of error might affect the performance of those tests, such as biases in transition/transversion rate ratio (Li et al. 1985Go; Li 1993Go; Pamilo and Bianchi 1993Go) and codon usage (Goldman and Yang 1994Go; Yang and Nielsen 2000Go). For instance, ignoring the transition/transversion bias can lead to an overestimation of dS and a consequent underestimation of {omega} (Li et al. 1985Go). However, virtually all currently available methods that use phylogenetic information in the estimation of selection parameters rely on the assumption that the phylogeny of the studied sequences is known with certainty, a condition that might not be met in real data sets. In this study, the influence of violating this assumption is investigated, namely, the use of suboptimal phylogenetic trees affects inferences of positive Darwinian selection based on codon models (Nielsen and Yang 1998Go; Yang et al. 2000Go). Although this source of bias has been addressed by Suzuki and Gojobori (1999)Go and Yang et al. (2000)Go, no study to date has investigated this issue systematically.

Three genes were selected for this study: isoeugenol-O-methyltransferase (IEMT, 310 codons, average divergence of 23.2%), an enzyme involved in the production of floral volatile compounds (Barkman 2003Go), CD45 (331 codons, average divergence of 6.9%), a highly expressed surface protein of mammal lymphocytes (Filip and Mundy 2004Go), and pantophysin (209 codons, average divergence of 7.32%), a vesicle-trafficking protein found in fish (Pogson and Mesa 2004Go). Both pantophysin and CD45 are thought to be under the influence of positive selection across the entire data set. IEMT, on the other hand, is a gene that seems to have evolved under positive selection from a clade of genes that are mostly under purifying selection caffeic acid-O-methyltransferase (COMT). The CD45 were obtained from the original study, whereas the other data sets were aligned without gaps. Their phylogenetic relationships based on unweighted parsimony are shown in figure 1.


Figure 1
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Phylogenetic relationships among the sequences used in the present study based on unweighted maximum parsimony. (A) CD45; (B) COMT/IEMT; and (C) pantothysin. Numbers above nodes represent bootstrap values following 10,000 pseudoreplicates (10 random taxon addition replicates in each turn). Note that the CD45 phylogeny puts the chimpanzee as most closely related to the gorilla than to human, a relationship that disagrees with our current understanding of primate systematics. However, that was the best topology under unweighted parsimony and was shown that way to maintain the consistency across the analyses.

 
Three levels of phylogenetic uncertainty were determined as follows. First, 50,000 unrooted topologies were simulated using a Markov model (or all 10,395 possible topologies in the case of the CD45 data set). Tree lengths of all studied topologies were then evaluated using unweighted maximum parsimony and ranked from lowest to highest. Finally, 20 topologies were sampled from the first, fifth, and tenth percentiles to represent low, intermediate, and high levels of phylogenetic uncertainty, respectively (fig. 2). The use of tree lengths as a proxy to phylogenetic uncertainty is based on the assumption that shorter trees by parsimony are more likely to be correct than longer trees, in addition to having higher log likelihood under the likelihood criterion, as is commonly observed in molecular phylogenetics. Tree simulation and evaluation were conducted using the software PAUP* 4.0b10 (Swofford 1998Go).


Figure 2
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Frequency distribution of tree lengths for each data set that was used in the present study. The tree-length ranges for the first, fifth, and tenth percentiles were IEMT: 1,361–1,549, 1,609–1,617, and 1,643–1,647; CD54: 209–232, 265–268, and 276–278; and pantothysin: 386–440, 462–464, and 472–474, respectively. See text for details.

 
Tests for positive selection were conducted based on the maximum likelihood approach of Yang et al. (2000)Go as implemented in PAML 3.14b (Yang 1997Go). In this approach, several models of codon substitution are fitted to the data, and their respective performances are evaluated using likelihood ratio tests (LRTs). The first step is to test for the existence of positive selection (i.e., the presence of sites with {omega} > 1) in the data set. This is achieved by contrasting a null model that does not allow sites with {omega} > 1 with a more general model where that condition is allowed. Statistical evidence is obtained when a method that allows for positive selection is significantly better than the alternative model without selection. The level of significance is assessed using the likelihood ratio statistic, which is calculated twice the difference in likelihood scores (2{Delta}lnL) of each model, which is compared with a {chi}2 distribution with the number of degrees of freedom calculated as the difference in the number of estimated parameters between the models. Three different LRTs were conducted. The first comparison is between M0 (1 ratio), a model that fits a single {omega}0 averaged over all sites, and M3 (discrete), which has 3 discrete classes of sites, each with a different {omega}0. The second is a comparison between M1a (nearly neutral model), which allows for 2 site classes (0 < {omega}0 < 1 or {omega}1 = 1), and M2a (selection), which has an additional site class ({omega} > 1). The third test is a comparison between a model of beta-distributed selective pressures, which allows for 10 site classes, each with {omega} < 1 (beta, M7), and another method with 11 site classes, one of which allows for {omega} > 1 (M8). This last comparison is one of the most stringent tests of positive selection (Anisimova et al. 2001Go). Equilibrium codon frequencies were calculated from the average nucleotide frequencies at the 3-codon positions. Calculations using codon frequencies as free parameters provided very similar qualitative behavior (not shown) and are available from the author upon request. Stop codons were removed from all sequences prior to the analyses.

Once evidence for positive selection was obtained, further tests were performed to identify the codon sites under selection. This evaluation was obtained by using the 2 methods implemented in the M2a, M3, and M8 models: naïve empirical Bayes (NEB) and Bayes empirical Bayes (BEB) (Yang et al. 2005Go). Positions with a high probability of being part of the {omega} > 1 class are inferred as more likely to be under the influence of positive selection.

In general, there was an increase in the likelihood ratio statistic with increasing inaccuracy in the used topology (table 1), suggesting the possibility of a corresponding increase in type I statistical errors. However, the results of the LRTs of positive selection were unaffected by this trend, indicating that all 3 tests of positive selection (M0 vs. M3, M1a vs. M2a, and M7 vs. M8) are robust to errors in the underlying phylogeny. In other words, as long as the topology is not "too wrong," the results of these tests are reliable. Parameter estimates under each model were also robust, with no apparent systematic bias (table 2).


View this table:
[in this window]
[in a new window]

 
Table 1 LRT Statistics of Positive Selection with Small (1%), Intermediate (5%), and High (10%) Levels of Phylogenetic Uncertainty

 

View this table:
[in this window]
[in a new window]

 
Table 2 Parameter Estimates under Different Models of Sequence Evolution with Increasing Levels of Phylogenetic Uncertainty

 
If one assumes that the best estimates were provided by the topologies with the smallest tree lengths according to maximum parsimony, phylogenetic uncertainty caused considerable overestimation in the number of sites under positive selection in the CD45 data set, both using NEB and BEB (table 3). A similar (yet much weaker) effect was observed in the pantophysin data set (table 3). An inspection of figure 2 indicates that the frequency distribution of tree lengths in the pantophysin data set was more skewed than in the CD45 data set, suggesting a higher phylogenetic signal. However, it is unclear whether the poorer performance of NEB and BEB in the CD45 data set is due to the lower phylogenetic signal or the smaller number of sequences.


View this table:
[in this window]
[in a new window]

 
Table 3 Average Number of Sites under Positive Selection According to the NEB and BEB Methods, and Respective Standard Errors in Each Data Set

 
Simulation studies have shown that tests of positive selection can be powerful even with sequences as short as 50 codons (Whelan and Goldman 1999Go; Anisimova et al. 2001Go). However, such short data sets are unlikely to provide reliable information in the phylogenetic relationships among the studied sequences. A suitable approach in such cases would be to infer the phylogeny of the respective taxa using additional sequences from other regions. Alternatively, phylogenetic uncertainty can be explicitly incorporated into the analysis using a Bayesian approach.


    Acknowledgements
 TOP
 Abstract
 Acknowledgements
 References
 
T.J. Barkman, N.I. Mundy, L.C. Filip, G.H. Pogson, and K.A. Mesa kindly shared the data sets used in their original studies, and M.K. Tschá provided assistance in the compilation of the simulation results.


    Footnotes
 
Ziheng Yang, Associate Editor


    References
 TOP
 Abstract
 Acknowledgements
 References
 

    Anisimova M, Bielawski JP, Yang Z. (2001) Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. Mol Biol Evol 18:1585–1592.[Abstract/Free Full Text]

    Barkman TJ. (2003) Evidence for positive selection on the floral scent gene isoeugenol-O-methyltransferase. Mol Biol Evol 20:168–172.[Abstract/Free Full Text]

    Filip LC and Mundy NI. (2004) Rapid evolution by positive Darwinian selection in the extracellular domain of the abundant lymphocyte protein CD45 in primates. Mol Biol Evol 21:1504–1511.[Abstract/Free Full Text]

    Goldman N and Yang Z. (1994) A codoSn-based model of nucleotide substitution for protein-coding genes. Mol Biol Evol 11:725–736.[Abstract]

    Li WH. (1993) Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol 36:96–99.[CrossRef][ISI][Medline]

    Li W-H, Wu C-I, Luo CC. (1985) A new method for estimating synonymous and nonsynonymous rates of nucleotide substitutions considering the relative likelihood of nucleotide and codon changes. Mol Biol Evol 3:418–426.

    Nielsen R. (2001) Statistical tests of selective neutrality in the age of genomics. Heredity 86:641–647.[CrossRef][ISI][Medline]

    Nielsen R and Yang Z. (1998) Likelihood models for detecting positively selected amino-acid sites and applications to the HIV-1 envelope gene. Genetics 148:929–936.[Abstract/Free Full Text]

    Pamilo P and Bianchi NO. (1993) Evolution of the Zfx and Zfy genes—rates and interdependence between the genes. Mol Biol Evol 10:271–282.[Abstract]

    Pogson GH and Mesa KA. (2004) Positive Darwinian selection at the pantophysin (Pan I) locus in marine gadid fishes. Mol Biol Evol 21:65–75.[Abstract/Free Full Text]

    Suzuki Y and Gojobori T. (1999) A method for detecting positive selection at single amino acid sites. Mol Biol Evol 16:1315–1328.[Abstract]

    Swofford DL. (1998) PAUP*: phylogenetic analysis using parsimony (*and other methods). (Sinauer Associates, Sunderland (MA)).

    Whelan S and Goldman N. (1999) Distributions of statistics used for the comparison of models of sequence evolution in phylogenetics. Mol Biol Evol 16:1292–1299.[ISI]

    Wong WS, Yang YZ, Goldman N, Nielsen R. (2004) Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites. Genetics 168:1041–1051.[Abstract/Free Full Text]

    Yang Z. (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13:555–556.[Free Full Text]

    Yang Z and Bielawski JP. (2000) Statistical methods for detecting molecular adaptation. Trends Ecol Evol 15:496–503.[CrossRef][Medline]

    Yang Z and Nielsen R. (2000) Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol 17:32–43.[Abstract/Free Full Text]

    Yang Z, Nielsen Z, Goldman N, Pedersen A-MK. (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431–449.[Abstract/Free Full Text]

    Yang Z, Wong WS, Nielsen R. (2005) Bayes empirical bayes inference of amino acid sites under positive selection. Mol Biol Evol 22:1107–1118.[Abstract/Free Full Text]

Accepted for publication September 7, 2006.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
A. Stern, A. Doron-Faigenboim, E. Erez, E. Martz, E. Bacharach, and T. Pupko
Selecton 2007: advanced models for detecting positive and purifying selection using a Bayesian inference approach
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W506 - W511.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/12/2274    most recent
msl116v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pie, M. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pie, M. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?