Skip Navigation


MBE Advance Access originally published online on January 14, 2008
Molecular Biology and Evolution 2008 25(6):1007-1015; doi:10.1093/molbev/msn005
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
25/6/1007    most recent
msn005v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Charlesworth, J.
Right arrow Articles by Eyre-Walker, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Charlesworth, J.
Right arrow Articles by Eyre-Walker, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

The McDonald–Kreitman Test and Slightly Deleterious Mutations

Jane Charlesworth* and Adam Eyre-Walker*,{dagger}

* Centre for the Study of Evolution, University of Sussex, Brighton, United Kingdom
{dagger} National Evolutionary Synthesis Center, Durham, New Hampshire

E-mail: a.c.eyre-walker{at}sussex.ac.uk


    Abstract
 TOP
 Abstract
 Introduction
 Theory
 Results
 Discussion
 Acknowledgements
 References
 
It is possible to estimate the proportion of substitutions that are due to adaptive evolution using the numbers of silent and nonsilent polymorphisms and substitutions in a McDonald and Kreitman-type analysis. Unfortunately, this estimate of adaptive evolution is biased downward by the segregation of slightly deleterious mutations. It has been suggested that 1 way to cope with the effects of these slightly deleterious mutations is to remove low-frequency polymorphisms from the analysis. We investigate the performance of this method theoretically. We show that although removing low-frequency polymorphisms does indeed reduce the bias in the estimate of adaptive evolution, the estimate is always downwardly biased, often to the extent that one would not be able to detect adaptive evolution, even if it existed. The method is reasonably satisfactory, only if the rate of adaptive evolution is high and the distribution of fitness effects for slightly deleterious mutations is very leptokurtic. Our analysis suggests that adaptive evolution could be quite prevalent in humans (>8%) and still not be detectable using current methodologies. Our analysis also suggests that the level of adaptive evolution has probably been underestimated, possibly substantially, in both bacteria and Drosophila.

Key Words: adaptive evolution • neutral theory • deleterious mutations


    Introduction
 TOP
 Abstract
 Introduction
 Theory
 Results
 Discussion
 Acknowledgements
 References
 
The McDonald and Kreitman (1991Go, MK) test and its derivatives (Fay et al. 2001Go; Bustamante et al. 2002Go, 2005Go; Smith and Eyre-Walker 2002Go; Sawyer et al. 2003Go; Bierne and Eyre-Walker 2004Go;) are commonly used methods for testing for the presence and measuring the level of adaptive evolution. These methods compare diversity within a species with the divergence between species at 2 types of site. The mutations at 1 of these categories of site are assumed to be neutral, whereas mutations at the other sites are assumed to be strongly deleterious, neutral, or advantageous. The 2 types of site might be synonymous and nonsynonymous sites in a protein coding gene, for example, or intron and untranslated region (UTR) sites. Under these assumptions, it is possible to test whether adaptive evolution has occurred and estimate the proportion of substitutions, at the sites subject to selection, which were a consequence of positive adaptive evolution.

Unfortunately, both the test for adaptive evolution and the estimate of the level of adaptive evolution are affected by slightly deleterious mutations. A slightly deleterious mutation is a mutation upon which purifying selection acts only very weakly so that its fate is determined by both selection and random genetic drift. If there are slightly deleterious mutations segregating in the population, then it becomes more difficult to detect adaptive evolution and the level of adaptive evolution is underestimated (Fay et al. 2001Go). Because there is ample evidence that some nonsynonymous (Rand and Kann 1996Go; Nachman 1998Go; Cargill et al. 1999Go; Fay et al. 2001Go, 2002Go; Hughes 2005Go; Charlesworth and Eyre-Walker 2006Go) and noncoding (Andolfatto 2005Go; Drake et al. 2006Go; Asthana et al. 2007Go) mutations are slightly deleterious, this can be a problem for ultimately estimating the amount of adaptive evolution that has occurred.

Fay et al. (2001)Go have suggested a very simple way in which one can control at least some of the effects of slightly deleterious mutations in MK-type analyses—remove polymorphisms segregating at low frequencies. Because slightly deleterious mutations are likely to segregate at lower frequencies in the population than neutral variants, removing low-frequency variants preferentially rids the analysis of slightly deleterious mutations. This approach seems to work. In data sets, in which there is evidence of slightly deleterious mutations, removing low-frequency polymorphisms increases the proportion of substitutions which are estimated to have been subject to adaptive evolution (e.g., Fay et al. 2001Go, 2002Go; Bierne and Eyre-Walker 2004Go; Andolfatto 2005Go; Charlesworth and Eyre-Walker 2006Go).

However, there are a number of issues with this approach. First, there has been no systematic attempt to investigate the performance of this method. Does the method recover the true level of adaptive evolution if enough low-frequency polymorphisms are excluded? If it does not, how biased are the estimates? Second, are there any guiding principles we can use to determine the frequency below which we should exclude polymorphisms from the analysis? So far, a variety of different cutoff frequencies have been used; for example, Fay et al. (2001Go, 2002Go) define rare polymorphisms as segregating at a frequency <15% and <12.5% for humans and Drosophila, respectively, whereas Zhang and Li (2005)Go use a cutoff of 15%, Charlesworth and Eyre-Walker (2006)Go use only the most common class of polymorphisms (>33%), and Bierne and Eyre-Walker (2004)Go and Proschel et al. (2006)Go only exclude singleton polymorphisms.

In the manuscript, we investigate how the removal of low-frequency polymorphisms affects the estimate of adaptive evolution when there are slightly deleterious mutations segregating. We address this question from a theoretical perspective.


    Theory
 TOP
 Abstract
 Introduction
 Theory
 Results
 Discussion
 Acknowledgements
 References
 
We will phrase the problem in terms of nonsilent and silent sites, where these categories of site might be, for example, nonsynonymous and synonymous sites or 5' UTR and intron sites. Let the number of nonsilent substitutions per site be dn, the number of nonsilent polymorphisms per site be pn, the number of silent substitutions per site be ds, and the number of silent polymorphisms per site be ps. If we assume that all silent mutations are neutral and all nonsilent mutations are strongly deleterious, neutral, or strongly advantageous, then it can be shown that the proportion of nonsilent substitutions that were fixed by positive selection can be estimated as

Formula (1)
(Charlesworth 1994Go; Fay et al. 2001Go; Smith and Eyre-Walker 2002Go). Here we investigate how this estimated value of {alpha} is altered by the inclusion of slightly deleterious mutations and the removal of variants segregating at different frequencies. We compare these estimates with the true level of adaptive evolution ({alpha}true).

Let us imagine that we have sampled n sequences from a diploid population that is stationary in size and at equilibrium. Assuming that silent mutations are neutral, the expected number of silent polymorphisms segregating per site in i of n sequences is

Formula (2)
(Watterson 1975Go), where {theta} = 4Neu, u is the nucleotide mutation rate and Ne is the effective population size. Note that we assume here that we do not know the direction by which the mutation occurred—for example, whether a site segregating a C and a T was a C -> T or a T -> C mutation—therefore, polymorphisms segregating in i and n-i sequences are indistinguishable. The number of nonsilent polymorphisms segregating in i of n sequences is

Formula (3)
where

Formula
S = 4Nes and s is the strength of selection acting in favor of a mutation. H(S, x) is half the time a new mutation spends between a frequency of x and x + dx (Wright 1938Go) (it is only half the time because a factor of 2 has been subsumed into the parameter {theta}), and Q(n, i, x) is the probability of observing a mutation that is segregating at a frequency of x in i or n-i chromosomes. Z(V, S) is the distribution of S, that is, the distribution of fitness effects governed by parameters contained in the vector V. If we select an appropriate distribution of fitness effects, we can allow a proportion of mutations to be slightly deleterious. We also examine cases in which some mutations are slightly advantageous.

The expressions for the numbers of silent and nonsilent substitutions are

Formula (4)

Formula (5)
where

Formula
and t is the divergence time of the 2 species being considered. R(S) is 2N times the probability of fixation of a new advantageous mutation of selective strength s, where N is the census population size (Kimura 1962Go, 1983Go).

Distribution of Fitness Effects
We incorporate slightly deleterious mutations within our model by assuming that the distribution of fitness effects for nonsilent mutations is a continuous function. We could alternatively have modeled them by assuming that there was a group of mutations that were slightly deleterious but all subject to the same strength of selection. However, this model seems unrealistic because there is evidence that the strength of selection acting upon nonsynonymous mutations follows a continuous distribution (e.g., Nielsen and Yang 2003Go; Piganeau and Eyre-Walker 2003Go; Sawyer et al. 2003Go; Yamplosky et al. 2005Go; Eyre-Walker et al. 2006Go; Loewe and Charlesworth 2006Go).

We consider 4 distributions of fitness effects that contain a proportion of mutations that are slightly deleterious and also proportions of mutations that are effectively neutral and strongly deleterious. First, we consider the gamma distribution that has been widely used to model the distribution of fitness effects (Keightley 1994Go, 1996Go; Nielsen and Yang 2003Go; Piganeau and Eyre-Walker 2003Go; Eyre-Walker et al. 2006Go; Loewe and Charlesworth 2006Go). There is evidence that this distribution of fitness effects fits data from human nonsynonymous single nucleotide polymorphism (SNPs) (Eyre-Walker et al. 2006Go); it is also the distribution of fitness effects predicted by Fisher's geometrical model (Martin and Lenormand 2006Go; Gu 2007Go). The gamma distribution is governed by 2 parameters, a shape parameter, β, and the mean strength of selection, S:

Formula (6)

Senond, we consider the lognormal distribution that has recently been suggested as an alternative to the gamma distribution by Loewe and Charlesworth (2006)Go. This distribution seems to fit data from Drosophila rather better than the gamma distribution (Loewe and Charlesworth 2006Go). This distribution is also governed by 2 parameters, the mean strength of selection, Formula , and a shape parameter, {sigma}.

Formula (7)
where

Formula

However, both the gamma and lognormal are unrealistic distributions because one might reasonably expect there to be slightly advantageous back mutations if there are slightly deleterious mutations (Charlesworth and Eyre-Walker 2007Go). For example, if a site is fixed for a T mutation and slightly deleterious C mutation arises with selection strength –s, then it seems reasonable to assume that if the site subsequently becomes fixed for C, by random genetic drift, then a new T mutation would have a selective strength +s. As a consequence, Piganeau and Eyre-Walker (2003)Go have suggested modifying the gamma distribution of fitness effects in the following manner. Let us assume that every site can be occupied by 2 alleles, A1 that has an advantage of +s relative to A2 and A2 that has disadvantage of –s relative to A1. If we assume the mutation rate is very low such that Neu << 1, then it can be shown that the equilibrium frequency of A1 allele is

Formula (8)
(Li 1987Go; Bulmer 1991Go). This is equivalently the proportion of similar sites that are fixed for A1 or the time for which a site is fixed for the A1 allele. If we assume that the distribution of the absolute value of S is gamma or lognormal, the realized distribution of fitness effects becomes, respectively:

Formula (9)
Piganeau and Eyre-Walker (2003)Go refer to the gamma distribution version of this distribution as the partially reflected gamma (PRG); it seems appropriate to call the lognormal equivalent the partially reflected lognormal (PRLN). Some examples of these distributions are given in figure 1.


Figure 1
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Examples of the (a) the gamma, (b) the PRG, (c) the lognormal, and (d) the PRLN distributions. The distributions shown are those used in figures 2 and 3 for the case when {alpha}true = 0.25. Curves from top to bottom for the gamma and PRG distributions are for shape parameters of 1, 0.5, 0.25, and 0.125. Curves for the lognormal and PRLN are for shape parameters 8, 6, 4, and 2.

 
Adaptive Evolution
Under the gamma and lognormal distributions, there are no advantageous mutations; and under the PRG and PRLN distributions, advantageous mutations are weakly selected and compensate for slightly deleterious mutations. To include strongly advantageous substitutions, we do not model them explicitly within the distribution of fitness effects but allow a certain proportion, {alpha}true, of substitutions to be a consequence of adaptive evolution. Equation 5 then becomes

Formula (10)
Because we assume the mutations responsible for these substitutions are strongly selected, we can ignore them, to good approximation, in the expression for the number of nonsynonymous polymorphisms because they contribute relatively little to polymorphism (Smith and Eyre-Walker 2002Go).

Using equations 2–4 and 10, we can write an expression for the estimated proportion of substitutions which were due to strongly advantageous mutations when we remove mutations segregating in fewer than k sequences as

Formula (11)

This expression depends only on n, k, and V because {theta} and {lambda} cancel out. This is the equation we use to study the effects of slightly deleterious and slightly advantageous mutations on the estimate of {alpha} and the effect that removing low-frequency polymorphisms has on this estimate. It should be noted that by using the equations above, we are assuming that we have an infinite amount of data because equations 24 and 10 give the expected values of pn, ps, dn, and ds. However, because {alpha} is essentially an odds ratio and there are well-established methods for obtaining unbiased estimates of odds ratios, such as the Mantel–Haenzel procedure, equation 11 also provides us with the expected value of {alpha}est when sample sizes are limited.

It is important to be clear about how we define adaptive substitutions. Under the PRG and PRLN distributions, half the substitutions will be slightly advantageous and half slightly deleterious, even if we assume there are no strongly advantageous mutations. These weakly selected adaptive substitutions are not included in the calculation of {alpha} because they represent mutations that compensate for the effects of slightly deleterious mutations; they therefore represent substitutions that do not lead to a net increase in fitness. The parameter {alpha}, therefore, represents the proportion of substitutions that leads to a net increase in fitness. However, it would be relatively easy to include these compensatory mutations in the calculation of {alpha} if we wished because {alpha}total = {alpha}strong + (1 – {alpha}strong)/2, where {alpha}total is the estimate of {alpha} including both strong and compensatory adaptive substitutions and {alpha}strong is the proportion of substitutions which were strongly adaptive.

Numerical Considerations
Equation 3 involves a double integral that is difficult to evaluate accurately. However, Welch et al. (J.J. Welch, D. Waxman, A. Eyre-Walker personal communication) have shown that it is possible to express the integral over S in terms of the difference between Hurwitz zeta functions for the gamma and PRG distributions. A different simplification is possible for the lognormal and PRLN by expressing the integral over x in terms of hypergeometric functions. In both cases, this reduces the integral to a single dimension, in x for the gamma and PRG distributions and in S for the lognormal and PRLN.

Parameterization
The potential parameter space to explore is very large, so we decided to focus our analysis on distributions of fitness effects that were consistent with the pattern of evolution that we observe in protein-coding sequences. Given a value of {alpha}true and the shape parameter of the distribution, we found the value of S that would yield a value of dn/ds of 0.2, the approximate value observed in several species,for example, drosophilids and mammals (Eyre-Walker et al. 2002Go). We also investigated the cases where dn/ds were 0.02, 0.1, 0.3, and 0.8; these gave very similar results, so we only report those from dn/ds = 0.2. We chose to investigate shape parameters of 0.125, 0.25, 0.5, and 1 for the gamma and PRG distributions because these values seem to be consistent with what is observed for protein-coding sequences (Piganeau and Eyre-Walker 2003Go; Eyre-Walker et al. 2006Go; Loewe and Charlesworth 2006Go; Keightley and Eyre-Walker 2007Go, though see Nielsen and Yang 2003Go). The lognormal distribution has been less extensively used, and so there is less information about realistic values of the shape parameter than for the gamma distributions. The only estimate we have is from the analysis of Loewe and Charlesworth (2006)Go who estimate the shape parameter to be 5.3 (95% confidence intervals [CIs] of 2.2 and 7.8) in Drosophila. However, the shape parameter cannot be too small or the mean strength of selection has to be unrealistically small to generate the required dn/ds value; for example, Formula = 4 for {alpha}true = 0.25, dn/ds = 0.2, and {sigma} = 1. We investigated shape parameters of 2, 4, 6, and 8.


    Results
 TOP
 Abstract
 Introduction
 Theory
 Results
 Discussion
 Acknowledgements
 References
 
Using our equations, we can calculate the expected value of {alpha}est, the proportion of substitutions estimated to be due to adaptive evolution, as a function of the true value of {alpha}, the number of sequences sampled, and the cutoff frequency, defined as the number of sequences including and below which we ignore polymorphisms (e.g., for a cutoff frequency of 1, we remove all singletons). In figures 2 and 3, we show the results from the gamma and PRG distributions for 2 values of {alpha}true, 0.25 and 0.5, when 8, 32, or 128 sequences have been sampled, for 4 different distributions of fitness effects, all of which will give a dn/ds value of 0.2. Results using the lognormal and PRLN are qualitatively very similar (results not shown). The results for cases in which dn/ds is either smaller than or larger than dn/ds = 0.2 are also very similar (results not shown).


Figure 2
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— The expected estimated value of {alpha} for a single-sided gamma distribution for different cutoff frequencies. The cutoff frequency is the number of sequences including and below which polymorphisms are excluded. Each graph shows {alpha} estimated for distributions with shape parameters, going from top to bottom, of 0.125 (diamonds), 0.25 (stars), 0.5 (squares), and 1 (triangles) and corresponding Figure 2 values that give dn/ds = 0.2. The left-hand column of graphs are for {alpha}true = 0.25 and the right-hand side {alpha}true = 0.50. The rows of graphs are for sample sizes of 8, 32, and 128 sequences (top to bottom).

 

Figure 3
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— The expected estimated value of {alpha} for the PRG distribution. The cutoff frequency is the number of sequences including and below which polymorphisms are excluded. Each graph shows {alpha} estimated for distributions with shape parameters, going from top to bottom, of 0.125 (diamonds), 0.25 (stars), 0.5 (squares), and 1 (triangles) and corresponding Figure 3 values that give dn/ds = 0.2. The left-hand column of graphs are for {alpha}true = 0.25 and the right-hand side {alpha}true = 0.50. The rows of graphs are for sample sizes of 8, 32, and 128 sequences (top to bottom).

 
Several patterns are evident. As expected, removing low-frequency polymorphisms from the MK analysis increases the estimate of {alpha} toward its true value. However, the estimated value of {alpha} does not approach the true value unless the true level of adaptive evolution is relatively high (e.g., {alpha}true = 0.5) or the distribution of fitness effects is very leptokurtic (β small). This has 2 related consequences. First, the estimate of {alpha} is always an underestimate, and this underestimation can be large if adaptive evolution is rare and/or the distribution of fitness effects is relatively platykurtic; Second, there may be no evidence of adaptive evolution even when there has been substantial adaptive evolution. For example, we would not detect any adaptive evolution even if 25% of all substitutions were due to adaptive evolution if β ≥ 0.5. The kurtosis of the distribution affects the degree to which {alpha} is underestimated because more leptokurtic distributions have a smaller proportion of polymorphisms that are slightly deleterious.

Although the estimate of {alpha} continues to increase as the cutoff frequency increases, most of the benefit of this procedure appears to be achieved by removing polymorphisms below 15%. This may therefore be taken as a useful rule-of-thumb for analyses of real data sets.

Surprisingly, the estimated value of {alpha} seems to be only weakly dependent upon the number of sequences that have been sampled. So sampling more sequences will only give one more power to detect adaptive evolution, in so much as there will be more polymorphisms detected. It will generally be more fruitful to sequence longer stretches of DNA rather than more sequences of the same sequence.


    Discussion
 TOP
 Abstract
 Introduction
 Theory
 Results
 Discussion
 Acknowledgements
 References
 
We have shown that removing low-frequency polymorphisms from an MK-type analysis is not a particularly good method for dealing with the effects of slightly deleterious mutations, although it is certainly much better than doing nothing. Only if adaptive evolution is quite frequent and the distribution of fitness effects for deleterious mutations is quite leptokurtic is the method efficient. So what implications do this have for the estimates of adaptive evolution that have already been obtained?

Two separate analyses have found essentially no evidence of adaptive evolution in protein-coding sequences between humans and chimpanzees (Chimpanzee-Sequencing-and-Analysis-Consortium 2005Go; Zhang and Li 2005Go) (though see Gojobori et al. 2007Go); in both cases, the proportion of nonsynonymous substitutions estimated to be due to adaptive evolution is close to zero and not significantly different from zero. However, as our analysis suggests, MK-type analyses are likely to have trouble detecting adaptive evolution if the level is quite low and/or the distribution of fitness effects is relatively platykurtic. The distribution of fitness effects for nonsynonymous mutations in humans has been estimated to be quite leptokurtic; (Eyre-Walker et al. 2006Go) and (Keightley and Eyre-Walker 2007Go) fitting a gamma distribution of fitness effects to human SNPs estimates the shape parameter to be approximately 0.19, so this at least should make detecting adaptive evolution easier. However, to investigate the question in detail, we found the true value of {alpha}, above which the estimated value of {alpha} would be greater than zero, taking the most extreme distributions estimated by Keightley and Eyre-Walker (2007)Go. If we assume a gamma distribution of fitness effects with β = 0.10 and Formula and exclude polymorphisms below 15%, the cutoff frequency suggested by our analysis and used by Fay et al. (2001)Go and Zhang and Li (2005)Go, we find that {alpha}true would have to be greater than 8% if 8 sequences have been sampled and 9% if 128 sequences have been sampled. If we assume a gamma distribution of fitness effects with β = 0.29 and Formula , then {alpha}true would have to be greater than 21% or 22% if 8 or 128 sequences have been sampled, respectively. In reality, because estimates of {alpha} tend to have large CIs because they are an odds ratio, it would be difficult to detect adaptive evolution unless it was rather greater than these limits.

In Drosophila melanogaster and its related species, several analyses suggest that the proportion of nonsynonymous substitutions driven by positive selection is ~50% (Smith and Eyre-Walker 2002Go; Bierne and Eyre-Walker 2004Go; Andolfatto 2005Go; Welch 2006Go); Sawyer et al. (2003Go, 2007Go) put the proportion even higher at over 90%. In contrast, a recent analysis of data from Drosophila miranda has found no evidence of adaptive evolution (Bachtrog and Andolfatto 2006Go). Interestingly, there are striking differences between the rate of adaptive evolution in genes that have sex-biased expression and those which do not in D. melanogaster (Proschel et al. 2006Go). Recent analyses suggest that the distribution of fitness effects in D. melanogaster is significantly less leptokurtic than it is in humans; the shape parameter of a gamma distribution of fitness effects is estimated to be approximately 0.32 (Keightley and Eyre-Walker 2007Go). This suggests that {alpha} may have been substantially underestimated in this species.

We recently estimated the level of adaptive evolution in the protein-coding sequences of enteric bacteria using the genomic sequences of 6 strains of Escherichia coli and 6 strains of Salmonella enterica (Charlesworth and Eyre-Walker 2006Go). We estimated the proportion of nonsynonymous substitutions that had been adaptive between the 2 species using polymorphism data either from E. coli or from S. enterica. In both cases, we observed a similar pattern. As we removed singletons and then singletons and doubletons, the estimate of adaptive evolution increased, but it showed no sign of reaching an asymptote (fig. 4). This pattern suggests that our estimate of {alpha} might have been higher if we had more genomes at our disposal so that we were able to exclude polymorphisms segregating at higher frequencies. To investigate this, we found the parameters of a gamma distribution that would yield a pn/ps value of 0.024, the value observed in E. coli (note that we take advantage of the fact that we have polymorphism data and can therefore parameterize the model in terms of pn/ps, which saves us the trouble of having to alter the parameters of the distribution of fitness effects for different levels of adaptive evolution).


Figure 4
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Estimates of {alpha} from enteric bacteria using polymorphism data from Escherichia coli (black dots) or Salmonella enterica (gray dots). Data from Charlesworth and Eyre-WalkerGo (2006)Go.

 
In figure 5, we plot the estimated value of {alpha} when the true value is 0.5 or 0.75 for 4 distributions of fitness effects, which are compatible with the observed value of pn/ps and sample sizes of 6 and 12 sequences. Several patterns are apparent. First, the theoretical predictions do not fit the data very well; in particular, the theory predicts that the value of {alpha} should tend to asymptote even when 6 sequences have been sampled, but this is not observed in the data; this may be due to sampling error or some feature of the data that is not accounted for. Second, the data are consistent with a distribution of fitness effects that is relatively platykurtic (β > 0.5); if the distribution is leptokurtic, then the estimated value of {alpha} increases but not as rapidly as we see in the real data. Critically, with such a nonleptokurtic distribution, the estimated value of {alpha} is considerably below the true value. In order to investigate this in more detail, we searched for values of β, Formula , and {alpha}true that would fit the observed values of {alpha} when all polymorphisms are considered and when just the highest frequency polymorphisms are used. For E. coli, the values are β = 0.62, Formula = 2900, and {alpha}true = 0.74. For S. enterica, the values are β = 0.8, Formula = 520, and {alpha}true = 0.65. It therefore seems likely that we have underestimated the level of adaptive evolution in enteric bacteria, and this underestimation may be quite large; the true value of {alpha} may be 50% greater than we estimated. However, there are clearly some features of the data that we are not accounting for, so some caution should be exercised. It is not clear whether sampling more genomes will help; our theoretical analysis suggests that it will not, but the data indicate the opposite.


Figure 5
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 5.— Estimates of {alpha} for the single-sided gamma distribution for different cutoff frequencies. Each graph shows {alpha} estimated for distributions with shape parameters, going from top to bottom, of 0.25 (diamonds), 0.5 (stars), 0.75 (squares), and 1 (triangles), and corresponding Figure 5 values which give pn/ps = 0.024. The left-hand column of graphs are for {alpha}true = 0.50 and the right-hand side {alpha}true = 0.75. The rows of graphs are for sample sizes of 6 and 12 sequences (top to bottom).

 
The poor performance of the exclusion method in situations where adaptive evolution is rare or the distribution of fitness effects is relatively platykurtic suggests that we need to develop new methods to take account of slightly deleterious mutations in MK-type analyses. There are 2 obvious possibilities. The first is to use outgroup information, where it is available, to orient the SNPs, that is, to infer whether the SNP is an X -> Y or Y -> X mutation. We can investigate this by using the unfolded site frequency spectra, that is, by changing equation 2 to

Formula (12)
and altering part of equation 3 as

Formula (13)

Examples are shown in figure 6. Several points are evident. As you progressively remove low-frequency variants from the analysis, the estimate of {alpha} appears to approach its true value, although it is still a little downwardly biased. However, this is a little deceptive because orienting the SNPs only slightly improves performance if you consider the same frequency cutoff; for example, if we consider the case where {alpha}true = 0.25, β = 0.25, and Formula = 800 (for dn/ds = 0.2) and we have 32 sequences, then if we exclude SNPs that are present in 4 or fewer sequences (i.e., below 15%), the estimate of {alpha} for nonoriented SNPs is 0.051 and for oriented SNPs it is 0.064. Orienting SNPs works much better because we can potentially set a cutoff above 50%. There are also some potential difficulties with implementing a method that orientates SNPs. We should take into account that there will be error in orientating SNPs and that the level of this error will differ between synonymous sites and nonsynonymous sites because the latter are more conserved.


Figure 6
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 6.— The expected estimated value of {alpha} for the single-sided gamma distribution for {alpha}true = 0.25 (left column) or {alpha}true = 0.5 (right column). In these examples, the sample size is 32 and the shape parameters of the gamma distribution are 0.125 (diamonds), 0.25 (stars), 0.5 (squares), and 1 (triangles) going from top to bottom.

 
An alternative way in which we might estimate the level of adaptive evolution, while taking into account of slightly deleterious mutations, is to estimate the distribution of fitness effects of the slightly deleterious mutations, while simultaneously estimating the level of adaptive evolution. This can be done relatively simply by extending methods for estimating the distribution of fitness effects, such as those developed by Eyre-Walker et al. (2006)Go and Keightley and Eyre-Walker (2007)Go.

We have suggested that polymorphisms below 15% be excluded from MK-type analyses when there is evidence that some nonsilent polymorphisms are slightly deleterious. Ideally, this evidence should not come from the MK analysis itself but from an independent source, for example, by comparing the average allele frequencies of nonsilent and silent polymorphisms; if the former is significantly lower than the latter, then it seems likely that some of the nonsilent mutations are slightly deleterious and that the 15% cutoff should be employed. Of course, removing low-frequency polymorphisms increases the variance of the estimate of {alpha}, but this is a price that has to be paid to obtain a less-biased estimate. Hopefully, by providing a recommended cutoff frequency, we will remove the temptation to search for the frequency that yields the highest value of {alpha} because this is statistically difficult to defend.

Finally, it should be emphasized that there are many other factors, besides slightly deleterious mutations, that can potentially bias our estimate of {alpha}. Of particular concern are increases or decreases in population size, which in combination with slightly deleterious mutations, can lead to either over or underestimates of {alpha} (McDonald and Kreitman 1991Go; Eyre-Walker 2002Go). However, balancing selection on nonsilent mutations can also lead to an underestimate and selection upon silent mutations an overestimate of {alpha}.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Theory
 Results
 Discussion
 Acknowledgements
 References
 
We are very grateful to John McDonald and 2 anonymous referees, one of whom suggested orienting SNPs, for comments. The authors were supported by the Biotechnology and Biological Sciences Research Council (J.C. and A.E.W.) and the National Evolutionary Synthesis Center (A.E.W.).


    Footnotes
 
John H. McDonald, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Theory
 Results
 Discussion
 Acknowledgements
 References
 

    Andolfatto P. Adaptive evolution of non-coding DNA in Drosophila. Nature (2005) 437:1149–1152.[CrossRef][Medline]

    Asthana S, Noble WS, Kryukov G, Grant CE, Sunyaev S, Stamatoyannopoulos JA. Widely distributed noncoding purifying selection in the human genome. Proc Natl Acad Sci USA (2007) 104:12410–12415.[Abstract/Free Full Text]

    Bachtrog D, Andolfatto P. Selection, recombination and demographic history in Drosophila miranda. Genetics (2006) 174:2045–2059.[Abstract/Free Full Text]

    Bierne N, Eyre-Walker A. Genomic rate of adaptive amino acid substitution in Drosophila. Mol Biol Evol (2004) 21:1350–1360.[Abstract/Free Full Text]

    Bulmer M. The selection-mutation-drift theory of synonymous codon usage. Genetics (1991) 129:897–907.[Abstract]

    Bustamante CD, Fledel-Alon A, Williamson SH. (14 co-authors). Natural selection on protein-coding genes in the human genome. Nature (2005) 437:1153–1157.[CrossRef][Medline]

    Bustamante CD, Nielsen R, Sawyer SA, Olsen KM, Purugganan MD, Hartl DL. The cost of inbreeding in Arabidopsis. Nature (2002) 416:531–534.[CrossRef][Medline]

    Cargill M, Altshuler D, Ireland J. (16 co-authors). Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet (1999) 22:231–238.[CrossRef][Web of Science][Medline]

    Charlesworth B. The effect of background selection against deleterious mutations on weakly selected, linked variants. Genet Res (1994) 63:213–227.[Web of Science][Medline]

    Charlesworth J, Eyre-Walker A. The rate of adaptive evolution in enteric bacteria. Mol Biol Evol (2006) 23:1348–1356.[Abstract/Free Full Text]

    Charlesworth J, Eyre-Walker A. The other side of the nearly neutral theory, evidence of slightly advantageous back-mutations. Proc Natl Acad Sci USA (2007) 104:16992–16997.[Abstract/Free Full Text]

    Chimpanzee-Sequencing-and-Analysis-Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature (2005) 437:69–87.[CrossRef][Medline]

    Drake JA, Bird C, Nemesh J. (11 co-authors). Conserved noncoding sequences are selectively constrained and not mutation cold spots. Nat Genet (2006) 38:223–227.[CrossRef][Web of Science][Medline]

    Eyre-Walker A. Changing effective population size and the McDonald-Kreitman test. Genetics (2002) 162:2017–2024.[Abstract/Free Full Text]

    Eyre-Walker A, Keightley PD, Smith NGC, Gaffney D. Quantifying the slightly deleterious model of molecular evolution. Mol Biol Evol (2002) 19:2142–2149.[Abstract/Free Full Text]

    Eyre-Walker A, Woolfit M, Phelps T. The distribution of fitness effects of new deleterious amino acid mutations in humans. Genetics (2006) 173:891–900.[Abstract/Free Full Text]

    Fay J, Wycoff GJ, Wu C-I. Positive and negative selection on the human genome. Genetics (2001) 158:1227–1234.[Abstract/Free Full Text]

    Fay J, Wycoff GJ, Wu C-I. Testing the neutral theory of molecular evolution with genomic data from Drosophila. Nature (2002) 415:1024–1026.[CrossRef][Medline]

    Gojobori J, Tang H, Akey JM, Wu CI. Adaptive evolution in humans revealed by the negative correlation between polymorphism and fixation phases of evolution. Proc Natl Acad Sci USA (2007) 104:3907–3912.[Abstract/Free Full Text]

    Gu X. Stabilizing selection of protein function and distribution of selection coefficient among sites. Genetica (2007) 130:93–97.[CrossRef][Web of Science][Medline]

    Hughes AL. Evidence for abundant slightly deleterious polymorphisms in bacterial populations. Genetics (2005) 169:533–538.[Abstract/Free Full Text]

    Keightley PD. The distribution of mutation effects on viability in Drosophila melanogaster. Genetics (1994) 138:1315–1322.[Abstract]

    Keightley PD. Nature of deleterious mutation load in Drosophila. Genetics (1996) 144:1993–1999.[Abstract]

    Keightley PD, Eyre-Walker A. Joint inference of the distribution of fitness effects of deleterious mutations and population demography based on nucleotide polymorphism frequencies. Genetics (2007) 177:1–11.[Free Full Text]

    Kimura M. On the probability of fixation of mutant genes in a population. Genetics (1962) 47:713–719.[Free Full Text]

    Kimura M. The neutral theory of molecular evolution (1983) Cambridge: Cambridge University Press.

    Li W-H. Models of nearly neutral mutations with particular implications for the nonrandom usage of synonymous codons. J Mol Evol (1987) 24:337–345.[CrossRef][Web of Science][Medline]

    Loewe L, Charlesworth B. Inferring the distribution of mutational effects on fitness in Drosophila. Biol Lett (2006) 2:426–430.[Abstract/Free Full Text]

    Martin G, Lenormand T. A general multivariate extension of Fisher's geometrical model and the distribution of mutation fitness effects across species. Evolution Int J Org Evolution (2006) 60:893–907.[CrossRef][Web of Science][Medline]

    McDonald JH, Kreitman M. Adaptive evolution at the Adh locus in Drosophila. Nature (1991) 351:652–654.[CrossRef][Medline]

    Nachman MW. Deleterious mutations in animal mitochondrial DNA. Genetica (1998) 102:61–69.[CrossRef][Medline]

    Nielsen R, Yang Z. Estimating the distribution of selection coefficients from phylogenetic data with applications to mitochondrial and viral DNA. Mol Biol Evol (2003) 20:1231–1239.[Abstract/Free Full Text]

    Piganeau G, Eyre-Walker A. Estimating the distribution of fitness effects from DNA sequence data: implications for the molecular clock. Proc Natl Acad Sci USA (2003) 100:10335–10340.[Abstract/Free Full Text]

    Proschel M, Zhang Z, Parsch J. Widespread adaptive evolution of Drosophila genes with sex-biased expression. Genetics (2006) 174:893–900.[Abstract/Free Full Text]

    Rand DM, Kann LM. Excess amino acid polymorphism in mitochondrial DNA: contrasts among genes from Drosophila, mice and humans. Mol Biol Evol (1996) 13:735–748.[Abstract]

    Sawyer S, Kulathinal RJ, Bustamante CD, Hartl DL. Bayesian analysis suggests that most amino acid replacements in Drosophila are driven by positive selection. J Mol Evol (2003) 57:S154–S164.[CrossRef][Web of Science][Medline]

    Sawyer SA, Parsch J, Zhang Z, Hartl DL. Prevalence of positive selection among nearly neutral amino acid replacements in Drosophila. Proc Natl Acad Sci USA (2007) 104:6504–6510.[Abstract/Free Full Text]

    Smith NGC, Eyre-Walker A. Adaptive protein evolution in Drosophila. Nature (2002) 415:1022–1024.[CrossRef][Web of Science][Medline]

    Watterson GA. On the number of segregating sites. Theor Popul Biol (1975) 7:256–276.[CrossRef][Web of Science][Medline]

    Welch JJ. Estimating the genome-wide rate of adaptive protein evolution in Drosophila. Genetics (2006) 173:821–837.[Abstract/Free Full Text]

    Wright S. The distribution of gene frequencies under irreversible mutation. Proc Natl Acad Sci USA (1938) 24:253–259.[Free Full Text]

    Yamplosky LY, Kondrashov FA, Kondrashov AS. Distribution of the strength of selection against amino acid replacements in human proteins. Hum Mol Genet (2005) 14:3191–3201.[Abstract/Free Full Text]

    Zhang L, Li W-H. Human SNPs reveal no evidence of frequent positive selection. Mol Biol Evol (2005) 22:2504–2507.[Abstract/Free Full Text]

Accepted for publication December 15, 2007.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
J. M. Flowers, Y. Hanzawa, M. C. Hall, R. C. Moore, and M. D. Purugganan
Population Genomics of the Arabidopsis thaliana Flowering Time Gene Network
Mol. Biol. Evol., November 1, 2009; 26(11): 2475 - 2486.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
A. Eyre-Walker and P. D. Keightley
Estimating the Rate of Adaptive Molecular Evolution in the Presence of Slightly Deleterious Mutations and Population Size Change
Mol. Biol. Evol., September 1, 2009; 26(9): 2097 - 2108.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
P. W. Messer
Measuring the Rates of Spontaneous Mutation From Deep and Large-Scale Polymorphism Data
Genetics, August 1, 2009; 182(4): 1219 - 1232.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
J. L. Strasburg, C. Scotti-Saintagne, I. Scotti, Z. Lai, and L. H. Rieseberg
Genomic Patterns of Adaptive Divergence between Chromosomally Differentiated Sunflower Species
Mol. Biol. Evol., June 1, 2009; 26(6): 1341 - 1355.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
E. Axelsson and H. Ellegren
Quantification of Adaptive Evolution of Genes Expressed in Avian Brain and the Population Size Effect on the Efficacy of Selection
Mol. Biol. Evol., May 1, 2009; 26(5): 1073 - 1079.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
J. Parsch, Z. Zhang, and J. F. Baines
The Influence of Demography and Weak Selection on the McDonald-Kreitman Test: An Empirical Study in Drosophila
Mol. Biol. Evol., March 1, 2009; 26(3): 691 - 698.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
R. Egea, S. Casillas, and A. Barbadilla
Standard and generalized McDonald-Kreitman test: a website to detect selection by comparing different classes of DNA sites
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W157 - W162.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
25/6/1007    most recent
msn005v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Charlesworth, J.
Right arrow Articles by Eyre-Walker, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Charlesworth, J.
Right arrow Articles by Eyre-Walker, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?