MBE Advance Access originally published online on January 3, 2008
Molecular Biology and Evolution 2008 25(3):568-579; doi:10.1093/molbev/msm284
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Mutation-Selection Models of Codon Substitution and Their Use to Estimate Selective Strengths on Codon Usage

* Department of Biology, Galton Laboratory, University College London, London, United Kingdom
Department of Biology, University of Copenhagen, Copenhagen, Denmark
E-mail: z.yang{at}ucl.ac.uk.
| Abstract |
|---|
|
|
|---|
Current models of codon substitution are formulated at the levels of nucleotide substitution and do not explicitly consider the separate effects of mutation and selection. They are thus incapable of inferring whether mutation or selection is responsible for evolution at silent sites. Here we implement a few population genetics models of codon substitution that explicitly consider mutation bias and natural selection at the DNA level. Selection on codon usage is modeled by introducing codon-fitness parameters, which together with mutation-bias parameters, predict optimal codon frequencies for the gene. The selective pressure may be for translational efficiency and accuracy or for fine-tuning translational kinetics to produce correct protein folding. We apply the models to compare mitochondrial and nuclear genes from several mammalian species. Model assumptions concerning codon usage are found to affect the estimation of sequence distances (such as the synonymous rate dS, the nonsynonymous rate dN, and the rate at the 4-fold degenerate sites d4), as found in previous studies, but the new models produced very similar estimates to some old ones. We also develop a likelihood ratio test to examine the null hypothesis that codon usage is due to mutation bias alone, not influenced by natural selection. Application of the test to the mammalian data led to rejection of the null hypothesis in most genes, suggesting that natural selection may be a driving force in the evolution of synonymous codon usage in mammals. Estimates of selection coefficients nevertheless suggest that selection on codon usage is weak and most mutations are nearly neutral. The sensitivity of the analysis on the assumed mutation model is discussed.
Key Words: codon substitution model codon usage mutation selection synonymous substitution
| Introduction |
|---|
|
|
|---|
In protein-coding genes, synonymous codons that code for the same amino acid do not appear at the same frequency (Ikemura 1981
In slowly growing organisms with small population sizes such as vertebrates, natural selection may be inefficient and indeed its effect on codon usage is controversial (see, e.g., Duret 2002
for a review). In contrast to results for bacteria, yeast, and Drosophila, strong evidence for selection on codon usage is lacking in vertebrates. For example, Kanaya et al. (2001)
found a correspondence between codon bias and tRNA gene copy number (a proxy for tRNA concentration) in Schizosaccharomyces pombe and C. elegans but not in Xenopus laevis and Homo sapiens; in the later species, highly expressed genes such as ribosomal genes and histone genes do not have strong codon bias. Some studies (e.g., Musto et al. 2001
) found a correlation between codon bias and putative expression levels (as measured by expressed sequence tag frequencies), but this correlation could be explained by transcription-coupled repair (Duret 2002
).
Besides selection for translational efficiency and accuracy, recent experimental work suggests that the selective pressure on codon usage may also be due to the need for an optimal translation kinetics, to ensure correct protein folding. Protein folding is thought to be cotranslational, occurring at the same time the protein is translated from the mRNA (Frydman 2001
). The use of preferred and unpreferred codons may affect the rate at which the protein is translated. The translation kinetics may be important in separating temporally folding events during protein synthesis on the ribosome, thus ensuring "beneficial" interactions and avoiding "unwanted" interactions within the growing peptide, to achieve high yield of the correctly folded protein. Kimchi-Sarfaty et al. (2007)
reported that certain synonymous mutations in the multidrug resistance 1 gene resulted in altered drug and inhibitor interactions. They found similar mRNA and protein levels but altered protein conformations between the "wild type" and mutant protein products and hypothesized that the incorporation of rare synonymous codons may have affected the timing of folding. This form of selection differs from translational selection in that preferred codons are not always advantageous if the optimal folding requires a slow translation. It is unclear how important such selection for protein folding is to the evolutionary process of protein-coding genes.
A number of authors have studied population genetics models in which the proportions of synonymous codons are modeled as the product of interactions between mutation bias, natural selection, and genetic drift (Kimura 1983
; Li 1987
; Bulmer 1991
; McVean and Charlesworth 1999
). McVean and Vieira (1999)
applied maximum likelihood (ML) to fit such a model to counts of synonymous codons for 2-fold amino acids in protein-coding genes in several Drosophila species, to estimate parameters of mutation bias and selective pressure. The analysis does not consider the evolutionary relationships among species, which may provide useful information concerning relative mutation rates between nucleotides. This model was extended by McVean and Vieira (2001)
to analyze synonymous differences between different species, with nonsynonymous differences ignored. Nielsen et al. (2007)
implemented a codon-substitution model in which a mutation is favored or disfavored by natural selection depending on whether it changes an unpreferred codon into a preferred one or vice versa. The model was applied to Drosophila protein–coding genes to obtain ML estimates of parameters measuring the strength of selection. This method requires a priori partitioning of synonymous codons into preferred and unpreferred categories and also assumes only one selection coefficient to accommodate selection on codon usage.
In this paper, we implement a few new models of codon substitution that relax those assumptions. Our motivations for this study are 2-fold. First, we devise a likelihood ratio test (LRT) of neutral evolution of codon usage to infer possible effects of natural selection. Whereas many previous studies have performed correlation analysis to test the various predictions of the mutation and selection theory of codon usage bias (see above), the LRT addresses this problem directly. Our model also provides direct measurements of selection acting on silent sites. Second, we examine the effects of model assumptions about codon usage on estimation of sequence distances such as dS, dN, and their ratio
= dN/dS. There has been considerable interest in the use of the
ratio to detect positive selection affecting protein evolution, and some concerns have been expressed as to whether this inference is affected by natural selection acting on silent sites (Kreitman and Akashi 1995
; Yang and Bielawski 2000
). We analyze 2 sets of data to address these issues, the first of the human and chimpanzee mitochondrial protein–coding genes and the second of 5,639 protein-coding genes from the 5 mammalian species: human, chimpanzee, macaque, mouse, and rat.
| Theory |
|---|
|
|
|---|
A Mutation-Selection Model of Codon Substitution
We construct a model of codon substitution by specifying the instantaneous rate of substitution from sense codons I = i1i2i3 to J = j1j2j3, where i1 is the nucleotide at the first position in codon I, and so on. We assume that point mutations occur independently at nucleotide sites and thus the rate is zero if I and J differ at more than 2 or 3 codon positions (Goldman and Yang 1994
jk. We explicitly model the process of one codon substituting another codon, that is, mutation, selection on the DNA (selection on codon usage), and selection on the protein.
Mutation Bias
Let the mutation rate from nucleotides i to j be µij per generation. The mutation model applies to all 3 codon positions, although the base compositions at the 3 positions may differ. We use the general time reversible (GTR or REV) model (e.g., Yang 1994
) to describe the mutation process so that
, with aij = aji for all i
j. Here
reflects mutation bias; if
is large, mutations are biased toward T. One of the mutation-bias parameters is redundant, and we scale them so that
. If the HKY mutation model (Hasegawa et al. 1985
) is used,
if i and j differ by a transition and
if i and j differ by a transversion, with
to be the transition/transversion rate ratio. Our analysis below is based mostly on the HKY model, but GTR is used in some analyses to examine the robustness of the results.
Selection on Codon Usage
We model selection on codon usage by introducing a fitness parameter fI for codon I. The selection coefficient for the mutation that changes the wild type codon I into a new mutant codon J is thus sIJ = fJ – fI. The probability of fixation of the mutation is
, where N is the effective chromosomal population size (Fisher 1930
; Wright 1931
; Kimura 1957
). Let FI = 2NfI be the scaled fitness of codon I, and SIJ = 2NsIJ = 2N(fJ – fI) = FJ – FI be the scaled selection coefficient. As the number of the I
J mutations in a generation is
, the substitution rate from codons I to J is given as
|
| (1) |
J mutation to the fixation probability of a neutral mutation, with h(SIJ) < 1, = 1 and > 1 for deleterious mutations (with SIJ < 0), neutral mutations (SIJ = 0), and advantageous mutations (SIJ > 0), respectively.
When the model is applied to sequence data from different species, we have in this study assumed that the effective population size N and the selection coefficients are the same among lineages. Those assumptions can be relaxed at the expense of including more parameters (McVean and Vieira 2001
; Nielsen et al. 2007
).
Selection on the Protein
To describe selection on the protein, we multiply the substitution rate by
if and only if the mutation is nonsynonymous (Goldman and Yang 1994
; Yang and Nielsen 1998
). Thus,
is the nonsynonymous/synonymous substitution rate ratio. The use of one single
to describe selection on the protein is very simplistic. However, previous models that incorporate amino acid chemical properties to specify codon substitution rates achieved only moderate (although statistically significant) improvements to the model's fit to data, and furthermore, such models produced rather similar estimates of mutation parameters to the simple model of one
ratio (Goldman and Yang 1994
; Yang et al. 1998
). Here our focus is on the effect of selection on synonymous codon usage. We also implement the site models that assume variable
ratios among codons in the gene (Nielsen and Yang 1998
; Yang et al. 2000
).
To summarize, the substitution rate from codons I to J is specified as
![]() | (2) |
,
,
, and
), 60 scaled fitness parameters, and
. The sequence distance t or branch lengths on the tree are additional parameters to be estimated from the data.
After the Q matrix is constructed, the stationary distribution of the Markov chain,
= {
1,
2, ...,
61}, is given by the system of linear equations
Q = 0, subject to the constraint that the
js sum to one. This distribution can also be calculated directly (see eq. 4 below). The matrix is then multiplied by a constant so that the "average" rate is one:
. The transition probability matrix P(t) = eQt is calculated following standard theory. (Note that we have used
J, where the subscript J is a codon to indicate the equilibrium frequency of codon J, and
, where the subscript j is a nucleotide to represent the mutation-bias parameter in the HKY or GTR mutation models.)
The Markov model of codon substitution specified by equation (2) is time reversible. To show this, it is sufficient to write the rate matrix as a product of a symmetrical matrix and a diagonal matrix (e.g., Yang 2006
, p. 33–34). The rate qIJ in equation (2) for a synonymous change can be rewritten as
![]() | (3) |
is the product of the mutation-bias parameters for the 2 unchanged nucleotides (i.e.,
if I = TCA and J = TCG). The quantity in the square brackets, denoted AIJ, satisfies AIJ = AJI for all I
J, whereas the quantity in the parentheses is a function of J only. The rate qIJ when the I
J substitution is nonsynonymous can be written in this form as well. Thus, the rate matrix Q = {qIJ} can be written as a product of a symmetrical matrix {AIJ} and a diagonal matrix so that the Markov process is time reversible, with the stationary frequency for codon J given as
|
| (4) |
. This result makes it clear that the stationary codon frequencies are determined by both mutation bias (represented by
) and selection on codon usage (represented by
). The model is referred to below as the FMutSel model. It may also be noted that instead of the codon fitness parameters (FJ), one may use the codon frequencies (
J) as parameters. The latter parametrization is convenient for an approximate implementation to be described below.
An LRT of Selection on Codon Usage
We implement a special case of the mutation-selection model of codon substitution (eq. 2), in which all synonymous codons (codons that encode the same amino acid) have the same fitness. Thus, instead of 60 (=61 – 1) codon fitness parameters for the universal genetic code, only 19 (=20 – 1) amino acid fitness parameters are used. The model assumes that the amino acid frequencies are determined by the functional requirements of the protein, but there is no fitness difference among the synonymous codons. From the theory above (eq. 4), the relative frequencies of synonymous codons are determined solely by the mutational-bias parameters. This model is referred to as FMutSel0.
An LRT can be constructed by comparing models FMutSel0 against FMutSel. Twice the log-likelihood difference between the 2 models is compared with the
2 distribution with degree of freedom = 60 – 19 = 41 for the universal code (or 40 for the vertebrate mitochondrial code). This constitutes a test of the null hypothesis that codon usage is due to mutation bias alone and not to selection acting at silent sites.
Measurements of Selection on Codon Usage
As our model explicitly separates mutation bias from selection affecting codon usage, we devise a few measures of the strength of natural selection on codon usage. Imagine observing the Markov process of codon substitution at any site (any codon triplet) for an infinitely long time. In a proportion
I of the time, the wild-type codon at the site in the population is codon I. The mutation (from codon I) to codon J, which changes the nucleotides ik into jk at codon position k and which has scaled fitness SIJ = FJ – FI, occurs at the rate
. Averaged over time, the proportion of the I
J mutation among all mutations is
![]() | (5) |
J.
One may then calculate the proportion of advantageous mutations among all mutations as
|
| (6) |
|
| (7) |
The strength of positive selection on an average advantageous mutation may be measured by
|
| (8) |
![]() | (9) |
J mutation among all advantageous mutations. Here
is defined only if the I
J mutation is advantageous, with SIJ > 0. Similarly, the strength of negative selection may be measured by the average SIJ among deleterious mutations with SIJ < 0.
One may also calculate the proportion of advantageous mutations among all "substitutions," that is, among those mutations that have passed the filtering by natural selection. This can be calculated using equation (6), with the proportion mIJ calculated using equation (5) but with
replaced by
or
IqIJ (eq. 2). Because the substitution process is reversible, the proportion of advantageous mutations among substitutions is exactly
.
An Approximate Implementation
In the FMutSel and FMutSel0 models, the codon fitness and amino acid fitness parameters are estimated by numerical optimization under ML. We also implement approximate versions of these models by fixing the predicted codon or amino acid frequencies to the observed frequencies in the sequence data. These are referred to as "FMutSel-F" and "FMutSel0-F," respectively. This strategy reduces the number of parameters to be estimated by numerical iteration by 60 under FMutSel-F for the universal genetic code and by 19 under FMutSel0-F. Early models concerning codon usage, such as F1 x 4, F3 x 4, and Fcodon, were all implemented using the observed base or codon frequencies as parameter estimates (Yang 1997
). For fair comparison, they are now also implemented using proper numerical optimization of the frequency parameters. Models implemented using the approximation are referred to using the suffix "-F" (e.g., F1 x 4-F).
| Analysis of Real Data |
|---|
|
|
|---|
We analyze 2 sets of data. The first consists of the mitochondrial genes of the human (GenBank accession number D38112) and the chimpanzee (D38113 [GenBank] ) of Horai et al. (1995)
The second set of data consists of the 5,639 human–chimpanzee–macaque–mouse–rat quintet alignments of orthologous genes from the macaque genome-sequencing project (Rhesus Macaque Genome Sequencing and Analysis Consortium 2007
). Codons that had alignment gaps in at least one species are removed. The data were analyzed as the primate pair of human and macaque genes, the rodent pair of mouse and rat genes, as well as the quintet including all 5 species. Our objectives in those analyses are to conduct the LRT of neutral evolution at silent sites and to estimate the coefficients of selection acting on codon usage.
Effects of the Model of Codon Usage on Distance Estimation
The log-likelihood values and estimates of sequence distances are shown in table 1 for the human and chimpanzee mitochondrial data set. The assumed mutation model is HKY, but different models are used concerning codon usage. The F1 x 4, F3 x 4, and Fcodon models specify the codon-substitution rate to be proportional to the frequency of the target codon, with the codon frequencies calculated using the 4 nucleotide frequencies (F1 x 4), the nucleotide frequencies at the 3 codon positions (F3 x 4), or with all codon frequencies treated as free parameters (Fcodon) (Yang 1997
). The F1 x 4MG model was proposed by Muse and Gaut (1994)
and assumes that the codon-substitution rate is proportional to the frequency of the target nucleotide. F3 x 4MG is an extension of F1 x 4MG and uses different base frequencies at the 3 codon positions. F1 x 4MG and F1 x 4 predict the same equilibrium codon frequencies, as do F3 x 4MG and F3 x 4.
|
The new FMutSel model has a much higher log-likelihood value than all the old models, indicating better fit to the data. Note that except for F1 x 4MG, which is equivalent to FMutSel with all codons having the same fitness, none of the other old models are nested within FMutSel and the
2 distribution cannot be used to compare them. However, use of the Akaike information criterion (Akaike 1974
We are interested in whether model assumptions concerning codon usage affect estimation of the distances between 2 protein-coding genes. The familiar nonsynonymous and synonymous distances dN and dS are calculated according to Goldman and Yang (1994)
. Previous studies have found that those distances are sensitive to assumptions about codon usage (e.g., Yang and Nielsen 1998
, 2000
). Estimates of dN are very similar among models, but estimates of dS vary considerably. Estimates of the
ratio differ by 2-folds among models. Nevertheless, the new FMutSel model produced estimates that are within the range of the old estimates. The estimates of
under the commonly used F3 x 4 and Fcodon models are slightly smaller than that under FMutSel.
Distances
and
are the number of nonsynonymous substitutions per nonsynonymous site and the number of synonymous substitutions per synonymous site, respectively, based on the "physical site" definition of sites (Yang 2006
: eq. 2.20). These distances are more stable across models, as noted previously.
is the number of nucleotide substitutions per site at the third codon position before selection on the protein, whereas
is the number of nucleotide substitutions per 4-fold degenerate site, estimated from the codon model under ML (Yang 2006
, p. 63–64). Distances d3B and d4 are very similar to each other and their estimates are also similar among different models of codon usage (table 1). See Yang (2006)
and Bierne and Eyre-Walker (2003)
for a discussion of those distances in analysis of codon usage bias.
Overall, estimates of sequence distances and
ratio under the old models, especially models F3 x 4 and Fcodon, are similar to estimates under the new FMutSel model. We also note that FMutSel produced almost identical results to FMutSel-F, indicating that the approximation of fixing the equilibrium codon frequencies at their observed values worked well in the data set. FMutSel-F has a big computational advantage and may be useful in real data analysis.
Test of Selection on Synonymous Codon Usage
We applied the LRT of neutral evolution of codon usage to nuclear genes from the mammalian species. The FMutSel and FMutSel0 models are fitted to each of the 5,639 genes for the human–macaque pair, the mouse–rat pair, and the 5-species quintet. The histograms of the log-likelihood difference between the 2 models (
) are shown in figure 1. Table 3 lists the number and proportion of genes in which the LRT is significant. At the 5% level, the null hypothesis of neutral evolution is rejected in 87%, 90%, and 94% of genes for the primate pair, the rodent pair, and the quintet, respectively. The differences in the proportions appear to reflect the information content in the data sets rather than any real biological differences between primates and rodents. The mouse–rat pair is more divergent than the human–macaque pair so that the data are more informative and the test has higher power. Similarly, the quintet data are most informative so that the null hypothesis is rejected in the greatest number of genes. The analysis thus provides statistical evidence that synonymous codon usage in most genes is influenced by natural selection. Nevertheless, the LRT may be sensitive to the mutation model assumed in the FMutSel and FMutSel0 models, and we suggest caution should be exercised in interpreting those results (see Discussion).
|
|
We also conducted the LRT by comparing FMutSel0-F against FMutSel-F, using the approximation of fixing equilibrium codon frequencies at their observed values. This approximate test produced very similar results to those of figure 1. The test statistics (

) calculated using the 2 procedures are plotted against each other in figure 2 for the quintet data sets.
|
The Distribution of Selection Coefficients
We used the FMutSel model to calculate the proportions of mutations with different selective coefficients (S), generating an estimation of the distribution of S among new mutations. For this analysis, we use 4 large data sets: the concatenated mitochondrial genes from the human and chimpanzee and the concatenated nuclear genes for the human–macaque pair, the mouse–rat pair, and the quintet. We used both model M0 (1-ratio), which assumes the same
ratio for all codons, and M3 (discrete), which assumes 2 site classes in proportions p0 and p1 with different
ratios
0 and
1 (Yang et al. 2000
ratio is highly variable among codons. Nevertheless, estimates of the mutation bias parameters (
) and codon fitness parameters (not shown) are very similar between the 2 models in each of the 4 large data sets (table 2).
|
We used parameter estimates obtained under model M0 (1-ratio) to calculate the scaled selective coefficients (S) for mutations that involve 2 codons differing at exactly one position and thus have nonzero rates. Those are the possible mutations allowed by the model, and their probabilities of occurrences are given by equation (5). There are 526 and 508 such mutations (codon pairs) for the universal and mitochondrial codes, respectively. The S values for those mutations were binned into 21 bins to generate a histogram, with the mid value in each bin used as the representative for that bin and with the proportion for the bin calculated as the sum of proportions (mIJ in eq. 5) of all mutations falling into that bin. The results are shown in figure 3a. The proportion (P+) of advantageous mutations among all mutations is shown in table 2, as well as the average selective coefficients of advantageous and deleterious mutations (
and
). Because preferred codons with higher fitness are more common and most mutations lead to unpreferred codons with lower fitness, the distribution of S among new mutations is skewed to the left, with the proportion
. The proportion of advantageous mutations among substitutions is higher than P+ because an advantageous mutation has a higher fixation probability and makes a greater contribution to substitutions than does a deleterious mutation. Indeed, the proportions of advantageous mutations among substitutions is
, due to the reversibility of the substitution model.
|
The estimates of
and
are greater and thus selection on silent sites is stronger in the mitochondrial genes than in the nuclear genes (table 2). In the former,
31% of new mutations are advantageous, whereas in the latter, the proportion is 37–40%. The much lower
ratios in the mitochondrial genes than in the nuclear genes indicate that the mitochondrial proteins are under much stronger selective constraint than the nuclear proteins. The difference is more striking when one considers the fact that the effective population size for mitochondrial genes is
that of the nuclear genes and that selection is less efficient in smaller populations. The higher efficiency of selection in mtDNA, with respect to both codon usage and protein evolution, may be due to the fact that the haploid mitochondrial genome makes it easy to remove recessive mutations, whereas they may remain hidden in the heterozygous state in nuclear genes. Another possible explanation is the hypothesis of selection for translational accuracy, which predicts stronger selection on codon usage on highly conserved proteins or on highly conserved amino acids in a protein because the fitness cost of translational misincorporation should depend on how the amino acid change affects protein function (Akashi 1994It should be noted that in our model, all S values are nonzero, and P+ in table 2 includes mutations with S only very slightly positive, the evolutionary dynamics of which may be indistinguishable from that of neutral mutations. For example, mutations with |S| > 2 are rare in all data sets. The estimated proportions of mutations with S > 2 and S < –2 are 0.2% and 1.7%, respectively, for the mitochondrial genes, 0.2% and 1.6% for the human–macaque pair, 0.1% and 1.0% for the mouse–rat pair, and 0.1% and 0.9% for the quintet. Thus, although the LRT rejects the null model of neutral evolution of silent sites, selection on codon usage is mostly weak, and most mutations appear to be nearly neutral with respect to selection on codon usage.
We are also interested in how natural selection on codon usage changes the fitness distribution of mutations, that is, how mutations of different fitness contribute to substitutions. A histogram of S after filtering by natural selection on codon usage can be generated using the same procedure as described above, except that the proportion mIJ is calculated using equation (5), with
replaced by
. The resulting histograms (fig. 3b) show the proportion of mutations with scaled fitness S that has survived natural selection on codon usage. Similarly, If we replace
by
IqIJ in equation (5), the resulting histograms (fig. 3c) represent the proportion of mutations with fitness S among observed substitutions, that is, among mutations that have passed the filtering by selection both on codon usage bias and on amino acid replacements. Because of the detailed balance condition of the reversible Markov model of substitution, the distributions in figure 3b and c are all symmetrical. Note that here the distinction between selection on codon usage and selection on amino acid replacements is more conceptual than temporal, with no implication that one necessarily occurs before the other.
| Discussion |
|---|
|
|
|---|
Mechanistic Models of Codon Usage and Protein Evolution
A number of authors have studied the frequencies of synonymous codons for 2-fold degenerate amino acids as the result of interactions between mutation, genetic drift, and natural selection (Kimura 1983
1 mutation in the allele-0 population is s = f1 – f0 and that of the 1
0 mutation in the allele-1 population is –s. At mutation-selection-drift equilibrium, the probability density of the frequency p of allele 1 is given as
|
| (10) |
|
| (11) |
0 +
1 = 1. If we assume that the same selective pressure applies to synonymous codons for all 2-fold degenerate amino acids in a gene,
1 will be the proportion of preferred codons in the gene. The contributions of mutation and selection to the equilibrium frequencies of synonymous codons are apparent from equation (11). This may also be considered a special case of equation (4), which gives the equilibrium distribution of the codon-substitution process.
McVean and Vieira (1999)
used equation (11) to analyze observed counts of preferred codons for 2-fold amino acids in several Drosophila species, fitting binomial models by ML. The analysis used information on codon usage but ignored differences between species. McVean and Vieira (2001)
implemented a population genetics model that is very similar to equation (2) to describe substitutions between synonymous codons between species. The authors analyzed between-species synonymous differences to estimate the strength of natural selection on synonymous codon usage, with nonsynonymous differences ignored. The FMutSel models extend the work of McVean and Vieira to a full codon substitution model, which is suitable for comparative analysis of protein-coding genes from multiple species.
Previous models of codon substitution (Goldman and Yang 1994
; Muse and Gaut 1994
) aim to describe nucleotide substitutions and do not explicitly accommodate mutation bias and natural selection acting on the DNA level. The models may thus be ill suited for studying the forces and mechanisms of the evolutionary process at silent sites. The mutation-selection models implemented in this paper address this drawback, by introducing parameters that explicitly describe mutation bias and natural selection acting on codon usage. We suggest that such models, with the easy interpretation of the model parameters, may be very useful for studying the process of molecular sequence evolution.
There has been considerable interest in incorporating fitness effects of new mutations in constructing substitution models for phylogenetic analysis. Halpern and Bruno (1998)
considered a codon-substitution model in which at every amino acid site in the protein, different amino acids have different fitness and thus different equilibrium frequencies. The model was developed for distance calculation but is not practical for real data analysis due to its use of too many parameters. Moses et al. (2003)
adapted the theory to describe nucleotide substitutions and to estimate site-specific substitution rates in noncoding regulatory elements such as transcription factor–binding sites. Note that from equation (4), we have
|
| (12) |
The FMutSel model also has similarities to the site-class models of amino acid replacement implemented by Koshi et al. (1999)
, which assume that different site classes have different amino acid frequencies and different substitution patterns and that in each site class, every amino acid J has its own "propensity" FJ. Koshi et al. (1999
, eq. 4) applied a truncation on the substitution rate, equivalent to fixing h(SIJ) = 1 whenever the difference in propensity SIJ = FJ – FI > 0. Like FMutSel, this model is also time reversible, with the same equilibrium distribution, where the frequency of amino acid J is proportional to
. Except for the truncation mentioned above, the model of Koshi et al. (1999)
can be given a population genetics interpretation, with the propensity interpreted as the scaled fitness FJ. However, the truncation of rates means that the model assumes that an advantageous mutation is fixed at the same rate as a neutral mutation, which is unrealistic biologically. A similar criticism was made by Thorne et al. (2007)
.
More recent work by Yu and Thorne (2006)
, Thorne et al. (2007)
, and Choi et al. (2007)
assigned a fitness to the sequence when they developed mutation-selection models to describe the evolution of RNA or protein sequences. An advantage of those models is that they allow dependence among sites due to RNA or protein structural constraints.
We note that there has been some debate in the literature concerning whether use of the
ratio to detect natural selection acting on the protein (for reviews, see Yang and Bielawski 2000
; Yang 2002
) requires the assumption of neutral evolution at silent sites. Many authors take it for granted that this assumption is needed. A concern is that if selection acts on codon usage, codon models may be misled to produce an
ratio greater than one because selection on silent sites has reduced dS and not because positive selection has elevated dN. From the mutation-selection models implemented in this paper, it is clear that the assumption is not necessary and it is possible to use the
ratio to detect positive selection acting on the protein even if silent sites are under natural selection, as assumed in FMutSel. Comparison between dS and dN is a contrast between the rates before and after the action of selection on the protein (Yang 2006
, eq. 2.19) so that the comparison is valid whether evolution at silent sites is driven by mutation or selection. In this regard, selection on silent sites may be more accurately described as selection on the DNA level as it affects both silent and replacement sites.
Sensitivity of the LRT to the Mutation Model
The mutation-selection model of codon substitution makes many simplistic assumptions about the evolutionary process. For our purpose of testing for selection acting on silent sites, the most worrying assumptions appear to be those concerning the mutation process as the mutation-bias and codon-fitness parameters are expected to be highly correlated in such an analysis. Indeed, the effects of the 2 would be virtually impossible to separate if we had used only information on codon frequencies (see eqs. 4 and 11).
To examine the impact of the assumed mutation model on the LRT of selection on codon usage, we implemented the GTR mutation model (e.g., Yang 1994
). The codon frequency parameters are estimated using the observed frequencies rather than by ML iteration. Application of the LRT under the GTR model to the mammalian data produced results very similar to those obtained under HKY. The proportions of genes for which the LRT is significant under GTR (table 3) are slightly lower (by 1–2%) than under HKY. Figure 4 plots the test statistic (
) for the 2 mutation models for the quintet data sets. The results suggest that the LRT may not be very sensitive to the assumed mutation model.
|
However, the estimates of codon-fitness parameters for the concatenated data under the 2 mutation models are very different (results not shown). This is the case even though both mutation models predicted very similar codon frequency parameters, which closely match the observed frequencies. Our estimates of the selection coefficients are affected by the mutation model. Thus, we found that the LRT is somewhat insensitive to the assumed mutation model but the estimates of codon fitness parameters are.
Both HKY and GTR assume independent mutations at nucleotide sites. There is considerable evidence suggesting that the mutation rate of a nucleotide may depend on neighboring nucleotides (e.g., Bulmer 1986
; Hwang and Green 2004
; Siepel and Haussler 2004
). One well-known example of such context effects is the high mutation rate of CpG dinucleotides in mammalian genomes. As the cytosine in CpG is prone to methylation and deamination, CpG dinucleotides have a very high rate of mutating into TpG (Scarano et al. 1967
). With such mutational context effects, both the null and alternative hypotheses (FMutSel0 and FMutSel) in the LRT are violated, but the 2 models may not be affected to the same extent, in which case the violation of assumptions may cause the test to generate excessive false positives. For example, FMutSel0 predicts that the relative frequencies of 4-fold degenerate codons encoding the same amino acid are given by the mutation-bias parameters (
), independent of the encoded amino acid. If the mutation rate and pattern at the third codon position depend on the nucleotides at the first and second positions, FMutSel0 may fit the data poorly, but FMutSel may still achieve a reasonable fit because of its use of a separate codon fitness parameter FJ for each target codon J. Although both FMutSel0 and FMutSel make use of information from nonsynonymous differences as well as synonymous differences, the test may nevertheless be sensitive to such mutational context effects. It has also been suggested that one mutation event may affect multiple nucleotides and the assumption of independent mutations may be unrealistic (e.g., Yang et al. 1998
; Whelan and Goldman 2004
). However, those studies typically analyze substitutions instead of mutations, and the apparent double or triple substitutions may reflect artifacts of the inadequate substitution model rather than true double or triple mutations. The models developed here concern the mutation process, and it would appear that double or triple mutations, if not rare, should affect the 2 models in similar ways. At any rate, the sensitivity of the LRT to violations of the assumed mutation model is not well understood and merits further research.
We consider several strategies that may alleviate the confounding effect of mutation and selection. The first is to make certain assumptions concerning either the mutation or the selection process. For example, the method of Nielsen et al. (2007)
required prior knowledge of preferred and unpreferred codons and also assumed the same selective strength acting on all codons. The latter assumption may be unrealistic in some data sets. A second strategy is to analyze pseudogenes or noncoding DNA to estimate mutation parameters and then use them in the mutation-selection model of codon substitution to analyze coding genes. Similarly, one may analyze coding and neighboring noncoding regions jointly, with the same mutation-bias parameters applied to both regions and the selection parameters applied to the coding regions only. This requires that the same mutation process operates in both coding and noncoding regions, an assumption that may be violated due to translation-coupled repair (Duret 2002
). A third strategy, suitable for joint analysis of many genes from the same set of species, is to assume that the mutation parameters are shared among genes or at least among genes with similar codon usage bias or GC content at the third codon positions, whereas the strengths of selection on codon usage may differ among genes. In this paper, we analyzed the 5,639 mammalian genes separately, fitting 66 or more parameters to each gene, so that the model is rather parameter-rich. Finally, developing models that explicitly accommodate mutational context effects may also be very useful in improving the realism of the models implemented here. In this regard, our likelihood model provides a natural framework for such extensions.
| Program Availability |
|---|
|
|
|---|
The new FMutSel and FMutSel0 models developed in this paper are implemented independently by the 2 authors for error checking. All models described in this paper are implemented in the CODEML program in PAML 4 (Yang 2007
| Acknowledgements |
|---|
|
|
|---|
We thank 3 referees for many useful comments. This study is supported by a grant from the Biotechnological and Biological Sciences Research Council to Z.Y. and grants from FNU (Danish Natural Science Research Council) and Danmarks Grundforskningsfond to R.N.
| Footnotes |
|---|
Jeffrey Thorne, Associate Editor
| References |
|---|
|
|
|---|
Akaike H. A new look at the statistical model identification. IEEE Trans Automat Contr (1974) 19:716–723.[CrossRef]
Akashi H. Synonymous codon usage in Drosophila melanogaster: natural selection and translational accuracy. Genetics (1994) 136:927–935.[Abstract]
Akashi H. Inferring weak selection from patterns of polymorphism and divergence at "silent" sites in Drosophila DNA. Genetics (1995) 139:1067–1076.[Abstract]
Bennetzen JL, Hall BD. Codon selection in yeast. J Biol Chem (1982) 257:3026–3031.
Bierne N, Eyre-Walker A. The problem of counting sites in the estimation of the synonymous and nonsynonymous substitution rates: implications for the correlation between the synonymous substitution rate and codon usage bias. Genetics (2003) 165:1587–1597.
Bulmer M. Neighboring base effects on substitution rates in pseudogenes. Mol Biol Evol (1986) 3:322–329.[Abstract]
Bulmer M. Coevolution of codon usage and transfer RNA abundance. Nature (1987) 325:728–730.[CrossRef][Medline]
Bulmer MG. The selection-mutation-drift theory of synonymous codon usage. Genetics (1991) 129:897–907.[Abstract]
Castillo-Davis CI, Hartl DL. Genome evolution and developmental constraint in Caenorhabditis elegans. Mol Biol Evol (2002) 19:728–735.
Choi SC, Hobolth A, Robinson DM, Kishino H, Thorne JL. Quantifying the impact of protein tertiary structure on molecular evolution. Mol Biol Evol (2007) 24:1769–1782.
Dunn KA, Bielawski JP, Yang Z. Substitution rates in Drosophila nuclear genes: implications for translational selection. Genetics (2001) 157:295–305.
Duret L. Evolution of synonymous codon usage in metazoans. Curr Opin Genet Dev (2002) 12:640–649.[CrossRef][Web of Science][Medline]
Duret L, Mouchiroud D. Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, Arabidopsis. Proc Natl Acad Sci USA (1999) 96:4482–4487.
Fisher R. The distribution of gene ratios for rare mutations. Proc R Soc Edinb (1930) 50:205–220.
Frydman J. Folding of newly translated proteins in vivo: the role of molecular chaperones. Annu Rev Biochem (2001) 70:603–647.[CrossRef][Web of Science][Medline]
Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol (1994) 11:725–736.[Abstract]
Halpern AL, Bruno WJ. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol (1998) 15:910–917.[Abstract]
Hasegawa M, Cao Y, Yang Z. Preponderance of slightly deleterious polymorphism in mitochondrial DNA: replacement/synonymous rate ratio is much higher within species than between species. Mol Biol Evol (1998) 15:1499–1505.
Hasegawa M, Kishino H, Yano T. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol (1985) 22:160–174.[CrossRef][Web of Science][Medline]
Horai S, Hayasaka K, Kondo R, Tsugane K, Takahata N. Recent African origin of modern humans revealed by complete sequences of hominoid mitochondrial DNAs. Proc Natl Acad Sci USA (1995) 92:532–536.
Hwang DG, Green P. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc Natl Acad Sci USA (2004) 101:13994–14001.
Ikemura T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes. J Mol Biol (1981) 146:1–21.[CrossRef][Web of Science][Medline]
Ikemura T. Codon usage and tRNA content in unicellular and multicellular organisms. Mol Biol Evol (1985) 2:13–34.[Abstract]
Kanaya S, Yamada Y, Kinouchi M, Kudo Y, Ikemura T. Codon usage and tRNA genes in eukaryotes: correlation of codon usage diversity with translation efficiency and with CG-dinucleotide usage as assessed by multivariate analysis. J Mol Evol (2001) 53:290–298.[CrossRef][Web of Science][Medline]
Kimchi-Sarfaty C, Oh JM, Kim IW, Sauna ZE, Calcagno AM, Ambudkar SV, Gottesman MM. A "silent" polymorphism in the MDR1 gene changes substrate specificity. Science (2007) 315:525–528.
Kimura M. Some problems of stochastic processes in genetics. Ann Math Stat (1957) 28:882–901.[CrossRef]
Kimura M. The neutral theory of molecular evolution (1983) Cambridge: Cambridge University Press.
Koshi JM, Mindell DP, Goldstein RA. Using physical-chemistry-based substitution models in phylogenetic analyses of HIV-1 subtypes. Mol Biol Evol (1999) 16:173–179.[Abstract]
Kreitman M, Akashi H. Molecular evidence for natural selection. Annu Rev Ecol Syst (1995) 26:403–422.[Web of Science]
Kurland CG. Translational accuracy and the fitness of bacteria. Annu Rev Genet (1992) 26:29–50.[Web of Science][Medline]
Li W-H. Models of nearly neutral mutations with particular implications for nonrandom usage of synonymous codons. J Mol Evol (1987) 24:337–345.[CrossRef][Web of Science][Medline]
McVean GA, Charlesworth B. A population genetics model for the evolution of synonymous codon usage: patterns and predictions. Genet Res (1999) 74:145–158.[CrossRef][Web of Science]
McVean GA, Vieira J. The evolution of codon preferences in Drosophila: a maximum-likelihood approach to parameter estimation and hypothesis testing. J Mol Evol (1999) 49:63–75.[CrossRef][Web of Science][Medline]
McVean GA, Vieira J. Inferring parameters of mutation, selection and demography from patterns of synonymous site evolution in Drosophila. Genetics (2001) 157:245–257.
Moriyama EN, Powell JR. Codon usage bias and tRNA abundance in Drosophila. J Mol Evol (1997) 45:514–523.[CrossRef][Web of Science][Medline]
Moses AM, Chiang DY, Kellis M, Lander ES, Eisen MB. Position specific variation in the rate of evolution in transcription factor binding sites. BMC Evol Biol (2003) 3:19.[CrossRef][Medline]
Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol (1994) 11:715–724.[Abstract]
Musto H, Cruveiller S, D'Onofrio G, Romero H, Bernardi G. Translational selection on codon usage in Xenopus laevis. Mol Biol Evol (2001) 18:1703–1707.
Nielsen R, Bauer DuMont VL, Hubisz MJ, Aquadro CF. Maximum likelihood estimation of ancestral codon usage bias parameters in Drosophila. Mol Biol Evol (2007) 24:228–235.
Nielsen R, Yang Z. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics (1998) 148:929–936.
Rhesus Macaque Genome Sequencing and Analysis Consortium. Evolutionary and biomedical insights from the Rhesus macaque genome. Science (2007) 316:222–234.
Scarano E, Iaccarino M, Grippo P, Parisi E. The heterogeneity of thymine methyl group origin in DNA pyrimidine isostichs of developing sea urchin embryos. Proc Natl Acad Sci USA (1967) 57:1394–1400.
Sharp PM, Averof M, Lloyd AT, Matassi G, Peden JF. DNA sequence evolution: the sounds of silence. Philos Trans R Soc Lond B Biol Sci (1995) 349:241–247.[Web of Science][Medline]
Sharp PM, Li WH. The rate of synonymous substitution in enterobacterial genes is inversely related to codon usage bias. Mol Biol Evol (1987) 4:222–230.[Abstract]
Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol (2004) 21:468–488.
Thorne JL, Choi SC, Yu J, Higgs PG, Kishino H. Population genetics without intraspecific data. Mol Biol Evol (2007) 24:1667–1677.
Whelan S, Goldman N. Estimating the frequency of events that cause multiple-nucleotide changes. Genetics (2004) 167:2027–2043.
Wright S. Evolution in Mendelian populations. Genetics (1931) 16:97–159.
Yang Z. Estimating the pattern of nucleotide substitution. J Mol Evol (1994) 39:105–111.[Web of Science][Medline]
Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci (1997) 13:555–556.
Yang Z. Inference of selection from multiple species alignments. Curr Opin Genet Dev (2002) 12:688–694.[CrossRef][Web of Science][Medline]
Yang Z. Computational molecular evolution (2006) Oxford (UK): Oxford University Press.
Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol (2007) 24:1586–1591.
Yang Z, Bielawski JP. Statistical methods for detecting molecular adaptation. Trends Ecol Evol (2000) 15:496–503.[CrossRef][Medline]
Yang Z, Nielsen R. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J Mol Evol (1998) 46:409–418.[CrossRef][Web of Science][Medline]
Yang Z, Nielsen R. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol (2000) 17:32–43.
Yang Z, Nielsen R, Goldman N, Pedersen A-MK. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics (2000) 155:431–449.
Yang Z, Nielsen R, Hasegawa M. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol (1998) 15:1600–1611.[Abstract]
Yu J, Thorne JL. Dependence among sites in RNA evolution. Mol Biol Evol (2006) 23:1525–1537.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
M. Welch, A. Villalobos, C. Gustafsson, and J. Minshull You're one in a googol: optimizing genes for protein expression J R Soc Interface, August 6, 2009; 6(Suppl_4): S467 - S476. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Rodrigue, C. L. Kleinman, H. Philippe, and N. Lartillot Computational Methods for Evaluating Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons Mol. Biol. Evol., July 1, 2009; 26(7): 1663 - 1676. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. J. Dunham, V. G. Dugan, E. K. Kaser, S. E. Perkins, I. H. Brown, E. C. Holmes, and J. K. Taubenberger Different Evolutionary Trajectories of European Avian-Like and Classical Swine H1N1 Influenza A Viruses J. Virol., June 1, 2009; 83(11): 5485 - 5494. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. J. Weadick and B. S.W. Chang Molecular Evolution of the {beta}{gamma} Lens Crystallin Superfamily: Evidence for a Retained Ancestral Function in {gamma}N Crystallins? Mol. Biol. Evol., May 1, 2009; 26(5): 1127 - 1142. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Subramanian Temporal Trails of Natural Selection in Human Mitogenomes Mol. Biol. Evol., April 1, 2009; 26(4): 715 - 717. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M. Moses and R. Durbin Inferring Selection on Amino Acid Preference in Protein Domains Mol. Biol. Evol., March 1, 2009; 26(3): 527 - 536. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Anisimova and C. Kosiol Investigating Protein-Coding Sequence Evolution with Probabilistic Codon Substitution Models Mol. Biol. Evol., February 1, 2009; 26(2): 255 - 271. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. dos Reis and L. Wernisch Estimating Translational Selection in Eukaryotic Genomes Mol. Biol. Evol., February 1, 2009; 26(2): 451 - 461. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Delport, K. Scheffler, and C. Seoighe Models of coding sequence evolution Brief Bioinform, January 1, 2009; 10(1): 97 - 109. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. C. Choi, B. D Redelings, and J. L Thorne Basing population genetic inferences and models of molecular evolution upon desired stationary distributions of DNA or protein sequences Phil Trans R Soc B, December 27, 2008; 363(1512): 3931 - 3939. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Kryazhimskiy, G. A Bazykin, J. Plotkin, and J. Dushoff Directionality in the evolution of influenza A haemagglutinin Proc R Soc B, November 7, 2008; 275(1650): 2455 - 2464. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Rodrigue, N. Lartillot, and H. Philippe Bayesian Comparisons of Codon Substitution Models Genetics, November 1, 2008; 180(3): 1579 - 1591. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





under the null model, the critical values for 








