Skip Navigation


MBE Advance Access originally published online on August 16, 2007
Molecular Biology and Evolution 2007 24(11):2362-2372; doi:10.1093/molbev/msm169
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
24/11/2362    most recent
msm169v2
msm169v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gilchrist, M. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gilchrist, M. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Combining Models of Protein Translation and Population Genetics to Predict Protein Production Rates from Codon Usage Patterns

Michael A. Gilchrist

Department of Ecology and Evolutionary Biology, University of Tennessee, Knoxville

E-mail: mikeg{at}utk.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Addendum
 Acknowledgements
 References
 
Genes are often biased in their codon usage. The degree of bias displayed often changes with expression level and intragenic position. Numerous indices, such as the codon adaptation index, have been developed to measure this bias. Although the expression level of a gene and index values are correlated, the heuristic nature of these metrics limits their ability to explain this relationship. As an alternative approach, this study integrates mechanistic models of cellular and population processes in a nested manner to develop a stochastic evolutionary model of a protein's production rate (SEMPPR). SEMPPR assumes that the evolution of codon bias is driven by selection to reduce the cost of nonsense errors and that this selection is counteracted by mutation and drift. Through the application of Bayes' theorem, SEMPPR generates a posterior probability distribution for the protein production rate of a given gene. Conceptually, SEMPPR's predictions are based on the degree of adaptation to reduce the cost of nonsense errors observed in the codon usage pattern of the gene. As an illustration, SEMPPR was parameterized using the Saccharomyces cerevisiae genome and its predictions tested using available empirical data. The results indicate that SEMPPR's predictions are as reliable index based ones. In addition, SEMPPR's output is more easily interpreted and its predictions could be improved through refinements of the models upon which it is built.

Key Words: codon bias • nonsense errors • fitness landscape • synonymous substitution • evolution


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Addendum
 Acknowledgements
 References
 
Codon usage bias is the selective (i.e., nonuniform) usage of particular synonymous codons within a genetic sequence (Ikemura 1981Go; Bennetzen and Hall 1982Go; Sharp and Li 1987Go). The nature and strength of codon bias varies between species as well as within a genome and across tissues. Correlating protein or mRNA abundance to codon bias has been the subject of numerous studies (Gouy and Gautier 1982Go; Ikemura 1985Go; Gygi et al. 1999Go; Greenbaum et al. 2003Go). Such approaches take a descriptive approach in linking codon usage and abundance. In contrast, this study demonstrates how codon usage bias and the average protein production rate of a gene can be linked mechanistically. This is done through the development of a stochastic evolutionary model of a protein's production rate (SEMPPR). SEMPPR is built by integrating nested models of intracellular and population-level processes together and applying Bayes' theorem.

Although it is well accepted that codon bias is positively correlated with gene expression level, the importance of different possible sources of selection is still unclear (Bulmer 1991Go; Kurland 1992Go; Akashi 1994Go, 1995Go; Eyre-Walker 1996Go; Akashi and Eyre-Walker 1998Go; Comeron and Kreitman 2002Go; Elf et al. 2003Go; Kliman et al. 2003Go; Qin et al. 2004Go; Gilchrist and Wagner 2006Go; Plotkin et al. 2006Go). One hypothesis is that codon bias results from selection to minimize the probability and associated costs of nonsense errors (i.e., the premature termination of protein translation).

A major cost of a nonsense error is the amount of energy invested into assembling the incomplete polypeptide (Bulmer 1991Go; Kurland 1992Go; Eyre-Walker 1996Go). Because this cost is related to the length of the nascent peptide when the error occurs, selection on codon usage against nonsense errors should increase with codon position along a sequence (Bulmer 1991Go; Kurland 1992Go; Qin et al. 2004Go; Gilchrist and Wagner 2006Go). This, in turn, leads to the prediction of increasing codon bias with codon position. The prediction of increasing bias with position is unique to the nonsense errors acting as a hypothesized source of selection on codon usage.

Cursory support for the hypothesis that nonsense errors help shape codon usage patterns comes from experimental data indicating that many ribosomes fail to complete translation of a given protein (Manley 1978Go; Jorgensen and Kurland 1990Go; Kurland 1992Go). The detection of intragenic patterns of codon bias also provides support for this hypothesis. For example, Hooper and Berg (2000)Go found evidence for an increase in bias with position in Escherichia coli. More recently, Qin et al. (2004)Go detected clear patterns of increasing codon usage bias with position in Saccharomyces cerevisiae and other microorganisms. Similar patterns have been detected indirectly in S. cerevisiae as well (Gilchrist and Wagner 2006Go).

These findings indicate that nonsense errors are costly enough to result in detectable patterns of adaptive codon usage in the sequences of many genes. Given that such patterns of codon usage exist, one is inclined to ask, "What information can be taken from such patterns and how can it be properly interpreted?" This study attempts to answer this question by developing a stochastic model, SEMPPR. Essentially, SEMPPR makes inferences about the production rate of a gene based on its elevation on the fitness landscape of protein production costs.

It should be noted that SEMPPR only considers the effects of synonymous substitutions on a protein's production cost. This is because SEMPPR takes the amino acid sequence of a gene as a given piece of information. Thus, the term "adaptation," as used here, does not refer in anyway to the function of the protein encoded. Instead, here the adaptation of a codon sequence refers the state of its expected cost of producing a protein relative to the minimal possible cost (i.e., the codon sequence, which uses only "optimal" codons).

If nonsense errors are an important source of selection driving the evolution of codon bias, then the strength of this selection on a gene should be a function of the production rate of the protein it encodes. This selection will be counteracted by mutation and drift. Intuitively, genes observed to be higher on their fitness landscape are more likely to be under stronger selection than genes observed to be lower on their fitness landscape. By employing Bayes' theorem, the work developed here formalizes this intuition in the form of a nested stochastic model SEMPPR. Although the utility of SEMPPR is demonstrated using S. cerevisiae, the models and ideas behind its formulation are quite general, making it theoretically applicable to other organisms.

In addition to being able to predict the production rate of a gene from sequence data, SEMPPR should provide a means for calculating the selective force for or against a synonymous substitution. The force is expected to vary as a function of codon position and gene expression level, both of which can be calculated using SEMPPR. Further, because SEMPPR links genotype and fitness in a clear and mechanistic way, the model provides a concrete system for the future examination of more fundamental topics in evolutionary biology, such as the shape and nature of fitness landscapes and behavior of populations on these landscapes.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Addendum
 Acknowledgements
 References
 
Definitions for all symbols used in this study are listed in Table 1. The presentation of SEMPPR is broken into two parts. The first part focuses on its construction from models of protein production and allele fixation. The second part illustrates how to implement SEMPPR using the 5,847 verified genes of the model organism S. cerevisiae. The performance of SEMPPR is then evaluated by comparing its predictions with empirical estimates of protein production rates for individual genes using high-throughput data.


View this table:
[in this window]
[in a new window]

 
Table 1 List of Symbols Used

 
Model Development
In this section, it is shown how the degree to which the allele observed to be fixed in a population reduces the cost of nonsense errors can be used to make inferences about the production rate of the protein it encodes. This is done by mapping genotype to phenotype, phenotype to fitness, and then fitness to fixation probability. As a last step, Bayes' theorem is applied to the fixation probability. The resulting output from SEMPPR is a posterior probability distribution for the protein production rate of a gene based on its observed codon sequence.

From Genotype to Phenotype: Codon Usage and the Cost of Protein Production
When considering the cost of producing a protein in the face of nonsense errors, it is critical to recognize that for every protein which is completely translated, a number of incomplete proteins are also produced. These incomplete proteins are the result of nonsense errors. The cost of these nonsense errors is a function of their expected number and the length of the incomplete proteins. This cost, along with that of the completed protein, must be accounted for when calculating the total cost of protein production.

In the protein translation model presented in Gilchrist and Wagner (2006)Go, a gene is represented as a vector of codons Formula = {c1, c2,...cn}, where ci is the elongation rate of the ith codon and n is the number of codons to be translated. Specific values for these parameters can be estimated from sequence and experimental data (see Gilchrist and Wagner 2006Go and below).

Based on these assumptions, the probability of a ribosome successfully translating up to and including the ith codon in the sequence Formula is

Formula (1)
where b is a constant background nonsense error rate. Although there are data which suggest that b may vary from codon to codon (Freistroffer et al. 2000Go), for simplicity b is assumed to be constant.

From this definition, it follows that the expected energetic cost of a nonsense error is

Formula (2)
The terms within the summation in equation (2) represent the cost of translating up to the ith codon, the probability a ribosome reaches the ith codon, and the probability a nonsense error occurs at that codon, respectively. The terms a1 and a2 represent the cost of ribosome initiation and peptide elongation and are equal to 4 and 2 high-energy phosphate bonds (~P), respectively. The factor 1/(1 – {sigma}n (Formula )) scales the expected cost in the summation by the probability a nonsense error will occur at some point during the translation process. This factor was neglected in the formulation of {xi} originally derived in Gilchrist and Wagner (2006)Go.

The cost function {xi}(Formula ) represents the expected "cost" of a single nonsense error for the given codon sequence Formula . This cost function does not take into account the expected "number" of errors that might occur for each complete protein produced. Letting the index i represent the number of failures until a successful completion (which can range from 0 to {infty}) and {eta}(Formula ) represent the total expected cost of producing a complete protein from Formula then,


Formula (3)


Formula (4)

The first 2 terms in the summation of equation (3) represent the probability of i failures occurring before the next successful translation. The last term in the summation represents the expected cost of these i failures and 1 success.

From Phenotype to Fitness: Linking the Energetics of Protein Production to Fitness
The term {eta}(Formula ) represents the expected cost of producing a single, complete protein (protein production cost, for brevity). The rate at which an organism must allocate energy to the production of this protein is the product of this cost and the target production rate of completed proteins {phi}. Presumably, the energy an organism has at its disposal is limited. Consequently, the fitness of an allele will be some function of the energy it requires to produce the protein it encodes. Specifically, SEMPPR assumes that the fitness w of an allele decreases exponentially with the energy expenditure rate it requires, that is,

Formula (5)
where q is a scaling term and a3 is a proportionality constant whose value will cancel out in the next calculation.

Unlike a linear assumption, the assumption of a decreasing exponential fitness function ensures that the fitness of an allele never drops below 0. This assumption is also consistent with those made in the population genetics model employed below. This assumption is less restrictive than one might assume given that when Formula, the decrease in w({eta}) with {eta} is approximately linear.

From Fitness to Fixation: Calculating the Probability of Observing an Allele
The next step in formulating SEMPPR is linking fitness and fixation probability of an allele with its given codon sequence. Recently, a number of different researchers have independently derived the stationary distribution for gene fixation (Berg and Lässig 2003Go; Berg et al. 2004Go; Sella and Hirsh 2005Go). Using the formulation presented by Sella and Hirsh (2005)Go, for a haploid organism with a large effective population size, Formula, the probability a specific allele with fitness w is fixed at a given locus is approximately,

Formula (6)
where g(w) represents the probability density function of the set of all possible alleles for a given locus. For reasons of simplicity, it is assumed the mutation rates between different genotypes are symmetric; this assumption can be relaxed in future studies (Sella and Hirsh 2005Go).

Using the above definitions, the fixation probability of allele in equation (6) can be reformulated on a per locus basis. That is, the fixation probability of a particular allele i with a codon sequence Formula i and a corresponding protein production cost {eta}i = {eta}(Formula i) is

Formula (7)
where the parameters q and Ne are implicitly given and Formula represents the probability density function of protein production costs for the focal gene across its synonymous sequence space (i.e., the number of different codon sequences that can produce the same amino acid sequence). The term |Formula | represents the size of this space. This size is equal to the product of the number of codon synonymous at each amino acid position along an entire sequence. Using S. cerevisiae's genome as an example, these synonymous space sequences range in magnitude from 108 for very short genes to >102300 for very long genes. The terms {eta}min and {eta}max represent the minimum and maximum protein production costs for the amino acid sequence corresponding to the focal locus. Conceptually, these values correspond to the highest peak and lowest valley of a gene's fitness landscape. Although {eta} values are actually discrete, the set of their sizes is extremely large relative to their range, and the difference between the ith and the i + 1th highest values is presumably very small. Therefore, for simplicity the distribution of {eta} values are approximated in equation (7) with a continuous distribution.

From Fixation to Production: Inferring Production from Adaptation via Bayes' Theorem
Equation (7) calculates the probability that an allele with a particular codon sequence becomes fixed at a particular locus conditional on the protein production rate {phi} and other information such as q and Ne. In contrast, SEMPPR calculates the posterior probability density function of a production rate given the protein production cost of the observed codon sequence {eta}obs. Therefore, in order to define SEMPPR, Bayes' theorem is applied to equation (7) producing,

Formula (8)
where {eta}obs is the protein production cost of the observed sequence, f({phi}) represents the prior probability of {phi}, and {phi}max represents the maximum possible protein production rate. Due to the fact that Formula does not conform to any standard distribution, the proper conjugate prior distribution, if it exists at all, is unknown. Numerical analyses indicate that reasonable the prior distributions f({phi}) include a flat distribution, a truncated log-normal distribution or a composite log-normal distribution. These analyses also indicate that as long as {phi}max is sufficiently large such that Formula and f({phi}) are very small and decreasing at {phi}max, its exact value has no discernible effect on the predictions made by SEMPPR.

Model Application
This second part illustrates SEMPPR's implementation and performance using the model organism S. cerevisiae. Implementing SEMPPR, involved a number of distinct steps. First, the model of protein translation upon which SEMPPR is based was parameterized. Second, the observed protein production cost {eta} was directly calculated for each gene and the distribution of its expected protein production costs g({eta}) was estimated by randomly sampling from the distribution. Third, empirical measurements summarized in Beyer et al. (2004)Go were used to calculate a set of observed protein production rates. The above results, along with a proxy for Ne, were used to estimate the fitness function's scaling factor q. With this done, summary statistics of the posterior distributions for protein production Formula produced by SEMPPR were compared with their corresponding empirical estimates. To provide a context for these results, the relationship between various codon bias indices to the same empirical data were also evaluated. In addition, internal evaluations of the reliability of the empirical data were also conducted.

On a technical note, initial development and evaluation of SEMPPR was carried out using Mathematica 5.2 (Wolfram Research Inc. 2005Go). For computational efficiency and portability, these routines were then reimplemented on a Linux workstation in PERL and C using standard libraries. This final version of the code is available in the supplementary data (Supplementary Material online). Updates and bug fixes can be found at www.tiem.utk.edu/~mikeg/software.

Translation Model Parameterization
The model of protein translation was parameterized as in Gilchrist and Wagner (2006)Go. Briefly, information in Ikemura (1985)Go, Percudani et al. (1997)Go, and Akashi (2003)Go was used to estimate of the concentration of each tRNA species in S. cerevisiae. These values were scaled by a proportionality constant so that the average elongation rate across all codons was 10 amino acids/s after adjusting for wobble (Thomas et al. 1988Go; Curran and Yarus 1989Go; Kruger et al. 1998Go). The nonsense error rate was set to b = 0.00515/s after Jorgensen and Kurland (1990)Go and Tsung et al. (1989)Go.

Coding and Amino Acid Sequences
The 12 May 2006 release of the S. cerevisiae genome was downloaded from Saccharomyces genome database (SGD) (Dolinski et al. 2006Go). Analyses were restricted to the 5,847 verified nuclear genes that lack internal stops. The codon sequence presented in the SGD database was assumed to be fixed in the population and used to calculate {eta}obs for each locus. The amino acid sequence for each gene was used to estimate its distribution of protein production costs g({eta}) as described below.

Simulating the Production Cost Distribution
Given the enormity of the synonymous sequence spaces, the distribution of {eta} values for a given gene, g({eta}), was estimated through simulation. For each gene, 10,000 simulated codon sequences were constructed by randomly selecting one of the possible synonymous codons for each amino acid in the sequence. The protein production cost {eta} for each randomly constructed sequence was then calculated.

Despite the fact that g({eta}) could not be estimated directly, calculation of the upper and lower bounds for each locus was still possible. Specifically, the minimal protein production cost {eta}min is achieved when a codon sequence contains only the fastest translating codons for the given amino acid sequence of a gene. Similarly, the maximal protein production cost {eta}max is achieved when a codon sequence contains only the slowest translating codons.

The cost distribution g({eta}) for each gene was approximated by rescaling a beta distribution as,

Formula (9)
where {Delta}{eta} = {eta}max{eta}min and {Gamma} represents the gamma function. The parameters of rescaled beta distribution, {alpha} and ß, were estimated using a Bayesian approach with a flat, improper prior of f({alpha}) = 1 and f(ß) = 1. Despite the use of an improper prior, the posterior distributions of {alpha} and ß are proper, and the posterior modes were used for the estimates of the parameters. The values for the posterior modal values of {alpha} and ß for each gene can be found in supplementary table S1 (Supplementary Material online).

Using a rescaled beta distribution for g({eta}) results in the denominator of equation (7) taking the form of a confluent hypergeometric function: 1F1({alpha}; {alpha} + ß, {Delta}{eta}z), where Formula. Evaluating 1F1 can be numerically intensive, especially when {Delta}{eta}z is large. However, Butler and Wood (2002)Go provide a Laplace approximation of this function, which was used in this work. This approximation results in 6th order accuracy but is approximately 2.5 orders faster to calculate.

To illustrate the sampling of g({eta}) and the fitting of a rescaled beta distribution to this data, the simulation results, along with the observed protein production costs {eta}obs, are presented in figure 1a for 4 focal genes: YLR038C, YOR253W, YGL123W, and YDL126C. These genes were arbitrarily chosen to cover a wide range of the observed values of ({alpha}, ß) (supplementary fig. S1a, Supplementary Material online) and are involved in distinctly different biological functions. For example, YLR038C is part of the mitochondrial electron transport chain, and YDL126C is involved in the retrotransportation of ubiquinated proteins from the endoplasmic reticulum to the cytosol. As figure 1a illustrates, the rescaled beta distribution is able to capture the shape of the observed distribution of the simulated population of {eta} very well.


Figure 1
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— (a) Probability distributions of protein production costs {eta} for 4 focal genes: YLR038C, YOR253W, YGL123W, and YDL126C. Distributions were estimated by fitting a rescaled beta distribution to a set of 10,000 randomly created synonymous sequences. In order to facilitate their presentation, each gene's {eta} values are presented relative to their individual {eta}min and {eta}max values. The protein production cost values for the sequence observed in SGD are indicated as {eta}obs. (b) Posterior probability distributions of the protein production rate {phi} for the same 4 genes. Note that ordering the genes by their observed costs {eta}obs does not necessarily correspond to their ordering with regard to their modal value and that the width of the curves varies between genes.

 
The overall complexity of the posterior distribution in equations (7) and (8) precludes most direct analysis. However, at least one analytic insight can be made. By examining the slope of Formula at {phi}=0 and under the assumption of a flat prior for f({phi}), it follows that the posterior mode of {phi} is at the 0 boundary when the protein production cost of the wild-type allele is greater than the average value across genotype space, that is Formula . Work with more complex priors suggests that this result is quite general.

Empirical Estimates of Protein Production Rates
Beyer et al. (2004)Go provide estimates of the protein production rates for much of the S. cerevisiae genome. Conceptually, protein production rates can be broken into 2 components (the translation rate per mRNA and mRNA abundance, {tau} and m, respectively), such that {phi}={tau}xm. The estimates of {tau} and m for individual genes are obtained from a combination of mRNA ribosome occupancy and abundance measurements presented in Beyer et al. (2004)Go.

The estimates of the translation rate per mRNA {tau} are derived from mRNA ribosome occupancy measurements. These measurements come from 2 separate studies by Arava et al. (2003)Go and MacKay et al. (2004)Go. Because the error in the occupancy data is poorly understood, Beyer et al. (2004)Go simply averaged these measures and then multiplied by 10 amino acids/sec to generate a translation rate per mRNA. It should be noted that this method of estimating {tau} ignores any impact heterogeneity in codon elongation rates or nonsense errors may have on ribosome occupancy. Given that the occupancy measurements are dependent on florescence and migration through a density gradient, the error in these measurements is likely to be log-normally distributed. As a result, the geometric mean, as opposed to the arithmetic mean, of the observations was used as the estimate of {tau}, that is, Formula . Using the arithmetic mean has no substantial effect on the evaluation of SEMPPR's performance.

The estimates of mRNA abundance m are based on Affymetrix GeneChips measurements from multiple labs. The authors modified some of the measurements to counteract the saturation effect observed in high-expression genes (see Beyer et al. 2004Go for more details). Due to the large variation in mRNA abundance measurements between experiments, Beyer et al. (2004)Go propose using the median values as the best estimate of mRNA abundance. Examination of the error in the mRNA abundances reveals that it is also log-normally, rather than normally, distributed. This result reinforces the wisdom of this choice (supplementary figure, Supplementary Material online). This is because for a log-normally distributed random variable M where M ~ log-N(µ, {sigma}), the central parameter µ and median of the distribution are equivalent. Thus, assuming log-normal measurement error with a flat prior uninformative prior on ln(m) is consistent with previous analyses and allows the generation of posterior probability intervals (PIs) for the mRNA abundance of each gene. Consequently, the posterior mode for M was used as the estimate of mRNA abundance for an individual gene, that is, Formula . Thus, the empirical estimate of a gene's protein production rate was Formula

Estimating q from Multiple Genes
The value of q was independently estimated for the 4,633 S. cerevisiae genes, with empirical estimates of the protein production rate Formula and the mode of SEMPPR's prediction both greater than 0. Maximum likelihood estimates (MLEs) of q was calculated using equation (7) and by assuming Formula and using the effective population size of Saccharomyces paradoxus Ne = 1.36 x 107 as a proxy for S. cerevisiae (Wagner 2005Go). Assuming that the error in the estimates of mRNA and ribosome occupancy are both log-normally distributed, it follows that the empirically estimated protein production rate Formula for an individual gene will be lognormally distributed around its true value and, consequently, the estimates of q should also be log-normally distributed around its actually value. As expected, the distribution of these MLE estimates of q appears to be log-normally distributed around the geometric mean of Formula(fig. 2), which was also the value used to parameterize SEMPPR. This set of independent estimates of q was used to calculate the upper and lower 95% confidence interval (CI) for q as 3.39 x 10–8 and 8.49 x 10–6, respectively.


Figure 2
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Individual MLEs of the scaling coefficient q from 4,633 Saccharomyces cerevisiae genes. MLE values were the ones, which maximized the likelihood of the empirically estimated protein production rate Formula given the observed protein production cost {eta}obs using equation (8).

 
Given the 200-fold range in the CI for q and the use of Ne based on S. paradoxus, it is worth noting how these terms affect SEMPPR's predictions. In general, increasing values of q or Ne is equivalent to increasing the importance of selection relative to drift on allele fixation probabilities either through stronger selection, q, or a larger population size, Ne. Given the fact that q, Ne, and {phi} are multiplied by one another in Goequation (7), for a given observed protein production cost {eta}obs increasing either q or Ne by some factor x rescales the probability distribution function for {phi} produced by SEMPPR by 1/x. Indeed, the difference between Ne for S. cerevisiae and S. paradoxus is ultimately absorbed in the estimates of q. Thus, the 1 unknown parameter in this analysis, q, and 1 questionable parameter, Ne, work together as a scaling factor across all of SEMPPR's predictions (supplementary fig. S4, Supplementary Material online). Because SEMPPR's predictions are evaluated on a log–log scale, changing either Ne or q has no effect on the correlation coefficient between the predicted and observed values.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Addendum
 Acknowledgements
 References
 
Evaluating SEMPPR
Predicting Empirical Estimates of {phi}
Having all of the necessary components in hand, SEMPPR was used to calculate the posterior distributions of the protein production rate, Formula, for each of the 5,847 genes analyzed. Figure 1b illustrates SEMPPR's distributions for the 4 focal genes mentioned earlier. The cumulative distribution functions for these and the 5,843 other verified S. cerevisiae genes are presented in supplementary table S2 (Supplementary Material online). The results from SEMPPR can be used to calculate any of a number of statistics such as the 95% PI or measures of central tendencies such as the posterior mode, geometric mean, or arithmetic mean. All 3 measures of central tendencies are highly correlated with one another on a log-log scale (i.e., the correlations coefficients {rho} range from 0.96 to 0.99). These values are presented in supplementary table S3 (Supplementary Material online). For simplicity, the geometric mean of SEMPPR's posterior distribution was chosen to represent the predicted protein production rate, that is, Formula

Because there is uncertainty in both the predicted and observed production rates, Formula and Formula, standard goodness of fit models cannot be used. Instead, the performance of SEMPPR's predictions was evaluated by the correlation between the Formula and Formula, indicating how much of the variation in Formula is explained by Formula. The correlation coefficient {rho} between these predicted and the empirical estimates of protein production across the S. cerevisiae genome is 0.661 (fig. 3). This high correlation coefficient indicates that SEMPPR's Formula values are highly correlated with the empirically Formula values.


Figure 3
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Scatter plot of empirical estimates of a protein's production rates Formula versus the geometric mean of SEMPPR's predictions Formula for 5,432 genes in Saccharomyces cerevisiae. Color scheme indicates whether or not the 95% PI of the observed and predicted rates of {phi} overlapped. Green points are where the predicted 95% PI were below the empirically based 95% PI. Black points are where the predicted 95% PI overlapped with the empirically based 95% PI. Blue points are where the predicted 95% PI were above the empirically based 95% PI. The correlation coefficient {rho} between the observed and predicted values was 0.661.

 
For comparison, similar correlation analyses were conducted with 4 commonly used indices of codon usage bias: the Codon Adaptation Index (CAI) (Sharp and Li 1987Go), Codon Bias Index (Bennetzen and Hall 1982Go), Frequency of Optimal Codons (Ikemura 1981Go, 1985Go), and Effective Number of Codons (Wright 1990Go). The correlation coefficients between these indices and Formula range from 0.574 to 0.657 (fig. 4). Thus even in its initial formulation, SEMPPR is able to predict the expression level of genes as well as the standard indices.


Figure 4
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Comparison of various indices and the logarithm of the empirical estimates of a protein's production rate Formula including their correlation coefficients {rho}. (a) CAI (Sharp and Li 1987Go), (b) Codon bias index (Bennetzen and Hall 1982Go), (c), Frequency of optimal codons Fop (Ikemura 1981Go, 1985Go), and (d) the inverse of the effective number of codons 1/Nc (Wright 1990Go). Index values were calculated using the program CodonW (Peden 2005Go).

 
Overlap between SEMPPR's Predictions and Estimates
In addition to having a higher correlation with the empirical data, SEMPPR has 2 distinct advantages over index-based predictions. The first advantage is that SEMPPR's predictions are in units of a biologically meaningful and measurable process, that is, protein production rate of a gene {phi}. The second advantage is the probabilistic nature of SEMPPR's predictions. The fact that the frequency at which SEMPPR's 95% PI overlaps with empirical estimates of {phi} can be calculated illustrates both of these advantages.

Because Formula values are themselves estimates with relatively large error, 2 separate analyses were conducted. In the first analysis, the error in estimating Formula was ignored and whether or not the SEMPPR's 95% PI overlaps with Formula was determined. Because the first analysis ignores the substantial measurement error in Formula, the performance of SEMPPR's predictions will be underestimated. To account for part of the measurement error, a second analysis was conducted where the measurement error in mRNA abundance Formula was used as a proxy for the measurement error in Formula.

In the first analysis in which the error in Formula was ignored, SEMPPR's 95% PI for {phi} overlap with Formula for 42.4% of the genes. For 36.7% and 20.9% of the genes, SEMPPR's 95% PI are below and above Formula, respectively. Because q acts as a scaling factor on SEMPPR's predictions, changing q will change the range of SEMPPR's 95% PI. Figure 5a illustrates how this behavior affects the overlap between SEMPPR's predictions and Formula. The overlap between SEMPPR's 95% PI and the observed Formula values peaks at 45% when q = 1.9 x 10–7 and has equal frequencies (27.9%) of above and below errors at q = 2.96 x 10–7.


Figure 5
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 5.— Illustration of how changing the scaling coefficient q affects the degree of overlap between the empirically observed and predicted production rates, Formula and Formula, respectively. (a) Assumes that the empirical measurements are without any error, that is, Formula. (b) Uses estimated error in mRNA abundance measurements Formula as a proxy for the error in Formula. This error is actually a lower bound for the error in Formula because it does not consider the error introduced by translation rate per mRNA Formula.

 
In the second analysis, the 95% PI of the mRNA measurements m for each gene was calculated based on the Affymetrix GeneChip measurements used in Beyer et al. (2004)Go. The error was assumed to be log-normally distributed with a flat prior on log10(m). Only genes with more than 17 measurements were included. The 95% PI values were then multiplied by the estimated per mRNA translation rate Formula to generate proxy 95% PI for Formula. Given that there is measurement error in the estimates of both m and {tau}, the PIs for m are underestimates of the uncertainty in Formula. In this second analysis, SEMPPR's 95% PI for {phi} overlap with the Formula 95% PI for 76.0% of the genes. For 18.6% and 6.41% of the genes, SEMPPR's 95% PI are below and above Formula, respectively. As before, the degree of overlap between SEMPPR's predictions and the empirical data is dependent on the value of the scaling parameter q (fig. 5b). The overlap between SEMPPR's and the Formula's 95% PI peaks at 78.8% when q = 2.35 x 10–7 and have equal frequencies (10.9%) of above and below errors at q = 2.70 x 10–7.


    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Addendum
 Acknowledgements
 References
 
The idea of linking gene expression and codon bias within a framework of selection, mutation, and drift is not new (e.g., see Bulmer 1991Go; Kurland 1992Go; Kliman and Hey 1994Go; Comeron and Kreitman 2002Go). What is new in this work is the idea of linking these processes in an explicitly mechanistic manner to develop SEMPPR. SEMPPR begins by linking the codon sequence of a genotype to its phenotype, that is, its protein production cost. The phenotype is then linked to fitness, and the probability of an allele being fixed is calculated using a population genetics model. As a last step Bayes' theorem is used, resulting in SEMPPR's ability to make inferences about a gene's protein production rate based on its observed codon sequence.

Because SEMPPR's is in part built on a model of population genetics, some of the metaphors from that field can be used for a more intuitive understanding of how SEMPPR works. Conceptually, the sequences with the minimal and maximal protein production costs, {eta}min and {eta}max, represent the highest peak and lowest point of a gene's protein production cost fitness landscape. The location of the observed sequence's cost {eta}obs on this landscape reflects the combined processes of selection, mutation, and drift. The height of the peaks and valleys of the fitness landscape are scaled by production rate of the gene {phi}. As a result, the difference in energetic usage between the highest peak and lowest valley of the fitness landscapes is small for genes with low production rates and large for genes with high production rates.

Intuitively, proteins with higher production rates are expected to show greater degrees of adaptation due to the increased importance of selection. Although this is essentially what is observed, Goequation (8) indicates that SEMPPR's inferences about the production rate of a gene is not just a function of the absolute or relative distance between the observed {eta}obs and the minimal {eta}min protein production costs. Instead, SEMPPR's inferences are based on where {eta}obs lies with respect to the entire set of possible protein production costs g({eta}) (fig. 1 and supplementary fig. S3 [Supplementary Material online]).

The strong performance of SEMPPR suggests that the one analytical result presented here, where the mode posterior production rate is 0 when an observed allele's protein production cost is less than the expected value from g({eta}), provides a conceptually clear and useful criteria for identifying pseudogenes (which by definition have 0 expression).

More generally, the process of linking genotype to phenotype and phenotype to fitness developed here provides an explicit framework for studying the structure and evolutionary behavior on a biologically realistic fitness landscape. Similar studies on fitness landscapes for transcription factor binding (Berg et al. 2004Go) and protein thermostability (Bastolla et al. 2006Go) have already produced interesting results, suggesting that the study of fitness landscapes on a molecular scale will become an increasingly important area of research.

One result from the development of SEMPPR is that, from the perspective of adaptation with respect to nonsense errors, codon bias should be a function of a gene's protein production rate. This idea is in contrast to previous studies, which have tried to link codon bias to the mRNA abundance of a gene (which affects its net protein production rate) or its protein abundance (which is affected by its net protein production rate). Recognizing that it makes biological sense to link codon usage to protein production rate rather than just mRNA or protein abundance is an important conceptual advantage of using mechanistic models. Despite their importance in the identification and initial description of codon bias, it is quality that cannot be obtained using heuristically derived metrics.

Although the population genetics model SEMPPR use requires an exponential relationship between trait values and fitness, because the scaling coefficient q appears to be very small, one could actually approximate the exponential fitness function as a linear function using a Taylor Series. Doing so yields the selection coefficient used in traditional population genetics of approximately Formula. For most observed alleles this translates into selection coefficients relative to the optimal allele ranging from 10–6 to 10–4. Given the large effective population size of S. cerevisiae, if one were simply considering competition between 2 alleles, then one would expect the optimal allele to eventually emerge and go to fixation because Formula. However, because there are a large number of differences between the observed allele and the optimal allele, it is impossible to mutate directly from one to another. Instead, the relevant selection coefficients are between alleles that are one mutational step from one another. These coefficients are likely to be much smaller than previously published estimates in E. coli (Hartl et al. 1994Go) or Drosophila (Akashi 1995Go).

Although small in size, because codon bias exists, these selection coefficients are clearly still evolutionarily significant. Further, because the substitution of suboptimal codons at the 5' end of a sequence has a greater negative effect on fitness than a similar substitution in the 3' end, the selection coefficients between the nearest mutational neighbors of an allele are not equal.

The size and nature of the selection coefficients against synonymous substitutions are clearly very subtle. However, these coefficients are important to consider when building phylogenetic models of nucleotide substitution. The framework developed here should provide a means of calculating selection coefficients and, thereby, making it possible to take such effects into account.

Currently, new techniques are being developed to account for such selection against synonymous substitutions by weighting the importance of a substitution based on codon position or specific codon type (Yang and Nielsen 2000Go; Bierne and Eyre-Walker 2003Go; Hirsh et al. 2005Go). However, these techniques are observational, rather than deductive, in nature. Further, none of these techniques take into account the effect of both expression level and codon position. By revisiting the models underlying SEMPPR, it should be possible to calculate the fixation probability of a particular codon given its position and its gene's estimated production rate. These substitution probabilities could then be incorporated into models used for phylogenetic reconstructions. These values could also be incorporated into more recent approaches testing for positive selection (e.g., Zhang et al. 2006Go).

In terms of the predictive performance of SEMPPR, the protein production rates it predicts are well correlated with the available empirical data ({rho} = 0.661 on a log–log scale). This strong correlation is in spite of the fact that the uncertainty of the empirical estimates of the protein production rate of a gene is quite large. For example, variation in the mRNA abundance measurements from Affymetrix GeneChips often ranges over 3 orders of magnitude between trials. The ribosome occupancy data upon which the per mRNA translation rates are based also appear to be quite noisy. For example, the correlation coefficient between the log-transformed data from Arava et al. (2003)Go and MacKay et al. (2004)Go have a correlation coefficient of {rho} = 0.55. Thus, it appears that the quality of SEMPPR's predictions is of similar or higher quality than those based on high-throughput data set. Further, as noted previously, the standard method by which the per mRNA translation rate {tau} is inferred from ribosomal occupancy data ignores the subtle effects of codon heterogeneity and nonsense errors. Thus taking these effects into account when estimating {tau} should improve the correlation between SEMPPR and the observed protein production rates.

The predictive performance of SEMPPR is similar to the that of the standard codon bias indices. However, SEMPPR has two important advantages. First is that its output is more interpretable from a biological and statistical perspective. The second is that the assumptions underlying models upon which it is built can always be refined further. For example, in the current formulation SEMPPR assumes that only fully translated proteins are functional. The assumption that only complete proteins have functionality is likely to be overly restrictive because most proteins are likely to maintain most or all of their functionality if all but the last few amino acids are translated. Indirect evidence for this idea is found in the fact that codon bias appears to decrease at the end of a coding sequence (Qin et al. 2004Go). In contrast, preliminary work with SEMPPR where the final 25 or 50 codons are ignored results in only minor changes to its output. Nevertheless, a more sophisticated analysis might reveal information on the general relationship between protein functionality and completeness.

One likely way SEMPPR could be improved is by refining the protein translation and population genetics models that underlie it. For example, the protein translation model assumes that ribosomes moving along a transcript do not interfere with one another. This assumption likely holds for most eukaryotes but is unlikely to hold for many prokaryotes. Works by Chou (2003)Go and Basu and Chowdhury (2007)Go suggest ways by which inter-ribosomal interference could be incorporated into the translation model used here. The translation model could also be further refined by allowing slight variations in the background nonsense error rates between different codons. Similarly, the population genetics model could be further refined to incorporate nonuniform mutation rates between codons.

One significant shortcoming of SEMPPR that does not have such a clear solution is its implicit assumption of evolutionary independence between codons and loci. This assumption ignores effects from forces such as interference selection, which also appears to play a role in the evolution of codon bias (Comeron and Kreitman 2002Go).

Ideally, in the future it will be possible to use SEMPPR to make quantitative predictions about protein production rates from the genomes of completely sequenced nonmodel organisms. This is because applying SEMPPR across a genome only requires knowledge of a relatively small number of parameters such as codon elongation rates and the background nonsense error rate of an organism. Currently, tRNA copy number can be accurately estimated from genomic data (Lowe and Eddy 1997Go; Hallin and Ussery 2004Go), and Akashi (2001)Go shows that elongation rates for different codons can be inferred from such data. In addition, work with CAI by Carbone et al. (2003)Go suggests that the codon elongation rates could be estimated from sequence data alone. Alternatively, information on these parameters from both tRNA copy number and sequence data could be combined using a Bayesian approach. Given the hundreds of nonmodel microbial organisms now being sequenced, SEMPPR could be enormously useful for understanding the ecological roles played by these less studied but nonetheless ecologically important species. This use of SEMPPR is similar in spirit to the work by Carbone et al. (2005)Go, but SEMPPR's predictions would be more mechanistically based and promise to be easier to interpret.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Addendum
 Acknowledgements
 References
 
Supplementary data, figures S1–S4, and tables S1–S3 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Addendum
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Addendum
 Acknowledgements
 References
 
The index values calculated by CodonW in the original Advanced Access version of this manuscript posted on August 16, 2007 had a significant number of errors that appear to be due to the long ORF description fields found in SGD's orf_coding. fasta file. These errors caused all of the indices to appear to perform substantially worse than they actually do. This problem was corrected on September 6, 2007.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Addendum
 Acknowledgements
 References
 
I would like to thank Aaron King, Jeffery Thorne, Lou Gross, Igor Zhulin, Russell Zaretzki, and Andreas Wagner for their helpful discussions with this work. I would also like to thank the 2 anonymous reviewers for their constructive and encouraging criticisms and A. Beyer for providing me with his database of Affymetrix measurements. In addition I am very grateful to Morten Kloster for pointing out the errors in the codon index values used in the original Advanced Access version of this manuscript. Premal Shah generated the PERL and C code. Financial support was provided by the University of Tennessee, Knoxville.


    Footnotes
 
Lauren McIntyre, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Addendum
 Acknowledgements
 References
 

    Akashi H. Synonymous codon usage in Drosophila melanogaster: natural selection and translational accuracy. Genetics (1994) 136:927–935.[Abstract]

    Akashi H. Inferring weak selection from patterns of polymorphism and divergence at silent sites in Drosophila DNA. Genetics (1995) 139:1067–1076.[Abstract]

    Akashi H. Gene expression and molecular evolution. Curr Opin Genet Dev (2001) 11:660–666.[CrossRef][Web of Science][Medline]

    Akashi H. Translational selection and yeast proteome evolution. Genetics (2003) 164:1291–1303.[Abstract/Free Full Text]

    Akashi H, Eyre-Walker A. Translational selection and molecular evolution. Curr Opin Genet Dev (1998) 8:688–693.[CrossRef][Web of Science][Medline]

    Arava Y, Wang YL, Storey JD, Liu CL, Brown PO, Herschlag D. Genome-wide analysis of mRNA translation profiles in Saccharomyces cerevisiae. Proc Natl Acad Sci USA (2003) 100:3889–3894.[Abstract/Free Full Text]

    Bastolla U, Porto M, Roman HE, Vendruscolo M. A protein evolution model with independent sites that reproduces site-specific amino acid distributions from the protein data bank. BMC Evol Biol (2006) 6. doi:10.1186/1471-2148-6-43.

    Basu A, Chowdhury D. Traffic of interacting ribosomes: effects of single-machine mechanochemistry on protein synthesis. Phys Rev E (2007) 75. doi:10.1103/PhysRevE-75.021902.

    Bennetzen JL, Hall BD. Codon selection in yeast. J Biol Chem (1982) 257:3026–3031.[Abstract/Free Full Text]

    Berg J, Lässig M. Stochastic evolution and transcription factor binding sites. Biophysics (2003) 48:S36–S44.

    Berg J, Willmann S, Lassig M. Adaptive evolution of transcription factor binding sites. BMC Evol Biol (2004) 4. Art. No. 021902 Part 1. doi:10.1186/1471-2148-4-42.

    Beyer A, Hollunder J, Nasheuer HP, Wilhelm T. Post-transcriptional expression regulation in the yeast Saccharomyces cerevisiae on a genomic scale. Mol Cell Proteomics (2004) 3:1083–1092.[Abstract/Free Full Text]

    Bierne N, Eyre-Walker A. The problem of counting sites in the estimation of the synonymous and nonsynonymous substitution rates: implications for the correlation between the synonymous substitution rate and codon usage bias. Genetics (2003) 165:1587–1597.[Abstract/Free Full Text]

    Bulmer M. The selection-mutation-drift theory of synonymous codon usage. Genetics (1991) 129:897–907.[Abstract]

    Butler RW, Wood ATA. Laplace approximations for hypergeometric functions with matrix argument. Ann Stat (2002) 30:1155–1177.[CrossRef]

    Carbone A, Kepes F, Zinovyev A. Codon bias signatures, organization of microorganisms in codon space, and lifestyle. Mol Biol Evol (2005) 22:547–561.[Abstract/Free Full Text]

    Carbone A, Zinovyev A, Kepes F. Codon adaptation index as a measure of dominating codon bias. Bioinformatics (2003) 19:2005–2015.[Abstract/Free Full Text]

    Chou T. Ribosome recycling, diffusion, and mRNA loop formation in translational regulation. Biophys J (2003) 85:755–773.[Web of Science][Medline]

    Comeron JM, Kreitman M. Population, evolutionary and genomic consequences of interference selection. Genetics (2002) 161:389–410.[Abstract/Free Full Text]

    Curran JF, Yarus M. Rates of aminoacyl-trans-RNA selection at 29 sense codons invivo. J Mol Biol (1989) 209:65–77.[CrossRef][Web of Science][Medline]

    Dolinski K, Balakrishnan R, Christie KR, et al, (20 co-authors). Saccharomyces genome database [Internet] (2006) [released 2006 May 12]. Available from: ftp://ftp.yeastgenome.org/yeast/.

    Elf J, Nilsson D, Tenson T, Ehrenberg M. Selective charging of tRNA isoacceptors explains patterns of codon usage. Science (2003) 300:1718–1722.[Abstract/Free Full Text]

    Eyre-Walker A. Synonymous codon bias is related to gene length in Escherichia coli: selection for translational accuracy? Mol Biol Evol (1996) 13:864–872.[Abstract]

    Freistroffer DV, Kwiatkowski M, Buckingham RH, Ehrenberg M. The accuracy of codon recognition by polypeptide release factors. Proc Natl Acad Sci USA (2000) 97:2046–2051.[Abstract/Free Full Text]

    Gilchrist MA, Wagner A. A model of protein translation including codon bias, nonsense errors, and ribosome recycling. J Theor Biol (2006) 239:417–434.[CrossRef][Web of Science][Medline]

    Gouy M, Gautier C. Codon usage in bacteria—correlation with gene expressivity. Nucleic Acids Res (1982) 10:7055–7074.[Abstract/Free Full Text]

    Greenbaum D, Colangelo C, Williams K, Gerstein M. Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol (2003) 4:117–124.[CrossRef][Medline]

    Gygi SP, Rochon Y, Franza BR, Aebersold R. Correlation between protein and mRNA abundance in yeast. Mol Cell Biol (1999) 19:1720–1730.[Abstract/Free Full Text]

    Hallin PF, Ussery DW. CBS genome atlas database: a dynamic storage for bioinformatic results and sequence data. Bioinformatics (2004) 20:3682–3686.[Abstract/Free Full Text]

    Hartl DL, Moriyama EN, Sawyer SA. Selection intensity for codon bias. Genetics (1994) 138:227–234.[Abstract]

    Hirsh AE, Fraser HB, Wall DP. Adjusting for selection on synonymous sites in estimates of evolutionary distance. Mol Biol Evol (2005) 22:174–177.[Abstract/Free Full Text]

    Hooper SD, Berg OG. Gradients in nucleotide and codon usage along Escherichia coli genes. Nucleic Acids Res (2000) 28:3517–3523.[Abstract/Free Full Text]

    Ikemura T. Correlation between the abundance of Escherichia-coli transfer-RNAs and the occurrence of the respective codons in its protein genes—a proposal for a synonymous codon choice that is optimal for the Escherichia-coli translational system. J Mol Biol (1981) 151:389–409.[CrossRef][Web of Science][Medline]

    Ikemura T. Codon usage and transfer-RNA content in unicellular and multicellular organisms. Mol Biol Evol (1985) 2:13–34.[Abstract]

    Jorgensen F, Kurland CG. Processivity errors of gene-expression in Escherichia coli. J Mol Biol (1990) 215:511–521.[CrossRef][Web of Science][Medline]

    Kliman RM, Hey J. The effects of mutation and natural-selection on codon bias in the genes of Drosophila. Genetics (1994) 137:1049–1056.[Abstract]

    Kliman RM, Irving N, Santiago M. Selection conflicts, gene expression, and codon usage trends in yeast. J Mol Evol (2003) 57:98–109.[CrossRef][Web of Science][Medline]

    Kruger MK, Pedersen S, Hagervall TG, Sorensen MA. The modification of the wobble base of tRNA(glu) modulates the translation rate of glutamic acid codons in vivo. J Mol Biol (1998) 284:621–631.[CrossRef][Web of Science][Medline]

    Kurland CG. Translational accuracy and the fitness of bacteria. Annu Rev Genet (1992) 26:29–50.[Web of Science][Medline]

    Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res (1997) 25:955–964.[Abstract/Free Full Text]

    MacKay VL, Li XH, Flory MR. (12 co-authors). Gene expression analyzed by high-resolution state array analysis and quantitative proteomics–response of yeast to mating pheromone. Mol Cell Proteomics (2004) 3:478–489.[Abstract/Free Full Text]

    Manley JL. Synthesis and degradation of termination and premature-termination fragments of beta-galactosidase in vitro and in vivo. J Mol Biol (1978) 125:407–432.[CrossRef][Web of Science][Medline]

    Peden J. Codonw (2005) v.1.4.2. Available from: http://sourceforge.net/projects/codonw/ (accessed on 2007 Sept 1).

    Percudani R, Pavesi A, Ottonello S. Transfer RNA gene redundancy and translational selection in Saccharomyces cerevisiae. J Mol Biol (1997) 268:322–330.[CrossRef][Web of Science][Medline]

    Plotkin JB, Dushoff J, Desai MM, Fraser HB. Codon usage and selection on proteins. J Mol Evol (2006) 63:635–653.[CrossRef][Web of Science][Medline]

    Qin H, Wu WB, Comeron JM, Kreitman M, Li WH. Intragenic spatial patterns of codon usage bias in prokaryotic and eukaryotic genomes. Genetics (2004) 168:2245–2260.[Abstract/Free Full Text]

    Sella G, Hirsh AE. The application of statistical physics to evolutionary biology. Proc Natl Acad Sci USA (2005) 102:9541–9546.[Abstract/Free Full Text]

    Sharp PM, Li WH. The codon adaptation index—a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res (1987) 15:1281–1295.[Abstract/Free Full Text]

    Thomas LK, Dix DB, Thompson RC. Codon choice and gene-expression—synonymous codons differ in their ability to direct aminoacylated-transfer RNA-binding to ribosomes invitro. Proc Natl Acad Sci USA (1988) 85:4242–4246.[Abstract/Free Full Text]

    Tsung K, Inouye S, Inouye M. Factors affecting the efficiency of protein-synthesis in Escherichia coli—production of a polypeptide of more than 6000 amino-acid residues. J Biol Chem (1989) 264:4428–4433.[Abstract/Free Full Text]

    Wagner A. Energy constraints on the evolution of gene expression. Mol Biol Evol (2005) 22:1365–1374.[Abstract/Free Full Text]

    Wolfram Research Inc. Mathematica. version 5.2 (2005) Champaign, (IL): Wolfram Research Inc.

    Wright F. The effective number of codons used in a gene. Gene (1990) 87:23–29.[CrossRef][Web of Science][Medline]

    Yang ZH, Nielsen R. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol (2000) 17:32–43.[Abstract/Free Full Text]

    Zhang Z, Li J, Yu J. Computing ka and ks with a consideration of unequal transitional substitutions. BMC Evol Biol Art (2006) 6.

Accepted for publication August 1, 2007.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
24/11/2362    most recent
msm169v2
msm169v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gilchrist, M. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gilchrist, M. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?