MBE Advance Access originally published online on January 4, 2007
Molecular Biology and Evolution 2007 24(3):836-844; doi:10.1093/molbev/msl212
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Gene Expression and Protein Length Influence Codon Usage and Rates of Sequence Evolution in Populus tremula
Umeå Plant Science Centre, Department of Ecology and Environmental Science, Umeå University, Umeå, Sweden
E-mail: pelle{at}wallace.emg.umu.se.
| Abstract |
|---|
|
|
|---|
Codon bias is generally thought to be determined by a balance between mutation, genetic drift, and natural selection on translational efficiency. However, natural selection on codon usage is considered to be a weak evolutionary force and selection on codon usage is expected to be strongest in species with large effective population sizes. In this paper, I study associations between codon usage, gene expression, and molecular evolution at synonymous and nonsynonymous sites in the long-lived, woody perennial plant Populus tremula (Salicaceae). Using expression data for 558 genes derived from expressed sequence tags (EST) libraries from 19 different tissues and developmental stages, I study how gene expression levels within single tissues as well as across tissues affect codon usage and rates sequence evolution at synonymous and nonsynonymous sites. I show that gene expression have direct effects on both codon usage and the level of selective constraint of proteins in P. tremula, although in different ways. Codon usage genes is primarily determined by how highly expressed a genes is, whereas rates of sequence evolution are primarily determined by how widely expressed genes are. In addition to the effects of gene expression, protein length appear to be an important factor influencing virtually all aspects of molecular evolution in P. tremula.
Key Words: codon bias gene expression Populus translational selection
| Introduction |
|---|
|
|
|---|
Nonrandom codon usage, or codon bias, is a common phenomenon in a wide variety of organisms, including prokaryotes, animals, and plants (see reviews in Akashi 2001
GC) occur more often than others across the genome of an organism, by local variations in the base composition, such as that represented by the strong isochore structure seen, for instance, in mammals (Eyre-Walker and Hurst 2001
Natural selection on codon usage is considered to be a weak evolutionary force (Nes
1), and selection coefficient (s) is therefore expected to be most efficient in species with large effective population sizes (Ne), such as prokaryotes and unicellular eukaryotes (e.g., Ikemura 1985
; Sharp and Li 1986
). In species with low Ne, genetic drift should be the predominant force shaping codon usage and overpower translational selection on codon variants. For instance, mutational biases have generally been believed to be the driving force of codon usage in many mammals, where Ne is usually low (Francino and Ochman 1999
). Nevertheless, levels of gene expression have been shown to be positively correlated with codon bias in a number of different eukaryotic organisms, such as Caenorhabditis elegans, Drosophila melanogaster, and Arabidopsis thaliana (Powell and Moriyama 1997
; Duret and Mouchiroud 1999
; Marais et al. 2001
). Additional support for translational selection comes from strong associations between tRNA abundance and codon bias, where synonymous codons preferentially used in highly expressed genes correspond to the most abundant tRNAs (Ikemura 1985
; Moriyama and Powell 1997
; Duret 2000
).
In addition to influencing codon usage in organisms, gene expression can also influence the rate of evolutionary change in a lineage. For instance, in many bacteria and unicellular eukaryotes, gene expression and codon bias are negatively correlated with substitution rates at synonymous sites (dS; Akashi 2001
). This pattern is consistent with translational selection constraining the rate of evolution at synonymous sites because in highly expressed genes, with high codon bias, most mutations will be to unpreferred codons that are slightly deleterious. This results in a reduction in the rate of synonymous substitutions in highly expressed genes and ultimately to a correlation between gene expression and dS. In multicellular eukaryotes, however, evidence for a correlation between dS and gene expression is weaker; such a correlation has been established in Drosophila (Powell and Moriyama 1997
; Bierne and Eyre-Walker 2003
; Marais et al. 2004
, although the evidence supporting it has been questioned, Dunn et al. 2001
). Recent studies in Arabidopsis (Wright et al. 2004
) and in mammals (Duret and Mouchiroud 2000
) have, however, failed to establish this pattern.
Gene expression has also recently been shown to be correlated with substitution rates at nonsynonymous sites (dN) in several different species. The underlying cause for these observations remain unclear and several alternative explanations have been put forward. For instance, highly expressed genes, showing strong codon bias, could be under selective constraint to optimize the efficiency and accuracy of protein synthesis (Akashi 2001
). Alternatively, highly expressed genes are likely to be involved in a larger number of biochemical processes (Kuma et al. 1995
) or are expressed in a greater number of different tissues (Duret 2000
) than lowly expressed genes. High-expression genes are therefore expected to experience greater selective constraints, resulting in a reduced substitution rate of nonsynonymous mutations.
In this paper, I study associations between codon usage, gene expression, and molecular evolution at synonymous and nonsynonymous sites in the long-lived, woody perennial plant Populus tremula (Salicaceae). I use data on gene expression from EST libraries derived from several different tissues and/or developmental stages. This allows for studies of how gene expression both within single tissues as well as across tissues affects codon usage and sequence evolution at synonymous and nonsynonymous sites. The results show that both expression level and breadth are important factors determining codon usage in P. tremula as both traits are positively correlated to codon bias. However, protein evolution appears to be primarily determined by expression breadth, consistent with the idea that genes expressed in many tissues experience higher selective constraints.
| Material and Methods |
|---|
|
|
|---|
Sequence Retrieval and Alignments
Putative coding sequences were obtained for 811 P. tremula genes from PopulusDB (Sterky et al. 2004
Estimation of Codon Bias and tRNA Abundance
Estimates of codon bias, measured as the frequency of optimal codons (Fop), were obtained for all 558 sequences in the data set using the program CodonW (version 1.4.2, http://codonw.sourceforge.net/). Correspondence analysis of synonymous codon usage was also performed using CodonW. Differences in codon usage between highly expressed genes and lowly expressed genes were based on comparisons of the most extreme genes from the first axis of the correspondence analysis (Chiapello et al. 1998
; Wright et al. 2004
). To calculate Fop, optimal codons were inferred from the pattern of codon usage in Populus described in Sterky et al. (2004)
. CodonW was also used to estimate GC content for the complete coding sequences (GC), at third positions (GC3s) and in noncoding 5' and 3' UTRs (GCnc).
The P. trichocarpa genome sequence (Tuskan et al. 2006
) was scanned for the number of tRNA genes using the program tRNAscan-SE (http://selab.janelia.org/tRNAscan-SE/). To enable comparisons with the preferred codons identified by CodonW, tRNAs were grouped by codons following the wobble rules for eukaryotes (Percudani 2001
). The tRNA genes identified in the P. trichocarpa genome generally follows the wobble rules because every codon is for the most part decoded by a single class of tRNAs. Some unexpected tRNA genes were identified, but whether these represent true genes, pseudogenes, or just sequencing errors is presently not clear (Tuskan et al. 2006
). Using the abundance of tRNA genes as a substitute for available levels of tRNAs is justified because tRNA gene copy numbers are generally correlated with cellular levels of tRNAs in both prokaryotes and eukaryotes (Kanaya et al. 1999
).
Expression Profiles
Libraries containing approximately 121.000 EST sequences from 19 different Populus tissues in various developmental stages were obtained from PopulusDB. These 19 libraries from PopulusDB are described in detail in Sterky et al. (2004)
. The EST libraries from the version of PopulusDB used in this paper (downloaded on 25 May 2006) are also summarized in figure 2. The libraries are derived from a number of different Populus species (Sterky et al. 2004
), but the low sequence divergence seen in coding regions between members of the genus Populus (>96%, Sterky et al. 2004
; Ingvarsson 2005
) suggest that this will not influence the estimation of expression profiles from the EST libraries.
|
The coding sequences in the complete data set were filtered with the XBlast program to mask out repetitive elements. Expression profiles were then obtained for all genes in the data set by BlastN searches against each library using stringent matching criteria. Alignments were required to show at least a 90% identity across 100 bp to be recorded as a match. The number of BlastN hits was used as a proxy for expression levels. Expression level for genes was recorded for each library separately, and the maximum expression of each gene across libraries was used as a measure of the global level of gene expression. In addition to expression level, I also estimated expression breadth, defined as the number of libraries where a gene scored at least one hit. One of the libraries (Y, virus-/fungus-infected leaves) was normalized before EST sequencing (see Sterky et al. 2004
Estimation of Substitution Rates
Synonymous and nonsynonymous nucleotide substitution rates per site were calculated for all genes using the maximum likelihood method of Goldman and Yang (1994)
, implemented in the Codeml program from the PAML package (version 3.14, Yang 1997
). The estimation was performed assuming transition/transversion bias and with codon frequencies calculated from average nucleotide frequencies (F1 x 4). The substitution rate in noncoding regions (5' and 3' UTRs) was calculated assuming transition/transversion bias and unequal base frequencies using the HasegawaKishinoYano model (HKY85, Hasegawa et al. 1985
) implemented in the BASEML program, also from the PAML package.
To test for evidence of positive selection, PAML was used to fit 2 models that allow the dN/dS ratio to vary between codons (models M7 and M8 from Yang et al. 2000
). Model M7 allows the dN/dS ratio to vary across sites, but constrains the ratio to between 0 and 1, whereas model M8 adds an extra class of sites with positive selection (dN/dS > 1). A gene is assumed to be under positive selection if a likelihood ratio test comparing models M7 with M8 is significant at P < 0.01.
Statistical Analyses
I used the statistical package R (Ihaka and Gentleman 1996
) for all analyses. All correlations used are based on the nonparametric Spearman rank correlation (
). By using this measure of association, one does not make any distributional assumptions of the underlying data.
Given the potential complex interplay of factors affecting both codon usage and rates of sequence evolution, traditional analysis techniques, like multiple regression or correlation analysis, are often unsuitable. One way to get around this problem is to use generalizations of regression models, such as path analysis, that specifically allow for interdependence between different variables. Path analysis is attractive because it allows for the construction of putatively causal schemes with multiple, possibly interdependent variables and also quantifies the importance of unmeasured factors affecting the dependent variables (Loehlin 2004
).
Here I use path analysis to estimate direct and indirect effects of the measured variables on codon bias and the selective constraint of genes. The result of the path model is displayed in a path diagram where double-headed arrows indicate 2 variables that are associated with each other, but where no assumptions are made about the causality of this association. In contrast, single-headed arrows indicate presumed causal relationships, where the variable at the base of the arrow is having a causal effect on the variable at the head of the arrow. These relationships are measured using standardized regression coefficients, so the relative strength of different factors can easily be compared. Before performing the path analysis, variables were either log or square root transformed to improve normality. The path analysis was performed using the sem package in R.
| Results and Discussion |
|---|
|
|
|---|
I have analyzed 558 genes from P. tremula for which an ortholog could be unambiguously identified in the P. trichocarpa genome sequence. Codon bias was estimated for all genes and data on gene expression were obtained from 19 different EST libraries covering a wide range of tissues and developmental stages. Substitution rates at synonymous and nonsynonymous sites and in 5' and 3' UTRs were also estimated for all genes using PAML.
Codon Bias and tRNA Abundance in Populus
A whole-genome scan for tRNA genes identified a total of 853 putative tRNA and an additional 44 pseudogenes, numbers that are similar to those found by Tuskan et al. (2006)
. Within codon classes there is a good correspondence between tRNA abundance and optimal codons, that is, codons that show significantly higher frequencies in highly expressed genes compared with weakly expressed genes. When codons are combined based on tRNA isoacceptors the correlation between putatively "optimal" codons, based on the correspondence analysis of codon usage, and tRNA abundance in the Populus genome sequence is significantly positive (Spearman's rank correlation,
= 0.840, P < 0.001; table 1). With the exception of 2 amino acids (Lys and Pro) the optimal codon corresponds to the most abundant tRNA when isoaccepting codons are taken into account.
|
Codon bias, measured as the frequency of optimal codons (Fop), averaged 0.340 across the 558 genes in P. tremula. Codon bias was highly correlated with GC content at third positions (
= 0.607, P << 0.001), suggesting that optimal codons in P. tremula tend to end in G or C, similar to what have been found in other dicots (Chiapello et al. 1998
= 0.019, P = 0.659) or between GC content at third positions and GC content in noncoding regions (
= 0.0167, P = 0.704). Base composition at synonymous sites thus appear to be effectively uncoupled from that of the surrounding noncoding regions, suggesting that mutational biases or transcription-coupled mutations are not driving codon usage in P. tremula and that codon bias is likely a product of translational selection (Duret 2002
Evidence for translational selection shaping codon usage in P. tremula is also corroborated by a strong positive correlation between codon bias and total expression level (
= 0.460, P << 0.001). In addition, the correlation between codon bias and expression breadth is also positive and significant (
= 0.395, P<< 0.001).
Total gene expression is known to be highly influenced by the number of different tissues a gene is expressed in (expression breadth), when expression data is calculated from pooled EST libraries (Akashi 2001
; Urrutia and Hurst 2001
). This could generate spurious correlations between codon bias and gene expression if expression breadth is the predominant force affecting codon usage. I partly tried to alleviate this problem by using the maximum expression across the different tissues as the measure of gene expression level. Despite this, expression level and expression breadth remain highly correlated (
= 0.471, P < 0.001, see also below).
However, the strong positive association between Fop and gene expression level (fig. 1) remain when separate analyses are done on the 19 different EST libraries. The observation of higher codon bias in genes with high gene expression seen in the total data set also holds for the data collected from the different tissues (fig. 2). The association between codon bias and gene expression is substantially stronger in tissues where gene expression is expected to be high, such as tissues where active cell division and growth is taking place (e.g., cambial zone and shoot meristem, the possible exception being young leaves). Conversely, the association between codon bias and gene expression is weak or even absent in tissues with low levels of active transcription (e.g., senescing leaves and dormant cambium). These observations are in general agreement with the positive association observed between codon bias and gene expression as more genes in actively growing tissues are highly expressed and should therefore be under stronger influence from translational selection. The results based on the individual libraries are less likely to be confounded by the possible effects of expression breadth, highlighting the direct association between codon bias and gene expression level, and strengthen the evidence for translational selection being an important force shaping codon usage in P. tremula. It is worth pointing out that the magnitude of the correlation between Fop and gene expression do not depend on number of ESTs in the different libraries (
= 0.06, ns). Variation in the strength of the association between codon bias and gene expression between libraries therefore likely represents real biological differences and not simply differences between the different libraries in the ability to detect this association.
|
Codon bias is also negatively correlated with protein length (
= 0.155, P < 0.001). If the data is restricted to genes for which the entire coding region is available, the correlation is even stronger (
= 0.248, P < 0.001) and the effect remains after factoring out the effects of gene expression (partial
= 0.198, P < 0.01). These results are similar to those observed in a number of different organisms (C. elegans, D. melanogaster, and A. thaliana, Duret and Mouchiroud [1999]
Rates of Synonymous and Nonsynonymous Substitutions among Genes
Maximum likelihood estimates of divergence at synonymous sites (dS) vary by almost 2 orders of magnitude across the 558 genes (range, 2.2 x 103 to 1.45 x 101), and 95% of the genes have dS values between 0.006 and 0.102. These values are of the same magnitude as estimates of dS obtained by Unneberg et al. (2005)
for P. tremula using a different set of genes.
The synonymous substitution rate is positively correlated with substitution rates in the surrounding noncoding 5' and 3' UTRs (dUTR,
= 0.179, P < 0.001). Interestingly, substitutions in the noncoding regions appear more constrained than substitutions at synonymous sites; the median dS/dUTR ratio is 0.81, suggesting that the substitution rate in noncoding regions is roughly 20% lower than at synonymous sites. This observation mirrors recent observations from both mammals and Drosophila that suggest stronger selective constraints on intergenic DNA than on synonymous sites (Halligan et al. 2004
; Osada et al. 2005
; Halligan and Keightley 2006
). Andolfatto (2005)
and Halligan and Keightley (2006)
found that mean divergence in 5' flanking regions in Drosophila were between 25% and 55% lower than divergence at synonymous sites. Similarly, Osada et al. (2005)
showed that the substitution rate in 5' UTRs were 1020% lower than at synonymous sites in humans. These studies suggest that strong selective constraint is acting on noncoding regions in the vicinity of protein-coding genes, and presumably 5' UTRs contain strongly conserved regions involved in gene regulation, such as transcription factorbinding sites. Whether strong selective constraint is acting on UTRs also in Populus remains to be determined, but the results are suggestive of such an effect.
Rates of nonsynonymous substitutions (dN) also vary substantially between genes, ranging from no variation, which was observed in 61 genes, to a maximum of 0.0479. Maximum likelihood estimates of dS and dN are also correlated across genes (
= 0.328, P << 0.001). However, variation in dN among genes is roughly 40% higher than variation in dS (CVdS = 0.69 and CVdS = 0.95), in line with the higher heterogeneity in functional constraints expected among different proteins.
I calculated the dN/dS ratios for all genes and even though the values vary widely among genes, the median dN/dS ratio equals 0.175, suggesting strong functional constraints at most genes. Again, this is almost identical to the mean dN/dS ratio of 0.177 obtained by Unneberg et al. (2005)
in a comparison of P. tremula versus P. trichocarpa. A total of 21 genes (3.8%) had dN/dS ratios that exceeded 1; however, only one gene had a dN/dS ratio that was significantly greater than one based on the comparison of models M7 and M8 in PAML (2
L = 11.33, degree of freedom = 2, P = 0.0035). This gene is annotated as a putative D-isomerspecific 2-hydroxyacid dehydrogenase. This test has very low power to detect the action of positive selection, however, as few relatively closely related sequences are compared in each test (Anisimova et al. 2001
).
There is no evidence for associations between gene expression, codon bias, and rates of synonymous substitutions (dS) in P. tremula. These results are similar to data from other multicellular eukaryotes, where few studies have documented a negative association between gene expression and dS (Akashi 2001
). A correlation between gene expression and dS is expected if selection on codon usage is very strong. Indeed, dS is negatively correlated with codon bias in many unicellular organisms where translational selection appear to be a major force shaping codon usage, presumably because of high Ne in these species. If selection on codon usage is weak, as is expected in most multicellular eukaryotes where Ne is relatively small, theory suggest that codon bias will exert only a minor influence on the synonymous substitution rate (McVean and Vieira 2001
). Moreover, the lack of a correlation could be due to methodological issues dealing with how dS is calculated (Bierne and Eyre-Walker 2003
).
Contrary to what was seen for dS, dN is negatively correlated with both codon bias (
= 0.144, P = 0.0012) and with the EST-based measures of gene expression (expression level,
= 0.143 and expression breadth,
= 0.238, P < 0.001 in both cases). Similar results hold also for dN/dS; in fact, the correlations between dN/dS and both codon bias (
= 0.176, P < 0.001) and gene expression are stronger than for dN (expression level,
= 0.198, P < 0.001, expression breadth,
= 0.280, P << 0.001). Both nonsynonymous substitution rates and levels of selective constraint are also significantly associated with gene length (dN:
= 0.221, P < 0.001, dN/dS:
= 0.194, P < 0.001).
Disentangling the Factors Affecting Codon Usage and Protein Evolution
Using the simple correlation analyses above, there is evidence for expression level, expression breadth, and protein length all being significantly associated with both codon bias and rates of nonsynonymous sequence evolution. However, these factors are also correlated with each other, and to disentangle the direct and indirect effects of these variables I used a path analysis approach. Path analysis is useful in these situations because it can explicitly accommodate for correlations among variables.
As can be seen in figure 3, when the effect of other variables are factored out, expression level has the largest direct effect on codon bias. Although expression breadth and protein length also affect codon bias, the magnitude of these effects are roughly half of the effect of expression level (fig. 3). In total, expression level, expression breadth, and protein length explain approximately 27% of the observed variation in codon bias among different genes.
|
Similar results have been found in Arabidopsis (Duret and Mouchiroud 1999
Codon bias is thought to reflect a balance between mutation, genetic drift, and natural selection favoring translation efficiency. Although natural selection plays an important role in generating codon bias, it is not clear whether selection act on protein elongation rate, the cost of proofreading, or translational accuracy (Duret and Mouchiroud 1999
). Regardlessly, selective differences between alternate synonymous codons are small and strong codon bias is only expected in species with sufficiently large population sizes. In that respect, it may seem somewhat puzzling that the evidence for codon bias is relatively strong in P. tremula, given that it is a woody perennial with fairly long generation times (1520 years/generation). However, P. tremula is one of the more widespread plant species in the world and sequence based estimates suggest that Ne
1 x 106 to 5 x 106 (Ingvarsson 2005
), which is similar to estimates of Ne from D. melanogaster (Akashi 1997
). Populus tremula is also obligately outcrossing and have very low levels of linkage disequilibrium (Ingvarsson 2005
). Weakly selected mutations, such as alternate synonymous codons, are therefore expected to be less susceptible to the effects of interference between linked mutations (Comeron and Kreitman 2002
).
Selective constraint on proteins, measured as dN/dS, appear to be primarily determined by expression breadth and protein length in P. tremula (fig. 3). Qualitatively similar results were obtained if dN was used instead of dN/dS (results not shown). Importantly, when the effects of expression breadth are taken into account, the relationships between dN/dS and expression level and between dN/dS and codon bias are essentially zero (fig. 3). These results suggest that expression breadth, rather than expression level, is the primary factor affecting the selective constraint of proteins in Populus and that translational selection apparently plays no role in this process (figs. 3 and 4).
|
Similarly, there is no discernible effect of codon bias on proteins evolution (fig. 3). It is conceivable, however, that gene expression has no direct influence on selective constraint but rather that the effect of gene expression is mediated through its effect on codon bias. This hypothesis amounts to constraining the direct paths connecting expression level and expression breadth to dN/dS, in figure 3, to zero. Because this "reduced" model represents a model that is nested within the model depicted in figure 3, the 2 models can be compared using a likelihood ratio test. This test, however, shows that the constrained model where gene expression variables do not directly influence dN/dS is substantially less likely than a model where these variables are allowed to directly influence protein evolution (2
L = 26.05, P < 0.001).
The observation of lower dN/dS ratios in broadly expressed genes is consistent with selective constraints on proteins expressed in many tissues (fig. 4). If translational selection was the main force driving amino acid substitutions in Populus, there should be little or no effect of expression breadth once expression level has been factored out. Rather, the analysis suggests that expression breadth is the predominant force affecting protein evolution, with little residual effect of expression level. This is consistent with strong purifying selection on genes expressed in many tissues, either because they must function in a wider set of biochemical environments, because they are involved in a greater array of biochemical pathways (Kuma et al. 1995
) or because mutations in genes expressed in many tissue have greater effects on fitness (Duret and Mouchiroud 1999
). Wright et al. (2004)
also found that expression breadth exerted stronger influence than gene expression level on dN in A. thaliana and Duret and Mouchiroud (2000)
attributed the more than 3-fold lower nonsynonymous substitution rate seen in ubiquitously expressed genes in humanrodent comparisons to greater selective constraints acting on these genes.
Drummond et al. (2006)
suggest that a single underlying factor, representing the number of translation events and that is determined by roughly equal contributions of gene expression, codon bias and protein abundance, is the main determinant of sequence evolution in yeast. Because yeast is unicellular expression breadth can, by definition, not affect protein evolution. Drummond et al. (2006)
included a measure of dispensability, that is, how essential a gene is for the total fitness of an organism that might be related to expression breadth in multicellular organisms. Nevertheless, Drummond et al. (2006)
showed that dispensability had essentially no effect on protein evolution, in contrast to earlier studies (Wall et al. 2005
). On the other hand, Lemos et al. (2005)
showed that protein evolution in D. melanogaster is negatively correlated with the number of interactions a given protein is involved in. Proteins involved in a greater number of interactions are also likely to be broadly expressed and hence under stronger stabilizing selection (Lemos et al. 2005
).
Protein length appears to play a significant role in shaping both codon bias and selective constraint of proteins in P. tremula (fig. 3). There is a strong negative relationship between protein length and codon bias (fig. 3). Several earlier studies have also documented strong effects of gene length on codon bias in a variety of organisms (e.g., Duret and Mouchiroud 1999
; Lemos et al. 2005
; Stenoien 2005
). Protein length is quantitatively the largest force affecting the selective constraint of proteins in P. tremula (fig. 3). The effect of protein length on sequence evolution in P. tremula is positive, and purifying selection is hence weaker in genes with longer coding regions. The results from P. tremula mirror those obtained by Lemos et al. (2005)
, who also showed a significant positive correlation between gene length and protein evolution in D. melanogaster that was virtually independent of gene expression level. At present, one can only speculate about the reasons behind these observations. Theoretical models that have studied the effect of gene length on sequence evolution have focused on the consequences of interference between different selected sites, known as the HillRobertson effect (Comeron et al. 1999
; McVean and Charlesworth 2000
; Marais et al. 2005
). Interference selection predicts reduced efficiency of natural selection in longer genes where many sites can be subject to simultaneous selection. This prediction thus fit with observations from P. tremula where longer genes appear to experience weaker selective constraint (i.e., higher dN/dS ratios).
Several studies have documented a negative correlation between protein length and mRNA abundance (Coghlan and Wolfe 2000
; Jansen and Gerstein 2000
; Urrutia and Hurst 2003
; Lemos et al. 2005
). To explain this observation, Jansen and Gerstein (2000)
suggested that gene length might set an upper limit to mRNA abundance due to natural selection for metabolic efficiency. On the other hand, other studies in higher eukaryotes have failed to establish a direct relationship between protein length and gene expression. For instance, Duret and Mouchiroud (1999)
found no association between protein length and gene expression in D. melanogaster and A. thaliana and even found a positive correlation between gene length and expression in C. elegans. A recent study by Ren et al. (2006)
suggest that highly expressed genes in both rice and Arabidopsis have both more and longer introns and larger transcript sizes than genes with lower gene expression.
Interestingly, in Populus there is a weak and marginal statistically significant negative association between mRNA abundance, measured by gene expression level estimates from the cDNA libraries and protein length (ß = 0.086, P = 0.042; fig. 3). In addition, there is a negative correlation between protein length and gene expression breadth (ß = 0.124, P = 0.004). Both of these effects are independent of the effects of codon bias, indicating that the effect of protein length is direct and thus represents a phenomenon with real biological relevance. In addition, there is a weak, but negative, correlation between gene expression and number of introns in the coding sequences from the Populus data set (
= 0.070, P = 0.098). This correlation is marginally stronger (although still nonsignificant), if the data set is restricted to genes where the complete coding sequence is available (
= 0.109, P = 0.091). There is thus no evidence in Populus for the observation from rice and Arabidopsis of highly expressed genes being less compact than lower expressed genes (Ren et al. 2006
)
The result on the effects of protein length is tantalizing as it suggest that protein length can be an important force that shapes several aspects of molecular evolution. As such, it clearly deserves further study.
| Conclusions |
|---|
|
|
|---|
Here, I have shown that gene expression directly influence codon usage and protein evolution in P. tremula, although it does so in very different ways. The codon usage of a gene is primarily determined by how highly expressed the gene is, whereas protein evolution is primarily determined by how widely expressed the gene is. In addition, protein length appears to be an important factor influencing virtually all aspects of molecular evolution.
The forces affecting codon usage and protein evolution are complex, with an interplay of a large number of factors. It is worth stressing that I have not been able to measure all factors that may be relevant for understanding codon usage and protein evolution in P. tremula. In addition, several of the factors included in my analyses are measured with a great deal of imprecision, such as gene expression levels estimated from EST libraries or the usage of optimal codons. Given that even modest measurement errors can result in spurious correlations between traits (Drummond et al. 2006
), disentangling the causal relationships between traits measured at genome-wide scales in a daunting task, to say the least. However, as techniques used to gather genome-wide data are refined and new methods are devised to measure variables of interest, our understanding of the causal mechanisms affecting evolution at the genome level should increase.
| Acknowledgements |
|---|
|
|
|---|
This work had been funded by a grant from the Swedish Research Council (Vetenskapsrådet).
| Footnotes |
|---|
Kenneth Wolfe, Associate Editor
| References |
|---|
|
|
|---|
Akashi H. (1997) Codon bias evolution in Drosophila. Population genetics of mutation-selection drift. Gene 205:269278.[CrossRef][ISI][Medline]
Akashi H. (2001) Gene expression and molecular evolution. Curr Opin Genet Dev 11:660666.[CrossRef][ISI][Medline]
Akashi H and Eyre-Walker A. (1998) Translational selection and molecular evolution. Curr Opin Genet Dev 8:688693.[CrossRef][ISI][Medline]
Andolfatto P. (2005) Adaptive evolution of non-coding DNA in Drosophila. Nature 437:11491152.[CrossRef][Medline]
Anisimova M, Bielawski JP, Yang ZH. (2001) Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. Mol Biol Evol 18:15851592.
Bierne N and Eyre-Walker A. (2003) The problem of counting sites in the estimation of the synonymous and nonsynonymous substitution rates: implications for the correlation between the synonymous substitution rate and codon usage bias. Genetics 165:15871597.
Chiapello H, Fisacek F, Caboche M, Henaut A. (1998) Codon usage and gene function are related in sequences of Arabidopsis thaliana. Gene 209:GC1GC38.[CrossRef][ISI][Medline]
Coghlan A and Wolfe KH. (2000) Relationship of codon bias to mRNA concentration and protein length in Saccharomyces cerevisiae. Yeast 16:11311145.[CrossRef][ISI][Medline]
Comeron JM and Kreitman M. (2002) Population, evolutionary and genomic consequences of interference selection. Genetics 161:389410.
Comeron JM, Kreitman M, Aguade M. (1999) Natural selection on synonymous sites is correlated with gene length and recombination in Drosophila. Genetics 151:239249.
Drummond D, Raval A, Wilke C. (2006) A single determinant dominates the rate of yeast protein evolution. Mol Biol Evol 23:327337.
Dunn KA, Bielawski JP, Yang ZH. (2001) Substitution rates in Drosophila nuclear genes: implications for translational selection. Genetics 157:295305.
Duret L. (2000) tRNA gene number and codon usage in the C. elegans genome are co-adapted for optimal translation of highly expressed genes. Trends Genet 16:287289.[CrossRef][ISI][Medline]
Duret L. (2002) Evolution of synonymous codon usage in metazoans. Curr Opin Genet Dev 12:640649.[CrossRef][ISI][Medline]
Duret L and Mouchiroud D. (1999) Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, Arabidopsis. Proc Natl Acad Sci USA 96:44824487.
Duret L and Mouchiroud D. (2000) Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol Biol Evol 17:6874.
Eyre-Walker A and Hurst LD. (2001) The evolution of isochores. Nat Rev Genet 2:549555.[CrossRef][ISI][Medline]
Francino HP and Ochman H. (1999) Isochores result from mutation not selection. Nature 400:3031.[CrossRef][Medline]
Goldman N and Yang ZH. (1994) Codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 11:725736.[Abstract]
Halligan DL, Eyre-Walker A, Andolfatto P, Keightley PD. (2004) Patterns of evolutionary constraints in intronic and intergenic DNA of Drosophila. Genome Res 14:273279.
Halligan D and Keightley P. (2006) Ubiquitous selective constrains in the Drosophila genome revealed by a genome-wide interspecies comparison. Genome Res 16:875884.
Hasegawa M, Kishino H, Yano T. (1985) Dating the human-ape split by a molecular clock of mitochondrial DNA. J Mol Evol 22:160174.[CrossRef][ISI][Medline]
Ihaka R and Gentleman R. (1996) R: a language for data analysis and graphics. J Comp Graph Stat 5:299314.[CrossRef]
Ikemura T. (1985) Codon usage and tRNA content in unicellular and multicellular organisms. Mol Biol Evol 2:1334.[Abstract]
Ingvarsson PK. (2005) Nucleotide polymorphism and linkage disequilibrium within and among natural populations of European aspen (Populus tremula L, Salicaceae). Genetics 169:945953.
Jansen R and Gerstein M. (2000) Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed genes. Nucleic Acids Res 28:14811488.
Kanaya S, Yamada Y, Kudo Y, Ikemura T. (1999) Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis. Gene 238:143155.[CrossRef][ISI][Medline]
Kent WJ. (2002) BLATthe BLAST-like alignment tool. Genome Res 12:656664.
Kuma K, Iwabe N, Miyata T. (1995) Functional constraints against variations on molecules from the tissue-level: slowly evolving brain-specific genes demonstrated by protein-kinase and immunoglobulin supergene families. Mol Biol Evol 12:123130.[Abstract]
Lemos B, Bettencourt BR, Meiklejohn CD, Hartl DL. (2005) Evolution of proteins and gene expression levels are coupled in Drosophila and are independently associated with mRNA abundance, protein length and number of protein-protein interactions. Mol Biol Evol 22:13451354.
Lin Y, Byrnes J, Hwang J, Li W. (2006) Codon-usage bias versus gene conversion in the evolution of yeast duplicate genes. Proc Natl Acad Sci USA 103:1441214416.
Loehlin JC. (2004) Latent variable models: an introduction to factor, path and structural sequation analysis(L. Erlbaum, Mahwah (NJ)).
Marais G, Domazet-Loso T, Tautz D, Charlesworth B. (2004) Correlated evolution of synonymous and nonsynonymous sites in Drosophila. J Mol Evol 59:771779.[CrossRef][ISI][Medline]
Marais G, Mouchiroud D, Duret L. (2001) Does recombination improve selection on codon usage? Lessons from nematode and fly complete genomes. Proc Natl Acad Sci USA 98:56885692.
Marais G, Nouvellet P, Keightley PD, Charlesworth B. (2005) Intron size and exon evolution in Drosophila. Genetics 170:481485.
McVean GAT and Charlesworth B. (2000) The effects of Hill-Robertson interference between weakly selected mutations on patterns of molecular evolution and variation. Genetics 155:929944.
McVean GA and Vieira J. (2001) Inferring parameters of mutation, selection and demography from patterns of synonymous site evolution in Drosophila. Genetics 157:245257.
Moriyama EN and Powell JR. (1997) Codon usage bias and tRNA abundance in Drosophila. J Mol Evol 45:514523.[CrossRef][ISI][Medline]
Osada N, Hirata M, Tanuma R, et al. (11 co-authors). (2005) Substitution rate and structural divergence of 5' UTR evolution: comparative analysis between human and cynomolus monkey cDNAs. Mol Biol Evol 22:19761982.
Percudani R. (2001) Restricted wobble rules for eukaryotic genomes. Trends Genet 17:133135.[ISI][Medline]
Powell JR and Moriyama EN. (1997) Evolution of codon usage bias in Drosophila. Proc Natl Acad Sci USA 94:77847790.
Ren X-Y, Vorst O, Fiers M, Stiekma WJ, Nap J-P. (2006) In plants, highly expressed genes are the least compact. Trends Genet 22:528532.[CrossRef][ISI][Medline]
Sharp PM and Li WH. (1986) An evolutionary perspective on synonymous codon usage in unicellular organisms. J Mol Evol 24:2838.[CrossRef][ISI][Medline]
Stenoien HK. (2005) Adaptive basis of codon usage in the haploid moss Physcomitrella patens. Heredity 94:8793.[CrossRef][ISI][Medline]
Sterky F, Bhalerao R, Unneberg P, et al. (19 co-authors). (2004) A Populus EST resource for plant functional genomics. Proc Natl Acad Sci USA 101:1395113956.
Tuskan G, DiFazio S, Jansson S, et al. (111 co-authors). (2006) The genome of Black cottonwood Populus trichocarpa (Torr. & Gray). Science 313:15961604.
Unneberg P, Strömberg M, Lundeberg J, Jansson S, Sterky F. (2005) Analysis of 70000 EST sequences to study divergence between two closely related Populus species. Tree Genet Genome 1:109115.
Urrutia AO and Hurst LD. (2001) Codon usage bias covaries with expression breadth and the rate of synonymous evolution in humans, but this is not evidence for selection. Genetics 159:11911199.
Urrutia AO and Hurst LD. (2003) The signature of selection mediated by expression on human genes. Genome Res 13:22602264.
Wall DP, Hirsh AE, Fraser HB, Kumm J, Giaever G, Eisen MB, Feldman MW. (2005) Functional genomic analysis of the rates of protein evolution. Proc Natl Acad Sci USA 102:54835488.
Wright SI, Yau CBK, Looseley M, Meyers BC. (2004) Effects of gene expression on molecular evolution in Arabidopsis thaliana and Arabidopsis lyrata. Mol Biol Evol 21:17191726.
Yang Z. (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 13:555556.
Yang ZH, Nielsen R, Goldman N, Pedersen AMK. (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431449.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||




as a function of expression breadth. Genes were grouped by the number of libraries within which gene expression could be detected. The correlation between median