Skip Navigation


MBE Advance Access originally published online on May 9, 2007
Molecular Biology and Evolution 2007 24(8):1744-1751; doi:10.1093/molbev/msm093
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
24/8/1744    most recent
msm093v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Neafsey, D. E.
Right arrow Articles by Galagan, J. E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Neafsey, D. E.
Right arrow Articles by Galagan, J. E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Dual Modes of Natural Selection on Upstream Open Reading Frames

Daniel E. Neafsey and James E. Galagan

Microbial Analysis Group, Broad Institute of MIT and Harvard, Cambridge, Massachusetts

E-mail: neafsey{at}broad.mit.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Upstream open reading frames (uORFs) are common features of eukaryotic genes, occurring in 10%–25% of 5' leader sequences. Upstream ORFs that have been subjected to experimental analysis have been generally found to decrease translational efficiency of the downstream coding sequence. Previous investigations of uORFs in mammals and yeast have detected uORFs conserved over long evolutionary distances, prompting speculation about the nature and cause of the natural selection underlying such conservation. We have analyzed uORFs in the basidiomycetous fungal pathogen Cryptococcus neoformans to discern the properties of this purifying selection. We find that uORFs in the Cryptococcus species complex are conserved at twice the expected rate, and we report 122 uORFs that are conserved among all four sequenced Cryptococcus strains. A significantly greater proportion of uORF losses occur via direct mutation to the uORF start codon than expected. This observation suggests that mutational disruption of a uORF that leaves the start codon intact may be selectively disadvantageous, perhaps because of the risk of premature translation initiation. Accounting for this constrained mode of loss and comparing the relative conservation of uORFs between the 5' leader and control sequences enables us to calculate that at least a third of uORFs may be conserved for their effects on translational efficiency. The remaining fraction may be conserved either by chance or as a result of selective pressure to prevent premature translation initiation from the uORF start codon. We find that the majority of conserved uORFs do not exhibit codon usage bias or conservation at the amino acid level, and therefore they do not likely encode bioactive peptides. Our analysis suggests that uORFs are an important and underappreciated mechanism of post-transcriptional gene regulation in eukaryotes.

Key Words: uORF • uAUG • conservation • translation • Cryptococcus


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Microarrays have given the biological community abundant genome-wide data on rates of DNA transcription. The relative ease with which microarray data can now be acquired should not obscure the fact that transcription is not synonymous with expression. Indeed, there is growing evidence of significant variation in mRNA transcript half-life (Wang et al. 2002Go) and translational efficiency among genes (Serikawa et al. 2003Go; MacKay et al. 2004). To make fullest use of transcriptional data, then, it is imperative to understand what factors may intercede at the translation stage to decouple levels of transcription and expression.

Short open reading frames in the 5' leader sequence of genes called upstream open reading frames (uORFs) are known to affect the translational efficiency of many eukaryotic genes (Morris and Geballe 2000Go; Meijer and Thomas 2002Go; Vilela and McCarthy 2003Go). Upstream ORFs are common genomic features, with estimates of uORF incidence in mammalian genes ranging as high as 25% (Crowe, Wang, and Rothnagel 2006Go) and 10%–22% of fungal genes (Galagan et al. 2005). Although some uORFs may augment expression by obscuring other cis-acting inhibitory elements (Geballe and Sachs 2000Go), most experimentally tested eukaryotic uORFs are translational repressors. Upstream ORFs have been shown to affect translational efficiency negatively through a variety of means, including ribosome-blocking by the encoded peptide, ribosome stalling at the uORF termination codon, induction of the nonsense-mediated decay (NMD) pathway, and failure of the ribosome to re-initiate at the genic translation start site after disengaging from the uORF (Gaba et al. 2001Go). Upstream ORFs that have been experimentally tested through cell-free translation assays or other means have been found to decrease the rate of translation up to 20-fold (Hinnebusch 2005Go), although some uORFs appear to have little impact, or a variable impact, on translation rates (e.g., Wang and Rothnagel 2004Go).

In accordance with the scanning model of translation initiation (Kozak 1994Go), it has been suggested that some uORFs may be conserved to prevent deleterious premature translation initiation from upstream AUG (uAUG) triplets (Iacono, Mignone, and Pesole 2005Go; Lynch, Scofield, and Hong 2005Go; Lynch 2006Go). Premature translation initiation leading to genic read-through would, at best, add extraneous peptides to the N-terminus of the encoded protein if the uAUG were in the same reading frame as the genic ORF, and, at worst, it would create a frameshift-induced nonsense mutation and entirely eliminate translation of the genic ORF. In this latter case, even if the uORF decreases the translation rate of the adjacent genic sequence, the phenotypic effect may be less severe than premature translation initiation, which results in the ribosome's reading through the genic translation start site. This hypothesis is supported by the observation that uAUGs are significantly under-represented in 5' leader sequences in mammals, yeast, and prokaryotes (Saito and Tomita 1999; Hahn et al. 2003; Churbanov et al. 2005Go; Iacono, Mignone, and Pesole 2005Go).

We have investigated the genomic distribution and conservation of uORFs in four recently sequenced strains of the fungal pathogen Cryptococcus neoformans. The Cryptococcus system is well suited for the analysis of uORF evolution. Phylogenetic analysis indicates a total synonymous divergence among strains comparable to that observed between human and mouse, meaning that the sequenced strains have diverged just enough to exhibit turnover in their uORF complements, but not so much that it isn't possible to accurately align their 5' leader sequences. In addition, approximately 23,000 full-length cDNAs have been sequenced for one of the strains (Loftus et al. 2005), permitting conservative estimation of the minimum extent of thousands of 5' leader and 3' trailer sequences.

We find that although uORFs are less common in 5' leader sequences than expected by chance, they are conserved at twice the expected rate. In addition, we report that uORFs in Cryptococcus exhibit little evidence of selection for length, codon bias, or amino acid content, and they are most likely conserved only at the level of the open reading frame. Analysis of the nature of the mutations by which uORFs have been lost in individual lineages allows us to estimate that although many uORFs may be conserved to insulate uAUGs and prevent premature translation initiation, at least a third of conserved uORFs are maintained because of their impact on translation efficiency. These observations suggest that uORFs are a widespread and important mechanism of post-transcriptional regulation in eukaryotes.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Genome Alignment and 5' Leader Mapping
The genome assemblies of four strains of C. neoformans were obtained from the Web sites of the sequencing centers that produced them (strain JEC21: TIGR; strain WM276: Michael Smith Genome Center; strains H99 and R265: Broad Institute). Whole genome alignments were created using a multistep process, with strain JEC21 as a reference. First, pairwise alignments between JEC21 and the other sequenced strains were created using PatternHunter (Ma, Tromp, and Li 2002Go). Blocks of 4-way homologous contigs were then identified with a hierarchical synteny clustering algorithm. Multiple alignments of homologous regions were generated with Multi-LAGAN (Brudno et al. 2003Go). The 5' leader sequences and 3' untranslated regions (UTRs) in the alignments were identified with a full-length cDNA library for JEC21 produced by TIGR (available at http://www.tigr.org/tdb/e2k1/cna1/). Each cDNA was matched to a region of the JEC21 genome using BLAT (Kent 2002Go). The 5' leader and 3' trailer sequences were inferred when the region matching the region of a cDNA sequence extended beyond the boundaries of a coding sequence in the TIGR annotation of the JEC21 strain. Although transcription may, in some cases, begin or end more distally from the coding sequence than suggested by the cDNA sequences, we believe these leader/trailer sequence boundaries to conservatively reflect the minimal extent of transcribed flanking regions. Only those 5' and 3' leader sequences that show no evidence of introns in the BLAT results or TIGR annotation were used for uORF analysis. All analyses on uORFs were conducted with custom Perl scripts.

Analysis of uORF Incidence and Conservation
For the purposes of this analysis, a uORF was defined as an AUG triplet followed by at least one intervening codon and a stop codon. Upstream ORFs were permitted to overlap each other, and they were either contained entirely in the 5' leader sequence or allowed to overlap the downstream coding ORF by a single base pair. Upstream ORFs were considered to be conserved if, in the multiple alignment of orthologous leader sequences, all strains exhibited a start codon and a stop codon in the same position, and those start and stop codons were in the same frame relative to each other. Losses were inferred when a uORF was conserved in all strains but one, and gains were inferred when a uORF was present in only a single strain. This simple mode of parsimony inference is likely to underestimate the absolute loss/gain rate ratio because of cases where parallel loss events in multiple lineages are mistaken for gains, but it does not influence the conclusions we draw from comparing the relative rates of loss/gain observed between the 5' leader sequence and control sequences or the relative rates of different modes of uORF loss and gain.

Modes of uORF loss were divided into several categories for analysis. Mutations that disrupted the AUG start codon were tallied separately from mutations that disrupted the stop codon or reading frame. This latter category of non-AUG mutations was further refined to include only those mutations that created an "uninsulated" AUG with no stop codon between it and the coding sequence, and to exclude the non-AUG mutations that destroyed the original uORF but did not create an uninsulated uAUG owing to the presence of a downstream "backup" stop triplet. Two categories of uORF gain were tallied: cases in which a mutation created a new AUG codon upstream and in-frame with an existing stop codon, and cases in which a mutation put a pre-existing uAUG triplet in the context of an ORF. Rates of uORF loss were calculated by dividing the number of observed, diagnosable loss events by the number of conserved uORFs in the same sequence class.

The expected incidence of uORFs in 5' leader and control sequence classes was calculated from the observed incidence of potential start (AUG) and stop (UAA/UAG/UGA) triplets in individual leader sequences. For each leader sequence, the numbers of empirically observed start and stop triplets were tallied, and then their relative order and frame were randomly permuted 1,000 times to generate a distribution from which an expected number of ORFs was derived.

Initiation/Termination Context
The initiation context of conserved and nonconserved uORFs was compared to the initiation context of genic ORFs exhibiting high and low codon bias. Codon bias was evaluated with the ENC' statistic (Novembre 2002Go). "High" and "low" bias gene sets correspond, respectively, to the genes at least two standard deviations above and below the average codon bias across all genes. Ten nucleotide positions immediately upstream and seven nucleotide positions immediately downstream of the initiation codon were examined in pairwise comparisons among the different ORF classes (conserved uORF vs. non-conserved uORF, high codon bias genic ORF vs. low codon bias genic ORF) from strain JEC21. A heterogeneity chi-squared test was used to determine the significance of difference in nucleotide usage at each position.

The sequence context of uORF stop codons was also compared between conserved and nonconserved uORFs in the manner described above.

Conservation of uORF-Encoded Peptides
Conserved uORFs were evaluated for conservation at the amino acid level. The ratio of nonsynonymous to synonymous polymorphisms (Ka/Ks) was computed for all conserved uORFs and a subset of 12 conserved uORFs that were 20–99 codons in length (after Crowe, Wang, and Rothnagel 2006Go). For each set, all uORFs were trimmed of their start and stop codons and concatenated into a single sequence for each Cryptococcus strain to obtain an overall estimate of Ka/Ks across the set. The Ka/Ks ratio for each set was calculated with codeml model M0 in the PAML 3.14 package (Yang 1997Go). Codon usage bias for strain JEC21 was measured in each concatenated uORF set with the ENC' statistic (Novembre 2002Go).

Evaluation of Annotation Accuracy
The accuracy of predicted translation start sites in the TIGR annotation was measured in two ways. First, the conservation of the predicted genic translation initiation codon was compared to the conservation rate of the first AUG triplet encountered upstream and the first two AUG triplets encountered downstream of the predicted translation start site. Verification of predicted translation start sites in the TIGR gene calls using a comparative method was largely successful. The AUG codon at the predicted translation start site was conserved 80.8% of the time (4,118/5,095). The first and second internal AUG codons were, respectively, perfectly conserved at rates of 78.8% (4,019 of 5,095 genes) and 80.2% (4,090/5,095). The first uAUG triplet was conserved only 52.5% of the time in leader sequences that contained an uAUG not in the context of a uORF (84/161; {chi}2 test; P = 1.1E-18), suggesting that this class of triplets is much less likely to be part of the genic coding sequence.

Conservation of the intervals between AUG triplets was also used to evaluate annotation accuracy. The ratio of nonsynonymous to synonymous divergence (Ka/Ks) was computed for all of the AUG–AUG intervals defined by the triplets scored for conservation in the previous analysis with codeml model M0 in the PAML 3.14 package (Yang 1997Go). Supplementary Figure 1 is a frequency histogram of Ka/Ks estimates for three intervals: first upstream AUG to the predicted translation initiation codon (TIC), TIC to the first internal coding AUG, and second internal coding AUG to the third internal coding AUG. The short length of some of these sequence intervals generates noise in the Ka/Ks statistic, but it is clear that most AUG–AUG intervals within the genic coding region have Ka/Ks ratios less than the AUG–TIC intervals that are upstream of the predicted coding sequence. This result again indicates that the predicted translation start sites are correct in the majority of C. neoformans genes. A small number of genes exhibiting Ka/Ks ratios < 0.30 in their upstream AUG–AUG intervals were excluded from further analysis.


    Results
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Upstream ORF Incidence and Conservation
The overall incidence and conservation of uORFs in non-spliced, cDNA-defined 5' leader sequences of strain JEC21 is illustrated in figure 1. The antisense strand of 5' leader sequences (5'–) and the sense and antisense strands of 3' leader sequences (3'+; 3'–) were used as control sequences for measuring uORF incidence and conservation, as randomly occurring ORFs in these strands are presumably selectively neutral. A total of 249 of 2,167 (11.5%) 5' leader sequences contained at least 1 uORF. This rate is at least fourfold smaller than the incidence of uORFs in any of the control sequences [(5'–): 1,040/2,167 = 48%; (3'+): 1,660/2,698 = 61.5%; (3'–): 1,635/2,698 = 60.6%]. This result is lower than the 36% incidence reported by Iacono, Mignone, and Pesole (2005)Go and the 25% incidence reported by Crowe, Wang, and Rothnagel (2006)Go for human genes. Even though uORFs are less common than expected in C. neoformans 5' leader sequences, they are disproportionately conserved. There were 122 uORFs of a total of 453 perfectly conserved in all 4 strains in the 5' leader sequence, for a conserved fraction of 27% (Supplementary Table 1). This is approximately twice the rate of conservation observed in any of the control sequence classes [(5'–): 322/2,904 = 11%; (3'+): 453/3,244 = 14%; (3'–): 534/3,683 = 14%].


Figure 1
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Upstream ORF incidence and conservation. (A) Percentage of genes exhibiting at least one uORF in the sense strand of the 5'leader (5'+), compared to the percentage of genes exhibiting at least one ORF in the antisense strand of the 5'leader (5'–) and both strands of the 3'trailer (3'+, 3'–). 2,167 5'leader sequences and 2,698 3'trailer sequences were examined. (B) Percentage of uORFs conserved in four sequenced Cryptococcus genomes (5'+), compared to the percentage of control ORFs conserved (5'–, 3'+, 3'–).

 

View this table:
[in this window]
[in a new window]

 
Table 1 uORF incidence, conservation, and mutational modes of loss in 2,167 5' leaders and 2,698 3' trailers

 
The reduced incidence of uORFs in 5' leader sequences may be explained in large part by the reduced incidence of AUG triplets. Figure 2 indicates that the observed incidence of AUG triplets is significantly lower than expected given the nucleotide composition of the 5' leaders sequence, similar to what has been found in mammals and yeast (Churbanov et al. 2005Go; Iacono, Mignone, and Pesole 2005Go). We observed only 655 instances of AUG triplets relative to 2,043 expected occurrences. Control triplets composed of the same nucleotides in a different order exhibited incidences much closer to the expected 2,043 occurrences (GAU: 1,940; AGU: 1,798; GUA: 1,614; UGA: 1,502; UAG: 1,522).


Figure 2
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Triplet incidence in 5' leader sequences. The observed/expected ratio of AUG triplets is much lower than for AUG triplets than permuted triplets of the same three three nucleotides ({chi}2 test; P = 0.0001). Expected triplet incidence was calculated as a simple product of component nucleotide frequencies. See text for actual frequencies.

 
Despite this reduced incidence of AUG triplets, uORFs are more common in the 5' leader sequence than expected by chance, given the observed incidence of potential start and stop triplets. Figure 3 illustrates that uORFs are 1.6 times more common than expected (216 observed; 134 expected) in 5' leader sequence after allowing for a shortage of AUG triplets, as determined by simulated shuffling of observed potential start and stop triplets. This means that the potential start and stop triplets in the 5' leader are in the correct orientation (start upstream of stop) and in the same reading frame much more frequently than expected by chance. Upstream ORFs are less common than expected in the sense strand of the 3'UTR, perhaps because of a greater than expected incidence of potential stop triplets in those sequences (data not shown). Averaging across the control sequences, the observed/expected ratio of uORF abundance was 0.95 (2,421 observed; 2,545 expected).


Figure 3
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Upstream ORF incidence (observed/expected). Expected uORF incidence was computed using 1,000 random permutations of observed start and stop triplets for each leader/trailer sequence. A surplus of observed uORFs indicates that start and stop triplets are in the correct orientation (start before stop) and in the same reading frame more often than expected by chance. (observed/expected 5'+: 216/134; 5'–: 773/702; 3'+ 560/904; 3'–: 1,088/939).

 
Calculating the Fraction of Conserved Regulatory uORFs
These observations of conservation suggest that uORFs may be functional repressors of translation, but they may also be conserved to prevent premature translation initiation from uAUGs. The stop codon of such uORFs would effectively be acting as insulation between the uAUG and genic coding sequence.

We sought to determine the proportion of uORFs conserved for uAUG insulation rather than for a potential impact on translation. To do so we note that conservation of a uORF for an insulatory purpose would predict purifying selection on the stop codon and reading frame, but not on the uORF start codon (a uAUG). Mutations disrupting the start codons of such uORFs would presumably be free to drift to fixation. Conservation of a uORF for its effect on translational efficiency, however, requires purifying selection on all parts of the uORF.

Given these different expectations regarding conservation, we tested for the prevalence of insulatory uORFs by analyzing modes of uORF loss. We identified instances of uORFs recently disrupted by mutation and recorded whether or not the mutation disrupted the uORF start codon (Table 1). As expected for uORFs acting as uAUG insulators, a greater proportion of uORFs were lost because of start-disrupting mutations than expected (5'+: 35 of 41 losses; control ORFs: 632 of 973 losses; {chi}2 test; P = 0.0062). Assuming that uORFs and control ORFs are subject to a similar profile of mutations, we infer that purifying selection has filtered out mutations that solely disrupted uORF stop codons or reading frames, thereby reducing the overall uORF loss rate by approximately 24% (observed losses = 41; expected losses = 54).

We measured the rate of diagnosable uORF losses in the 5' leader and control sequence classes as 0.17 and 0.37, respectively, by dividing the observed number of losses by the number of conserved uORFs. This twofold difference in loss rates between 5' leader and control uORFs indicates that the 24% reduction in loss rate caused by selection to prevent premature translation initiation is insufficient to explain the degree of conservation observed for uORFs in the 5' leader sequence. We infer that the balance of the difference in loss rates may be due to selection to preserve the effects of the 5' leader uORFs on translation efficiency of the downstream genic ORF.

To determine the fraction of uORFs conserved for their impact on translation, we assume that the observed rate of uORF loss (Lo; 0.334 lost/conserved) is a mixture of two loss rates:

Formula
where li corresponds to loss rate for "insulatory" uORFs that are conserved only to prevent read-through translation, lr corresponds to loss rate for repressor uORFs that are conserved for their impact on translation, and xr corresponds to the proportion of all 5' leader uORFs that are repressors. We conservatively assume that no repressor uORFs have been lost during the observed time span of Cryptococcus evolution (lr = 0). We further model the loss rate for protective uORFs (li) as equal to the observed loss rate for control ORFs (0.74 lost/conserved) with a 24% downward correction to simulate insulation selection. Given these loss rates, solving for xr yields a value of 0.41. That is, 41% of conserved uORFs are predicted to be conserved for their impact on translation. This estimate may be considered a lower bound, in that relaxing the requirement that all uORFs conserved for their impact on translation must be perfectly conserved across all 4 Cryptococcus species, or that reducing the effect of insulation selection, would result in a higher estimate of the proportion of uORFs that are translational repressors. Thus, we find that a significant fraction of conserved uORFs are maintained in the genome for their inhibitory effect on translation.

Initiation/Termination Context
We compared the nucleotide usage 10 bp upstream and 7 bp downstream of the translation initiation codon between genes exhibiting high and low codon bias. Miyasaka (1999)Go detected a correlation between the degree of codon bias and the optimality of the initiation context in yeast genes, suggesting that selection can operate on both features simultaneously to affect translation efficiency. Further, Crowe, Wang, and Rothnagel (2006)Go found evidence of selection to optimize uORF initiation context for uORFs conserved between human and mouse. In Cryptococcus, genes exhibiting high codon bias exhibit significantly greater usage of adenine nucleotides in positions –3 (heterogeneity {chi}2 test; P = 0.001) and position –1 (heterogeneity {chi}2 test; P = 0.001) relative to the first site of the translation initiation codon. Interestingly, conserved uORFs exhibit significantly greater usage of cytosine in position –6 relative to uORFs recently lost or gained in a given C. neoformans lineage (heterogeneity {chi}2 test; P = 0.05). However, there was no significant difference in nucleotide usage between high and low codon bias genes at position – 6, indicating that this result for conserved uORFs may be spurious. Thus we did not find evidence of strong selection to generate optimal or nonoptimal contexts for translation initiation at conserved uORFs in Cryptococcus.

In a similar manner, we also compared nucleotide usage in the vicinity of uORF stop codons between conserved and non-conserved uORFs. Grant and Hinnebusch (1994)Go found that ribosome reinitiation frequency following uORF translation at the GCN4 locus in yeast was strongly associated with the A/U richness of the final uORF codon and 10 base pairs following the stop codon. For Cryptococcus, we found little difference in nucleotide composition between the termination regions of conserved and nonconserved uORFs. Uracil nucleotides were significantly more common in the third position of the last codon of nonconserved uORFs relative to conserved uORFs (heterogeneity {chi}2 test; P = 0.01), but adenine nucleotides were most commonly found at this location in conserved uORFs. We conclude that there is no strong selective pressure to modulate nucleotide composition in uORF termination regions in Cryptococcus.

Upstream AUG Conservation
We failed to find evidence that uAUGs are conserved outside the context of uORFs. We found 23 instances of unambiguous loss and 82 conserved instances of uAUGs in the 5' leader sequences that were not associated with uORFs, yielding a loss rate of 23/82 = 0.28. We found 1,359 cases of unambiguous loss and 3,410 conserved uAUGs across the control sequences, for a loss rate of 0.39. These rates were not significantly different ({chi}2 test; P = 0.13), indicating that AUG codons are not selectively maintained outside the context of uORFs in Cryptococcus 5' leader sequences. Previous analyses have reported conservation of uAUG triplets in mammals and yeast (Churbanov et al. 2005Go; Iacono, Mignone, and Pesole 2005Go), but they did not distinguish whether the signal was independent of selection to maintain the integrity of uORFs.

Selection on uORF Length
We compared the size distribution of conserved and nonconserved uORFs from the 5' leader sequences and from the 5'– control sequence. Churbanov et al. (2005)Go and Iacono, Mignone, and Pesole (2005)Go report that observed uORFs in fungal and mammalian sequences are significantly shorter than expected. In Cryptococcus, the average length of nonconserved uORFs in both the sense and antisense strands of the 5' leader sequence is longer (63 and 43 bp, respectively) than conserved uORFs in both strands, but the average length of conserved uORFs is almost identical in the sense and antisense strands (30.5 bp and 29.1 bp, respectively). One would predict that conserved uORFs might be shorter than nonconserved uORFs by chance, because longer uORFs have a greater likelihood of incurring an indel that puts the start and stop codons out of frame. We did find, however, that conserved uORFs in the 5' leader were significantly enriched in the 15–27 bp range relative to nonconserved uORFs or conserved uORFs in the antisense strand (nonparametric bootstrapping; P = 2.7E-4). We tested to see whether uORF length was selectively maintained in the presence of potential "backup" stop triplets by tallying instances in which the original stop codon of a uORF was mutated or not mutated in the presence of a downstream, in-frame stop. We observed a smaller frequency of utilization of back-up stops in the 5' leader sequence (30 utilized/50 unutilized = 0.60) compared to the control sequences (830 utilized/954 unutilized = 0.87), but this difference is not significant ({chi}2 test; P = 0.11). This result suggests a lack of selective pressure to keep uORFs at their present size when there is an easy mutational path to lengthen them. We conclude that uORFs may be under selection for shorter length in Cryptococcus, but the evidence is weak.

Conservation of uORF-Encoded Peptides
Crowe, Wang, and Rothnagel (2006)Go report evidence of conservation at the amino acid level for mammalian uORFs greater than 20 amino acids in length, and they suggest that many uORFs may encode bioactive peptides. To test whether conserved Cryptococcus uORFs may also be subject to such selective constraint, we estimated the ratio of nonsynonymous to synonymous substitution rates (Ka/Ks) in a concatenation of all 124 conserved Cryptococcus uORFs, as well as in a concatenation of 12 "long" conserved uORFs that were at least 20 amino acids in length. A Ka/Ks ratio close to 1 indicates neutral evolution, whereas a Ka/Ks ratio close to 0 suggests purifying selection.

We measured a Ka/Ks ratio of 0.82 for all conserved uORFs and a ratio of 0.42 for 12 conserved uORFs that were greater than 20 amino acids in length. For comparison, the average Ka/Ks ratio for Cryptococcus genic sequences is 0.18, and only 5.4% of Cryptococcus genes have a Ka/Ks ratio greater than or equal to the value observed for the set of long conserved uORFs (see Methods).

Using the ENC' statistic (Novembre 2002Go), we also measured relative codon usage bias in each concatenated uORF set to detect selection for translational efficiency or accuracy at the codon level. While the set of 12 long uORFs exhibits a codon usage bias close to the median value for Cryptococcus genes (long uORF ENC' = 57.3; genic median ENC' = 55.8, where higher values signify lower codon bias), only 8.7% of Cryptococcus genes exhibit codon usage bias as weak or weaker than the usage bias observed across all conserved uORFs (ENC' = 59.3).

These results suggest that there may be weak purifying selection on the set of 12 conserved uORFs greater than 20 amino acids in length, but that the majority of conserved uORFs do not experience selection at the amino acid level or translational level and therefore are not likely to encode bioactive peptides.


    Discussion
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
The uORFs present in the 5' leader sequences of Cryptococcus and most other eukaryotes are conserved for at least two significant reasons. Approximately 40% of uORFs are maintained in the genome because their inhibitory effects on translation efficiency presumably increase organismal fitness. Others are conserved over short time periods but beyond neutral expectation as a result of "insulation selection." These latter uORFs may be subject to purifying selection only until they experience an AUG-inactivating mutation that drifts to fixation. Zhang and Dietrich (2005)Go recovered 18 well-conserved uORFs from the sequenced genomes of the more highly divergent Saccharomyces clade, suggesting that selection to preserve the translational effects of uORFs may persist over evolutionary timescales much longer than those observed in the Cryptococcus system. However, tests of conservation over long evolutionary distances do not account for biological changes in regulation over time and are therefore conservative with regard to estimating the proportion of uORFs subject to purifying selection as a result of regulatory function.

The level at which this purifying selection acts appears in most cases to be the open reading frame itself, rather than individual uORF codons. Several recent manuscripts have focused on uAUG triplets as a unit of selection and/or conservation (Churbanov et al. 2005Go; Crowe, Wang, and Rothnagel 2006Go), but we find no evidence that uAUG triplets are conserved outside the context of uORFs. Indeed, we find that when uORFs are lost via mutation, they are disproportionately destroyed through mutation to the uAUG that initiates their translation. Further, we find no strong evidence that uORF length is under selection. Upstream ORFs that are perfectly conserved among the four sequenced Cryptococcus strains are shorter than uORFs that are not conserved, but this effect most likely derives from the increased probability that longer uORFs will incur a frame-shifting indel mutation over time. Neither do we detect strong evidence that the majority of conserved uORFs are under selection at the peptide coding level or translational efficiency level. A set of 12 conserved uORFs that were greater than 20 amino acids in length exhibited codon usage comparable to that observed in genic sequences and showed a Ka/Ks ratio that suggests purifying selection, albeit at a level lower than 95% of annotated Cryptococcus genes. Most conserved uORFs are shorter than 20 amino acids, however, and show no evidence of purifying selection or codon usage bias, even when concatenated to enhance signal strength. So, while some longer uORFs may encode functional, bioactive peptides, most uORFs are conserved only so far as to maintain their open reading frame.

Some thoroughly studied uORFs, such as GCN4 in yeast, facilitate context-dependent post-transcriptional gene regulation (Gaba et al. 2001Go; Arava et al. 2005Go). There is the possibility that the extended residence time in the genome afforded to most uORFs as a result of their protective role in preventing premature translation initiation might increase the likelihood that some uORFs could be exapted into regulatory mechanisms and conserved permanently (Lynch 2006Go). Such a hypothesis is difficult to test directly, but it is possible to imagine a context-dependent regulatory uORF evolving in a step-wise process under this scenario. A new uORF might initially fix in a population as a consequence of a neutral or nearly neutral impact on the translational efficiency of the downstream gene. Because of the need to keep the uAUG of the uORF insulated, this uORF would enjoy a residence time in the genome longer than typically expected for a neutral genomic feature. One could imagine a compensatory mutation occurring during this extended residence time (e.g., a strengthened transcription factor binding site) that could make any decrease in translational efficiency caused by the uORF beneficial. Subsequent mutations might then tune the uORF's impact on translation to an optimal level or make its effects on translation context dependent. In vitro translation assays to determine the precise effects of individual uORFs are already underway using uORFs from Cryptococcus, and they will likely be necessary to determine which features of a uORF determine its impact on translation in this organism.

Changes in translational efficiency may be compensated for by changes in transcription rates for constitutively expressed genes, as well. Just as transcriptional regulatory elements have been observed to exhibit redundancy and turnover (Tanay, Regev, and Shamir 2005Go), translational regulatory elements may also undergo flux over evolutionary time periods, leading to a complex interplay within and between transcriptional and translational factors. The lack of detectable selection on uORFs at the coding level or for initiation/termination sequence context suggests that promoter elements and uORFs may evolve in close proximity in 5' leader sequences with low interference.

The degree to which translational efficiency affects organismal fitness is unclear. Empirical evidence from yeast indicates that there is a great deal of variation in translational efficiency among genes (MacKay et al. 2004). Variation in codon bias among genes in other genomes suggests that selection differentially influences translational efficiency among genes, not just in yeast but in a host of organisms (Akashi 2001Go; Duret 2002Go; Chamary, Parmley, and Hurst 2006Go). If a significant fraction of the genes in every genome do not experience strong selection for optimal translational efficiency, as many analyses of codon bias suggest, then perhaps the abundance of uORFs in eukaryotic genomes may be explained by viewing uORFs as selectively neutral features despite their impact on translation. Studies examining variation in translational efficiency at the population level would cast much light on this question.

The existence of uORFs conserved over deep evolutionary time, however (Iacono, Mignone, and Pesole 2005Go; Zhang and Dietrich 2005Go), suggests that many uORFs are not selectively neutral features and that they are conserved precisely because of their potential to have a negative impact on translational efficiency. Post-transcriptional gene regulation may not be efficient in terms of cellular resources and energy, but it may sometimes offer a more expedient mechanism of changing gene expression than transcriptional modulation. Certain classes of genes, such as those prone to aggregation when overexpressed (DePristo, Weinreich, and Hartl 2005Go) or oncogenes (Mehta, Trotta, and Peltz 2006Go) might benefit from features such as uORFs that check their ultimate expression level. Genes whose expression level must be precisely regulated might also benefit from reduced translational efficiency, as it has been demonstrated that a high rate of transcription coupled with a low rate of translation can minimize noise in eukaryotic gene expression (McAdams and Arkin 1997Go; Blake et al. 2003Go; Fraser et al. 2004Go).

Upstream ORFs are common, easily identified features of eukaryotic genes. The subtle, dual nature of the selective forces underpinning their conservation underscores the need for clusters of related genome sequences in deciphering functional noncoding elements. Further, the emerging universality of uORFs and other mechanisms of post-transcriptional gene regulation underscores the need for full-length cDNA libraries and other resources to identify 5' leader sequences and 3' untranslated regions in newly sequenced genomes. Given the ubiquity of uORFs in genomes, it is clear that comprehending the mechanism of their impact on translation will ultimately be essential to understanding eukaryotic gene expression.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Supplementary Table 1 and Figure 1 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
This work was supported in part by funding from the National Science Foundation and the National Institute of Allergy and Infectious Diseases. We thank Matt Sachs, Scott Roy, and two anonymous reviewers for helpful comments on this manuscript.


    Footnotes
 
Laura Katz, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 

    Akashi H. Gene expression and molecular evolution. Curr Opin Genet Dev. (2001) 11:660–666.[CrossRef][Web of Science][Medline]

    Arava Y, Boas FE, Brown PO, Herschlag D. Dissecting eukaryotic translation and its control by ribosome density mapping. Nucleic Acids Res. (2005) 33:2421–2432.[Abstract/Free Full Text]

    Blake WJ, Kaern M, Cantor CR, Collins JJ. Noise in eukaryotic gene expression. Nature. (2003) 422:633–637.[CrossRef][Medline]

    Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. (2003) 13:721–731.[Abstract/Free Full Text]

    Chamary JV, Parmley JL, Hurst LD. Hearing silence: non-neutral evolution at synonymous sites in mammals. Nat Rev Genet. (2006) 7:98–108.[CrossRef][Web of Science][Medline]

    Churbanov A, Rogozin IB, Babenko VN, Ali H, Koonin EV. Evolutionary conservation suggests a regulatory function of AUG triplets in 5'-UTRs of eukaryotic genes. Nucleic Acids Res. (2005) 33:5512–5520.[Abstract/Free Full Text]

    Crowe ML, Wang XQ, Rothnagel JA. Evidence for conservation and selection of upstream open reading frames suggests probable encoding of bioactive peptides. BMC Genomics. (2006) 7:16.[CrossRef][Medline]

    DePristo MA, Weinreich DM, Hartl DL. Missense meanderings in sequence space: a biophysical view of protein evolution. Nat Rev Genet. (2005) 6:678–687.[CrossRef][Web of Science][Medline]

    Duret L. Evolution of synonymous codon usage in metazoans. Curr Opin Genet Dev. (2002) 12:640–649.[CrossRef][Web of Science][Medline]

    Fraser HB, Hirsh AE, Giaever G, Kumm J, Eisen MB. Noise minimization in eukaryotic gene expression. PLoS Biol. (2004) 2:e137.[CrossRef][Medline]

    Gaba A, Wang Z, Krishnamoorthy T, Hinnebusch AG, Sachs MS. Physical evidence for distinct mechanisms of translational control by upstream open reading frames. Embo J. (2001) 20:6453–6463.[CrossRef][Web of Science][Medline]

    Galagan JE, Calvo SE, Cuomo C, et al, (50 co-authors). Sequencing of Aspergillus nidulans and comparative analysis with A. fumigatus and A. oryzae. Nature (2005) 438:1105–1115.[CrossRef][Medline]

    Geballe AP, Sachs MS. Translational control by upstream open reading frames. In: Translational Control of Gene Expression—Sonenberg N, Hershey JWB, Mathews MB, eds. (2000) Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. 595–614.

    Grant CM, Hinnebusch AG. Effect of sequence context at stop codons on efficiency of reinitiation in GCN4 translational control. Mol Cell Biol. (1994) 14:606–618.[Abstract/Free Full Text]

    Hinnebusch AG. Translational regulation of GCN4 and the general amino acid control of yeast. Annu Rev Microbiol. (2005) 59:407–450.[CrossRef][Web of Science][Medline]

    Iacono M, Mignone F, Pesole G. uAUG and uORFs in human and rodent 5' untranslated mRNAs. Gene. (2005) 349:97–105.[CrossRef][Web of Science][Medline]

    Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res. (2002) 12:656–664.[Abstract/Free Full Text]

    Kozak M. Determinants of translational fidelity and efficiency in vertebrate mRNAs. Biochimie. (1994) 76:815–821.[Medline]

    Loftus BJ, Fung E, Roncaglia P, et al, (54 co-authors). The genome of the basidiomycetous yeast and human pathogen Cryptococcus neoformans. Science. (2005) 307:1321–1324.[Abstract/Free Full Text]

    Lynch M. The origins of eukaryotic gene structure. Mol Biol Evol. (2006) 23:450–468.[Abstract/Free Full Text]

    Lynch M, Scofield DG, Hong X. The evolution of transcription-initiation sites. Mol Biol Evol. (2005) 22:1137–1146.[Abstract/Free Full Text]

    Ma B, Tromp J, Li M. PatternHunter: faster and more sensitive homology search. Bioinformatics. (2002) 18:440–445.[Abstract/Free Full Text]

    MacKay VL, Li X, Flory MR, et al, (12 co-authors). Gene expression analyzed by high-resolution state array analysis and quantitative proteomics: response of yeast to mating pheromone. Mol Cell Proteomics. (2004) 3:478–489.[Abstract/Free Full Text]

    McAdams HH, Arkin A. Stochastic mechanisms in gene expression. Proc Natl Acad Sci USA. (1997) 94:814–819.[Abstract/Free Full Text]

    Mehta A, Trotta CR, Peltz SW. Derepression of the Her-2 uORF is mediated by a novel post-transcriptional control mechanism in cancer cells. Genes Dev. (2006) 20:939–953.[Abstract/Free Full Text]

    Meijer HA, Thomas AA. Control of eukaryotic protein synthesis by upstream open reading frames in the 5'-untranslated region of an mRNA. Biochem J. (2002) 367:1–11.[CrossRef][Web of Science][Medline]

    Miyasaka H. The positive relationship between codon usage bias and translation initiation AUG context in Saccharomyces cerevisiae. Yeast. (1999) 15:633–637.[CrossRef][Web of Science][Medline]

    Morris DR, Geballe AP. Upstream open reading frames as regulators of mRNA translation. Mol Cell Biol. (2000) 20:8635–8642.[Free Full Text]

    Novembre JA. Accounting for background nucleotide composition when measuring codon usage bias. Mol Biol Evol. (2002) 19:1390–1394.[Free Full Text]

    Serikawa KA, Xu XL, MacKay VL, Law GL, Zong Q, Zhao LP, Bumgarner R, Morris DR. The transcriptome and its translation during recovery from cell cycle arrest in Saccharomyces cerevisiae. Mol Cell Proteomics. (2003) 2:191–204.[Abstract/Free Full Text]

    Tanay A, Regev A, Shamir R. Conservation and evolvability in regulatory networks: the evolution of ribosomal regulation in yeast. Proc Natl Acad Sci USA. (2005) 102:7203–7208.[Abstract/Free Full Text]

    Vilela C, McCarthy JE. Regulation of fungal gene expression via short open reading frames in the mRNA 5' untranslated region. Mol Microbiol. (2003) 49:859–867.[CrossRef][Web of Science][Medline]

    Wang XQ, Rothnagel JA. 5'-untranslated regions with multiple upstream AUG codons can support low-level translation via leaky scanning and reinitiation. Nucleic Acids Res. (2004) 32:1382–1391.[Abstract/Free Full Text]

    Wang Y, Liu CL, Storey JD, Tibshirani RJ, Herschlag D, Brown PO. Precision and functional specificity in mRNA decay. Proc Natl Acad Sci USA. (2002) 99:5860–5865.[Abstract/Free Full Text]

    Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. (1997) 13:555–556.[Free Full Text]

    Zhang Z, Dietrich FS. Identification and characterization of upstream open reading frames (uORF) in the 5' untranslated regions (UTR) of genes in Saccharomyces cerevisiae. Curr Genet. (2005) 48:77–87.[CrossRef][Web of Science][Medline]

Accepted for publication May 7, 2007.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Proc. Natl. Acad. Sci. USAHome page
S. E. Calvo, D. J. Pagliarini, and V. K. Mootha
Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans
PNAS, May 5, 2009; 106(18): 7507 - 7512.
[Abstract] [Full Text] [PDF]


Home page
Brief Funct Genomic ProteomicHome page
D. Tautz
Polycistronic peptide coding genes in eukaryotes--how widespread are they?
Brief Funct Genomic Proteomic, January 1, 2009; 8(1): 68 - 74.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
S. S. Hughes, C. O. Buckley, and D. E. Neafsey
Complex Selection on Intron Size in Cryptococcus neoformans
Mol. Biol. Evol., February 1, 2008; 25(2): 247 - 253.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
J. Puyaubert, L. Denis, and C. Alban
Dual Targeting of Arabidopsis HOLOCARBOXYLASE SYNTHETASE1: A Small Upstream Open Reading Frame Regulates Translation Initiation and Protein Targeting
Plant Physiology, February 1, 2008; 146(2): 478 - 491.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
24/8/1744    most recent
msm093v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Neafsey, D. E.
Right arrow Articles by Galagan, J. E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Neafsey, D. E.
Right arrow Articles by Galagan, J. E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?