MBE Advance Access originally published online on September 4, 2008
Molecular Biology and Evolution 2008 25(11):2483-2492; doi:10.1093/molbev/msn195
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Analyzing the Evolution of RNA Secondary Structures in Vertebrate Introns Using Kimura's Model of Compensatory Fitness Interactions
Department of Biology II, Section of Evolutionary Biology, Ludwig-Maximilians-University, Munich, Germany
E-mail: piskol{at}bio.lmu.de.
| Abstract |
|---|
|
|
|---|
Previous studies have shown that splicing efficiency, and thus maturation of pre-mRNA, depends on the correct folding of the RNA molecule into a secondary or higher order structure. When disrupted by a mutation, aberrant folding may result in a lower splicing efficiency. However, the structure can be restored by a second, compensatory mutation. Here, we present a logistic regression approach to analyze the evolutionary dynamics of RNA secondary structures. We apply our approach to a set of computationally predicted RNA secondary structures in vertebrate introns. Our results are consistent with the hypothesis of a negative influence of the physical distance between pairing nucleotides on the occurrence of covariations, as predicted by Kimura's model of compensatory evolution. We also confirm the hypothesis that longer local secondary structure elements (helices) can accommodate a larger number of covariations, wobbles, and mismatches. Furthermore, we find that wobbles and mismatches are more frequent in the middle of a helix, whereas covariations occur preferentially at the helix ends. The GC content is a major determinant of this pattern.
Key Words: RNA secondary structure covariation compensatory mutation introns
| Introduction |
|---|
|
|
|---|
After the introduction of the concept of epistatic fitness interactions by Haldane (1931)
RNA structures are comprised of pairing regions and unpaired parts of the RNA sequence. If the structure, and thus the function of an RNA molecule, is more important than its sequence, epistatic selection is expected to retain the form of the structure. Therefore, single mutations within the pairing region of an RNA should be deleterious and selected against, if they destroy the structure. The original conformation, however, can be restored by a mutation that creates a complementary base on the opposite strand. This structure-restoring mutation is called "compensatory." The complete process of reestablishing the pairing by mutations from one canonical base pair to another leads to a "covariation."
If a mutation occurs in introns or at synonymous positions, it is usually assumed to be neutral as it does not change the protein. However, selection is still possible in such regions due to the various stages the mRNA molecule has to pass through during its maturation. Thus, not only the primary sequence of an mRNA or pre-mRNA but also its secondary and tertiary structures may play a role and be subject to selective pressure. Our interest in this study is directed toward introns because they may show coevolutionary patterns that are less confounded by other processes than those of coding regions (i.e., they do not underlie the selective constraints imposed by the coding function of a sequence).
The analysis of the evolution of compensatory mutations can be based on a two-locus, two-allele model described by Stephan (1996)
. It incorporates Kimura's (1985)
idea of compensatory neutral mutations, which states that individual mutations are deleterious but harmless in certain Watson–Crick base pair configurations. Under the assumption of a randomly mating diploid population, 2 linked loci with alleles A, a at locus 1 and B, b at locus 2 are examined. Alleles A and B may mutate to a and b, respectively. However, no back mutations are allowed due to the small probability of multiple hits. On the passage from AB to ab, two different intermediate conformations aB and Ab are possible. Assuming a genic selection scheme, these may have fitnesses 1 – s1 and 1 – s2, whereas the initial and end states have fitness 1. The recombination fraction between both loci is described by parameter r.
In the context of RNA structures, A and B may be identified with the bases adenine (A) and uracil (U), respectively, whereas a and b are guanine (G) and cytosine (C). The intermediate configurations will then be AC and GU and assigned the selection coefficients s1 and s2, respectively (usually s1 > s2 > 0, as GU pairs are more stable than AC pairs).
The rate of compensatory evolution kc is defined as the inverse of the expected transition time from state AB to ab. Depending on the strength of selection against deleterious intermediate states, kc may be influenced by recombination. In the case of strong selection, kc decreases exponentially with increasing r. This is due to the fact that recombination removes newly established beneficial double mutants (ab) from the gene pool. In the case of weak selection, kc is independent of r (Stephan 1996
). Thus, under strong selection against intermediate states, it is predicted that the number of observed covariations within RNA structures decreases as the distance, and hence the recombination rate, between loci increases—the so-called distance effect (Stephan and Kirby 1993
; Stephan 1996
; Chen et al. 1999
).
With the growing availability of comparative genomics data and sophisticated means for RNA secondary structure prediction, we are able to test the distance effect hypothesis and other properties of RNA secondary structures. To this end, we used computationally predicted RNA secondary structures in vertebrate introns (Pedersen et al. 2006
). These structures were inferred from sequence alignments of eight species covering a wide phylogenetic range (from humans to teleost fishes).
| Materials and Methods |
|---|
|
|
|---|
Data Source (RNA Secondary Structures and Alignments)
RNA secondary structures (folds) were downloaded from the hg18.evofold track of the UCSC Table Browser (http://genome.ucsc.edu/) (Karolchik et al. 2004
We are interested in evolutionary events that took place at the level of local structural elements (helices) in an RNA fold. We therefore defined a helix as a region in the secondary structure that is encompassed on both sides by interior, hairpin, or multiloops (Zucker and Stiegler 1981
). For each of the helices, we annotated the alignment columns that contained covariations and also identified wobble (GT) and mismatch base pairs. Furthermore, we determined the length, the average substitution rate, and the GC content of each helix as well as the average free energy of each fold. The helix length was taken as the number of pairing brackets in the RNA structure of that region. The average substitution rate was obtained by calculating the expected number of substitutions between each pair of sequences in the alignment (Jukes and Cantor 1969
), divided by the number of compared positions and averaged over all pairwise comparisons. Here, only positions in the alignment that were not involved in a covariation were included. The GC content of a helix was calculated as the number of GC base pairs in the alignment of all species and divided by the number of all canonical base pairs (GC + AT). The average free energy was calculated using RNAeval (Hofacker et al. 1994
) for each of the sequences in the alignment and averaged over all sequences.
Fold Selection Criteria
The data source constitutes a large set of high-quality RNA secondary structures. However, we retained only helices from folds that satisfied the following criteria: 1) folds located in introns, 2) only one splice form per gene, 3) fold length
50 nt, 4) alignment of
7 species, 5) helix length
3 nt, and 6) the fraction of substitutions in all pairs of sequences per helix
3/4.
These criteria were chosen for the following reasons. Intronic regions were of special interest as they are not under the selective pressure to code for amino acids. Because it is assumed that the fragments under investigation have a certain function (as structures are conserved over a wide phylogenetic range), they are expected to compensate for disruptions in the RNA structure. This makes Kimura's (1985)
model of epistatic interactions applicable. The functionality of RNA secondary structure and the applicability of Kimura's model were demonstrated in many systems. For example, it was found that the disruption of a hairpin structure in intron 1 of the alcohol dehydrogenase gene in Drosophila melanogaster is associated with a significantly reduced splicing efficiency and protein production (Chen and Stephan 2003
). A compensatory mutation, however, restored the original efficiency. We chose only folds with a size of at least 50 nt because in such folds, recombination would have had a sufficiently high chance to act such that a distance effect would be expected. By choosing alignments of seven or eight species, we ensured a sufficiently wide phylogenetic range that allowed us to detect covariations. It also guaranteed a higher alignment quality due to the higher conservation in this region. The substitution rate between pairs of sequences in each helix was used as a hint for correctly aligned sequences. A wrongly aligned sequence might have led to an incorrect prediction of covariations. We therefore removed helices from the data set if any pairwise comparison of sequences in their alignment yielded more than 75% differences. In addition, we removed complete folds if, on average, the sequences showed an exceptionally high free energy when folding them according to the predicted RNA secondary structure with RNAeval. This was taken as a sign for a misalignment of at least one sequence. Sequences with a high free energy do not fit well into the predicted RNA structure based on the complete alignment and the chance for them to appear in nature in that fold is low.
Logistic Regression
To monitor factors that are responsible for the occurrence of covariations or substitution events in general (covariations, wobbles, or mismatches), we applied a logistic regression approach. We assumed that the occurrence of these events is independent of each other (e.g., the occurrence of a covariation at a certain position in the helix does not depend on the occurrence of covariations at other positions of the helix). Under this simplifying assumption, the probability of observing a certain number of covariations in a helix follows a binomial distribution:
|
| (1) |
in the least complex way. In the case of covariations, the probability of observing one covariation given several independent variables can be described by
|
| (2) |
|
| (3) |
These coefficients are determined by a linear predictor after applying the logit transformation,
|
| (4) |
In regression analysis, the insignificance of an independent variable xk in the model can be understood as conditional independence between response and influence. When modeling data with a binary response, these responses are strictly bounded in the interval [0, 1]. A linear regression model might predict unwanted values that are smaller than 0 or grater than 1, while the logistic curve asymptotes at 0 and 1. Logistic regression can further account for a nonconstant variance, it can be applied to data that is not normally distributed; and in addition, as in the case of covariations, rare events can be modeled (Crawley 2005
). Furthermore, exp(βk) can be interpreted as the multiplicative change in odds of observing an event given a change of variable xk by one unit (Supplementary Material online).
Logistic regression models can be set up using different numbers of independent variables. The more variables are used, the better the resulting fit. However, retention of some of these variables may not significantly improve the fit. To balance model simplicity and fit, we applied a model simplification procedure whenever necessary. Therefore, first the automatic procedure based on Akaike's information criterion in R (R Development Core Team 2006
) was used to determine insignificant variables. Close-to-significant variables that still remained in the model were then tested for their justification by analysis of deviance.
Independent Variables in Logistic Regression Analysis
When using logistic regression to identify forces that may be responsible for the occurrence of covariations, wobbles, or mismatches, one has to decide on variables that describe aspects of the RNA secondary structure. Kimura's (1985)
model suggests that at least three parameters are needed to explain the evolution by compensatory mutations: the rate of recombination between pairing nucleotides, the strength of selection against individual mutations, and the mutation rate. In our study, we used the distance in sequence between the two interacting nucleotides in a helix (x1) instead of the recombination rate. To describe the strength of selection, we used 1) the length of a helix (x2), 2) the distance of the mutated position to the end of the helix (x3), and 3) the average substitution rate in the helix (x4). The latter variable also covers properties of the mutation rate, as does the GC content of a helix that was included in our analysis as variable x5. Depending on whether the occurrence of covariations or wobbles/mismatches was under investigation, a slightly different measurement of these variables was applied. The most accurate description of these variables can be achieved by recording them for all pairing nucleotides in each sequence of the alignment one by one. The exact distance in sequence between two pairing positions and to the end of the helix were obtained in this way. This approach, however, is only applicable to wobbles and mismatches because their identification does not depend on the phylogenetic tree. In the case of covariations, it was not always discernable which species displayed the ancestral base pair and which one the base pair emerging from the covariation event. Thus, for the analysis, it would have been necessary to omit many covariations. We overcame this problem by sacrificing some accuracy in measuring the influencing variables and included all observations of covariation events. That is, we chose as response variable the presence or absence of a covariation in a column of the complete alignment. It was therefore not necessary to know in which species the substitutions occurred. However, to obtain the distance between two columns, it was necessary to average the distance over all sequences and also to average the distance to the end of the local secondary structure element. The remaining three variables helix length, average substitution rate, and GC content per helix were determined for covariations, wobbles, and mismatches as described in the Data Source (RNA Secondary Structures and Alignments).
To take the difference in the recombination rate between autosomes and X chromosome into consideration, the distances on the X chromosome were multiplied by 2/3. This compensates for the fact that males have only one X chromosome, which reduces the overall possibility of the X chromosome to recombine.
Some of the influencing variables are correlated with each other (supplementary material table S1, Supplementary Material online), which may lead to the effect of multicollinearity in the logistic regression analysis. The strongest correlation is between the position in a helix and its length (Pearson:
= 0.5344, P < 2.2 x 10–16). Multicollinearity may render the results of the logistic regression hard to interpret. It can lead to insignificant coefficients even though the independent variable is important and can also result in very large confidence intervals (CIs) for the regression coefficients (Belsley et al. 1980
). These intervals may contain 0 and make it impossible to tell the direction of the influence variable. The presence of multicollinearity was tested by the variance inflation factor (VIF) (Chatterjee and Price 1991
), which in case of multicollinearity exhibits values higher than 10 (Menard 1995
).
| Results |
|---|
|
|
|---|
Data sets and Identification of Covariations
We applied the selection criteria outlined in Materials and Methods to the complete data set, thus reducing it from 47,510- to 507-folds. These structures represent the first set of data to be analyzed and will be denoted as data set DS1. They contain 3,116 local structural elements (helices), of which 246 hold a total of 284 covarying columns in their alignments (i.e., some helices contain more than one covariation). In four of these columns, two covariations are found, respectively, which results in a total of 288 covariations. Yet, our logistic regression approach based on a binary response variable takes only the presence or absence of a covariation in a column into account regardless of the number of covariations at this position.
It is obvious that not all of these helices will underlie the same amount of evolutionary constraint. Some parts of the fold may carry an important role in the function of the complete RNA molecule, whereas others may be of less relevance. The helix that connects the 3' and 5' ends of the sequence might have a significant role because it determines the closure of the structure. Therefore, we collected all such helices from DS1 into a smaller data set DS2. In 21 of 507 folds from DS1, the closing helix was shorter than 3 nt and is thus not considered here. This resulted in a set of 486 closing helices, of which 40 contained a total of 51 columns with covariations.
These two data sets were analyzed for the evolutionary time point at which a covariation occurred. In DS1, only for 94 of the 284 columns could it be unambiguously determined which species contained the ancestral base pair and which the derived one. These covariations are shown in figure 1 at the branches of the tree. The number of covariations is expected to be identical in branches of the same length in the phylogenetic tree. However, the current data set shows a clear overrepresentation of covariations in branches leading to zebrafish and pufferfish. This finding may be attributed to relaxed selective constraints in these species due to whole-genome duplications as described by Christoffels et al. (2004)
and Volff (2005)
and references therein. Such reduced constraints play an important role in the evolution of duplicated genes (Ohta 1988
). They are expected to not only lead to a higher number of compensatory substitutions but also reduce the distance effect (Kimura 1985
).
|
In total, 18 of the 507 fragments in DS1 overlap with miRNA structures. To check whether these structures have different evolutionary constraints than the rest of the folds, the average substitution rates of the helices in miRNA and non-miRNA folds were compared. Because these rates did not show any significant difference (supplementary table S2, Supplementary Material online), all folds were retained in the data set. We also accounted for putatively different selective pressures between helices by including the average substitution rate into the models.
Logistic Regression Models for Covariations
We first investigated factors that influence the occurrence of covariations by setting up logistic regression models for DS1 and DS2 (table 1). In this configuration, the response variable corresponds to the presence or absence of a covariation in a column of the alignment, whereas the independent variables are chosen as described in Materials and Methods. Aside from the selection criteria that were applied above, we chose only columns that had an average distance of more than 50 nt. This ensured that recombination acting on these regions was sufficiently frequent. In both data sets, relatively few covariations were found—in DS1, 126 columns contain a covariation, whereas in 6,445 columns, covariations are absent (47 vs. 2,596 in DS2).
|
Data set DS1 shows borderline significance in the negative effect of the distance, whereas in DS2, the distance has a highly significant negative influence. This can be attributed to the effect of recombination, which removes newly established double mutants from the population. Its effect is stronger in DS2, which may indicate a stronger selection pressure on the helices in this data set. In figure 2 in DS2, the fitted probabilities of observing a covariation are plotted against the distance. The solid line resembles predicted probabilities at different distances based on the estimated coefficients in logistic regression and holding all other free variables constant at their mean. Due to the vast overrepresentation of columns that do not contain a covariation, the fitted probabilities are very low. Nonetheless, the negative trend induced by the distance variable is clearly visible.
|
The distance has a negative effect, while the length of a helix positively affects the occurrence of covariations. Both data sets show a significantly positive estimate. The same influence of this variable was also found by Parsch et al. (2000)
Furthermore, in both data sets, the distance to the end of a helix shows a significantly negative effect. It suggests that covariations have a tendency to appear at the ends of local structural elements.
The average substitution rate, on the other hand, does not play a role in the occurrence of covariations. This is not surprising as we calculated the substitution rate by excluding columns that contained covariations. The GC content influences covariations in a highly significant positive manner. Indeed, its effect is estimated to be the most reliable in the model (smallest p values).
In general, the estimated coefficients for all independent variables differ widely in their orders of magnitude due to the different scales of the influence variables. The VIF values for all of the coefficients are less than 1.3 (supplementary table S3, Supplementary Material online). Based on the common threshold of 10 for the VIF, this suggests that multicollinearity is not a problem in our analysis. Even if the CIs of each of the coefficients are large (table 2), none of them includes 0. This allows us to infer the direction of the influence variable even though no definitive conclusion about its exact value can be made.
|
For instance, given the CI for the distance variable in DS2 (–0.0190 to –0.0040), the increase in distance between two columns by one unit results in a relative change in the odds of observing a covariation by a factor between 0.9812 and 0.9960. Although these numbers represent only a small decrease for the change by one unit, a change by 50 units (nucleotides) would alter the initial odds by a factor between 0.3871 and 0.8184. The distance estimate from table 1 (–0.0107) would yield a change by a factor of 0.5857 (=exp(–0.0107)50). Therefore, the odds of observing a covariation are reduced by nearly one-half when the distance increases by only 50 nt—a rather strong effect.
Logistic Regression Models for Wobbles and Mismatches
We further tested how the influence variables in DS1 affect the occurrence of wobble and mismatch events. Therefore, the response variable was chosen to indicate the presence or absence of a wobble or mismatch, respectively. Again, all base pairs with a distance smaller than 50 nt were removed from the data. A total of 42,164 canonical base pairs, 4,739 wobble pairs, and 522 mismatch pairs were available for the logistic regression analysis. Table 3 shows the estimated model parameters for wobbles and mismatches after reducing the models as much as possible (removing insignificant influence variables). The distance between pairing nucleotides seems to play a role in the occurrence of wobbles; however, the estimated coefficient and thus the reduction in the odds of observing a wobble is much smaller than for covariations (exp(–0.0011)50 = 0.9465). The distance has no effect on mismatches. For both wobbles and mismatches, the helix length plays a significant role, which agrees with previous results for covariations and suggests that wobbles and mismatches tend to occur only in longer helices that have the capacity to tolerate slight disruptions of the structure. Furthermore, both models suggest that wobbles and mismatches are more frequent at greater distances from the helix ends (i.e., in the inner regions of the helix). This is in contrast to the estimates that were obtained for covariations. It shows that suboptimal base pairs may be tolerated in the middle, whereas the ends of helices tend to be preserved by covariation events. Both models agree on a significantly positive effect of the average substitution rate. This was expected as the substitution rate should directly influence divergence. The models, however, differ in the influence of the GC content, which appears to be an important factor only for mismatches. The estimated coefficient is smaller than its counterpart in the model for covariations, indicating that the occurrence of mismatches does not depend as strongly on the GC content. The occurrence of wobbles, on the other hand, does not depend on the GC content at all. Although a higher GC content may lead to a higher mutation rate (Smith et al. 2002
; Ochman 2003
), the probability of fixation of these mutations appears to be a limiting factor. A possible reason may be that the total number of wobbles in a helix (which on average is much higher than those of mismatches and covariations) is limited to preserve the stability of the helix.
|
For wobbles and also mismatches, none of the CIs for the coefficients overlap with 0, and the VIF values are also low (<1.49) (supplementary table S4, Supplementary Material online).
Distribution of GC Base Pairs over Helix Positions
In addition to the logistic regression analysis, we investigated several other characteristics of the data sets. These help to validate the reliability of the results and give explanations for effects seen in the logistic regression analysis. One of these factors is the GC content and its distribution over helix positions. In general, intronic regions in Drosophila show higher AT levels than neighboring exons (Chen and Stephan 2003
). Zhang (1998)
found the GC content in human exons to be 53%. If the observation of Chen and Stephan is also valid in humans, then a GC content lower than 53% would be expected. This was found particularly in introns of intermediate and large sizes (Kalari et al. 2006
), where the GC content of first introns was as low as
40%.
The average GC content of all folds in the original data set (Pedersen et al. 2006
) is 41.15%. This low value can be attributed to the adjustment of the EvoFold secondary structure prediction algorithm against predictions in GC-rich regions and was verified by a study on data from the ENCODE project (Washietl et al. 2007
). For the structures we selected, the average GC content was found to be 37.87% and 38.43% in DS1 and DS2, respectively. These values are even lower than the one obtained for the original data set and suggest that the folds were correctly annotated as intronic.
We were also interested in the distribution of GC nucleotides over helix positions. For functional RNAs, it is believed that the structure of the RNA is more important than its sequence. Hence, helices within a fold should retain their positions within the sequence. Because G and C are bound by three hydrogen bonds, they should be preferably located at helix ends to prevent those ends from breaking apart. On the other hand, as GC base pairs were found to be more mutable than AT base pairs (Smith et al. 2002
; Ochman 2003
); their distribution over helix positions may also play an important role in the location of compensatory mutations. Indeed, figure 3 shows that the distribution of AT nucleotides (light gray) over helix positions is shifted toward greater values, whereas GC nucleotides (dark gray) occur preferentially at positions 0 and 1. A one-sided Wilcoxon rank-sum test confirmed these results with P < 2.2 x 10–16 and 2.22 x 10–16 for DS1 and DS2, respectively. It supports the initial conjecture that GC pairs are necessary to maintain the stability of the ends of helical regions and thus ensure the form of the fold. Because the above observation is based solely on columns that did not change (columns with covariation events were excluded), the elevated GC content at these positions may account to some extent for the preferred occurrence of covariations at helix ends.
|
Distributions of Distance, Helix Length, and Distance to Helix End for Covariations and Wobbles
We also investigated the distributions of the distance, helix length, and the distance to the end of a helix with regard to covariations and wobbles more closely. For a detailed analysis of the distance, data set DS1 as well as DS2 were split into two categories depending on the presence or absence of a covariation in a column (fig. 4a) and the presence or absence of a wobble in a base pair (fig. 4b). Subsequently the distance distributions for both categories were compared using the Wilcoxon rank-sum test. The distances for covarying columns are significantly smaller than distances for columns that do not contain such a substitution event (Wilcoxon one-sided W = 347,685, P = 0.0028 [DS1]; W = 39,923, P = 2.389 x 10–16 [DS2]). For wobbles, the distance between wobble base pairs is only significantly smaller in DS1 (W = 9,645,880, P = 4.648 x 10–16), whereas no significant effect is seen in DS2 (W = 15,068,426, P = 0.2040). These values support the results found in the logistic regression analysis which showed only a very weak effect of distance on wobbles but a more pronounced one on covariations.
|
In the case of helix length as the variable under investigation, the categories were dependent on the presence or absence of covariations in the helix. The Wilcoxon rank-sum test was then applied to the helix length distributions for both groups (fig. 4c). The same was done for helices containing wobbles (fig. 4d). Helices containing covariations and wobbles show greater lengths than helices without these substitutions. This observation is significant for wobbles (Wilcoxon one-sided W = 1,696,052, P < 2.2 x 10–16). For covariations, the shift of the distribution containing covariations only exhibits a significant difference in DS1 (Wilcoxon one-sided W = 423,819.5, P = 5.203 x 10–16), whereas for DS2, it is not significant (Wilcoxon one-sided W = 9,966, P = 0.107).
A similar analysis was applied to the distance to the helix end for covariations and wobbles. For covariations, the two groups consisted of pairing columns containing a covariation and those columns lacking it (fig. 4e). In the case of wobbles, base pairs containing a wobble were compared with canonical base pairs (canonical base pairs from all columns that contained a covariation were excluded; fig. 4f). The Wilcoxon rank-sum test for DS1 and DS2 shows that the distribution of base pairs containing wobbles is shifted toward higher values (P < 2.2 x 10–16 for DS1 and DS2). This corresponds to the results obtained in logistic regression that the influence of the distance to the helix end showed a highly significant positive effect on the occurrence of wobbles. For covariations, the exact opposite effect is found. This is also in accordance with the regression analysis and demonstrates that covariations occur preferably at the ends of helices. A significant result is obtained for DS1 (Wilcoxon one-sided W = 2,079,410, P = 0.0038), whereas the result for DS2 is only marginally significant due to the small sample size (Wilcoxon one-sided W = 67,353, P = 0.0615). Significance in DS1 is largely due to columns containing covariations that are located at distance 0 to the helix end (just before an unpaired region). Removing them gives P = 0.074.
Base Pairs Involved in Covariations
The compensatory substitutions were also examined for the composition of nucleotide pairs that are involved in such events. We used all 288 covariation events and split the species into groups according to the distinct Watson–Crick base pair configurations for each event (table 4). Because the direction of substitutions was not taken into account, only the upper part of the table is filled. It can be seen that there is an excess of the base pair combinations AT, GC and TA, CG. These base pairs have the ability to mutate into each other via the intermediate step of a wobble pair (AT
GT
GC) by two consecutive transitions. This seems to be the most favorable path as it assures that the disruption of a structure is relatively weak and its functionality is not lost. Because the nucleotides in the wobble intermediate are still associated with each other by two hydrogen bonds, the stability is higher than for noncanonical base pairs. This results in a higher probability to experience a compensatory mutation because the intermediate variant is more frequent in the population. Based on the 94 covariations with known direction, we further observed a slight overrepresentation of covariations that remove GC base pairs from the structure.
|
Ochman (2003)
| Discussion |
|---|
|
|
|---|
A preliminary correlation analysis in a Drosophila data set (Drosophila 12 Genomes Consortium 2007
By applying logistic regression analysis, we were able to confirm the predicted distance effect for covariations. This effect is particularly strong for helices at the end of folds. According to Kimura's (1985)
model, this can be attributed to the circumstance that the distance has an impact on newly established double mutants by making them prone to recombination. Although there is no doubt about the negative influence of distance, great caution has to be taken in inferring the strength of the influence. Due to the large CIs for the distance variable, it is difficult to make quantitative predictions. Although the distance between pairing positions has a negative effect on covariations, the occurrence of wobbles and mismatches seems to be independent of it.
We were also able to confirm that the length of a helix exhibits a positive effect on the occurrence of covariations (Parsch et al. 2000)
. Furthermore, we found that wobbles and mismatches are more often present in longer helices. Although the majority of helices without substitutions are of length 3, the majority of helices with covariations or wobbles have a length of 4–6 nt (fig. 4c and d). Relaxed constraints in helices of these lengths seem to play an important role in this observation.
Another important variable we considered is the distance of a nucleotide site to the nearest helix end. Zucker and Sankoff (1984)
noticed that wobble base pairs occur frequently in the internal parts of helical structures. We found that this observation holds not only for wobble pairs but also for mismatches (table 3). Because these two classes of changes represent the majority of the variability in the alignment of a helix, we may conclude that lower selection pressure is present in these inner regions. One might expect that this higher variability in inner regions would ensure that covariations are more frequent in the middle of helices. However, the negative estimates for the distance to the helix end in table 1 clearly contradict this and show that covariations tend to be more often present at the ends of a helix, in particular at the outmost position (fig. 4e). A possible explanation for this observation may be found in transcription-directed mutagenesis (Burkala et al. 2007
). Accordingly, during transcription, the nontranscribed strand is present in a single-stranded form, which makes it susceptible to mutations. The advancing polymerase leads to supercoiling of the single-stranded DNA. Regions that contain inverted complements are thereby arranged into a secondary structure by the pairing of the DNA with itself. Unpaired or mispaired bases in this structure have a mutation rate many times higher than paired ones (Wright 2000
) as they are exposed by a greater extent to the soluble environment and thus to nucleotide altering enzymes (Burkala et al. 2007
). It was pointed out by Wright et al. (2003)
that positions located in close proximity to the stem are the most mutable ones. Therefore, mutations at the ends of helices should be compensated by a second mutation more frequently than at positions that are located within helices. The impact of transcription-directed mutagenesis cannot only be detected experimentally but also leaves signals over evolutionary times (Hoede et al. 2006
).
The GC content and its distribution along a helix seems to have a remarkable influence on the observed pattern of variation. The occurrence of covariations and mismatches does generally increase with the GC content. In addition to transcription-directed mutagenesis (described above), it may also contribute to the higher rate of covariations at helix ends as GC bases are more mutable and tend to occur at helix ends. Wobbles, however, are independent of the GC content of a helix, possibly because their high abundance in helices leads to a saturation effect such that new GT mutations can no longer go to fixation.
Finally, we compare our results on the evolution of secondary structures in vertebrates with observations from Drosophila. Both influences of distance and helix length were previously analyzed in Drosophila data. A study of intronic RNA secondary structures (Drosophila 12 Genomes Consortium 2007
) that were predicted by RNAalifold (Hofacker et al. 2002
) based on six Drosophila species found a negative correlation between distance of pairing regions and the rate of compensatory substitutions. Parsch et al. (2000)
noticed a positive correlation between helix length and the rate of compensatory evolution in helices of Adh introns and bicoid 3' UTR. They also showed a decrease in the rate of compensatory evolution with increasing distance between paired residues, thus confirming our findings. Furthermore, the bicoid structure was analyzed by Innan and Stephan (2001)
. They found the number of substitutions per site in unpaired regions to be higher than in paired regions. Our data shows the same pattern (supplementary table S5, Supplementary Material online).
Although the effects seen in the Drosophila data were generally based on small data sets, we rely here on a much larger set of structures. The recent availability of large Drosophila data sets from the Drosophila 12 Genomes Consortium (2007)
makes it now possible to perform the analysis we have done on vertebrate structures on Drosophila data as well. Using 125 high-quality RNA secondary structures that were derived from 12 Drosophila genomes (Stark et al. 2007
), we were able to confirm the negative influence of distance between pairing columns on the rate of compensatory evolution and also the distance to the end of a helix showed the same trend as was found in the vertebrate data.
| Supplementary Material |
|---|
|
|
|---|
Supplementary material, figures 1 and 3, tables S1–S5, and alignments of the analyzed folds are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|
We thank the LMU STABLAB for assistance in the regression analysis and an anonymous reviewer for very valuable comments on the manuscript. This work was supported by grant DFG (Deutsche Forschungsgemeinschaft) Ste 325/8 to W.S.
| Footnotes |
|---|
Hideki Innan, Associate Editor
| References |
|---|
|
|
|---|
Belsley DA, Kuh E, Welsch RE. Regression diagnostics: identifying influential data and sources of collinearity (1980) New York: John Wiley & Sons, Ltd.
Blanchette M, Kent WJ, Riemer C, et al, (12 co-authors). Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res (2004) 14(4):708–715.
Burkala E, Reimers JM, Schmidt KH, Davis N, Wei P, Wright BE. Secondary structures as predictors of mutation potential in the lacZ gene of Escherichia coli. Microbiology (2007) 153(7):2180–2189.
Chatterjee S, Price B. Regression analysis by example (1991) 2nd ed. New York: John Wiley & Sons, Ltd.
Chen Y, Carlini DB, Baines JF, Parsch J, Braverman JM, Tanda S, Stephan W. RNA secondary structure and compensatory evolution. Genes Genet Syst (1999) 74(6):271–286.[CrossRef][Web of Science][Medline]
Chen Y, Stephan W. Compensatory evolution of a precursor messenger RNA secondary structure in the Drosophila melanogaster Adh gene. Proc Natl Acad Sci USA (2003) 100(20):11499–11504.
Christoffels A, Koh EGL, Chia J-M, Brenner S, Aparicio S, Venkatesh B. Fugu genome analysis provides evidence for a whole-genome duplication early during the evolution of ray-finned fishes. Mol Biol Evol (2004) 21(6):1146–1151.
Crawley MJ. Statistics—an introduction using R (2005) New York: John Wiley & Sons.
Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature (2007) 450:203–219.[CrossRef][Web of Science][Medline]
Haldane J. A mathematical theory of natural selection VIII. Metastable populations. Proc Camb Philol Soc (1931) 27:137–142.
Hoede C, Denamur E, Tenaillon O. Selection acts on DNA secondary structures to decrease transcriptional mutagenesis. PLoS Genet (2006) 2(11):e176.[CrossRef][Medline]
Hofacker IL, Fekete M, Stadler PF. Secondary structure prediction for aligned RNA sequences. J Mol Biol (2002) 319(5):1059–1066.[CrossRef][Web of Science][Medline]
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P. Fast folding and comparison of RNA secondary structures. Monatsh Chem (1994) 125:167–188.[CrossRef]
Innan H, Stephan W. Selection intensity against deleterious mutations in RNA secondary structures and rate of compensatory nucleotide substitutions. Genetics (2001) 159(1):389–399.
Jukes T, Cantor C. Evolution of protein molecules. In: Munro HN, Mammalian protein metabolism (1969) New York: Academic Press. 21–132.
Kalari KR, Casavant M, Bair TB, Keen HL, Comeron JM, Casavant TL, Scheetz TE. First exons and introns—a survey of GC content and gene structure in the human genome. In Silico Biol (2006) 6(3):237–242.[Medline]
Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ. The UCSC Table Browser data retrieval tool. Nucleic Acids Res (2004) 32(Database issue):D493–D496.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res (2002) 12(6):996–1006.
Kimura M. The role of compensatory neutral mutations in molecular evolution. J Genet (1985) 64:7–19.[CrossRef]
Kirby DA, Muse SV, Stephan W. Maintenance of pre-mRNA secondary structure by epistatic selection. Proc Natl Acad Sci USA (1995) 92(20):9047–9051.
Menard S. Applied logistic regression analysis. (1995) Thousands Oaks (CA): SAGE Publications.
Ochman H. Neutral mutations and neutral substitutions in bacterial genomes. Mol Biol Evol (2003) 20(12):2091–2096.
Ohta T. Evolution by gene duplication and compensatory advantageous mutations. Genetics (1988) 120(3):841–847.
Parsch J, Braverman JM, Stephan W. Comparative sequence analysis and patterns of covariation in RNA secondary structures. Genetics (2000) 154(2):909–921.
Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, Kent J, Miller W, Haussler D. Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol (2006) 2(4):e33.[CrossRef][Medline]
R Development Core Team. R: a language and environment for statistical computing (2006) Vienna (Austria): R Foundation for Statistical Computing. Available from http://www.R-project.org.
Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res (2003) 13(1):103–107.
Smith NGC, Webster MT, Ellegren H. Deterministic mutation rate variation in the human genome. Genome Res (2002) 12(9):1350–1356.
Stark A, Lin MF, Kheradpour P, et al, (46 co-authors). Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature (2007) 450(7167):219–232.[CrossRef][Web of Science][Medline]
Stephan W. The rate of compensatory evolution. Genetics (1996) 144(1):419–426.[Abstract]
Stephan W, Kirby DA. RNA folding in Drosophila shows a distance effect for compensatory fitness interactions. Genetics (1993) 135(1):97–103.[Abstract]
Volff J-N. Genome evolution and biodiversity in teleost fish. Heredity (2005) 94(3):280–294.[CrossRef][Web of Science][Medline]
Washietl S, Pedersen JS, Korbel JO, et al, (24 co-authors). Structured RNAs in the ENCODE selected regions of the human genome. Genome Res (2007) 17(6):852–864.
Wright BE. A biochemical mechanism for nonrandom mutations and evolution. J Bacteriol (2000) 182(11):2993–3001.
Wright BE, Reschke DK, Schmidt KH, Reimers JM, Knight W. Predicting mutation frequencies in stem-loop structures of derepressed genes: implications for evolution. Mol Microbiol (2003) 48(2):429–441.[CrossRef][Web of Science][Medline]
Wright S. Evolution in Mendelian populations. Genetics (1931) 16(2):97–159.
Zhang MQ. Statistical features of human exons and their flanking regions. Hum Mol Genet (1998) 7(5):919–932.
Zucker M, Sankoff D. RNA secondary structures and their prediction. Bull Math Biol (1984) 46:591–621.[Web of Science]
Zucker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res (1981) 9(1):133–148.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



