MBE Advance Access originally published online on May 23, 2006
Molecular Biology and Evolution 2006 23(8):1525-1537; doi:10.1093/molbev/msl015
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Article |
Dependence among Sites in RNA Evolution
Bioinformatics Research Center, North Carolina State University
E-mail: thorne{at}statgen.ncsu.edu.
| Abstract |
|---|
|
|
|---|
Although probabilistic models of genotype (e.g., DNA sequence) evolution have been greatly elaborated, less attention has been paid to the effect of phenotype on the evolution of the genotype. Here we propose an evolutionary model and a Bayesian inference procedure that are aimed at filling this gap. In the model, RNA secondary structure links genotype and phenotype by treating the approximate free energy of a sequence folded into a secondary structure as a surrogate for fitness. The underlying idea is that a nucleotide substitution resulting in a more stable secondary structure should have a higher rate than a substitution that yields a less stable secondary structure. This free energy approach incorporates evolutionary dependencies among sequence positions beyond those that are reflected simply by jointly modeling change at paired positions in an RNA helix. Although there is not a formal requirement with this approach that secondary structure be known and nearly invariant over evolutionary time, computational considerations make these assumptions attractive and they have been adopted in a software program that permits statistical analysis of multiple homologous sequences that are related via a known phylogenetic tree topology. Analyses of 5S ribosomal RNA sequences are presented to illustrate and quantify the strong impact that RNA secondary structure has on substitution rates. Analyses on simulated sequences show that the new inference procedure has reasonable statistical properties. Potential applications of this procedure, including improved ancestral sequence inference and location of functionally interesting sites, are discussed.
Key Words: RNA secondary structure dependence among sites substitution rate
| Introduction |
|---|
|
|
|---|
The connection of genotype to phenotype is central to evolution. Ideally, models of nucleotide substitution would fully reflect the impact of phenotype on the change over time of the underlying genotype. In practice, models of nucleotide sequence change are usually constructed with little regard to the phenotype.
Notable exceptions to the usual practice are models of nucleotide substitution that incorporate information about RNA secondary structure. RNA secondary structures change more slowly than do the sequences that fold into them (e.g., Dixon and Hillis 1993
; Gutell 1996
; Higgs 2000
). A variety of nucleotide substitution models have been constructed that exploit the slow change of ribosomal and transfer RNA secondary structures (e.g., Schöniger and von Haeseler 1994
; Tillier 1994
; Muse 1995
; Rzhetsky 1995
; Tillier and Collins 1995
, 1998
; Smith et al. 2004
). An assumption of these models is that RNA secondary structure is known and invariant over time. This allows all residues in an RNA sequence to be classified on the basis of whether they pair with another site in the sequence. For unpaired sites, conventional models of nucleotide substitution are adopted. For paired sites in a helical region of RNA secondary structure, evolution at the two positions is jointly modeled. The joint modeling of paired sites is performed because substitution events that restore or preserve hydrogen bonding can have higher rates than those that disrupt hydrogen bonding. These models that incorporate RNA secondary structure permit correlated nucleotide substitutions at paired sites, but they require unpaired positions to change independently. In addition, two sites that are paired are assumed by these models to evolve independently of other site pairs.
In contrast, the most successful techniques for approximating free energy of RNA secondary structure (e.g., Zuker and Stiegler 1981
, Zuker 1989
) do not independently handle different residue pairs in an RNA helix. Instead, the approximated free energy of an RNA secondary structure depends on the "stacking" interactions between residue pairs that are adjacent in an RNA helix. Because RNA secondary structure tends to be preserved during evolution and because the phenomena that assist energy-based RNA secondary structure prediction are likely to affect evolutionary patterns and rates, it seems worthwhile to incorporate these phenomena into probabilistic evolutionary models.
Stacking interactions between adjacent residue pairs in an RNA helix induce an evolutionary dependence among sites that cannot be handled by conventional evolutionary inference procedures. Other forms of dependence (e.g., the specific ordering of residues within a hairpin loop) are also considered by algorithms for approximating free energy of RNA structures but not by the aforementioned probabilistic descriptions of how RNA secondary structure influences sequence evolution. The assumption by conventional procedures of independent evolution among sites allows the likelihood for an entire data set of aligned sequences to be expressed as a product of individual site likelihoods, and each site likelihood can be determined by applying the pruning algorithm of Felsenstein (1981)
. The pruning algorithm relies on the ability to calculate the transition probability of observing a specific residue type at the end of a branch given the parameter values of the evolutionary model. With more general forms of evolutionary dependence among sites, transition probability calculations become intractable.
Starting with the pioneering work of Jensen and Pedersen (Jenson and Pedersen 2000
; Pedersen and Jensen 2001)
, there has been much recent effort to make statistical inferences about evolution when sequence sites do not change independently (e.g., Robinson et al. 2003
; Hwang and Green 2004
; Pedersen et al. 2004
; Siepel and Haussler 2004
; Rodrigue et al. 2005
). Here, we modify the sequence path approach of Robinson et al. (2003)
to formulate and explore a possibility where the relative rate of sequence evolution is affected by approximate free energy of RNA secondary structure. We discuss the strengths and weaknesses of our approach as well as some of its potential applications.
| Materials and Methods |
|---|
|
|
|---|
Parameterization
We introduce a Markovian model for evolution of RNA-encoding regions with constraints due to secondary structure. In terms of both parameterization and statistical inference, this study closely parallels the protein evolution work of Robinson et al. (2003)
The values of the 3N entries in each row of R that can be positive depend on how the corresponding substitutions affect the approximate free energy. The biologically plausible expectation is that a rate should be relatively high if the substitution improves the stability of RNA secondary structure. Likewise, if the secondary structure becomes less stable due to a specific substitution event, then the rate should be low. To assess the stability of a sequence folded into a specific secondary structure, we approximate the free energy of the sequence. This approximate free energy will be denoted E(i), and details for how it is calculated in our implementation will be provided below.
Otherwise, our parameterization is similar to that of widely used nucleotide models. We include parameters
A,
G,
C, and
T (
A +
G +
C +
T = 1 and these parameters are all positive) so that mutation rates to the 4 nucleotide types need not be equal. Here, we are studying the evolution of DNA sequences that encode RNAs. Although all thymines are transcribed to uracils, evolution occurs at the DNA level, and we therefore use
T rather than
U. The parameter
differentiates between transitions and transversions. The instantaneous substitution rate Rij is set to 0 if sequences i and j differ at more than one site. For cases where i and j differ by exactly one site that has type h (h
{A, C, G, T}) in j, the rate matrix entries are
![]() | (1) |
When the parameter s is zero, secondary structure does not affect the substitution rates, and our model reduces to the widely used "HKY" independent site model (Hasegawa et al. 1985
). The biologically reasonable s values are positive because these mean that low free energy of RNA secondary structure is favored by evolution. It is formally possible to have a negative value for s. This would indicate that higher free energies and unstable secondary structures are favored by evolution. This scenario is biologically implausible for most situations. A partial validation of our approach would be to determine whether data support positive values for s.
Stationary Probabilities of Sequences
For this model, the stationary probability of a sequence i with length N is
![]() | (2) |
represents all the parameters in this dependence model. Inspection verifies that detailed balance is fulfilled (p(i|
)Rij = p(j|
)Rji for all possible i and j), and therefore, the model is time reversible. The denominator of the equation for p(i|
) is the summation over all possible sequences with length N. When s = 0, the denominator is 1, and the stationary distribution becomes the product of nucleotide frequencies. If s is substantially above 0, the stationary distribution is concentrated among sequences with a particularly good fit between sequence and secondary structure. Likewise, if s is substantially below 0, the stationary distribution is characterized by a relatively small number of sequences with particularly high free energies. The dependence of this stationary distribution on s means that s can be estimated from a single sequence. Therefore, it is not necessary to have a data set with 2 or more related sequences to get information about the impact of RNA secondary structure on evolution.
Sequence Path Density
For conventional models of sequence change, the dimension of the substitution rate matrix is manageable (e.g., a codon-based model with a 61 x 61 rate matrix), and matrix exponentiation can be applied to calculate a transition probability. With the dependence structure adopted here, the rate matrix R is 4N x 4N, and it is not feasible to exponentiate R unless N is extremely small. To overcome this high dimensionality, we employ a sequence path approach (Jensen and Pedersen 2000
; Pedersen and Jensen 2001
; Robinson et al. 2003
; Rodrigue et al. 2005
). We do this by augmenting the observed sequence data at tips of a phylogenetic tree with a sequence path. For every branch on the tree, the sequence path contains information about how the sequence at the beginning of the branch is transformed by substitution events to the sequence at the end of the branch. The sequence path also specifies the exact times of these events.
Assume an unrooted tree topology that relates k observed sequences i1, i2, ..., ik at nodes 1, 2, ..., k. There will be I internal nodes on this tree, and they will be numbered k + 1, k + 2, ..., k + I. The unobserved sequences at these nodes will be denoted ik+1, ik+2, ..., ik+I.
Because our dependence model is time reversible, any node can be selected to root the tree. We treat node 1 as the root node, and then a relative time ordering can be imposed on all nodes of the tree and the nodes that begin and end a branch can therefore be designated. Except for node 1, each node on the tree ends exactly one branch. A sequence path
on a tree is the set of sequence paths on the different branches of the tree. The sequence path on the branch that ends at node a will be denoted
a, and the corresponding branch will be named branch a. Because node 1 serves as the root, it will be the only node that does not end a sequence path. Therefore, the sequence path on a tree (i.e.,
) consists of the collection of
2,
3, ...,
k+I.
Letting B(a) refer to the parental node of node a and letting
represent all parameters in the dependence model as well as the tree topology and branch lengths, we have, because of the Markov property of sequence change over time,
![]() | (3) |
) represents the prior density for the vector
of parameters. Rather than specifying the time duration of each branch as part of
, we let branch lengths vary on the tree by assigning a different rate-scaling factor to each branch on the tree. The rate scaling parameter for the branch ending at node a is a component of
and is denoted ua.
If the sequence at node B(a) and the sequence path
a from B(a) to a are known, then the sequence at a is also known. This means
![]() | (4) |
a and let ta(z) be the time of the zth substitution on this branch. At the beginning of the branch, we set ta(0) = 0. The time the branch ends will be ta(q + 1). Although different branches on a tree can obviously have different time durations, evolutionary rates and chronological times are confounded when only sequence data are available because only the product of rate and time can be estimated. In our case, we can set times of all branches to 1 because rate-scaling parameters can vary among branches. For this reason, we set ta(q + 1) = 1. The sequence after the zth substitution on branch a is defined as ia(z). By construction, ia(0) = iB(a) and ia(q + 1) = ia(q) = ia.
The rate at which a specific sequence v changes to a specific sequence k is Rvk. The total rate of changes that would alter sequence v is termed Rv, where
![]() | (5) |
The time until a substitution event changes sequence v is exponentially distributed with the rate parameter Rv. Given that there is a substitution affecting sequence v at some time point, the probability that v changes to k is Rvk/Rv. The sequence path density for branch a is then
![]() | (6) |
represents there not being a substitution event during the last time period [t(q), 1]. The sequence path density over a phylogeny can be simply calculated by applying equation (6) to each branch and then obtaining the product of all branch path densities.
MetropolisHastings Algorithm
Given the observed multiple-sequence data i1, i2, ..., ik, we construct a Markov chain on the state space of
and
via the MetropolisHastings algorithm (Metropolis et al. 1953
; Hastings 1970
). The stationary distribution of this Markov chain is p(
,
|i1, i2, ..., ik), the posterior distribution of interest. Samples from this Markov chain can be used to approximate the posterior density. Our approach for extending the Robinson et al. (2003)
inference technique to more than 2 sequences is similar to that outlined in Rodrigue et al. (2005)
.
The Markov chain is initiated at a randomly selected combination of (
(0),
(0)). Then, we propose random new values
' and
'. The probability density of proposing
' and
' from
and
will be denoted J(
',
'|
,
). With probability equal to the minimum of 1 and r, where
![]() | (7) |
(1),
(1)) along our Markov chain to be the proposed state (i.e.,
(1) =
',
(1) =
'). Otherwise,
(1) =
,
(1) =
. By repeating this procedure, a Markov chain with stationary density p(
,
|i1, i2, ..., ik) is formed.
In equation (7), the
and
terms can be calculated via equation (6). The J(
,
|
',
') and J(
',
'|
,
) terms are proposal densities that are described below. The remaining terms p(i1|
) and p(i1'|
') are more difficult to handle because the denominator of the stationary distribution in equation (2) is a sum over all possible sequences with length N. We approximate their ratio p(i1'|
')/p(i1|
) with the grid-based Gibbs sampling approach of Robinson et al. (2003)
.
Proposing 
One difference between our implementation and that of Robinson et al. (2003)
involves our treatment of the rate-scaling (ua) parameters. Robinson et al. (2003)
assigned independent prior distributions to their
parameter, their
parameters, and their rate scaling parameters. When s = 0, our model reduces to the HKY independent site model. In this situation, the branch length when s = 0 is
![]() | (8) |
Based on the above equation, one sees that
and the branch length have a positive a priori correlation when s = 0 and when prior distributions for the
parameters,
, and ua are all assigned independently. Likewise, the
parameters and the branch length are a priori correlated in this situation. These a priori correlations seem to us to be undesirable. It is not obvious, for example, that a high value of
should be associated with a large branch length. Although ba as defined above should not necessarily be interpreted as a branch length when s
0, the Robinson et al. (2003)
prior structure has
and ba positively correlated whether or not s = 0. To eliminate these prior correlations, we do not directly specify a prior on ua, and we instead choose to directly put independent prior distributions on the ba parameters,
, and the
parameters. These priors induce a prior on the rate-scaling parameters ua because
![]() | (9) |
Following Robinson et al. (2003)
, our Markov chain Monte Carlo (MCMC) implementation actually consists of various proposal distributions J(
',
'|
,
) and the Markov chain is formed by cycling through these proposal types. Each proposal can result in only slight differences between (
,
) and (
',
'). For example, one proposal type has
' =
and
' =
except that
'
and that the new rate-scaling factor ua' is determined via equation (9). To choose
', we sample from a uniform distribution that is determined by the current value
, a prespecified "window" length surrounding
, and any constraints on
such as a prespecified maximum or minimum value. We employ similar proposal steps that change only s or that change only ba for a specific node a. Our technique for proposing new values of
A,
C,
G, and
T is along similar lines and has been described earlier in Robinson et al. (2003)
.
Proposing Sequence Paths
Our current sequence path
and our proposed sequence path
' will differ only by the site path at a randomly selected site m. The proposed site path
'm is generated via a simple nucleotide substitution model that has independent changes among sites. Our implementation used the HKY independent change model (Hasegawa et al. 1985
). The topology and HKY parameters will be represented by
. We have
![]() | (10) |
term in equation (10) represents the probability of the interior node residues at site r conditional on
and the observed tip residues. To jointly sample internal node residues from the conditional density, we slightly modify the ancestral sequence reconstruction algorithm of Pupko et al. (2000)
and on the residues that begin and end the branch. To sample these site paths for each branch, we adopt the "forward simulation" procedure of Nielsen (2002)
so that none are extremely small. Our implementation somewhat reduces computation by storing site paths where the ending residue in the forward simulation procedure does not match the ending residue
that is desired. A particular cached site path can be used at some later time when an instance with a matching beginning and ending residue is needed.
Inference from a Single Sequence
Information about parameters
A,
C,
G, and
T is contained in equation (2), the formula for stationary probabilities of sequences. More interestingly, the stationary probability formula also contains information about the structural impact factor s. This means that inferences about s can be made solely based on the stationary probability of a single sequence. In the case of a single sequence, there is no information about
or ua, and the relevant parameters are
= {s,
A,
C,
G,
T}. For a single sequence i, the posterior distribution p(
|i) = p(i|
)p(
)/p(i) can be approximated by sampling
via a MetropolisHastings technique similar to, but much simpler than, the one described above. Accordingly, the acceptance ratio in equation (7) is modified to become
![]() | (11) |
')/p(i|
) again can be approximated by the grid-based Gibbs sampler (Robinson et al. 2003
Calculating RNA Free Energy
To approximate the free energy of an RNA molecule folded into a prespecified structure, computer subroutines were adopted from the Vienna RNA software package (Hofacker et al. 1994
). The default energy parameters in the Vienna RNA package (Mathews et al. 1999
) were used in our study. Many potential factors can contribute to an RNA secondary structure energy approximation. These factors include stacking interactions between adjacent sets of paired bases in an RNA helix and the specific nucleotide ordering in a hairpin loop. More details are well summarized in Hofacker et al. (1994)
. This means that the dependence among sites in an RNA sequence is effectively determined by multiple factors instead of by only the stacking interactions among adjacent base pairs. Some of these factors are infrequent, and we neglect them in our study. For example, we ignore the coaxial stacking energy of adjacent helices in multiloops (this is the "d1" option in the Vienna RNA package).
Our implementation requires a known canonical secondary structure, normally determined from comparative sequence analysis, as a reference structure to indicate which sites might be paired and which would not. Although the canonical structure indicates which sites might be paired, pairings that are energetically unfavorable are not forced. The only favorable pairings are considered to be A-U, C-G, G-C, G-U, U-A, and U-G. Making the energetically favorable pairings results in preliminary pairing assignments that are further processed by breaking any pairings that would result in an energetically unstable "helix" consisting of only a single base pair. Pairing assignments are finalized after considering the single-nucleotide bulges in the canonical structure. When these bulges in the canonical structure neighbor an energetically unfavorable pairing for the sequence, a single-nucleotide bulge could become a size-3 internal loop. When a size-3 internal loop might otherwise arise, our implementation determines whether the size-3 loop can be converted to a single-nucleotide bulge by adding an energetically favorable pairing. Specifically, imagine that a position k in the canonical structure is a single-nucleotide bulge and that position k + 1 in the canonical structure is paired with position m. If the pairing between k + 1 and m is energetically unfavorable, our implementation instead pairs position k and m if this pairing is energetically favorable. A similar modification to secondary structure is made if position k 1 rather than position k + 1 pairs with position m in the canonical structure.
Although there is no formal requirement by our statistical inference procedure that the secondary structure be known and constant over evolutionary time, computational feasibility becomes an issue when secondary structure is both unknown and allowed to change over time. The problem is that the grid-based Gibbs sampler and the calculation of sequence path densities each require free energy to be approximated for huge numbers of sequences. Algorithms for approximating free energy with unknown secondary structure require an amount of computation that is proportional to at least the cube of the sequence length. Many redundant calculations can be eliminated with a careful implementation of our procedure because most sequences for which free energy is approximated are different at only one position from an already evaluated sequence, but we have been unable to develop a computationally feasible implementation for the case where secondary structure is unknown and allowed to freely vary over time.
For our implementation, sequence positions have at most one other position prespecified by the canonical structure with which they might pair. This prespecification greatly reduces the necessary amount of calculation because the free energy of each folded sequence can be evaluated in an amount of time that is proportional to sequence length. In fact, calculations of free energy can become independent of sequence length for implementations that rely solely on the canonical structure for which pairings might occur. This possibility arises because free energy calculations are performed on sequences that are only one position different from a sequence for which the free energy has already been approximated. When only a single position differs, only terms contributing to the free energy that involves the single changed sequence position need to be considered.
| Analyses |
|---|
|
|
|---|
Prior Densities and Implementation Details
In all analyses, all combinations of nonnegative values for
A,
C,
G, and
T that satisfy the
A +
C +
G +
T = 1 were treated as being equally likely a priori. Prior densities were uniform on the interval (0, 1) for all ba and uniform on (0, 10) for
. Although biologically reasonable values of s are positive because they favor evolution of stable secondary structure, we center a uniform prior distribution for s about zero. By centering the s priors about zero, we can examine whether the sequence data have sufficient information to yield posterior densities for s that are concentrated in the positive values. Specifically, we set the prior distribution for s to be uniform on (1, 1). The width of this interval was chosen after some pilot experiments. The goal was to use a prior where the posterior distribution for s was not concentrated near either the lower or the upper endpoint of the interval. We did not want to make the interval too wide because the amount of computation needed for the grid-based Gibbs sampling can quickly grow as intervals grow in width. We adopted a grid with 9 equally spaced points along the s dimension and 12 equally spaced points between 0.14 and 0.36 for each of
A,
C, and
G. The
T dimension was not needed because the sum of the 4
parameters is constrained to be one. For each grid point, 200 sequences were sampled because pilot studies suggested that 200 sequences per grid point would suffice (data not shown).
For each MCMC cycle, we propose one update for the ba value at each node a, one update for
, and one for s. Each cycle includes 10 separate random selections of sites at which to propose site path updates to all branches. One of the 4 base frequency parameters is also the focus of a proposed update during each MCMC cycle. After examining substantial MCMC output to assess convergence of the Markov chains, we settled on 1 000 000 MCMC cycles in each analysis with the first 200 000 of these cycles treated as a "burn-in" period that is not included in the posterior approximation. We take posterior samples every 10 cycles for analysis.
To investigate whether Markov chains were sufficiently long to yield good posterior distribution approximations, extensive visual examination was made of trace plots that showed how sampled parameter values changed during MCMC runs (data not shown). For all MCMC analyses, at least 2 independent runs with different starting points were performed to check convergence. Our conclusion is that the Markov chains were long enough to yield good posterior approximations.
Eukaryotic 5S Ribosomal RNAs
Ribosomal 5S RNA (5S rRNA) is an integral component of the large ribosomal subunit in almost all known organisms. The conserved nature of its secondary structure, the existence of 5S rRNA sequence level variation, and the relatively short sequence length combine to make it a good choice for exploring the techniques introduced here.
Primary Sequences and Secondary Structure
We analyzed a data set with 8 eukaryotic 5S rRNA sequences, each with 119 nt. They are collected from the 5S ribosomal RNA database (Szymanski et al. 2002
) from 8 species (Cyanophora paradoxa, Marchantia polymorpha, Ustilago hordei, Schizosaccharomyces pombe, Tulasnella violea, Pinus silvestris, Medicago sativa, Zea mays). These 8 sequences were intentionally selected so as to avoid gaps in the alignment that relates them. Their canonical structure has been determined by combining knowledge from comparative analysis and experiments (Szymanski et al. 2002
) and is depicted in figure 1.
|
Tree Topology
The neighbor-joining method (Saitou and Nei 1987
|
Importance of RNA Secondary Structure
We examined the importance of helical and loop regions for stabilizing RNA secondary structure by applying several permutation schemes to the observed M. polymorpha sequence (fig. 3). Qualitatively similar results were obtained when applying these permutation schemes to the other 7 5S rRNA sequences that we analyzed here (data not shown). For each of the 100 000 sequences generated according to the simplest permutation scheme "complete permutation" (fig. 3A), the permuted sequences had exactly the same number of each of the 4 nucleotide types as the M. polymorpha sequence, but these types were randomly arranged in a different order. With conventional models of sequence change, the order of nucleotides along a sequence does not matter. The fact that the free energy of the actual M. polymorpha sequence is significantly lower than that of random sequences indicates that conventional evolutionary models are not capturing some important biological information.
|
The results of the "loop-only permutation" (fig. 3B) indicate that the order of nucleotides in 5S rRNA loop regions seems to be important, despite the fact that evolutionary models usually ignore the order. As previously noted, numerous dinucleotide evolutionary models incorporate RNA secondary structure but ignore stacking interactions (e.g., Schöniger and von Haeseler 1994
Inferences on Multiple Sequences
As a check of our software implementation, we analyzed the 5S data while forcing s = 0 (i.e., the HKY model) and then compared the results with those obtained with the HKY model and the software package MrBayes (Ronquist and Huelsenbeck 2003
). The prior distributions for these 2 analyses were almost identical. The only difference was that our implementation has a uniform prior distribution for
, whereas MrBayes has a more flexible and complicated prior structure for
. As shown in tables 1 and 2, the posterior approximations from MrBayes and our program are quite similar when s = 0.
|
|
More interestingly, we also estimated parameters when s was not forced to be 0 (tables 1 and 2). For this latter analysis, estimates of
A and
T are somewhat higher and
G and
C are somewhat lower than when s = 0. This is presumably because the
parameters reflect mutational tendencies and because G-C bonds stabilize RNA secondary structure more than do A-U bonds. Therefore, G-C bonds tend to be especially common in particularly stable RNA secondary structures. The observed frequencies of the 4 residue types will result, according to our model, from a balance between the
parameter values and the frequencies of the residue types in the sequences with low free energies. When s is close to 0, the balance favors the former. When s is large and positive, the balance favors the latter. When s is not forced to be 0, its posterior distribution is concentrated in the biologically plausible s > 0 region (table 2). This is the region of parameter space where evolution favors low free energy and stability of secondary structure. We view this as a partial validation of our approach.
For the purpose of additional validation, 5S rRNA evolution was simulated according to the dependence model. The simulation assumed the tree topology in figure 2. For the observed data, the posterior mean of s is estimated to be approximately 0.3661. One set of sequences was simulated for each case of s = 0.3661, s = 0, and s = 0.3661. For all other parameters, the posterior means inferred from the observed data were used to simulate. For each of the simulation scenarios, the posterior densities of s are concentrated close to their true values (table 3). Estimates of other parameters were also reasonable (data not shown).
|
Inferences on a Single Sequence and Sequence Pairs
In addition, inferences based on a single sequence were performed. For all 8 5S rRNA sequences, the posterior estimates of parameter s are depicted in table 4. It is not surprising that the 95% credibility intervals for s from the single-sequence analyses are wider than the credibility interval from the multiple-sequence analysis because the amount of information contained in one sequence is significantly less than that contained in multiple sequences. With these single-sequence analyses, there is a strong negative correlation between the free energy of the sequence being evaluated and the estimated posterior mean of s. This negative correlation occurs because observation of particularly stable sequences is evidence for particular strong influences of RNA secondary structure.
|
Unexpectedly, each of the single-sequence analyses yielded higher estimates of s than obtained from the 8-sequence analysis. To further explore this phenomenon, a 2-sequence data set was formed from each of the 28 pairwise combinations of the 8 sequences. The s estimates from the pairwise analyses (table 5) tended to be intermediate between the single-sequence estimates and the 8-sequence estimate. The cause of this may be related to the fact that the single-sequence s estimates depend solely on the stationary distribution of sequences and the prior distribution of s. In contrast, the s estimates from the multiple-sequence analyses are also influenced by the sequence path. Therefore, the stationary distribution seems to imply higher values of s than the sequence path information. One interesting possibility is that the evolutionary process has not reached stationarity. An additional possibility is that the model is inadequate. One improvement to the model would add rate heterogeneity among sites beyond that which is explained by RNA secondary structure. Increased rate heterogeneity among sites would not necessarily change the stationary distribution from that predicted by our model but it would alter the sequence path density. Model improvements might lead the sequence path density to favor values of s that are closer to those supported by the single-sequence analyses. We note that the observed disparity in s estimates between single- and multiple-sequence analyses should not be attributed to the prior distribution for s. Multiple-sequence analyses have more information about s than do single-sequence analyses. Rather than being closer, estimates of s from multiple-sequence analyses should tend to be farther from the prior mean of 0 than should estimates from single-sequence analyses.
|
Impact of Secondary Structure over Substitution Rates
With our model, the factor
![]() | (12) |
Because the 5S rRNA sequences have length 119 nucleotides, there are 357 = 119 x 3 sequences that differ from a given sequence i at a single position. For each of these 357 possible sequences j and for a particular value of s, the rate factor Aij can be calculated. Using the posterior mean estimate of s from the 8-sequence analysis, figure 4 depicts histograms of Aij values.
|
Several 5S rRNA sites are almost identical among all eukaryotic 5S rRNAs. With the numbering scheme of figure 1, these sites are 11, 41, 66, 74, 75, 76, 77, 90, and 99. In the 8-sequence data set, there is not one possible change affecting these highly conserved sites that leads to an Aij value larger than 1. This high frequency of Aij values less than 1 is not restricted to the most conserved sites. In fact, most Aij values are less than 1 (fig. 4). This is presumably because natural selection has shaped the compatibility of sequence and secondary structure.
Some Aij values do exceed 1. Possible changes in helical regions that have Aij > 1 are particularly interesting. Our analysis indicates that for all 8 sequences, site 7, site 28, and site 112 are predicted to have one or more possible changes that would lead to higher fitness of the entire sequence. A natural explanation would be that our model is failing to capture important structural and/or functional consequences of these possible changes. A formal possibility is that changes with Aij > 1 reflect the failure of evolution to optimize the genotype.
Ancestral Sequence Free Energy
Because we incorporate secondary structure, we anticipate that the ancestral sequences inferred with our model would be more accurate and have lower associated free energies than those from independent site models. For individual internal nodes, we compared the estimated posterior distributions of free energy from the dependence model with those when s = 0 (fig. 5). For internal nodes that are not close in evolutionary distance to any tip nodes (e.g., node 1 and node 2 of the tree in fig. 2), the ancestral sequences inferred from our dependence model tend to have lower free energy than those inferred from the independent-sites model. For internal nodes that are closely related to extant sequences, we do not find much separation between the s = 0 and s
0 free energy distributions.
|
| Discussion |
|---|
|
|
|---|
The free energy of an RNA sequence is an admittedly rough measure of the extent that a specific sequence is favored by evolution. Real RNA structures do not necessarily hold the configuration with minimum free energy (e.g., Doshi et al. 2004
For mRNA, the picture is less clear. Seffens and Digby (1999)
concluded that free energies associated with actual mRNA sequences are significantly lower than those of sequences produced via random permutation of the observed mRNA sequence. This conclusion contrasted with the belief that, because of the short life of mRNA, it is not so important to maintain a relative stable secondary structure.
Workman and Krogh (1999)
later argued that random simulated sequences should reflect counts of consecutive nucleotide (dinucleotide) pairs that are found in actual RNA sequences. Dinucleotide counts are important because they contain information about the stacking interactions between adjacent pairs of bases in an RNA helical region. Workman and Krogh (1999)
further concluded for mRNA, and even for tRNA and rRNA sequences, that the free energy of observed sequences is not significantly lower than that of random sequences with identical dinucleotide counts. The biological interpretation of results from random sequences with preserved dinucleotide counts needs to be done with caution because one might expect certain dinucleotides to be associated with stable RNA structure, and the observed frequencies of these dinucleotides might be partly a result of natural selection for stable RNA structure.
Recent work of Clote et al. (2005)
, based on large-scale analysis, indicates that the free energy is indeed much lower than that of random sequences for most structural RNAs. Even smaller RNA molecules such as the precursor of microRNA have lower free energy than random sequences (Bonnet et al. 2004
). In all these works, it should be remembered that the free energy of a sequence was obtained from secondary structure prediction algorithms based on energy minimization. This means that the predicted secondary structure with minimized free energy might be different from the actual one. Ideally one would evaluate the free energy of a sequence over its real structure or at least over structures similar to its real structure.
Models that allow correlated changes between paired sites in an RNA helix fit data better than those that ignore RNA secondary structure (e.g., Muse 1995
). Because our approach incorporates free energy, it may fit data better than models that incorporated secondary structure only by discriminating between sites in helices and sites in loops. As part of our future planned work, we intend to perform rigorous statistical comparisons of our model with alternatives. One way to do these model comparisons in a Bayesian framework is by estimating Bayes factors. There are a variety of techniques for estimating Bayes factors, and we are particularly interested in the thermodynamic integration procedure of Lartillot and Phillippe (Lartillot and Philippe 2004
, 2006
).
One limitation of the model introduced here is that it does not take into account RNA tertiary structure. An effect of this limitation is that pseudoknots are not permitted. Algorithms exist for approximating the energy of RNA structures with pseudoknots (e.g., Lyngsø and Pedersen 2000
). These algorithms tend to be computationally demanding for predicting RNA structure, but extending our approach to pseudoknots should be computationally feasible for cases where the canonical structure is known.
Although using the free energy of an RNA sequence in one specific structure seems to be a reasonable start, RNA structures of individual molecules actually change over time. The expected free energy of a sequence can be obtained (Miklós et al. 2005
) or approximated (Ding and Lawrence 2003
) by integrating over all possible secondary structures. A better evolutionary model might formulate substitution rates in terms of expected free energy of a sequence rather than in terms of the lowest free energy that can be attained by the sequence. Computational concerns mean that this expected free energy approach is not immediately practical, but this may be a good direction for future work.
Here, we have developed an evolutionary model that incorporates the effect of RNA secondary structure over substitution rates. The evolutionary dependence among sites is naturally integrated in our model. In a more general sense, we developed a model to effectively combine genotype and phenotype. In this specific case of RNA, the DNA encoding the RNA sequence is genotype, and the RNA secondary structure is phenotype. Because of the generality of the approach we adopt here, we can extend the statistical procedure to many other situations where phenotype interacts with genotype during evolution.
| Acknowledgements |
|---|
|
|
|---|
We thank Stéphane Aris-Brosou, Sang-Chul Choi, Asger Hobolth, Hirohisa Kishino, and Tae-Kun Seo for their help. This work was partially supported by National Science Foundation DEB-0445180 and National Institutes of Health R01 GM070806 to J.L.T. and by a Danish National Science Foundation grant to Rasmus Nielsen.
| Footnotes |
|---|
1 Present address: Bioinformatics Centre, University of Copenhagen, Universitetsparken 15, Copenhagen Ø, Denmark
Martin Embley, Associate Editor
| References |
|---|
|
|
|---|
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL. 2000. GenBank. Nucleic Acids Res 28:158.
Bonnet E, Wuyts J, Rouze P, Van de Peer Y. 2004. Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences. Bioinformatics 20:29117.
Clote P, Ferré F, Kranakis E, Krizanc D. 2005. Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA 11:57891.
Ding Y, Lawrence CE. 2003. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res 31:7280301.
Dixon MT, Hillis DM. 1993. Ribosomal RNA secondary structure: compensatory mutations and implications for phylogenetics analysis. Mol Biol Evol 10:25667.[Abstract]
Doshi K, Cannone J, Cobaugh C, Gutell R. 2004. Evaluation of the suitability of free-energy minimization using nearest-neighbor energy parameters for RNA secondary structure prediction. BMC Bioinformatics 5:105.
Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:36876.[CrossRef][Web of Science][Medline]
Gardner PP, Giegerich R. 2004. A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics 65:140.
Gutell RR. 1996. Comparative sequence analysis and the structure of 16S and 23S RNA. In: Zimmermann RA, Dahlberg AE, editors. Ribosomal RNA: structure, evolution, processing and function in protein biosynthesis. Boca Raton, FL: CRC Press. p 1527.
Han K, Yanga B. 2003. PseudoViewer2: visualization of RNA pseudoknots of any type. Nucleic Acids Res 31:343240.
Hasegawa M, Kishino H, Yano T. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22:16074.[CrossRef][Web of Science][Medline]
Hastings WK. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97109.
Higgs PG. 2000. RNA secondary structure: physical and computational aspects. Q Rev Biophys 33:199253.[CrossRef][Web of Science][Medline]
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P. 1994. Fast folding and comparison of RNA secondary structures. Monatsh Chem 125:16788.[CrossRef]
Hwang DG, Green P. 2004. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc Natl Acad Sci USA 101:139944001.
Jensen JL, Pedersen AK. 2000. Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Adv Appl Probab 32:499517.[CrossRef]
Lanave C, Preparata G, Saccone C, Serio G. 1984. A new method for calculating evolutionary substitution rates. J Mol Evol 20:8693.[CrossRef][Web of Science][Medline]
Lartillot N, Philippe H. 2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol Biol Evol 21:1095109.
Lartillot N, Philippe H. 2006. Computing Bayes factors using thermodynamic integration. Syst Biol 55:195207.[CrossRef][Medline]
Lyngsø RB, Pedersen CN. 2000. RNA pseudoknot prediction in energy-based models. J Comput Biol 7:40927.[CrossRef][Web of Science][Medline]
Mathews DH, Sabina J, Zuker M, Turner DH. 1999. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol 288:91140.[CrossRef][Web of Science][Medline]
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. 1953. Equation of state calculations by fast computing machines. J Chem Phys 21:108792.[CrossRef]
Miklós I, Meyer IM, Nagy B. 2005. Moments of the boltzmann distribution for RNA secondary structures. Bull Math Biol 67:103147.[CrossRef][Web of Science][Medline]
Muse SV. 1995. Evolutionary analyses of DNA sequences subject to constraints on secondary structure. Genetics 139:142939.[Abstract]
Nielsen R. 2002. Mapping mutations on phylogenies. Syst Biol 51:72939.[CrossRef][Web of Science][Medline]
Pedersen A-MK, Jensen JL. 2001. A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. Mol Biol Evol 18:76376.
Pedersen JS, Forsberg R, Meyer IM, Hein J. 2004. An evolutionary model for protein-coding regions with conserved RNA structure. Mol Biol Evol 21:191322.
Pupko T, Pe'er I, Shamir R, Graur D. 2000. A fast algorithm for joint reconstruction of ancestral amino acid sequences. Mol Biol Evol 17:8906.
Robinson DM, Jones D, Kishino H, Goldman N, Thorne JL. 2003. Protein evolution with dependence among codons due to tertiary structure. Mol Biol Evol 20:1692704.
Rodrigue N, Lartillot N, Bryant D, Philippe H. 2005. Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene 347:20717.[CrossRef][Web of Science][Medline]
Ronquist F, Huelsenbeck JP. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:15724.
Rzhetsky A. 1995. Estimating substitution rates in ribosomal RNA genes. Genetics 141:77183.[Abstract]
Saitou N, Nei M. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:40625.[Abstract]
Schöniger M, von Haeseler A. 1994. A stochastic model for the evolution of autocorrelated DNA sequences. Mol Phylogenet Evol 3:2407.[CrossRef][Medline]
Seffens W, Digby D. 1999. mRNAs have greater negative folding free energies than shuffled or codon choice randomized sequences. Nucleic Acids Res 27:157884.
Siepel A, Haussler D. 2004. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol 21:46888.
Smith AD, Lui TWH, Tillier ERM. 2004. Empirical models for substitution in ribosomal RNA. Mol Biol Evol 21:41927.
Swofford D. 2002. PAUP*. Phylogenetic analysis using parsimony (*and other methods) Version 4. Sunderland, MA: Sinauer Associates.
Szymanski M, Barciszewska MZ, Erdmann VA, Barciszewski J. 2002. 5S ribosomal RNA database. Nucleic Acids Res 30:1768.
Tillier ERM. 1994. Maximum likelihood with multi-parameter models of substitution. J Mol Evol 39:40917.[CrossRef]
Tillier ERM, Collins R. 1995. Neighbor joining and maximum likelihood with RNA sequences: addressing the interdependence of sites. Mol Biol Evol 12:715.[Web of Science]
Tillier ERM, Collins RA. 1998. High apparent rate of simultaneous compensatory basepair substitutions in ribosomal RNA. Genetics 148:19932002.
Workman C, Krogh A. 1999. No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. Nucleic Acids Res 27:481622.
Yu J, Thorne JL. 2006. Testing for spatial clustering of amino acid replacements within protein tertiary structure. J Mol Evol 62:68292.
Zuker M. 1989. On finding all suboptimal foldings of an RNA molecule. Science 244:4852.
Zuker M, Stiegler P. 1981. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res 9:13348.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
W. Delport, K. Scheffler, and C. Seoighe Models of coding sequence evolution Brief Bioinform, January 1, 2009; 10(1): 97 - 109. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. C. Choi, B. D Redelings, and J. L Thorne Basing population genetic inferences and models of molecular evolution upon desired stationary distributions of DNA or protein sequences Phil Trans R Soc B, December 27, 2008; 363(1512): 3931 - 3939. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Baele, Y. Van de Peer, and S. Vansteelandt A Model-Based Approach to Study Nearest-Neighbor Influences Reveals Complex Substitution Patterns in Non-coding Sequences Syst Biol, October 1, 2008; 57(5): 675 - 692. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Yang and R. Nielsen Mutation-Selection Models of Codon Substitution and Their Use to Estimate Selective Strengths on Codon Usage Mol. Biol. Evol., March 1, 2008; 25(3): 568 - 579. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Rodrigue, H. Philippe, and N. Lartillot Uniformization for sampling realizations of Markov processes: applications to Bayesian implementations of codon substitution models Bioinformatics, January 1, 2008; 24(1): 56 - 62. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Rodrigue, H. Philippe, and N. Lartillot Exploring Fast Computational Strategies for Probabilistic Phylogenetic Analysis Syst Biol, October 1, 2007; 56(5): 711 - 726. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Howell, J. L. Elson, C. Howell, and D. M. Turnbull Relative Rates of Evolution in the Coding and Control Regions of African mtDNAs Mol. Biol. Evol., October 1, 2007; 24(10): 2213 - 2221. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. L. Thorne, S. C. Choi, J. Yu, P. G. Higgs, and H. Kishino Population Genetics Without Intraspecific Data Mol. Biol. Evol., August 1, 2007; 24(8): 1667 - 1677. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. C. Choi, A. Hobolth, D. M. Robinson, H. Kishino, and J. L. Thorne Quantifying the Impact of Protein Tertiary Structure on Molecular Evolution Mol. Biol. Evol., August 1, 2007; 24(8): 1769 - 1782. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





















