MBE Advance Access originally published online on October 12, 2006
Molecular Biology and Evolution 2007 24(1):159-170; doi:10.1093/molbev/msl144
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Evolutionary Model Selection with a Genetic Algorithm: A Case Study Using Stem RNA



* Department of Pathology, University of California, San Diego
Bioinformatics Research Center, North Carolina State University
School of Medicine, University of Wales, Swansea, United Kingdom
E-mail: spond{at}ucsd.edu.
| Abstract |
|---|
|
|
|---|
The choice of a probabilistic model to describe sequence evolution can and should be justified. Underfitting the data through the use of overly simplistic models may miss out on interesting phenomena and lead to incorrect inferences. Overfitting the data with models that are too complex may ascribe biological meaning to statistical artifacts and result in falsely significant findings.
We describe a likelihood-based approach for evolutionary model selection. The procedure employs a genetic algorithm (GA) to quickly explore a combinatorially large set of all possible time-reversible Markov models with a fixed number of substitution rates. When applied to stem RNA data subject to well-understood evolutionary forces, the models found by the GA 1) capture the expected overall rate patterns a priori; 2) fit the data better than the best available models based on a priori assumptions, suggesting subtle substitution patterns not previously recognized; 3) cannot be rejected in favor of the general reversible model, implying that the evolution of stem RNA sequences can be explained well with only a few substitution rate parameters; and 4) perform well on simulated data, both in terms of goodness of fit and the ability to estimate evolutionary rates. We also investigate the utility of several distance measures for comparing and contrasting inferred evolutionary models.
Using widely available small computer clusters, our approach allows, for the first time, to evaluate the performance of existing RNA evolutionary models by comparing them with a large pool of candidate models and to validate common modeling assumptions. In addition, the new method provides the foundation for rigorous selection and comparison of substitution models for other types of sequence data.
Key Words: RNA sequence evolution secondary structure model selection genetic algorithms multimodel inference
| Introduction |
|---|
|
|
|---|
It has long been recognized that in order to understand and quantify the evolution of molecular sequences, it is crucial to incorporate biologically important constraints and selective pressures into the stochastic model used to describe character substitution. For instance, secondary structure is vital to the function of many RNA molecules. The major evolutionary constraint on stem RNA is the preservation of canonical Watson–Crick (adenine–guanine and cytosine–uracil) pairings. However, not all mispaired dinucleotides are equally deleterious—for example, the GU pair is encountered often and is considered to be a weak pairing or a tolerated intermediate state between the two proper pairings. This relative simplicity stands in sharp contrast to protein data, where no single property or constraint can adequately capture major evolutionary forces, and ad hoc or empirical models are the current state of the art.
In the past decade, a number of probabilistic models for stem RNA have been described. These models consider the process of substitution on 16 characters formed by 2 nucleotide residues paired along an RNA stem. Early research efforts were directed either toward extending established models of nucleotide evolution (such as Jukes and Cantor 1969
; Hasegawa et al. 1985
) to correct for unequal rates of substitutions that preserve or disrupt canonical pairings or toward defining models on reduced character states. Tillier (1994)
and Tillier and Collins (1995)
described a Markov model for the evolution of 6 dinucleotide states (AU, UA, CG, GC, GU, UG). Muse (1995)
introduced a model with a single "coupling" parameter that rewarded substitutions from unpaired to paired states and penalized the reverse substitutions. Schöniger and von Haeseler (1994
, 1999)
and Rzhetsky (1995)
focused on modeling nucleotide substitution biases, unequal dinucleotide frequencies, and site-to-site rate variation in dinucleotides. Savill et al. (2001)
described a series of models derived from biological and thermodynamic constraints and conducted an extensive comparison of goodness of fit among several candidates. From a practical perspective, Telford et al. (2005)
have argued that the use of realistic models of RNA sequence evolution results in better phylogenetic trees and should be encouraged whenever possible.
In the standard maximum likelihood framework of comparative phylogenetic sequence analysis (Felsenstein 1981
), the general time-reversible model (REV) on the appropriate set of characters is usually considered to be the most complex model worth fitting. For example, empirical models of protein sequence evolution (e.g., Adachi and Hasegawa 1996
; Adachi et al. 2000
; Dimmic et al. 2002
) are often derived by fitting the protein REV model to a curated alignment and fixing substitution rates between different amino acids at their maximum likelihood estimates for subsequent analyses of alignments of similar genes or taxa. However, because the REV model posits a separate rate parameter for each pair of characters, it can be computationally challenging to fit, and available biological sequence data may be too sparse, leading to catastrophic loss of estimate precision and the possibility of consequent misinterpretation of the results. Furthermore, the "seed" alignment used to derive the model may not contain enough data to estimate relevant evolutionary processes for a whole family of sequences. In other words, an independent sample of similar sequences may lead to substantially different parameter estimates. Large sampling errors and lack of generalizability are frequently the result of model "overfitting." As a way to deal with these concerns, Whelan and Goldman (2001)
used approximate numerical methods to fit a single REV-type model jointly to an extensive collection of protein families, thus creating a much larger data set, but at the cost of smoothing over family-specific nuances.
One solution to the problem of overfitting is to group all substitutions into a few rate classes based, for instance, on a priori knowledge of biochemical properties (Muse 1995
; Savill et al. 2001
). However, when only considering a set of a priori models, a careful researcher must account for the possibility that among models that have been left out, there may exist some that have a better statistical fit to the data and lead to different biological conclusions. Kosakovsky Pond and Frost (2005a)
provide such an example in the context of estimating temporally variable selective pressure. From the standpoint of inferring the nature of the substitution process from the data, it may be preferable to estimate the patterns of dinucleotide replacement using a flexible statistical procedure and then to correlate rates of replacement with different types of change in molecular properties, rather than to fit a small collection of property-based a priori models and select the best one. To avoid statistical problems arising out of model misspecification (e.g., Kolaczkowski and Thornton 2004
), a rigorous model selection or multimodel inference procedure should be employed whenever possible. For DNA data, there are merely 203 nucleotide models to consider; hence, it is feasible to fit all of them and test the robustness of inference to the choice of a particular model (Burnham and Anderson 2003
; Posada and Buckley 2004
). However, the combinatorially large number of all possible RNA models renders the exhaustive search approach inapplicable. Alternatively, a Markov chain Monte Carlo (MCMC) procedure can sample from the space of nucleotide models (Alfaro and Huelsenbeck 2006
). MCMC-based approaches may be too computationally demanding for exploring the much larger model space of RNA models, and the appropriate choice of a prior distribution on the complex discrete space of models is far from obvious.
We present a maximum likelihood model selection procedure based on a genetic algorithm (GA). GAs have been successfully applied to the prediction of secondary RNA structures from sequence data (Shapiro and Navetta 1994
; Notredame et al. 1997
), but to the best of our knowledge, they have never been used for substitution model selection and validation. Evolutionary models selected by our GA approach provide a better fit than the state-of-the-art a priori models (Savill et al. 2001
). As expected, inferred substitution rates between different dinucleotides depend on whether a pairing is lost, gained, or maintained; on the number of nucleotide substitutions required; and on the specific nucleotides involved in the substitution. We find that models with few substitution rates (e.g., 4–6) outperform the REV model in terms of an information theoretical model selection criterion (small sample Akaike information criterion—AICc). The addition of site-to-site variation in substitution rates results in a further improvement in fit, but the estimated site-to-site rate distribution is fairly robust to the model of character substitution. Consistent with the results of Savill et al. (2001)
, the models that allow direct substitution between all nucleotide pairs outperform those that permit direct substitutions involving only a single nucleotide.
| Materials and Methods |
|---|
|
|
|---|
A Class of Dinucleotide Substitution Models
We consider stationary time-reversible Markov models of dinucleotide substitution. The most general model from this class, RNA REV, has 120 rate parameters—each describing the rate of substitution between a pair of different dinucleotides. For other models in the class, there are 1
C
119 different substitution rates, and some pairs of dinucleotides share the same rate. The case C = 1 reduces to the "equal-input model," in which all substitutions occur at the same rate. The instantaneous rate of change from dinucleotide i to dinucleotide j (where i, j range from 1 to 16 and the characters are indexed in alphabetical order, i.e., AA, AC, ..., UG, UU) is given by qij = rij
j, i
j. Here, rij is one of the C estimated rate parameters Rk, and
j denotes the equilibrium frequency of the dinucleotide labeled by j. Time reversibility implies rij = rji. The diagonal entries are defined so that the rate matrix describes a valid Markov process, that is, qii=–
k=1, k
i16qik. Because only products of evolutionary rates and times can be estimated using the standard phylogenetic likelihood framework, one of the C rates is not identifiable. We choose to normalize the rates by setting the expected number of substitutions per site (per unit time) to 1, that is,
i=116qii
i=1. Finally, the probability of substituting character i with j in time t
0 can be computed as Pr{i
j|t}=[exp(Qt)]ij, where Q=(qij)
is the rate matrix of the process, and the matrix exponential can be evaluated using one of the standard numerical algorithms (Moler and Loan 1978A model with C rates can be encoded by an array of 120 integers, with the first entry corresponding to the (AA, AC) pair, the second to (AA, AG), and so on. Each array entry ranges from 1 to C and determines which substitution rate Rk represents a given pair. To rule out degenerate and equivalent model parameterizations, we need to enforce 2 constraints.
- Each k = 1, ..., C must be present in the encoding vector at least once; otherwise we could reduce C and relabel the rates accordingly. For example, (1, 2, 2, 4, ...)
(1, 2, 2, 3, ...), assuming that 3 does not appear in the left-hand side parameterization.
- The first occurrence of 1 precedes the first occurrence of all other k; precedes the first occurrence of 2; precedes those of 3, 4, ...; and so on; otherwise we could reindex k in the order of first appearance and arrive at an equivalent model. For example, (2, 1, 4, 3, ...)
(1, 2, 3, 4, ...).
For C rate classes, there are 
S(120,k) possible models to consider, where
|
|
Model Selection Using a GA
The parameter space for our optimization problem consists of 2 components: a discrete partitioning of substitution rates into C different classes and a vector of real-valued parameters corresponding to branch lengths and substitution rates R1, ..., RC. We use a GA to search through the discrete component of the parameter space and conventional numerical optimization techniques to find maximum likelihood estimates of all other parameters, given a substitution model. To reduce computational effort, we initially estimate branch lengths using the equal-input model and readjust them every time the GA iteration improves the fitness score by more than a fixed number of points (e.g., 5) relative to the most recent model for which branch lengths have been estimated.
We use the CHC algorithm—cross-generational elitist selection, heterogeneous recombination, and cataclysmic mutation (Eshelman 1991
), an aggressive population-based hill climber. The algorithm operates on vectors of integers encoding substitution models (state vectors); hence, the minimum unit of "model evolution" is the assignment of a given dinucleotide pair to a rate class. If C > 2, this vector can either be manipulated directly or be converted to binary representation. The size of the population is fixed in our implementation—at twice the number of available processors. The CHC algorithm always retains the most fit individual and uses a radical recombination operator to rapidly explore the parameter space without collapsing it. The recombination operator of the CHC algorithm generates an offspring from 2 parent states by randomly choosing 1 of the 2 parental values in every position of the state vector. To ensure that the score of the best-fitting individual does not decrease across generations, the best-fitting individual from the current generation is always copied to the next one. To avoid entrapment in local optima, when the diversity of the population (measured as the relative difference in fitness scores between the best- and the worst-fitting individuals) drops below a certain threshold, a high proportion—15–35%—of positions are mutated in the encoding vectors of all but the best-fitting individual. After recombination and mutation operations, it may be necessary to convert the newly generated model to a canonical form described in the previous section.
The mutation phase is triggered if the absolute (or relative) difference between the fitness of the best- and the worst-fitting models in a generation decreases below a fixed threshold,
> 0. With probability µ (= 0.1), each entry in the model vectors of the current generation (excluding the best-fitting one) will be randomly replaced with an integer in 1, ..., C (excluding its present state). Individuals are chosen for mating with probabilities proportional either to their Akaike weights (see below) relative to other individuals in the current population or to their decreasing fitness ranking. AICc rank–based parent selection usually leads to slower convergence but better mixing of the populations, whereas Akaike weights are more suitable to quickly find a "good" solution because poorly fitting individuals almost never get a chance to reproduce.
We say that the algorithm has converged when the best fitness score remains constant for 100 consecutive generations. To verify convergence, we repeated some of the runs several times, with the initial population generated randomly every time. For models with many rate classes, convergence in a reasonable amount of time posed an issue; many runs terminated before the maximum was achieved (as measured by the best score among multiple runs); hence, we restricted our models to contain no more than 8 rate parameters. In addition, to safeguard against premature algorithm termination, we propose the following procedure: 1) run 5 instances of the GA for each data set and C using Akaike weight–based parent selection; 2) select 3 best individuals from each run, as well as the top 3 models found by the run for C – 1 rate classes (if C > 2); 3) perform an additional follow-up run with this initial population, using fitness rank–based parent selection; 4) report the output from the final run. A typical GA run in this setting will explore between 2,000 and 50,000 models and take between 1 and 48 h on a 32-node computer cluster.
The fitness of every model is measured by its AICc (Akaike 1974
; Sugiura 1978
). AICc of a model with p parameters applied to a sample of size s is defined as AICc=–2logL(
|data)+2p[s/(s–p–1)]. AICc is meaningful when the number of observations exceeds the number of model parameters by at least 2. In our case, we treat each dinucleotide alignment column as an independent observation; hence, the method requires at least B + C + 17 alignment columns (B is the number of branches in the phylogenetic tree; there are 15 character frequency parameters). Many alignments of paired RNA regions are short and may fail to meet this requirement—other information criteria, such as Schwartz's Bayesian Information Criterion (Schwarz 1978
), could be used in those instances. We deliberately selected a conservative measure that requires long sequences, to reduce the effects of model overfitting.
Reference Models
We compare GA-derived models with those previously used in literature. These models can be split into 3 classes
Extreme Cases
The least and most general models are in this class. The equal-input model assumes that all substitution rates qij = 1, whereas the most general—REV—estimates each rate independently. We also include the restrictions of those models that allow nonzero substitution rates only between dinucleotides differing by a single nucleotide (e.g., a one-step substitution). We denote the restricted models with the suffix "-1."
A Priori Structural Models
Savill et al. (2001)
found that one of the "structural" models (RNA16A) tended to provide the best description for the data. This model estimates 5 substitution rates qij:
s, a single nucleotide substitution between any 2 paired (AU, CG, GC, CU, UA, UG) states;
d, a double transition between AU and GC and UA and CG; ß, all other substitutions among paired states;
, a single nucleotide substitution between a paired and an unpaired state;
, a single nucleotide substitution between 2 unpaired states. At the suggestion of an anonymous reviewer, we also consider (for biological data), an extension of RNA16A, which places a 15-parameter general reversible structure of the 6 paired states and retains the
and
parameters of RNA16. This 17 rate parameter model is referred to as PairedREV.
Extensions of Nucleotide Models
We consider 2 models that reduce to standard nucleotide models under certain parameter constraints and allow nonzero rates for single nucleotide substitutions only. The Muse95 model (Muse 1995
) is an extension of the standard nucleotide substitution model Hasegawa-Kishino-Yano 1985 model (HKY85) to the state space of 16 dinucleotides. It includes a single coupling parameter (R) to account for the relative (to those substitutions that leave the pairing status unchanged) increase of rates that restore canonical pairings; 1/R reflects the relative paucity of those substitutions that break up canonical pairings. This model reduces to HKY85 when R = 1. Another model (NREV16), similar in spirit to the model of Schöniger and von Haeseler (1999)
, independently estimates all 5 nucleotide substitution biases and reduces to the standard reversible nucleotide model if dinucleotide frequencies are equal to the products of constituent nucleotide frequencies.
Model Interpretation
Once we have selected a model using the GA, it is important to assess the confidence we have in the choice, in terms of both how much better the model fits when compared with other candidates and whether we would select the same (or a similar) model again if other independent data sets were available. The comparison between the GA model, the equal-input model, and the REV model can be carried out using the standard likelihood ratio test, as these models are always nested; we caution, however, that because the model selection procedure is data dependent, the standard LRT may be inapplicable—hence, we strongly recommend the multimodel inference approach detailed below. To determine the significance of the difference in fit between models specified a priori (such as the RNA16A or the Muse95 models) and the GA model (which may not necessarily be nested), we rely on both AICc and the Shimodaira–Hasegawa test (Shimodaira and Hasegawa 1999
), using the difference in AICc rather than the difference in log likelihoods, in order to compare models with different numbers of parameters.
Rather than just base inference on the single best-fitting model, we also average over a set of evolutionary models. To this end, we calculate the Akaike weights (Akaike 1983
) for each model examined during a run of the GA. If we define
i=AIC
–min AICc, then the Akaike weight for model mi (M is the total number of fitted models) is given by
.
Model-averaged estimates of a substitution parameter can be found by computing the sum of estimates derived from each model, weighted by the appropriate wi. We can also assess the relative support for model i versus model j as the ratio of their Akaike weights (evidence ratios). By summing over the Akaike weights from largest to smallest, until the sum reaches or exceeds 1 –
, we can rapidly obtain an
-level confidence set of models. Alternative derivations of confidence sets, such as those based on evidence ratios—for example, all models with evidence ratio no less than c when compared with the best-fitting model—can also be considered. In general, it is difficult to ascertain that any given GA run (or a series of GA runs) achieves good coverage of the true confidence set, defined, for example, as that which would have been derived had all possible models been fitted. Theoretical results (Whitley 2001
) suggest that genetic algorithms sample regions of the search space with probability proportional to their fitness (at least asymptotically), so that highly likely models are going to be sampled with relatively high probability. In practice, we monitored model-averaged estimates for apparent convergence and found them to be robust to combining models from multiple GA runs.
Other information theoretical or model complexity scores can be used in place of AICc. Due to computational expense, we defer the detailed exploration of the effect of various fitness measures on model selection to a subsequent study. However, our preliminary results indicate that GA model selection is fairly robust to the choice of a specific information-based criterion (results not shown).
Effect of Site-to-Site Rate Variation
Variation of substitution rates along the alignment can potentially exert a confounding effect on the model matrix. In particular, it could be argued that if site-to-site rate variation is unaccounted for, rate biases found by the GA are simply attempting to correct for localized spatial rate heterogeneity. The issue can be addressed either by using the GA to search the space of models that allow for site-to-site rate variation or by investigating whether the effects of correcting for substitutional biases and rate variation are nearly independent in terms of log-likelihood score improvement. The former approach is rigorous but time consuming because the computational complexity is increased approximately B-fold for a model with B spatial rates, whereas the latter is faster but ad hoc. When the log-likelihood score improvements from the 2 effects grossly deviate from additivity, we can investigate the data further with a modified GA. The flexible and computationally efficient ß –
discrete distribution with 4 rate categories (Kosakovsky Pond and Frost 2005b
) is used to model rate heterogeneity in both approaches.
Rate Matrix Comparison
Qualitative Comparisons
Structural a priori and GA models partition 120 substitution rates into K disjoint subsets, and qualitative similarity between any 2 models can be compared using a clustering metric. One of the simplest measures at our disposal is the "Rand statistic" (Rand 1971
), which quantifies the similarity between two clusterings of the same set of objects. The Rand statistic on 2 clusterings A and B is defined as R(A,B)=(N00+N11)/(N00+N01+N10+N11). N00 is the number of pairs (of rates in the case of substitution models) that belong to different clusters in both A and B, N01 is the number of pairs that belong to the same cluster in A but different clusters in B, N10 is the number of pairs that belong to the same cluster in B but different clusters in A, and N11 is the number of pairs that belong to the same cluster in both A and B. For clusterings on S objects, the denominator is always equal to S(S – 1)/2, hence 0
R(A, B)
1. It is also easy to see that R(A, B) = 1 if and only if clusterings A and B are identical. We illustrate the use of the Rand statistic on the example of 2 nucleotide rate matrices that cluster the 6 available nucleotide substitution rates. TN93 model (Tamura and Nei 1993
) induces the clustering {
AC,
AT,
CG,
GT}, {
AG}, {
CT}, whereas the Kimura three state model yields {
AC,
GT}, {
AG,
CT}, {
AT,
CG}. There are 2 pairs of rates (
AC,
GT, and
AT,
CG) that belong to the same cluster in both models and, hence, N11 = 2. One can further check that N00 = 8, N01 = 1, N10 = 4, and the Rand statistic is 2/3.
To test whether A and B are more similar than expected by chance, we simulate the null distribution of the Rand statistic, by computing the statistic using A and simulated clusterings with the same number of components as B, but random allocations of elements to clusters (if we are comparing an a priori fixed model with a GA-derived one), or by computing the statistic on pairs of random clusterings (with the appropriate number of components) for the comparison of 2 GA-derived models. Lastly, having scaled the matrix to one expected substitution per unit time per dinucleotide, we set all those rate entries for which the 95% likelihood profile confidence interval overlaps 0 to 0 for clustering statistic computation purposes. This reduction is sensible because such small rates are effectively indistinguishable from 0 for alignment sizes and evolutionary times used in this study.
Correlation with A Priori Structures
All realistic models of stem RNA evolution should respect the basic property of favoring preservation and restoration of the canonical or weak GU pairings and penalizing substitutions to mispaired states. Additionally, one might inquire whether inferred models corroborate the intuition that single nucleotide substitutions should happen at a higher rate than doublet substitutions and that basic nucleotide substitution biases (e.g., transitions vs. transversions) are recapitulated.
Quantitative Comparisons
Although useful for coarse-grained comparison, clustering-based similarity analyses are unable to account for the magnitude of rate differences and their effect on the substitution process over a relevant period of time. For instance, 2 models may produce similar rate clusters but dramatically different estimates of rates for each cluster. Conversely, 2 models with differing clusterings can have similar rates between clusters and represent similar evolutionary processes.
We define the total variation metric (TVM) on two 16-state Markov processes as D(t;T1,T2)=(1/32)
a,b
{AA...UU}|T
(t)–T
(t)|, where T
denotes the probability of substituting dinucleotide a with dinucleotide b in time t using model Ti. TVM is a time-dependent distance metric between substitution models, taking values in [0, 1]. D(0) = 0, and as t
, D approaches the TVM computed on the stationary distributions of T1 and T2. For processes with the same stationary distribution, the metric, therefore, converges to 0 as time approaches infinity.
We propose to compute D(t) at 3 characteristic times:
- DL: the longest path in phylogenetic tree fitted using a model expected to produce reasonable estimates of branch lengths (e.g., REV).
- DM: the median branch length in the tree derived using REV.
- Dmax: the maximum total variation distance between 2 models over all possible times t
0, that is, Dmax=suptD(t;T1,T2).
- DM: the median branch length in the tree derived using REV.
Having obtained pairwise distances between the models, we can cluster them into similarity trees, for example, using neighbor joining (Saitou and Nei 1987
).
Implementation
Recent advances in computational power make the use of complex search algorithms feasible. All sequence analyses and model fitting were performed using the HyPhy (Kosakovsky Pond et al. 2005
) software on 32-node partitions of the IBM Blue-C computer at the University of Swansea, Wales. Given P available processors, P – 1 slave nodes were used to fit various models, and a single master node dispatched the jobs and assembled the results. The size of CHC population was set to 2P – 2 individuals. A single pass of GA required from several hours to 2 days based on the size of the alignment. HyPhy batch files needed to perform the analyses and result processing can be obtained from http://www.hyphy.org/pubs/rna-ga-models.tgz.
Sequence Alignments
Three previously published stem RNA alignments were used in this study (table 1). The alignments and trees stored in NEXUS format can be downloaded from http://www.hyphy.org/pubs/rna-ga-models.tgz.
|
Simulations
We simulated 3 sets of data replicates to evaluate the properties of GA-based model inference when the true models of evolution are known. Due to computational complexity of GA model searches, we only explored a few descriptive cases. Firstly, we fitted the nucleotide-reversible model to the deuterostome small subunit (SSU) RNA alignment (table 1) and simulated 20 parametric replicates (Simulation 1). One would expect that a GA search on a data set without secondary structure constraints would lead to the recovery of the nucleotide model of substitution. Secondly, we fitted the RNA16A to the same alignment and parametrically simulated 20 data sets (Simulation 2), to test whether a GA search is able to properly recover structural properties. Thirdly, we parametrically simulated 20 data sets using the RNA REV model fitted to the deuterostomes alignment as a test case when the true model is not well represented by any a priori matrices (Simulation 3).
| Results |
|---|
|
|
|---|
Model Fits and Comparisons
Biological Sequence Data
We fitted a collection of predefined and GA-inferred maximum likelihood models of dinucleotide substitution to 3 data sets: deuterostome SSU RNA, mammalian placental 12S RNA, and the HIV rev response element (RRE). Model fit results are summarized in table 2 for deuterostome SSU RNA and in tables S1 and S2 (see Supplementary Material online) for Mammalian 12S and HIV-1 RRE RNA, respectively.
|
We performed GA searches with C = 2, 3, ..., 8 rate classes for the deuterostomes data set but considered only C = 5 and C = 8 for the other 2 alignments. The most general existing structural model has 5 estimated rates, and in our experience (results not shown), fitting more than 8 rate classes is computationally unstable and does not lead to significant improvements in fit, at least for alignment sizes considered in this paper.
Based on AICc, models found by the GA were the best among all those considered, except for the smallest alignment (HIV RRE), where the GA model was second in AICc to the structural model with the fewest parameters (Muse95); however, the RRE alignment is very short, and the AICc penalty for adding even a single model parameter is rather steep. The GA model with as few as 3 rate classes was able to outperform all a priori models and the REV (or REV-1) model. In fact, most of the AICc improvement was realized with C
5 rate classes (table 2). The RNA16A structural model had the best AICc score among all a priori models but fitted poorly compared with the GA models. From comparison with the PairedREV model, we note that most of the improvement in likelihood scores did not come from fine-grain modeling of substitution rates between paired states but rather from allowing more flexibility in how paired–unpaired and unpaired–unpaired rates are described. The improvement in fit was not due to a small number of dinucleotide sites (outliers), but rather represented a contribution from many sites in the alignment, as demonstrated by the low P values obtained with Shimodaira–Hasegawa tests comparing a GA model with others (table S3, see Supplementary Material online) and the fact that a large proportion of sites (fig. S1, see Supplementary Material online) had a better likelihood under the GA model when compared with a priori structural models. The best-fitting GA models also more closely approximated the overall length of the tree (using tree length under REV as a reference) than any a priori model.
When we tabulated AICc scores explored by the GA, it became apparent that there existed a large number of candidate models, which explained the data better than any a priori models or the REV model (tables S3, fig. 1), suggesting that a priori models may miss a number of important properties of the evolutionary process but a range of GA models are able to capture those subtleties. Indeed, AICc thresholds for the inclusion in the 95% confidence set were such (5,842.49 for deuterostomes SSU, 64,676.8 for mammalian 12S, and 1,424.5 for HIV-1 RRE) that no (except the Muse95 model for the HIV RRE alignment) a priori models could be considered credible. Consequently, it appears that a priori models fail to capture important (at least from the standpoint of goodness of fit) alignment-specific evolutionary properties.
|
We found that permitting instantaneous dinucleotide changes gave a substantial improvement in AICc scores. However, the GA-1 model with 5 rate classes outperformed the RNA16A model in all 3 cases, indicating that the role of doublet substitutions among paired states, modeled by RNA16A, may be minor in comparison with a few doublet substitutions between paired and unpaired states and within unpaired states (see fig. 2 and figs. S2 and S3 in see Supplementary Material online). The overall AICc hierarchy was largely unchanged by the inclusion of site-to-site rate variation because adding the ß –
rate variation (Kosakovsky Pond and Frost 2005b
|
Rate Matrix Comparison
We compared rate matrices using both the Rand statistic and the TVM. Nearly all pairs of models that clustered using the Rand statistic (table S4, see Supplementary Material online) showed significant clustering patterns using TVM, even when applied to different data sets. This finding indicates that the GA models are driven to recover similar rate clustering structures, some of which are represented in the structural RNA16A model. However, for the comparisons involving general GA models, the nontrivial structure induced on the rates between double-substituted states reduced the Rand statistic; hence, much of the alignment-specific clustering may be due to these rates.
GA-derived models closely resembled the REV model, when compared using TVM for all 3 data sets, at the 3 characteristic times (fig. 3). Indeed, no a priori model was more similar to REV (REV-1) than the GA (GA-1) model. This finding reinforces the intuition that GA models are attempting to approximate the most general model with few parameters by recovering the most important substitution patterns. The clustering pattern among other models varied a great deal, both from data set to data set and between the 3 timescales.
|
Box-plots of best-fitting model rate estimates (fig. 4) demonstrate that GA models infer higher rates for substitutions that restore canonical pairings and for those that involve a single nucleotide, albeit there is little difference between transitions and transversions. For the larger mammalian 12S RNA data set, GA 8 and RNA REV (results not shown) infer a very similar pattern of rates, whereas for HIV RRE, there is more fluctuation, possibly reflective of the trend to "fit the noise" with overly parameter-rich models. Model-averaged rate estimates are very similar to those derived from the best-fitting models (results not shown). It is important to note that structural alignment algorithms, commonly used to prepare stem RNA data for phylogenetic analyses, can introduce a confounding bias toward pair preservation or recovery. However, such bias, if present, would clearly affect all those models of sequence evolution that treat the sequence alignment as fixed.
|
Simulated Sequence Data
GA-based models perform admirably well when the true evolutionary process is known. We fitted 8 models (equal input, equal input-1, Muse95, NREV16, RNA16A, REV-1, REV, and GA with 5 rate classes) to each simulated replicate. For computational expediency, a single GA run was performed in place of a more rigorous multiple-run procedure. A possible side effect of this is that the algorithm may have failed to converge in some instances, possibly biasing the results against GA-derived models.
Simulation 1
This simulation examined the fit of dinucleotide models to data generated under a nucleotide model with the same substitutional biases as the deuterostome SSU data set but without any paired states. Not surprisingly, the 2 models that can collapse to nucleotide-level evolution (Muse95 and NREV16) resembled the nucleotide model used for the simulation the most (table S5, see Supplementary Material online), as measured by TVM (regardless of the timescale), followed by general reversible models, and then the GA-derived model. However, the GA had the best AICc in 15 of the 20 cases. More importantly (fig. S4, see Supplementary Material online), GA models correctly inferred the lack of correlation between pairing gain/loss and the substitution rates and estimated the rate of doublet substitutions to be effectively zero. Lastly, appropriately scaled substitution rates taken from GA fits closely resembled the true rates (results not shown).
Simulation 2
This simulation used data generated under the structural RNA16A model, using parameter estimates derived from the deuterostome SSU data set. The GA model had the best AICc in all 20 cases (table S6, see Supplementary Material online). Whereas the fitted RNA16A model bore the closest resemblance to the generating model, when measured by TVM (regardless of the timescale used), the GA model was able to approximate the substitution process nearly as well as a much more parameter-rich REV (or REV-1) model. In 9 of the 20 cases, GA-derived transition matrices were second closest to the true model, using the maximum tree path timescale (the results were similar for the other 2 timescales).
Rate estimates from a GA model were similar to those used to simulate the data (fig. S5, see Supplementary Material online), although they were "smoothed" (smaller generating rate estimates were biased upward and larger generating rates were biased downward). This is not surprising because a slight misclassification of rate classes (e.g., incorrectly placing a few high rates into a lower rate class) will have just this effect. For larger, more informative data sets, one can expect this effect to decrease. The average Rand statistic between the true model and GA fits was 0.713 with P < 0.001 (permutation test), showing significant similarity between rate structures in the generating and the inferred models. On average, 64% of substitution rates were allocated to the correct rate class.
Simulation 3
This simulation employed data simulated under a fully reversible model using parameter estimates obtained from the deuterostomes SSU data set. The GA model had the best AICc score in all 20 cases and was the second closest (following the "correct" REV model) in terms of the TVM (table S7, see Supplementary Material online). The generating REV model had 66 (out of 120) distinct substitution rates. Clearly, any model with a small number of rates (e.g., RNA16A or GA 5) cannot be expected to approximate all individual rates well, especially for a fairly small alignment, but rather to approximate "groups" of rates (e.g., low, intermediate, high) and their relative proportions among all 120 rates. As seen in table S8 (see Supplementary Material online), the distribution of GA-derived rates is more similar to the distribution used to generate the data than those inferred using RNA16A. However, this difference is minor in terms of TVM performance.
These simulations point to several desirable properties of our GA approach. Firstly, as measured by AICc, our GA models are able to explain the data better than incorrect a priori models and frequently better than the model whose structure was used to simulate the data, when the "true" model is parameter rich. Secondly, rate matrices estimated by the GAs are less distant from the true model than incorrect a priori models and perform similarly to the much more parameter-rich REV model. Thirdly, the best model and model-averaged estimates of substitution rates recover the distribution of rates used to generate the data well.
| Discussion |
|---|
|
|
|---|
In recent years, researchers have come to realize that the choice of a particular Markov process used to model sequence evolution can and should be justified (e.g., Huelsenbeck et al. 2004
When used to select models of dinucleotide substitution applied for stem RNA, the GA procedure led to the consistent recovery of important biological properties from the data without making a priori assumptions about these properties. GA searches favored models that penalize evolution away from paired states and reward reconstitution of canonical (or weak) pairings and those substitutions that involve a single nucleotide. This finding validates most of the assumptions in existing structural RNA models and, conversely, establishes that the GA procedure generates results that make good evolutionary sense. However, it also appears that there exist data set–specific substitution rates, which are misclassified by the simpler models, because our GA models consistently fit biological data sets better than existing state-of-the-art models. Simulation results suggest that GA models can adequately represent vastly disparate evolutionary processes with as few as 5 rate parameters, mitigating the issue of data overfitting. Our models are computationally feasible and can be readily applied to biological data sets using small computer clusters.
Searching large model spaces can also be performed using MCMC approaches. For example, Huelsenbeck et al. (2004)
used a reversible jump MCMC sampler to explore nucleotide model space. MCMC approaches have the advantage that, if implemented correctly, models are sampled in proportion to their likelihood or posterior probability. However, this comes at a steep computational cost and may be unfeasible for substitution models consisting of a large number of states and may give results that are similar to those based on averaging over a set of models identified using a GA search. As computational power becomes more affordable, more research comparing GA approaches with MCMC approaches will be possible.
Unrestricted model selection and rate clustering open new avenues for quantifying and comparing evolutionary processes. Inferred substitution models can be compared qualitatively (clustering patterns) and quantitatively (total variation and other distance metrics). This enables a direct comparison of the structure of competing substitution models on the same data set or of best-fitting models between different, for instance nonhomologous, data sets, thus paving the way for metaevolutionary analyses, in which organisms can be compared in terms of the evolutionary processes that have acted on them, rather than simply based on sequence similarity.
| Supplementary Material |
|---|
|
|
|---|
Supplementary tables and figures are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|
We thank Art Poon, Selene Zarate, and Andy Leigh Brown for many helpful suggestions during the preparation of the manuscript. This research was supported in part by the National Institutes of Health (AI43638, AI47745, and AI57167), the University of California University-wide AIDS Research Program (grant number IS02-SD-701), and a University of California, San Diego, Center for AIDS Research/National Institute of Allergy and Infectious Diseases Developmental Awards to S.D.W.F. and S.L.K.P (AI36214).
| Footnotes |
|---|
Robin Bush, Associate Editor
| References |
|---|
|
|
|---|
In Abramowitz M and Stegun I (Eds.). Handbook of mathematical functions with formulas, graphs, and mathematical tables, 9th printing (1972) (Dover, New York).
Adachi J and Hasegawa M. (1996) Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 42:4459–468.[Web of Science][Medline]
Adachi J, Waddell PJ, Martin W, Hasegawa M. (2000) Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J Mol Evol 50:4348–358.[Web of Science][Medline]
Akaike H. (1974) A new look at the statistical model identification. IEEE Trans Autom Control 119:716–723.[CrossRef]
Akaike H. (1983) Information measures and model selection. Int Stat Inst 44:139–149.
Alfaro ME and Huelsenbeck JP. (2006) Comparative performance of Bayesian and AIC-based measures of phylogenetic model uncertainty. Syst Biol 55:189–96.[CrossRef][Web of Science][Medline]
Burnham K and Anderson D. (2003) Model selection and multimodel inference 2nd ed. (Springer, New York).
Dimmic MW, Rest JS, Mindell DP, Goldstein RA. (2002) rtrev: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogeny. J Mol Evol 55:165–73.[CrossRef][Web of Science][Medline]
Drummond AJ, Ho SYW, Phillips MJ, Rambaut A. (2006) Relaxed phylogenetics and dating with confidence. PLoS Biol 4:5e88.[CrossRef][Medline]
Eshelman LJ. (1991) The CHC adaptive search algorithm: how to do safe search when engaging in nontraditional genetic recombination. In Spatz BM (Ed.). Foundations of genetic algorithms(Morgan Kaufmann, San Mateo (CA)) pp. 265–283.
Felsenstein J. (1981) Evolutionary trees from DNA-sequences—a maximum-likelihood approach. J Mol Evol 17:368–376.[CrossRef][Web of Science][Medline]
Hasegawa M, Kishino H, Yano T. (1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Mol Biol Evol 21:160–174.
Hue S, Clewley JP, Cane PA, Pillay D. (2005) Investigation of HIV-1 transmission events by phylogenetic methods: requirement for scientific rigour. AIDS 19:4449–450.[Medline]
Huelsenbeck JP, Larget B, Alfaro ME. (2004) Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo. Mol Biol Evol 21:61123–1133.
Jukes TH and Cantor CR. (1969) Evolution of protein molecules. In Munro HM (Ed.). Mammalian protein metabolism(Academic Press, New York) pp. 21–132.
Kolaczkowski B and Thornton JW. (2004) Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431:980–984.[CrossRef][Medline]
Kosakovsky Pond SL and Frost SD. (2005a) A genetic algorithm approach to detecting lineage-specific variation in selection pressure. Mol Biol Evol 22:3478–485.
Kosakovsky Pond SL and Frost SD. (2005b) A simple hierarchical approach to modeling distributions of substitution rates. Mol Biol Evol 22:2223–234.
Kosakovsky Pond SLK, Frost SDW, Muse SV. (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics 21:5676–679.
Le S-Y, Zhang K, Maizel J, Jacob V. (2002) RNA molecules with structure dependent functions are uniquely folded. Nucleic Acids Res 30:163574–3582.
Moler C and Loan CV. (1978) Nineteen dubious ways to compute the exponential of a matrix. SIAM Rev 20:801–836.[CrossRef]
Murphy WJ, Eizirik E, Johnson WE, Zhang YP, Ryder OA, O'Brien SJ. (2001) Molecular phylogenetics and the origins of placental mammals. Nature 409:6820614–618.[CrossRef][Medline]
Muse SV. (1995) Evolutionary analyses of DNA sequences subject to constraints on secondary structure. Genetics 139:1429–1439.[Abstract]
Notredame C, O'Brien E, Higgins D. (1997) Raga: RNA sequence alignment by genetic algorithm. Nucleic Acids Res 25:224570–4580.
Posada D and Buckley TR. (2004) Model selection and model averaging in phylogenetics: advantages of Akaike information criterion and Bayesian approaches over likelihood ratio tests. Syst Biol 53:5793–808.[CrossRef][Web of Science][Medline]
Rand W. (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850.[CrossRef][Web of Science]
Rzhetsky A. (1995) Estimating substitution rates in ribosomal RNA genes. Genetics 141:2771–783.[Abstract]
Saitou N and Nei M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425.[Abstract]
Savill NJ, Hoyle DC, Higgs PG. (2001) RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum-likelihood methods. Genetics 157:1399–411.
Schöniger M and von Haeseler A. (1994) A stochastic model for the evolution of autocorrelated DNA sequences. Mol Phylogenet Evol 3:3240–247.[CrossRef][Medline]
Schöniger M and von Haeseler A. (1999) Toward assigning helical regions in alignments of ribosomal RNA and testing the appropriateness of evolutionary models. J Mol Evol 49:5691–698.[CrossRef][Web of Science][Medline]
Schwarz G. (1978) Estimating the dimension of a model. Ann Stat 6:461–464.[CrossRef]
Shapiro BA and Navetta J. (1994) A massively parallel genetic algorithm for RNA secondary structure prediction. J Supercomput 8:3195–207.[CrossRef]
Shimodaira H and Hasegawa M. (1999) Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol 16:1114–1116.[Web of Science]
Sugiura N. (1978) Further analysis of the data by Akaike's information criterion and the finite corrections. Commun Stat Theory Methods A7:13–26.[CrossRef]
Tamura K and Nei M. (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial-DNA in humans and chimpanzees. Mol Biol Evol 10:512–526.[Abstract]
Telford MJ, Wise MJ, Gowri-Shankar V. (2005) Consideration of RNA secondary structure significantly improves likelihood-based estimates of phylogeny: examples from the Bilateria. Mol Biol Evol 22:41129–1136.
Tillier REM. (1994) Maximum likelihood with multiparameter models of substitution. J Mol Evol 39:4409–417.[CrossRef]
Tillier REM and Collins RA. (1995) Neighbor joining and maximum likelihood with RNA sequences: addressing the interdependence of sites. Mol Biol Evol 12:17–15.[Medline]
Whelan S and Goldman N. (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18:5691–699.
Whitley D. (2001) An overview of evolutionary algorithms: practical issues and common pitfalls. J Inf Softw Technol 43:817–831.[CrossRef]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
W. Delport, K. Scheffler, and C. Seoighe Models of coding sequence evolution Brief Bioinform, January 1, 2009; 10(1): 97 - 109. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




