Skip Navigation


MBE Advance Access originally published online on October 12, 2006
Molecular Biology and Evolution 2007 24(1):159-170; doi:10.1093/molbev/msl144
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
24/1/159    most recent
msl144v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kosakovsky Pond, S. L.
Right arrow Articles by Frost, S. D. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kosakovsky Pond, S. L.
Right arrow Articles by Frost, S. D. W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Evolutionary Model Selection with a Genetic Algorithm: A Case Study Using Stem RNA

Sergei L. Kosakovsky Pond*, Frank V. Mannino{dagger}, Michael B. Gravenor{ddagger}, Spencer V. Muse{dagger} and Simon D. W. Frost*

* Department of Pathology, University of California, San Diego
{dagger} Bioinformatics Research Center, North Carolina State University
{ddagger} School of Medicine, University of Wales, Swansea, United Kingdom

E-mail: spond{at}ucsd.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
The choice of a probabilistic model to describe sequence evolution can and should be justified. Underfitting the data through the use of overly simplistic models may miss out on interesting phenomena and lead to incorrect inferences. Overfitting the data with models that are too complex may ascribe biological meaning to statistical artifacts and result in falsely significant findings.

We describe a likelihood-based approach for evolutionary model selection. The procedure employs a genetic algorithm (GA) to quickly explore a combinatorially large set of all possible time-reversible Markov models with a fixed number of substitution rates. When applied to stem RNA data subject to well-understood evolutionary forces, the models found by the GA 1) capture the expected overall rate patterns a priori; 2) fit the data better than the best available models based on a priori assumptions, suggesting subtle substitution patterns not previously recognized; 3) cannot be rejected in favor of the general reversible model, implying that the evolution of stem RNA sequences can be explained well with only a few substitution rate parameters; and 4) perform well on simulated data, both in terms of goodness of fit and the ability to estimate evolutionary rates. We also investigate the utility of several distance measures for comparing and contrasting inferred evolutionary models.

Using widely available small computer clusters, our approach allows, for the first time, to evaluate the performance of existing RNA evolutionary models by comparing them with a large pool of candidate models and to validate common modeling assumptions. In addition, the new method provides the foundation for rigorous selection and comparison of substitution models for other types of sequence data.

Key Words: RNA sequence evolution • secondary structure • model selection • genetic algorithms • multimodel inference


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
It has long been recognized that in order to understand and quantify the evolution of molecular sequences, it is crucial to incorporate biologically important constraints and selective pressures into the stochastic model used to describe character substitution. For instance, secondary structure is vital to the function of many RNA molecules. The major evolutionary constraint on stem RNA is the preservation of canonical Watson–Crick (adenine–guanine and cytosine–uracil) pairings. However, not all mispaired dinucleotides are equally deleterious—for example, the GU pair is encountered often and is considered to be a weak pairing or a tolerated intermediate state between the two proper pairings. This relative simplicity stands in sharp contrast to protein data, where no single property or constraint can adequately capture major evolutionary forces, and ad hoc or empirical models are the current state of the art.

In the past decade, a number of probabilistic models for stem RNA have been described. These models consider the process of substitution on 16 characters formed by 2 nucleotide residues paired along an RNA stem. Early research efforts were directed either toward extending established models of nucleotide evolution (such as Jukes and Cantor 1969Go; Hasegawa et al. 1985Go) to correct for unequal rates of substitutions that preserve or disrupt canonical pairings or toward defining models on reduced character states. Tillier (1994)Go and Tillier and Collins (1995)Go described a Markov model for the evolution of 6 dinucleotide states (AU, UA, CG, GC, GU, UG). Muse (1995)Go introduced a model with a single "coupling" parameter that rewarded substitutions from unpaired to paired states and penalized the reverse substitutions. Schöniger and von Haeseler (1994Go, 1999)Go and Rzhetsky (1995)Go focused on modeling nucleotide substitution biases, unequal dinucleotide frequencies, and site-to-site rate variation in dinucleotides. Savill et al. (2001)Go described a series of models derived from biological and thermodynamic constraints and conducted an extensive comparison of goodness of fit among several candidates. From a practical perspective, Telford et al. (2005)Go have argued that the use of realistic models of RNA sequence evolution results in better phylogenetic trees and should be encouraged whenever possible.

In the standard maximum likelihood framework of comparative phylogenetic sequence analysis (Felsenstein 1981Go), the general time-reversible model (REV) on the appropriate set of characters is usually considered to be the most complex model worth fitting. For example, empirical models of protein sequence evolution (e.g., Adachi and Hasegawa 1996Go; Adachi et al. 2000Go; Dimmic et al. 2002Go) are often derived by fitting the protein REV model to a curated alignment and fixing substitution rates between different amino acids at their maximum likelihood estimates for subsequent analyses of alignments of similar genes or taxa. However, because the REV model posits a separate rate parameter for each pair of characters, it can be computationally challenging to fit, and available biological sequence data may be too sparse, leading to catastrophic loss of estimate precision and the possibility of consequent misinterpretation of the results. Furthermore, the "seed" alignment used to derive the model may not contain enough data to estimate relevant evolutionary processes for a whole family of sequences. In other words, an independent sample of similar sequences may lead to substantially different parameter estimates. Large sampling errors and lack of generalizability are frequently the result of model "overfitting." As a way to deal with these concerns, Whelan and Goldman (2001)Go used approximate numerical methods to fit a single REV-type model jointly to an extensive collection of protein families, thus creating a much larger data set, but at the cost of smoothing over family-specific nuances.

One solution to the problem of overfitting is to group all substitutions into a few rate classes based, for instance, on a priori knowledge of biochemical properties (Muse 1995Go; Savill et al. 2001Go). However, when only considering a set of a priori models, a careful researcher must account for the possibility that among models that have been left out, there may exist some that have a better statistical fit to the data and lead to different biological conclusions. Kosakovsky Pond and Frost (2005a)Go provide such an example in the context of estimating temporally variable selective pressure. From the standpoint of inferring the nature of the substitution process from the data, it may be preferable to estimate the patterns of dinucleotide replacement using a flexible statistical procedure and then to correlate rates of replacement with different types of change in molecular properties, rather than to fit a small collection of property-based a priori models and select the best one. To avoid statistical problems arising out of model misspecification (e.g., Kolaczkowski and Thornton 2004Go), a rigorous model selection or multimodel inference procedure should be employed whenever possible. For DNA data, there are merely 203 nucleotide models to consider; hence, it is feasible to fit all of them and test the robustness of inference to the choice of a particular model (Burnham and Anderson 2003Go; Posada and Buckley 2004Go). However, the combinatorially large number of all possible RNA models renders the exhaustive search approach inapplicable. Alternatively, a Markov chain Monte Carlo (MCMC) procedure can sample from the space of nucleotide models (Alfaro and Huelsenbeck 2006Go). MCMC-based approaches may be too computationally demanding for exploring the much larger model space of RNA models, and the appropriate choice of a prior distribution on the complex discrete space of models is far from obvious.

We present a maximum likelihood model selection procedure based on a genetic algorithm (GA). GAs have been successfully applied to the prediction of secondary RNA structures from sequence data (Shapiro and Navetta 1994Go; Notredame et al. 1997Go), but to the best of our knowledge, they have never been used for substitution model selection and validation. Evolutionary models selected by our GA approach provide a better fit than the state-of-the-art a priori models (Savill et al. 2001Go). As expected, inferred substitution rates between different dinucleotides depend on whether a pairing is lost, gained, or maintained; on the number of nucleotide substitutions required; and on the specific nucleotides involved in the substitution. We find that models with few substitution rates (e.g., 4–6) outperform the REV model in terms of an information theoretical model selection criterion (small sample Akaike information criterion—AICc). The addition of site-to-site variation in substitution rates results in a further improvement in fit, but the estimated site-to-site rate distribution is fairly robust to the model of character substitution. Consistent with the results of Savill et al. (2001)Go, the models that allow direct substitution between all nucleotide pairs outperform those that permit direct substitutions involving only a single nucleotide.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
A Class of Dinucleotide Substitution Models
We consider stationary time-reversible Markov models of dinucleotide substitution. The most general model from this class, RNA REV, has 120 rate parameters—each describing the rate of substitution between a pair of different dinucleotides. For other models in the class, there are 1 ≤ C ≤ 119 different substitution rates, and some pairs of dinucleotides share the same rate. The case C = 1 reduces to the "equal-input model," in which all substitutions occur at the same rate. The instantaneous rate of change from dinucleotide i to dinucleotide j (where i, j range from 1 to 16 and the characters are indexed in alphabetical order, i.e., AA, AC, ..., UG, UU) is given by qij = rij{pi}j, i != j. Here, rij is one of the C estimated rate parameters Rk, and {pi}j denotes the equilibrium frequency of the dinucleotide labeled by j. Time reversibility implies rij = rji. The diagonal entries are defined so that the rate matrix describes a valid Markov process, that is, qii=–{sum}k=1, k!=i16qik. Because only products of evolutionary rates and times can be estimated using the standard phylogenetic likelihood framework, one of the C rates is not identifiable. We choose to normalize the rates by setting the expected number of substitutions per site (per unit time) to 1, that is, {sum}i=116qii{pi}i=1. Finally, the probability of substituting character i with j in time t ≥ 0 can be computed as Pr{i->j|t}=[exp(Qt)]ij, where Q=(qij)Formula is the rate matrix of the process, and the matrix exponential can be evaluated using one of the standard numerical algorithms (Moler and Loan 1978Go). Equilibrium frequencies of character states can either be estimated by maximum likelihood or set to the proportions observed in the data. For computational expediency, we adopt the latter approach; when we have estimated the frequencies by maximum likelihood for comparison, we found that model fits did not improve significantly (results not shown).

A model with C rates can be encoded by an array of 120 integers, with the first entry corresponding to the (AA, AC) pair, the second to (AA, AG), and so on. Each array entry ranges from 1 to C and determines which substitution rate Rk represents a given pair. To rule out degenerate and equivalent model parameterizations, we need to enforce 2 constraints.

  1. Each k = 1, ..., C must be present in the encoding vector at least once; otherwise we could reduce C and relabel the rates accordingly. For example, (1, 2, 2, 4, ...) {equiv} (1, 2, 2, 3, ...), assuming that 3 does not appear in the left-hand side parameterization.
  2. The first occurrence of 1 precedes the first occurrence of all other k; precedes the first occurrence of 2; precedes those of 3, 4, ...; and so on; otherwise we could reindex k in the order of first appearance and arrive at an equivalent model. For example, (2, 1, 4, 3, ...) {equiv} (1, 2, 3, 4, ...).

For C rate classes, there are {sum}FormulaS(120,k) possible models to consider, where

Formula
is the "Stirling number" of the second kind (Abramowitz and Stegun 1972Go). S(N, k) enumerates the number of ways to partition N objects into k nonempty sets. Clearly, the number of models included in the search space is combinatorially large, for example, for C = 2 there are approximately 6.6 x 1035 models and for C = 3 approximately 3.0 x 1056. Consequently, an efficient heuristic search procedure must be employed to systematically explore the space of possible models. We also consider a subset of the model space comprising those models that allow only single nucleotide substitutions to occur instantaneously. This restriction reduces the maximum number of rate parameters to 48.

Model Selection Using a GA
The parameter space for our optimization problem consists of 2 components: a discrete partitioning of substitution rates into C different classes and a vector of real-valued parameters corresponding to branch lengths and substitution rates R1, ..., RC. We use a GA to search through the discrete component of the parameter space and conventional numerical optimization techniques to find maximum likelihood estimates of all other parameters, given a substitution model. To reduce computational effort, we initially estimate branch lengths using the equal-input model and readjust them every time the GA iteration improves the fitness score by more than a fixed number of points (e.g., 5) relative to the most recent model for which branch lengths have been estimated.

We use the CHC algorithm—cross-generational elitist selection, heterogeneous recombination, and cataclysmic mutation (Eshelman 1991Go), an aggressive population-based hill climber. The algorithm operates on vectors of integers encoding substitution models (state vectors); hence, the minimum unit of "model evolution" is the assignment of a given dinucleotide pair to a rate class. If C > 2, this vector can either be manipulated directly or be converted to binary representation. The size of the population is fixed in our implementation—at twice the number of available processors. The CHC algorithm always retains the most fit individual and uses a radical recombination operator to rapidly explore the parameter space without collapsing it. The recombination operator of the CHC algorithm generates an offspring from 2 parent states by randomly choosing 1 of the 2 parental values in every position of the state vector. To ensure that the score of the best-fitting individual does not decrease across generations, the best-fitting individual from the current generation is always copied to the next one. To avoid entrapment in local optima, when the diversity of the population (measured as the relative difference in fitness scores between the best- and the worst-fitting individuals) drops below a certain threshold, a high proportion—15–35%—of positions are mutated in the encoding vectors of all but the best-fitting individual. After recombination and mutation operations, it may be necessary to convert the newly generated model to a canonical form described in the previous section.

The mutation phase is triggered if the absolute (or relative) difference between the fitness of the best- and the worst-fitting models in a generation decreases below a fixed threshold, {lambda} > 0. With probability µ (= 0.1), each entry in the model vectors of the current generation (excluding the best-fitting one) will be randomly replaced with an integer in 1, ..., C (excluding its present state). Individuals are chosen for mating with probabilities proportional either to their Akaike weights (see below) relative to other individuals in the current population or to their decreasing fitness ranking. AICc rank–based parent selection usually leads to slower convergence but better mixing of the populations, whereas Akaike weights are more suitable to quickly find a "good" solution because poorly fitting individuals almost never get a chance to reproduce.

We say that the algorithm has converged when the best fitness score remains constant for 100 consecutive generations. To verify convergence, we repeated some of the runs several times, with the initial population generated randomly every time. For models with many rate classes, convergence in a reasonable amount of time posed an issue; many runs terminated before the maximum was achieved (as measured by the best score among multiple runs); hence, we restricted our models to contain no more than 8 rate parameters. In addition, to safeguard against premature algorithm termination, we propose the following procedure: 1) run 5 instances of the GA for each data set and C using Akaike weight–based parent selection; 2) select 3 best individuals from each run, as well as the top 3 models found by the run for C – 1 rate classes (if C > 2); 3) perform an additional follow-up run with this initial population, using fitness rank–based parent selection; 4) report the output from the final run. A typical GA run in this setting will explore between 2,000 and 50,000 models and take between 1 and 48 h on a 32-node computer cluster.

The fitness of every model is measured by its AICc (Akaike 1974Go; Sugiura 1978Go). AICc of a model with p parameters applied to a sample of size s is defined as AICc=–2logL({theta}|data)+2p[s/(sp–1)]. AICc is meaningful when the number of observations exceeds the number of model parameters by at least 2. In our case, we treat each dinucleotide alignment column as an independent observation; hence, the method requires at least B + C + 17 alignment columns (B is the number of branches in the phylogenetic tree; there are 15 character frequency parameters). Many alignments of paired RNA regions are short and may fail to meet this requirement—other information criteria, such as Schwartz's Bayesian Information Criterion (Schwarz 1978Go), could be used in those instances. We deliberately selected a conservative measure that requires long sequences, to reduce the effects of model overfitting.

Reference Models
We compare GA-derived models with those previously used in literature. These models can be split into 3 classes

Extreme Cases
The least and most general models are in this class. The equal-input model assumes that all substitution rates qij = 1, whereas the most general—REV—estimates each rate independently. We also include the restrictions of those models that allow nonzero substitution rates only between dinucleotides differing by a single nucleotide (e.g., a one-step substitution). We denote the restricted models with the suffix "-1."

A Priori Structural Models
Savill et al. (2001)Go found that one of the "structural" models (RNA16A) tended to provide the best description for the data. This model estimates 5 substitution rates qij: {alpha}s, a single nucleotide substitution between any 2 paired (AU, CG, GC, CU, UA, UG) states; {alpha}d, a double transition between AU and GC and UA and CG; ß, all other substitutions among paired states; {gamma}, a single nucleotide substitution between a paired and an unpaired state; {varepsilon}, a single nucleotide substitution between 2 unpaired states. At the suggestion of an anonymous reviewer, we also consider (for biological data), an extension of RNA16A, which places a 15-parameter general reversible structure of the 6 paired states and retains the {gamma} and {varepsilon} parameters of RNA16. This 17 rate parameter model is referred to as PairedREV.

Extensions of Nucleotide Models
We consider 2 models that reduce to standard nucleotide models under certain parameter constraints and allow nonzero rates for single nucleotide substitutions only. The Muse95 model (Muse 1995Go) is an extension of the standard nucleotide substitution model Hasegawa-Kishino-Yano 1985 model (HKY85) to the state space of 16 dinucleotides. It includes a single coupling parameter (R) to account for the relative (to those substitutions that leave the pairing status unchanged) increase of rates that restore canonical pairings; 1/R reflects the relative paucity of those substitutions that break up canonical pairings. This model reduces to HKY85 when R = 1. Another model (NREV16), similar in spirit to the model of Schöniger and von Haeseler (1999)Go, independently estimates all 5 nucleotide substitution biases and reduces to the standard reversible nucleotide model if dinucleotide frequencies are equal to the products of constituent nucleotide frequencies.

Model Interpretation
Once we have selected a model using the GA, it is important to assess the confidence we have in the choice, in terms of both how much better the model fits when compared with other candidates and whether we would select the same (or a similar) model again if other independent data sets were available. The comparison between the GA model, the equal-input model, and the REV model can be carried out using the standard likelihood ratio test, as these models are always nested; we caution, however, that because the model selection procedure is data dependent, the standard LRT may be inapplicable—hence, we strongly recommend the multimodel inference approach detailed below. To determine the significance of the difference in fit between models specified a priori (such as the RNA16A or the Muse95 models) and the GA model (which may not necessarily be nested), we rely on both AICc and the Shimodaira–Hasegawa test (Shimodaira and Hasegawa 1999Go), using the difference in AICc rather than the difference in log likelihoods, in order to compare models with different numbers of parameters.

Rather than just base inference on the single best-fitting model, we also average over a set of evolutionary models. To this end, we calculate the Akaike weights (Akaike 1983Go) for each model examined during a run of the GA. If we define {Delta}i=AICFormula–min AICc, then the Akaike weight for model mi (M is the total number of fitted models) is given by Formula .

Model-averaged estimates of a substitution parameter can be found by computing the sum of estimates derived from each model, weighted by the appropriate wi. We can also assess the relative support for model i versus model j as the ratio of their Akaike weights (evidence ratios). By summing over the Akaike weights from largest to smallest, until the sum reaches or exceeds 1 {alpha}, we can rapidly obtain an {alpha}-level confidence set of models. Alternative derivations of confidence sets, such as those based on evidence ratios—for example, all models with evidence ratio no less than c when compared with the best-fitting model—can also be considered. In general, it is difficult to ascertain that any given GA run (or a series of GA runs) achieves good coverage of the true confidence set, defined, for example, as that which would have been derived had all possible models been fitted. Theoretical results (Whitley 2001Go) suggest that genetic algorithms sample regions of the search space with probability proportional to their fitness (at least asymptotically), so that highly likely models are going to be sampled with relatively high probability. In practice, we monitored model-averaged estimates for apparent convergence and found them to be robust to combining models from multiple GA runs.

Other information theoretical or model complexity scores can be used in place of AICc. Due to computational expense, we defer the detailed exploration of the effect of various fitness measures on model selection to a subsequent study. However, our preliminary results indicate that GA model selection is fairly robust to the choice of a specific information-based criterion (results not shown).

Effect of Site-to-Site Rate Variation
Variation of substitution rates along the alignment can potentially exert a confounding effect on the model matrix. In particular, it could be argued that if site-to-site rate variation is unaccounted for, rate biases found by the GA are simply attempting to correct for localized spatial rate heterogeneity. The issue can be addressed either by using the GA to search the space of models that allow for site-to-site rate variation or by investigating whether the effects of correcting for substitutional biases and rate variation are nearly independent in terms of log-likelihood score improvement. The former approach is rigorous but time consuming because the computational complexity is increased approximately B-fold for a model with B spatial rates, whereas the latter is faster but ad hoc. When the log-likelihood score improvements from the 2 effects grossly deviate from additivity, we can investigate the data further with a modified GA. The flexible and computationally efficient ß – {Gamma} discrete distribution with 4 rate categories (Kosakovsky Pond and Frost 2005bGo) is used to model rate heterogeneity in both approaches.

Rate Matrix Comparison
Qualitative Comparisons
Structural a priori and GA models partition 120 substitution rates into K disjoint subsets, and qualitative similarity between any 2 models can be compared using a clustering metric. One of the simplest measures at our disposal is the "Rand statistic" (Rand 1971Go), which quantifies the similarity between two clusterings of the same set of objects. The Rand statistic on 2 clusterings A and B is defined as R(A,B)=(N00+N11)/(N00+N01+N10+N11). N00 is the number of pairs (of rates in the case of substitution models) that belong to different clusters in both A and B, N01 is the number of pairs that belong to the same cluster in A but different clusters in B, N10 is the number of pairs that belong to the same cluster in B but different clusters in A, and N11 is the number of pairs that belong to the same cluster in both A and B. For clusterings on S objects, the denominator is always equal to S(S – 1)/2, hence 0 ≤ R(A, B) ≤ 1. It is also easy to see that R(A, B) = 1 if and only if clusterings A and B are identical. We illustrate the use of the Rand statistic on the example of 2 nucleotide rate matrices that cluster the 6 available nucleotide substitution rates. TN93 model (Tamura and Nei 1993Go) induces the clustering {{theta}AC, {theta}AT, {theta}CG, {theta}GT}, {{theta}AG}, {{theta}CT}, whereas the Kimura three state model yields {{theta}AC, {theta}GT}, {{theta}AG, {theta}CT}, {{theta}AT, {theta}CG}. There are 2 pairs of rates ({theta}AC, {theta}GT, and {theta}AT, {theta}CG) that belong to the same cluster in both models and, hence, N11 = 2. One can further check that N00 = 8, N01 = 1, N10 = 4, and the Rand statistic is 2/3.

To test whether A and B are more similar than expected by chance, we simulate the null distribution of the Rand statistic, by computing the statistic using A and simulated clusterings with the same number of components as B, but random allocations of elements to clusters (if we are comparing an a priori fixed model with a GA-derived one), or by computing the statistic on pairs of random clusterings (with the appropriate number of components) for the comparison of 2 GA-derived models. Lastly, having scaled the matrix to one expected substitution per unit time per dinucleotide, we set all those rate entries for which the 95% likelihood profile confidence interval overlaps 0 to 0 for clustering statistic computation purposes. This reduction is sensible because such small rates are effectively indistinguishable from 0 for alignment sizes and evolutionary times used in this study.

Correlation with A Priori Structures
All realistic models of stem RNA evolution should respect the basic property of favoring preservation and restoration of the canonical or weak GU pairings and penalizing substitutions to mispaired states. Additionally, one might inquire whether inferred models corroborate the intuition that single nucleotide substitutions should happen at a higher rate than doublet substitutions and that basic nucleotide substitution biases (e.g., transitions vs. transversions) are recapitulated.

Quantitative Comparisons
Although useful for coarse-grained comparison, clustering-based similarity analyses are unable to account for the magnitude of rate differences and their effect on the substitution process over a relevant period of time. For instance, 2 models may produce similar rate clusters but dramatically different estimates of rates for each cluster. Conversely, 2 models with differing clusterings can have similar rates between clusters and represent similar evolutionary processes.

We define the total variation metric (TVM) on two 16-state Markov processes as D(t;T1,T2)=(1/32){sum}a,bisin{AA...UU}|TFormula(t)–TFormula(t)|, where TFormula denotes the probability of substituting dinucleotide a with dinucleotide b in time t using model Ti. TVM is a time-dependent distance metric between substitution models, taking values in [0, 1]. D(0) = 0, and as t->{infty}, D approaches the TVM computed on the stationary distributions of T1 and T2. For processes with the same stationary distribution, the metric, therefore, converges to 0 as time approaches infinity.

We propose to compute D(t) at 3 characteristic times:

DL: the longest path in phylogenetic tree fitted using a model expected to produce reasonable estimates of branch lengths (e.g., REV).
DM: the median branch length in the tree derived using REV.
Dmax: the maximum total variation distance between 2 models over all possible times t ≥ 0, that is, Dmax=suptD(t;T1,T2).

Having obtained pairwise distances between the models, we can cluster them into similarity trees, for example, using neighbor joining (Saitou and Nei 1987Go).

Implementation
Recent advances in computational power make the use of complex search algorithms feasible. All sequence analyses and model fitting were performed using the HyPhy (Kosakovsky Pond et al. 2005Go) software on 32-node partitions of the IBM Blue-C computer at the University of Swansea, Wales. Given P available processors, P – 1 slave nodes were used to fit various models, and a single master node dispatched the jobs and assembled the results. The size of CHC population was set to 2P – 2 individuals. A single pass of GA required from several hours to 2 days based on the size of the alignment. HyPhy batch files needed to perform the analyses and result processing can be obtained from http://www.hyphy.org/pubs/rna-ga-models.tgz.

Sequence Alignments
Three previously published stem RNA alignments were used in this study (table 1). The alignments and trees stored in NEXUS format can be downloaded from http://www.hyphy.org/pubs/rna-ga-models.tgz.


View this table:
[in this window]
[in a new window]

 
Table 1 Stem RNA Alignments Used for Model Selection Analyses

 
Simulations
We simulated 3 sets of data replicates to evaluate the properties of GA-based model inference when the true models of evolution are known. Due to computational complexity of GA model searches, we only explored a few descriptive cases. Firstly, we fitted the nucleotide-reversible model to the deuterostome small subunit (SSU) RNA alignment (table 1) and simulated 20 parametric replicates (Simulation 1). One would expect that a GA search on a data set without secondary structure constraints would lead to the recovery of the nucleotide model of substitution. Secondly, we fitted the RNA16A to the same alignment and parametrically simulated 20 data sets (Simulation 2), to test whether a GA search is able to properly recover structural properties. Thirdly, we parametrically simulated 20 data sets using the RNA REV model fitted to the deuterostomes alignment as a test case when the true model is not well represented by any a priori matrices (Simulation 3).


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Model Fits and Comparisons
Biological Sequence Data
We fitted a collection of predefined and GA-inferred maximum likelihood models of dinucleotide substitution to 3 data sets: deuterostome SSU RNA, mammalian placental 12S RNA, and the HIV rev response element (RRE). Model fit results are summarized in table 2 for deuterostome SSU RNA and in tables S1 and S2 (see Supplementary Material online) for Mammalian 12S and HIV-1 RRE RNA, respectively.


View this table:
[in this window]
[in a new window]

 
Table 2 Model Fits for the Deuterostome SSU RNA Data Set

 
We performed GA searches with C = 2, 3, ..., 8 rate classes for the deuterostomes data set but considered only C = 5 and C = 8 for the other 2 alignments. The most general existing structural model has 5 estimated rates, and in our experience (results not shown), fitting more than 8 rate classes is computationally unstable and does not lead to significant improvements in fit, at least for alignment sizes considered in this paper.

Based on AICc, models found by the GA were the best among all those considered, except for the smallest alignment (HIV RRE), where the GA model was second in AICc to the structural model with the fewest parameters (Muse95); however, the RRE alignment is very short, and the AICc penalty for adding even a single model parameter is rather steep. The GA model with as few as 3 rate classes was able to outperform all a priori models and the REV (or REV-1) model. In fact, most of the AICc improvement was realized with C ≤ 5 rate classes (table 2). The RNA16A structural model had the best AICc score among all a priori models but fitted poorly compared with the GA models. From comparison with the PairedREV model, we note that most of the improvement in likelihood scores did not come from fine-grain modeling of substitution rates between paired states but rather from allowing more flexibility in how paired–unpaired and unpaired–unpaired rates are described. The improvement in fit was not due to a small number of dinucleotide sites (outliers), but rather represented a contribution from many sites in the alignment, as demonstrated by the low P values obtained with Shimodaira–Hasegawa tests comparing a GA model with others (table S3, see Supplementary Material online) and the fact that a large proportion of sites (fig. S1, see Supplementary Material online) had a better likelihood under the GA model when compared with a priori structural models. The best-fitting GA models also more closely approximated the overall length of the tree (using tree length under REV as a reference) than any a priori model.

When we tabulated AICc scores explored by the GA, it became apparent that there existed a large number of candidate models, which explained the data better than any a priori models or the REV model (tables S3, fig. 1), suggesting that a priori models may miss a number of important properties of the evolutionary process but a range of GA models are able to capture those subtleties. Indeed, AICc thresholds for the inclusion in the 95% confidence set were such (5,842.49 for deuterostomes SSU, 64,676.8 for mammalian 12S, and 1,424.5 for HIV-1 RRE) that no (except the Muse95 model for the HIV RRE alignment) a priori models could be considered credible. Consequently, it appears that a priori models fail to capture important (at least from the standpoint of goodness of fit) alignment-specific evolutionary properties.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— AICc scores of all the models explored by a composite (5 independent populations + follow-up) GA run with 8 rate classes on the deuterostomes data set. Scores of various a priori models are indicated by labeled arrows. Note that the GA sampled a substantial number of suboptimal models, which nonetheless outperform all a priori models.

 
We found that permitting instantaneous dinucleotide changes gave a substantial improvement in AICc scores. However, the GA-1 model with 5 rate classes outperformed the RNA16A model in all 3 cases, indicating that the role of doublet substitutions among paired states, modeled by RNA16A, may be minor in comparison with a few doublet substitutions between paired and unpaired states and within unpaired states (see fig. 2 and figs. S2 and S3 in see Supplementary Material online). The overall AICc hierarchy was largely unchanged by the inclusion of site-to-site rate variation because adding the ß – {Gamma} rate variation (Kosakovsky Pond and Frost 2005bGo) among sites resulted in similar improvements in fit among all models. Site-to-site rate distribution parameter estimates were fairly stable among all good-fitting models, implying that site-to-site rate variation was not strongly correlated with substitution preferences.


Figure 2
View larger version (56K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— A graphical representation of fitted rate matrices for the deuterostomes data set. The shading of each cell reflects the relative magnitude of the appropriate rate in comparison with the maximal rate in the matrix; solid black corresponds to the value of 1, whereas solid white matches the value of 0; all matrices have been scaled to yield one expected substitution per dinucleotide site per unit time. All rates for substitutions involving a single nucleotide are marked by a circle. Rate matrices are partitioned into 4 quadrants: substitutions between paired (including the weakly paired GU, UG) states (top left), substitutions that gain a pairing (bottom left), substitutions that lose a pairing (top right), and substitutions between unpaired states (bottom right).

 
Rate Matrix Comparison
We compared rate matrices using both the Rand statistic and the TVM. Nearly all pairs of models that clustered using the Rand statistic (table S4, see Supplementary Material online) showed significant clustering patterns using TVM, even when applied to different data sets. This finding indicates that the GA models are driven to recover similar rate clustering structures, some of which are represented in the structural RNA16A model. However, for the comparisons involving general GA models, the nontrivial structure induced on the rates between double-substituted states reduced the Rand statistic; hence, much of the alignment-specific clustering may be due to these rates.

GA-derived models closely resembled the REV model, when compared using TVM for all 3 data sets, at the 3 characteristic times (fig. 3). Indeed, no a priori model was more similar to REV (REV-1) than the GA (GA-1) model. This finding reinforces the intuition that GA models are attempting to approximate the most general model with few parameters by recovering the most important substitution patterns. The clustering pattern among other models varied a great deal, both from data set to data set and between the 3 timescales.


Figure 3
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Models clustered by neighbor joining using the TVM based on 3 characteristic times, which are shown when appropriate. For the maximum TVM tree, the range (and median) of times (different for each pair of models) is shown. Long path tree nodes are labeled with model names and their AICc scores. GAs were run with C = 5 rate categories, except the general GA for the mammalian data set, which had C = 8 rate categories.

 
Box-plots of best-fitting model rate estimates (fig. 4) demonstrate that GA models infer higher rates for substitutions that restore canonical pairings and for those that involve a single nucleotide, albeit there is little difference between transitions and transversions. For the larger mammalian 12S RNA data set, GA 8 and RNA REV (results not shown) infer a very similar pattern of rates, whereas for HIV RRE, there is more fluctuation, possibly reflective of the trend to "fit the noise" with overly parameter-rich models. Model-averaged rate estimates are very similar to those derived from the best-fitting models (results not shown). It is important to note that structural alignment algorithms, commonly used to prepare stem RNA data for phylogenetic analyses, can introduce a confounding bias toward pair preservation or recovery. However, such bias, if present, would clearly affect all those models of sequence evolution that treat the sequence alignment as fixed.


Figure 4
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Correlation of substitution rates inferred by the various models and substitution types in various stem RNA data sets. All rate matrices have been scaled so that the expected number of substitutions per dinucleotide site per unit time is one. A "weak" loss or gain means that an unpaired or Watson–Crick state is changed to or from GU or UG, whereas "strong" losses or gains refer to substitutions between unpaired and Watson–Crick states. For presentation purposes, the rates are square root transformed due to the overabundance of low rates.

 
Simulated Sequence Data
GA-based models perform admirably well when the true evolutionary process is known. We fitted 8 models (equal input, equal input-1, Muse95, NREV16, RNA16A, REV-1, REV, and GA with 5 rate classes) to each simulated replicate. For computational expediency, a single GA run was performed in place of a more rigorous multiple-run procedure. A possible side effect of this is that the algorithm may have failed to converge in some instances, possibly biasing the results against GA-derived models.

Simulation 1
This simulation examined the fit of dinucleotide models to data generated under a nucleotide model with the same substitutional biases as the deuterostome SSU data set but without any paired states. Not surprisingly, the 2 models that can collapse to nucleotide-level evolution (Muse95 and NREV16) resembled the nucleotide model used for the simulation the most (table S5, see Supplementary Material online), as measured by TVM (regardless of the timescale), followed by general reversible models, and then the GA-derived model. However, the GA had the best AICc in 15 of the 20 cases. More importantly (fig. S4, see Supplementary Material online), GA models correctly inferred the lack of correlation between pairing gain/loss and the substitution rates and estimated the rate of doublet substitutions to be effectively zero. Lastly, appropriately scaled substitution rates taken from GA fits closely resembled the true rates (results not shown).

Simulation 2
This simulation used data generated under the structural RNA16A model, using parameter estimates derived from the deuterostome SSU data set. The GA model had the best AICc in all 20 cases (table S6, see Supplementary Material online). Whereas the fitted RNA16A model bore the closest resemblance to the generating model, when measured by TVM (regardless of the timescale used), the GA model was able to approximate the substitution process nearly as well as a much more parameter-rich REV (or REV-1) model. In 9 of the 20 cases, GA-derived transition matrices were second closest to the true model, using the maximum tree path timescale (the results were similar for the other 2 timescales).

Rate estimates from a GA model were similar to those used to simulate the data (fig. S5, see Supplementary Material online), although they were "smoothed" (smaller generating rate estimates were biased upward and larger generating rates were biased downward). This is not surprising because a slight misclassification of rate classes (e.g., incorrectly placing a few high rates into a lower rate class) will have just this effect. For larger, more informative data sets, one can expect this effect to decrease. The average Rand statistic between the true model and GA fits was 0.713 with P < 0.001 (permutation test), showing significant similarity between rate structures in the generating and the inferred models. On average, 64% of substitution rates were allocated to the correct rate class.

Simulation 3
This simulation employed data simulated under a fully reversible model using parameter estimates obtained from the deuterostomes SSU data set. The GA model had the best AICc score in all 20 cases and was the second closest (following the "correct" REV model) in terms of the TVM (table S7, see Supplementary Material online). The generating REV model had 66 (out of 120) distinct substitution rates. Clearly, any model with a small number of rates (e.g., RNA16A or GA 5) cannot be expected to approximate all individual rates well, especially for a fairly small alignment, but rather to approximate "groups" of rates (e.g., low, intermediate, high) and their relative proportions among all 120 rates. As seen in table S8 (see Supplementary Material online), the distribution of GA-derived rates is more similar to the distribution used to generate the data than those inferred using RNA16A. However, this difference is minor in terms of TVM performance.

These simulations point to several desirable properties of our GA approach. Firstly, as measured by AICc, our GA models are able to explain the data better than incorrect a priori models and frequently better than the model whose structure was used to simulate the data, when the "true" model is parameter rich. Secondly, rate matrices estimated by the GAs are less distant from the true model than incorrect a priori models and perform similarly to the much more parameter-rich REV model. Thirdly, the best model and model-averaged estimates of substitution rates recover the distribution of rates used to generate the data well.


    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
In recent years, researchers have come to realize that the choice of a particular Markov process used to model sequence evolution can and should be justified (e.g., Huelsenbeck et al. 2004Go; Posada and Buckley 2004Go; Kosakovsky Pond and Frost 2005aGo; Drummond et al. 2006Go). At best, underfitting the data through the use of overly simplistic substitution models may miss out on biologically interesting phenomena and at worst can lead to incorrect inferences, for example, regarding phylogenetic relationships (e.g., Hue et al. 2005Go). Increasing availability of modern high-performance computers allows the development and testing of state-of-the-art methods. We have proposed a general mechanism for placing any a priori Markov model of evolution in the context of a combinatorially large set of competing evolutionary models. Our procedure is not biased by many prior assumptions and can infer desirable model properties based on well-established multimodel inference and model selection literature. A model selection procedure can also assign a level of support to each model, which can be used to average over models and overcome problems of overfitting the data that occur when many models are considered.

When used to select models of dinucleotide substitution applied for stem RNA, the GA procedure led to the consistent recovery of important biological properties from the data without making a priori assumptions about these properties. GA searches favored models that penalize evolution away from paired states and reward reconstitution of canonical (or weak) pairings and those substitutions that involve a single nucleotide. This finding validates most of the assumptions in existing structural RNA models and, conversely, establishes that the GA procedure generates results that make good evolutionary sense. However, it also appears that there exist data set–specific substitution rates, which are misclassified by the simpler models, because our GA models consistently fit biological data sets better than existing state-of-the-art models. Simulation results suggest that GA models can adequately represent vastly disparate evolutionary processes with as few as 5 rate parameters, mitigating the issue of data overfitting. Our models are computationally feasible and can be readily applied to biological data sets using small computer clusters.

Searching large model spaces can also be performed using MCMC approaches. For example, Huelsenbeck et al. (2004)Go used a reversible jump MCMC sampler to explore nucleotide model space. MCMC approaches have the advantage that, if implemented correctly, models are sampled in proportion to their likelihood or posterior probability. However, this comes at a steep computational cost and may be unfeasible for substitution models consisting of a large number of states and may give results that are similar to those based on averaging over a set of models identified using a GA search. As computational power becomes more affordable, more research comparing GA approaches with MCMC approaches will be possible.

Unrestricted model selection and rate clustering open new avenues for quantifying and comparing evolutionary processes. Inferred substitution models can be compared qualitatively (clustering patterns) and quantitatively (total variation and other distance metrics). This enables a direct comparison of the structure of competing substitution models on the same data set or of best-fitting models between different, for instance nonhomologous, data sets, thus paving the way for metaevolutionary analyses, in which organisms can be compared in terms of the evolutionary processes that have acted on them, rather than simply based on sequence similarity.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Supplementary tables and figures are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
We thank Art Poon, Selene Zarate, and Andy Leigh Brown for many helpful suggestions during the preparation of the manuscript. This research was supported in part by the National Institutes of Health (AI43638, AI47745, and AI57167), the University of California University-wide AIDS Research Program (grant number IS02-SD-701), and a University of California, San Diego, Center for AIDS Research/National Institute of Allergy and Infectious Diseases Developmental Awards to S.D.W.F. and S.L.K.P (AI36214).


    Footnotes
 
Robin Bush, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 

    In Abramowitz M and Stegun I (Eds.). Handbook of mathematical functions with formulas, graphs, and mathematical tables, 9th printing (1972) (Dover, New York).

    Adachi J and Hasegawa M. (1996) Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 42:4459–468.[ISI][Medline]

    Adachi J, Waddell PJ, Martin W, Hasegawa M. (2000) Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J Mol Evol 50:4348–358.[ISI][Medline]

    Akaike H. (1974) A new look at the statistical model identification. IEEE Trans Autom Control 119:716–723.[CrossRef]

    Akaike H. (1983) Information measures and model selection. Int Stat Inst 44:139–149.

    Alfaro ME and Huelsenbeck JP. (2006) Comparative performance of Bayesian and AIC-based measures of phylogenetic model uncertainty. Syst Biol 55:189–96.[CrossRef][ISI][Medline]

    Burnham K and Anderson D. (2003) Model selection and multimodel inference 2nd ed. (Springer, New York).

    Dimmic MW, Rest JS, Mindell DP, Goldstein RA. (2002) rtrev: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogeny. J Mol Evol 55:165–73.[CrossRef][ISI][Medline]

    Drummond AJ, Ho SYW, Phillips MJ, Rambaut A. (2006) Relaxed phylogenetics and dating with confidence. PLoS Biol 4:5e88.[CrossRef][Medline]

    Eshelman LJ. (1991) The CHC adaptive search algorithm: how to do safe search when engaging in nontraditional genetic recombination. In Spatz BM (Ed.). Foundations of genetic algorithms(Morgan Kaufmann, San Mateo (CA)) pp. 265–283.

    Felsenstein J. (1981) Evolutionary trees from DNA-sequences—a maximum-likelihood approach. J Mol Evol 17:368–376.[CrossRef][ISI][Medline]

    Hasegawa M, Kishino H, Yano T. (1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Mol Biol Evol 21:160–174.

    Hue S, Clewley JP, Cane PA, Pillay D. (2005) Investigation of HIV-1 transmission events by phylogenetic methods: requirement for scientific rigour. AIDS 19:4449–450.[Medline]

    Huelsenbeck JP, Larget B, Alfaro ME. (2004) Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo. Mol Biol Evol 21:61123–1133.[Abstract/Free Full Text]

    Jukes TH and Cantor CR. (1969) Evolution of protein molecules. In Munro HM (Ed.). Mammalian protein metabolism(Academic Press, New York) pp. 21–132.

    Kolaczkowski B and Thornton JW. (2004) Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431:980–984.[CrossRef][Medline]

    Kosakovsky Pond SL and Frost SD. (2005a) A genetic algorithm approach to detecting lineage-specific variation in selection pressure. Mol Biol Evol 22:3478–485.[Abstract/Free Full Text]

    Kosakovsky Pond SL and Frost SD. (2005b) A simple hierarchical approach to modeling distributions of substitution rates. Mol Biol Evol 22:2223–234.[Abstract/Free Full Text]

    Kosakovsky Pond SLK, Frost SDW, Muse SV. (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics 21:5676–679.[Abstract/Free Full Text]

    Le S-Y, Zhang K, Maizel J, Jacob V. (2002) RNA molecules with structure dependent functions are uniquely folded. Nucleic Acids Res 30:163574–3582.[Abstract/Free Full Text]

    Moler C and Loan CV. (1978) Nineteen dubious ways to compute the exponential of a matrix. SIAM Rev 20:801–836.[CrossRef]

    Murphy WJ, Eizirik E, Johnson WE, Zhang YP, Ryder OA, O'Brien SJ. (2001) Molecular phylogenetics and the origins of placental mammals. Nature 409:6820614–618.[CrossRef][Medline]

    Muse SV. (1995) Evolutionary analyses of DNA sequences subject to constraints on secondary structure. Genetics 139:1429–1439.[Abstract]

    Notredame C, O'Brien E, Higgins D. (1997) Raga: RNA sequence alignment by genetic algorithm. Nucleic Acids Res 25:224570–4580.[Abstract/Free Full Text]

    Posada D and Buckley TR. (2004) Model selection and model averaging in phylogenetics: advantages of Akaike information criterion and Bayesian approaches over likelihood ratio tests. Syst Biol 53:5793–808.[CrossRef][ISI][Medline]

    Rand W. (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850.[CrossRef][ISI]

    Rzhetsky A. (1995) Estimating substitution rates in ribosomal RNA genes. Genetics 141:2771–783.[Abstract]

    Saitou N and Nei M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425.[Abstract]

    Savill NJ, Hoyle DC, Higgs PG. (2001) RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum-likelihood methods. Genetics 157:1399–411.[Abstract/Free Full Text]

    Schöniger M and von Haeseler A. (1994) A stochastic model for the evolution of autocorrelated DNA sequences. Mol Phylogenet Evol 3:3240–247.[CrossRef][Medline]

    Schöniger M and von Haeseler A. (1999) Toward assigning helical regions in alignments of ribosomal RNA and testing the appropriateness of evolutionary models. J Mol Evol 49:5691–698.[CrossRef][ISI][Medline]

    Schwarz G. (1978) Estimating the dimension of a model. Ann Stat 6:461–464.

    Shapiro BA and Navetta J. (1994) A massively parallel genetic algorithm for RNA secondary structure prediction. J Supercomput 8:3195–207.[CrossRef]

    Shimodaira H and Hasegawa M. (1999) Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol 16:1114–1116.[ISI]

    Sugiura N. (1978) Further analysis of the data by Akaike's information criterion and the finite corrections. Commun Stat Theory Methods A7:13–26.[CrossRef]

    Tamura K and Nei M. (1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial-DNA in humans and chimpanzees. Mol Biol Evol 10:512–526.[Abstract]

    Telford MJ, Wise MJ, Gowri-Shankar V. (2005) Consideration of RNA secondary structure significantly improves likelihood-based estimates of phylogeny: examples from the Bilateria. Mol Biol Evol 22:41129–1136.[Abstract/Free Full Text]

    Tillier REM. (1994) Maximum likelihood with multiparameter models of substitution. J Mol Evol 39:4409–417.[CrossRef]

    Tillier REM and Collins RA. (1995) Neighbor joining and maximum likelihood with RNA sequences: addressing the interdependence of sites. Mol Biol Evol 12:17–15.[Medline]

    Whelan S and Goldman N. (2001) A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18:5691–699.[Abstract/Free Full Text]

    Whitley D. (2001) An overview of evolutionary algorithms: practical issues and common pitfalls. J Inf Softw Technol 43:817–831.[CrossRef]

Accepted for publication October 2, 2006.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow