Skip Navigation


MBE Advance Access originally published online on April 29, 2007
Molecular Biology and Evolution 2007 24(8):1667-1677; doi:10.1093/molbev/msm085
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
24/8/1667    most recent
msm085v2
msm085v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Thorne, J. L.
Right arrow Articles by Kishino, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Thorne, J. L.
Right arrow Articles by Kishino, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Population Genetics Without Intraspecific Data

Jeffrey L. Thorne*,{dagger}, Sang Chul Choi{dagger}, Jiaye Yu{ddagger}, Paul G. Higgs§ and Hirohisa Kishino||

* Wissenschaftskolleg zu Berlin, Institute for Advanced Study, Berlin, Germany
{dagger} Bioinformatics Research Center, North Carolina State University
{ddagger} Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark
§ Department of Physics and Astronomy, McMaster University, Hamilton, Ontario, Canada
|| Laboratory of Biometrics, University of Tokyo, Tokyo, Japan

E-mail: thorne{at}statgen.ncsu.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Discussion
 Acknowledgements
 References
 
A central goal of computational biology is the prediction of phenotype from DNA and protein sequence data. Recent models of sequence change use in silico prediction systems to incorporate the effects of phenotype on evolutionary rates. These models have been designed for analyzing sequence data from different species and have been accompanied by statistical techniques for estimating model parameters when the incorporation of phenotype induces dependent change among sequence positions. A difficulty with these efforts to link phenotype and interspecific evolution is that evolution occurs within populations, and parameters of interspecific models should have population genetic interpretations. We show, with two examples, how population genetic interpretations can be assigned to evolutionary models. The first example considers the impact of RNA secondary structure on sequence change, and the second reflects the tendency for protein tertiary structure to influence nonsynonymous substitution rates. We argue that statistical fit to data should not be the sole criterion for assessing models of sequence change. A good interspecific model should also yield a clear and biologically plausible population genetic interpretation.

Key Words: dependence among sites • fixation probability • protein structure • RNA structure


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Discussion
 Acknowledgements
 References
 
The impact of phenotype on genotypic change over time is a central feature of evolution, and every effort should be made to assess it. Population genetics is rich with theory that can be exploited for identifying and characterizing natural selection. When a population has segregating genetic variation for a phenotypic trait of interest, the nature of natural selection affecting this trait can potentially be quantified.

Although the footprints of natural selection can linger in a genome even after an event such as a selective sweep has finished, it is difficult to get a refined picture of how selection affects a trait if there is little or no genetic variation for the trait. Members of the same species tend to have very similar genomes, and this similarity is sometimes a serious impediment for population genetic approaches. One perspective is that genetic variation within populations provides a clear picture of the flotsam and jetsam of evolution but sheds little light on biologically important events because these events are rare and because population variation represents only a very small slice of evolutionary time. This perspective is probably too extreme, but it may not be totally without merit.

A key allure of interspecific genome comparison is the increased chance that phenotypically important genetic variation will have accrued. Unfortunately, interspecific variation has its own difficulties. Although specialized statistical tools are widely employed in phylogenetics, these procedures usually lack an explicit connection to population genetic theory. This missing connection is problematic because evolution occurs within populations, and the deepest understanding of evolution therefore occurs when the evolutionary process is framed with respect to population genetics.

Ideally, the population genetic implications of interspecific variation could be inferred. Halpern and Bruno (1998)Go did pioneering work toward this end by attaching population genetic interpretations to models of sequence evolution that have been developed for interspecific studies (see also Nielsen and Yang 2003Go; Berg, Willmann, and Lässig 2004; Knudsen and Miyamoto 2005Go; Mustonen and Lässig 2005; Sella and Hirsh 2005Go). In doing so, they managed to estimate population genetic parameters without requiring intraspecific genetic variation.

The idea underlying the Halpern and Bruno technique is that the rate at which one sequence changes to another is the product of two factors: the rate at which the necessary mutation occurs and the probability that the new mutation is fixed in the population. For some phylogenetic models of sequence change, each parameter can be interpreted as influencing either only the mutation rate or only natural selection. If the rate of a sequence change can be separated into factors corresponding to mutation and to natural selection, then the factor associated with natural selection represents the probability of fixation. More accurately, rates in interspecific models are usually not scaled to changes per generation, and this means that the two factors are proportional to the mutation rate and proportional to the fixation probability. These proportionality constants are shared among possible sequence changes and are not serious obstacles.

After employing standard phylogenetic techniques to estimate parameters of a model for sequence change (see Felsenstein 2004Go), Halpern and Bruno leveraged this factorization of interspecific rates into mutation and natural selection terms. There is a large and relevant body of literature for expressing fixation probabilities in terms of population genetic parameters such as population size and relative fitness. Halpern and Bruno equated the fixation probabilities estimated from interspecific models and the population genetic formulae for fixation probabilities. The product of effective population size and difference in relative allele fitness could thereby be estimated from interspecific models of sequence change.

Halpern and Bruno introduced their approach to better interpret parameter estimates obtained from a codon-level interspecific model of DNA sequence change that permits variation of preferred amino acids among codon positions. The model they studied has all codons in a protein-coding sequence independently evolve. Because of the evolutionary independence, the protein-coding sequences with the highest fitness would be those having the most preferred amino acid at each codon position. This is an overly simplistic situation. Residues in actual proteins physically interact and they can covary in evolution. The fitness landscape induced by Halpern and Bruno's model is both biologically implausible and relatively uninteresting because of the evolutionary independence. One mostly unresolved and very important issue in evolution is the nature of adaptive landscapes and the movement of populations on these landscapes. Progress involving this topic has been slow, largely because mappings of genotype to phenotype and mappings of phenotype to fitness remain unresolved. Elucidation of the adaptive landscape will be facilitated by statistical inference via more realistic models of sequence change that incorporate effects of phenotype on genotypic (i.e., DNA sequence) change.

Models that do not assume that each DNA or protein molecule is a sequence of independently evolving units have recently been receiving much attention. Some of these models feature evolutionary dependence among positions that occurs when mutation rates are a function of not just the nucleotide types involved in the change but also the sequence at neighboring positions (e.g., Hwang and Green 2004; Siepel and Haussler 2004GoGo; Hobolth et al. 2006). Other models focus on evolutionary dependence among sequence positions that is attributable to phenotypes such as protein tertiary structure (Robinson et al. 2003; Rodrigue et al. 2005; Rodrigue, Lartillot, and Philippe 2006Go) and RNA secondary structure (Yu and Thorne 2006Go). This latter category of models has the potential to yield biologically plausible fitness landscapes.

In this article, we apply the Halpern and Bruno technique for connecting population genetics and phylogenetics. We focus on probabilistic models where rates of molecular evolution are partially determined by the phenotypic consequences of each particular change. These models lack the implausible assumption that a sequence consists of independently evolving nucleotides, amino acids, or codons. For each possible nucleotide substitution, the product of effective population size (N) and fitness difference between alleles (s) can be estimated. The value of the product affects how much influence the competing forces of genetic drift and natural selection are expected to have on the fate of a new mutation (e.g., Kimura 1962Go). Using models for protein–coding DNA that have independent evolution among codons, Nielsen and Yang (2003)Go estimated this product for each possible nonsynonymous change, and they then examined the resulting distribution. We follow their example, but with models that incorporate effects of RNA secondary structure and protein tertiary structure. Some features of our inferred distributions are biologically untenable. We argue that examining these distributions can be helpful for uncovering shortcomings of interspecific models of sequence change. The distributions of Ns estimates among possible changes to extant sequences can also be compared to the distribution of Ns estimates among changes that are actually inferred to have occurred during evolution. We perform and discuss this contrast for a 5S rRNA data set.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Discussion
 Acknowledgements
 References
 
Population Genetic Assumptions
Imagine a diploid organism with a constant effective population size N. Although the actual and effective population sizes could be substantially different, we ignore this possibility here. Mutations are considered to be uncommon enough that the fixation probability of a new variant at one sequence site is unlikely to be affected by polymorphisms at other sites.

We assume approximately multiplicative relative fitnesses of alleles/sequences. Therefore, a genotype consisting of sequences i and j would have relative fitness Formula where wi and wj are the relative fitnesses of the sequences i and j. We measure wj for all sequences j with reference to wi and will set wi = 1. We define sj = wj wi = wj – 1.

Let Zij be the event that a new mutant allele type j is eventually fixed when the initial condition is that the other 2N 1 alleles are type i. When sj2 is close enough to 0 to be neglected, diffusion theory approximates the probability of Zij as

Formula (1)
(Kimura 1962)Go. Fixation in the opposite direction can also be considered. Let Zji be the event that a new mutant allele of type i is eventually fixed when the initial condition is that the other 2N – 1 alleles are type j. Because terms of order sj2 and higher can be neglected,

Formula (2)

RNA Secondary Structure Model
Some interspecific models of sequence evolution that are useful for connecting genotype and phenotype assume the instantaneous rate of change Rij from a sequence i to a sequence j is 0 if i and j differ at more than one position. This amounts to restricting the mutation process to changing one position at a time. Among models with this restriction, we first examine the case where

Formula (3)

The purpose of the parameter u is to scale the rates to the desired measurement units. At the sole position where i and j differ, sequence j has nucleotide type h isin {A, C, G, T}. The {pi}h term represents a relative rate of mutations to nucleotide type h. In the absence of natural selection, {pi}h would be the expected proportion of nucleotide type h in a sequence ({pi}A + {pi}C + {pi}G + {pi}T = 1). The parameter {kappa} differentiates between transitions and transversions. This parameterization does not reflect tendencies for mutation rates to be context-dependent or to even be affected by the nucleotide type being substituted, but more realistic treatments of the mutation process could be added.

The E(i) and E(j) terms represent values for sequences i and j of some phenotypic trait and f summarizes how much the trait affects sequence evolution. If f = 0, the phenotype does not influence sequence change. If f > 0, evolution favors sequences associated with low phenotype scores. If f < 0, evolution favors sequences associated with high phenotype scores.

This model has been employed to study how sequence change is affected by RNA secondary structure (Yu and Thorne 2006Go). In this case, E(i) and E(j) represent free energies of sequences i and j as approximated by standard computational biology techniques (e.g., Zuker and Stiegler 1981; Zuker 1989GoGo; Hofacker et al. 1994). For types of RNA molecules such as rRNA and tRNA, stable secondary structure is advantageous and selection will tend to favor negative and low free energies. The parameter f relates free energy to patterns of sequence change. If f exceeds 0 (the biologically plausible scenario), evolution favors lower free energies. If f = 0, structure has no impact on change.

Population Genetic Interpretation of the Interspecific Model
Berg, Willmann, and Lässig (2004) showed how population genetic interpretations can be attached to phylogenetic models with rate structures such as that described by equation (3). Their mathematical treatment is both more thorough and more general than that presented here. Our less rigorous treatment is included here because we also want to consider the population genetic implications of interspecific models that slightly differ from the parameterization of equation (3), and the parameterization of equation (3) is a good starting point.

For the rate Rij of equation (3), the term associated with natural selection is ef(E(i)–E(j)). The fixation probability P(Zij) should be this term multiplied by some constant. When E(i) = E(j), the change from i to j is neutral. For neutral substitutions in a diploid population, the fixation probability is Formula . Because ef(E(i)–E(j))= 1 when E(i) = E(j), the constant should be Formula . Therefore, the interspecific model corresponds to a fixation probability of Formula when there is a new mutation from i to j. By applying similar arguments to the rate Rji, the fixation probability if there was a new mutation from j to i would be Formula .

Because we now have expressions for P(Zij) and P(Zji) based on the interspecific rates as well as based on diffusion theory, we can set the ratio of P(Zij)/P(Zji) from interspecific rates equal to the ratio from diffusion theory,

Formula (4)
The left side of equation (4) can be written e2f(E(i)–E(j)) whereas some algebra reduces the right side to e2 (2N–1)sj. This means

Formula (5)
Because diffusion theory requires N>>1,

Formula (6)
Via equation (6), parameter estimates of f from interspecific models can convert the phenotypic difference E(i) – E(j) into a difference of fitnesses scaled by twice the population size.

With the above reasoning, Formula and Formula represent fixation probabilities specified by the interspecific model. Because probabilities cannot exceed 1,

Formula (7)
and

Formula (8)
These inequalities imply |2Nsj| < log 2N, and they represent limitations of the interspecific rate parameterization in equation (3). The more direct parameterization of interspecific rates that we describe later does not share these limitations.

At equilibrium and when mutation is rare enough that populations are usually monomorphic, the probability that a population is monomorphic for sequence j can be written in terms of population size and relative fitnesses using equation (6). With the rates of equation (3), the stationary distribution of sequence j with length L is

Formula (9)
(Yu and Thorne 2006Go; see also Robinson et al. 2003; Berg, Willmann, and Lässig 2004). Here, {theta} = {{kappa}, f, {pi}A, {pi}C, {pi}G, {pi}T} is the vector representing all parameters. The jm and kn terms, respectively, represent the nucleotide types occupying position m of sequence j and position n of sequence k. The summation in the denominator is over all possible sequences with length L. Invoking equation (6), the stationary distribution can be rewritten as

Formula (10)
or, in terms of the relative fitness wj of sequence j, as

Formula (11)

Equations 9 and 10 illustrate the balance between mutation and natural selection. When f = 0, all possible sequences have equal fitness and mutation governs which sequences are most probable. When f is extremely positive, natural selection dominates mutation. In the RNA case, this means that the sequence with lowest free energy of its secondary structure will be the sequence with highest stationary probability.

A More Direct Connection Between Interspecific and Intraspecific Rates
The preceding sections illustrate how the model of Yu and Thorne (2006)Go can be assigned a population genetic interpretation. While it can be illuminating to make such an interpretation, the restrictions detailed in equations (7) and (8) represent weaknesses of the interpretation. A closer linkage to intraspecific variation can be established when the interspecific model is intentionally constructed to reflect population genetic considerations.

We again choose a reference sequence i with relative fitness 1 and again consider any other sequence j with relative fitness 1 + sj, but we now have a parameter a such that sj = a(E(i) E(j)) for all j. A restriction on a is that it cannot take values that make sj < –1. With the new model, rates of interspecific change will again be a product of mutation rate and fixation probability. Specifically,

Formula (12)
The 2N factors above are added to emphasize that any of the 2N copies of allele i might give rise to a new mutant allele j. We refer to this model as the "Direct RNA" model to emphasize that it directly integrates population genetics.

By simulation, Sella and Hirsh (2005)Go find that the approximation

Formula (13)
is more accurate than the conventional diffusion theory approximation of equation (1). When sj is close to 0, the two approximations are very similar. Using the rates of equation (12) with the Sella-Hirsh approximation yields a time reversible evolutionary model with stationary distribution

Formula (14)

Although adopting the Sella-Hirsh approximation with the rates of equation (12) would technically allow separate estimation of N and a (and therefore separate estimation of N and s), N and a will be difficult to separate in practice unless natural selection is very strong. Otherwise, most information in a data set will pertain to the product Na. Our implementation of the Direct RNA model assumes selective differences are small enough that Formula . This approximation yields evolutionary rates

Formula (15)
and an approximate stationary distribution

Formula (16)
This implementation allows only the product Na to be inferred.

The approximate stationary distribution of the Yu-Thorne model (eq. 9) becomes the implemented stationary distribution of the Direct RNA model (eq. 16) when f is replaced by 2Na. Although these distributions can be used to estimate f and 2Na from data sets consisting of a single sequence, the shared stationary distribution form means that single-sequence data sets have no information concerning the relative statistical fit of the two models. Another consequence of the shared form is that the posterior distribution of f will be identical to that of 2Na for a single-sequence data set, assuming the prior distributions for these parameters are identical. Multiple-sequence data sets contain information about f and 2Na beyond that in the stationary distributions, and therefore these data sets do permit statistical comparison of the Yu-Thorne and Direct RNA models.

Protein Tertiary Structure as Phenotype: The Robinson model
A variation of the rate parameterization in equation (3) was employed by Robinson et al. (2003) to study the effects of protein tertiary structure on the evolution of protein-coding genes. In this case, the tertiary structure is assumed to be experimentally determined and invariant for all proteins being studied. Sequences that are compatible with this known tertiary structure will tend to be favored by evolution. Robinson et al. (2003) and we measure sequence-structure compatibility with a scoring system originally developed for fold recognition of globular proteins (Jones, Taylor, and Thornton 1992; Jones 1999GoGo). The system produces two scores that assess compatibility of a translated DNA sequence i that is folded into a known tertiary structure. One score, E1(i), reflects solvent accessibility. This score is low if the hydrophilic amino acid residues in the translated sequence i are mostly on the surface of the folded protein and if the hydrophobic residues do not tend to be in environments with high solvent accessibility. The other score, E2(i), reflects the pairwise amino acid interactions that are induced by folding the translated sequence i into the known tertiary structure. This score is low if the induced amino acid interactions tend to be favorable, and it is high if the interactions are unfavorable. The parameters f1 and f2, respectively, represent the effects of solvent accessibility and pairwise interactions on sequence change. The parameters are both zero if structure and evolution are independent. The parameters are both positive if sequence-structure compatibility is favored during evolution. The rates are modeled as

Formula (17)
The {omega} parameter is intended to reflect forces on sequence change that are due to natural selection but that are external to the protein tertiary structure. The value of this parameter represents whether the external environment favors ({omega} > 1) or disfavors ({omega} < 1) new alleles. Unlike the f1 and f2 parameters, the contribution by {omega} to interspecific rates is shared by all new mutations but does not otherwise depend on the phenotype of the allele.

We start with the special case of the Robinson et al. (2003) model where the value of {omega} is 1. By following calculations resembling those that produced equation (6), we get

Formula (18)
Estimates of 2Nsj can be obtained when {omega} = 1 in almost the same way as with the Yu–Thorne RNA model.

When {omega} is not fixed to 1, algebraic expressions for 2Nsj are not simple because the relative fitnesses of sequences i and j depend on whether i is a new mutation from j or whether j is a new mutation from i. In other words, {omega} != 1 means that there is a fitness effect by virtue of being a new mutation regardless of the exact sequence resulting from the new mutation. To stress this, we write sj(i) to indicate the relative fitness of sequence j minus the relative fitness of sequence i when it is sequence j that is the new mutation.

The influence of {omega} on sj(i) and si(j) necessitates a slightly different strategy for linking population genetics and interspecific rates. One such strategy is to note that the fixation probabilities from the interspecific model and population genetics should be about the same

Formula (19)
and then using Formula for sj(i) close to 0. This leads to

Formula (20)
Via the above equation, estimated values of {omega}, f1, and f2 can be combined to yield numerical estimates of 2Nsj(i). We obtain satisfactory numerical estimates with the "uniroot function" of the R software package (Brent 1973; R Development Core Team 2006GoGo).

The Direct Protein Model
Features of the Robinson model of protein evolution and the Direct RNA model can be combined to form the "Direct Protein" model. To account for solvent accessibility and pairwise interactions, the Direct Protein model, respectively, has parameters a1 and a2. If allele i has relative fitness 1, allele j has relative fitness 1 + a1(E1(i) – E1(j)) + a2(E2(i) – E2(j)). Rates for this hybrid model are:

Formula (21)
As with the Direct RNA model, the Sella-Hirsh approximation for P(Zij) (see eq. 13) yields a time reversible evolutionary model with a stationary distribution that is similar in form to equation (14). Paralleling our implementation of the Direct RNA model, we again make the assumption that natural selection is weak, so that our Direct Protein model implementation has a rate structure resembling equation (15). This means only the products Na1 and Na2 can be estimated. Because the formulae and details of the Direct Protein model parallel those of the Direct RNA model, they are not included here. The Direct Protein model lacks a parameter that plays the role of the {omega} parameter in the Robinson model. A parameter a3 that is analogous to {omega} could be added by having the relative fitness of a new mutant allele j be 1 + a1(E1(i) – E1(j)) + a2(E2(i) – E2(j)) + a3, but we have not explored this possibility.

RNA Analysis
We previously studied 5S rRNA evolution with the Yu-Thorne model (Yu and Thorne 2006Go) and with sequences from the 5S ribosomal RNA database (Szymanski et al. 2002). To approximate free energies of 5S rRNA sequences, slight modifications to the Vienna RNA software package (Hofacker et al. 1994) were made as described in Yu and Thorne (2006Go). As part of that study, a 5S Zea mays rRNA sequence was used to estimate the parameter f. For a prior distribution for f that was uniform between –1 and 1, the Z. mays sequence yielded a posterior mean of about 0.4953 and a 95% credibility interval from 0.3480 to 0.6569. As noted above, the shared stationary distribution forms of equations (9) and (16)Go mean that the Z. mays sequence also yields a posterior mean of about 0.4953 for 2Na when the prior is uniform from –1 to 1.

The Z. mays and the other seven 5S rRNA sequences analyzed in the earlier study are each 119 residues in length. At each of the 119 positions, the nucleotide type occupying the position could potentially mutate into one of the other three types. Referring to the Z. mays sequence as sequence i, there are 357 = 3 x 119 neighboring sequences j for which the rate Rij of equation (3) is non-zero. For each of these 357 possible sequences j, the posterior mean estimate Formula can be multiplied by E(i) – E(j), and then equation (6) can yield an estimate of 2Nsj for this possible change. The mean, median, and standard deviation of the 357 2Nsj estimates are, respectively, –0.94, –0.54, and 1.21. Of the 357 changes, 112 do not affect the approximate free energy, and therefore these changes yield 2Nsj estimates of 0. The histogram consisting of the 245 non–zero 2Nsj estimates is shown in figure 1. We elected to indicate the 112 estimates that are exactly 0 on figure 1 separately because we feel that interpretation of the histogram would otherwise be greatly affected by the details of how it is drawn. The choice of the Z. mays sequence was arbitrary. Similar histograms were obtained when possible changes to other of the extant 5S rRNA sequences were considered (data not shown). We emphasize that the resemblance between Equations 9 and 16Go means that an identical distribution of 2Nsj estimates is produced by the Direct RNA model.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Distribution of 2Ns estimates among possible changes to Zea mays 5S rRNA sequence. The 112 possible changes that yield a 2Nsj estimate of 0 are indicated with the arrow but are not otherwise represented in the histogram. Both the Yu-Thorne RNA model and the Direct RNA model yield this distribution of 2Ns estimates.

 
The Markov chain Monte Carlo (MCMC) procedure of Yu and Thorne (2006)Go randomly samples from the posterior distribution of sequence paths. A sequence path on a branch consists of all the substitution events that occurred to transform the ancestral sequence on the branch to the descendant sequence on the branch. The timing of each event is also specified by the sequence path on the branch. A sequence path on a tree consists of the set of branch paths that yield the observed tip sequences.

Because we had not retained sampled sequence paths in our earlier study with eight 5S rRNA sequences, we repeated the earlier analysis on this data set with the Yu-Thorne model, but this time we retained information pertaining to a randomly selected sequence path. Prior distributions for parameters and other implementation details are described by Yu and Thorne (2006)Go. The taxa from which the sequences are derived, the assumed tree topology, and the estimated branch lengths are indicated in figure 2. Although the Yu-Thorne model is time-reversible and does not by itself yield rooting information, the work of Arisue, Hasegawa, and Hasimoto (2005) suggests that the common ancestor of this 5S rRNA phylogeny corresponds to some location on the branch drawn with the dashed line in figure 2. Except for the events on this branch, all events on our randomly sampled sequence path can therefore be oriented with respect to time. This means that we know the sequence before and after each fixation event. When the events on the branch drawn with the dashed line are excluded, the selected sequence path had 156 events. For each of these events, we calculated E(i) – E(j), where i is the sequence before and j is the sequence after. Because the randomly sampled path coincided with a value of about 0.3599 for f, we approximate 2Nsj with 0.3599 x (E(i) – E(j)) for each of the 156 events. The mean, median, and standard deviation of the 2Nsj estimates, respectively, are 0.03, 0, and 1.18. Among the 156 events, 42 did not result in a change of approximate free energy. These events therefore yielded zero for the estimated 2Nsj. The Yu-Thorne results in figure 3 were produced by applying equation (6) to each of the 114 time-oriented events on our selected sequence path that did not yield a 2Nsj estimate of exactly zero. Although this histogram is based solely on one randomly sampled sequence path, and although there is substantial uncertainty regarding which sequence path generated the extant 5S rRNA sequences, our limited experience suggests that other sequence paths randomly sampled from the posterior distribution yield histograms that are qualitatively similar to figure 3 in that they are much closer to being symmetric about zero than are figure 1 and other histograms of 2Nsj estimates derived from possible changes to extant sequences.


Figure 2
View larger version (5K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Topology relating 5S rRNA sequences. Branch lengths are scaled according to the number of changes in the randomly sampled sequence path that was analyzed. The branch with the dashed line is the one on which the most recent common ancestor of the 8 extent sequence is assumed to have occurred.

 

Figure 3
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Distribution of 2Ns estimates among changes on a randomly sampled sequence path. The eight 5S rRNA sequences were analyzed with the model of Yu and Thorne (2006) and a sequence path was randomly sampled. The entire sampled sequence path had 169 changes, but 13 changes could not be oriented with respect to time because they were located on the same branch as the common ancestral sequence. Of the remaining 156 changes, 42 yielded as 2Ns estimate of 0, and these are indicated with the arrow.

 
We also analyzed the eight 5S rRNA sequences with the Direct RNA model. To make the results as comparable as possible, the prior distribution for 2Na in this analysis was identical to the prior distribution for f in the Yu and Thorne (2006Go) analysis. The posterior mean and 95% credibility interval for f were about 0.3661 and (0.2992, 0.4382) (Yu and Thorne 2006Go), whereas values for 2Na were 0.3723 and (0.3006, 0.4520). When paths were randomly sampled using the Direct RNA model and then 2Nsj estimates based on the path were calculated, as was done to produce figure 3, results were qualitatively similar in that the Direct RNA model yielded 2Nsj estimates that were relatively symmetric about zero.

The posterior means for 2Na and f from the eight-sequence data set are lower than those from the single-sequence Z. mays data set. The same pattern holds when each of the other seven sequences in the eight-sequence data set are subjected to single-sequence analysis (Yu and Thorne 2006Go). This may be evidence of a violation of the stationarity assumption (Yu and Thorne 2006Go).

Protein Analysis
As an example of converting interspecific parameter estimates into 2Ns values, we consider the annexin V (also known as annexin A5) protein-coding genes from mouse (Mus musculus) and rat (Rattus norvegicus). We select this aligned sequence pair of 314 codons because it was analyzed with the model of equation (17) by (Robinson et al. 2003). In the previous study, parameters were estimated for the case where {omega} is not fixed at 1. We use the previous estimates for that case and also estimate parameters when {omega} = 1.

To obtain the {omega} = 1 estimates for the Robinson model, we followed the earlier study by using the same tertiary structure (PDB ID 1ALA), prior distributions for parameters, MCMC strategy, and diagnostics for convergence of the Markov chain. The only exception is that {omega} is fixed at 1 rather than being given a prior distribution. The prior distribution settings and the posterior distribution estimates for {omega}, f1 (the solvent accessibility impact parameter), and f2 (the pairwise interaction impact parameter) are summarized in table 1. Results for the Direct Protein model were obtained in a very similar fashion and are also listed in table 1.


View this table:
[in this window]
[in a new window]

 
Table 1 Prior and Posterior Distributions for the Mouse and Rat Annexin V Sequence Pair

 
Referring to the mouse DNA sequence coding for annexin V as i and fixing {omega} to be 1 for the Robinson model, we used equation (18) and the posterior means f1Formula 0.871 and f2Formula 0.0384 to estimate 2Nsj for each of the 2,058 sequences j that result from a single nonsynonymous change to i. The mean, median, and standard deviation of the 2Nsj estimates from the Robinson model with {omega} = 1 are, respectively, –0.21, –0.11, and 0.50. A similar approach was employed to obtain 2Nsj estimates for the Direct Protein model, and almost identical results were obtained. The mean, median, and standard deviation of the 2Nsj estimates are, respectively, –0.21, –0.11, and 0.51 when the posterior means 2Na1Formula 0.894 and 2Na2Formula 0.0379 are employed.

For each possible nonsynonymous change to i, equation (20) can be numerically evaluated with the posterior means of {omega}, f1, and f2 to yield an estimate of 2Nsj(i) for the Robinson model when {omega} != 1. The mean, median, and standard deviation of the 2,058 resulting values are respectively –1.68, –1.62, and 0.34. Figure 4 shows the distribution of the 2Nsj(i) estimates obtained from the Robinson model when {omega} = 1, the Robinson model when {omega} != 1, and the Direct Protein model.


Figure 4
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Distribution of 2Ns estimates among possible nonsynonymous changes to mouse annexin V. The darkest bars represent results from the Direct Protein model. The bars with intermediate shading are from the Robinson model when {omega} is fixed at 1. The lightest bars are from the Robinson model when {omega} is not fixed at 1.

 

    Discussion
 TOP
 Abstract
 Introduction
 Methods
 Discussion
 Acknowledgements
 References
 
Limitations of Population Genetic Assumptions
Nielsen and Yang (2003)Go studied an interspecific model with independent evolution among codons. The Robinson et al. (2003) model can be converted to that of Nielsen and Yang by allowing {omega} values to vary among codons and by forcing f1 and f2 to both equal 0. With an approach akin to equation (20), Nielsen and Yang produced estimates with their model of 4Nsj (and therefore 2Nsj) for possible nonsynonymous changes to extant protein-coding sequences. They noted that their population genetic interpretation of this model is valid both when the mutation rate is low enough to make contemporaneous polymorphisms at multiple sites unlikely and when recombination is high enough to destroy linkage disequilibrium between different polymorphic codons. As in the Nielsen and Yang case, the population genetic interpretations of the models studied here can be valid when the mutation rate is so low as to make contemporaneous polymorphisms at multiple sites unlikely.

However, the interspecific models studied here exhibit greater degrees of evolutionary dependence among sites. Even with high levels of recombination, the probability of fixation of a new nucleotide type at one polymorphic sequence position can be affected by polymorphism at another position if the two positions interact in some way via natural selection on phenotype. This interference occurs if the fitness difference between two variants at one polymorphic site depends on the nucleotide type occupying the other.

Although they exhibit more dependence among sequence sites than conventional interspecific models, the models explored here can still have a great deal of "conditional independence" among sites. For example, two amino acid residues in the same protein may be far from one another on the protein tertiary structure. The pairwise interactions between these two residues may be negligible and, conditional upon no changes at the amino acids occupying other protein positions, the codons corresponding to the two amino acids may evolve independently. If recombination is high enough to put polymorphisms at these two codons into linkage equilibrium, then fixation events at these two "conditionally independent" codons would not interfere with one another. Therefore, as with the Yang and Nielsen independent-codon model, the population genetic interpretation of our dependence-among-sites models is more likely to be valid for high recombination rates than for low rates. The valid range of recombination and mutation rates becomes increasingly restricted as the amount of "conditional independence" decreases.

Evolutionary models that allow simultaneous changes at paired sites in an RNA stem (Tillier and Collins 1998; Higgs 2000GoGo; Savill, Hoyle, and Higgs 2001) or that allow simultaneous changes at multiple codon positions (Whelan and Goldman 2004Go; Higgs, Hao, and Golding 2007) have been found to fit data better than models where each evolutionary event alters only a single sequence site. Some evidence for simultaneous changes could be an artifact of model misspecification. For example, it may be more crucial to maintain base complementarity in some RNA stem regions than in others. If sequence changes alter only a single position at a time, this would lead to heterogeneity among stem regions in the expected evolutionary time between disruption and restoration of base complementarity. When none of the evolutionary models being statistically compared can perfectly account for this heterogeneity because they either permit no heterogeneity or they only permit simultaneous changes at corresponding stem residues, the models with simultaneous change might fit data better.

However, we suspect that this sort of model misspecification is not the main reason why previous studies have found evidence for simultaneous changes at multiple positions. The population genetic theory of compensatory substitutions in RNA shows why double substitutions occur: compensatory changes are likely to spread through the population together and be fixed at the same time, even though the two mutations are unlikely to be simultaneous (Higgs 1998Go). The same population genetic reasoning could be applied to multiple simultaneous changes in proteins. Therefore, modifying the models considered here by allowing simultaneous changes at multiple positions is a good direction for future work, because these modifications would have a strong biological justification.

In our protein analyses and in the 5S rRNA analyses that produced figure 3, we have considered sequences from distantly related species and have assumed that effective population sizes have been constant subsequent to the common ancestor of the population. This assumption is certainly wrong but could be relaxed. Effective population sizes could be allowed to vary among, and even within, branches. One approach would be a hierarchical formulation in which the effective population size is assumed constant within branches (or other prespecified intervals of evolutionary time) but different between branches. The values of the effective population size on different branches could be treated as independent samples from some probability distribution.

Perhaps more appealingly, effective population sizes could be modeled as autocorrelated over time so that closely related evolutionary lineages would be expected to have similar effective population sizes. Changes in effective population size can influence the chronological rate of evolution because slightly deleterious new mutations are more likely to undergo fixation in small populations than in large ones (see Ohta 1972, 1987; Ohta and Tachida 1990GoGoGo). Therefore, there seems to be a great potential for merging this problem with recent statistical ideas for interspecific divergence time estimation in the absence of constant rates of molecular evolution (e.g., Kishino, Thorne, and Bruno 2001; Sanderson 2002; Drummond et al. 2006GoGoGo).

The most difficult obstacle to allowing effective population size to change over time is probably the treatment of the common ancestral sequence. If changes in effective population size are slow or uncommon, it might be acceptable to treat the common ancestral sequence as a realization from the stationary distribution of sequences and to specify a prior distribution for the parameters in the stationary distribution. The more challenging and more likely scenario may be that population size changes are too frequent for the stationary distribution to be reached. This would necessitate more sophisticated and presumably more complicated treatments of the common ancestral sequence.

Assessment of Population Genetic Inferences
The histograms of 2Ns values for possible changes to extant sequences (figs. 1 and 4Go) are sensible in that the means of the 2Ns values are negative in each case. We expect 2Ns to be negative for most changes to an extant sequence because these sequences are a compromise between the opposing forces of natural selection and mutation. Natural selection favors sequences with higher fitness, and observed sequences should tend to be associated with better phenotypes than possible mutants. The 2Ns distribution when {omega} is not constrained to 1 closely resembles the 2Ns distribution when {omega} is fixed at 1 (fig. 4), except that the former distribution is shifted toward more negative values of 2Ns. The shift occurs because the posterior mean of {omega} is much less than 1.

Differences between the 2Ns distribution for inferred 5S rRNA changes (fig. 3) and possible changes (fig. 1) are reasonable in some respects. The average value of 2Ns among possible substitutions (–0.94) is less than the average among inferred substitutions (0.03) and the 2Ns distribution for inferred substitutions is closer to being symmetric about zero. We expect this because deleterious mutations are relatively unlikely to be fixed. The 2Ns distribution for possible changes aims to summarize the selective effects of possible mutations, whereas the 2Ns distribution for inferred substitutions aims to summarize the selective impacts of fixation events.

Unfortunately, none of the histograms for possible changes to extant sequences seem biologically plausible. We should expect some fraction of possible changes to be lethal or extremely deleterious because 5S rRNA and annexin V are both important genes (e.g., Smith et al. 2001; Gerke and Moss 2002Go). Homologs of both genes are present in distantly related species and are presumably subject to strong natural selection. Lethal and very deleterious mutations do not get fixed during evolution, and they therefore correspond to Ns values that are –{infty} or are at least very low. Although limitations of our procedures might prevent Ns estimates of –{infty}, a realistic model of sequence evolution should at least yield some possible changes for which Ns estimates are more extreme than any we obtain.

Population geneticists have incorporated nonsynonymous polymorphism data to investigate the distribution of selection on possible nonsynonymous changes. Their techniques and conclusions vary somewhat, but recent population genetic studies (Piganeau and Eyre-Walker 2003Go; Yampolsky, Kondrashov, and Kondrashov 2005; Eyre-Walker, Woolfit, and Phelps 2006Go) find that a substantial proportion of possible nonsynonymous changes are extremely deleterious. These deleterious changes should have much lower 2Ns values than do any of the changes represented in figures 1 and 4. These population genetic studies indicate that both the mean and the median 2Ns values should be much lower than those in our histograms. Qualitatively, the Nielsen and Yang (2003)Go results are somewhat like our own, but our technique is related to theirs and is likely to share strengths and weaknesses.

Our inferred 2Ns distributions should be deemed much less reliable than those from the recent population genetic studies. To obtain our estimates, we require that sequences can be perfectly mapped to phenotypic scores and that effect of phenotype is perfectly captured by our interspecific rate models. We also require a very specific population genetic scenario to hold for long periods of evolutionary time. These requirements are suspect.

If other techniques yield much more reliable estimates for 2Ns distributions than can our approach and other descendants of the Halpern and Bruno (1998)Go approach (Nielsen and Yang 2003Go; Berg, Willmann, and Lässig 2004; Sella and Hirsh 2005Go; Mustonen and Lässig 2005), should the interspecific-based attempts be abandoned? Our conclusion is no. We believe the examination of inferred 2Ns distributions is a good way to interrogate interspecific models of sequence evolution. The phylogenetics literature has accumulated a wide variety of statistical techniques for comparing interspecific models of sequence change (see Felsenstein 2004Go; see also Lartillot and Philippe 2006Go), but these statistical techniques do not purport to use the large body of evolutionary knowledge from population genetics. If population genetic and other biological evidence can inform interspecific evolutionary models, the result should be better interspecific evolutionary models. If a possible sequence change is extremely deleterious, the interspecific rate for that change should be negligible. When distantly related proteins are compared, conventional interspecific models need to account for the fact that none of the ancestral sequences could have been lethal and all or almost all ancestral sequences should have been relatively fit. The disconnect between fitness and interspecific models of sequence change is large and can be better understood by considering population genetic evidence. We also believe that the value of interspecific sequence comparisons will increasingly become apparent in population genetics because interspecific comparisons are not subject to a paucity of observed genetic variation.

Influence of Parameterization on Population Genetic Interpretations
If population genetics and phylogenetics are to be reconciled, the choice of parameterization for interspecific models is crucial. For example, while its mathematical effects on interspecific rates are clear and simple, assigning a biological interpretation to the {omega} parameter in equation (17) is difficult. As noted above, {omega} != 1 means that we must consider which of two alleles originated from the other. If a population has 1 sequence j allele and 2N – 1 i alleles, {omega} != 1 means that the probability j is eventually fixed will depend on whether it is newly arisen from sequence i or whether j is the sole surviving copy of an allele that is ancestral to i. Under what biological circumstances should the evolutionary fates of two alleles depend on more than just the sequences that encode them and their frequencies within a population? Complicated scenarios that yield relative fitnesses being affected by which of two alleles is ancestral are certainly possible, but it seems to us that these scenarios may be the exception rather than the rule in biology. Reconciling the {omega} parameter of interspecific models with plausible population genetic interpretations may be difficult. Work on allowing {omega} values to vary among codons (e.g., Nielsen and Yang 1998Go; Yang et al. 2000) has been and continues to be fruitful, but we believe these strategies for improving the statistical fits of evolutionary models to data can be supplemented by more biologically interpretable models that have better connections to phenotype and its effects on evolutionary rates.

Koshi, Mindell, and Goldstein (1999) considered an interspecific model of amino acid replacement with independent evolution among sites. An interesting feature of their model is that different sites in a protein experience different amino acid replacement processes as a result of a variation among sites of preferred physicochemical properties. As with the models considered here, the rates of the Koshi model can be interpreted as a product of a factor representing mutation rate and a factor representing fixation probability. Fixation probabilities are lower in the Koshi model for amino acid replacements that yield a worse phenotype than for replacements that improve a phenotype, and rates are especially low for those replacements that yield a much worse phenotype. The Koshi model differs from the parameterizations here in its treatment of rates for changes that improve the phenotype. Whether a replacement greatly improves phenotype or only slightly improves it, the Koshi model employs the same fixation probability. The asymmetry of incorporating amount of phenotypic change into evolutionary rates when the changes are unfavorable but not when the changes are favorable may be hard to reconcile with an explicit population genetic framework.

Although it considers protein structure stability rather than the physicochemical properties examined by Koshi, Mindell, and Goldstein (1999), the Bastolla et al. (2006) model of protein evolution has a parameterization of evolutionary rates with the same type of rate asymmetry. This kind of parameterization is attractive because it yields a time-reversible evolutionary process (Koshi, Mindell, and Goldstein 1999). Nevertheless, it is unclear to us whether this asymmetric parameterization can satisfactorily be connected to population genetics.

Future Directions
Although there is abundant statistical evidence that they fit DNA and protein data better than did the pioneering models of amino acid replacement (Dayhoff and Eck 1968Go) and nucleotide substitutions (Jukes and Cantor 1969Go), widely used interspecific models retain glaring weaknesses. The biggest of these weaknesses may be the disconnect between genotype and phenotype. Many nucleotide substitutions result in phenotypic changes, and these changes can affect fitness. Along with mutation, natural selection is a major determinant of evolutionary rates, but mapping of genotype to fitness is absent in most interspecific models of sequence change and it is crude in the others.

A focus of computational biology is the automated prediction of phenotype from genotype. In silico systems for phenotypic predictions can only improve. A central task for evolutionary biologists should be to translate predicted phenotypes into relative fitnesses. To make the link between phenotype and fitness, population genetics must be considered. The reconciliation of population genetics and interspecific evolution is possible and would facilitate development of biologically informed models of sequence change.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Methods
 Discussion
 Acknowledgements
 References
 
We thank D. Robinson for software and data assistance, and J. Felsenstein, A. Hobolth, B. Redelings, and an anonymous reviewer for helpful comments. J.L.T. and S.C.C. were supported by National Institutes of Health. grant GM070806 and National Science Foundation grant D.E.B-0445180. P.G.H. was supported by the Canada Research Chairs Organization. H.K. was supported by the Japan Society for the Promotion of Science (J.S.P.S. grant SR B-16300086). J.Y. was supported by a Danish Natural Science Research Council (FNU) grant to R. Nielsen.


    Footnotes
 
Ziheng Yang, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Methods
 Discussion
 Acknowledgements
 References
 

    Arisue N, Hasegawa M, Hashimoto T. Root of the Eukaryota tree as inferred from combined maximum likelihood analyses of multiple molecular sequence data. Mol Biol Evol. (2005) 22:409–420.[Abstract/Free Full Text]

    Bastolla U, Porto M, Roman HE, Vendrulscolo M. A protein evolution model with independent sites that reproduces site-specific amino acid distributions from the Protein Data Bank. BMC Evol Biol (2006) 6:43.[CrossRef][Medline]

    Berg J, Willmann S, Lässig M. Adaptive evolution of transcription factor binding sites. BMC Evol Biol. (2004) 4:42.[CrossRef][Medline]

    Brent R. Algorithms for minimization without derivatives (1973) Englewood Cliffs, NJ: Prentice–Hall.

    Dayhoff MO, Eck RV. A model of evolutionary change in proteins. In: Atlas of protein sequence and structure—Dayhoff MO, Eck RV, eds. (1968) National Biomedical Research Foundation: Washington D.C. 33–41. 1967–68.

    Drummond AJ, Ho SY, Phillips MJ, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS Biol (2006) 4:e88.[CrossRef][Medline]

    Eyre-Walker A, Woolfit M, Phelps T. The distribution of fitness effects of new deleterious amino acid mutations in humans. Genetics (2006) 173:891–900.[Abstract/Free Full Text]

    Felsenstein J. Inferring phylogenies (2004) Sunderland, MA: Sinauer.

    Gerke V, Moss SE. Annexins: From structure to function. Physiol Rev. (2002) 82:331–371.[Abstract/Free Full Text]

    Halpern AL, Bruno WJ. Evolutionary distances for protein-coding sequences: Modeling site-specific residue frequencies. Mol Biol Evol. (1998) 15:910–917.[Abstract]

    Higgs PG. Compensatory neutral mutations and the evolution of RNA. Genetica. 102/103 (1998) 91–101.

    Higgs PG. RNA secondary structure: Physical and computational aspects. Quart Rev Biophys (2000) 33:199–253.[CrossRef][Web of Science][Medline]

    Higgs PG, Hao WQ, Golding GB. Identification of conflicting selective effects on highly expressed genes. Evol Bioinf (2007) 2:1–13.

    Hobolth A, Nielsen R, Wang Y, Wu F, Tanksley SD. CpG+CpNpG analysis of protein coding sequences from tomato. Mol Biol Evol. (2006) 23:1318–1323.[Abstract/Free Full Text]

    Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P. Fast folding and comparison of RNA secondary structures. Monatsh Chem. (1994) 125:167–188.[CrossRef]

    Hwang DG, Green P. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc Natl Acad Sci USA (2004) 101:13994–14001.[Abstract/Free Full Text]

    Jones DT. GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol. (1999) 287:797–815.[CrossRef][Web of Science][Medline]

    Jones DT, Taylor WR, Thornton JM. A new approach to protein fold recognition. Nature (1992) 358:86–89.[CrossRef][Medline]

    Jukes TH, Cantor CR. Evolution of protein molecules. In: Mammalian Protein Metabolism—Munro HN, ed. (1969) New York: Academic Press. 21–132.

    Kimura M. On the probability of fixation of mutant genes in a population. Genetics (1962) 47:713–719.[Free Full Text]

    Kishino H, Thorne JL, Bruno WJ. Performance of a divergence time estimation method under a probabilistic model of rate evolution. Mol Biol Evol. (2001) 18:352–361.[Abstract/Free Full Text]

    Knudsen B, Miyamoto MM. Using equilibrium frequencies in models of sequence evolution. BMC Evol Biol. (2005) 5:21.[CrossRef][Medline]

    Koshi JM, Mindell DP, Goldstein RA. Using physical-chemistry based mutation models in phylogenetic analyses of HIV-1 subtypes. Mol Biol Evol. (1999) 16:173–179.[Abstract]

    Lartillot N, Philippe H. Computing Bayes factors using thermodynamic integration. Syst Biol. (2006) 55:195–207.[CrossRef][Medline]

    Mustonen V, Lässig M. Evolutionary population genetics of promoters: Predicting binding sites and functional phylogenies. Proc Nat Acad Sci USA. (2005) 102:15936–15941.[Abstract/Free Full Text]

    Nielsen R, Yang Z. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics (1998) 148:929–936.[Abstract/Free Full Text]

    Nielsen R, Yang Z. Estimating the distribution of selection coefficients from phylogenetic data with applications to mitochondrial and viral DNA. Mol Biol Evol. (2003) 20:1231–1239.[Abstract/Free Full Text]

    Ohta T. Population size and rate of evolution. J Mol Evol. (1972) 1:305–314.[CrossRef][Web of Science][Medline]

    Ohta T. Very slightly deleterious mutations and the molecular clock. J Mol Evol. (1987) 26:1–6.[CrossRef][Web of Science][Medline]

    Ohta T, Tachida H. Theoretical study of near neutrality. I. Heterozygosity and rate of mutant substitution. Genetics (1990) 126:219–229.[Abstract]

    Piganeau G, Eyre-Walker A. Estimating the distribution of fitness effects from DNA sequence data: Implications for the molecular clock. Proc Natl Acad Sci USA (2003) 100:10335–10340.[Abstract/Free Full Text]

    R Development Core Team. R: A language and environment for statistical computing. (R Foundation for Statistical Computing,Vienna, 2005) (2006) (http://www.R-project.org).

    Robinson DM, Jones D, Kishino H, Goldman N, Thorne JL. Protein evolution with dependence among codons due to tertiary structure. Mol Biol Evol. (2003) 20:1692–1704.[Abstract/Free Full Text]

    Rodrigue N, Lartillot N, Bryant D, Philippe H. Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene (2005) 347:207–217.[CrossRef][Web of Science][Medline]

    Rodrigue N, Lartillot N, Philippe H. Assessing site-interdependent phylogenetic models of sequence evolution. Mol Biol Evol. (2006) 23:1762–1775.[Abstract/Free Full Text]

    Sanderson MJ. Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach. Mol Biol Evol. (2002) 19:101–109.[Abstract/Free Full Text]

    Savill NJ, Hoyle DC, Higgs PG. RNA sequence evolution with secondary structure constraints: Comparison of substitution rate models using maximum likelihood methods. Genetics (2001) 157:399–411.[Abstract/Free Full Text]

    Sella G, Hirsh AE. The application of statistical physics to evolutionary biology. Proc Natl Acad Sci USA (2005) 102:9541–9546.[Abstract/Free Full Text]

    Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. (2004) 21:468–488.[Abstract/Free Full Text]

    Smith MW, Meskausjas A, Wang P, Sergiev PV, Dinman JD. Saturation mutagenesis of 5S rRNA in Saccharomyces cerevisiae. Mol Cell Biol. (2001) 21:8264–8275.[Abstract/Free Full Text]

    Szymanski M, Barciszewska MZ, Erdmann VA, Barciszewski J. 5S ribosomal RNA database. Nucleic Acids Res. (2002) 30:176–178.[Abstract/Free Full Text]

    Tillier ERM, Collins RA. High apparent rate of simultaneous compensatory basepair substitutions in ribosomal RNA. Genetics (1998) 148:1993–2002.[Abstract/Free Full Text]

    Whelan S, Goldman N. Estimating the frequency of events that cause multiple-nucleotide changes. Genetics (2004) 167:2027–2043.[Abstract/Free Full Text]

    Yampolsky LY, Kondrashov FA, Kondrashov AS. Distribution of the strength of selection against amino acid replacements in human proteins. Hum Mol Genet (2005) 14:3191–3201.[Abstract/Free Full Text]

    Yang Z, Nielsen R, Goldman N, Pedersen A-MK. Codon-substitution models for variable selection pressure at amino acid sites. Genetics (2000) 155:431–449.[Abstract/Free Full Text]

    Yu J, Thorne JL. Dependence among sites in RNA evolution. Mol Biol Evol. (2006) 23:1525–1537.[Abstract/Free Full Text]

    Zuker M. On finding all suboptimal foldings of an RNA molecule. Science (1989) 244:48–52.[Abstract/Free Full Text]

    Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. (1981) 9:133–148.[Abstract/Free Full Text]

Accepted for publication April 18, 2007.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
N. Rodrigue, C. L. Kleinman, H. Philippe, and N. Lartillot
Computational Methods for Evaluating Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons
Mol. Biol. Evol., July 1, 2009; 26(7): 1663 - 1676.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
A. M. Moses and R. Durbin
Inferring Selection on Amino Acid Preference in Protein Domains
Mol. Biol. Evol., March 1, 2009; 26(3): 527 - 536.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
M. Anisimova and C. Kosiol
Investigating Protein-Coding Sequence Evolution with Probabilistic Codon Substitution Models
Mol. Biol. Evol., February 1, 2009; 26(2): 255 - 271.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
W. Delport, K. Scheffler, and C. Seoighe
Models of coding sequence evolution
Brief Bioinform, January 1, 2009; 10(1): 97 - 109.
[Abstract] [Full Text] [PDF]


Home page
Phil Trans R Soc BHome page
S. C. Choi, B. D Redelings, and J. L Thorne
Basing population genetic inferences and models of molecular evolution upon desired stationary distributions of DNA or protein sequences
Phil Trans R Soc B, December 27, 2008; 363(1512): 3931 - 3939.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
N. Rodrigue, N. Lartillot, and H. Philippe
Bayesian Comparisons of Codon Substitution Models
Genetics, November 1, 2008; 180(3): 1579 - 1591.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
Z. Yang and R. Nielsen
Mutation-Selection Models of Codon Substitution and Their Use to Estimate Selective Strengths on Codon Usage
Mol. Biol. Evol., March 1, 2008; 25(3): 568 - 579.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
24/8/1667    most recent
msm085v2
msm085v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Thorne, J. L.
Right arrow Articles by Kishino, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Thorne, J. L.
Right arrow Articles by Kishino, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?