Skip Navigation


MBE Advance Access originally published online on May 23, 2007
Molecular Biology and Evolution 2007 24(8):1769-1782; doi:10.1093/molbev/msm097
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
24/8/1769    most recent
msm097v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Choi, S. C.
Right arrow Articles by Thorne, J. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Choi, S. C.
Right arrow Articles by Thorne, J. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Quantifying the Impact of Protein Tertiary Structure on Molecular Evolution

Sang Chul Choi*, Asger Hobolth*, Douglas M. Robinson*,{dagger}, Hirohisa Kishino{ddagger} and Jeffrey L. Thorne*,§

* Bioinformatics Research Center, North Carolina State University
{dagger} Statistical Genetics and Biomarkers Group, Bristol-Myers Squibb Co., Pennington, New Jersey
{ddagger} Laboratory of Biometrics, Graduate School of Agriculture and Life Sciences, University of Tokyo, Tokyo, Japan
§ Wissenschaftskolleg zu Berlin, Berlin, Germany

E-mail: thorne{at}statgen.ncsu.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Discussion
 Supplementary Material
 Appendix A
 Appendix B
 Acknowledgements
 References
 
To investigate the evolutionary impact of protein structure, the experimentally determined tertiary structure and the protein-coding DNA sequence were collected for each of 1,195 genes. These genes were studied via a model of sequence change that explicitly incorporates effects on evolutionary rates due to protein tertiary structure. In the model, these effects act via the solvent accessibility environments and pairwise amino acid interactions that are induced by tertiary structure. To compare the hypotheses that structure does and does not have a strong influence on evolution, Bayes factors were estimated for each of the 1,195 sequences. Most of the Bayes factors strongly support the hypothesis that protein structure affects protein evolution. Furthermore, both solvent accessibility and pairwise interactions among amino acids are inferred to have important roles in protein evolution. Our results also indicate that the strength of the relationship between tertiary structure and evolution has a weak but real correlation to the annotation information in the Gene Ontology database. Although their influences on rates of evolution vary among protein families, we find that the mean impacts of solvent accessibility and pairwise interactions are about the same.

Key Words: molecular evolution • protein structure impact • Gene Ontology • MCMC • Bayes factor


    Introduction
 TOP
 Abstract
 Introduction
 Discussion
 Supplementary Material
 Appendix A
 Appendix B
 Acknowledgements
 References
 
Most widely used models of sequence evolution exploit the assumption that individual sites or codons evolve independently from one another. The independence assumption dictates that a substitution at one site or codon does not influence substitution rates at other sites or codons. Unfortunately, the assumption of evolutionary independence among sites is not biologically plausible. In reality, the effect of a change at a particular site might cascade throughout the sequence, causing changes in substitution rates at other sites or codons. Nonsynonymous substitutions (i.e., those nucleotide substitutions that cause an amino acid replacement) are especially likely to affect rates at other positions, because the amino acid that occupies one site in a protein may have interactions with amino acids that are located nearby in the protein structure.

Despite strong evidence of a relationship between protein structure and protein change, evolutionary models of protein-coding genes tend to neglect protein structure. Protein tertiary structure changes very slowly over time (Chothia and Lesk 1986Go; Flores et al. 1993Go; Russell et al. 1997Go). If a nucleotide substitution results in an amino acid replacement that disrupts protein structure, it is likely to be selectively deleterious, and the rate at which such a substitution occurs should be low. Likewise, a nucleotide substitution should have a relatively high rate if it improves compatibility between protein sequence and structure. This is the intuition underlying a recently developed procedure for making evolutionary inferences when, owing to protein tertiary structure, codons do not evolve independently (Robinson et al. 2003Go).

Protein tertiary structure is phenotype whereas protein-coding DNA is genotype. Evolutionary models that lack dependence induce sterile adaptive landscapes where each codon or sequence position can be separately optimized. In contrast, models with dependence have the prospect of yielding biologically plausible fitness landscapes. Allowing dependent change among sequence positions is important because it permits a reasonable treatment of the effects of phenotype on evolution of genotype. A central goal of the newly emerging field of computational biology is to make in silico predictions of aspects of phenotype from genotype. The statistical strategies used in the study reported here can incorporate these predictive products of computational biology into evolutionary models and thereby better reveal the impact of the phenotype on evolution of the genotype.

Despite the obvious and important role for protein tertiary structure in shaping the evolution of protein-coding DNA sequences, tools for quantifying the evolutionary impact of tertiary structure remain primitive. One major difficulty is that natural selection to maintain protein tertiary structure induces evolutionary dependencies among sequence positions and these dependencies complicate statistical inference. Felsenstein's pruning algorithm (Felsenstein 1981Go) is the conventional mechanism for transforming continuous time Markov models of sequence evolution into a basis for statistical inference, but adapting the pruning algorithm to dependent changes among positions is problematic.

Recently, strategies have been developed for making inferences with models that have evolutionary dependence among sites (Jensen and Pedersen 2000Go; Pedersen and Jensen 2001Go; Hwang and Green 2004Go; Pedersen et al. 2004Go; Siepel and Haussler 2004Go; Christensen, Hobolth, and Jensen 2005Go). Protein tertiary structure as a source of dependent change among positions was the focus of Robinson et al. (2003)Go. The inference procedure of Robinson et al. (2003)Go was limited to data sets with two aligned protein-coding DNA sequences, but this was soon extended to data sets with three sequences (Robinson 2003Go) and data sets with three or more sequences (Rodrigue et al. 2005Go). Although these inference procedures require the restrictive assumption that all protein sequences being analyzed share a single known tertiary structure, their biggest limitation is probably that it can be computationally prohibitive to analyze a large number of highly diverged sequences.

It would be desirable to assess the impact of protein structure on protein evolution across a large set of protein families. This sort of assessment would shed light on how variable this impact is among protein families, and it might help determine if certain features of a protein family are correlated with the relationship between protein structure and evolution. Here, we show that information about the relationship between protein structure and evolution can be extracted from data sets consisting only of a single sequence per known protein structure.

Analyzing data sets with single sequences requires much less computation per data set than does analyzing multiple sequence data sets. As a result, we have been able to quantify the relationship between protein structure and evolution for 1,195 protein families. We find that incorporating tertiary structure into models of protein evolution almost always produces a much better fit to data. We explore how the influences on evolutionary rates of solvent accessibility and pairwise interactions vary among protein families. When averaged among protein families, we find that the sizes of the solvent accessibility and pairwise interaction impacts are similar. We also examine whether there is a correlation between annotation information in the Gene Ontology database and the strength of the structure–evolution relationship. We conclude that there is a weak but real correlation.

Evolutionary Model
We adopt and summarize the Robinson et al. (2003)Go model of protein evolution, referred to here as the dependent–sites model. It is designed so that the rate of change Rij from sequence i to sequence j depends both on the mutational process and on natural selection. Some parameters in the model are intended to capture the effects of mutation, while others are intended to reflect natural selection. The natural selection parameters are further subdivided into those that relate tertiary structure to evolution and those that reflect phenomena external to protein structure.

In the complete absence of natural selection, the probability of a neutrally evolving sequence is governed solely by the mutational process. In this neutral scenario, the frequencies of the four nucleotide types would be {pi}A, {pi}C, {pi}G, and {pi}T, where {pi}A + {pi}G + {pi}C + {pi}T = 1. To account for the differential occurrence of transition and transversion mutations, we add the transition/transversion ratio parameter {kappa} to the neutral model.

Effects on evolutionary rates that cannot be attributed to protein structure are captured by the parameter {omega}. This parameter represents the factor by which a nonsynonymous rate differs from what the rate would be if the change were actually synonymous. If {omega} were the only parameter to differentiate between neutral evolution and natural selection, the dependent–sites model would have no important distinction from other simple codon substitution models (e.g., Goldman and Yang 1994Go; Muse and Gaut 1994Go).

To represent the influence of tertiary structure, the dependent–sites model has two additional parameters. One parameter, s, captures the effects on evolutionary rates of solvent accessibility. The other, p, reflects the effects of pairwise amino acid interactions. When s and p are both zero, protein structure does not alter evolutionary rates. When both are positive, proteins evolve so as to be compatible with tertiary structure. It is formally possible that s or p can be negative. However, negative values for these parameters are not biologically plausible because they represent the situation where a protein is evolving so as to poorly fit its three-dimensional structure.

To measure or score how compatible a particular sequence is with a particular structure, a scoring system that was empirically defined for protein fold recognition is used (Jones, Taylor, and Thornton 1992Go; Jones and Thornton 1996Go). This scoring system is designed for measuring how well a specific protein sequence folds into a specific structure. Rather than using the sequence-structure compatibility measure to find the best structure, the dependent-sites model assumes a known protein structure that is invariant during evolution and the sequence-structure compatibility measure links rates of protein-coding gene evolution to tertiary structure.

The sequence–structure compatibility measure is designed for globular proteins and has two components. One assesses solvent accessibility and the other evaluates pairwise amino acid interactions. The solvent accessibility score assigned by folding an amino acid sequence i (or a translated DNA sequence i) into the known tertiary structure is denoted by Es(i), whereas the pairwise amino acid interaction score is Ep(i). Low values of Es(i) and Ep(i) represent better sequence–structure compatibility than high values. The scoring systems are akin to free energies in that the values of Es(i) and Ep(i) should be negative when sequence i is folded into its actual structure and positive otherwise.

The dependent–sites model does not allow instantaneous changes from a sequence i to a sequence j unless these sequences differ at only 1 nucleotide position. The rate Rij is zero if i and j differ at more than one position. Likewise, changes that introduce stop codons are not allowed. Letting h isin {A, C, G, T} be the nucleotide in sequence j at the sole position where i and j differ and letting u be a rate scaling factor, the rates of change are

Formula (1)

The length in nucleotides of sequence i will be L. Having im and kn, respectively, be the mth nucleotide of sequence i and the nth nucleotide of sequence k, this system of rates yields the following stationary distribution of sequences

Formula (2)
where the sum in the denominator is over all possible sequences k that have length L and that lack a premature stop codon (Robinson et al. 2003Go). The fact that the s, p, and {pi} parameters appear in equation (2) means that these parameters can be estimated without the computational burden that stems from analyzing multiple–sequence data sets. Because {kappa} and {omega} do not appear in equation (2),

Formula
This means estimation of {kappa} and {omega} requires two or more sequences.

Data Preparation
There were 32,987 experimentally determined protein tertiary structures deposited at the Protein Data Bank (PDB) database as of December 2, 2005 (Berman et al. 2000Go). Some protein families in this database are much more heavily represented than others. Our objective is to characterize a large and diverse sample of protein structures in order to begin to understand how the influence of tertiary structure varies in the protein universe. For this reason, we decided to partially avoid redundancy in the PDB by concentrating on the PDBselect list (Hobohm et al. 1992Go) available on September 15, 2006. This list identifies PDB protein chains with less than 25% sequence identity. For PDB structures with multiple protein models, we selected the representative conformation supplied by the OLDERADO database (Kelley and Sutcliffe 1997Go) at http://pqs.ebi.ac.uk/pqs-nmr.html. For those proteins not included in OLDERADO, the first structure conformation entry in PDB was selected.

We employed a resource (Martin 2005Go) that maps PDB structures to their corresponding protein sequences in the Swiss-Prot database (Boeckmann et al. 2003Go). Subsequently, we retrieved the DNA sequences corresponding to the protein sequences using the public Nucleotide Sequence Database (Kanz et al. 2005). This procedure provided us with a nonredundant data set of protein-coding DNA sequences that each had a corresponding protein tertiary structure in the PDB database. Each of the protein sequences was processed to produce output from the DSSP program (Kabsch and Sander 1983Go), from which solvent accessibility of protein chains can be generated. The DSSP data were used to determine the solvent accessibility scores Es(i) for a sequence i folded into a known structure. To be able to compute the pairwise interaction score Ep(i), the three-dimensional coordinates of the C–{alpha}, C–ß, O, and N atoms for each amino acid were extracted from the PDB data.

Because there is excessive uncertainty in doing evolutionary inference with short proteins, we eliminated protein-coding sequences with fewer than 200 nucleotides. We also removed the longest protein for reasons of computational feasibility. All results reported here are based on the remaining 1,195 proteins.

For each of these 1,195 protein families, the amino acid sequence information resulting from the translated DNA exactly matches the corresponding amino acid sequence information in the PDB database. Our requirement of exact matching serves to help eliminate PDB entries that are not naturally occurring. Because there are a variety of data collection issues that can cause terminal gaps, protein families were not removed from our data set if the alignments between the amino acid sequences from translated DNA and the PDB entries exhibited terminal gaps. However, only the DNA and structure information corresponding to the region of exact matching was used in our analyses.

Methods
Our estimation approach is a simplified version of that described by Robinson et al. (2003)Go. Because our data sets consist only of single sequences, we need not consider the nucleotide substitutions that separate a group of homologous protein-coding DNA sequences. Instead, our Bayesian inference procedure is based entirely on equation (2) and on the prior distributions of parameters. By having prior distributions of s and p that are independent of one another and independent of the nucleotide frequency prior, the posterior density for a sequence i is

Formula (3)
To simplify notation, the vector {theta} = {s, p, {pi}A, {pi}C, {pi}G, {pi}T} will be employed so that

Formula (4)

We selected a prior distribution for s that is uniform between –5 and 5 and a prior distribution for p that is uniform between –0.3 and 0.3. Although the biologically plausible values of s and p are positive, we intentionally specified the prior distributions so that positive values of s and p were not favored a priori. We therefore view posterior densities concentrated on the positive values of these parameters as partial validations of our model. The endpoints of the uniform priors for s and p were chosen from our prior experience with the dependent-sites model. We wanted prior intervals that were symmetric about 0 and that were as short as possible while still yielding very low posterior density near the interval endpoints. The incentive for having short prior intervals stems from the grid-based Gibbs sampler that we employed (see below and also see Robinson et al. 2003Go). The joint prior distribution of {pi}A, {pi}C, {pi}G, and {pi}T was uniform for all analyses reported here. Because the sum of these four parameter values must be one, this prior on the {pi} parameters is equivalent to a Dirichlet distribution with 4 categories that has the hyperparameter corresponding to each category set to one.

To approximate the posterior density, we employ a Metropolis-Hastings algorithm (Metropolis et al. 1953Go; Hastings 1970Go). The parameter values after t steps (t > 0) of the Markov chain will be denoted {theta}(t). The algorithm begins with any initial set of parameter values {theta}(0) that has nonzero posterior density. Thereafter, new states of the Markov chain are determined by randomly proposing new parameter values and then randomly accepting or rejecting the proposed state with the appropriate probability. Let the proposed parameter values at step t + 1 of the Markov chain be denoted {theta}'. These proposed values are obtained by randomly sampling from a probability density J({theta}'|{theta}(t)) that is conditional upon the current state of the Markov chain. A new proposal {theta}' is accepted with probability equal to the minimum of 1 and r, where

Formula (5)
and where J({theta}(t)|{theta}') is the probability density of proposing {theta}(t) if the current parameter values had actually been {theta}'. If the proposal is accepted {theta}(t+1) = {theta}', and otherwise {theta}(t+1) = {theta}(t).

The sums in the numerator and denominator of equation (5) are challenging to evaluate because they range over all DNA sequences that have the same length as sequence i. We approximate the ratio of these two sums via the grid-based Gibbs sampler of (Robinson et al. (2003)Go. Pilot experiments indicated that our implementation of this grid-based Gibbs sampler yielded satisfactory results (data not shown).

Rather than simultaneously proposing changes to all parameters represented by {theta}, our implementation has one proposal that changes only s, another that changes only p, and a third that changes the {pi} parameters. The proposed solvent accessibility parameter s' is sampled from a uniform distribution between s 1 and s + 1. When s – 1 is less than –5 and when s + 1 is greater than 5 (i.e., when there is a risk of proposing values beyond the boundaries of the uniform prior distribution), the lower or upper limits of the proposal density are shortened so that proposed values do not have prior densities of zero. The proposal density for p' is similar in nature to that for s', except that the uniform proposal distribution for p' is between p – 0.1 and p + 0.1. The proposal density for the {pi} parameters is a Dirichlet distribution of order four with parameters {pi}A x 100, {pi}C x 100, {pi}G x 100, and {pi}T x 100. The proposal acceptance proportions varied among the 1,195 protein families, but about 31% of s proposals, 15% of p proposals, and 24% of {pi} proposals were accepted.

We implemented the Metropolis-Hastings algorithm by cycling through the s, p, and {pi} proposals. We refer to one set of these 3 proposals as a cycle. To design a Metropolis-Hastings strategy that would reliably approximate the posterior distribution of {theta} and that would simultaneously be feasible for analyzing 1,195 single-sequence data sets, we arbitrarily selected 10 of the data sets for particularly careful examination. Each of the 10 different data sets was analyzed by initializing each of 5 different Metropolis-Hastings runs at a different set of initial parameter values {theta}(0).

With these pilot analyses, we obtained very similar estimates of the posterior distribution for the 5 different runs when we discarded the first 10,000 Metropolis-Hastings cycles and then sampled every 100 cycles until 1,000 samples had been obtained (data not shown). The Tracer program calculates an "effective sample size" from Metropolis-Hastings output (Rambaut and Drummond 2003Go). We found that the effective sample size tended to be almost the same as the actual sample size of 1,000 when Metropolis-Hastings runs had the burn-in of 10,000 cycles and the sampling frequency of 100 cycles as described above. Similarly, we found that consecutive Metropolis-Hastings samples have low correlation. The Metropolis-Hastings settings that we selected yielded potential scaling reduction statistic values (Gelman et al. 2003Go) that were less than 1.001 for all runs with the 10 data sets. A value of this statistic that is close to 1 indicates that increasing the number of Metropolis-Hastings cycles may not greatly improve posterior density approximations. Because our Metropolis-Hastings implementation seems well-behaved for the posterior density approximations of the 10 selected data sets, we elected to use the same Metropolis-Hastings settings to analyze the entire collection of 1,195 single-sequence data sets.

The Distribution of Tertiary Structure Impact Among Proteins
The Metropolis-Hastings analyses yield samples from the posterior density p({theta}|i) for each of the 1,195 protein-coding sequences i. These samples can then be used to estimate the posterior densities of individual parameters. The posterior means of si and pi for a protein family represented by sequence i can be approximated with the mean of the Metropolis-Hastings samples for that protein family. The estimated posterior means Formula and Formula will have little Monte Carlo error when the number of approximately independent samples from the joint posterior density of si and pi is large. As noted above, our pilot experiments suggest our Metropolis–Hastings run lengths are long enough to yield posterior density estimates with relatively small Monte Carlo errors.

The predominantly positive values for Formula and Formula in figure 1 can be viewed as a partial validation of our model because they represent situations where sequences evolve so as to be compatible with their tertiary structure. These positive values cannot be attributed to the prior distributions for s and p because the prior distributions were uniform and centered at 0. An important source of variation between the estimates Formula and Formula and the true values si and pi is the limited amount of information in a single-sequence data set. Therefore, credibility intervals for si and pi can be wide. Because of this variation, a histogram of the 1,195 posterior mean estimates of s (or p) is expected to be more variable than the distribution of the 1,195 actual but unknown values of s (or p). Also, unlike the distribution of actual parameter values, the distribution of the 1,195 posterior mean estimates will be influenced by prior distributions for parameters.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Distributions of protein tertiary structure impact parameters: Maximum-likelihood estimates of the distributions among genes of s and p are superimposed on the histograms of the Formula and Formula estimates. (A) The solvent accessibility parameter s. (B) The pairwise interaction parameter p.

 
To obtain a maximum-likelihood estimate of the distribution of actual values, we assign the distribution a parametric form. Specifically, we use independent normal densities to describe the variability of s and p values among proteins. The independence assumption for s and p seems reasonable because, among the 1,195 protein families, the correlation between the estimated posterior means for s and p was only about 0.035. The details of our maximum-likelihood procedure can be found in Appendix A.

Figure 1A shows a histogram for the solvent accessibility parameter of the 1,195 posterior means Formula. The estimated normal density of true s values among proteins is a dashed line that is superimposed on the histogram. This estimated distribution has mean 0.735 whereas the mean of the histogram values is 0.771. The standard deviation of the estimated normal density (0.138) is less than the standard deviation of the histogram values (0.292) because the posterior means are error-prone estimates of the true values of s.

Only 11 of the 1,195 posterior means of s (i.e., about 0.9%) are negative. Because some of these posterior means may be negative as a result of estimation error, it is not surprising that an even smaller percentage (less than 0.01%) of the estimated normal density for s is less than zero. Of these 11 protein families with negative posterior means, 9 are actually integral membrane or membrane-associated proteins. The negative s estimates for these 9 families are not surprising because our sequence-structure compatibility measure (Jones, Taylor, and Thornton 1992Go; Jones and Thornton 1996Go) was empirically derived from globular proteins. Thus, this measure was not intended for evaluating integral membrane and membrane–associated proteins. The other two proteins with negative posterior means for s are a lipid–binding protein and a hypothetical protein.

The histogram with the 1,195 posterior means of the p parameter (fig. 1B) has an average of 0.038 and a standard deviation of 0.022, whereas the estimated normal density for p has a mean of 0.034 and a standard deviation of 0.013. The percentage of the estimated normal density for p that is less than 0 is about 0.5%. There are 12 (about 1.0%) protein families for which the estimated posterior mean of p was less than 0. We have not identified a common attribute among these 12 families.

The smaller standard deviation for the estimated normal density of p than for the 1,195 posterior means is expected because the variability among posterior means will be inflated as a result of estimation error. For the same reason, we expected the percentage of the estimated normal density below 0 to be substantially smaller than the percentage of posterior means below 0. The percentages are closer than we expected, and this is evidently attributable to the conflict between the asymmetry in the positively skewed histogram of posterior means for p and the symmetry of the assumed normal density for the distribution of p among protein families. A more flexible distributional form than the normal density may be warranted.

The posterior means for the solvent accessibility parameter estimates are more variable among short than long protein-coding sequences, but the distribution of posterior means of s among short sequences seems to be centered at about the same value as the distribution for long sequences (fig. 2A). This is presumably because short sequences contain less information about s. The posterior means for p are also more variable for short than long sequences (fig. 2B). However, the pattern for p differs from that for s. Long sequences tend to yield positive but relatively low posterior means for p. Short sequences tend to yield either low positive values or higher positive values for p. We are unsure how to explain this pattern, but it seems to be responsible for inducing the positive skew in the histogram of p posterior means (fig. 1B).


Figure 2
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Scatter plots of length of protein-coding DNA sequences against protein tertiary structure impact posterior means: (A) The solvent accessibility parameter s. (B) The pairwise interaction parameter p.

 
Bayes Factors
Rodrigue, Philippe, and Lartillot (2006)Go have performed a careful examination of how a wide range of evolutionary models fit three different multiple-sequence data sets. They conclude that adding sequence-structure compatibility measures substantially improves model fit, but that available sequence-structure measures are not sufficient by themselves to produce a good model fit. To supplement the sequence-structure compatibility measures, these authors find that further model improvements can be achieved via permitting additional rate variation among sites (Yang 1994Go; Yang, Nielsen, and Hasegawa 1998Go; Goldman and Whelan 2002Go) and additional protein-specific variation of amino acid composition (see Cao et al. 1994Go).

The single-sequence nature of our data sets and the associated computational demands prevent us from being able to carefully examine as many evolutionary model variants as did Rodrigue, Philippe, and Lartillot (2006)Go. Nevertheless, we were interested in the improvement in model fit that is contributed by sequence-structure compatibility measures. We compared a model (M1) that sets both the s and p parameters to 0 with a more general model (M2) that does not constrain the s and p values. We compare models by approximating Bayes factors. In our case, the Bayes factor for a protein family k with sequence ik is

Formula (6)
In the above equation, the numerator integrand is the stationary distribution of sequence ik multiplied by the prior density of the s, p, and nucleotide frequency parameters. The denominator integrand is the special case of equation (2), where s = 0 and p = 0 multiplied by the prior density of the nucleotide frequency parameters.

Rodrigue, Philippe, and Lartillot (2006)Go employed a promising technique known as thermodynamic integration (Lartillot and Philippe 2006Go) for Bayes factor approximation. In contrast, we implemented the technique of Chib and Jeliazkov (2001Go; see also Chib 1995Go) to approximate Bayes factors. Details of our implementation strategy are given in Appendix B.

Our Bayes factor approximations show that incorporating s and p improves model fit for the vast majority of our 1,195 protein families (fig. 3A). The approximation is less than 1 and thereby favors the more simple evolutionary model without protein structure (i.e., M1) for only 12 of the 1,195 data sets (about 1.0%). Twice the natural logarithm of the Bayes factor is between 0 and 10 for an additional 61 data sets (about 5.1%). The remaining 1,122 protein families have a Bayes factor approximation with twice the natural logarithm exceeding 10. These results are fully consistent with the idea that tertiary structure should be incorporated into evolutionary models. As the proteins get longer, the Bayes factors tend to increase (fig. 3B and 3C). The positive correlation (0.71) between length and twice the natural logarithm of the Bayes factor is presumably attributable to the fact that longer sequences represent more data.


Figure 3
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Distribution of Bayes factor estimates among proteins: (A) A histogram of twice the natural logarithm of the Bayes factor estimates. The dashed line separates the protein families for which twice the natural logarithm of the Bayes factor is less than 10 from those for which it exceeds 10. (B) Scatter plot of the nucleotide length of protein-coding DNA sequences against twice the natural logarithm of the Bayes factor estimate. (C) Scatter plot of the nucleotide length against the ratio of twice the natural logarithm of the Bayes factor estimate and nucleotide length.

 
The Effects of a Nonsynonymous Change
Because evolutionary rates from sequence i to sequence j are determined by s(Es(i) – Es(j)) and p(Ep(i) – Ep(j)) rather than solely by s and p (see eq. 1), we cannot conclude that solvent accessibility affects protein evolution more than pairwise interactions, simply because our estimates of s tend to be bigger than our estimates of p. We examined the value of s(Es(i) – Es(j)) and p(Ep(i) – Ep(j)) for each possible nonsynonymous substitution that could change an observed sequence i. If natural selection has played a strong role in shaping an observed sequence i, most of these values should be negative.

For the observed protein-coding DNA sequence i, suppose there are Ti possible nonsynonymous substitutions. Let ji,1, ji,2, ..., ji,Ti be the set of protein-coding sequences that are different from i by exactly 1 nonsynonymous change. One assessment of the degree of optimization by natural selection for sequence i is

Formula (7)
where

Formula (8)
and

Formula (9)
If protein tertiary structure has a strong evolutionary impact, the observed sequence i should tend to be more compatible with its structure than neighboring sequences. This would mean that rates of changes to neighboring sequences will tend to be small and the values of {Delta}i, {Delta}s,i, and {Delta}p,i should be less than 0 (see eq. 1).

Our estimates Formula, Formula, and Formula are obtained via Equations (7),(8), and (9), except that the estimated posterior means Formula and Formula for protein family i replace s and p. The mean of the Formula among the 1,195 protein families is about –0.083, the mean for Formula is about –0.091, and the mean for Formula is about –0.174. Figure 4 shows the variation of Formula, Formula, and Formula among protein families. Some of this variation is due to estimation errors. A more direct impression of the variation among protein families in the actual values of {Delta}i, {Delta}s,i, and {Delta}p,i might be obtained through Stein–type shrinkage estimation (Stein 1956Go). The maximum-likelihood estimates of the true distributions of s and p values among proteins (see fig. 1A) could serve as the prior distributions for the si and pi values of each protein family i. The resulting posterior mean estimates of si and pi could then be regarded as empirical Bayes estimates. Plugging the empirical Bayes estimates of si and pi values into Equations (7), (8), and (9) would yield approximations for {Delta}i, {Delta}s,i, and {Delta}p,i. Instead, we decided to focus on variation among the {Delta}i values with the permutation procedures that are described below.


Figure 4
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Average selective effects of nonsynonymous changes: (A) Histograms of Formula, Formula, and Formula. Alternate histogram categories are labelled, and the labels represent the central value of data points in the category. For example, the histogram category designated "0" represents all points in the interval from – 0.025 to 0.025. The leftmost histogram category (labelled "*") represents all points that are less than – 0.475. (B) Scatter plot of Formula versus Formula

 
Gene Ontology, Tertiary Structure, and Evolution
One possibility is that certain protein attributes are associated with particularly strong constraints on evolution due to tertiary structure. To explore this, we employed the Gene Ontology (i.e., GO) database (Ashburner et al. 2000Go; Camon et al. 2004Go). The three fundamental units of this database are "Cellular Component," "Molecular Function," and "Biological Process." Each unit is hierarchically organized into annotation terms, and the terms within each hierarchy that apply to a given protein family are stored in the GO database. To lessen the impact of individual protein families on our investigation of GO terms, we decided to only consider terms for which at least 6 of the 1,195 protein families have been annotated with the term and for which at least 6 have not been annotated with it. For the 209 terms that satisfied this criterion in the GO database on October 11, 2006, the numbers in the "Cellular Component," "Molecular Function," and "Biological Process" divisions were, respectively, 35, 65, and 109. The GO hierarchy places 23 of these 209 terms at the most general "Level 1" rank, and the distributions of the Formula values for these 23 terms is summarized in figure 5. A figure summarizing the distributions for all 209 GO terms is available in the Supplementary Material online.


Figure 5
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 5.— Boxplots of Formula estimates for the 23 Gene Ontology (GO) level-1 terms: the vertical lines indicate the median Formula value among the 456 protein families that were annotated with at least one of the 209 Gene Ontology terms. (A) Cellular Component terms. (B) Molecular Function terms. (C) Biological Process terms.

 
For each of these 209 GO terms, we calculated the sample median of the Formula among protein families associated with the GO term. We refer to these sample medians as "optimization measures" of the GO terms. We expect the optimization measure associated with a GO term to be negative because most possible nonsynonymous substitutions lessen sequence-structure compatibility.

A null hypothesis is that GO attributes are independent of the strength of the structure–evolution relationship. As a test statistic, we use the sample variance among the 209 optimization measures. We expect more variation among optimization measures if annotation and evolutionary influence of structure are related than if they are unrelated. The null distribution of the test statistic is simulated by fixing the GO attributes for each protein family and then calculating the sample variance of the optimization measures after permuting the 456 Formula among protein families that were assigned to at least one of the 209 GO terms. This permutation procedure preserves the complex hierarchical structure among GO terms while simultaneously being consistent with the null hypothesis. We simulated the null distribution of the test statistic by performing the permutation procedure 10,000 times. The proportion of the simulated test statistic values that were greater than the observed value was quite small (0.001). A small proportion (0.001) was also observed when the test statistic was the variance among GO terms of the mean Formula for each GO term. These small proportions suggest that there is an association between GO attributes and optimization measures.

However, the association does not appear to be strong. The null hypothesis that GO attributes are independent of the strength of the structure–evolution relationship could be violated either if certain attributes tend to have unusually high optimization measure values or if certain ones tend to have unusually low optimization values. These two violations have different implications. The alternative hypothesis that tertiary structure has a particularly strong impact on evolution of proteins with certain attributes would lead to unusually low optimization values. Corresponding to another alternative hypothesis, unusually high optimization measures might be generated either if tertiary structure is not particularly important for proteins with some GO attributes or if our evolutionary model is not very appropriate for proteins with some attributes.

Because of the different implications of unusually high versus unusually low optimization measures, we did two additional hypothesis tests. For the alternative hypothesis that leads to low values of optimization measures, our test statistic was the lowest optimization measure value among the 209 GO attributes. The lowest value was about –0.266 and was achieved by "carrier activity." Although this test statistic differs from the previous test that used the variance among optimization measures, our null hypothesis has not changed, and we therefore applied the same permutation procedure. The low value for "carrier activity" does not seem particularly noteworthy because more than 29% of the null distribution for the test statistic was even lower.

To examine the remaining alternative hypothesis, our test statistic was the highest optimization measure value among the GO attributes. This highest value was about –0.071 and was achieved by the GO term "extracellular region part." The proportion of simulated test statistic values that were greater than the observed "extracellular region part" value was 0.0077. Because only 6 protein families were annotated with "extracellular region part," and because there was considerable variation among the Formula estimates for these 6 families, we also looked at a modified test statistic that was the maximum among GO terms of the means of the Formula values for each term. When we used the mean rather than the median {Delta}i estimate for the GO terms, "extracellular region part" had a mean of –0.09844 and was not the GO term with the highest value. Using the maximum of the 209 means rather than the maximum of the 209 medians as the test statistic, 0.1269 of the permuted data sets yielded a higher value than –0.09844. Therefore, we are hesitant to draw strong conclusions about the "extracellular region part" GO term.

The Formula, Formula, and Formula values are computed by comparing observed sequences to sequences that are one nonsynonymous change different. These values therefore assess the local adaptation of an observed sequence relative to its sequence neighborhood. In contrast, our Bayes factor estimates are measures of global adaptation because they represent the overall evidence that tertiary structure influences protein evolution. Twice the natural logarithms of these Bayes factor estimates have an approximately linear relationship with sequence length (fig. 3B) and can be normalized by dividing by the sequence length (fig. 3C). Similar permutation tests to those described above were performed for each GO term when the test statistic was the median of twice the natural logarithm of the Bayes factor normalized for sequence length. The results using these global measures of adaptation were qualitatively similar to those based on the Formula, Formula, and Formula measures of local adaptation and are therefore not detailed here.

We also examined possible connections between GO terms and evolution by forming a 2 x 2 contingency table for each of the 209 GO attributes. Protein families were assigned to contingency table columns according to whether their Formula value was above or below the median Formula value among the 456 protein families that were annotated with at least one of the 209 GO terms. Protein families were assigned to rows according to whether they were (or were not) annotated with the GO term. The null hypothesis that protein annotation is independent of the Formula value was investigated by applying Fisher's exact test. We concentrated on the two-tailed alternative hypothesis that a GO term may be associated either with high or low Formula values.

The result was 209 P values. Because some of these P-values will be low by chance, we selected the lowest of the 209 values as a test statistic for evaluating the null hypothesis that the evolutionary impact of protein structure is independent of annotation. We approximated the sampling distribution of this test statistic with a permutation procedure similar to that described above. We found that 5% of the sampling distribution is less than 0.001165. The four GO terms in the actual data set with a P-value lower than 0.001165 are "intracellular" (P-value of 0.001161), "carbohydrate metabolism" (0.000371), "organelle" (0.000429), and "intracellular organelle" (0.000429). The P-values for "organelle" and "intracellular organelle" are necessarily the same because each is represented by exactly the same 65 families in our data set. These two GO terms and "intracellular" yield relatively low median Formula values, whereas "carbohydrate metabolism" has a median Formula value that is relatively high. Despite the statistical evidence of a relationship between the evolutionary effect of protein structure and these GO terms, we cannot offer a compelling biological explanation for why these terms stand out. Moreover, the strong statistical evidence of a relationship should not be misconstrued as evidence of a strong relationship. The lack of dramatic variation of median Formula values among GO terms argues against a strong relationship (see figure 5 and the Supplementary Material online).


    Discussion
 TOP
 Abstract
 Introduction
 Discussion
 Supplementary Material
 Appendix A
 Appendix B
 Acknowledgements
 References
 
The sequence-structure compatibility method that we adopt was designed for protein fold recognition, and it combines solvent accessibility and pairwise interaction information (Jones, Taylor, and Thornton 1992Go; Jones and Thornton 1996Go). We find strong evidence that incorporating this structural information improves evolutionary models. We also find substantial variation among protein families in the evolutionary impacts of solvent accessibility and pairwise interactions. This suggests that protein fold recognition might be improved if the contributions of solvent accessibility and pairwise interaction information are allowed to vary among protein folds.

Melo, Sánchez, and Sali (2002)Go conclude that both pairwise amino acid interactions and solvent accessibility can make important contributions to successful protein fold recognition, but that pairwise interactions are more important. Prior to this study, importance in evolution of pairwise interactions versus solvent accessibility had been largely unexamined. The slightly more extreme mean value of Formula among the 1,195 protein families hints that pairwise interactions tend to be more important in evolution than does solvent accessibility. However, Formula and Formula have comparable magnitude and the most important message emerging from these analyses is that pairwise interactions and solvent accessibility seem to have roughly comparable impacts on protein evolution.

Although there seems to be a real association between the evolutionary importance of tertiary structure and GO terms, the tendency is not strong. A less parametric study (Yu and Thorne 2006bGo) addressed whether the degree of spatial clustering of amino acid replacements on protein tertiary structure is associated with GO attributes. Strong association was not found in this previous study. We cannot exclude the possibility that associations between evolutionary impacts of tertiary structure and GO terms are strong albeit subtle and can only be found by employing more realistic evolutionary models.

The evolutionary model used in this study would be a very simple codon substitution model except for the s and p parameters that have been added to incorporate protein tertiary structure. Simple codon substitution models have been improved by relaxing the assumption that all codons share the same value of the {omega} parameter (see eq. 1) and therefore the same ratio between nonsynonymous and synonymous rates (Nielsen and Yang 1998Go; Yang and Nielsen 1998Go). These approaches allow different codons to be associated with different {omega} values via a variety of schemes (Yang et al. 2000Go; Yang and Nielsen 2002Go; Yang and Swanson 2002Go). The dependent–sites model could also be modified to permit variation of {omega} values among codons. The stationary distribution of sequences (eq. 2) would not be affected by this modification. This means there is a wide range of codon substitution models that would all yield identical estimates of the s and p parameters from single-sequence data sets.

As with variation of {omega} values among codons, adding variation of evolutionary rates among codons via the discretized gamma technique of Yang (1994)Go would not alter the stationary distribution of sequences. Rate variation parameters and {omega} parameters do not have the close connections to phenotype possessed by the s and p parameters. For interspecific studies, there has been surprisingly little work on explicitly incorporating aspects of phenotype into models of molecular evolution. This is disappointing because the connection between phenotype and sequence change is central to evolution. Only slight modifications to the statistical inference procedure adopted here are necessary for studying the link between evolution and other aspects of phenotype (e.g., see Yu and Thorne 2006aGo).

Because the stationary distribution of sequences is affected by mutation as well as natural selection, one of many worthwhile model improvements would be a more careful treatment of mutation. Recently, several techniques for making evolutionary inferences with context-dependent mutation patterns have been developed (Jensen and Pedersen 2000Go; Pedersen and Jensen 2001Go; Hwang and Green 2004Go; Pedersen et al. 2004Go; Siepel and Haussler 2004Go; Christensen, Hobolth, and Jensen 2005Go). It would be straightforward to incorporate context-dependent mutation into models of protein change.

There is a tension between mutation and natural selection. Mutation is a possible explanation for why observed sequences might not be phenotypically optimal. Another—and not mutually exclusive—possibility is finite evolutionary history. Evolution may not yet have achieved a stationary distribution of sequences. This is particularly likely if the nature of the evolutionary process experiences fundamental changes over time. For example, the optimal phenotypic value might change. For our inferences, we have ignored the possibility that evolution has not achieved a stationary distribution of sequences. If we are correct to ignore this possibility, lack of optimal sequence-structure compatibility would be explained by the balance between mutation and natural selection. It is this assumed balance that allows us to estimate the s and p parameters. Data sets consisting of multiple homologous sequences would contain information that might let the stationarity assumption be evaluated. The question would then be whether the observed sequences are characteristic of the kinds of changes that are inferred to have occurred since the most recent common ancestral sequence existed.

A remaining explanation for why observed sequences may not seem to be phenotypically optimal is simply that the evolutionary model is flawed. Obviously, measures of sequence-structure compatibility could be improved. More importantly, phenotype is not simply a matter of protein structure. Phenotypic effects of a protein also depend on the environment and on the amount and pattern of protein expression.

Differences in the relative rates of specific nonsynonymous changes seem easier to explain with tertiary structure than through levels of protein expression. However, there is a strong association between the expression rate of a protein and its overall rate of nonsynonymous change (Drummond, Raval, and Wilkie 2005Go). Moreover, expression rates in yeast seem to be more strongly correlated with the overall nonsynonymous rate of a gene than does protein structure (Bloom et al. 2006Go).

A satisfactory evolutionary model should reflect the fact that protein expression is phenotype, whereas the regulatory regions that influence expression are genotype. Evolutionary models that connect DNA sequences of regulatory regions and expression phenotypes are possible. Promising strategies for describing evolution of regulatory elements have already been proposed (Berg, Willmann, and Lässig 2004Go; Mustonen and Lässig 2005Go). An improvement upon the probabilistic model described here would have both sequence-structure compatibility measures and in silico predictions of protein expression simultaneously influence the evolution of the coding and regulatory regions of a gene.

Certainly, the evolutionary model used here could be improved in a variety of ways. However, we view the current status of probabilistic evolutionary models in an optimistic light. With model-based evolutionary inference, the evolutionary relevance of different biological features can be statistically evaluated and the most important features can be incorporated into improved evolutionary models. Now that statistical procedures exist for making inferences when phenotype and context-dependent mutation affect evolutionary rates, development of more realistic evolutionary models should proceed at a rapid pace.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Discussion
 Supplementary Material
 Appendix A
 Appendix B
 Acknowledgements
 References
 
Supplementary materials can be found online at the Molecular Biology and Evolution (http://www.mbe.oxfordjournals.org).


    Appendix A
 TOP
 Abstract
 Introduction
 Discussion
 Supplementary Material
 Appendix A
 Appendix B
 Acknowledgements
 References
 
To explain our maximum-likelihood estimation procedure for inferring the distribution of s and p values among proteins, we introduce additional notation. The 1,195 sequences will be denoted as i, whereas the 1,195 true values of s and p will be s = {s1, s2, ..., s1195} and p = {p1, p2, ..., p1195}. Here, si(g) and pi(g) will refer to the gth of G approximately independent samples from the joint posterior density of si and pi. For the 1,195 single-sequence analyses, G was 1,000. The mean and variance of the normal distribution of s values among proteins will be collectively denoted {alpha}, whereas the mean and variance of the normal distribution for p values among proteins will collectively be ß. The uniform prior distributions for s and p will be referred to as {alpha}0 and ß0.

The goal of the procedure is to find the values of {alpha} and ß that maximize the likelihood p(i|{alpha},ß) = p(i1, i2, ..., i1195|{alpha},ß). The likelihood approximation exploits the assumption that the probability of observing a sequence for a specific protein family depends on the s and p values of the family but, conditional upon the s and p values for the family, the probability of observing a sequence is independent both of the prior density for s and p and of the distribution of s and p among protein families. In other words,

Formula (10)

Formula (11)
Numerical optimization routines can find the {alpha} and ß that maximize the above approximate likelihood. These routines need not consider the factor p(i|{alpha}00) because it is not a function of {alpha} or ß. For our analyses, the simplex routine of Nelder and Mead (1965)Go as implemented by the GNU Scientific Library (Galassi et al. 2005Go) has proven satisfactory for numerical optimization.

The uniform prior densities specified by {alpha}0 and ß0 constrain s and p to a finite range, whereas the normal densities specified by {alpha} and ß have no such constraint. This difference in support between {alpha}0 and ß0 versus {alpha} and ß has the potential to affect inference. However, the difference in support has no important consequences for our analyses because the posterior densities for s and p have little mass near the boundaries of the uniform priors that are set by {alpha}0 and ß0.


    Appendix B
 TOP
 Abstract
 Introduction
 Discussion
 Supplementary Material
 Appendix A
 Appendix B
 Acknowledgements
 References
 
In this appendix we describe how to approximate the Bayes factor using Chib (1995)Go and Chib and Jeliazkov (2001)Go. Recall from equation (6) that the Bayes factor is given by

Formula
Chib (1995)Go notes that

Formula
so that for any {theta} we have

Formula (12)
Chib (1995)Go assumes the two terms in the numerator of equation (12) are easy to calculate and shows how to estimate the denominator from output of a Gibbs sampler. Chib and Jeliazkov (2001)Go extend Chib (1995)Go to allow estimating the denominator from Metropolis-Hastings output.

Our case is more complicated because the first term in the numerator, the stationary distribution of a sequence, cannot be calculated directly. However, ratios of stationary distributions can be approximated using Robinson et al. (2003)Go. If we let {theta}1 = {pi}1 and {theta}2 = (s2, p2, {pi}2) and apply equation (12), we obtain the Bayes factor

Formula (13)
The first term is a ratio of stationary distributions and can be approximated using Robinson et al. (2003)Go. The second term, the ratio of the priors of the two models, is also easily calculated. The last term is approximated using the Metropolis-Hastings output, as described in Chib and Jeliazkov (2001)Go.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Discussion
 Supplementary Material
 Appendix A
 Appendix B
 Acknowledgements
 References
 
We thank D. Jones for guidance. We also thank A. von Haeseler and two anonymous reviewers for their input. This research was supported by National Institutes of Health (NIH) grant GM070806 and National Science Foundation (NSF) grant D.E.B-0445180. S. C. Choi was additionally supported by Korea Science and Engineering Foundation grant M06-2003-000-10086-0. A. Hobolth was additionally supported by Danish Research Council grant 21-04-0375. H. Kishino was supported by the Japan Society for the Promotion of Science (JSPS) (SR B-16300086).


    Footnotes
 
Arndt von Haeseler, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Discussion
 Supplementary Material
 Appendix A
 Appendix B
 Acknowledgements
 References
 

    Ashburner M, Ball CA, Blake JA. (20 co-authors). Gene ontology: tool for the unification of biology. Nat Genet. (2000) 25:25–29.[CrossRef][Web of Science][Medline]

    Berg J, Willmann S, Lässig M. Adaptive evolution of transcription factor binding sites. BMC Evol Biol. (2004) 4:42.[CrossRef][Medline]

    Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. (2000) 28:235–242.[Abstract/Free Full Text]

    Bloom JD, Drummond DA, Arnold FH, Wilkie CO. Structural determinants of the rate of protein evolution in yeast. Mol Biol Evol. (2006) 23:1751–1761.[Abstract/Free Full Text]

    Boeckmann B, Bairoch A, Apweiler R. (12 co-authors). The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. (2003) 31:365–370.[Abstract/Free Full Text]

    Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. (2004) 32:D262–266.[Abstract/Free Full Text]

    Cao Y, Adachi J, Janke A, Pääbo S, Hasegawa M. Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a single gene. J Mol Evol. (1994) 39:519–527.[Web of Science][Medline]

    Chib S. Marginal likelihood from the Gibbs output. J Am Stat Assoc. (1995) 90:1313–1321.[CrossRef][Web of Science]

    Chib S, Jeliazkov I. Marginal likelihood from the Metropolis-Hastings output. J Am Stat Assoc. (2001) 96:270–281.[CrossRef][Web of Science]

    Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. EMBO J (1986) 5:823–826.[Web of Science][Medline]

    Christensen O, Hobolth A, Jensen J. Pseudo-likelihood analysis of codon substitution models with neighbor-dependent rates. J Comput Biol. (2005) 12:1166–1182.[CrossRef][Web of Science][Medline]

    Drummond DA, Raval A, Wilkie CO. A single determinant dominates the rate of yeast protein evolution. Mol Biol Evol. (2005) 23:327–337.[CrossRef][Web of Science][Medline]

    Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. (1981) 17:368–376.[CrossRef][Web of Science][Medline]

    Flores TP, Orengo CA, Moss DS, Thornton JM. Comparison of conformational characteristics in structurally similar protein pairs. Protein Sci. (1993) 2:1811–1826.[Web of Science][Medline]

    Galassi M, Davies J, Theiler J, Gough B, Jungman G, Booth M, Rossi F. GNU Scientific Library Reference Manual (2005) Network Theory Ltd.

    Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. Chapman & Hall/CRC (2003).

    Goldman N, Whelan S. A novel use of equilibrium frequencies in models of sequence evolution. Mol Biol Evol. (2002) 19:1821–1831.[Abstract/Free Full Text]

    Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. (1994) 11:725–736.[Abstract]

    Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika (1970) 57:97–109.[Abstract/Free Full Text]

    Hobohm U, Scharf M, Schneider R, Sander C. Selection of a representative set of structures from the Brookhaven Protein Data Bank. Protein Sci. (1992) 1:409–417.[Web of Science][Medline]

    Hwang DG, Green P. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc Natl Acad Sci USA (2004) 101:13994–14001.[Abstract/Free Full Text]

    Jensen JL, Pedersen AMK. Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Adv Appl Probab (2000) 32:499–517.[CrossRef]

    Jones DT, Taylor WR, Thornton JM. A new approach to protein fold recognition. Nature (1992) 358:86–89.[CrossRef][Medline]

    Jones DT, Thornton JM. Potential energy functions for threading. Curr Opin Struc Biol. (1996) 6:210–216.[CrossRef][Web of Science][Medline]

    Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers (1983) 22:2577–2637.[CrossRef][Web of Science][Medline]

    Kanz C, Aldebert P, Althorpe N. (32 co-authors). The EMBL Nucleotide Sequence Database. Nucleic Acids Res. (2005) 33:D29–D33.[Abstract/Free Full Text]

    Kelley LA, Sutcliffe MJ. OLDERADO: On-line database of ensemble representatives and domains. Protein Sci. (1997) 6:2628–2630.[Web of Science][Medline]

    Lartillot N, Philippe H. Computing Bayes factors using thermodynamic integration. Syst Biol. (2006) 55:195–207.[CrossRef][Medline]

    Martin AC. Mapping PDB chains to UniProtKB entries. Bioinformatics (2005) 21:4297–4301.[Abstract/Free Full Text]

    Melo F, Sánchez R, Sali A. Statistical potentials for fold assessment. Protein Sci. (2002) 11:430–448.[CrossRef][Web of Science][Medline]

    Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equation of state calculations by fast computing machines. J Chem Phys. (1953) 21:1087–1092.[CrossRef]

    Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates with application to the chloroplast genome. Mol Biol Evol. (1994) 11:715–724.[Abstract]

    Mustonen V, Lässig M. Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies. Proc Natl Acad Sci USA (2005) 102:15936–15941.[Abstract/Free Full Text]

    Nelder JA, Mead R. A simplex method for function minimization. Comput J (1965) 7:308–313.

    Nielsen R, Yang Z. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics (1998) 148:929–936.[Abstract/Free Full Text]

    Pedersen AMK, Jensen JL. A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. Mol Biol Evol. (2001) 18:763–776.[Abstract/Free Full Text]

    Pedersen JS, Forsberg R, Meyer IM, Hein J. An evolutionary model for protein-coding regions with conserved RNA structure. Mol Biol Evol. (2004) 21:1913–1922.[Abstract/Free Full Text]

    Rambaut A, Drummond A. (2003) Tracer v1.3 URL http://evolve.zoo.ox.ac.uk.

    Robinson DM. D.R. EVOL: Three Dimensional Realistic Evolution. In: Ph.D. thesis, Bioinformatics (2003) North Carolina State University.

    Robinson DM, Jones DT, Kishino H, Goldman N, Thorne JL. Protein evolution with dependence among codons due to tertiary structure. Mol Biol Evol. (2003) 20:1692–1704.[Abstract/Free Full Text]

    Rodrigue N, Lartillot N, Bryant D, Philippe H. Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene (2005) 347:207–217.[CrossRef][Web of Science][Medline]

    Rodrigue N, Philippe H, Lartillot N. Assessing site-interdependent phylogenetic models of sequence evolution. Mol Biol Evol. (2006) 23:1762–1775.[Abstract/Free Full Text]

    Russell RB, Saqi MAS, Sayle RA, Bates PA, Sternberg MJE. Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. J Mol Biol. (1997) 269:423–439.[CrossRef][Web of Science][Medline]

    Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. (2004) 21:468–488.[Abstract/Free Full Text]

    Stein C. Inadmissibility of the usual estimator for the mean of a multivariate distribution. Proc Third Berkeley Symp Math Statist Prob (1956) 1:197–206.[Medline]

    Yang Z. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. (1994) 39:306–314.[CrossRef][Web of Science][Medline]

    Yang Z, Nielsen R. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J Mol Evol. (1998) 46:409–418.[CrossRef][Web of Science][Medline]

    Yang Z, Nielsen R. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol Biol Evol. (2002) 19:908–917.[Abstract/Free Full Text]

    Yang Z, Nielsen R, Goldman N, Pedersen AMK. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics (2000) 155:431–449.[Abstract/Free Full Text]

    Yang Z, Nielsen R, Hasegawa M. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol. (1998) 15:1600–1611.[Abstract]

    Yang Z, Swanson WJ. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol Biol Evol. (2002) 19:49–57.[Abstract/Free Full Text]

    Yu J, Thorne JL. Dependence among sites in RNA evolution. Mol Biol Evol. (2006a) 23:1525–1537.[Abstract/Free Full Text]

    Yu J, Thorne JL. Testing for spatial clustering of amino acid replacements within protein tertiary structure. J Mol Evol. (2006b) 62:682–692.[CrossRef][Web of Science][Medline]

Accepted for publication May 11, 2007.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
N. Rodrigue, C. L. Kleinman, H. Philippe, and N. Lartillot
Computational Methods for Evaluating Phylogenetic Models of Coding Sequence Evolution with Dependence between Codons
Mol. Biol. Evol., July 1, 2009; 26(7): 1663 - 1676.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
M. Anisimova and C. Kosiol
Investigating Protein-Coding Sequence Evolution with Probabilistic Codon Substitution Models
Mol. Biol. Evol., February 1, 2009; 26(2): 255 - 271.
[Abstract] [Full Text] [PDF]


Home page
Phil Trans R Soc BHome page
S. C. Choi, B. D Redelings, and J. L Thorne
Basing population genetic inferences and models of molecular evolution upon desired stationary distributions of DNA or protein sequences
Phil Trans R Soc B, December 27, 2008; 363(1512): 3931 - 3939.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
N. Rodrigue, N. Lartillot, and H. Philippe
Bayesian Comparisons of Codon Substitution Models
Genetics, November 1, 2008; 180(3): 1579 - 1591.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
Z. Yang and R. Nielsen
Mutation-Selection Models of Codon Substitution and Their Use to Estimate Selective Strengths on Codon Usage
Mol. Biol. Evol., March 1, 2008; 25(3): 568 - 579.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
24/8/1769    most recent
msm097v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Choi, S. C.
Right arrow Articles by Thorne, J. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Choi, S. C.
Right arrow Articles by Thorne, J. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?