Skip Navigation


MBE Advance Access originally published online on September 28, 2007
Molecular Biology and Evolution 2007 24(12):2632-2647; doi:10.1093/molbev/msm190
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrowOA All Versions of this Article:
24/12/2632    most recent
msm190v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Saunders, C. T.
Right arrow Articles by Green, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Saunders, C. T.
Right arrow Articles by Green, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Authors.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.


Research Articles

Insights from Modeling Protein Evolution with Context-Dependent Mutation and Asymmetric Amino Acid Selection

Christopher T. Saunders* and Phil Green*,{dagger}

* Department of Genome Sciences, University of Washington
{dagger} Howard Hughes Medical Institute, Seattle, WA

E-mail: ctsa{at}u.washington.edu or phg{at}u.washington.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Appendix
 Acknowledgements
 References
 
We develop an approximate maximum likelihood method to estimate flanking nucleotide context-dependent mutation rates and amino acid exchange-dependent selection in orthologous protein-coding sequences and use it to analyze genome-wide coding sequence alignments from mammals and yeast. Allowing context-dependent mutation provides a better fit to coding sequence data than simpler (context-independent or CpG "hotspot") models and significantly affects selection parameter estimates. Allowing asymmetric (nonreciprocal) selection on amino acid exchanges gives a better fit than simple dN/dS or symmetric selection models. Relative selection strength estimates from our models show good agreement with independent estimates derived from human disease-causing and engineered mutations. Selection strengths depend on local protein structure, showing expected biophysical trends in helical versus nonhelical regions and increased asymmetry on polar–hydrophobic exchanges with increased burial. The more stringent selection that has previously been observed for highly expressed proteins is primarily concentrated in buried regions, supporting the notion that such proteins are under stronger than average selection for stability. Our analyses indicate that a highly parameterized model of mutation and selection is computationally tractable and is a useful tool for exploring a variety of biological questions concerning protein and coding sequence evolution.

Key Words: protein evolution • amino acid selection • context-dependent mutation • phylogenetic analysis • protein expression • protein structure


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Appendix
 Acknowledgements
 References
 
A better understanding of protein evolution would illuminate both protein molecular biology and the diversification of species and populations. Developing adequate evolutionary models remains, however, a daunting challenge, in part because the core processes of mutation and selection are quite complex. Mutation rates depend, for example, on flanking nucleotide context (Sved and Bird 1990Go; Blake et al. 1992Go; Hess et al. 1994Go; Morton et al. 1997Go); in mammals, the strongest such effect is the elevated mutation rate due to methyl cytosine deamination at CpG dinucleotides (Ehrlich and Wang 1981Go), but other context effects, such as an elevated rate at A nucleotides flanked by 2 pyrimidines, are also quantitatively important (Blake et al. 1992Go). Selection on nonsynonymous mutations depends upon the specific amino acids that are exchanged, reflecting differences in both physicochemical properties (Grantham 1974Go) and metabolic cost (Akashi and Gojobori 2002Go) and is moreover asymmetric in the sense that average selection strength on A -> B substitutions (where A and B are amino acids) may differ from that on B -> A substitutions (Yampolsky et al. 2005Go).

Models of coding sequence evolution, like models of any complex phenomenon, must make numerous simplifying assumptions to be tractable. Ultimately, however, it is important to incorporate as much biological complexity as possible, not only because failure to do so can bias analysis results but also because fully understanding this complexity is an important biological goal. Increasing model complexity poses 2 challenges: more parameters must be estimated, which requires larger training data sets and greater attention to ensure that the parameter space has been adequately scanned; and simplifying but biologically unjustified mathematical assumptions must be removed, which can require developing new algorithms to restore computational feasibility. An example of the latter challenge is that mutational context dependency violates the mathematically convenient assumption that nucleotide segments (single nucleotides or codons) evolve independently.

Here, we develop a model of coding sequence evolution that incorporates context-dependent mutation rates and asymmetric exchange-dependent selection on amino acid substitutions and use it to estimate mutation and selection parameters from genome-wide coding sequence alignments by likelihood-based phylogenetic analysis. Previous models have incorporated some but not all of these features. Many protein evolution models have confounded mutation and selection, making biological interpretation of the model parameters problematic: these include models used to generate sequence alignment matrices (Dayhoff et al. 1978Go; Henikoff and Henikoff 1992Go; Jones et al. 1992Go) and empirical maximum likelihood models of amino acid (Yang et al. 1998Go; Whelan and Goldman 2001Go) and codon (Kosiol et al. 2007Go) substitution. Methods that do deconvolute mutation and selection often represent selection by a single dN/dS parameter (Goldman and Yang 1994Go; Muse and Gaut 1994Go). Yang et al. (1998Go) developed codon substitution models allowing for symmetric amino acid exchange-dependent selection, with selection parameters either taken from a published "physicochemical distance" matrix or estimated from the data. They found that both exchange-dependent models outperform the use of a single dN/dS and that estimated selection parameters outperform the best physicochemical distance matrix. More recently, Tang et al. (2004Go) used a nonlikelihood-based method to estimate symmetric selection parameters, finding them to be similar across diverse organisms. None of the above models allow for context-dependent mutation.

A few studies have considered asymmetric selection. The EX matrix (Yampolsky and Stoltzfus 2005Go) quantifies asymmetric amino acid exchangeability from assays of engineered mutations. This approach may fail to capture the full spectrum of selection acting on amino acid exchanges in vivo, and it can only use data from a limited set of genes and organisms. It does not estimate mutation rates. Yampolsky et al. (2005Go) combined genetic disease mutation, polymorphism, and divergence data to estimate the distribution of selection coefficients for each type of amino acid exchange in recent human evolution, allowing different mutation rates at CpG dinucleotides as estimated from neutrally evolving sequence but not other types of context effects. Their approach does not appear applicable to broader evolutionary scales.

Several groups have developed context-dependent mutation models for neutrally evolving sequence (Arndt et al. 2003Go; Hwang and Green 2004Go; Jojic et al. 2004Go; Lunter and Hein 2004Go; Siepel and Haussler 2004Go; Arndt and Hwa 2005Go), but coding sequence models have been more limited. A reversible model that incorporates CpG-dependent mutation rates has been developed for analyzing coding sequence pairs (Jensen and Pedersen 2000Go; Christensen et al. 2005Go). Siepel and Haussler (2004Go) adapt their context-dependent models to coding sequence but using an approach that confounds mutation and selection. The Markov Chain Monte Carlo (MCMC) approach of Hwang and Green (2004Go) can be extended to incorporate selection, and we have recently done this (Hwang and Green, unpublished data). In comparison to the approach presented here, the MCMC method has the advantage of using an exact rather than approximate probability model, but it is computationally less efficient for large-scale data analysis and less easily adapted to likelihood-based applications.

Our approach derives context-dependent mutation rates and exchange-specific selection parameters from likelihood-based phylogenetic analysis of homologous coding sequences. The model we implement analyzes the evolution of "extended codons," which are contiguous 4-nt segments of coding sequence consisting of a codon together with the immediately following (3') nucleotide. Compared with codons, using extended codons substantially improves power to estimate context-dependent mutation rates and to deconvolute selection and mutation, largely because context for synonymous sites is included; in single codon models, context-dependent rates are only allowed at middle codon positions, where they are confounded with selection. Note, for example, that the elevated CpG mutation rate in mammals complicates estimation of selection on Thr(ACG) to Met(ATG) changes, with failure to account for mutational context potentially causing artifactual inferences of weak or positive selection. Substantial distortions in selection values can also arise from neglecting other (non-CpG) context effects, as we illustrate in this study.

We drop the assumption that sequences are in compositional equilibrium (stationarity), which has typically been made for mathematical convenience in previous work. Stationarity implies that sequence composition is invariant throughout the phylogenetic tree, which is incompatible with observations that isochores are decaying in mammals (Duret et al. 2002Go; Belle et al. 2004Go). Stationarity is usually enforced by constraining the rate matrix to be reversible and the sequence composition at the root to equal that of present day organisms. We drop both constraints. To avoid the computationally prohibitive task of searching a large space of possible root distributions, we have developed a novel procedure (the "per-branch least squares [BLS] method") for estimating a root distribution by backwards evolution along the tree.

Another technical innovation in our work is a "subgraph-caching" variant of the Felsenstein (1973) site likelihood calculation algorithm, which reuses the results of intermediate stages of the calculation. This is related to previously described subtree-caching procedures (Kosakovsky Pond and Muse 2004Go) but is substantially more efficient for analyzing long (genomic) alignments of sequences of codons or extended codons.

Our methods are in principle applicable to data sets involving any number of species or genes but are particularly suited to genome-scale analyses of a small number of organisms (up to approximately 5 mammalian genomes). Analyses in this paper pertain to 3 mammals (human, mouse, and rat) and 3 Saccharomyces species. For both data sets, nonstationary models that include context-dependent mutation and asymmetric selection yield significant likelihood improvements over simpler models. Mutation rate and selection parameter estimates both show a substantial dependence on the specific mutation model assumed, with selection estimates varying by 50% or more between models. Selection estimates from models incorporating context-dependent mutation and asymmetric selection give the best correlation to independent exchangeability estimates derived from human disease-causing and engineered mutations.

We also examine the dependence of selection estimates on protein structure. Selection patterns in helical regions recapitulate known biophysical trends. Polar {leftrightarrow} hydrophobic exchanges display a selection asymmetry which increases with degree of burial in the protein structure, quantifying the action of selection to maintain the hydrophobic core. We also show that the increased strength of selection acting on highly expressed genes relative to low expression genes is disproportionately biased toward buried regions of the protein, consistent with the idea that highly expressed proteins are under stronger pressure to fold properly.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Appendix
 Acknowledgements
 References
 
Model
We represent protein-coding sequences as sequences of "states" from an alphabet {Sigma} = {a1, a2, ...aN} of size N; the 2 alphabets used in this study are CODON, the set of sense codons of the standard genetic code (N = 61) (Goldman and Yang 1994Go; Muse and Gaut 1994Go), and C4+, the set of 4-mer sequences consisting of 3 sense codon nucleotides together with the first nucleotide of the next (3') codon (N = 244).

We model sequence evolution on a rooted bifurcating phylogenetic tree {tau} as a (possibly nonreversible) continuous-time homogeneous Markov substitution process with an N-dimensional vector {pi} of state frequencies at the root node and an N x N–dimensional rate matrix Q = {qi,j}, where qi,j is the instantaneous rate of change between states ai and aj. The qi,j are assumed to be of the form

Formula
where Formula represents (context-dependent) mutation rates, Formula represents selection on amino acid changes, and {delta}(i,j) is the number of aligned nucleotide differences between states ai and aj. Note that state substitutions can only occur via single nucleotide changes and the matrix diagonal is set such that each row sums to 0. We describe below several specific forms for Formula and Formula

In light of the fact that genome sequences are not at compositional equilibrium (Duret et al. 2002Go; Belle et al. 2004Go), we permit the root frequencies {pi} to differ from those of present day sequences. This allows for the possibility of changes in the evolutionary process having occurred, but only prior to the last common ancestor of the analyzed sequences (a single Q is assumed to apply on all branches since the time of the root).

In our analyses, the topology of the tree {tau} is assumed known. The set G of tree nodes consists of a set L of leaf nodes and a set H = G \ L (the set of elements of G that are not in L) of internal nodes; the set of branch lengths is denoted β.

We use {theta} = ({tau}, β, Q, {pi}) to denote a specific model. Model parameters are estimated from a gap-free alignment of L nucleotide sequences, each of length J, in which aligned nucleotides are considered homologous, and each observed sequence corresponds to a leaf node in the tree (so L = |L|). Each nucleotide sequence is mapped to a corresponding sequence of I = J/3 states from the alphabet {Sigma} (note that when {Sigma} = C4+, a single nucleotide can map to 2 successive states in the corresponding state sequence). The observed state sequences at L are collectively referred to as the known data D, and the ancestral state sequences at H are referred to as the missing data M.

Let di,l denote the observed state at site i (1 ≤ i ≤ I) and leaf node Formula and let mi,h denote the ancestral state at site i and node Formula The probability of the set Formula of observed aligned states at site i is a sum over all possible values of the missing data Formula at that site:

Formula (1)
Calculation of terms on the right requires the substitution probability matrix P(t) = {P(aj|ai,t)}i,j, where P(aj|ai,t) is the probability that state ai has changed to aj after time t. We calculate P(t)=eQt either from the Taylor series expansion Formula or from the eigendecomposition eQt = Ue{Lambda}tU–1, where {Lambda} = diag({lambda}1, {lambda}2 ... {lambda}N) and U satisfy Q = U{Lambda}U–1. The site probability equation (1) can be calculated using the Felsenstein (1973) pruning algorithm. We take advantage of particular features of the data analyzed in this study (relatively long alignments from a small set of species) by using a subgraph-caching variant of the Felsenstein algorithm, which stores and reuses values obtained at intermediate steps in calculating equation (1), namely, the conditional (on internal node state) probabilities of the observed data in subgraphs of the tree. This method (which is more fully described in the Appendix) is related to previously described subtree-caching methods (Kosakovsky Pond and Muse 2004Go) but is substantially more efficient for small trees because the largest possible subgraph which includes reusable information is often not a subtree.

To define the data probability for the full alignment allowing for possible state overlaps, we view (following Siepel and Haussler [2004Go]) the alignment as an instance of a first order Markov chain (in reversed order) whose probability is given by a product of transition probabilities

Formula (2)

Note that this only approximates a full context-dependency model, which requires a Markov field to represent (Hwang and Green 2004Go). For reasons of computational efficiency, we make additional simplifying assumptions as follows. For the CODON state space, where states do not overlap, the site probabilities are assumed to be independent, so P(di|di+1, {theta}) is replaced by P(di|{theta}) and equation (2) reduces to P(D|{theta}) = prodiP(di|{theta}).

For the C4+ state space, adjacent states cannot be considered even approximately independent because they share a nucleotide, and we instead proceed as follows. The C4+ state di,l at sequence position i and leaf node l can be represented as di,l = (ci,l, ai,l) where ci,l = (n1, n2, n3)i,l is the ith codon, ai,l = (n4)i,l is the adjacent nucleotide from the neighboring codon, and nk is the kth nucleotide in the C4+ state. The set of aligned observed states at site i may be represented di = (ci,ai), where ci = {ci,l|l isin L} and ai = {ai,l|l isin L}, and the site probability P(di|{theta}) then equals the joint probability P(ci,ai|{theta}). Because ai is the "left edge" of di+1, we approximate P(di|di+1, {theta}) by

Formula
Here, Q is the set of all possible alignments of L codons. To compute P(ai|{theta}) efficiently, we adapt the Felsenstein algorithm used to handle an unknown leaf state and sum over {(c,ai,l)|c isin CODON} for site i and leaf node l.

The likelihood equation (2) is then approximated by the product of the conditional codon probabilities: Formula.

We explore the consequences of making the above approximations by running the model on simulated sequences (see below).

Root Distribution
As noted above, we wish to allow the root state distribution {pi} to differ from that of the leaves. Unfortunately, a full search of the parameter space required to represent {pi} is in general impractical (for models using the C4+ state space, this space has dimension 243). Instead, we estimate {pi} by reversing the state substitution process along each branch of the tree (given the current estimates of β and Q), starting from the observed state distributions at leaf nodes. We denote this as the BLS root distribution estimation method, implemented as follows.

For any nonroot node n in the tree, let {delta}n denote the row vector representing the estimated state distribution at n, {sigma}(n) the parent of n, and βn the length of the branch connecting n and {sigma}(n). We wish to estimate the distribution {delta}{sigma}(n) at {sigma}(n). For the true model parameters and state distributions, the relation {delta}n = {delta}{sigma}(n)Pn holds, where Formula Accordingly, let fB(Q, βn, {delta}n) denote the estimate for {delta}{sigma}(n) that results in the least-squared error between these values: Formula where the minimization is over probability distributions {delta}. This calculation is a special case of the nonnegative least squares problem, solvable by the NNLS algorithm (Lawson and Hanson 1974Go), but because we require increased computational efficiency (because {pi} must be reestimated every time the model parameter values change) we instead proceed as follows. Let Formula. If w is a probability distribution we take fB(Q, βn, {delta}n) = w; otherwise, we find fB(Q, βn, {delta}n) by conjugate gradient minimization (Press et al. 1992Go) starting from the initial value w', where w' is the probability distribution obtained by setting all negative values of w to zero and scaling the remaining terms to sum to one.

We can now find the root distribution Formula where r is the root node, by the following recursion. For leaf nodes Formula we take {delta}n to be the observed state distribution at n and assign a weight wn equal to the observed state sequence length. For interior nodes Formula we recursively compute

Formula
where A(n) denotes the set of child nodes of n.

The method could probably be improved by optimizing a least squares fit against all observed data simultaneously rather than separately for each tree branch.

Mutation Rates
We allow mutation rates to depend upon the flanking nucleotides, denoting the rate at any position by µ(n, n', n, n+), where n and n', respectively, are the original and mutant nucleotides and n and n+ are the nucleotides immediately preceding and following the mutated position.

We consider 4 possibilities for µ. NONREV is the context-independent nonreversible 12-parameter nucleotide model (Yang 1994Go): µNONREV(n, n',n,n+) = rn,n', n != n' CPG adds 2 additional parameters to represent CpG transitions and transversions:

Formula
(Relative to other sites, CpGs display increased transversion as well as transition mutation rates [Hess et al. 1994Go].)

FACTORED-TRIPLET treats the preceding and following nucleotide context effects as independent multipliers: µFACTORED-TRIPLET(n, n', n, n+) = f-(n,n', n)f+(n, n', n+), where the partial-context functions f-(n, n', n) and f+(n, n', n+) use one parameter for every combination of arguments for which Formula 96 in all. Finally, TRIPLET allows a separate rate for each n, n',n,n+ where Formula requiring 192 parameters in all. Note that we do not impose a symmetry (reverse complement) condition on the rates because this is known to fail in transcribed sequence (Green et al. 2003Go).

For nucleotides at the left or right edge of a site, one of the flanking nucleotides is not contained within the site. In this case, we approximate the rate at the edge position as a weighted average over all possible nucleotides at the missing position. For example, for nucleotides n at the right edge of the site, we calculate

Formula
where the edge nucleotide distribution function Formula specifies the probability that nucleotide x follows n when n is at the right edge of the site, and N is the set of nucleotides. For nucleotides at the left edge of the site, the corresponding edge distribution function Formula gives the probability that x precedes n. For computational simplicity Formula and Formula are assumed to be independent of the phylogenetic branch and are calculated from the observed coding sequences. We tried allowing branch-specific values of Formula and Formula iteratively derived using state distributions at internal tree nodes as calculated by the BLS procedure, but found that this did not significantly improve the likelihood or parameter estimation accuracy on simulated sequences (data not shown).

We define the context-independent mutation rate associated to µ as

Formula
where the nucleotide frequencies Pnuc are calculated from observed sequences. We use the context-independent mutation rate to scale all mutation parameters such that Formula . The scaling implies that the actual number of free parameters in each mutation model is one less than the value given above. Using this scaling, the branch lengths β approximate the number of mutations per nucleotide position along the branch (the correspondence would be exact, if there were no mutational context dependency and the rate matrix were reversible). To more accurately estimate the number of neutral mutations per nucleotide on each branch, we simulate sequence evolution on the tree and record the number of substitutions occurring per nucleotide before selection is applied.

Selection on Amino Acid Substitutions
Selection on nonsynonymous mutations is represented as a multiplicative factor applied to the mutation rate:

Formula
where c and c', respectively, denote the starting and mutated codons, {alpha} maps codons to amino acids, and S is the set of stop codons. The nonsynonymous selection term {omega}(i,j) defines the multiplicative factor for mutations from amino acid i to j, where Formula and (i,j) is restricted to the set of 150 ordered pairs of amino acids having one or more codons that differ by a single nucleotide change. The SINGLE parameterization of {omega} uses the same value for all (i,j), corresponding to the dN/dS value customarily used in codon models. SYMMETRIC imposes {omega}(i,j) = {omega}(j,i) but allows a separate parameter value for each unordered pair (75 in all). Finally, ASYMMETRIC allows a separate parameter for each of the 150 ordered pairs, so that the strength of selection on reciprocal amino acid exchanges may be unequal.

The above formula needs to be modified slightly to represent selection acting on the final nucleotide n4 of the C4+ state because the codon containing this nucleotide is not specified by the state. To do this, we approximate the selection coefficient as a weighted average over all possible "completions" of the partial codon. Let n be the original nucleotide, n' be the mutated nucleotide, and (n, na, nb) the codon with nucleotides n, na, nb at position 1–3. The partial codon selection factor is

Formula
where the codon frequencies Pcodon are calculated from all observed sequences.

Summary Rates and Selection Factors
It is useful to define summary substitution rates q or selection factors {omega} between sets of amino acids (e.g., selection on hydrophobic to polar exchanges). We do this as follows.

Define the flux from state i to a different state j as {rho}ij = P(i)qij, where P(i) is the frequency of state i in observed sequences and qij is the substitution rate between states i and j, as defined above. The flux between 2 sets of states A and B is Formula and the summary substitution rate between A and B is Formula Given 2 amino acids, we can take A and B to be the sets of all states which translate to those amino acids, in which case qAB is called the summary substitution rate between the amino acids.

Summary selection factors are defined similarly. We now define flux between 2 sets of states A and B as Formula where {alpha}(i) is the amino acid translation of state i (for the C4+ alphabet, this is the translation of the completely embedded codon); in other words, only changes between nonsynonymous states are considered. We define the mutation rate Formulaij between states i and j, as the substitution rate obtained when all {omega}(i,j) are set equal to 1, and the corresponding mutation flux as Formulaij = P(i)Formulaij. The mutation flux between state sets A and B is restricted to nonsynonymous changes Formula The summary selection factor for exchanges between sets A and B is then Formula For A = B = {Sigma} (the complete state alphabet), the procedure yields the conventional dN/dS value.

Maximum Likelihood Estimation
We seek the model {theta} = ({tau}, β, Q, {pi}) assigning the highest probability to the observed data: Formula In our analyses, we typically assume a particular tree {tau} a priori and use the BLS method to estimate {pi} given Q, β and D, so likelihood maximization is reduced to a search over the parameters of Q and β. However, this still can be a formidable task: for the most elaborate parameterization of Q, there are 191 mutation parameters and 150 selection parameters, in addition to 2L-2 branch length parameters.

Likelihood maximization is performed using Powell's conjugate direction method as described in (Press et al. 1992Go), with the following modifications to improve efficiency: we initially run 10 Polak–Ribiere conjugate gradient iterations to improve starting parameter estimates, and at the end of each search iteration, we perform a line maximization in the direction {theta}n{theta}n – 1 (where {theta}n is the current iteration parameter vector), even when the conjugate direction inclusion criteria are not met. We tried likelihood maximization using both a quasi-Newton algorithm (BFGS) and the conjugate-gradient method and found them to be slower and to yield lower maxima. The EM algorithm (Dempster et al. 1977Go), as applied in Siepel and Haussler (2004Go), was also found to be less effective, in part because the subgraph-caching algorithm (see Appendix) cannot be efficiently applied to the EM expectation step.

Confidence Interval Calculation
Finding confidence intervals (CIs) by inversion of the Fisher information matrix (Sorensen and Gianola 2002aGo) produces unstable results for our highly parameterized models, and we instead use a more robust procedure based on inverting the log likelihood ratio (Sorensen and Gianola 2002bGo). Denote the model parameters by {theta} = {{theta}1, {theta}2, ..., {theta}j}, the likelihood function by L({theta}) = P(D|{theta}), and define the profile likelihood function for {theta}j as Formula where {theta}-j = {theta}\{{theta}j}, that is, by maximizing over all parameters other than {theta}j. The estimated 100(1 – {alpha})% CI for {theta}j is given by

Formula
where Formula denotes the maximum likelihood parameter estimates. By asymptotic maximum likelihood theory, the profile likelihood is approximately Gaussian in the neighborhood of Formulaj, with variance parameter that can be estimated by

Formula
for any small nonzero {delta}; this yields the estimated CI Formula

The approximate variance for summary parameters are calculated by treating the individual parameter estimates as independent random variables with mean equal to the maximum likelihood estimate and variance as calculated above. Comparisons of the summary CIs to exact CIs of simpler models suggest that this approximation is conservative, that is, causes overestimation of the CIs. For example, the CIs appearing in figure 7a (for summary dN/dS values from ASYMMETRIC, TRIPLET, C4+ models) are all larger than the corresponding exact CIs on the dN/dS parameters of SINGLE, NONREV, CODON models analyzing the same data.

Simulation
Our simulations for a model {theta} = ({tau}, β, Q, {pi}) use a continuous-time evolutionary process which exactly represents context-dependent mutations. Starting from a root sequence sampled from {pi}, the sequence evolves along each branch as follows: the substitution rate {phi}j at each nucleotide position j is calculated from the mutation and selection parameters; the substitution rate for the entire sequence is Formula where J is the number of nucleotide positions. The time until the next nucleotide substitution in the sequence is sampled from the exponential distribution Formula Draws from the distribution of {phi}j and the distribution of nucleotide substitution rates at each position determine the position and type of the substitution. The substitution rates at the substituted position and any contextually dependent positions are updated together with Formula and the process is repeated. Iteration stops when the branch length is exceeded.

Software Availability
The program CodeAxe used to perform the analyses in this study is available as source code from http://www.phrap.org.

Coding Sequence Data
For the mammalian coding sequence analysis, we have taken all complete coding sequence (CDS) annotations from the NCBI RefSeq database (Pruitt et al. 2005Go) for human (build 36), mouse (build 36), and rat (build 4). We also analyzed coding sequence from the Saccharomyces species Saccharomyces cerevisiae, Saccharomyces paradoxus, and Saccharomyces mikatae downloaded from (http://www.genome.wi.mit.edu/personal/manoli/yeasts/) (Kellis et al. 2003Go).

Alignment
For either set of organisms, the coding sequences were assigned to orthologous sets containing one sequence from each organism using a symmetric best Blast E-value technique (Tatusov et al. 1997Go). Coding sequences in each ortholog set were aligned as amino acid sequences using ClustalW (Thompson et al. 1994Go) and subsequently reverted to their original nucleotide sequence.

Quality Criteria
Orthologous coding sequence sets were not analyzed if any RefSeq entry contained incomplete CDS annotations, used a nonstandard translation table or had any other exception to the standard genetic code, or were annotated with any of the following terms: "artificial frameshift," "unclassified translation discrepancy," or "pseudogene."

Mammalian sequence alignments were also filtered to remove CpG islands, detected as maximally scoring segments in the human coding sequence having scores at least 50, under a scoring scheme that assigns +17 to CpGs and –1 to other dinucleotides (Lander et al. 2001Go). Islands were constrained to start at the 5' end of the sequence by requiring the maximal scoring segment to begin within the first 15 nt of the sequence and treating codons 5' of the segment start as part of the island. Codons in aligned orthologous mammalian sequences that intersected or preceded a human CpG island were removed from the analysis.

The remaining portion of each alignment was subject to a window quality filter to remove possible alignment artifacts. For each aligned codon, we defined a window of 10 preceding and 10 following codons. The aligned codon was removed from analysis if any gaps occurred in this window or if more than 5 of the 20 window codon positions contained nonsynonymous differences (including potential nonsynonymous differences involving ambiguous nucleotides). Ambiguous nucleotides were not allowed in any analyzed codons. After all filtration steps, the mammalian and Saccharomyces alignments respectively contained 5,128,530 and 1,666,416 aligned codon positions.

Structure Assignment
Translated human coding sequences at least 40 amino acids long from each ortholog set were aligned to the 95% sequence identity subset of the ASTRAL database v1.69 (Chandonia et al. 2004Go) using a single round PSI-Blast (v2.2.14) search (Altschul et al. 1997Go) with a final Smith–Waterman realignment step. Alignments were required to have an E value of less than 0.01, at least 30% amino acid sequence identity, a minimum length of 40 amino acids, and a maximum of 30% overlap with other ASTRAL domains already aligned (with smaller E value). For overlapping accepted ASTRAL domain matches, the lower E-value domain takes precedence in the overlap region.

For each ASTRAL domain, we used DSSP (Kabsch and Sander 1983Go) to assign secondary structure to each amino acid position. The DSSP secondary structure categories were reduced to 3 as follows: {alpha}, {pi}, and 3-10 helix were assigned to "helix," extended was assigned to "β sheet," and the remaining DSSP categories were assigned to "loop/other." Burial was assigned to all ASTRAL domains by using a simple contact number for each amino acid, the 14 Å Cβ density (the number of Cβ atoms within 14 Å of the Cβ of the measured position, with C{alpha} substituting for Cβ in Gly), which was divided into 3 quantiles to define buried, intermediate, and exposed states. This burial metric (using, however, 7 classes rather than 3) was previously found to be the best conserved measure of burial among distant homologs (Karchin et al. 2004Go). Using these methods, we were able to assign secondary structure and burial to 994,587 and 992,354 codon positions, respectively, in the mammalian sequence alignments.

Expression Level Categories
Expression level was approximated by counts of expressed sequence tags (ESTs) and cDNAs in gene-oriented clusters constructed as described in Garg and Green (2007Go). We first eliminated clusters lacking a one-to-one association with a Refseq gene. We then classified genes into 2 equal sized expression bins based on EST + cDNA count of the associated clusters: low expression genes had counts <65 (the median value), and high expression genes had counts ≥65.

Human Disease Propensity Matrix
We obtained 24,719 putative human disease-causing nonsynonymous mutations from the HGMD database (Stenson et al. 2003Go) (Phillips A and Cooper DN, personal communication November 2005) and 18,564 validated nonsynonymous single nucleotide polymorphisms (SNPs) from dbSNP, v127 (Sherry et al. 2001Go), using the corresponding chimpanzee base to polarize the mutation that gave rise to the SNP. We then estimated "relative human disease propensity" of each amino acid exchange type as the log ratio of the fraction of HGMD mutations that are of that type to the fraction of nonsynonymous SNPs of that type. This ratio is expected to inversely correlate to amino acid exchangeability, provided HGMD mutations have a distribution similar to that of all negatively selected nonsynonymous mutations and nonsynonymous SNPs have a distribution similar to that of all nonsynonymous mutations.

Simulation Analysis
Rooting a Nonstationary Model
In table 1, we compare the performance of 3 different methods to estimate the root distribution: equating to the distribution in observed sequences (FIXED), the BLS method (discussed above), and a full search over root distribution parameters (FREE). In the first part of this table, we use a CODON state space model with NONREV mutation (which is context-independent). The BLS root model performs nearly as well as the FREE root model, and both are much better than the FIXED model, as judged by relative likelihoods, ability to recover the root codon distribution and parameter error. The second part of table 1 shows that these trends also hold for C4+ state space models. We use the BLS root method in all remaining analyses in this paper.


View this table:
[in this window]
[in a new window]

 
Table 1 Comparison of Root Estimation Methods on Simulated Data

 
State Type
In table 2, we compare CODON and C4+ state space models. The models have similar performance for context-independent mutation (first 2 rows of table). Note that in this case, the likelihood for the C4+ model is approximate whereas that of CODON is exact. For context-dependent mutation rates (next 2 rows), C4+ models fare substantially better for both likelihood and parameter estimation accuracy, although the C4+ parameter estimation error is still notably higher than in the context-independent simulation. In similar analyses using an alternative 4-nt state type, C4–, composed of each codon and its preceding nucleotide, we found parameter estimation to be noticeably less accurate than for C4+, probably because the C4+ state contains both nucleotides flanking third codon positions which (because they undergo relatively little selection) provide most of the information regarding underlying mutation rates. Based on the above tests, we use C4+ models for all remaining analyses in this study.


View this table:
[in this window]
[in a new window]

 
Table 2 Comparison of Different State Spaces on Simulated Data

 
Normalized Error Distribution
We test the accuracy of our parameter and CI estimation (which involved several approximations) by observing the normalized error distribution of the fully parameterized C4+ model (fig. 1). For an exact likelihood model, this distribution should be asymptotically normal and unbiased with standard deviation (SD) 1. Although it does appear approximately normal, the SD is 1.306, larger than expected. Without normalizing, the SD of relative parameter error (SDre) is 0.0558. When the analysis is repeated on a simulated data set of one-fifth the size, the normalized error distribution is closer to standard normal (SD = 1.027), but with less accurate parameter estimates (SDre = 0.1061). This suggests that the approximations made in the model substantially distort estimation accuracy only when large quantities of data are analyzed, possibly by biasing the parameter estimates slightly.


Figure 1
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Normalized error histogram for all 346 parameters of the ASYMMETRIC, TRIPLET, C4+ model, from analysis of simulated data. Parameter estimates derived from an analysis of mammalian coding sequences using this model were used to simulate a data set of identical size (5,128,530 codon positions), and the simulated data were then analyzed using the same model. The normalized error is defined as the difference between the "true" (as assumed in the simulation) and estimated parameter values, divided by its estimated SD. The curve corresponds to a standard normal distribution.

 

    Results and Discussion
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusions
 Appendix
 Acknowledgements
 References
 
Model Parameterization
We compared evolutionary models of varying degrees of complexity by analyzing orthologous human, mouse, and rat coding sequences using the extended codon (C4+) model (fig. 2a). The more complex parameterizations have in nearly all cases much higher likelihoods. The most dramatic improvement is between NONREV and the CPG model, which has 2 additional mutation parameters representing CpG transition and transversion rates. The FACTORED-TRIPLET mutation model (which treats 5' and 3' nucleotide context effects as independent and multiplicative) shows a lesser but still substantial further improvement, reflecting non-CpG context dependencies in the data. The full TRIPLET model, which drops the independence assumption, shows a weaker improvement regardless of the selection model, suggesting that nonindependent context effects are of secondary importance and can be ignored when insufficient data are available for reliable estimation. Nonetheless, even the smallest likelihood improvement between FACTORED-TRIPLET and TRIPLET mutation (with ASYMMETRIC selection) yields a log likelihood difference of 6,711, which for 96 additional free parameters is highly significant by a likelihood ratio test.


Figure 2
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Comparison of mutation and selection models. Approximate log-likelihoods, relative to SINGLE, NONREV, are shown for (a) mammalian and (b) yeast coding sequence data sets, for various mutation and selection parameterizations of the C4+ model.

 
Elaborating the selection model from SINGLE (one dN/dS parameter) to SYMMETRIC (allowing a separate {omega}(i,j) parameter for each unordered amino acid pair (i,j) related by a single nucleotide change) shows a likelihood improvement comparable to that between NONREV and FACTORED-TRIPLET mutation models. Further generalization to the ASYMMETRIC model (allowing one parameter for each ordered amino acid pair) provides a more modest, but still significant improvement. Although selection is expected to be asymmetric at individual positions in a protein as a result of differing preferences resulting from structural or functional constraints, it is less clear that the average selection trends over the entire proteome should show any asymmetry. The likelihood improvements we see here suggest that asymmetric selection may indeed play a significant (if secondary) role at the proteome level. Further analyses below indicate that there are important asymmetric selection effects for exchanges between hydrophobic and polar amino acid sets.

Figure 2b reports a similar analysis of coding sequences for 3 Saccharomyces species, showing most of the same trends (the main exception being the lack of a major improvement with the CpG-only model, reflecting the fact that CpGs are not mutation hotspots in yeast). Although this data set is only about one-third the size of the mammalian data set, the likelihood improvements are still highly significant.

Phylogenetic trees with branch times trained from the ASYMMETRIC, TRIPLET model are shown in figure 3.


Figure 3
View larger version (4K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Phylogenetic trees for (a) mammalian and (b) yeast coding sequence data sets, with branch lengths derived from simulations of the ASYMMETRIC, TRIPLET, C4+ model. The known tree topology relating the species was assumed for the analysis. Branch lengths are in units of mutations per nucleotide before selection (see Methods).

 
In figure 2, the likelihood improvements resulting from elaborating the mutation and selection models are not additive; for example, without context-dependent mutation the log likelihood difference between SYMMETRIC and ASYMMETRIC selection in mammals is 13,690, but with FACTORED-TRIPLET mutation the change is only about half of that: 7,106. This suggests that a more richly parameterized model of selection can artifactually partially compensate for a simpler mutation model (or vice versa). By extension, it is important to recognize the possibility that other biological phenomena not represented by our model, such as intragenomic or temporal heterogeneity in mutation and selection, could be artifactually inflating the apparent likelihood improvements for these models.

In the above analyses, our parameter estimation allows the state distribution at the root to differ from that in present day organisms (i.e., we do not assume stationarity of the evolutionary process). We found that imposing approximate stationarity by equating the root state distribution to that of the leaves results in substantial likelihood decreases of 10,586.6 for mammals and 2,101.5 for yeast, using in both cases the ASYMMETRIC, TRIPLET model. In mammals, a nonstationary evolutionary process is also implied by studies of isochore evolution, which show a parallel decrease in the G + C content of G + C rich isochores in many mammalian genomes (Duret et al. 2002Go; Belle et al. 2004Go).

Parameterization of Mutation in Mammals
Although the need to account for accelerated CpG transition rates in mammals has become reasonably well accepted, it is apparent from the above analysis that weaker non-CpG context effects are also important. Figure 4 shows how mutation rates and selection factors change between the CPG and TRIPLET mutation models when allowing ASYMMETRIC selection. Figure 4a indicates that mutation rates in TRIPLET vary over roughly a 4-fold range compared with the corresponding CPG rate. Both models report a (non-CpG) transition/transversion rate ratio ({kappa}) of approximately 3 and show CpG transversion rates comparable to those of non-CpG transitions, consistent with prior observations (Hwang and Green 2004Go; Siepel and Haussler 2004Go).


Figure 4
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Comparison of context-dependent mutation model parameters. Parameters were estimated from mammalian coding sequences with the ASYMMETRIC, C4+ models incorporating either CPG or TRIPLET mutation. Error bars are estimated 95% CIs. (a) Mutation rate estimates. Each mutation rate distribution is separately scaled such that the expected mutation rate is 1 in independent and identically distributed nucleotide sequence (see Methods). (b) Selection factor estimates. Values which differ by greater than a factor of 1.5 between models are highlighted in red.

 
Our CpG transition estimates are only about 4 times higher than those of non-CpG transitions, whereas analyses of neutrally evolving sequence in primates typically show a 10- to –12-fold difference (Sved and Bird 1990Go; Chimpanzee Sequencing and Analysis Consortium 2005Go). This apparent discrepancy is unlikely to be model artifact because the C4+ model was able to successfully recover parameter estimates from simulated data having substantially higher CpG transition rates (data not shown). Instead, it appears to reflect: 1) the fact that our rate estimates are dominated by trends on the rodent lineage, where the CpG to non-CpG rate ratio is significantly lower as a result of differential generation-time effects (Hwang and Green 2004Go) and 2) the relatively high G + C content of mammalian coding sequences (International Human Genome Sequencing Consortium 2001), which results in a lower rate of CpG mutations (likely attributable to a dependency of cytosine deamination on DNA melting [Fryxell and Moon 2005Go] and/or higher biased gene conversion rates [Meunier and Duret 2004Go]).

The selection parameter estimates using CPG and TRIPLET mutation also show several substantial differences (greater than a factor of 1.5), implying that substitution rates for both restricted and permissive changes can be strongly influenced by non-CpG mutational context effects (fig. 4b). One illustrative example is {omega}(Phe,Tyr), which is 0.147 ± 0.0102 (95% CI) with CPG mutation, but 0.260 ± 0.0268 with TRIPLET mutation. This difference can be explained by the change of the NTN -> A mutation rate from the CPG model (0.0855 ± 0.00153) to the TRIPLET model, in which the nucleotide triplet contexts inducing the Phe -> Tyr substitution have substantially lower T -> A mutation rates: TTC -> A is 0.0366 ± 0.00331 and TTT -> A is 0.0443 ± 0.00395. These lower mutation rates for pyrimidine triplets are consistent with previous observations that the mutation rate tends to increase as a function of the number of purine/pyrimidine alternations in the triplet context (Hess et al. 1994Go; Hwang and Green 2004Go). This observation also holds for NTN -> A mutation rates in the TRIPLET model, where the mean mutation rate for contexts with zero YR or RY alternations is 0.0664 ± 0.00249, with 1 alternation is 0.0968 ± 0.00379, and with 2 alternations is 0.173 ± 0.00876.

Comparison to Independently Derived Selection Estimates
We compared our selection estimates {omega}(i,j) for amino acid exchanges with 2 independently derived measures of (asymmetric) amino acid exchangeability: 1) the EX matrix (Yampolsky and Stoltzfus 2005Go), which is derived from assaying the effects of 9,671 experimentally engineered mutations in 12 different proteins; and 2) a "human disease propensity" matrix derived from human heritable disease–associated amino acid substitutions in HGMD (Stenson et al. 2003Go), normalized by human polymorphism data to control for mutation rate effects (see Methods). The R2 of each set relative to the other is 0.348, or 0.428 when weighted by the human polymorphism count, so they can only be considered approximate standards.

In table 3, we summarize the weighted R2 for our species-averaged log selection values to both sets. There is a general trend of improving correlation with increasing complexity of the mutational model, except that TRIPLET gives little improvement over FACTORED-TRIPLET, consistent with the relatively low likelihood improvement between FACTORED-TRIPLET and TRIPLET (fig. 2), and (in the case of symmetric selection) the context-independent (NONREV) model shows higher correlation to the human disease propensity matrix. Table 3 also suggests that selection estimates are more reliable for asymmetric than for symmetric selection models when context-dependent mutation is allowed for, again consistent with our likelihood analyses (fig. 2). For comparison, we also compute the R2, relative to the validation sets, for 2 other published sets of exchangeability estimates (bottom rows of table 3): the "universal" evolutionary index (Tang et al. 2004Go), which estimates symmetric {omega}(i,j) values from aligned coding sequences from pairs of organisms, accounting for the transition/transversion bias and the content of codons and nucleotides in the analyzed sequences; and estimated mean selection coefficients (Yampolsky et al. 2005Go) derived from human disease, polymorphism, and human–chimpanzee divergence data.


View this table:
[in this window]
[in a new window]

 
Table 3 Comparison of Amino Acid Exchangeability

 
Structure Analysis
Previous studies using empirical models of protein-structure–dependent amino acid replacement (Thorne et al. 1996Go; Goldman et al. 1998Go) found replacement to be strongly associated with secondary structure and solvent accessibility. We investigate here the dependence of selection (rather than replacement) on structure, using mammalian coding sequences whose local structural properties (3-state secondary structure and burial) could be determined by alignment to a homologous structure (see Methods).

In figure 5, we compare selection patterns in helical and nonhelical regions. Amino acid exchanges showing significant differences between these regions are highlighted according to the change in {alpha}-helix propensity as defined by Chou and Fasman (1978Go). This figure shows that, in most cases, exchanges which are relatively more favored in helical regions increase the helical propensity and exchanges relatively more favored in nonhelical regions decrease it.


Figure 5
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 5.— Asymmetric amino acid exchangeability in helical versus nonhelical protein regions. The {omega}(i,j) selection parameters for exchanges from amino acid i to j are shown for helical and nonhelical regions ({omega}(i,j) <0.008 are excluded). Parameters were jointly estimated in an ASYMMETRIC, TRIPLET, C4+ model using shared mutation and branch length parameters but separate selection parameters and observed state distributions. The diagonal line indicates the best linear fit of all log selection values through the origin. Only selection values which significantly deviate from this line are shown. Exchanges are colored to indicate change in helical propensity according to the Chou-Fasman scale (Chou and Fasman 1978Go), with positive change indicating increased helical propensity. Vertical and horizontal lines indicate 95% CIs.

 
One exception involves Lys {leftrightarrow} Arg exchanges, where the model's selection estimates contradict the Chou–Fasman scale's preference for Lys in helices. However, a greater relative acceptance of Arg over Lys in helical regions is consistent both with assays of engineered mutations in helices (Blaber et al. 1993Go) as well as a helical propensity scale derived from predicted structures (Koehl and Levitt 1999Go).

As expected, selection on Gly {leftrightarrow} Ala exchanges is more favorable to Ala in helical regions than in nonhelical regions, but surprisingly, selection against Gly -> Ala exchanges within helical regions is still slightly stronger than that against Ala -> Gly. We note, however, that due to mutational asymmetry, the summary substitution rate of Gly -> Ala in helical regions (0.0135 ± 0.0030) is substantially higher than for Ala -> Gly (0.0071 ± 0.0017), consistent with the observed amino acid composition in helices.

We next analyzed hydrophobic and polar amino acid exchanges in subsets of the mammalian coding sequence set annotated by structural burial (fig. 6). Selection becomes more stringent on all exchanges with increasing burial, consistent with previous observations (Rennell et al. 1991Go). In addition, the selection asymmetry favoring polar to hydrophobic exchanges over the reciprocal exchange becomes more pronounced with burial. An overall asymmetry favoring polar to hydrophobic exchanges ignoring structural status has previously been noted (Yampolsky et al. 2005Go); however, we find this overall asymmetry to be sensitive to the precise amino acids chosen to constitute the hydrophobic set (data not shown), whereas the trend of increasing polar to hydrophobic bias as a function of burial is much more robust. Figure 6 also shows that exchanges within a set (hydrophobic or polar) are under weaker selection than exchanges between sets and that the increase in selection strength for polar exchanges has a greater correlation with burial than hydrophobic exchanges do. These results quantify how selection is both generally more restrictive in buried regions and biased toward a lower proportion of polar amino acids in the desolvated protein core.


Figure 6
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 6.— Analysis of the mammalian coding sequence data partitioned by structural burial, using the ASYMMETRIC, TRIPLET, C4+ model. "H" is a set of hydrophobic amino acids {Ala, Phe, Ile, Leu, Met, Val, Trp, Tyr}, and "P" is a set of polar amino acids {Asp, Glu, His, Lys, Asn, Gln, Arg, Ser, Thr}. Summary {omega} values between and within these sets are calculated from the model selection parameters as described in Methods.

 
Finally, we analyzed the joint effects of gene expression level and protein structure on selection. It is known that highly expressed genes tend to evolve more slowly than rarely expressed genes (Sharp 1991Go; Green et al. 1993Go; Pal et al. 2001Go; Herbeck et al. 2003Go; Rocha and Danchin 2004Go; Subramanian and Kumar 2004Go). We have previously proposed (Green et al. 1993Go) that this was due to "higher selective pressures to optimize the activities and structures of these (highly expressed) proteins and to minimize undesired interactions with other cellular components." Two more specific hypotheses can be formulated. The first proposes that selection acts primarily to maintain proper folding and stability, in order to increase protein yield and protect against aggregation of misfolded prod