MBE Advance Access originally published online on September 28, 2007
Molecular Biology and Evolution 2007 24(12):2632-2647; doi:10.1093/molbev/msm190
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Insights from Modeling Protein Evolution with Context-Dependent Mutation and Asymmetric Amino Acid Selection

* Department of Genome Sciences, University of Washington
Howard Hughes Medical Institute, Seattle, WA
E-mail: ctsa{at}u.washington.edu or phg{at}u.washington.edu.
| Abstract |
|---|
|
|
|---|
We develop an approximate maximum likelihood method to estimate flanking nucleotide context-dependent mutation rates and amino acid exchange-dependent selection in orthologous protein-coding sequences and use it to analyze genome-wide coding sequence alignments from mammals and yeast. Allowing context-dependent mutation provides a better fit to coding sequence data than simpler (context-independent or CpG "hotspot") models and significantly affects selection parameter estimates. Allowing asymmetric (nonreciprocal) selection on amino acid exchanges gives a better fit than simple dN/dS or symmetric selection models. Relative selection strength estimates from our models show good agreement with independent estimates derived from human disease-causing and engineered mutations. Selection strengths depend on local protein structure, showing expected biophysical trends in helical versus nonhelical regions and increased asymmetry on polar–hydrophobic exchanges with increased burial. The more stringent selection that has previously been observed for highly expressed proteins is primarily concentrated in buried regions, supporting the notion that such proteins are under stronger than average selection for stability. Our analyses indicate that a highly parameterized model of mutation and selection is computationally tractable and is a useful tool for exploring a variety of biological questions concerning protein and coding sequence evolution.
Key Words: protein evolution amino acid selection context-dependent mutation phylogenetic analysis protein expression protein structure
| Introduction |
|---|
|
|
|---|
A better understanding of protein evolution would illuminate both protein molecular biology and the diversification of species and populations. Developing adequate evolutionary models remains, however, a daunting challenge, in part because the core processes of mutation and selection are quite complex. Mutation rates depend, for example, on flanking nucleotide context (Sved and Bird 1990
B substitutions (where A and B are amino acids) may differ from that on B
A substitutions (Yampolsky et al. 2005Models of coding sequence evolution, like models of any complex phenomenon, must make numerous simplifying assumptions to be tractable. Ultimately, however, it is important to incorporate as much biological complexity as possible, not only because failure to do so can bias analysis results but also because fully understanding this complexity is an important biological goal. Increasing model complexity poses 2 challenges: more parameters must be estimated, which requires larger training data sets and greater attention to ensure that the parameter space has been adequately scanned; and simplifying but biologically unjustified mathematical assumptions must be removed, which can require developing new algorithms to restore computational feasibility. An example of the latter challenge is that mutational context dependency violates the mathematically convenient assumption that nucleotide segments (single nucleotides or codons) evolve independently.
Here, we develop a model of coding sequence evolution that incorporates context-dependent mutation rates and asymmetric exchange-dependent selection on amino acid substitutions and use it to estimate mutation and selection parameters from genome-wide coding sequence alignments by likelihood-based phylogenetic analysis. Previous models have incorporated some but not all of these features. Many protein evolution models have confounded mutation and selection, making biological interpretation of the model parameters problematic: these include models used to generate sequence alignment matrices (Dayhoff et al. 1978
; Henikoff and Henikoff 1992
; Jones et al. 1992
) and empirical maximum likelihood models of amino acid (Yang et al. 1998
; Whelan and Goldman 2001
) and codon (Kosiol et al. 2007
) substitution. Methods that do deconvolute mutation and selection often represent selection by a single dN/dS parameter (Goldman and Yang 1994
; Muse and Gaut 1994
). Yang et al. (1998
) developed codon substitution models allowing for symmetric amino acid exchange-dependent selection, with selection parameters either taken from a published "physicochemical distance" matrix or estimated from the data. They found that both exchange-dependent models outperform the use of a single dN/dS and that estimated selection parameters outperform the best physicochemical distance matrix. More recently, Tang et al. (2004
) used a nonlikelihood-based method to estimate symmetric selection parameters, finding them to be similar across diverse organisms. None of the above models allow for context-dependent mutation.
A few studies have considered asymmetric selection. The EX matrix (Yampolsky and Stoltzfus 2005
) quantifies asymmetric amino acid exchangeability from assays of engineered mutations. This approach may fail to capture the full spectrum of selection acting on amino acid exchanges in vivo, and it can only use data from a limited set of genes and organisms. It does not estimate mutation rates. Yampolsky et al. (2005
) combined genetic disease mutation, polymorphism, and divergence data to estimate the distribution of selection coefficients for each type of amino acid exchange in recent human evolution, allowing different mutation rates at CpG dinucleotides as estimated from neutrally evolving sequence but not other types of context effects. Their approach does not appear applicable to broader evolutionary scales.
Several groups have developed context-dependent mutation models for neutrally evolving sequence (Arndt et al. 2003
; Hwang and Green 2004
; Jojic et al. 2004
; Lunter and Hein 2004
; Siepel and Haussler 2004
; Arndt and Hwa 2005
), but coding sequence models have been more limited. A reversible model that incorporates CpG-dependent mutation rates has been developed for analyzing coding sequence pairs (Jensen and Pedersen 2000
; Christensen et al. 2005
). Siepel and Haussler (2004
) adapt their context-dependent models to coding sequence but using an approach that confounds mutation and selection. The Markov Chain Monte Carlo (MCMC) approach of Hwang and Green (2004
) can be extended to incorporate selection, and we have recently done this (Hwang and Green, unpublished data). In comparison to the approach presented here, the MCMC method has the advantage of using an exact rather than approximate probability model, but it is computationally less efficient for large-scale data analysis and less easily adapted to likelihood-based applications.
Our approach derives context-dependent mutation rates and exchange-specific selection parameters from likelihood-based phylogenetic analysis of homologous coding sequences. The model we implement analyzes the evolution of "extended codons," which are contiguous 4-nt segments of coding sequence consisting of a codon together with the immediately following (3') nucleotide. Compared with codons, using extended codons substantially improves power to estimate context-dependent mutation rates and to deconvolute selection and mutation, largely because context for synonymous sites is included; in single codon models, context-dependent rates are only allowed at middle codon positions, where they are confounded with selection. Note, for example, that the elevated CpG mutation rate in mammals complicates estimation of selection on Thr(ACG) to Met(ATG) changes, with failure to account for mutational context potentially causing artifactual inferences of weak or positive selection. Substantial distortions in selection values can also arise from neglecting other (non-CpG) context effects, as we illustrate in this study.
We drop the assumption that sequences are in compositional equilibrium (stationarity), which has typically been made for mathematical convenience in previous work. Stationarity implies that sequence composition is invariant throughout the phylogenetic tree, which is incompatible with observations that isochores are decaying in mammals (Duret et al. 2002
; Belle et al. 2004
). Stationarity is usually enforced by constraining the rate matrix to be reversible and the sequence composition at the root to equal that of present day organisms. We drop both constraints. To avoid the computationally prohibitive task of searching a large space of possible root distributions, we have developed a novel procedure (the "per-branch least squares [BLS] method") for estimating a root distribution by backwards evolution along the tree.
Another technical innovation in our work is a "subgraph-caching" variant of the Felsenstein (1973) site likelihood calculation algorithm, which reuses the results of intermediate stages of the calculation. This is related to previously described subtree-caching procedures (Kosakovsky Pond and Muse 2004
) but is substantially more efficient for analyzing long (genomic) alignments of sequences of codons or extended codons.
Our methods are in principle applicable to data sets involving any number of species or genes but are particularly suited to genome-scale analyses of a small number of organisms (up to approximately 5 mammalian genomes). Analyses in this paper pertain to 3 mammals (human, mouse, and rat) and 3 Saccharomyces species. For both data sets, nonstationary models that include context-dependent mutation and asymmetric selection yield significant likelihood improvements over simpler models. Mutation rate and selection parameter estimates both show a substantial dependence on the specific mutation model assumed, with selection estimates varying by 50% or more between models. Selection estimates from models incorporating context-dependent mutation and asymmetric selection give the best correlation to independent exchangeability estimates derived from human disease-causing and engineered mutations.
We also examine the dependence of selection estimates on protein structure. Selection patterns in helical regions recapitulate known biophysical trends. Polar
hydrophobic exchanges display a selection asymmetry which increases with degree of burial in the protein structure, quantifying the action of selection to maintain the hydrophobic core. We also show that the increased strength of selection acting on highly expressed genes relative to low expression genes is disproportionately biased toward buried regions of the protein, consistent with the idea that highly expressed proteins are under stronger pressure to fold properly.
| Methods |
|---|
|
|
|---|
Model
We represent protein-coding sequences as sequences of "states" from an alphabet
= {a1, a2, ...aN} of size N; the 2 alphabets used in this study are CODON, the set of sense codons of the standard genetic code (N = 61) (Goldman and Yang 1994
We model sequence evolution on a rooted bifurcating phylogenetic tree
as a (possibly nonreversible) continuous-time homogeneous Markov substitution process with an N-dimensional vector
of state frequencies at the root node and an N x N–dimensional rate matrix Q = {qi,j}, where qi,j is the instantaneous rate of change between states ai and aj. The qi,j are assumed to be of the form
![]() |
represents (context-dependent) mutation rates,
represents selection on amino acid changes, and
(i,j) is the number of aligned nucleotide differences between states ai and aj. Note that state substitutions can only occur via single nucleotide changes and the matrix diagonal is set such that each row sums to 0. We describe below several specific forms for
and
In light of the fact that genome sequences are not at compositional equilibrium (Duret et al. 2002
; Belle et al. 2004
), we permit the root frequencies
to differ from those of present day sequences. This allows for the possibility of changes in the evolutionary process having occurred, but only prior to the last common ancestor of the analyzed sequences (a single Q is assumed to apply on all branches since the time of the root).
In our analyses, the topology of the tree
is assumed known. The set
of tree nodes consists of a set
of leaf nodes and a set
=
\
(the set of elements of
that are not in
) of internal nodes; the set of branch lengths is denoted β.
We use
= (
, β, Q,
) to denote a specific model. Model parameters are estimated from a gap-free alignment of L nucleotide sequences, each of length J, in which aligned nucleotides are considered homologous, and each observed sequence corresponds to a leaf node in the tree (so L = |
|). Each nucleotide sequence is mapped to a corresponding sequence of I = J/3 states from the alphabet
(note that when
= C4+, a single nucleotide can map to 2 successive states in the corresponding state sequence). The observed state sequences at
are collectively referred to as the known data D, and the ancestral state sequences at
are referred to as the missing data M.
Let di,l denote the observed state at site i (1
i
I) and leaf node
and let mi,h denote the ancestral state at site i and node
The probability of the set
of observed aligned states at site i is a sum over all possible values of the missing data
at that site:
|
| (1) |
tU–1, where
= diag(
1,
2 ...
N) and U satisfy Q = U
U–1. The site probability equation (1) can be calculated using the Felsenstein (1973) pruning algorithm. We take advantage of particular features of the data analyzed in this study (relatively long alignments from a small set of species) by using a subgraph-caching variant of the Felsenstein algorithm, which stores and reuses values obtained at intermediate steps in calculating equation (1), namely, the conditional (on internal node state) probabilities of the observed data in subgraphs of the tree. This method (which is more fully described in the Appendix) is related to previously described subtree-caching methods (Kosakovsky Pond and Muse 2004
To define the data probability for the full alignment allowing for possible state overlaps, we view (following Siepel and Haussler [2004
]) the alignment as an instance of a first order Markov chain (in reversed order) whose probability is given by a product of transition probabilities
|
| (2) |
Note that this only approximates a full context-dependency model, which requires a Markov field to represent (Hwang and Green 2004
). For reasons of computational efficiency, we make additional simplifying assumptions as follows. For the CODON state space, where states do not overlap, the site probabilities are assumed to be independent, so P(di|di+1,
) is replaced by P(di|
) and equation (2) reduces to P(D|
) =
iP(di|
).
For the C4+ state space, adjacent states cannot be considered even approximately independent because they share a nucleotide, and we instead proceed as follows. The C4+ state di,l at sequence position i and leaf node l can be represented as di,l = (ci,l, ai,l) where ci,l = (n1, n2, n3)i,l is the ith codon, ai,l = (n4)i,l is the adjacent nucleotide from the neighboring codon, and nk is the kth nucleotide in the C4+ state. The set of aligned observed states at site i may be represented di = (ci,ai), where ci = {ci,l|l
L} and ai = {ai,l|l
L}, and the site probability P(di|
) then equals the joint probability P(ci,ai|
). Because ai is the "left edge" of di+1, we approximate P(di|di+1,
) by
![]() |
is the set of all possible alignments of L codons. To compute P(ai|
) efficiently, we adapt the Felsenstein algorithm used to handle an unknown leaf state and sum over {(c,ai,l)|c
CODON} for site i and leaf node l.
The likelihood equation (2) is then approximated by the product of the conditional codon probabilities:
.
We explore the consequences of making the above approximations by running the model on simulated sequences (see below).
Root Distribution
As noted above, we wish to allow the root state distribution
to differ from that of the leaves. Unfortunately, a full search of the parameter space required to represent
is in general impractical (for models using the C4+ state space, this space has dimension 243). Instead, we estimate
by reversing the state substitution process along each branch of the tree (given the current estimates of β and Q), starting from the observed state distributions at leaf nodes. We denote this as the BLS root distribution estimation method, implemented as follows.
For any nonroot node n in the tree, let
n denote the row vector representing the estimated state distribution at n,
(n) the parent of n, and βn the length of the branch connecting n and
(n). We wish to estimate the distribution 
(n) at
(n). For the true model parameters and state distributions, the relation
n = 
(n)Pn holds, where
Accordingly, let fB(Q, βn,
n) denote the estimate for 
(n) that results in the least-squared error between these values:
where the minimization is over probability distributions
. This calculation is a special case of the nonnegative least squares problem, solvable by the NNLS algorithm (Lawson and Hanson 1974
), but because we require increased computational efficiency (because
must be reestimated every time the model parameter values change) we instead proceed as follows. Let
. If w is a probability distribution we take fB(Q, βn,
n) = w; otherwise, we find fB(Q, βn,
n) by conjugate gradient minimization (Press et al. 1992
) starting from the initial value w', where w' is the probability distribution obtained by setting all negative values of w to zero and scaling the remaining terms to sum to one.
We can now find the root distribution
where r is the root node, by the following recursion. For leaf nodes
we take
n to be the observed state distribution at n and assign a weight wn equal to the observed state sequence length. For interior nodes
we recursively compute
![]() |
(n) denotes the set of child nodes of n. The method could probably be improved by optimizing a least squares fit against all observed data simultaneously rather than separately for each tree branch.
Mutation Rates
We allow mutation rates to depend upon the flanking nucleotides, denoting the rate at any position by µ(n, n', n–, n+), where n and n', respectively, are the original and mutant nucleotides and n– and n+ are the nucleotides immediately preceding and following the mutated position.
We consider 4 possibilities for µ. NONREV is the context-independent nonreversible 12-parameter nucleotide model (Yang 1994
): µNONREV(n, n',n–,n+) = rn,n', n
n' CPG adds 2 additional parameters to represent CpG transitions and transversions:
![]() |
FACTORED-TRIPLET treats the preceding and following nucleotide context effects as independent multipliers: µFACTORED-TRIPLET(n, n', n–, n+) = f-(n,n', n–)f+(n, n', n+), where the partial-context functions f-(n, n', n–) and f+(n, n', n+) use one parameter for every combination of arguments for which
96 in all. Finally, TRIPLET allows a separate rate for each n, n',n–,n+ where
requiring 192 parameters in all. Note that we do not impose a symmetry (reverse complement) condition on the rates because this is known to fail in transcribed sequence (Green et al. 2003
).
For nucleotides at the left or right edge of a site, one of the flanking nucleotides is not contained within the site. In this case, we approximate the rate at the edge position as a weighted average over all possible nucleotides at the missing position. For example, for nucleotides n at the right edge of the site, we calculate
|
|
specifies the probability that nucleotide x follows n when n is at the right edge of the site, and
is the set of nucleotides. For nucleotides at the left edge of the site, the corresponding edge distribution function
gives the probability that x precedes n. For computational simplicity
and
are assumed to be independent of the phylogenetic branch and are calculated from the observed coding sequences. We tried allowing branch-specific values of
and
iteratively derived using state distributions at internal tree nodes as calculated by the BLS procedure, but found that this did not significantly improve the likelihood or parameter estimation accuracy on simulated sequences (data not shown).
We define the context-independent mutation rate associated to µ as
|
|
Selection on Amino Acid Substitutions
Selection on nonsynonymous mutations is represented as a multiplicative factor applied to the mutation rate:
![]() |
maps codons to amino acids, and
is the set of stop codons. The nonsynonymous selection term
(i,j) defines the multiplicative factor for mutations from amino acid i to j, where
and (i,j) is restricted to the set of 150 ordered pairs of amino acids having one or more codons that differ by a single nucleotide change. The SINGLE parameterization of
uses the same value for all (i,j), corresponding to the dN/dS value customarily used in codon models. SYMMETRIC imposes
(i,j) =
(j,i) but allows a separate parameter value for each unordered pair (75 in all). Finally, ASYMMETRIC allows a separate parameter for each of the 150 ordered pairs, so that the strength of selection on reciprocal amino acid exchanges may be unequal.
The above formula needs to be modified slightly to represent selection acting on the final nucleotide n4 of the C4+ state because the codon containing this nucleotide is not specified by the state. To do this, we approximate the selection coefficient as a weighted average over all possible "completions" of the partial codon. Let n be the original nucleotide, n' be the mutated nucleotide, and (n, na, nb) the codon with nucleotides n, na, nb at position 1–3. The partial codon selection factor is
![]() |
Summary Rates and Selection Factors
It is useful to define summary substitution rates q or selection factors
between sets of amino acids (e.g., selection on hydrophobic to polar exchanges). We do this as follows.
Define the flux from state i to a different state j as
ij = P(i)qij, where P(i) is the frequency of state i in observed sequences and qij is the substitution rate between states i and j, as defined above. The flux between 2 sets of states
and
is
and the summary substitution rate between
and
is
Given 2 amino acids, we can take
and
to be the sets of all states which translate to those amino acids, in which case q
is called the summary substitution rate between the amino acids.
Summary selection factors are defined similarly. We now define flux between 2 sets of states
and
as
where
(i) is the amino acid translation of state i (for the C4+ alphabet, this is the translation of the completely embedded codon); in other words, only changes between nonsynonymous states are considered. We define the mutation rate
ij between states i and j, as the substitution rate obtained when all
(i,j) are set equal to 1, and the corresponding mutation flux as
ij = P(i)
ij. The mutation flux between state sets
and
is restricted to nonsynonymous changes
The summary selection factor for exchanges between sets
and
is then
For
=
=
(the complete state alphabet), the procedure yields the conventional dN/dS value.
Maximum Likelihood Estimation
We seek the model
= (
, β, Q,
) assigning the highest probability to the observed data:
In our analyses, we typically assume a particular tree
a priori and use the BLS method to estimate
given Q, β and D, so likelihood maximization is reduced to a search over the parameters of Q and β. However, this still can be a formidable task: for the most elaborate parameterization of Q, there are 191 mutation parameters and 150 selection parameters, in addition to 2L-2 branch length parameters.
Likelihood maximization is performed using Powell's conjugate direction method as described in (Press et al. 1992
), with the following modifications to improve efficiency: we initially run 10 Polak–Ribiere conjugate gradient iterations to improve starting parameter estimates, and at the end of each search iteration, we perform a line maximization in the direction
n –
n – 1 (where
n is the current iteration parameter vector), even when the conjugate direction inclusion criteria are not met. We tried likelihood maximization using both a quasi-Newton algorithm (BFGS) and the conjugate-gradient method and found them to be slower and to yield lower maxima. The EM algorithm (Dempster et al. 1977
), as applied in Siepel and Haussler (2004
), was also found to be less effective, in part because the subgraph-caching algorithm (see Appendix) cannot be efficiently applied to the EM expectation step.
Confidence Interval Calculation
Finding confidence intervals (CIs) by inversion of the Fisher information matrix (Sorensen and Gianola 2002a
) produces unstable results for our highly parameterized models, and we instead use a more robust procedure based on inverting the log likelihood ratio (Sorensen and Gianola 2002b
). Denote the model parameters by
= {
1,
2, ...,
j}, the likelihood function by L(
) = P(D|
), and define the profile likelihood function for
j as
where
-j =
\{
j}, that is, by maximizing over all parameters other than
j. The estimated 100(1 –
)% CI for
j is given by
![]() |
denotes the maximum likelihood parameter estimates. By asymptotic maximum likelihood theory, the profile likelihood is approximately Gaussian in the neighborhood of
j, with variance parameter that can be estimated by
![]() |
; this yields the estimated CI The approximate variance for summary parameters are calculated by treating the individual parameter estimates as independent random variables with mean equal to the maximum likelihood estimate and variance as calculated above. Comparisons of the summary CIs to exact CIs of simpler models suggest that this approximation is conservative, that is, causes overestimation of the CIs. For example, the CIs appearing in figure 7a (for summary dN/dS values from ASYMMETRIC, TRIPLET, C4+ models) are all larger than the corresponding exact CIs on the dN/dS parameters of SINGLE, NONREV, CODON models analyzing the same data.
Simulation
Our simulations for a model
= (
, β, Q,
) use a continuous-time evolutionary process which exactly represents context-dependent mutations. Starting from a root sequence sampled from
, the sequence evolves along each branch as follows: the substitution rate
j at each nucleotide position j is calculated from the mutation and selection parameters; the substitution rate for the entire sequence is
where J is the number of nucleotide positions. The time until the next nucleotide substitution in the sequence is sampled from the exponential distribution
Draws from the distribution of
j and the distribution of nucleotide substitution rates at each position determine the position and type of the substitution. The substitution rates at the substituted position and any contextually dependent positions are updated together with
and the process is repeated. Iteration stops when the branch length is exceeded.
Software Availability
The program CodeAxe used to perform the analyses in this study is available as source code from http://www.phrap.org.
Coding Sequence Data
For the mammalian coding sequence analysis, we have taken all complete coding sequence (CDS) annotations from the NCBI RefSeq database (Pruitt et al. 2005
) for human (build 36), mouse (build 36), and rat (build 4). We also analyzed coding sequence from the Saccharomyces species Saccharomyces cerevisiae, Saccharomyces paradoxus, and Saccharomyces mikatae downloaded from (http://www.genome.wi.mit.edu/personal/manoli/yeasts/) (Kellis et al. 2003
).
Alignment
For either set of organisms, the coding sequences were assigned to orthologous sets containing one sequence from each organism using a symmetric best Blast E-value technique (Tatusov et al. 1997
). Coding sequences in each ortholog set were aligned as amino acid sequences using ClustalW (Thompson et al. 1994
) and subsequently reverted to their original nucleotide sequence.
Quality Criteria
Orthologous coding sequence sets were not analyzed if any RefSeq entry contained incomplete CDS annotations, used a nonstandard translation table or had any other exception to the standard genetic code, or were annotated with any of the following terms: "artificial frameshift," "unclassified translation discrepancy," or "pseudogene."
Mammalian sequence alignments were also filtered to remove CpG islands, detected as maximally scoring segments in the human coding sequence having scores at least 50, under a scoring scheme that assigns +17 to CpGs and –1 to other dinucleotides (Lander et al. 2001
). Islands were constrained to start at the 5' end of the sequence by requiring the maximal scoring segment to begin within the first 15 nt of the sequence and treating codons 5' of the segment start as part of the island. Codons in aligned orthologous mammalian sequences that intersected or preceded a human CpG island were removed from the analysis.
The remaining portion of each alignment was subject to a window quality filter to remove possible alignment artifacts. For each aligned codon, we defined a window of 10 preceding and 10 following codons. The aligned codon was removed from analysis if any gaps occurred in this window or if more than 5 of the 20 window codon positions contained nonsynonymous differences (including potential nonsynonymous differences involving ambiguous nucleotides). Ambiguous nucleotides were not allowed in any analyzed codons. After all filtration steps, the mammalian and Saccharomyces alignments respectively contained 5,128,530 and 1,666,416 aligned codon positions.
Structure Assignment
Translated human coding sequences at least 40 amino acids long from each ortholog set were aligned to the 95% sequence identity subset of the ASTRAL database v1.69 (Chandonia et al. 2004
) using a single round PSI-Blast (v2.2.14) search (Altschul et al. 1997
) with a final Smith–Waterman realignment step. Alignments were required to have an E value of less than 0.01, at least 30% amino acid sequence identity, a minimum length of 40 amino acids, and a maximum of 30% overlap with other ASTRAL domains already aligned (with smaller E value). For overlapping accepted ASTRAL domain matches, the lower E-value domain takes precedence in the overlap region.
For each ASTRAL domain, we used DSSP (Kabsch and Sander 1983
) to assign secondary structure to each amino acid position. The DSSP secondary structure categories were reduced to 3 as follows:
,
, and 3-10 helix were assigned to "helix," extended was assigned to "β sheet," and the remaining DSSP categories were assigned to "loop/other." Burial was assigned to all ASTRAL domains by using a simple contact number for each amino acid, the 14 Å Cβ density (the number of Cβ atoms within 14 Å of the Cβ of the measured position, with C
substituting for Cβ in Gly), which was divided into 3 quantiles to define buried, intermediate, and exposed states. This burial metric (using, however, 7 classes rather than 3) was previously found to be the best conserved measure of burial among distant homologs (Karchin et al. 2004
). Using these methods, we were able to assign secondary structure and burial to 994,587 and 992,354 codon positions, respectively, in the mammalian sequence alignments.
Expression Level Categories
Expression level was approximated by counts of expressed sequence tags (ESTs) and cDNAs in gene-oriented clusters constructed as described in Garg and Green (2007
). We first eliminated clusters lacking a one-to-one association with a Refseq gene. We then classified genes into 2 equal sized expression bins based on EST + cDNA count of the associated clusters: low expression genes had counts <65 (the median value), and high expression genes had counts
65.
Human Disease Propensity Matrix
We obtained 24,719 putative human disease-causing nonsynonymous mutations from the HGMD database (Stenson et al. 2003
) (Phillips A and Cooper DN, personal communication November 2005) and 18,564 validated nonsynonymous single nucleotide polymorphisms (SNPs) from dbSNP, v127 (Sherry et al. 2001
), using the corresponding chimpanzee base to polarize the mutation that gave rise to the SNP. We then estimated "relative human disease propensity" of each amino acid exchange type as the log ratio of the fraction of HGMD mutations that are of that type to the fraction of nonsynonymous SNPs of that type. This ratio is expected to inversely correlate to amino acid exchangeability, provided HGMD mutations have a distribution similar to that of all negatively selected nonsynonymous mutations and nonsynonymous SNPs have a distribution similar to that of all nonsynonymous mutations.
Simulation Analysis
Rooting a Nonstationary Model
In table 1, we compare the performance of 3 different methods to estimate the root distribution: equating to the distribution in observed sequences (FIXED), the BLS method (discussed above), and a full search over root distribution parameters (FREE). In the first part of this table, we use a CODON state space model with NONREV mutation (which is context-independent). The BLS root model performs nearly as well as the FREE root model, and both are much better than the FIXED model, as judged by relative likelihoods, ability to recover the root codon distribution and parameter error. The second part of table 1 shows that these trends also hold for C4+ state space models. We use the BLS root method in all remaining analyses in this paper.
|
State Type
In table 2, we compare CODON and C4+ state space models. The models have similar performance for context-independent mutation (first 2 rows of table). Note that in this case, the likelihood for the C4+ model is approximate whereas that of CODON is exact. For context-dependent mutation rates (next 2 rows), C4+ models fare substantially better for both likelihood and parameter estimation accuracy, although the C4+ parameter estimation error is still notably higher than in the context-independent simulation. In similar analyses using an alternative 4-nt state type, C4–, composed of each codon and its preceding nucleotide, we found parameter estimation to be noticeably less accurate than for C4+, probably because the C4+ state contains both nucleotides flanking third codon positions which (because they undergo relatively little selection) provide most of the information regarding underlying mutation rates. Based on the above tests, we use C4+ models for all remaining analyses in this study.
|
Normalized Error Distribution
We test the accuracy of our parameter and CI estimation (which involved several approximations) by observing the normalized error distribution of the fully parameterized C4+ model (fig. 1). For an exact likelihood model, this distribution should be asymptotically normal and unbiased with standard deviation (SD) 1. Although it does appear approximately normal, the SD is 1.306, larger than expected. Without normalizing, the SD of relative parameter error (SDre) is 0.0558. When the analysis is repeated on a simulated data set of one-fifth the size, the normalized error distribution is closer to standard normal (SD = 1.027), but with less accurate parameter estimates (SDre = 0.1061). This suggests that the approximations made in the model substantially distort estimation accuracy only when large quantities of data are analyzed, possibly by biasing the parameter estimates slightly.
|
| Results and Discussion |
|---|
|
|
|---|
Model Parameterization
We compared evolutionary models of varying degrees of complexity by analyzing orthologous human, mouse, and rat coding sequences using the extended codon (C4+) model (fig. 2a). The more complex parameterizations have in nearly all cases much higher likelihoods. The most dramatic improvement is between NONREV and the CPG model, which has 2 additional mutation parameters representing CpG transition and transversion rates. The FACTORED-TRIPLET mutation model (which treats 5' and 3' nucleotide context effects as independent and multiplicative) shows a lesser but still substantial further improvement, reflecting non-CpG context dependencies in the data. The full TRIPLET model, which drops the independence assumption, shows a weaker improvement regardless of the selection model, suggesting that nonindependent context effects are of secondary importance and can be ignored when insufficient data are available for reliable estimation. Nonetheless, even the smallest likelihood improvement between FACTORED-TRIPLET and TRIPLET mutation (with ASYMMETRIC selection) yields a log likelihood difference of 6,711, which for 96 additional free parameters is highly significant by a likelihood ratio test.
|
Elaborating the selection model from SINGLE (one dN/dS parameter) to SYMMETRIC (allowing a separate
(i,j) parameter for each unordered amino acid pair (i,j) related by a single nucleotide change) shows a likelihood improvement comparable to that between NONREV and FACTORED-TRIPLET mutation models. Further generalization to the ASYMMETRIC model (allowing one parameter for each ordered amino acid pair) provides a more modest, but still significant improvement. Although selection is expected to be asymmetric at individual positions in a protein as a result of differing preferences resulting from structural or functional constraints, it is less clear that the average selection trends over the entire proteome should show any asymmetry. The likelihood improvements we see here suggest that asymmetric selection may indeed play a significant (if secondary) role at the proteome level. Further analyses below indicate that there are important asymmetric selection effects for exchanges between hydrophobic and polar amino acid sets. Figure 2b reports a similar analysis of coding sequences for 3 Saccharomyces species, showing most of the same trends (the main exception being the lack of a major improvement with the CpG-only model, reflecting the fact that CpGs are not mutation hotspots in yeast). Although this data set is only about one-third the size of the mammalian data set, the likelihood improvements are still highly significant.
Phylogenetic trees with branch times trained from the ASYMMETRIC, TRIPLET model are shown in figure 3.
|
In figure 2, the likelihood improvements resulting from elaborating the mutation and selection models are not additive; for example, without context-dependent mutation the log likelihood difference between SYMMETRIC and ASYMMETRIC selection in mammals is 13,690, but with FACTORED-TRIPLET mutation the change is only about half of that: 7,106. This suggests that a more richly parameterized model of selection can artifactually partially compensate for a simpler mutation model (or vice versa). By extension, it is important to recognize the possibility that other biological phenomena not represented by our model, such as intragenomic or temporal heterogeneity in mutation and selection, could be artifactually inflating the apparent likelihood improvements for these models.
In the above analyses, our parameter estimation allows the state distribution at the root to differ from that in present day organisms (i.e., we do not assume stationarity of the evolutionary process). We found that imposing approximate stationarity by equating the root state distribution to that of the leaves results in substantial likelihood decreases of 10,586.6 for mammals and 2,101.5 for yeast, using in both cases the ASYMMETRIC, TRIPLET model. In mammals, a nonstationary evolutionary process is also implied by studies of isochore evolution, which show a parallel decrease in the G + C content of G + C rich isochores in many mammalian genomes (Duret et al. 2002
; Belle et al. 2004
).
Parameterization of Mutation in Mammals
Although the need to account for accelerated CpG transition rates in mammals has become reasonably well accepted, it is apparent from the above analysis that weaker non-CpG context effects are also important. Figure 4 shows how mutation rates and selection factors change between the CPG and TRIPLET mutation models when allowing ASYMMETRIC selection. Figure 4a indicates that mutation rates in TRIPLET vary over roughly a 4-fold range compared with the corresponding CPG rate. Both models report a (non-CpG) transition/transversion rate ratio (
) of approximately 3 and show CpG transversion rates comparable to those of non-CpG transitions, consistent with prior observations (Hwang and Green 2004
; Siepel and Haussler 2004
).
|
Our CpG transition estimates are only about 4 times higher than those of non-CpG transitions, whereas analyses of neutrally evolving sequence in primates typically show a 10- to –12-fold difference (Sved and Bird 1990
The selection parameter estimates using CPG and TRIPLET mutation also show several substantial differences (greater than a factor of 1.5), implying that substitution rates for both restricted and permissive changes can be strongly influenced by non-CpG mutational context effects (fig. 4b). One illustrative example is
(Phe,Tyr), which is 0.147 ± 0.0102 (95% CI) with CPG mutation, but 0.260 ± 0.0268 with TRIPLET mutation. This difference can be explained by the change of the NTN
A mutation rate from the CPG model (0.0855 ± 0.00153) to the TRIPLET model, in which the nucleotide triplet contexts inducing the Phe
Tyr substitution have substantially lower T
A mutation rates: TTC
A is 0.0366 ± 0.00331 and TTT
A is 0.0443 ± 0.00395. These lower mutation rates for pyrimidine triplets are consistent with previous observations that the mutation rate tends to increase as a function of the number of purine/pyrimidine alternations in the triplet context (Hess et al. 1994
; Hwang and Green 2004
). This observation also holds for NTN
A mutation rates in the TRIPLET model, where the mean mutation rate for contexts with zero YR or RY alternations is 0.0664 ± 0.00249, with 1 alternation is 0.0968 ± 0.00379, and with 2 alternations is 0.173 ± 0.00876.
Comparison to Independently Derived Selection Estimates
We compared our selection estimates
(i,j) for amino acid exchanges with 2 independently derived measures of (asymmetric) amino acid exchangeability: 1) the EX matrix (Yampolsky and Stoltzfus 2005
), which is derived from assaying the effects of 9,671 experimentally engineered mutations in 12 different proteins; and 2) a "human disease propensity" matrix derived from human heritable disease–associated amino acid substitutions in HGMD (Stenson et al. 2003
), normalized by human polymorphism data to control for mutation rate effects (see Methods). The R2 of each set relative to the other is 0.348, or 0.428 when weighted by the human polymorphism count, so they can only be considered approximate standards.
In table 3, we summarize the weighted R2 for our species-averaged log selection values to both sets. There is a general trend of improving correlation with increasing complexity of the mutational model, except that TRIPLET gives little improvement over FACTORED-TRIPLET, consistent with the relatively low likelihood improvement between FACTORED-TRIPLET and TRIPLET (fig. 2), and (in the case of symmetric selection) the context-independent (NONREV) model shows higher correlation to the human disease propensity matrix. Table 3 also suggests that selection estimates are more reliable for asymmetric than for symmetric selection models when context-dependent mutation is allowed for, again consistent with our likelihood analyses (fig. 2). For comparison, we also compute the R2, relative to the validation sets, for 2 other published sets of exchangeability estimates (bottom rows of table 3): the "universal" evolutionary index (Tang et al. 2004
), which estimates symmetric
(i,j) values from aligned coding sequences from pairs of organisms, accounting for the transition/transversion bias and the content of codons and nucleotides in the analyzed sequences; and estimated mean selection coefficients (Yampolsky et al. 2005
) derived from human disease, polymorphism, and human–chimpanzee divergence data.
|
Structure Analysis
Previous studies using empirical models of protein-structure–dependent amino acid replacement (Thorne et al. 1996
In figure 5, we compare selection patterns in helical and nonhelical regions. Amino acid exchanges showing significant differences between these regions are highlighted according to the change in
-helix propensity as defined by Chou and Fasman (1978
). This figure shows that, in most cases, exchanges which are relatively more favored in helical regions increase the helical propensity and exchanges relatively more favored in nonhelical regions decrease it.
|
One exception involves Lys
Arg exchanges, where the model's selection estimates contradict the Chou–Fasman scale's preference for Lys in helices. However, a greater relative acceptance of Arg over Lys in helical regions is consistent both with assays of engineered mutations in helices (Blaber et al. 1993
As expected, selection on Gly
Ala exchanges is more favorable to Ala in helical regions than in nonhelical regions, but surprisingly, selection against Gly
Ala exchanges within helical regions is still slightly stronger than that against Ala
Gly. We note, however, that due to mutational asymmetry, the summary substitution rate of Gly
Ala in helical regions (0.0135 ± 0.0030) is substantially higher than for Ala
Gly (0.0071 ± 0.0017), consistent with the observed amino acid composition in helices.
We next analyzed hydrophobic and polar amino acid exchanges in subsets of the mammalian coding sequence set annotated by structural burial (fig. 6). Selection becomes more stringent on all exchanges with increasing burial, consistent with previous observations (Rennell et al. 1991
). In addition, the selection asymmetry favoring polar to hydrophobic exchanges over the reciprocal exchange becomes more pronounced with burial. An overall asymmetry favoring polar to hydrophobic exchanges ignoring structural status has previously been noted (Yampolsky et al. 2005
); however, we find this overall asymmetry to be sensitive to the precise amino acids chosen to constitute the hydrophobic set (data not shown), whereas the trend of increasing polar to hydrophobic bias as a function of burial is much more robust. Figure 6 also shows that exchanges within a set (hydrophobic or polar) are under weaker selection than exchanges between sets and that the increase in selection strength for polar exchanges has a greater correlation with burial than hydrophobic exchanges do. These results quantify how selection is both generally more restrictive in buried regions and biased toward a lower proportion of polar amino acids in the desolvated protein core.
|
Finally, we analyzed the joint effects of gene expression level and protein structure on selection. It is known that highly expressed genes tend to evolve more slowly than rarely expressed genes (Sharp 1991













