MBE Advance Access originally published online on March 7, 2007
Molecular Biology and Evolution 2007 24(5):1113-1121; doi:10.1093/molbev/msm044
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Assessing the Determinants of Evolutionary Rates in the Presence of Noise

* Department of Biology, The University of Pennsylvania
The Broad Institute of Harvard University and MIT
E-mail: jplotkin{at}sas.upenn.edu.
| Abstract |
|---|
|
|
|---|
Although protein sequences are known to evolve at vastly different rates, little is known about what determines their rate of evolution. However, a recent study using principal component regression (PCR) has concluded that evolutionary rates in yeast are primarily governed by a single determinant related to translation frequency. Here, we demonstrate that noise in biological data can confound PCRs, leading to spurious conclusions. When equalizing noise levels across 7 predictor variables used in previous studies, we find no evidence that protein evolution is dominated by a single determinant. Our results indicate that a variety of factorsincluding expression level, gene dispensability, and proteinprotein interactionsmay independently affect evolutionary rates in yeast. More accurate measurements or more sophisticated statistical techniques will be required to determine which one, if any, of these factors dominates protein evolution.
Key Words: evolutionary rates noise pca PCR expression levels dN dS
| Introduction |
|---|
|
|
|---|
Proteins span more than 3 orders of magnitude in their evolutionary rate, as quantified by the number of nonsynonymous substitutions per site. What determines a protein's rate of evolution has been actively debated over the past several decades. Early hypotheses suggested that the evolutionary rate of a protein is governed by at least 2 factors: the protein's level of functional constraint (i.e., the density of functionally active residues) and its overall importance (or "dispensability") to the organism (Wilson et al. 1977
Arguments for the role of dispensability are mostly based on theory (Ohta 1973
), as very little empirical data have been available until recently (Hurst and Smith 1999
; Hirsh and Fraser 2001
). A relationship between dispensability and evolutionary rate is hypothesized because mutations in essential proteins are more likely to be deleterious. Such mutations are purged from the population, thereby reducing the evolutionary rate of indispensable proteins (Ohta 1973
).
In contrast to dispensability, the relationship between functional constraint and evolutionary rate has been studied empirically for over 40 years. Among the first to address the role of functional constraint was Ingram (1961)
, who observed that the polypeptide chains of hemoglobin should be differentially constrained depending on the number of other chains with which they physically interact (Ingram 1961
). Similarly, when studying cytochrome c, Dickerson (1971)
observed that surface residues that interact with other proteins tend to be highly conserved. Although functional constraints are difficult to measure directly, or even define precisely, many studies have used as a proxy the number of physical interactions in which a given protein participates. Authors have recently confirmed early hypotheses of Ingram and Dickerson on a much larger scale using curated sets of interacting protein crystal structures (Mintseris and Weng 2005
).
The sudden plethora of sequenced genomes allows us to compare orthologous coding sequences from related species, estimate evolutionary rates, and ask what features of proteins covary with their evolutionary rates. The yeast Saccharomyces cerevisiae has emerged as a model system for systematically studying the determinants of evolutionary rates. Saccharomyces cerevisiae was the first fully sequenced eukaryote, and its genome remains the most comprehensively annotated. Additionally, S. cerevisiae has been the subject of thousands of functional genomic experiments producing diverse information for evolutionary investigation.
Previous studies have reported a variety of functional, biophysical, and fitness-related variables that correlate with the evolutionary rates of proteins (Drummond et al. 2006
): proteins evolve more slowly if they have a higher number of mRNA molecules per cell (expression) (Green et al. 1993
; Pal et al. 2001
), if they have a higher number of protein molecules per cell (abundance) (Drummond et al. 2006
), a higher codon adaptation index (CAI) (Pal et al. 2001
; Wall et al. 2005
), more proteinprotein interactions (degree) (Fraser et al. 2002
), a larger fitness effect upon gene knockout (dispensability) (Hirsh and Fraser 2001
), shorter sequence length (Marais and Duret 2001
), or a more central role in the interaction network (centrality) (Hahn and Kern 2005
). But these predictor variables are themselves correlated with one anotherraising the question of which variables are truly involved in determining evolutionary rates and which variables happen to covary simply because they are influenced by another, causal variable. It has been argued, for example, that the correlation between dispensability and evolutionary rate is simply a side effect of causal relationships between expression level and evolutionary rate and between expression level and dispensability (Pal et al. 2003
).
Drummond et al. (2006)
undertook a comprehensive analysis of the determinants of protein evolution in yeast. Their work represents a significant advance towards identifying the major, independent correlates of evolutionary rates (McInerney 2006
; Pal et al. 2006
; Rocha 2006). Prior to the work of Drummond et al. (2006)
, many authors had used the techniques of multiple regression and partial correlation to assess whether correlates of evolutionary rate are independent of one another (Fraser et al. 2002
; Bloom and Adami 2003
; Rocha and Danchin 2004
). Drummond et al. (2006)
demonstrated that colinearity of predictor variables and measurement noise can cause partial correlations to yield spuriously significant results. In lieu of partial correlations, and in order to remove colinearity, Drummond et al. (2006)
used a principal component regression (PCR) to analyze evolutionary rates. Surprisingly, they found that a single componentcomprised almost entirely of expression level, abundance, and CAIexplained far more variation in evolutionary rates than any other component. On the basis of these results, Drummond et al. proposed that a single determinant, namely, selection against translation-errorinduced protein misfolding, dominates protein evolution in yeast (Drummond et al. 2005
, 2006
). This recent finding has already changed how researchers think about protein evolution (Koonin and Wolf 2006
; McInerney 2006
; Pal et al. 2006
; Rocha 2006).
Here, we reinspect the analyses of evolutionary rates presented by Drummond et al. (2006)
. We show that the PCR utilized by Drummond et al. can be confounded when predictor variables have been measured with different amounts of noise. We assess the amount of noise associated with each of the 7 predictor variables used in previous studies of yeast protein evolution. We show that after equalizing noise levels across the predictor variables, each predictor has a roughly equal contribution to the evolutionary rate of proteins, and there is no evidence for a single, dominant factor driving protein evolution. Finally, we present a simple mathematical model of evolutionary rates, with parameter values determined by empirical yeast data. Our model demonstrates that the apparent predominance of translational selection as the determinant of yeast protein evolution may be a spurious artifact arising from the variable accuracy of functional genomic measurements.
| Results |
|---|
|
|
|---|
We have reanalyzed the data sets studied by Drummond et al. (2006)
Given the large number of variables correlated with evolutionary rate, we wish to know which correlations are independent of one another. For example, let R, E, and D denote evolutionary rate, mRNA expression level, and degree of proteinprotein interactions, respectively. If these variables are noise free, then the partial correlation coefficient r(D, R | E) describes the relationship between degree and rate, controlling for expression. In practice, however, we have only noisy estimates of these variables, denoted E*, and D*. Noise in expression data, for example, arises both from variability in expression levels between cells and from inaccuracies in measurement. We refer to both sources of variability as noise.
Drummond et al. (2006)
have shown that the method of partial correlations can yield spuriously significant results when applied to noisy variablesthat is, r(D*, R | E*) may show a significant departure from zero even when r(D, R | E) equals zero. In other words, as a result of noise, the relationship between protein dispensability and evolutionary rate may appear to be independent of expression levels, even when the underlying, noiseless variables are uncorrelated when controlling for expression. In order to avoid the pitfalls of partial correlations, which are caused by noise and colinearity, Drummond et al. regressed evolutionary rates against the principal components of the 7 predictor variables. The principal component regression corrects for colinearity among predictor variables, but the PCR implicitly assumes that all predictors have been measured with the same amount of noise. In the following sections, we show that the assumption of equal noise is not valid for the data sets analyzed by Drummond et al., and we demonstrate that the apparent dominant determinant of protein evolution may be an artifact of this invalid assumption.
Quantifying Noise in Functional Genomic Data
Some measurable features of genes are virtually free of noise, such as gene length and CAI. But other variables contain significant noise, such as mRNA expression levels, protein abundance, the fitness effects of knockouts, and the number of proteinprotein interactions. How can we quantify and estimate the amount of noise associated with each of these genome-wide measurements?
The most straightforward way to quantify noise is to calculate the correlation coefficient between 2 (or more) independent measurements of the same quantity. For example, mRNA expression levels in yeast are quite reproducible: the correlation between 2 independent measurements of mRNA expression levels, using the same oligonucleotide arrays, has been reported as rexpr = 0.72 (n = 5, 555) (Drummond et al. 2005
). We compared expression levels measured by different investigators 3 years apart (Holstege et al. 1998
; Causton et al. 2001
) and found an even stronger correlation: r = 0.90 (n = 5, 460). Nevertheless, we will use the value rexpr = 0.72 so that our analysis is conservative and our estimate of expression noise agrees with that of Drummond et al. (2005)
.
Protein abundance data are apparently less noisy than mRNA abundance data. In the only systematic study reporting abundances for a large number of yeast proteins, Ghaemmaghami et al. (2003)
measured 206 proteins in triplicate. Assuming the noise to be normally distributed, the average correlation in abundance between any 2 sets of 206 replicates is r = 0.98. This estimate of noise for protein abundances is not as conservative as for mRNA abundances abovebecause these replicates were performed using the same method by the same investigators. But this estimate nevertheless suggests that protein abundance measurements in yeast contain relatively little noise. In order to be conservative (see Discussion), we will assume that protein abundances contain approximately the same amount of noise as mRNA levels, rabund = 0.72, even though this assumption likely overestimates the noise in protein abundances.
Gene dispensability data in yeast are more noisy than protein abundances or expression levels. The correlation coefficient between 2 independent measurements of the fitness effects of knockouts (Warringer et al. 2003
; Deutschbauer et al. 2005
) made using the same set of viable single-gene deletion strains is rdisp = 0.56 (n = 4, 156). In order to facilitate comparison with Drummond et al. (2006)
, we analyze the same data set of gene dispensabilities (Deutschbauer et al. 2005
), despite the fact that the other data set correlates more strongly with evolutionary rates (Wall et al. 2005
).
The number, or degree, of proteinprotein interactions is by far the most noisy variable analyzed in this study and previous related studies. Bloom and Adami (2003)
recently summarized the results from 9 protein interaction data sets. Performing all 36 possible pairwise comparisons of these data sets, we find that as many pairs exhibit negative correlations as exhibit positive correlations. Despite the discouraging discordance among the 9 protein interaction data sets, it is well established that some data sets contain less noise than others (Kemmeren et al. 2002
; von Mering et al. 2002
). Protein interaction data sets assembled from low-throughput studies are difficult to use in this context because of the unknown but certainly extreme bias in which proteins have been studied by individual investigators (Reguly et al. 2006). For this reason, we chose to compare 2 of the highest quality high-throughput data sets that were generated by a single method: mass spectrometry (Gavin et al. 2002; Ho et al. 2002). Any systematic biases inherent in the mass spectrometry method will artificially inflate the correlation between these 2 measurements, leading to a conservative underestimate of noise. The observed agreement between the 2 measurements of protein interaction degree is rdegree = 0.11 (n = 524). Whereas this correlation is much greater than the median correlation among all 36 pairs of interaction data sets (r = 0.002), the reproducibility of the protein interaction data is much lower than for all other types of data in this study.
Protein interactions measured by mass spectrometry (Ho et al. 2002) are very similar to the composite data set of interactions analyzed by Drummond et al. (2006)
. Using the data from Ho et al. in place of the interaction data analyzed by Drummond et al. does not significantly alter the PCR of dN (compare fig. 1a and b, below). Therefore, because we can estimate the noise in the mass spectrometry data, but not in the composite data set used by Drummond et al., we will use the spectrometry data for all subsequent analyses. (Results are unchanged if we use the composite data set of interactions in our analyses instead of the mass spectrometry data.)
|
Lastly, we assume that a protein's centrality in the interaction networka quantity based entirely on proteinprotein interaction datahas the same noise level as the underlying interaction data. Relaxing this assumption by raising or lowering the amount of noise in the centrality data does not significantly affect our results.
Equalizing Noise across Predictor Variables
As we have shown above, the 7 correlates of evolutionary rate analyzed by Drummond et al. (2006)
contain widely different amounts of noise. What conclusions, then, can we draw from PCRs, given that the PCR method assumes equal noise across all predictors? One way to answer this question is by artificially equalizing the noise levels across the predictor variables and repeating the PCR analysis. (A second way to answer the question is presented in a subsequent section.) We focus on the PCR because this is the technique employed by Drummond et al. (2006)
to reach their conclusions about protein evolution in yeast.
The degree of protein interactions is by far the most noisy of the 7 predictor variables in our study. In order to match the level of noise in interaction degree, we can add an appropriate amount of extra noise to each of the other 6 predictors. As described in Appendix, we can solve analytically for the appropriate amount of Gaussian noise to be added to each predictor so that the resulting variables have the same amount of noise as degree, namely, rdegree = 0.11.
Figure 1 shows PCR analyses of the original predictor variables alongside analyses of modified predictors whose noise levels have been equalized. There are 3 features of each regression that are important to note: 1) the amount of variation in evolutionary rate explained by the dominant component; 2) the amount of variation explained by the secondmost dominant component; and 3) the loadings of predictor variables on components.
As figure 1 shows, variable noise among predictors dramatically affects the PCR analysis of evolutionary rates. Without correcting for variable noise, the PCR identifies a single component that explains at least 20-fold more variance in evolutionary rate than any other component (Drummond et al. 2006
); whereas after equalizing noise levels, the dominant component does not explain significantly more variance than the subdominant component (i.e., not more than 2 standard deviations). In other words, after correcting for noise levels, there is no evidence of a single, dominating determinant of evolutionary rates in yeast.
Variable noise levels affect the PCR analysis in several other, important ways. Without correcting for noise, the dominant explanatory component consists almost exclusively of translation-related variables: mRNA expression, protein abundance, and CAI. The translation-related variables each explain more than 4 times the total variation in evolutionary rate than any of the other predictor variables. These results have been interpreted as conclusive evidence that translational selection governs the rate of protein evolution (Drummond et al. 2006
). By contrast, after equalizing noise levels, the dominant explanatory component contains roughly equal loadings from all 7 predictor variables (fig. 1c), and each of the predictor variables explains roughly the same amount of total variation in evolutionary rate (within one standard deviation). In other words, after correcting for noise levels, there is no evidence of a dominant, translation-related determinant of evolutionary rates.
A Simple Model of Evolutionary Rates
As with other techniques for multiple regression, the PCR method assumes equal noise levels across predictors, and it is sensitive to violations of this assumption. As seen above, when we equalize noise levels among predictor variables, the resulting PCR paints a very different picture of yeast protein evolution than has been previously reported (Drummond et al. 2006
). These results still beg the following question: given the known amount of noise associated with each predictor variable, how much variation in evolutionary rate would be explained by the underlying, noiseless predictors?
In this section, we provide one possible answer to this question by using a simple mathematical model. We present a phenomenological model of evolutionary rates, and we choose parameters consistent with most important features of the observed yeast data. The purpose of this model is not to recapitulate every detail of the empirical data but rather to explore what underlying patterns are consistent with the salient features of the observed, noisy data.
For the sake of simplicity, we focus on the 4 variables that explain the most rate variation: expression (E), abundance (A), CAI (C), and proteinprotein interaction degree (D). We demonstrate that the observed, noisy data are consistent with a model in which there are multiple independent determinates of evolutionary rates and in which the underlying (noiseless) protein interactions explain more variance in evolutionary rate than expression level, abundance, or CAI.
We specify our model so as to reflect several important biological features of protein evolution (Drummond et al. 2006
): 1) mRNA expression, protein abundance, and CAI all covary because they all reflect, in part, the amount of translation events experienced by a gene; 2) expression, abundance, and CAI also covary with the degree of proteinprotein interactions, for reasons unrelated to translation; 3) the amount of translation and the degree of protein interactions both influence the evolutionary rate. These features lead to the following model equations:
![]() |
and ß quantify the relative importance of translation versus other sources of variation in the predictor variables. In our model, the evolutionary rate (R) is determined by the amount of translation (Z2) and by the variation in protein interactions unrelated to other variables (Z6). (Similar results are obtained under related models, such as R =
Z1 + ßZ2; see also Supplementary Materials online.)
In addition to the underlying model, we also specify equations that describe noisy versions of the predictor variables representing measurements of expression (E*), measurements of abundance (A*), measurements of CAI (C*), and measurements of interaction degree (D*):
![]() |
We choose the 4 parameters of our model so as to match the most important empirical features of the yeast data: 1) the noise in mRNA expression data, rexpr = 0.72; 2) the noise in protein interaction data, rdeg = 0.11; 3) the correlation between measured expression levels and evolutionary rate, r(E*, R) = 0.56 (n = 2, 840); and 4) the correlation between measured interaction degree and evolutionary rate, r(D*, R) = 0.23 (n = 692). Using a straightforward parameterization technique (see Appendix), we find parameters so as to match all 4 of these observed features in the yeast data.
It is instructive to compare a PCR analysis of the underlying variables in our model against a PCR analysis of the noisy variables, which represent measurable quantities. When applied to the noisy variables (fig. 2a), the PCR indicates a dominant explanatory component consisting almost entirely of the translation-related variablesexpression (E*), abundance (A*), and CAI (C*). Each of these variables explains significantly more variation in evolutionary rate than protein interactions, which appear in a secondary minor component; this situation is analogous to that seen in the real data (fig. 1a and b). By contrast, when applied to the underlying noiseless variables, the PCR reveals a dramatically different picture (fig. 2b): no single component dominates evolutionary rates. Moreover, the true degree of protein interactions explains significantly more variation in rate than expression, abundance, or CAI.
|
The simple model developed here is consistent with the observed noise levels in yeast genomic data and with the observed correlations with evolutionary rates. Under this model, protein interaction degree explains significantly more variation in evolutionary rates than expression level, abundance, or CAI. Nevertheless, if one were to analyze the noisy versions of predictor variables (disregarding the fact that such a PCR analysis violates the assumption of equal noise levels), one would reach the opposite conclusion. Thus, our model highlights the danger of using a principle component regression to analyze noisy biological data without accounting for known variation in noise levels across predictor variables.
Other Evidence on the Determinants of Evolutionary Rates
The preceding sections demonstrate that the PCR is not robust to violating the assumption of equal noise levels across predictor variables. This weakness is not unique to the PCR, but it is likely shared by most other multiple regression techniques. As a result of this difficulty, however, given the known variation in noise levels across predictors, there is no evidence at present for a single determinant of evolutionary rates in yeast.
In this section, we demonstrate another related line of evidence against our ability to deduce a single determinant of evolutionary rates: namely, the PCR depends strongly on which predictors are included in the regression.
A gene's expression, abundance, and CAI are all related to the total amount of translation events it experiences (Drummond et al. 2006
). Therefore, Drummond et al. interpret the dominant explanatory component in their regressioncomprised equal parts expression, abundance, and CAIas the amount of translation, and they conclude that selection against translation-errorinduced protein misfolding is the predominant determinant of protein evolution in yeast (Drummond et al. 2005
, 2006
). If these conclusions were robust, a PCR analysis of the same data excluding CAI, for example, should yield a very similar resultexcept that the resulting dominant component would be comprised abundance and expression.
As seen in figure 3, a regression excluding CAI paints a very different picture of protein evolution than expected under the translational-selection hypothesis. According to this regression, there is no evidence that translational processes dominate protein evolution. Instead, multiple independent components explain significant variation in evolutionary rates. Moreover, the degree of protein interactions explains more variance in evolutionary rate than protein abundances, and it is more strongly represented in the first component. Regressions excluding mRNA expression, protein abundance, or combinations of these variables yield very similar results. None of these results would be observed if the PCR method were robust and if translational processes dominated protein evolution.
|
Finally, we note that other techniques for analyzing collinear predictors, such as the sliced inverse regression (Duan and Li 1991
| Discussion |
|---|
|
|
|---|
What independent factors influence the rate of protein evolution remains an outstanding question. The work of Drummond et al. (2006)
We emphasize that the limitations of the PCR in the face of predictor variables with different degrees of noise are not unique to the PCR method. The same limitations apply to virtually all methods of multiple regression, which typically assume equal noise levels across predictors. We have chosen to focus on the PCR only because this was the method used by Drummond et al. (2006)
to reach their conclusions about protein evolution in yeast.
Our analysis has formally demonstrated a concept that is intuitively clear: standard regression techniques cannot meaningfully compare the explanatory power of predictors when the predictors contain different amounts of measurement noise. It is less clear how best to deal with this difficulty, which comes hand in hand with diverse genomic data. Ideally, information about the known noise levels of predictors should be incorporated into statistical methods when partitioning phenotypic variance into independent contributions. Unfortunately, we know of no method that performs such a partitioning while accounting for variable noise levels. The simplistic approach used here for equalizing noise levels may seem Draconian because it throws out some signal contained in the less noisy predictors. Nevertheless, this conservative approach is necessary until more sophisticated statistical methods, which account for variable noise levels, are developed. Indeed, the inadequacy of the PCR in this context is further demonstrated by the fact that its results are not robust to removing predictor variables from the regression.
Our procedure for equalizing noise levels is conservative with respect to our conclusions because we have used one of the lowest estimates of noise in protein interaction data and some of the highest estimates of noise in expression and abundance data. In particular, we have estimated expression noise using the same correlation coefficient reported by Drummond et al. (2005)
, and our estimate of noise in protein interactions is significantly smaller than the median estimate across 36 pairwise comparisons.
Aside from adding noise to empirical data, we have also presented a simple model of evolutionary rates that is consistent with the observed yeast data. The purpose of this model is not to recapitulate all details of the empirical data but rather to explore what underlying patterns are consistent with the salient features of the observed, noisy data. According to this model, the underlying, noiseless variable for protein interaction degree explains more variance in evolutionary rate than all other variablesdespite the fact that a PCR analysis on the noisy, measured variables would yield the opposite conclusion. We emphasize that this model does not establish that proteinprotein interactions actually have a stronger influence on yeast evolutionary rates than expression, abundance, or CAI. Rather, the model simply demonstrates that we cannot yet rule out the possibility of an important, independent role for proteinprotein interactions (or other noisy variables) in determining evolutionary rates.
We emphasize that our analysis does not rule out the possibility of a single determinant for evolutionary rates. Indeed, in the future, we may conclude definitively that translational processes explain more variance in evolutionary rates than any other feature of yeast proteins. At present, however, given the variable amounts of noise associated with existing genome-wide measurements, PCRs do not provide sufficient evidence to reach this conclusion. It appears that more accurate measurements, or more sophisticated statistical techniques, will be required to tease apart the underlying determinants of protein evolution.
| Appendix |
|---|
|
|
|---|
Data Sets
All data sets were taken directly from Drummond et al. (2006)
Equalizing Noise Levels across Data Sets
Before applying regressions, all variables were log transformed (except dispensability), centered, and variance normalized, as in Drummond et al. (2006)
. Our results remain essentially unchanged using rank regressions instead of parametric regressions.
Of the data sets used in this study, the proteinprotein interaction data are the least precise. In order to equalize noise across predictor variables, we add an appropriate amount of noise to each transformed variable and then rescale each variable by its variance. In order to match the noise level in protein interaction data, rdegree = 0.11, we must add enough extra noise to each other variable so that, if we were to add the noise twice independently (so as to imitate 2 independent measurements, each with their own source of noise), the resulting correlation would equal rdegree.
To be more explicit, consider a predictor variable E with original noise level, rE < rdegree, given by the correlation between 2 independent measurements: rE = r(E
,E
). We must find the value of
such that
|
|
E. Letting A = scale(E
) +
Z1 and B = scale(E
) +
Z2 and substituting into the expression for the correlation coefficient, we must solve |
|
Because A and B both have expected value zero, and each Zi is uncorrelated with E
and E
, our equation reduces to
|
|
Because the numerator in the equation above equals rE, we may write
|
|
This equation gives a simple expression for the amount of noise,
, that we must add to a variable E so as to equalize its noise level with that of protein interaction degree.
For our variables of interest, we have rexpr = rabund = 0.72, rdegree = rcentrality = 0.112, rdisp = 0.561, and rCAI = rlength = 1. As a result,
expr =
abund = 2.329,
degree =
centrality = 0,
disp = 2.003, and
CAI =
length = 2.815. Figure 1c shows the mean results from PCR analyses of the predictor variables after adding noise. Standard deviations were calculated from >2000 independent random draws.
Principal Components Regressions
All regressions were performed in R (www.r-project.com). In all cases, we have retained all the components in the PCRs. Although there are techniques designed to assess the appropriate number of "nondegenerate" components, such techniques are inherently subjective (Jackson 1993
), and so their utility in this context is unclear.
Parameterizing the Model
Our phenomenological model of expression level (E), abundance (A), CAI (C), protein interaction degree (D), and evolutionary rate (R) depends on 4 parameters according to the equations
![]() |
,E
), a measurement of abundance (A*), a measurement of CAI (C*), and 2 independent measurements of interaction degree (D
,D
):
![]() |
In these equations, each Wi is an independent Gaussian variable, and we have conservatively assumed that abundance data are as noisy as expression data (nA = nE). We wish to choose the 4 parameters
, ß, nE, and nD so as to match important features of the yeast data: 1) the noise in mRNA expression data, rexpr = 0.72; 2) the noise in protein interaction data, rdeg = 0.11; 3) the correlation between measured expression levels and evolutionary rate, r(E
,R) = 0.56; and 4) the correlation between measured interaction degree and evolutionary rate, r(D
,R) = 0.23. In other words, our parameters should satisfy the following 4 conditions as precisely as possible:
![]() |
Under the assumptions of our model, we can write analytic expressions for left hand sides of these equations, in terms of our 4 parameters:
![]() |
![]() |
|
|
![]() |
| Supplementary Material |
|---|
|
|
|---|
Supplementary materials are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|
We thank D. A. Drummond, C. O. Wilke, and A. Raval for productive conversations and contributions to the manuscript. J.B.P. acknowledges support from the Burroughs Wellcome Fund. Note added in proof: recently Kim and Yi (Genetica 2007, DOI 10.1007/s10709-006-9125-2) have independently demonstrated that principle component regression can yield spurious results when predictor variables contain different amounts of noise.
| Footnotes |
|---|
Michele Vendruscolo, Associate Editor
| References |
|---|
|
|
|---|
Bloom JD, Adami C. Apparent dependence of protein evolutionary rate on number of interactions is linked to biases in protein-protein interactions data sets. BMC Evol Biol (2003) 3.
Causton HC, Ren B, Koh SS, Harbison CT, Kanin E, Jennings EG, Lee TI, True HL, Lander ES, Young RA. Remodeling of yeast genome expression in response to environmental changes. Mol Biol Cell (2001) 12:323337.
Deutschbauer AM, Jaramillo DF, Proctor M, Kumm J, Hillenmeyer ME, Davis RW, Nislow C, Giaever G. Mechanisms of haploinsufficiency revealed by genome-wide profiling in yeast. Genetics (2005) 169:19151925.
Dickerson RE. The structures of cytochrome c and the rates of molecular evolution. J Mol Evol (1971) 1:2645.[CrossRef][Medline]
Drummond D, Bloom JD, Adami C, Wilke CO, Arnold FH. Why highly expressed proteins evolve slowly. Proc Natl Acad Sci USA (2005) 102:1433814343.
Drummond DA, Raval A, Wilke CO. A single determinant dominates the rate of yeast protein evolution. Mol Biol Evol (2006) 23:327337.
Duan N, Li KC. Slicing regression: a link-free regression method. Ann Stat (1991) 19:505530.
Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW. Evolutionary rate in the protein interaction network. Science (2002) 296:750752.
Gavin AC, Bosche M, Krause R, et al, (38 co-authors). Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature (2002) 415:141147.[CrossRef][Medline]
Ghaemmaghami S, Huh WK, Bower K, Howson RW, Belle A, Dephoure N, O'Shea EK, Weissman JS. Global analysis of protein expression in yeast. Nature (2003) 425:737741.[CrossRef][Medline]
Green P, Lipman D, Hillier L, Waterston R, States D, Claverie JM. Ancient conserved regions in new gene sequences and the protein databases. Science (1993) 259:17111716.
Hahn MW, Kern AD. Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks. Mol Biol Evol (2005) 22:803806.
Han JD, Bertin N, Hao T, et al, (11 co-authors). Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature (2004) 430:8893.[CrossRef][Medline]
Hirsh AE, Fraser HB. Protein dispensability and the rate of evolution. Nature (2001) 411:10461049.[CrossRef][Medline]
Hirsh AE, Fraser HB, Wall DP. Adjusting for selection on synonymous sites in estimates of evolutionary distance. Mol Biol Evol (2005) 22:174177.
Ho Y, Gruhler A, Heilbut A, et al, (46 co-authors). Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature (2002) 415:180183.[CrossRef][Medline]
Holstege FC, Jennings EG, Wyrick JJ, Lee TI, Hengartner CJ, Green MR, Golub TR, Lander ES, Young RA. Dissecting the regulatory circuitry of a eukaryotic genome. Cell (1998) 95:717728.[CrossRef][Web of Science][Medline]
Hurst LD, Smith NG. Do essential genes evolve slowly? Curr Biol (1999) 9:747750.[CrossRef][Web of Science][Medline]
Ingram VM. Gene evolution and the haemoglobins. Nature (1961) 189:704708.[CrossRef][Medline]
Jackson DA. Stopping rules in principal components analysis: a comparison of heuristical and statistical approaches. Ecology (1993) 74:22042214.[CrossRef][Web of Science]
Kemmeren P, van Berkum NL, Vilo J, Bijma T, Donders R, Brazma A, Holstege FC. Protein interaction verification and functional annotation by integrated analysis of genome-scale data. Mol Cell (2002) 9:11331143.[CrossRef][Web of Science][Medline]
Koonin EV, Wolf YI. Evolutionary systems biology: links between gene evolution and function. Curr Opin Biotechnol (2006) 17:481487.[CrossRef][Web of Science][Medline]
Marais G, Duret L. Synonymous codon usage, accuracy of translation, and gene length in caenorhabditis elegans. J Mol Evol (2001) 52:275280.[Web of Science][Medline]
McInerney JO. The causes of protein evolutionary rate variation. Trends Ecol Evol (2006) 21:230232.[CrossRef][Medline]
Mintseris J, Weng Z. Structure, function, and evolution of transient and obligate protein-protein interactions. Proc Natl Acad Sci USA (2005) 102:1093010935.
Ohta T. Slightly deleterious mutant substitutions in evolution. Nature (1973) 246:9698.[CrossRef][Medline]
Pal C, Papp B, Hurst LD. Highly expressed genes in yeast evolve slowly. Genetics (2001) 158:927931.
Pal C, Papp B, Hurst LD. Genomic function: rate of evolution and gene dispensability. Nature (2003) 421:496497.[Medline]
Pal C, Papp B, Lercher MJ. An integrated view of protein evolution. Nat Rev Genet (2006) 7:337348.[CrossRef][Web of Science][Medline]
Reguly T, Breitkreutz A, Boucher L, et al, (20 co-authors). Comprehensive curation and analysis of global interaction networks in saccharomyces cerevisiae. J Biol (2006) 5:11.[CrossRef][Medline]
Rocha EP, Danchin A. An analysis of determinants of amino acids substitution rates in bacterial proteins. Mol Biol Evol (2004) 21:108116.
Rocha EPC, Forthcoming. The quest for the universals of protein evolution. Trends Genet (2006) 22:412416.[CrossRef][Web of Science][Medline]
von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P. Comparative assessment of large-scale data sets of protein-protein interactions. Nature (2002) 417:399403.[Medline]
Wall DP, Hirsh AE, Fraser HB, Kumm J, Giaever G, Eisen MB, Feldman MW. Functional genomic analysis of the rates of protein evolution. Proc Natl Acad Sci USA (2005) 12:54835488.
Warringer J, Ericson E, Fernandez L, Nerman O, Blomberg A. High-resolution yeast phenomics resolves different physiological features in the saline response. Proc Natl Acad Sci USA (2003) 100:1572415729.
Wilson A, Carlson SS, White TJ. Biochemical evolution. Annu Rev Biochem (1977) 46:573639.[CrossRef][Web of Science][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
S. S. Hughes, C. O. Buckley, and D. E. Neafsey Complex Selection on Intron Size in Cryptococcus neoformans Mol. Biol. Evol., February 1, 2008; 25(2): 247 - 253. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











