Skip Navigation


MBE Advance Access originally published online on August 4, 2006
Molecular Biology and Evolution 2006 23(11):2039-2048; doi:10.1093/molbev/msl081
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/11/2039    most recent
msl081v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Julenius, K.
Right arrow Articles by Pedersen, A. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Julenius, K.
Right arrow Articles by Pedersen, A. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Protein Evolution Is Faster Outside the Cell

Karin Julenius*,{dagger} and Anders Gorm Pedersen{ddagger}

* Division of Matrix Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden
{dagger} Stockholm Bioinformatics Center, SCFAB, Stockholm University, Stockholm, Sweden
{ddagger} Center for Biological Sequence Analysis, BioCentrum-DTU, Technical University of Denmark, Lyngby, Denmark

E-mail: karinjul{at}sbc.su.se.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Some proteins are highly conserved across all species, whereas others diverge significantly even between closely related species. Attempts have been made to correlate the rate of protein evolution to amino acid composition, protein dispensability, and the number of protein–protein interactions, but in all cases, conflicting studies have shown that the theories are hard to confirm experimentally. The only correlation that is undisputed so far is that highly/broadly expressed proteins seem to evolve at a lower rate. Consequently, it has been suggested that correlations between evolution rate and factors like protein dispensability or the number of protein–protein interactions could be just secondary effects due to differences in expression. The purpose of this study was to analyze mammalian proteins/genes with known subcellular location for variations in evolution rates. We show that proteins that are exported (extracellular proteins) evolve faster than proteins that reside inside the cell (intracellular proteins). We find weak, but significant, correlations between evolution rates and expression levels, percentage of tissues in which the proteins are expressed (expression broadness), and the number of protein interaction partners. More important, we show that the observed difference in evolution rate between extra- and intracellular proteins is largely independent of expression levels, expression broadness, and the number of protein–protein interactions. We also find that the difference is not caused by an overrepresentation of immunological proteins or disulfide bridge–containing proteins among the extracellular data set. We conclude that the subcellular location of a mammalian protein has a larger effect on its evolution rate than any of the other factors studied in this paper, including expression levels/patterns. We observe a difference in evolution rates between extracellular and intracellular proteins for a yeast data set as well and again show that it is completely independent of expression levels.

Key Words: evolution rate • subcellular localization • expression level • protein connectivity • immunological protein • disulfide bridge


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Some proteins are highly conserved across all species, whereas others diverge significantly even between closely related species (Li and Graur 1991Go). In order to understand what causes these differences, attempts to correlate the protein evolution rate to a number of different factors have been made. In a previous study involving 60 mammalian genes, the rate of evolution was found to correlate to amino acid composition of the protein (Graur 1985Go), but this finding was later convincingly disputed (Tourasse and Li 2000Go). For a yeast (Hirsh and Fraser 2001Go) and a bacterial (Jordan et al. 2002Go) data set, the rate of evolution was found to be correlated to protein dispensability, indicating that essential genes evolve slower than nonessential genes, in agreement with the neutral theory of molecular evolution (Kimura 1979Go). However, conflicting studies in yeast (Pal et al. 2003Go; Yang et al. 2003Go), bacteria (Rocha and Danchin 2004Go), and mouse (Hurst and Smith 1999Go) show that more work is needed to prove this as a general rule. Using the yeast protein interaction network, it has been shown that the evolution rate is correlated to the protein connectivity, indicating that proteins involved in a large number of different protein–protein contacts evolve at a lower rate compared with those that have fewer interaction partners (Fraser et al. 2002Go; Fraser and Hirsh 2004Go), whereas others claim the results inconclusive using the same data (Bloom and Adami 2003Go; Jordan et al. 2003Go). It has been shown that there is an inverse relationship between evolution rate and age of a gene for human–mouse orthologous protein pairs (Alba and Castresana 2005Go), whereas others claim that this is an artifact due to the difficulty in identifying protein orthologs in distantly related species for fast-evolving genes (Elhaik 2006Go). The only undisputed correlation so far is that highly/broadly expressed proteins seem to evolve at a lower rate, and this has been shown for mammals (Duret and Mouchiroud 2000Go; Jordan et al. 2004Go; Zhang and Li 2004Go), yeast (Pal et al. 2001Go), and bacteria (Rocha and Danchin 2004Go) alike. The observed correlations to protein dispensability and protein connectivity could possibly be explained as secondary effects caused by differences in expression because essential genes tend to have higher/broader expression (Hurst and Smith 1999Go; Pal et al. 2003Go; Rocha and Danchin 2004Go) and highly expressed proteins are overrepresented in protein interaction networks due to the protein concentration dependence of proteomics-based techniques (Bloom and Adami 2003Go; Jordan et al. 2003Go). Recently, a study on the yeast, worm, and fly interactomes concludes that proteins that have a more central position in the network evolve more slowly and are more likely to be essential for survival (Hahn and Kern 2005Go), a study on yeast concludes that gene dispensability and gene expression levels have independent, significant effects on the rate of protein evolution (Wall et al. 2005Go), and a combined analysis of 7 different factors concludes that the most important factor in yeast, explaining nearly half the variation in the rate of protein evolution, is translational selection (as measured by expression level, codon adaptation index, and protein abundance) (Drummond et al. 2005Go).

In a previous study, we investigated mammalian mucin-type O-glycosylation sites aiming, among other things, to investigate whether glycosylated serines and threonines are more likely to be evolutionary conserved than other serines and threonines (Julenius et al. 2005Go). As a negative data set, we tried using serines and threonines present in nuclear proteins. Because mucin-type O-glycosylation is only found in extracellular proteins or extracellular domains of membrane proteins, we could be certain that the nuclear serines and threonines would not be glycosylated. During the collection of these data, it quickly became apparent that the serines and threonines of nuclear proteins were in fact much more conserved than the serines and threonines of the O-glycosylated proteins (regardless of whether or not they were glycosylated). In this serendipitous manner we became interested in investigating the rate of protein evolution in different subcellular compartments. Since then, others have found similar indications (Winter et al. 2004Go; Aris-Brosou 2005Go), although no one has yet made this the main topic of a study.

The aim of this study was to investigate whether there was a significant difference in the evolution rate of proteins with different subcellular localizations. We found for a mammalian data set that extracellular proteins evolve faster than intracellular proteins. We also found that the differences in evolution rates between extra- and intracellular proteins are independent of expression levels, number of protein interaction partners, presence of disulfide bridges, and whether or not the protein is involved in the immune response. We observe a difference in evolution rates between extra- and intracellular proteins for a yeast data set as well and show that it is again completely independent of expression levels.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Amino Acid Sequence analysis
Mammalian proteins were extracted from Swiss-Prot, release 43 (Boeckmann et al. 2003Go), and sorted according to the "subcellular location" comment. Only exact and complete matches to either of the following phrases were included: "Nuclear.", "Cytoplasmic.", "Secreted.", "Extracellular.", "Type I membrane protein.", "Type II membrane protein.," and "Integral membrane protein." The two terms Secreted and Extracellular are synonymous according to personal communications with a Swiss-Prot annotator and are referred to as "extracellular" throughout the article. Orthologs were identified using the first part of the Swiss-Prot ID (e.g., APA1_HUMAN, APA1_BOVIN, and APA1_MOUSE were taken to be orthologs), and each protein set was only represented once. The number of proteins in each orthology group varied between 2 and 40 (median 3; mean 3.2). For comparison, we also performed the analysis exclusively on the proteins where we had access to orthologs from human, mouse, and rat. All other species were discarded for this analysis. Multiple alignments of the amino acid sequences were generated using ClustalW (Thompson et al. 1994Go).

For each alignment, the sequence diversity was quantified using Nei's sequence diversity measure, {pi} (Nei and Li 1979Go). {pi} is defined as the average fraction of differing sites between all possible pairwise comparisons of the sequences; a sequence diversity of 0.05 in a population of sequences thus means that on average any pair of sequences will be different at 5% of their sites. Signal peptides and propeptides (according to Swiss-Prot annotation) were not included in the analysis. The amino acid diversities of the extracellular, transmembrane, and cytosolic segments were calculated individually for transmembrane proteins. Information about membrane topology was extracted from Swiss-Prot annotation although this was not experimentally verified for all entries.

Amino acid–specific measures of sequence conservation were also calculated. For each aligned position, the frequency of the most prevalent amino acid residue was determined. Sequence conservation for this individual amino acid residue was defined as the fraction of sequences with identical amino acid residue in the aligned position. For each protein alignment (and subcellular location in the case of membrane proteins), we calculated average degrees of conservation for the types of amino acid residues present in the sequence. Again, signal peptides and propeptides were excluded. For each of the subcellular categories, these averages were used to calculate an overall average as well as the standard deviation (SD). Assuming normal distribution, the SD was used to estimate a 95% confidence interval for the true overall average.

DNA Sequence Analysis
DNA sequences were identified through cross-references to GenBank (Benson et al. 2004Go) in the Swiss-Prot entries. Only DNA sequences that were consistent with the corresponding amino acid sequence were accepted, leading to a slight reduction in data set size (table 1). The DNA sequences were aligned according to the existing protein alignment using RevTrans (Wernersson and Pedersen 2003Go), and neighbor-joining trees were created using ClustalW (Thompson et al. 1994Go). For membrane proteins, DNA sequences were divided into extracellular, transmembrane, and cytosolic fragments. The fragments were concatenated with fragments from the same cellular compartment and protein. Only membrane protein sequences of total length 80 aa or more after concatenation were accepted for further analysis. One dN/dS ratio per alignment was estimated using codeml in the phylogenetic analysis by maximum likelihood (PAML) software package (Yang 1997Go) (model M0).


View this table:
[in this window]
[in a new window]

 
Table 1 The Number of Protein Alignments for the Mammalian Data Sets

 
Permutation Test
We used permutation tests to investigate whether the means of two dN/dS distributions were significantly different. What we want to do is to compare the present distributions with what could occur by chance. For the sake of clarity, we will call the distribution with the lower mean A and the distribution with the higher mean B. A and B were merged, and a new distribution with the same size as B was constructed by randomly drawing from the combined distribution. We repeated the drawing process 10,000 times, and if the mean of the randomly drawn distribution was equal to or larger than the mean of B fewer than 100 times out of the 10,000, the means of A and B were found to be significantly different with P < 0.01. The mean of all values (A + B) is a constant and remains the same for all permutations. If the mean of a random distribution of size B is smaller than the mean of B, it follows that the mean of the remaining distribution must be larger than the mean of A.

Expression Data Analysis
Expression data for the human and/or mouse genes were collected from GNF SymAtlas v 0.8.0 (Su et al. 2002Go). From all expression levels in different tissues, for different DNA probes and, if available, also different organisms (mouse and human) one mean was calculated and used as the expression level of that particular protein. The broadness of tissue expression was estimated by calculating the fraction of tissues in which the genes are positively expressed for every DNA probe individually and averaging over the probes of that protein. Positive expression in a tissue was defined as those cases where a gene displayed at least 20% of its maximum expression and at the same time had an absolute expression of at least 100.

Analysis of Number of Protein–Protein Interactions
In order to determine the number of protein–protein interactions for members of our data sets, we used a comprehensive database constructed in connection with another current project (Kasper Lage et al., unpublished data). Briefly, this database has been made by pooling human interaction data from a number of the largest databases and then increasing coverage by transferring data from model organisms. All interactions in the database have been assessed for trustworthiness using a score that relies on network topology and furthermore takes into account that interactions from large-scale experiments generally contain more false positives than the interactions from small-scale experiments and that interactions are more reliable if they have been reproduced in more than one independent interaction experiment.

Immunological Proteins
To investigate the contribution of immunological proteins involved in recognizing pathogens to the evolution rate of extracellular proteins, 225 Gene Ontology terms indicating possible involvement in immunological processes and pathogen binding were identified. The Swiss-Prot entries of all orthologs of every extracellular protein were searched for the presence of these terms and any one hit identified that particular protein as immunological. A total of 95 proteins for the amino acid sequence analysis and 90 proteins for the DNA sequence analysis were identified using this method.

Yeast Data Set
Evolution rates (estimated by comparison to Saccharomyces bayanus, Saccharomyces mikatae, and Saccharomyces paradoxus) (Wall et al. 2005Go), gene dispensability data (Deutschbauer et al. 2005Go), and expression levels (Holstege et al. 1998Go) for Saccharomyces cerevisiae were downloaded from the electronic supplement of Drummond et al. (2005)Go. As gene dispensability measurement, we used both average growth rates of the homozygous deletion strains and whether the gene was essential or not. Lists of yeast genes with the subcellular localizations "nucleus," "cytoplasm," "extracellular," and "cell wall" were downloaded from the Comprehensive Yeast Genome Database (Guldener et al. 2005Go). Because exported proteins were sparse, additional extracellular and cell wall proteins were identified using Saccharomyces Genome Database (Christie et al. 2004).


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Analysis of Diversity in Different Subcellular Compartments
From Swiss-Prot, we collected a total of 2,723 mammalian proteins where the subcellular location was known. Specifically, the following location categories were included: nuclear, cytoplasmic, transmembrane, and extracellular (table 1). The 2,723 groups of orthologs consisted of between 2 and 40 proteins each (median group size: 3; mean: 3.2). For every single group of orthologs, a multiple alignment was constructed, and Nei's sequence diversity {pi} (Nei and Li 1979Go) was calculated (fig. 1A). The sequence diversity is the average pairwise difference in a set of sequences, and we here use it as a simple measure of the rate of evolution: all other things being equal, the diversity will be higher in alignments of more rapidly evolving sequences.


Figure 1
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Violin plot of amino acid sequence diversity for proteins from different sides of the cellular membrane. {pi} is the average diversity when making pairwise comparisons between all the sequences in a multiple alignment. High {pi} indicates dissimilar amino acid sequences of orthologs from different species; low {pi} indicates similar sequences. A violin plot is a combination of a box plot and a density plot (Hintze and Nelson 1998Go). The white dot indicates the median and the black bar the first and third quartile. (A) All mammalian proteins present in Swiss-Prot (B) the human, mouse, and rat protein completely and exclusively represented.

 
For globular proteins we observe that the average diversity varies between the different subcellular compartments in the following order: nuclear < cytoplasmic < extracellular (fig. 1A). Interestingly, a similar trend is seen for different parts of transmembrane proteins where diversity is low in transmembrane regions, higher for cytosolic regions, and highest for extracellular regions (fig. 1A). The mean diversities for the above-mentioned categories are significantly different in all cases (table 2). To rule out the possibility that these results are biased due to uneven taxonomic coverage in the data sets for different subcellular compartments, we also analyzed the diversity in a set of alignments that all contained only human, mouse, and rat sequences (table 1). The results from this analysis are essentially identical to the results of the full data analysis (cf. figs. 1A and 1B).


View this table:
[in this window]
[in a new window]

 
Table 2 95% Confidence Intervals of the Mean for the Mammalian Data Sets

 
In conclusion, the analysis of sequence diversity in globular and transmembrane proteins in a large mammalian data set shows that extracellular (parts of) proteins evolve more rapidly than intracellular (parts of) proteins.

Analysis of Conservation of Individual Amino Acids
In order to see if the differences are attributed to certain amino acid residues, we also calculated an amino acid–specific measure of sequence conservation. For each aligned position, the sequence conservation with respect to the most prevalent amino acid residue in that position was calculated. The results were averaged for each type of amino acid residue within each subcellular location category, and confidence levels were calculated. For the membrane proteins, the extracellular, the transmembrane, and the cytosolic parts were averaged individually. We observe a statistically significant difference where extracellular proteins show lower sequence conservation for every individual residue type except for cysteine as compared with intracellular proteins (fig. 2).


Figure 2
View larger version (33K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Amino acid conservation for proteins from different sides of the cellular membrane. Fraction of identical sites is plotted for different amino acids, plus one total for all amino acids. Nuclear proteins in magenta ({blacksquare}), cytoplasmic proteins in black ({diamondsuit}), extracellular proteins in blue (x), and membrane proteins: membrane part in green ({square}), cytosolic part in orange ({blacktriangleup}) and extracellular part in brown (•). The vertical bars show 95% confidence interval of the means.

 
Analysis of Selective Pressure in Different Subcellular Compartments
In order to rule out that the differences in evolution rates between proteins from different subcellular compartments are merely due to differences in mutation rates, we also analyzed the corresponding DNA sequences. When analyzing DNA sequences for molecular evolution purposes, one distinguishes between the rate of synonymous mutations per synonymous site (dS, mutations that do not change the corresponding amino acid) and the rate of nonsynonymous mutations per nonsynonymous site (dN, mutations that do). The ratio of these rates (dN/dS) provides information about the selective pressure acting on the investigated set of sequences. If no selection is acting on the encoded protein, then the synonymous and nonsynonymous rates per site will be the same, leading to a dN/dS = 1. Similarly, dN/dS < 1 indicates that the protein has been under negative (purifying) selection, whereas dN/dS > 1 indicates the presence of positive, adaptive selection (Yang 1997Go, 2002Go).

Through cross-referencing from Swiss-Prot (Boeckmann et al. 2003Go) to GenBank (Benson et al. 2004Go), we gathered as many DNA sequences as possible corresponding to the protein data sets (table 1). Using PAML (Yang 1997Go), we estimated one dN/dS ratio per protein alignment. The resulting distributions of dN/dS ratios for proteins of different subcellular compartments are shown as a violin plot in figure 3. Analogous to the {pi} results, we observe that the mean of the dN/dS ratio varies between the different cellular compartments in the following order: nuclear < cytoplasmic < extracellular for globular proteins and transmembrane regions < cytoplasmic regions < extracellular regions for cellular membrane proteins. The means of the distributions are significantly different in all cases (table 2). Most cytosolic and nuclear proteins have a low dN/dS ratio (<0.2), whereas extracellular proteins show a wider distribution of different dN/dS ratios with a higher overall mean. For membrane proteins we see the same tendency, although the differences are less distinct. The difference in dN/dS on either side of the cellular membrane shows that, on average, DNA coding for intracellular proteins is under more strict negative (purifying) selection than DNA coding for extracellular proteins.


Figure 3
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Violin plot of gene evolution rates for proteins from different sides of the cellular membrane. dN is the number of nonsynonymous substitutions per nonsynonymous site, and dS is the number of synonymous substitutions per synonymous site. High dN/dS indicates positive selection, and low dN/dS indicates negative selection. A violin plot is a combination of a box plot and a density plot (Hintze and Nelson 1998Go). The white dot indicates the median and the black bar the first and third quartile.

 
Impact of Expression Level on Results
To rule out that the differences in evolution rates between extracellular and intracellular (cytoplasmic and nuclear) proteins that we have discovered are secondary effects from differences in expression level, we investigated the effects of gene expression levels on the rate of gene evolution in our data set. Expression data from a publicly available database, GNF SymAtlas v 0.8.0 (Su et al. 2002Go), were collected for as many proteins as possible (table 1). No correlation was found between dN/dS and expression level, but a weak correlation between dN/dS and expression broadness was seen (r = –0.17, r2 = 0.029) that was significant. The negative correlation shows that proteins with broad expression have lower evolution rate, in agreement with previous reports for mammalian proteins (Duret and Mouchiroud 2000Go). An r2 of 0.029 indicates that approximately 3% of dN/dS variation can be explained by differences in expression broadness. For sequence diversity, {pi}, weak but significant correlations were found to both expression level (r = –0.11, r2 = 0.012) and expression broadness (r = –0.27, r2 = 0.073). The results indicate that at the most, 7% of the variation in evolution rate can be explained by differences in expression characteristics.

We see differences in expression in different subcellular compartments. Expression levels are higher on average in cytoplasmic proteins (1.1 x 103–1.8·x 103) as compared with nuclear (4.9 x 102–6.2·x 102) and extracellular proteins (5.0 x 102–7.0·x 102), and expression broadness is highest in nuclear proteins (0.40–0.44), followed by cytoplasmic (0.35–0.40) and extracellular proteins (0.21–0.26). To further prove that our results are independent of expression, we divided the proteins into three groups of approximately equal sizes depending on their expression levels. Within each category, the differences in means of dN/dS and {pi} for intra- and extracellular proteins were statistically significant (table 3). The same result was obtained for division into three groups depending on the broadness of the tissue expression (table 3). Because the extracellular and cytosolic parts of a membrane protein have the same expression characteristics, the analysis was not performed for membrane proteins. We also performed the opposite experiments and divided the proteins into three categories depending on the subcellular localization (nuclear, cytoplasmic, and extracellular). Within each category, no statistically significant correlations were found from dN/dS to neither expression levels nor expression broadness.


View this table:
[in this window]
[in a new window]

 
Table 3 95% Confidence Intervals of the Mean for the Mammalian Data Sets Divided into Groups Depending on the Expression Level/Broadness

 
Impact of Number of Protein–Protein Interactions on Results
To see whether differences in protein connectivity for proteins in different subcellular compartments could be causing the differences in evolution rates between extracellular, intracellular (cytoplasmic and nuclear) proteins, we investigated the effects of protein connectivity on the rate of gene evolution in our data set. The number of protein–protein interactions for extracellular, cytoplasmic, and nuclear proteins were extracted from a database constructed in connection with another current project (Kasper Lage et al., unpublished data). The number of proteins for which protein–protein interaction data could be extracted is shown in table 1. Correlations between number of interaction partners and evolution rate was low but significant (r = –0.16, r2 = 0.025) whether we use selective pressure, dN/dS, or sequence diversity, {pi}, as a measure of evolution rate. The negative correlation shows that proteins with a high number of protein interaction partners (high connectivity in the network) have lower evolution rate, in agreement with previous reports (Fraser et al. 2002Go; Fraser and Hirsh 2004Go; Hahn and Kern 2005Go). An r2 of 0.025 indicates that less than 3% of evolution rate variation can be explained by differences in the number of interaction partners.

We see differences in protein connectivity in different subcellular compartments. Cytoplasmic proteins have, on average, highest connectivity (9.4–16.8), followed by nuclear (8.4–10.8) and extracellular proteins (0.85–1.4). This somewhat contradicts previous findings that proteins involved in transcription and replication (typically nuclear) are among the proteins with the highest average number of interaction partners (Kunin et al. 2004Go). To further prove that our results are largely independent of number of interaction partners, we divided the proteins into three groups depending on their number of interaction partners. The categories (of approximately equal size) were as follows: 1) no interaction partners, 2) one interaction partner, and 3) more than one interaction partner. Within each category, the differences in means of dN/dS and {pi} for intra- and extracellular proteins were statistically significant (table 3). We conclude that the differences in evolution rate between proteins from different subcellular compartments are not caused by differences in the number of protein–protein interactions.

Impact of Immunoproteins on Results
Some extracellular proteins are involved in immunological processes and possibly in recognizing pathogens and therefore could be under positive, adaptive selection (Hurst and Smith 1999Go; Castillo-Davis et al. 2004Go). In order to check to what extent immunological proteins involved in recognizing pathogens contribute to the observed differences in evolution rate between intracellular and extracellular proteins, we removed proteins with possible roles in the immune response from the analysis of extracellular proteins using Gene Ontology information in the Swiss-Prot entries. The remaining extracellular proteins showed slightly slower evolution than the complete extracellular data set, but the distributions of evolution rates between cytoplasmic and extracellular proteins excluding immunological proteins are still significantly different (table 2). We conclude that selective pressure caused by interaction with pathogens is not the main cause for the observed phenomenon.

Impact of Disulfide Bridges on Results
It has been proposed that the existence of disulfide bridges in an extracellular protein allows for a faster rate of evolution by stabilizing the structure (Hegyi and Bork 1997Go). To investigate whether this could explain the observed difference in evolution rate between intra- and extracellular proteins, we looked for pairs of conserved cysteines, indicating the possible presence of disulfide bridges, in the set of extracellular proteins. Among the 485 extracellular proteins in our amino acid analysis, 101 do not have a pair of conserved cysteines, making the presence of disulfide bonds highly unlikely (83 out of 427 for the DNA analysis). Contrary to the expectation, these disulfide-free proteins show no evidence of being more constrained evolutionarily. In fact, the proteins in this group are evolving slightly faster than the remaining extracellular proteins (figs. 1 and 2). Therefore, our data do not support the theory that the presence of disulfide bridges leads to faster protein evolution. The distributions of evolution rates between cytoplasmic and extracellular proteins excluding those with possible disulfide bonds are still significantly different (table 2). We conclude that the presence of disulfide bonds does not explain the observed differences in evolution rate between intra- and extracellular proteins.

Analysis of Yeast Data
Because yeast is the most studied organism when it comes to evolution rates and its correlation to other factors, we also investigated whether we could find any differences in evolution rates between intracellular and extracellular proteins for a yeast data set. Evolution rates as measured by dN/dS ratios (Wall et al. 2005Go) for 48 extracellular and cell wall proteins, 1,568 cytoplasmic proteins, and 1,216 nuclear proteins are shown as violin plots in figure 4A. We see no significant difference between nuclear and cytoplasmic proteins, but the mean of dN/dS for the extracellular proteins is significantly higher (table 4). Expression levels (Holstege et al. 1998Go) for 108 extracellular and cell wall proteins, 2,713 cytoplasmic proteins, and 2,032 nuclear proteins are shown in figure 4B. The means of the expression levels are significantly different (table 4) in the following order: nuclear < cytoplasmic < extracellular proteins. It is known that there is a strong anticorrelation between expression levels and evolution rates in yeast (highly expressed genes evolve slower). The fact that exported proteins show faster evolution could only be explained by differences in expression levels if the expression levels were lower for the exported proteins, which they are not. Therefore, the difference in evolution rates between extra- and intracellular proteins in yeast is completely independent of expression levels.


Figure 4
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Violin plots of gene evolution rates and expression levels for yeast proteins from different sides of the cellular membrane. (A) Distribution of dN/dS ratios (evolution rates) (Wall et al. 2005Go) and (B) distribution of expression levels (Holstege et al. 1998Go). (C) Distribution of gene dispensability (average growth rates of homozygous deletion strains) (Deutschbauer et al. 2005Go). A violin plot is a combination of a box plot and a density plot (Hintze and Nelson 1998Go). The white dot indicates the median and the black bar the first and third quartile.

 

View this table:
[in this window]
[in a new window]

 
Table 4 95% Confidence Intervals of the Mean for the Yeast Data Set

 
Gene dispensability as measured by average growth rates of the homozygous deletion strains (Deutschbauer et al. 2005Go) for 115 extracellular and cell wall proteins, 2,245 cytoplasmic proteins, and 1,427 nuclear proteins are shown in figure 4C. Few of the extracellular proteins give rise to a slower growing yeast strain when deleted, and the means of the growth rates are significantly different in the different compartments (table 4). Many genes in the data set used are classified as essential because they give rise to a strain with no detectable growth rate when deleted and are subsequently excluded from the previous analysis. The fraction of essential genes is different in the different subcellular compartments (table 4). When comparing dN/dS in the different compartments for essential genes only, we find very similar distributions (data not shown). However, this part of the analysis can only be seen as preliminary because there are very few essential extracellular genes in yeast. We conclude that the differences in evolution rates between extra- and intracellular proteins in yeast may (in part) be caused by differences in gene dispensability.


    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Analysis of Diversity in Different Subcellular Compartments
We have shown that the evolution rate of extracellular proteins is faster than the evolution rate of intracellular proteins. Specifically, this was shown by analyzing diversity in alignments of proteins from various subcellular compartments. Diversity can be used as a simple measure of evolution rate because it will, all other things being equal, be higher in alignments of rapidly evolving proteins. The observed difference in diversity was not an artifact caused by uneven taxonomic coverage in the different data sets as shown by analyzing data where all alignments contained only the sequences from man, mouse, and rat. It was also not caused by immunoproteins or proteins with disulfide bonds being overrepresented in the extracellular data. Interestingly, different parts of transmembrane proteins displayed a similar trend, with the extracellular parts evolving more rapidly than the intracellular and transmembrane parts.

Concerning the results, two details confirm previous knowledge and indicate that the approach is sound. 1) It has previously been observed that transmembrane regions of membrane proteins are highly conserved between species (Donnelly et al. 1993Go; Tourasse and Li 2000Go; Stevens and Arkin 2001Go). 2) Cysteine is the one amino acid residue that goes against the trend and is actually more highly conserved in the extracellular environment than in the intracellular environment. This can be explained by the fact that cysteines form structurally important disulfide bonds in extracellular proteins, but not in intracellular proteins due to the reducing environment inside a cell.

Analysis of Selective Pressure
The results of the diversity analysis were confirmed and elaborated upon by analyzing the selective pressure acting on genes encoding proteins from different subcellular compartments. Specifically, we estimated the dN/dS ratio (the ratio between the rate of nonsynonymous substitutions per nonsynonymous site and the rate of synonymous substitutions per synonymous site) for the genes encoding the previously examined proteins. We find that genes encoding extracellular proteins on average have a higher dN/dS ratio than genes encoding intracellular proteins. This means that amino acid–changing mutations are more likely to be accepted in extracellular proteins than in intracellular proteins and shows that the results of the amino acid–based analysis are caused by differences in selective pressure and not by differences in the mutation rate of genes from different compartments.

Effects of Gene Expression, Gene Dispensability, and Protein Connectivity
We find that expression levels, expression broadness, gene dispensability (yeast), and protein connectivity (mammals) are all correlated to the evolution rate, with weak, but significant correlations. We also find that expression levels, expression broadness, gene dispensability, and protein connectivity have different distributions in the different subcellular compartments. Protein connectivity, the number of essential genes, and gene expression broadness are all lower for extracellular proteins as compared with intracellular proteins, and growth rate of deletion strains is higher. However, only expression broadness may explain the observed differences in evolution rates between cytoplasmic and nuclear proteins. The gene expression is broader for nuclear as compared with cytoplasmic proteins in the mammalian data set, which correlates to a lower evolution rate. Dividing the data into different subsets based on expression level, expression broadness, or protein connectivity showed that none of these are major reasons for the differences found (table 3). Unfortunately, the available yeast data did not permit us to do a similar analysis for gene dispensability. On the other hand, gene dispensability (the very same data we used) has been shown to have little impact on evolution rate in yeast (Drummond et al. 2005Go), and the nuclear yeast proteins are not evolving slower than the cytoplasmic, something we would expect from the gene dispensability distributions.

We also have the possibility that the effects are additive, so that it is a combination of several of the mentioned factors that is causing the differences. However, we do not believe gene expression to be an important cause for the following two reasons. 1) The yeast data show contradicting distributions. 2) The correlations between expression and evolution rate are higher when measured by sequence diversity than when measured by selective pressure. (This is probably due to the fact that preferred codons are more important in highly/broadly expressed genes, giving rise to an overall lower mutation rate.) If differences in expression were a contributing cause of the differences in evolution rate in different compartments, we would see the same difference for the two evolution rate measures, but we do not. As a comparison, the correlation between protein connectivity and evolution rate is also independent on evolution rate measure.

Other Causes for Different Evolution Rates
Because of the observed differences in selective pressure, the differences in evolution rate therefore seem to be caused by extracellular proteins being less constrained than intracellular ones, possibly with an added component of positive selection for some extracellular proteins. In this context, it seems conceivable that intracellular proteins could be relatively constrained because of the complexity of the cellular chemistry. Moreover, intracellular pathways have probably been relatively stable during the evolution of the mammal species investigated here, whereas the systems for intercellular communication and organization are likely to have undergone considerable change in the same period of time. We have shown that the difference was not caused by the presence of immunoproteins or proteins containing disulfide bonds in the extracellular data set. However, extracellular proteins with no direct role in the immune response are also potential targets for infecting pathogens, and this could add an element of positive selection to any extracellular host protein. If there is an inverse relationship between evolution rate and age of a gene (Alba and Castresana 2005Go), this could be another reason for the differences between intra- and extracellular proteins because most extracellular proteins probably have evolved with multicellular organisms and therefore are younger on average (compare with the small number of exported proteins in yeast, which is a unicellular eukaryote).

Many extracellular proteins, as well as extracellular parts of membrane-bound proteins, consist of evolutionarily mobile sequence modules (Hegyi and Bork 1997Go). These mosaic proteins are generally believed to have evolved by exon shuffling (Kolkman and Stemmer 2001Go), and it is thought that this may have played a role in the evolution of multicellularity. Many modules in extracellular proteins contain disulfide bridges, which stabilize the fold and have therefore been suggested to allow for a faster mutation rate (Hegyi and Bork 1997Go). Although our data contradict this particular theory, it is possible that the modularity and the exon shuffling may have led to an increased rate of evolution in these proteins.

In recent years, the connection between protein folding/misfolding and evolution has been the focus of much interest (Dobson 1999Go, 2003Go; Depristo et al. 2005Go). A range of debilitating human diseases is associated with protein misfolding events, some of which are associated with protein aggregation resulting in insoluble agglomerates called amyloid plaques. Early-formed species from the aggregation process of otherwise non–disease-associated proteins have been shown to be cytotoxic (Bucciantini et al. 2002Go), indicating that there is an inherent toxicity to the aggregates themselves. In agreement with this observation, there is evidence that evolutionary selection has tended to avoid amino acid sequences, such as alternating polar and hydrophobic residues, that favor a ß-sheet structure of the type seen in amyloid fibrils (Broome and Hecht 2000Go). Numerous safety mechanisms are in place to protect the organism from misfolded proteins. These are slightly different in nature for intra- and extracellular proteins. Intracellular proteins reside in a crowded environment where a misfolded protein can be refolded with chaperones or marked for degradation with ubiquitin, while exported proteins come in contact with chaperones mainly in the endoplasmic reticulum and Golgi. Quality control is rigorous in the secretory pathway (Hammond and Helenius 1995Go), but once outside the cell, the environment is less crowded and the concentration of proteases is low. How these different control mechanisms affect the evolution of intra- and extracellular proteins is hard to predict, but it is possible that a perturbation in protein stability can be more easily accommodated in an extracellular protein. It seems likely that amyloid or other protein aggregates would be more harmful inside the cell than outside. Contradicting this is the fact that the typical amyloid diseases, Alzheimer's disease, and Creutzfeldt–Jakob disease, both occur extracellularly, but it is important to note that these diseases both occur at an age where the genes have already been passed on to the next generation, and they are therefore less likely to have impacted the evolution to any larger extent (Dobson 2002Go).

The horizontal transfer of genes is an important evolutionary mechanism in bacterial genomes. A recent study shows that horizontally transferred genes are integrated at the periphery of the metabolic networks, whereas central parts remain evolutionary stable (Pal et al. 2005Go). Our data confirm previous findings that proteins involved in a large number of different protein–protein contacts (central in the interactome) evolve at a lower rate compared with those that have fewer interaction partners (peripheral in the interactome), which is another example that genes that are central to a system on a molecular level undergo slower evolution than genes that are peripherally involved in the system. An example of this on an organ/tissue level is the finding that members of the acetylcholine receptor family show slower evolution for the members involved in the central nerve system as compared with the members involved in the peripheral nerve system (Miyata et al. 1994Go). On the level of the organism, neural- or brain-specific genes display lower evolution rates than the other members of the same gene families (Miyata et al. 1994Go), and the brain may be regarded as very central to the system of a mammalian organism. Correspondingly, on the cell level, proteins that are exported may be regarded as peripheral, whereas intracellular proteins may be regarded as central. We speculate that this may be a general rule that will be true on many levels.

We have shown that a major determinant of how rapidly a protein evolves is its subcellular location with extracellular proteins evolving significantly faster than intracellular proteins. Especially because so many plausible determinants have been difficult to prove relevant, this is an important finding.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
We thank Jens Lagergren for guidance concerning the permutation test. This work was supported by The Danish National Research Foundation, the Danish Center for Scientific Computing, and Knut and Alice Wallenbergs Foundation.


    Footnotes
 
Douglas Crawford, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 

    Alba MM and Castresana J. (2005) Inverse relationship between evolution rate and age of mammalian genes. Mol Biol Evol 22:598–606.[Abstract/Free Full Text]

    Aris-Brosou S. (2005) Determinants of adaptive evolution at the molecular level: the extended complexity hypothesis. Mol Biol Evol 22:200–9.[Abstract/Free Full Text]

    Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. (2004) GenBank: update. Nucleic Acids Res 32:D23–6.[Abstract/Free Full Text]

    Bloom JD and Adami C. (2003) Apparent dependence of protein evolution rate on number of interactions is linked to biases in protein-protein interactions data sets. BMC Evol Biol 3:21.[CrossRef][Medline]

    Boeckmann B, Bairoch A, Apweiler R, et al. (12 co-authors). (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 31:365–70.[Abstract/Free Full Text]

    Broome BM and Hecht MH. (2000) Nature disfavors sequences of alternating polar and non-polar amino acids: implications for amyloidogenesis. J Mol Biol 296:961–8.[CrossRef][ISI][Medline]

    Bucciantini M, Giannoni E, Chiti F, Baroni F, Formigli L, Zurdo J, Taddei N, Ramponi G, Dobson CM, Stefani M. (2002) Inherent toxicity of aggregates implies a common mechanism for protein misfolding diseases. Nature 416:507–11.[CrossRef][Medline]

    Castillo-Davis CI, Kondrashov FA, Hartl DL, Kulathinal RJ. (2004) The functional genomic distribution of protein divergence in two animal phyla: coevolution, genomic conflict, and constraint. Genome Res 14:802–11.[Abstract/Free Full Text]

    Depristo MA, Weinreich DM, Hartl DL. (2005) Missense meanderings in sequence space: a biophysical view of protein evolution. Nat Rev Genet 6:678–87.[Medline]

    Deutschbauer AM, Jaramillo DF, Proctor M, Kumm J, Hillenmeyer ME, Davis RW, Nislow C, Giaever G. (2005) Mechanisms of haploinsufficiency revealed by genome-wide profiling in yeast. Genetics 169:1915–25.[Abstract/Free Full Text]

    Dobson CM. (1999) Protein misfolding, evolution and disease. Trends Biochem Sci 24:329–32.[CrossRef][ISI][Medline]

    Dobson CM. (2002) Getting out of shape. Nature 418:729–30.[CrossRef][Medline]

    Dobson CM. (2003) Protein folding and misfolding. Nature 426:884–90.[CrossRef][Medline]

    Donnelly D, Overington JP, Ruffle SV, Nugent JH, Blundell TL. (1993) Modeling alpha-helical transmembrane domains: the calculation and use of substitution tables for lipid-facing residues. Protein Sci 2:55–70.[Abstract]

    Drummond DA, Raval A, Wilke CO. (2005) A single determinant dominates the rate of yeast protein evolution. Mol Biol Evol 23:327–37.

    Duret L and Mouchiroud D. (2000) Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol Biol Evol 17:68–74.[Abstract/Free Full Text]

    Elhaik E, Sabath N, Graur D. (2006) The "inverse relationship between evolution rate and age of Mammalian genes" is an artifact of increased genetic distance with rate of evolution and time of divergence. Mol Biol Evol 23:1–3.[Abstract/Free Full Text]

    Fraser HB and Hirsh AE. (2004) Evolution rate depends on number of protein-protein interactions independently of gene expression level. BMC Evol Biol 4:13.[CrossRef][Medline]

    Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW. (2002) Evolution rate in the protein interaction network. Science 296:750–2.[Abstract/Free Full Text]

    Graur D. (1985) Amino acid composition and the evolution rates of protein-coding genes. J Mol Evol 22:53–62.[CrossRef][ISI][Medline]

    Guldener U, Munsterkotter M, Kastenmuller G, et al. (20 co-authors). (2005) CYGD: the Comprehensive Yeast Genome Database. Nucleic Acids Res 33:D364–8.[Abstract/Free Full Text]

    Hahn MW and Kern AD. (2005) Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks. Mol Biol Evol 22:803–6.[Abstract/Free Full Text]

    Hammond C and Helenius A. (1995) Quality control in the secretory pathway. Curr Opin Cell Biol 7:523–9.[CrossRef][ISI][Medline]

    Hegyi H and Bork P. (1997) On the classification and evolution of protein modules. J Protein Chem 16:545–51.[CrossRef][ISI][Medline]

    Hintze JL and Nelson RD. (1998) Violin plots: a box plot-density trace synergism. Am Stat 52:181–4.[CrossRef]

    Hirsh AE and Fraser HB. (2001) Protein dispensability and rate of evolution. Nature 411:1046–9.[CrossRef][Medline]

    Holstege FC, Jennings EG, Wyrick JJ, Lee TI, Hengartner CJ, Green MR, Golub TR, Lander ES, Young RA. (1998) Dissecting the regulatory circuitry of a eukaryotic genome. Cell 95:717–28.[CrossRef][ISI][Medline]

    Hurst LD and Smith NG. (1999) Do essential genes evolve slowly? . Curr Biol 9:747–50.[CrossRef][ISI][Medline]

    Jordan IK, Marino-Ramirez L, Wolf YI, Koonin EV. (2004) Conservation and coevolution in the scale-free human gene coexpression network. Mol Biol Evol 21:2058–70.[Abstract/Free Full Text]

    Jordan IK, Rogozin IB, Wolf YL, Koonin EV. (2002) Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Res 12:962–8.[Abstract/Free Full Text]

    Jordan IK, Wolf YI, Koonin EV. (2003) No simple dependence between protein evolution rate and the number of protein-protein interactions: only the most prolific interactors tend to evolve slowly. BMC Evol Biol 3:1.[CrossRef][Medline]

    Julenius K, Molgaard A, Gupta R, Brunak S. (2005) Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites. Glycobiology 15:153–64.[Abstract/Free Full Text]

    Kimura M. (1979) The neutral theory of molecular evolution. Sci Am 241:98–100 102, 108 passim.[ISI][Medline]

    Kolkman JA and Stemmer WP. (2001) Directed evolution of proteins by exon shuffling. Nat Biotechnol 19:423–8.[CrossRef][ISI][Medline]

    Kunin V, Pereira-Leal JB, Ouzounis CA. (2004) Functional evolution of the yeast protein interaction network. Mol Biol Evol 21:1171–6.[Abstract/Free Full Text]

    Li WH and Graur D. (1991) Fundamentals of molecular evolution. (Sinauer Associates, Sunderland, MA).

    Miyata T, Kuma K, Iwabe N, Nikoh N. (1994) A possible link between molecular evolution and tissue evolution demonstrated by tissue specific genes. Jpn J Genet 69:473–80.[CrossRef][Medline]

    Nei M and Li WH. (1979) Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc Natl Acad Sci USA 76:5269–73.[Abstract/Free Full Text]

    Pal C, Papp B, Hurst LD. (2001) Highly expressed genes in yeast evolve slowly. Genetics 158:927–31.[Free Full Text]

    Pal C, Papp B, Hurst LD. (2003) Genomic function: rate of evolution and gene dispensability. Nature 421:496–7, discussion 49:7–8.

    Pal C, Papp B, Lercher MJ. (2005) Adaptive evolution of bacterial metabolic networks by horizontal gene transfer. Nat Genet 37:1372–5.[CrossRef][ISI][Medline]

    Rocha EP and Danchin A. (2004) An analysis of determinants of amino acids substitution rates in bacterial proteins. Mol Biol Evol 21:108–16.[Abstract/Free Full Text]

    Stevens TJ and Arkin IT. (2001) Substitution rates in alpha-helical transmembrane proteins. Protein Sci 10:2507–17.[Abstract/Free Full Text]

    Su AI, Cooke MP, Ching KA, et al. (14 co-authors). (2002) Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci USA 99:4465–70.[Abstract/Free Full Text]

    Thompson JD, Higgins DG, Gibson TJ. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–80.[Abstract/Free Full Text]

    Tourasse NJ and Li WH. (2000) Selective constraints, amino acid composition, and the rate of protein evolution. Mol Biol Evol 17:656–64.[Abstract/Free Full Text]

    Wall DP, Hirsh AE, Fraser HB, Kumm J, Giaever G, Eisen MB, Feldman MW. (2005) Functional genomic analysis of the rates of protein evolution. Proc Natl Acad Sci USA 102:5483–8.[Abstract/Free Full Text]

    Wernersson R and Pedersen AG. (2003) RevTrans: multiple alignment of coding DNA from aligned amino acid sequences. Nucleic Acids Res 31:3537–9.[Abstract/Free Full Text]

    Winter EE, Goodstadt L, Ponting CP. (2004) Elevated rates of protein secretion, evolution, and disease among tissue-specific genes. Genome Res 14:54–61.[Abstract/Free Full Text]

    Yang J, Gu Z, Li WH. (2003) Rate of protein evolution versus fitness effect of gene deletion. Mol Biol Evol 20:772–4.[Abstract/Free Full Text]

    Yang Z. (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13:555–6.[Free Full Text]

    Yang Z. (2002) Inference of selection from multiple species alignments. Curr Opin Genet Dev 12:688–94.[CrossRef][ISI][Medline]

    Zhang L and Li WH. (2004) Mammalian housekeeping genes evolve more slowly than tissue-specific genes. Mol Biol Evol 21:236–9.[Abstract/Free Full Text]

Accepted for publication July 24, 2006.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
M. D. Dean, J. M. Good, and M. W. Nachman
Adaptive Evolution of Proteins Secreted during Sperm Maturation: An Analysis of the Mouse Epididymal Transcriptome
Mol. Biol. Evol., February 1, 2008; 25(2): 383 - 392.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF)