Skip Navigation


MBE Advance Access originally published online on August 24, 2005
Molecular Biology and Evolution 2006 23(1):30-39; doi:10.1093/molbev/msi249
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
23/1/30    most recent
msi249v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (13)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Prachumwat, A.
Right arrow Articles by Li, W.-H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Prachumwat, A.
Right arrow Articles by Li, W.-H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oupjournals.org

Research Article

Protein Function, Connectivity, and Duplicability in Yeast

Anuphap Prachumwat* and Wen-Hsiung Li{dagger}

* Committee on Genetics, University of Chicago; and {dagger} Department of Ecology and Evolution, University of Chicago

E-mail: whli{at}uchicago.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Protein-protein interaction networks have evolved mainly through connectivity rewiring and gene duplication. However, how protein function influences these processes and how a network grows in time have not been well studied. Using protein-protein interaction data and genomic data from the budding yeast, we first examined whether there is a correlation between the age and connectivity of yeast proteins. A steady increase in connectivity with protein age is observed for yeast proteins except for those that can be traced back to Eubacteria. Second, we investigated whether protein connectivity and duplicability vary with gene function. We found a higher average duplicability for proteins interacting with external environments than for proteins localized within intracellular compartments. For example, proteins that function in the cell periphery (mainly transporters) show a high duplicability but are lowly connected. Conversely, proteins that function within the nucleus (e.g., transcription, RNA and DNA metabolisms, and ribosome biogenesis and assembly) are highly connected but have a low duplicability. Finally, we found a negative correlation between protein connectivity and duplicability.

Key Words: protein interaction network • protein connectivity • gene duplicability • network evolution • protein localization


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Biological processes, which contribute to the phenotypes of living cells, are wired by interaction networks of various cellular components such as proteins, DNA, RNA, and metabolites. Such network data, especially protein-protein interactions in the budding yeast (Saccharomyces cerevisiae), can now be generated in a high-throughput manner, allowing large-scale analyses. We are interested in the yeast protein interaction network that is organized, similar to nonbiological networks, into a small world and a scale-free topology (Barabasi and Oltvai 2004Go). A small world has a high probability that any two neighbors of a node are connected with each other, while a scale-free topology shows a power-law distribution of node connectivities (for a review, see Barabasi and Oltvai 2004Go) and contributes to a high tolerance to disturbance (Albert and Barabasi 2000Go).

Barabasi and Albert (1999)Go proposed that growth of a network with a preferential attachment behavior is sufficient to explain the emergence of a scale-free network topology. This model requires that a new node preferentially connects to a well-connected node, predicting that old nodes should tend to have a higher connectivity than young ones. This prediction, however, was not supported by a recent analysis of the yeast protein network by Kunin, Pereira-Leal, and Ouzounis (2004)Go, who therefore suggested that to understand the scale-free topology of the protein network, protein function should also be taken into account.

In this study, we use a larger set of data or a set of better quality data than that of Kunin, Pereira-Leal, and Ouzounis (2004)Go to re-examine the prediction of the preferential attachment model by checking whether a correlation exists between the age and connectivity of yeast proteins. We also investigate whether protein connectivity and gene duplicability vary with gene function. Because yeast, which is a single-cell organism, inhabits in a wide range of environmental niches, genetic diversity for proteins that are exposed to or interact with extracellular environments may confer benefits to the organism. As duplication may increase such diversity (or produce a new adaptive function, e.g., Francino 2005Go), we hypothesize a higher duplicability for proteins exposed to extracellular environments than for those localized to intracellular compartments. Moreover, because gene duplication plays a major role in network growth (e.g., Barabasi and Albert 1999Go; Pastor-Satorras, Smith, and Sole 2003Go) and conversely, connectivity may affect gene duplicability, we investigate whether a relationship exists between protein connectivity and duplicability.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Protein-Protein Interaction Data
Protein-protein interaction pairs are collected from various high-throughput experiments (Fromont-Racine et al. 2000Go; Newman, Wolf, and Kim 2000Go; Uetz et al. 2000Go; Dress et al. 2001Go; Ito et al. 2001Go; Gavin et al. 2002Go; Ho et al. 2002Go; Tong et al. 2002Go) and databases (Munich Information Center for Protein Sequences, Database of Interacting Proteins, Biomolecular Interaction Network Database, and Yeast Protein Database). This collection (denoted by ALL_K) includes 5,015 proteins and 16,747 interactions. Because high-throughput interaction data come with high false-positive rates, we also use a set of highly confident data (denoted by BaderSTD) from Bader et al. (2004)Go that is comprised of 2,759 proteins and 5,785 interactions. Further, "true interactions" inferred from many small-scale experiments are also considered (denoted by SSE). Given that SSE is a small data set, we combine it with BaderSTD to obtain a larger high-confident data set (denoted by SSBader). Descriptive statistics of these data are shown in table 1. The connectivity (denoted by k) of a protein in a network of interest is defined by the number of interactions of the protein with other proteins in that network. In addition to using the mean and median of k as measures of connectivity for the proteins in a category of interest, we also use the proportion of hubs in the category. We define a hub as a protein with k ≥ a, where a is 5 or 7 (the two cutoff points give similar results). We show the results of analyses on SSBader and ALL_K but not the results on other data sets because they are essentially the same.


View this table:
[in this window]
[in a new window]
 
Table 1 Descriptive Statistics of the Protein-Protein Interaction Data Sets Used in This Study

 
Classification of Proteins into Age Groups
For each yeast protein, we identified homologous proteins from other genomes that have been sequenced. These homologous groups of yeast proteins were obtained from KOG and COG (Tatusov et al. 2003Go), Inparanoid (O'Brien, Remm, and Sonnhammer 2005Go), Génolevures (Dujon et al. 2004Go), Kellis et al. (2003)Go, Cliften et al. (2003)Go, and Kunin, Pereira-Leal, and Ouzounis (2004)Go. Although yeast proteins can be assigned into 10 age categories (groups) by their shared ancestral origins (10 lineages) from these orthologous groups (fig. 1), this categorization gives a small number of proteins for some categories. For statistical purposes, we classify yeast proteins into five age categories (denoted by I–V; fig. 1 and table 2); we exclude the 380 spurious open reading frames (ORFs) defined by both Kellis et al. (2003)Go and Ghaemmaghami et al. (2003)Go.



View larger version (18K):
[in this window]
[in a new window]
 
FIG. 1.— The evolutionary path leading to the yeast (Saccharomyces cerevisiae) is shown in thick branches on this species tree. Yeast protein age is inferred by the presence of an ortholog in other species. The oldest age group includes yeast proteins that can be traced back to eubacterial genomes, while the youngest one includes proteins with orthologs only within the Saccharomyces sensu-stricto species or without any ortholog in the other genomes. Horizontal dashed lines represent the age groups and are numbered by I, II, III, IV, and V. The tree is not drawn according to scale.

 

View this table:
[in this window]
[in a new window]
 
Table 2 Descriptive Statistics of Each Age Group for the Number of Proteins (t) and Median and Mean k Values

 
Identification of Duplicate and Singleton Genes
The whole set of S. cerevisiae protein sequences were downloaded from SGD (http://www.yeastgenome.org/). Duplicate genes were identified as described in Gu et al. (2003)Go (E < 10–10). A singleton was defined as a gene with only one copy in the genome.

Protein Subcellular Localization and Biological Process
The protein localization profile for S. cerevisiae grown in synthetic medium (downloaded from http://yeastgfp.ucsf.edu; Huh et al. 2003Go) is combined with subcellular localization defined by the gene ontogeny (GO) classification (downloaded from SGD on April 5, 2005). Mislocalization of some proteins from Huh et al. (2003)Go is corrected according to the authors' supplementary data. The GO subcellular localization categories are translated to the subcellular localization categories of Huh et al. (2003)Go because GO subcellular localizations are at a deeper level than those from Huh et al. (2003)Go (e.g., GO distinguishes between membrane and lumen of mitochondrion, while Huh et al. [2003]Go does not). The GO's extracellular category composed of a small number of proteins is combined into the cell periphery. A protein is associated with more than one localization category if it is found in multiple localizations (e.g., shuttle and transport proteins). Biological processes of each ORF are assigned according to the GO Slim that classifies proteins to gain a high-level view of the functions (downloaded from SGD on April 5, 2005).

Measures of Gene Duplicability
Similar to Marland et al. (2004)Go, for each category (i.e., a subcellular localization category or a biological process) under study, the number of unique types of genes is defined as the number of singletons plus the number of duplicated gene types in that category. The number of duplications per gene (n) is the total number of genes divided by the total number of unique types of genes. The proportion of unduplicated genes (P) is the proportion of singletons in the total number of unique types of genes. While n roughly indicates the average number of paralogs per gene in the category, 1 – P denotes the proportion of gene types that have been duplicated. Both n and 1 – P can be used as measures of gene duplicability (Yang, Lusk, and Li 2003Go). In addition, we also consider the proportion of duplicate genes in each category (Q). Q and n are less desirable than P because they can be strongly affected by the presence of large gene families.

Our statistical analyses are conducted in R (version 2.0.1, http://www.r-project.org/). The statistical tests used are Fisher's exact test and the Mann-Whitney test (also called the Wilcoxon rank sum two-sample test), which, in contrast to the parametric two-sample t test, is a nonparametric method replacing the protein connectivity data by ranks, which reduces the influence of outliers. The test is more appropriate than the t test because protein connectivities are not normally distributed.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Origins of Proteins and Their Connectivity
To determine whether the connectivity (k) correlates with the age of a protein, the mean and median k values for each age group are obtained. It appears that young proteins (e.g., those found in yeasts only) have a lower mean k than that in the older age groups (e.g., Archaea and Plasmodium-Plants-Animals) for both the all data set (ALL_K) and the highly confident (SSBader) data set (table 2 and fig. 2A and B). However, those proteins traceable to Eubacteria show a lower mean k and a slightly lower median k than those in the Archaea group (table 2 and fig. 2A and B). Further, the younger age groups have a lower proportion of hubs than the older age groups, except the Eubacteria, which shows a lower proportion of hubs than the Archaea and the Plasmodium-Plants-Animals (fig. 2C).



View larger version (13K):
[in this window]
[in a new window]
 
FIG. 2.— The patterns of connectivity (k) for each age group in the ALL_K (A) and the SSBader (B) are represented by mean (black bars) and median (white bars) k. The P values from the Mann-Whitney test performed between the adjacent age groups are indicated under the graphs. The bar marked by * indicates a P value of 0.068. (C) The proportion of hubs (proteins with k ≥ 5) among proteins in the same age group also indicates a level of connectivity for each age group. Similar patterns are observed for both the ALL_K and the SSBader, but only the ALL_K is shown. The P values from Fisher's exact test performed between the adjacent age groups are indicated under the graph.

 
Performing the Mann-Whitney test on these data, we first ask whether two adjacent age groups have different connectivities. The test shows that the Eubacteria age group has a significantly lower k than Archaea in both data sets (P < 5 x 10–8; fig. 2A and B). The Archaea age group has a significantly higher k than the Plasmodium-Plants-Animals group in ALL_K (P = 2 x 10–4), though the significant level is lower in SSBader (P = 0.068). Second, we pick an age group as a pivot group and perform two tests: (1) between this pivot group and the older proteins and (2) between the pivot group and the younger proteins. The tests reveal that the Eubacteria group "does not" show a different k from the rest of the proteins in the network. The other groups show a significantly different k from their older and/or younger counterparts (P << 0.006; data not shown). Clearly, the oldest proteins (the Eubacteria group) do not have the highest k in the protein network, and for this reason there is no positive correlation between connectivity and age. However, a significant correlation is seen when the Eubacteria group is excluded.

Protein Function and Connectivity
In the following analysis, we consider protein localization and perform the Mann-Whitney test on both data sets; although we show only the results for SSBader, a similar pattern is observed for ALL_K. Note that the mean k values for the proteins localized to nucleus and nucleolus are 6.85 and 8.81, respectively, which are significantly higher than the mean k (5.33) for the whole network (P < 5 x 10–6, table 3). Some other localization categories such as cytoplasm, mitochondrion, cell periphery, and endoplasmic reticulum show a significantly lower k than the other proteins (P < 0.003, table 3).


View this table:
[in this window]
[in a new window]
 
Table 3 Descriptive Statistics for the Number of Proteins (t) and Mean and Median k in the SSBader Data Set When Categorized by Subcellular Localization and Biological Process

 
Similarly, when biological processes are considered, proteins involved in protein biosynthesis and catabolism, ribosome biogenesis and assembly, DNA and RNA metabolisms, and transcription show a significantly higher k than the proteins involved in other biological processes (mean and median k are greater than 5.33 and 3, respectively; P < 5 x 10–6, table 3). Although proteins involved in lipid, carbohydrate, and amino acid metabolisms and cellular respiration show a significantly lower k than the average in SSBader (table 3), only lipid metabolism proteins show a significantly lower k in ALL_K; nonetheless, the proteins in the other three categories still have the low k (data not shown).

Protein Function Versus Connectivity Within the Same Age Group
It is interesting to ask whether within the same age group the function of a protein affects its connectivity. To answer this question, we categorize proteins by their localization or biological processes for each protein age group and perform the Mann-Whitney test between a functional group of interest and the rest within the same age group (only mean k values for functional categories are shown in Supplementary Fig. 1S, Supplementary Material online). Proteins localized to nucleus and nucleolus show a significantly higher k in the Eubacteria and Archaea age groups; proteins localized to nucleus also show a significant higher k in the Plasmodium-Plants-Animals and Microspora–Schizosaccharomyces pombe–Saccharomyces complex groups (P < 7 x 10–4). For biological processes, proteins involved in ribosome biogenesis and assembly, RNA metabolism, and protein catabolism are significantly more highly connected than other functions for the Eubacteria, Archaea, and Plasmodium-Plants-Animals groups. Although many younger age groups (IV and V) do not show a significant difference in connectivity among biological process categories (probably because of small sample sizes), proteins involved in transcription show a significantly higher k than those in the other biological processes in the Microspora–Schizosaccharomyces pombe–Saccharomyces complex age group. Proteins involved in carbohydrate and amino acid and derivative metabolisms show a significantly lower k than other proteins in the Eubacteria group, while proteins involved in cell wall and membrane organization and biogenesis are lowly connected in the Microspora–S. pombe–Saccharomyces complex group.

Protein Function and Duplicability
We investigate the proportion of unduplicated genes (P) for each localization category. A low P value indicates a high duplicability. The P values are significantly lower in cell periphery, bud, and vacuole categories but significantly higher in nucleus and nucleolus (P < 0.003, table 4); all tests for this section are Fisher's exact test. The categories with a significantly lower P value have a higher proportion of duplicate genes (Q) than that of the whole genome and vice versa (P < 0.003, table 4). A significantly different duplicability in cytoplasm (higher) and spindle pole (lower) from average is indicated by Q. The significant high duplicability in cell periphery is also revealed by the number of duplications per gene (n = 1.44; n = 1.21 for the whole-genome average). Similarly, the n values are relatively low (between 1.03 and 1.11) for mitochondrion, nucleus, nucleolus, and spindle pole.


View this table:
[in this window]
[in a new window]
 
Table 4 Duplication Patterns of Proteins Localized to 16 Subcellular Compartment Categories as Measured by the Proportion of Duplicates (Q) and the Proportion of Unduplicated Genes (P)

 
When biological processes are considered, we find that ~1/4 of yeast proteins are uncharacterized. Among the remaining proteins, duplicates in carbohydrate metabolism, generation of precursor metabolites and energy, protein biosynthesis and catabolism, transport, and response to stress are significantly overrepresented, whereas in DNA metabolism, RNA metabolism, transcription, and ribosome biogenesis and assembly, duplicates are significantly underrepresented (P < 0.002, table 5). Among all proteins annotated with their biological processes, those involved in the transport, protein biosynthesis and catabolism, RNA metabolism, transcription, protein modification, and DNA metabolism are among the highest represented (between 7%–17%). Relative to the whole-proteome average, these categories show either high or low number of duplicates (table 5). Generally speaking, low P values are supported by high Q values. Duplicates in the unknown biological process category, however, are significantly underrepresented (P < 0.002).


View this table:
[in this window]
[in a new window]
 
Table 5 Distribution of Duplicates for Each Biological Process That Is Defined According to GO Slim Classification

 
Protein Connectivity and Duplicability
Figure 3A shows that P is positively correlated with both mean and median k for biological processes (R2 = 0.35 and 0.45 for mean and median k, respectively, P < 0.002). A similar pattern is also observed when we consider only significant categories from table 3 (R2 = 0.66 and 0.79) or table 5 (R2 = 0.74 and 0.83 for mean and median k, respectively, all P < 0.008). Moreover, this pattern is also found when the proportion of hubs is used as a measure of connectivity (R2 = 0.43, P = 0.0001; fig. 3B). In addition, we observe essentially the same results when using protein localization categories and/or the Q values (data not shown). Furthermore, there are, on average, ~8% higher duplicabilities in the nonhub proteins than the hub proteins (P = 79% and 88% and Q = 30% and 22% for the nonhubs and hubs, respectively, P < 1 x 10–6). This pattern suggests that proteins with a lower connectivity have, on average, a high gene duplicability.



View larger version (13K):
[in this window]
[in a new window]
 
FIG. 3.— A positive correlation between connectivity (k) and proportion of unduplicated proteins (P) for the biological process classification. Because a similar trend is observed for median k, only (A) mean k and (B) the proportion of hubs (k ≥ 5) are shown. The trend lines are provided for only visualization.

 
A summary of protein connectivity and gene duplicability of nuclear, cytoplasmic, and external and cell peripheral proteins are shown in table 6. In general, nuclear proteins are highly connected but show a low duplicability, while those external and cell peripheral ones show a high duplicability but are lowly connected. The connectivity and gene duplicability of cytoplasmic proteins are between those of the nuclear and the external and cell peripheral proteins.


View this table:
[in this window]
[in a new window]
 
Table 6 A Summary of Protein Connectivity (k) and Gene Duplicability (1–P) for Nuclear, Cytoplasmic, and External and Cell Peripheral Proteins Categorized by Functions

 

    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Our finding that proteins in the oldest group (the Eubacteria group) do not exhibit higher connectivities (k) than proteins in the Archaea and Plasmodium-Plants-Animals groups is similar to that of Kunin, Pereira-Leal, and Ouzounis (2004)Go. However, the connectivities of the pre-Eukaryotes group (the union of the Eubacteria and Archaea) are, on average, only slightly lower than those of the Plasmodium-Plants-Animals group (i.e., the Crown-Eukaryotes in the study of Kunin, Pereira-Leal, and Ouzounis [2004]Go). Moreover, proteins in the Archaea age group show a significantly higher k than those in the Plasmodium-Plants-Animals age group (fig. 2). Thus, only the Eubacteria group contradicts the prediction of the preferential attachment model, and actually a positive correlation between age and k is seen when the Eubacteria group is excluded (table 2 and fig. 2).

The higher protein connectivity for the Archaea and Plasmodium-Plants-Animals age groups than the Eubacteria group could be due to connection gains through new gene creation (e.g., gene duplication or gene fusion). Possibly, during the early evolution of eukaryotic cells whose nucleus evolved from Archaea, proteins for eukaryotic cell formation might have arisen in number, and some became hubs for such functional modules (e.g., fig. 2C). Moreover, domain shuffling and length extension (increase protein complexity) of proteins in the Archaea and Plasmodium-Plants-Animals groups could have increased new connections for these proteins.

A constraint by gene function may influence protein network evolution (Kunin, Pereira-Leal, and Ouzounis 2004Go). To investigate this, we defined protein function by both localization and biological processes according to the GO annotation. Because localization partly determines the function of a protein, a combination of localization and biological process increases confidence in our function classification. Proteins involved in transcription, RNA metabolism, protein biosynthesis and catabolism, and ribosome biogenesis and assembly tend to be highly connected. Although the majority of our results are consistent with those reported by Kunin, Pereira-Leal, and Ouzounis (2004)Go, translational proteins (e.g., protein biosynthesis and catabolism) are highly connected, contrary to their finding. In support of our observation, the majority of these proteins localized to nucleus and nucleolus are highly connected. On the other hand, proteins localized to cell periphery and vacuole are lowly connected (tables 3 and 6).

It appears that protein function affects connectivity across protein age groups (see "Protein Function Versus Connectivity Within the Same Age Group"). This pattern, however, may have resulted from the emergence time of these highly connected protein functions because proteins emerged at the same evolutionary period tend to interact with one another (Qin et al. 2003Go), and proteins with similar functions are likely clustered (von Mering et al. 2002Go). We find that the emergence time of protein contributes partly to the high k for "only" some gene functions. For example, transport and RNA metabolism categories have comparable numbers of proteins (and prevalently emerged) in the Eubacteria and Plasmodium-Plants-Animals age groups, but transport proteins are not highly connected (Supplementary Table 1S and Fig. 1SB, Supplementary Material online). Biological processes with proteins that largely emerged in the Eubacteria group (e.g., carbohydrate, amino acid and derivative metabolisms, and generation of precursor metabolites and energy) are also relatively lowly connected (Supplementary Table 1S and Fig. 1SB, Supplementary Material online). Likewise, proteins localized in cell periphery, cytoplasm, endoplasmic reticulum, nucleus, and nucleolus largely emerged in the Eubacteria and Plasmodium-Plants-Animals age groups, but only those localized in nucleus and nucleolus are coincidentally highly connected (Supplementary Table 1S and Fig. 1SA, Supplementary Material online). This finding supports the view of Kunin, Pereira-Leal, and Ouzounis (2004)Go that age alone is not sufficient to explain the observed connectivities of proteins and that protein function also needs to be considered. Importantly, evidence that for almost all of the function categories proteins in the Eubacteria group show a lower k than those in the Archaea and Plasmodium-Plants-Animals groups (Supplementary Fig. 1S, Supplementary Material online) confirms our previous finding.

The observed patterns of gene duplication suggest that duplicate genes in the yeast are unequally represented in both subcellular localization and biological process categorizations (tables 4–6 GoGo). A higher duplicability is observed for proteins localized to cell periphery, bud, vacuole, and cytoplasm and for proteins involved in transport, carbohydrate metabolisms, protein biosynthesis and catabolism, response to stress, and generation of precursor metabolites and energy, but not for proteins in other subcellular compartments or biological processes. Some functions such as transcription, DNA and RNA metabolisms, and ribosome biogenesis and assembly have a low duplicability. From these observations, we suggest that gene function is a major determinant of gene duplicability in S. cerevisiae.

Duplicate genes of some functions may not have a good chance to confer selective advantages, leading to a low gene duplicability. Proteins involved in transcription, DNA and RNA metabolisms, and ribosome biogenesis and assembly may face with such a constraint. For example, duplication of a global transcription regulator likely affects many downstream genes, presumably being deleterious in the majority of cases and leading to a slim chance of duplicate survival. These functions (e.g., ribosome biogenesis and assembly) may also be constrained by the dosage balance of protein complex (Papp, Pal, and Hurst 2003Go; Yang, Lusk, and Li 2003Go). However, other factors may affect gene duplicability because of a higher proportion of transcription proteins in multicellular organisms than in yeast (Babu et al. 2004Go). Moreover, the pattern that yeast's duplicate genes, especially those retained from the whole-genome duplication, tend to have a higher gene complexity (measured by protein length, number of domains or of cis-regulatory elements) than other genes leads to the conclusion that gene complexity may contribute to the duplicate retention (He and Zhang 2005Go). However, analyzing protein length in our data set, we find that in approximately half of the functional categories duplicates are longer than singletons, and in a few of these cases the difference is statistically significant (data not shown).

Our results (table 6) support the hypothesis that a higher duplicability for proteins interacting with fluctuating external environments may confer benefits to the organism. For example, in yeast nutrient capture through cell periphery is the first stage of cell growth, and so the chance that duplication of a gene in this process is beneficial is high. A high duplicability for proteins localized to cell periphery is also seen in fruit fly, nematode, mouse, and humans (unpublished data), along with an increase in the total numbers of these proteins from yeast to nematode and fruit fly (Hazkani-Covo et al. 2004Go). Moreover, the majority of highly duplicated genes in bacterial or multicellular eukaryotic genomes encode various types of membrane or secreted proteins such as membrane transporters, receptors, and secreted signaling molecules (Kondrashov et al. 2002Go). Together, these results support a higher duplicability for proteins that interact with external environments.

Living in an often scarce nutrient habitat, yeasts inevitably compete among themselves or with other species for limited nutrients. Therefore, duplication of a transport protein may be advantageous because it increases the efficiency of nutrient uptake. Similarly, the substrate transport between subcellular compartments or even in or out of the cell is a basic requirement of eukaryotic cells. In addition to nutrient uptake, yeast transporters play diverse roles such as drug resistance, salt tolerance, control of cell volume, efflux of undesirable metabolites, and sensing of extracellular nutrients (Van Belle and Andre 2001Go). A high duplicability of transport proteins is also observed in bacterial genomes (Gevers et al. 2004Go). Therefore, duplication of such a protein may increase the chance of functional specialization or diversification.

Using transporter subfamilies characterized phylogenetically (De Hertogh et al. 2002Go), we find a unique set of transporters in mitochondrion but a shared set between cell periphery and vacuole. In cell periphery and vacuole, three subfamilies are present at a high number: the yeast amino acid transporters (YATs), the drug H+ antiporters (DHAs), and the sugar porters (SPs). In particular, the DHAs directly interact with and protect cell from a number of extracellular compounds that are growth inhibitory or unusual to natural environments (Sá-Correia and Tenreiro 2002Go). Most DHAs are typically characterized as nonessential due to their functional redundancy and specificity overlap (Rogers et al. 2001Go; Giaever et al. 2002Go). Furthermore, these genes are only activated by environmental stress factors. In general, DHAs and a large number of YATs and SPs are undetected under a normal growth condition. The SPs are usually involved in the first step in carbohydrate metabolism after di- and trisaccharides are hydrolyzed outside the cell. Therefore, the variability and efficiencies of transporters directly affect the metabolic and growth rate of yeast. Furthermore, a high duplicability in yeast metabolism, especially in the central metabolism and upstream of the central metabolism pathways, has been observed (Marland et al. 2004Go).

Although recent evidence of prevalence in partial duplications of yeast's protein complexes (i.e., a large fraction of protein complexes with a strong homology to others) lends support for functional specialization (Pereira-Leal and Teichmann 2005Go), how protein connectivity plays a role in gene duplicability is unclear. The preferential attachment model also does not suggest any bias in duplicability of a node type (hub vs. nonhub). Our results suggest that highly connected proteins (i.e., hubs) have a low duplicability (fig. 3 and table 6). Despite its high tolerance against random perturbation, the protein network integrity relies mainly on its hubs and is sensitive to a targeted hub removal (Albert, Jeong, and Barabasi 2000Go). Indeed, lethality increases threefolds if a hub is deleted (Jeong et al. 2001Go; Han et al. 2004Go). Along with these observations, a slow evolutionary rate (Fraser 2005Go) and highly conserved ortholog (Wuchty 2004Go; Fraser 2005Go) for hubs suggest a strong selection pressure on them. Likely, duplication of a hub is deleterious because it affects a large number of proteins (i.e., a high pleiotropy), especially those with partners participting in different functions (an intermodule hub). However, the pleiotropy is likely reduced if such a hub is situated within a functional module (an intramodule hub). Recently, however, a greater constraint on intramodule than intermodule hubs was found (Fraser 2005Go). Below, we discuss this issue further.

A hub protein may be part of a large (stable) protein complex; in this case, a dosage increase by a single-gene duplication would likely affect the balance of complex formation (Veitia 2002Go). A larger proportion of the intramodule hubs (81%) are in a complex than that of the intermodule hubs (18%). Conversely, the majority of the intermodule hubs are mediators, regulators, or adapters (Han et al. 2004Go). These intermodule hubs globally integrate signals between functional modules and are likely to localize to various subcellular compartments. Duplication of an intermodule hub can destroy the network integrity and disrupt the informational flow because of a subsequent interaction change or misexpression of a duplicate. Using a small data set characterized by Han et al. (2004)Go, we find that the intermodule hubs show a slightly lower duplicability (12.6%) than the intramodule hubs (16.3%). This is contrary to Fraser's (2005)Go observation. Further research is needed to find out whether duplicability of a hub is more constrained within or between functional modules. It is, however, clear that the survivability of duplication of an intramodule or an intermodule hub is usually lower than the average gene duplicability in the genome.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Supplementary Table 1S and Figure 1S are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
We thank V. Kunin for sending us data and R. Lusk and M. Chou for their help in the protein interaction data collection, Y.-W. Chang for her help in the gene function classification, and G. Morris, J. Yang, and Z. Gu for helpful discussions. We are grateful to two anonymous reviewers for their valuable comments. This study was supported by the International Balzan Foundation.


    Footnotes
 
Takashi Gojobori, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 

    Albert, R., and A. L. Barabasi. 2000. Topology of evolving networks: local events and universality. Phys. Rev. Lett. 85:5234–5237.[CrossRef][Web of Science][Medline]

    Albert, R., H. Jeong, and A. L. Barabasi. 2000. Error and attack tolerance of complex networks. Nature 406:378–382.[CrossRef][Medline]

    Babu, M. M., N. M. Luscombe, L. Aravind, M. Gerstein, and S. A. Teichmann. 2004. Structure and evolution of transcriptional regulatory networks. Curr. Opin. Struct. Biol. 14:283–291.[CrossRef][Web of Science][Medline]

    Bader, J. S., A. Chaudhuri, J. M. Rothberg, and J. Chant. 2004. Gaining confidence in high-throughput protein interaction networks. Nat. Biotechnol. 22:78–85.[CrossRef][Web of Science][Medline]

    Barabasi, A. L., and R. Albert. 1999. Emergence of scaling in random networks. Science 286:509–512.[Abstract/Free Full Text]

    Barabasi, A. L., and Z. N. Oltvai. 2004. Network biology: understanding the cell's functional organization. Nat. Rev. Genet. 5:101–113.[CrossRef][Web of Science][Medline]

    Cliften, P., P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton, J. Majors, R. Waterston, B. A. Cohen, and M. Johnston. 2003. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301:71–76.[Abstract/Free Full Text]

    De Hertogh, B., E. Carvajal, E. Talla, B. Dujon, P. Baret, and A. Goffeau. 2002. Phylogenetic classification of transporters and other membrane proteins from Saccharomyces cerevisiae. Funct. Integr. Genomics 2:154–170.[CrossRef][Medline]

    Drees, B. L., B. Sundin, E. Brazeau et al. (22 co-authors). 2001. A protein interaction map for cell polarity development. J. Cell Biol. 154:549–571.[Abstract/Free Full Text]

    Dujon, B., D. Sherman, G. Fischer et al. (19 co-authors). 2004. Genome evolution in yeasts. Nature 430:35–44.[CrossRef][Medline]

    Francino, M. P. 2005. An adaptive radiation model for the origin of new gene functions. Nat. Genet. 37:573–577.[CrossRef][Web of Science][Medline]

    Fraser, H. B. 2005. Modularity and evolutionary constraint on proteins. Nat. Genet. 37:351–352.[CrossRef][Web of Science][Medline]

    Fromont-Racine, M., A. E. Mayes, A. Brunet-Simon et al. (11 co-authors). 2000. Genome-wide protein interaction screens reveal functional networks involving Sm-like proteins. Yeast 17:95–110.[CrossRef][Web of Science][Medline]

    Gavin, A. C., M. Bosche, R. Krause et al. (38 co-authors). 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415:141–147.[CrossRef][Medline]

    Gevers, D., K. Vandepoele, C. Simillon, and Y. Van de Peer. 2004. Gene duplication and biased functional retention of paralogs in bacterial genomes. Trends Microbiol. 12:148–154.[CrossRef][Web of Science][Medline]

    Ghaemmaghami, S., W. K. Huh, K. Bower, R. W. Howson, A. Belle, N. Dephoure, E. K. O'Shea, and J. S. Weissman. 2003. Global analysis of protein expression in yeast. Nature 425:737–741.[CrossRef][Medline]

    Giaever, G., A. M. Chu, L. Ni et al. (74 co-authors). 2002. Functional profiling of the Saccharomyces cerevisiae genome. Nature 418:387–391.[CrossRef][Medline]

    Gu, Z., L. M. Steinmetz, X. Gu, C. Scharfe, R. W. Davis, and W. H. Li. 2003. Role of duplicate genes in genetic robustness against null mutations. Nature 421:63–66.[CrossRef][Medline]

    Han, J. D., N. Bertin, T. Hao et al. (11 co-authors). 2004. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 430:88–93.[CrossRef][Medline]

    Hazkani-Covo, E., E. Y. Levanon, G. Rotman, D. Graur, and A. Novik. 2004. Evolution of multicellularity in Metazoa: comparative analysis of the subcellular localization of proteins in Saccharomyces, Drosophila and Caenorhabditis. Cell Biol. Int. 28:171–178.[CrossRef][Web of Science][Medline]

    He, X., and J. Zhang. 2005. Gene complexity and gene duplicability. Curr. Biol. 15:1016–1021.[CrossRef][Web of Science][Medline]

    Ho, Y., A. Gruhler, A. Heilbut et al. (20 co-authors). 2002. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415:180–183.[CrossRef][Medline]

    Huh, W. K., J. V. Falvo, L. C. Gerke, A. S. Carroll, R. W. Howson, J. S. Weissman, and E. K. O'Shea. 2003. Global analysis of protein localization in budding yeast. Nature 425:686–691.[CrossRef][Medline]

    Ito, T., T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki. 2001. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA 98:4569–4574.[Abstract/Free Full Text]

    Jeong, H., S. P. Mason, A. L. Barabasi, and Z. N. Oltvai. 2001. Lethality and centrality in protein networks. Nature 411:41–42.[CrossRef][Medline]

    Kellis, M., N. Patterson, M. Endrizzi, B. Birren, and E. S. Lander. 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423:241–254.[CrossRef][Medline]

    Kondrashov, F. A., I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. 2002. Selection in the evolution of gene duplications. Genome Biol. 3:research0008.1–0008.9.

    Kunin, V., J. B. Pereira-Leal, and C. A. Ouzounis. 2004. Functional evolution of the yeast protein interaction network. Mol. Biol. Evol. 21:1171–1176.[Abstract/Free Full Text]

    Marland, E., A. Prachumwat, N. Maltsev, Z. Gu, and W. H. Li. 2004. Higher gene duplicabilities for metabolic proteins than for nonmetabolic proteins in yeast and E. coli. J. Mol. Evol. 59:806–814.[CrossRef][Web of Science][Medline]

    Newman, J. R., E. Wolf, and P. S. Kim. 2000. A computationally directed screen identifying interacting coiled coils from Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. USA 97:13203–13208.[Abstract/Free Full Text]

    O'Brien, K. P., M. Remm, and E. L. Sonnhammer. 2005. Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 33(Database Issue):D476–D480.[Abstract/Free Full Text]

    Papp, B., C. Pal, and L. D. Hurst. 2003. Dosage sensitivity and the evolution of gene families in yeast. Nature 424:194–197.[CrossRef][Medline]

    Pastor-Satorras, R., E. Smith, and R. V. Sole. 2003. Evolving protein interaction networks through gene duplication. J. Theor. Biol. 222:199–210.[CrossRef][Web of Science][Medline]

    Pereira-Leal, J. B., and S. A. Teichmann. 2005. Novel specificities emerge by stepwise duplication of functional modules. Genome Res. 15:552–559.[Abstract/Free Full Text]

    Qin, H., H. H. Lu, W. B. Wu, and W. H. Li. 2003. Evolution of the yeast protein interaction network. Proc. Natl. Acad. Sci. USA 100:12820–12824.[Abstract/Free Full Text]

    Rogers, B., A. Decottignies, M. Kolaczkowski, E. Carvajal, E. Balzi, and A. Goffeau. 2001. The pleitropic drug ABC transporters from Saccharomyces cerevisiae. J. Mol. Microbiol. Biotechnol. 3:207–214.[CrossRef][Web of Science][Medline]

    Sá-Correia, I., and S. Tenreiro. 2002. The multidrug resistance transporters of the major facilitator superfamily, 6 years after disclosure of Saccharomyces cerevisiae genome sequence. J. Biotechnol. 98:215–226.[CrossRef][Medline]

    Tatusov, R. L., N. D. Fedorova, J. D. Jackson et al. (17 co-authors). 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41.[CrossRef][Medline]

    Tong, A. H., B. Drees, G. Nardelli et al. (16 co-authors). 2002. A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 295:321–324.[Abstract/Free Full Text]

    Uetz, P., L. Giot, G. Cagney et al. (20 co-authors). 2000. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403:623–627.[CrossRef][Medline]

    Van Belle, D., and B. Andre. 2001. A genomic view of yeast membrane transporters. Curr. Opin. Cell Biol. 13:389–398.[CrossRef][Web of Science][Medline]

    Veitia, R. A. 2002. Exploring the etiology of haploinsufficiency. Bioessays 24:175–184.[CrossRef][Web of Science][Medline]

    von Mering, C., R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork. 2002. Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417:399–403.[Medline]

    Wuchty, S. 2004. Evolution and topology in the yeast protein interaction network. Genome Res. 14:1310–1314.[Abstract/Free Full Text]

    Yang, J., R. Lusk, and W. H. Li. 2003. Organismal complexity, protein complexity, and gene duplicability. Proc. Natl. Acad. Sci. USA 100:15661–15665.[Abstract/Free Full Text]

Accepted for publication August 18, 2005.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
GeneticsHome page
W. Qian and J. Zhang
Gene Dosage and Gene Duplicability
Genetics, August 1, 2008; 179(4): 2319 - 2324.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
P. M. Kim, J. O. Korbel, and M. B. Gerstein
Positive selection at the protein network periphery: Evaluation in terms of structural constraints and cellular context
PNAS, December 18, 2007; 104(51): 20274 - 20279.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
E. B. Dopman and D. L. Hartl
A portrait of copy-number polymorphism in Drosophila melanogaster
PNAS, December 11, 2007; 104(50): 19920 - 19925.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
L. Li, Y. Huang, X. Xia, and Z. Sun
Preferential Duplication in the Sparse Part of Yeast Protein Interaction Network
Mol. Biol. Evol., December 1, 2006; 23(12): 2467 - 2473.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
23/1/30    most recent
msi249v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (13)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Prachumwat, A.
Right arrow Articles by Li, W.-H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Prachumwat, A.
Right arrow Articles by Li, W.-H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?