Skip Navigation


MBE Advance Access originally published online on September 15, 2006
Molecular Biology and Evolution 2006 23(12):2467-2473; doi:10.1093/molbev/msl121
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/12/2467    most recent
msl121v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, L.
Right arrow Articles by Sun, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, L.
Right arrow Articles by Sun, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Preferential Duplication in the Sparse Part of Yeast Protein Interaction Network

Li Li, Yingwu Huang, Xuefeng Xia and Zhirong Sun

MOE Key Laboratory of Bioinformatics, State Key Laboratory of Biomembrane and Membrane Biotechnology, Department of Biological Sciences and Biotechnology, Tsinghua University, Beijing, China

E-mail: sunzhr{at}mail.tsinghua.edu.cn.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Gene duplication is an important mechanism driving the evolution of biomolecular network. Thus, it is expected that there should be a strong relationship between a gene's duplicability and the interactions of its protein product with other proteins in the network. We studied this question in the context of the protein interaction network (PIN) of Saccharomyces cerevisiae. We found that duplicates have, on average, significantly lower clustering coefficient (CC) than singletons, and the proportion of duplicates (PD) decreases steadily with CC. Furthermore, using functional annotation data, we observed a strong negative correlation between PD and the mean CC for functional categories. By partitioning the network into modules and assigning each protein a modularity measure Qn, we found that CC of a protein is a reflection of its modularity. Moreover, the core components of complexes identified in a recent high-throughput experiment, characterized by high CC, have lower PD than that of the attachments. Subsequently, 2 types of hub were identified by their degree, CC and Qn. Although PD of intramodular hubs is much less than the network average, PD of intermodular hubs is comparable to, or even higher than, the network average. Our results suggest that high CC, and thus high modularity, pose strong evolutionary constraints on gene duplicability, and gene duplication prefers to happen in the sparse part of PINs.

Key Words: gene duplicability • yeast • clustering coefficient • protein interaction network • modularity • network evolution


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Gene duplication has long been thought to be a primary source of material for the origin of evolutionary novelties (Taylor and Raes 2004Go). With the rapid accumulation of genomic data, much attention has been focused on the genome-wide study of gene duplication. Gene duplication is prevalent in all 3 domains of life as an important mechanism in generating new genes and new functions. Although a lot of progresses have been made on some aspects of gene duplication, such as the extent of gene duplications in model organisms (Rubin et al. 2000Go; Gu et al. 2002Go) and their generation rates (Lynch and Conery 2000Go; Gao and Innan 2004Go), many basic questions concerning gene duplication are still unclear, and some are even controversial. For example, what is the driving force and major constraint of the generation and evolution of duplicates?

It is known that duplicates, produced by gene duplication, are the outcome of a complicated process with multiple stages, including generation, fixation in the population, preservation, and further divergence. Several factors have been reported to be related with gene duplicability. Among them are gene complexity (He and Zhang 2005Go), gene function (Conant and Wagner 2002Go; Marland et al. 2004Go; Prachumwat and Li 2006Go), gene essentiality (Gu et al. 2003Go; He and Zhang 2006Go), dosage effects (Papp et al. 2003Go), mRNA expression level (Davis and Petrov 2004Go, 2005Go), alternative splicing (Kopelman et al. 2005Go; Su et al. 2006Go), etc.

Biological processes are rarely performed by single isolated molecules. Instead, they typically involve a coordinated activity of many molecules forming a neighborhood in biomolecular networks. And the function of a protein is in the context of its interactions with other proteins in the cell (Eisenberg et al. 2000Go). Many characteristics of proteins have been characterized to be related with their topological feature in biomolecular network. Hence, it is reasonable to hypothesize that there are intensive interplay between duplication of a gene and its environment in the protein interaction network (PIN). Although this question has been investigated in some studies recently (Wagner 2003Go; Hughes and Friedman 2005Go; Prachumwat and Li 2006Go), no consensus has been made. There are 2 major difficulties in answering this question: limited quality of protein interaction data in one hand and the way of analysis in the other. In this study, we conducted detailed analyses of the possible influences of topological features on the propensity of a gene to duplicate in the PIN of Saccharomyces cerevisiae. Our results show that the mean CC of duplicates is significantly lower than that of singletons in the networks. Furthermore, we presented a hypothesis to explain the different behaviors of CC and degree in influencing gene duplicability from the perspective of modular organization of the network.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Protein Interaction Networks
We used 4 data sets, including 1 combined data set, 2 data sets from small-scale experiments, and 1 data set from a high-throughput (HT) experiment, to study the relationship of gene duplicability and topological features separately. Information of PIN of S. cerevisiae was obtained from the Database of Interacting Proteins (DIP) (Salwinski et al. 2004Go). It covers results from both small-scale and HT experiments. For the HT pull-down assays, the spoke model (Bader and Hogue 2002Go) has been used to extract pairwise interactions. The database downloaded from http://dip.doe-mbi.ucla.edu/ (release September 2005) contains 15,233 pairwise physical interactions between 4,761 proteins (denoted by DIP). HT data on protein interaction have been known to have relatively high false-positive rate, so we also used the manually curated data set at Munich Information Center for Protein Sequences (MIPS) (Guldener et al. 2006Go). After excluding data from genome-scale experiments, we obtained 2,575 physical interactions for 1,630 proteins (denoted by MIPS_SS). The other data set from small-scale experiments is the "LC" data set in BioGRID (Stark et al. 2006Go). It contains information on 3,018 proteins, connected by 9,398 interactions (denoted by BioGRID_LC). The main advantage of this data set is that it is extracted and manually curated from more than 3,000 online publications. The HT data set we used is the high-throughput mass spectrometry–based protein interaction data set from Gavin et al. (2006)Go. First, protein pairs were extracted from the purifications. Then, if both members of a pair are part of the same complex identified in the experiment, they are considered to interact physically. The resultant network contains 6,531 interactions between 1,430 proteins (denoted by Gavin06).

The degree (denoted by k) of a node (protein) in an interaction network is defined by the number of interactions of the node with other nodes in that network. For a node of degree k in the network, its clustering coefficient (CC) is defined as 2N/k(k – 1), where N is the number of interactions between the node's k neighbors and k(k – 1)/2 is the number of possible interactions between its neighbors. It implies an average of interconnectivity among the neighbors of a node. A CC of 1 means that all the neighbors of a node are fully interconnected. The sparse part of the network is characterized by low CC. The distribution of CC is a measure of how clustered a network is. Because CC is not defined for nodes with degree of 1, these nodes are excluded in the correlation analysis of CC and duplicability.

Identification of Duplicates and Singletons
Saccharomyces cerevisiae protein sequences were downloaded from Saccharomyces Genome Database (SGD) (http://www.yeastgenome.org/). Duplicate genes were identified as the procedure in Gu et al. (2002)Go. An all-against-all FASTA search was conducted for the whole set of S. cerevisiae protein sequences. Those fulfilling the criteria on alignable region and identity same as that used by Gu et al. (2002)Go were identified as duplicate genes. We refer to all other genes as singletons, each of which has only 1 copy in the genome. Totally 1,815 proteins were identified as duplicates, each of which has at least 1 homolog. Among them, 1,350 are presented in DIP. After identifying the homologous genes, we used the single-linkage algorithm for clustering genes into gene families.

Protein Function Annotations
To study the relation between gene function and the propensity of a gene to duplicate, we obtained annotations for S. cerevisiae from the MIPS Functional Catalogue (FunCat) database (funcat-2.0 data version 20062005) (Ruepp et al. 2004Go). FunCat consists of 28 main categories, of which 19 are available for S. cerevisiae. Each main functional branch is organized as a hierarchical, treelike structure. After excluding the category of "unclassified proteins" and the 2 categories with too few entries (less than 30), we proceeded with the analyses for the remaining 16 categories.

Fitness measurements were obtained from a HT study (Steinmetz et al. 2002Go) that measures the growth of each strain of a nearly complete collection of yeast single-gene–deletion mutants. Following Gu et al. (2003)Go, we used the lowest fitness value across 5 growth conditions (YPD, YPDGE, YPE, YPG, and YPL) for each strain. Haploinsufficient and haplosufficient genes were identified as in the procedure of He and Zhang (2006)Go.

Statistics
Our statistical analyses and plotting were conducted using R (version 2.2.0, http://www.r-project.org/). Statistical tests used in the study are {chi}2 test and 2-sample Wilcoxon rank sum test, which is also known as "Mann–Whitney test." In contrast to the parametric 2-sample t-test, it is a nonparametric method. So it is more appropriate than the t-test because topological quantities generally are not normally distributed.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Proportion of Duplicates and CC
To determine whether topological features relate to the duplicability of genes, we compared the mean degree and CC of duplicates with those of singletons in the PINs (table 1). There is no significant difference between the mean degree of duplicates and singletons, in consistence with previous observation (Wagner 2003Go). We observed, however, that duplicates have a mean CC significantly less than that of singletons for all the 4 data sets (table 1). The mean CC of singletons is 35–64% higher than that of duplicates. For example, in DIP, the mean CC of singletons is 0.139, compared with 0.084 for duplicates (P < 10–7, 1-tailed Wilcoxon rank sum test). Due to 2 considerations, we show the results of further analyses on DIP only. First, DIP has a higher coverage on the whole proteome than other data sets. Second, results on all the data sets are qualitatively similar. By grouping the proteins into 5 bins according to their CC, we found a monotonic relationship between mean CC and proportion of duplicates (PD) (fig. 1). PD in the whole network is 0.28. Whereas mean CC increases from 0 to 1, PD steadily decreases from 0.31 to 0.12. Compared with singletons, duplicates are more prevalent in the proteins with smaller CC and less prevalent in the proteins with greater CC (supplementary fig. 1, Supplementary Material online). These phenomena and further investigations (see below) show that the higher CC for duplicates is not due to the presence of a few outliers. Although the difference of duplicates and singletons in CC is distinct, no correlation was detected between family size of duplicates and CC. It suggests that for duplicates, the possible influence of CC on their duplicability is independent of the family size.


View this table:
[in this window]
[in a new window]

 
Table 1 Comparison between Duplicates and Singletons in the PINs

 

Figure 1
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Monotonic relationship between CC and the PD. Proteins were grouped into 5 bins by their CC. PD in each group was calculated and plotted.

 
CC is a measure of relative intensity of the interconnectedness of a node's neighbors. Its reliability and practical meaning varies with the degree k of a node for which the CC is calculated. For example, for a node with a degree of 2, CC = 1 means that there is 1 interaction between its 2 neighbor. However, for a node with degree of 10 and CC of 1, it will have 10(10 – 1)/2 = 45 interactions between its 10 neighbors. Although both nodes have equal CC, it is reasonable to consider that the node with degree of 10 is more highly clustered and modular. Regarded as the mean of k(k – 1)/2 measurements, the CC of a node with higher degree is also more reliable in the sense that it is less sensitive to false-positive and false-negative interactions in the network. Thus, we speculate that the correlation between CC and PD should increase with degree. We found this to be the case after binning nodes into groups according to their degree (fig. 2). The extent of differences in PD as a function of CC increases with degree. Specifically, among the proteins with degree greater than 5, those with CC of 0 have a probability of 37.4% to be a duplicate, whereas this number is 10.1% if their CC is greater than 0.5. Such a difference of nearly fourfold in PD at the 2 extremes of CC further emphasizes CC as an indicator of duplication. Thus, by reducing the confounding effects caused by nodes with low degree, this observation highlights the monotonic relationship between CC and PD.


Figure 2
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Relation of CC and the PD at different degree. For groups with higher degree, the relation of PD and CC becomes more distinct. In the first 2 groups (degree of 2 and 3, respectively), some bins are empty because no entry with the required CC.

 
PD and Mean CC of Neighbors
CC is a local measure on the interconnectedness of a node's immediate neighbors. It would be interesting to see whether the relation between PD and cliquishness extends to a longer range. So we calculated the mean CC of each node's neighbors. To prevent recounting the pairs that have been considered in calculating the node's CC, these pairs were excluded in the calculation of neighbors' CC. As expected, the neighbors of duplicates show a significantly lower mean CC than that of singletons (table 1). It further indicates that duplicates prefer to appear in the sparse part of the PIN, and this effect is not limited within immediate neighborhoods.

PD versus CC for Functional Categories
Strong relationships between gene function and the propensity of a gene to undergo duplication have been reported in a variety of species (Conant and Wagner 2002Go; Blanc and Wolfe 2004Go; Prachumwat and Li 2006Go). After the relation between CC and PD at the topological level was revealed, it is of interest to check whether this relation also exists at the level of functional categories. We investigated this question in the frame of the 16 functional categories of FunCat. As expected, we reproduced the previous observations on the uneven propensities of genes to duplicate for certain functions. For instance, genes involving transcription have a lower PD than the network-wide average, whereas metabolism genes show markedly higher proportions of duplicates. Many function-specific arguments have been devoted to explain these phenomena (Conant and Wagner 2002Go; Blanc and Wolfe 2004Go; Prachumwat and Li 2006Go). By plotting PD against mean CC (fig. 3), we observed a strong correlation between them (Pearson's r = – 0.84, P = 4 x 10–5, degree of freedom = 14). Specifically, although the metabolism genes have higher duplicability than genes involving transcription (45% vs. 23%), they are also featured by much smaller CC (0.09 vs. 0.19) (table 2).


Figure 3
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— A strong negative correlation between CC and the PD for functional categories.

 

View this table:
[in this window]
[in a new window]

 
Table 2 Distributions of Functional Categories as Measured by the PD and CC

 
Furthermore, we compared the mean CC of duplicates and singletons for each functional category to see whether the difference in CC between duplicates and singletons is function specific. As shown in table 2, the difference qualitatively exists for nearly all the categories.

The same analysis was also conducted in another functional annotation scheme, GO Slim (downloaded from SGD on 2 October 2005), and similar results were obtained (data not shown).

CC and Duplicability
Due to potential bias introduced at the postduplication stage, the observed enrichment of duplicates with small CC does not necessarily mean higher duplicability of proteins with small CC. To disentangle the genuine factors influencing gene duplicability from this effect, we followed the strategy of He and Zhang (2006)Go and limited following analyses to singletons in S. cerevisiae. In He and Zhang (2006)Go, according to whether their orthologs have duplicated in 4 related yeast species (Debaryomyces hansenii, Candida albicans, Yarrowia lipolytica, and Saccharomyces bayanus) or not, 2 groups of genes were identified and denoted as group D (standing for duplicate) and group S (standing for singleton). We found the mean CC of group D (0.087) to be significantly lower than that of group S (0.159) (P = 0.01). The limited significance is partly due to the small sample size of group D, which have only 44 proteins with degree greater than 1 in DIP. It should be noted that the extent of the difference between group D and group S is similar to the difference between duplicates and singletons, indicating consistency of the 2 observations.

Another way to discern the relation of duplicability and CC is to make a comparison at the level of functional categories. Because the majority of the proteins are singletons, the consequence of gene duplication is unlikely to influence qualitatively the difference in the mean CC among categories. Therefore, the relationship of PD and CC at functional category level largely remains unchanged upon gene duplications. The observed strong negative correlation between PD and mean CC for the functional categories thus can be considered as a support of the relation between gene duplicability and CC. Taken together, our observations suggest a negative correlation between gene duplicability and CC.

CC and Modularity
Modularity is a key feature of cellular systems (Hartwell et al. 1999Go). As a characteristic of network modularity, the CC has been shown to be much higher in the PIN compared with random network. So the influence of CC on gene duplicability might lie in the modular organization of the network.

To analyze the relationship of CC and modularity in a quantitative scheme, we applied a recent module identification method (Newman 2006Go) to DIP. The method calculates network modularity using matrix eigen values and eigen vectors, to enable the division of networks into modules. An advantage of this method is that each node is assigned a measure Q, which is defined as the number of intramodular edges, subtracted by the expected value. By dividing Q by degree k, we obtained a scaled variable (denoted by Qn) representing the extent of modularity of a node. For nodes with similar degree, higher Qn means higher modularity. By plotting CC against Qn for nodes with degree greater than 4, we observed a strong correlation between them (Spearman's {rho} = 0.50, P < 10–15) (supplementary fig. 2, Supplementary Material online). Thus, although CC of the network average is a characteristic of network modularity, CC of individual nodes can be considered as a reflection of their topological modularity.

A recent HT experiment (Gavin et al. 2006Go) identified 491 complexes in yeast and partitions proteins in complexes into 2 types: core components that are present in most isoforms and attachments present in only some of them. We mapped these proteins onto DIP and found that the mean CC of cores is 0.258, which is significantly higher than that of attachments (0.178, P < 10–5, 1-tailed Wilcoxon rank sum test). What is more, PD of cores is 14.3%, compared with 29.0% of attachments ({chi}2 = 24.4, P < 10–6). Taken together, these observations suggest that the relation of CC and duplicability might be a reflection of the modular organization of the network. More analyses are given in Discussion.


    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
In this work, we studied gene duplicability in the context of PINs and found a negative correlation between gene duplicability and CC. This relationship was accentuated for proteins with high connectivity. Such patterns suggest that the rate of gene duplication and so the duplication of the protein product and the growth of the network size are inhomogeneous across the PIN. Specifically, gene duplication prefers to happen in the sparse part of the PIN. Our results highlight the importance of studying the interactions among gene duplication, network topology, and evolution.

The Influence of Data Quality of PIN
It is known that HT methods of detecting protein–protein interactions may produce a significant fraction of false positives, and current knowledge on the yeast protein–protein interaction is incomplete (i.e., the PIN contains false negatives) (Uetz and Finley 2005Go). Any result deduced from the analysis of PIN may be influenced by this factor. To estimate the potential influence of incomplete and noisy protein interaction data on our findings, we mimicked the effects of false positives/negatives by adding/removing 20% of interactions between randomly selected protein pairs. We generated 100 randomly perturbed samples of removal and addition. The trend of the decrease of PD with the increase of CC remains qualitatively the same (Supplementary figs. 3 and 4, Supplementary Material online). Moreover, the trend was observed in not only the data set containing results of both HT experiments and small-scale experiments but also the data set containing small-scale experiments only. These 2 lines of evidence show the robustness of our results to the noises in data.

Yeast 2-hybrid assay is an important HT technology for detecting protein interactions. Although the fraction of false positives has been predicted to be high (Mrowka et al. 2001Go), it would be interesting to check whether our observations can be reproduced in the data derived from this method. Thereby, we analyzed the result of a HT yeast 2-hybrid experiment (Ito et al. 2001Go). It contains information on 3,266 proteins, connected by 4,383 interactions. A major difference of this data set with other data sets aforementioned is that it has a mean CC of just 0.037 and only 7% of the nodes have nonzero CC. Such a distribution of CC makes it an inappropriate sample for the analysis. As a result, no correlation between CC and PD was observed in this data set. Although the distribution of CC makes this phenomenon not unexpected, it is a caveat of our results on CC and PD for other data sets.

Why Preferential Duplication in the Sparse Part of PIN?
There are several factors known to influence gene duplicability. To understand why more highly clustered proteins have lower duplicability, we explored the possible contributions from several known factors. First, it has been suggested that less important genes in yeast have a higher duplicability (He and Zhang 2006Go). In a study investigating the relation of essentiality to topological characteristics, essentiality was shown to be positively correlated with CC in the PIN of yeast (Yu et al. 2004Go). We detected a negative correlation between essentiality, measured by the fitness reduction upon deletion, of a gene and its propensity to be a duplicate, consistent with Gu et al. (2003)Go. After essentiality is controlled for, the remaining influence of CC on duplicability decreases (data not shown). However, the influence of fitness on the relation of CC and duplicability may be an overestimation due to the functional compensation between paralogs, which will generally reduce the essentiality of duplicates. Furthermore, to examine the influence of essentiality on the relation of CC and duplicability, we divided all genes into 4 groups according to the fitness effects and compared the mean CC of duplicates and singletons for each group (table 3). If the relation of CC and PD is a byproduct of the relation of essentiality and PD, it is expected to disappear within the groups. What was observed is that the difference of duplicates and singletons in CC was preserved for groups of more important genes but became much less distinct for less important genes. As aforementioned, the fitness of a gene may change upon duplication. This effect is especially severe for duplicates measured as unimportant genes. Thus, it may explain the less distinct difference in CC between duplicates and singletons in the less essential groups. Taken together, our observation might be partly attributed to the influence of the fitness effect.


View this table:
[in this window]
[in a new window]

 
Table 3 Mean CC of Duplicates and Singletons Belonging to Groups with Different Fitness Effect

 
Second, it has been suggested that haploinsufficient genes have a higher duplicability than haplosufficient genes because doubling the gene dosage of haploinsufficient genes is more likely to be beneficial immediately after gene duplication (Kondrashov and Koonin 2004Go). Nevertheless, we found no significant difference in CC between haploinsufficient and haplosufficient genes. The mean CC of haploinsufficient gene is 0.158, compared with 0.163 for haplosufficient genes (P = 0.23, 2-tailed Wilcoxon rank sum test). Hence, our observation cannot be explained by the higher duplicability of haploinsufficient genes.

Third, gene duplicability is inevitably influenced by its function. Duplicates are overrepresented in certain functional categories and underrepresented in some others (Conant and Wagner 2002Go; Blanc and Wolfe 2004Go; Prachumwat and Li 2006Go). As has been mentioned, we recovered a similar phenomenon. What is more, a strong negative correlation between mean CC and duplicability for functional categories was revealed, and this correlation exists not only at the level of functional categories but also at the level of the genes within nearly all the functional categories. Thus, rather than stating that our observation is influenced by the biased duplicability of some functional categories, we prefer to consider CC as a good indicator of gene duplicability of functional categories.

The above analyses suggest that the correlation between CC and duplicability can be explained only partly by most of known factors influencing gene duplicability. Thus, it is worth seeking to understand this phenomenon from network perspective. As shown in the Results, CC might influence gene duplicability as a measure of modularity. Modularity can promote or constrain duplication in different part of the network due to the benefits or disadvantages introduced by the duplication.

From the perspective of topological features, we speculate that a modular network might contain 3 basic constituents: intramodular hub, characterized by both high degree and CC, representing the central element of modules; intermodular hub, with high degree and low CC, connecting nodes in different modules; and peripheral elements, featured by moderate or small degree and CC. To give a snapshot on the duplicability of different constituents, we used the following rough but reasonable cutoffs. Hubs are defined as proteins within the top 30% in the network as measured by degree. Among the hubs, those within both the top/bottom 30% as measured by CC and top/bottom 30% as measured by Qn from the module identification method of Newman (2006)Go were identified as intra-/intermodular hubs, respectively. Consequently, we identified 244 intramodular and 190 intermodular hubs, with 15% and 39% of them being duplicates, respectively, compared with the network average of 28%. Clearly, the choice of thresholds is somewhat arbitrary. But the results remain qualitatively the same even when we used different cutoffs to define hubs. A caveat of our procedure is that filtering by degree, CC and Qn may not be sufficient to identify different hubs, which may have temporal features not captured in a static map of protein interactions. So there might be some false positives in our result.

In summary, different network constituents show distinct propensity of gene duplication. Our observations suggest the following picture of network growth by duplication. Intermodular hubs represent the most stable and conservative part of the network, with little chance of duplication. Intramodular hubs are among the sparse and dynamic region of network evolution, not only due to a high rate of duplication but also in the sense that duplication of hubs may induce more interaction rewiring. Through duplications in these positions, the network evolves new cellular functions by reorganizing the connections among modules. The peripheral nodes constitute the major part of the network and grow at a moderate rate. The growth and modifications of modules largely lie in this part due to their large populations.

It is interesting to note that the result of another study (Fraser 2005Go), which focused on the evolutionary rates of the 2 types of hubs, is compatible with our observation in measuring the evolutionary conservation of hubs, although different data and strategies were used to identify the hubs.

The Roles of Degree and CC in Deciding Gene Duplicability
Taken together, the above results provide valuable clues, based on which we propose a hypothesis to explain the relation of topological characteristics and the conservation of proteins in the PIN. On one hand, a high CC indicates high functional coherence and compactness between a node and its neighbors, which together form modules. On the other hand, a high CC, if accompanied by high degree, also means the central role of a node in the module. Thus, higher CC poses more severe constraint on the protein evolution, including deletion, duplication, and changes in sequence. The situation is quite different for degree. Degree correlates with pleiotropy, so that hubs experience more pleiotropic constraints. Nevertheless, higher degree is also a sign of higher intrinsic flexibility and thus has more potential to evolve adaptive function upon mutations. Consequently, we speculate that degree might impose 2 opposite effects, which largely counteract each other, on the evolution of a protein. In case of gene deletion and sequence divergence, the former effect may exceed the latter, leading to significantly reduced evolutionary rates of both types of hub. But in case of duplication, pleiotropy may additionally facilitate the preservation of duplicates, as suggested by subfunctionalization model (Force et al. 1999Go) and "adaptive-conflict" model (Hughes 1994Go). Thus, the intramodular hubs, both free from the constraint by CC and benefited from the pleiotropy, may be conferred duplicability similar to, or even higher than, the network average and therefore play active roles in network evolution and contribute to the plasticity of biomolecular network by organizing limited number of modules to fulfill various cellular functions.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Supplementary figures 1–4 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
We thank Dr. Jingdong Han for critical discussions, Dr. Mark Newman for providing the program of module identification, and Yun Zhou and Hu Chen for valuable comments. This work was supported by grants from the National Natural Science Foundation of China (No.90303017, No.90408019), Hi-Tech Research and Development 863 Program of China (No.2002AA234041), and Foundational Science Research Grant from the 973 project (No.2003CB715900).


    Footnotes
 
Jianzhi Zhang, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 

    Bader GD and Hogue CWV. (2002) Analyzing yeast protein-protein interaction data obtained from different sources. Nat Biotechnol 20:991–997.[CrossRef][ISI][Medline]

    Blanc G and Wolfe KH. (2004) Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell 16:1679–1691.[Abstract/Free Full Text]

    Conant GC and Wagner A. (2002) Genomehistory: a software tool and its application to fully sequenced genomes. Nucleic Acids Res 30:3378–3386.[Abstract/Free Full Text]

    Davis JC and Petrov DA. (2004) Preferential duplication of conserved proteins in eukaryotic genomes. PLoS Biol 2:e55.[Medline]

    Davis JC and Petrov DA. (2005) Do disparate mechanisms of duplication add similar genes to the genome? Trends Genet 21:548–551.[CrossRef][ISI][Medline]

    Eisenberg D, Marcotte EM, Xenarios I, Yeates TO. (2000) Protein function in the post-genomic era. Nature 405:823–826.[CrossRef][Medline]

    Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J. (1999) Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151:1531–1545.[Abstract/Free Full Text]

    Fraser HB. (2005) Modularity and evolutionary constraint on proteins. Nat Genet 37:351–352.[CrossRef][ISI][Medline]

    Gao L-Z and Innan H. (2004) Very low gene duplication rate in the yeast genome. Science 306:1367–1370.[Abstract/Free Full Text]

    Gavin A-C, Aloy P, Grandi P, et al. (32 co-authors). (2006) Proteome survey reveals modularity of the yeast cell machinery. Nature 440:631–636.[CrossRef][Medline]

    Gu Z, Cavalcanti A, Chen F-C, Bouman P, Li W-H. (2002) Extent of gene duplication in the genomes of drosophila, nematode, and yeast. Mol Biol Evol 19:256–262.[Abstract/Free Full Text]

    Gu Z, Steinmetz LM, Gu X, Scharfe C, Davis RW, Li W-H. (2003) Role of duplicate genes in genetic robustness against null mutations. Nature 421:63–66.[CrossRef][Medline]

    Guldener U, Munsterkotter M, Oesterheld M, Pagel P, Ruepp A, Mewes H-W, Stumpflen V. (2006) MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res 34:D436–D441.[Abstract/Free Full Text]

    Hartwell LH, Hopfield JJ, Leibler S, Murray AW. (1999) From molecular to modular cell biology. Nature 402:C47–C52.[CrossRef][Medline]

    He X and Zhang J. (2005) Gene complexity and gene duplicability. Curr Biol 15:1016–1021.[CrossRef][ISI][Medline]

    He X and Zhang J. (2006) Higher duplicability of less important genes in yeast genomes. Mol Biol Evol 23:144–151.[Abstract/Free Full Text]

    Hughes AL. (1994) The evolution of functionally novel proteins after gene duplication. Proc R Soc Lond B Biol Sci 256:119–124.[Medline]

    Hughes AL and Friedman R. (2005) Gene duplication and the properties of biological networks. J Mol Evol 61:758–764.[CrossRef][ISI][Medline]

    Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 98:4569–4574.[Abstract/Free Full Text]

    Kondrashov FA and Koonin EV. (2004) A common framework for understanding the origin of genetic dominance and evolutionary fates of gene duplications. Trends Genet 20:287–290.[CrossRef][ISI][Medline]

    Kopelman NM, Lancet D, Yanai I. (2005) Alternative splicing and gene duplication are inversely correlated evolutionary mechanisms. Nat Genet 37:588–589.[CrossRef][ISI][Medline]

    Lynch M and Conery JS. (2000) The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155.[Abstract/Free Full Text]

    Marland E, Prachumwat A, Maltsev N, Gu N, Li W-H. (2004) Higher gene duplicabilities for metabolic proteins than for nonmetabolic proteins in yeast and E. coli. J Mol Evol 59:806–814.[CrossRef][ISI][Medline]

    Mrowka R, Patzak A, Herzel H. (2001) Is there a bias in proteome research? Genome Res 11:1971–1973.[Abstract/Free Full Text]

    Newman MEJ. (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA 103:8577–8582.[Abstract/Free Full Text]

    Papp B, Pal C, Hurst LD. (2003) Dosage sensitivity and the evolution of gene families in yeast. Nature 424:194–197.[CrossRef][Medline]

    Prachumwat A and Li W-H. (2006) Protein function, connectivity, and duplicability in yeast. Mol Biol Evol 23:30–39.[Abstract/Free Full Text]

    Rubin GM, Yandell MD, Wortman JR, et al. (50 co-authors). (2000) Comparative genomics of the eukaryotes. Science 287:2204–2215.[Abstract/Free Full Text]

    Ruepp A, Zollner A, Maier D, et al. (11 co-authors). (2004) The funcat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res 32:5539–5545.[Abstract/Free Full Text]

    Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. (2004) The database of interacting proteins: 2004 update. Nucleic Acids Res 32:D449–D451.[Abstract/Free Full Text]

    Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. (2006) Biogrid: a general repository for interaction datasets. Nucleic Acids Res 34:D535–D539.[Abstract/Free Full Text]

    Steinmetz LM, Scharfe C, Deutschbauer AM, et al. (11 co-authors). (2002) Systematic screen for human disease genes in yeast. Nat Genet 31:400–404.[ISI][Medline]

    Su Z, Wang J, Yu J, Huang X, Gu X. (2006) Evolution of alternative splicing after gene duplication. Genome Res 16:182–189.[Abstract/Free Full Text]

    Taylor JS and Raes J. (2004) Duplication and divergence: the evolution of new genes and old ideas. Annu Rev Genet 38:615–643.[CrossRef][ISI][Medline]

    Uetz P and Finley RLJ. (2005) From protein networks to biological systems. FEBS Lett 579:1821–1827.[CrossRef][ISI][Medline]

    Wagner A. (2003) How the global structure of protein interaction networks evolves. Proc R Soc Lond B Biol Sci 270:457–466.[Medline]

    Yu H, Greenbaum D, Lu HX, Zhu X, Gerstein M. (2004) Genomic analysis of essentiality within protein networks. Trends Genet 20:227–231.[CrossRef][ISI][Medline]

Accepted for publication September 12, 2006.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Proc. Natl. Acad. Sci. USAHome page
E. B. Dopman and D. L. Hartl
A portrait of copy-number polymorphism in Drosophila melanogaster
PNAS, December 11, 2007; 104(50): 19920 - 19925.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/12/2467    most recent
msl121v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, L.
Right arrow Articles by Sun, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, L.
Right arrow Articles by Sun, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?