Skip Navigation


MBE Advance Access originally published online on July 24, 2006
Molecular Biology and Evolution 2006 23(10):1937-1945; doi:10.1093/molbev/msl056
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/10/1937    most recent
msl056v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (11)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gu, X.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gu, X.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Article

A Simple Statistical Method for Estimating Type-II (Cluster-Specific) Functional Divergence of Protein Sequences

Xun Gu

Department of Genetics, Development and Cell Biology, Center for Bioinformatics and Biological Statistics, Iowa State University

E-mail: xgu{at}iastate.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Theory
 A Simple Robust Method
 Software and Examples
 Discussion
 Acknowledgements
 References
 
Predicting functional amino acid residues in silico is important for comparative genomics. In this paper, we focus on the issue of how to statistically identify cluster-specific amino acid residues that are related to the functional divergence after gene duplication. We approach this problem using a framework based on site-specific shift of amino acid property (type-II functional divergence), as opposed to site-specific shift of evolutionary rate (type-I functional divergence). An efficient statistical procedure is implemented to facilitate the development of phylogenomic database for cluster-specific residues of large-scale protein families. Our method has the following features: 1) statistical testing of the type-II functional divergence and 2) the site-specific Bayesian profile to measure how amino acid residues contribute to type-II (cluster-specific) functional divergence. Consequently, one may obtain the posterior probability for "functional" cluster-specific residues. Case studies are presented and indicate that radical cluster-specific residues are responsible for most of inferred type-II functional divergence, whereas conserved cluster-specific residues appear less than even those imperfect radical cluster-specific residues to this type of functional divergence.

Key Words: functional divergence • gene family evolution • cluster-specific sites


    Introduction
 TOP
 Abstract
 Introduction
 Theory
 A Simple Robust Method
 Software and Examples
 Discussion
 Acknowledgements
 References
 
Under the framework of phylogenomic annotation of gene function (Eisen and Fraser 2003Go), the importance of gene function can be measured quantitatively in terms of the functional constraints of the protein sequence (Kimura 1983Go). For instance, an amino acid residue is said to be functionally important if it is evolutionarily conserved. Therefore, change of the evolutionary conservation at a particular residue may indicate the involvement of functional divergence (Lichtarge et al. 1996Go; Gu 1999Go). Following this idea, many research groups have developed statistical methods for testing and predicting the functional divergence of a gene family, which indeed showed the association between sequence and functional or structural divergence (e.g., Lichtarge et al. 1996Go; Gu 1999Go, 2001Go, 2003Go; Knudsen and Miyamoto 2001Go; Landgraf et al. 2001Go; Wang and Gu 2001Go; Gaucher et al. 2002Go; Jordan et al. 2001Go; Lopez et al. 1999Go; Gribaldo et al. 2003Go; Madabushi et al. 2004Go; Gao et al. 2005Go; Rastogi and Liberles 2005Go; H Zhou, J Gu, SJ Lamont, X Gu, personal communication).

Furthermore, Gu (2001)Go made a distinction between 2 types of functional divergence. Type-I functional divergence results in site-specific rate shift (Gu 1999Go; Knudsen and Miyamoto 2001Go; Landgraf et al. 2001Go; Gaucher et al. 2002Go; Lopez et al. 2002). A typical case is an amino acid residue that is highly conserved in a subset of homologous genes but highly variable in a different subset of those homologous genes. Alternatively, type-II functional divergence results in the shift of cluster-specific amino acid property (Lichtarge et al. 1996Go; Gu 2001Go). Such divergence is exemplified by a radical shift of amino acid property, for example, positive versus negative charge differences at a homologous site that is otherwise evolutionally conserved between subtrees within a phylogeny. Note that these 2 types of functional divergence may have other names. For instance, the basic evolutionary trace approach (Lichtarge et al. 1996Go; Madabushi et al. 2004Go) has mainly focused on cluster-specific residues related to type-II functional divergence. Gribaldo et al. (2003)Go also looked at type-II functional divergence called as "constant-but-different." Meanwhile, the weighted evolutionary trace approach proposed by Landgraf et al. (2001)Go was similar to type-I functional divergence (Gu 1999Go).

Many studies have been published about the statistical significance of observed patterns, which is important as the research community tends to use them to infer functional divergence of proteins (e.g., Gu 1999Go, 2001Go; Knudsen and Miyamoto 2001Go; Landgraf et al. 2001Go; Gaucher et al. 2002Go). Although several methods for type-I functional divergence are available, as well as the software (e.g., Gu and Vander Velden 2002Go), the implementation of statistical testing for type II has not been well resolved.

In this paper, we develop a statistical method for type-II functional divergence. In a typical case of 2 gene clusters generated by a gene duplication event, type-II functional divergence results in site-specific shift of amino acid physiochemical property after the gene duplication. In this regard, "type II is also called cluster-specific functional divergence" (Lichtarge et al. 1996Go). Cluster-specific amino acid residues have been widely used in functional assays of protein families. For instance, Sun et al. (2002)Go identified essential amino acid changes in paired domain evolution of the PAX gene family. Given the growing diversity and number of protein sequences available, it is now practical and necessary to develop a phylogenetic database for cluster-specific amino acid residues of protein families. To this end, we have to address 2 related statistical issues. First, are the type-II (cluster-specific) changes statistically significant? And secondly, for observed cluster-specific amino acid residues, how can we statistically measure whether they are related to type-II functional divergence. We address these issues by developing the statistical method that is suitable for large-scale data analysis.


    Theory
 TOP
 Abstract
 Introduction
 Theory
 A Simple Robust Method
 Software and Examples
 Discussion
 Acknowledgements
 References
 
Type-II Functional Divergence (cluster-specific) in the Early Stage
In principle, the evolution of protein sequences of duplicate genes can be divided into 2 stages, the early (E) stage after gene duplication and the late (L) stage (fig. 1). We assume that functional divergence between duplicate genes has occurred in the E stage, whereas in the late (L) stage, the purifying selection plays a major role to maintain related but distinct functions of 2 duplicate genes (Ohno 1970Go; Kimura 1983Go; Force et al. 1999Go). Accordingly, we modify the "2-state model" (Gu 1999Go, 2001Go) specific to type-II (cluster-specific) functional divergence:

(i) In the E stage, an amino acid residue can be in either of 2 states: F0 (type II unrelated) and F1 (type II related). The probability of a residue being under F1 is P(F1) = {theta}II and that being under F0 is P(F0) = 1 – {theta}II. To distinguish it from the type-I functional divergence (Gu 1999Go), we call {theta}II "the coefficient of type-II functional divergence."
(ii) In the L stage, an amino acid residue is always under the state of F0, indicating no further type-II functional divergence. Amino acid substitutions in this stage are mainly under purifying selection.


Figure 1
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— (A) Two gene clusters after gene duplication. E and L are early and late stages of gene cluster A and B, respectively. (B) Type-I functional divergences: in the early stage, the evolutionary rate may increase for functional divergence–related change, resulting in shifted functional constraints between clusters A and B. Type-II functional divergence: in the early stage, the evolutionary rate may increase for functional divergence–related change, resulting in a radical shift in amino acid property but in the late stage is back to the same level of sequence conservation. (C) The phylogenetic tree of COX gene family, which was inferred by the Neighbor-Joining method, using amino acid sequences with Poisson distance. Bootstrapping values of more than 50% are presented. Modified from Gu (1999Go, 2001Go).

 
Substitution Models under F0 and F1
The pattern of amino acid substitutions during evolution, or the substitution model, relies on the states of functional divergence (F0/F1). The F0-substitution model largely reflects the conserved evolution of protein sequences, which can be empirically determined by the Dayhoff model (Dayhoff et al. 1978Go) or the Jones-Taylor-Thornton model (Jones et al. 1992Go). In contrast, under F1, radical amino acid substitutions may occur more frequently, apparently due to the functional divergence between duplicate genes (Lichtarge et al. 1996Go). To avoid overparameterization, we propose a simple F1-substitution model that can distinguish between the "radical" and "conserved" amino acid substitutions. First, we tentatively classify 20 amino acids into 4 groups: charge positive (K, R, and H), charge negative (D and E), hydrophilic (S, T, N, Q, C, G, and P), and hydrophobic (A, I, L, M, F, W, V, and Y). An amino acid substitution is called radical (denoted by R) if it changes from one group to another; otherwise it is called conserved, that is, within the group, denoted by C. It should be mentioned that the conservations of residues may be due to selection pressure for greater functions or for folding stabilities (Tseng and Liang 2006Go). The status of no substitution is denoted by N.

Secondly, we assume that, under state F0, the transition probability for a radical, conserved, or no substitution is given by

Formula

Formula
or

Formula 1(1)
respectively, where t is the evolutionary time, {lambda} is the substitution rate, and {pi}R (or {pi}C) is the proportion of radical (or conserved) substitutions in the total substitutions; {pi}R + {pi}C = 1. Apparently, equation (1) is an extended Poisson model of protein sequence evolution. Based on the Dayhoff PAM matrix, we empirically determined that {pi}R = 0.312 and {pi}C = 0.688. Indeed, without any functional divergence, conserved amino acid substitutions are more likely to occur, as expected by the theory of neutral evolution (Kimura 1983Go).

Next we consider the transition probabilities under F1 in the early stage, denoted by P(Y|F1) for Y = N, R, C. It should be noted that, according to our model (see above), an amino acid residue that has no change in the early stage is essentially unrelated to the type-II functional divergence. This argument implies P(N|F1) = 0. Similar to Eq.(1), one may choose P(R|F1) and P(C|F1) in the same forms as Formula 1 and a'C(1 – e{lambda}t), respectively. Together, these arguments directly lead to

Formula 1

Formula 1
and

Formula 2(2)
where aR = a'R/(a'R + a'C) and aC = 1 – aR. That is, aR (or aC) is the (F1) proportion of radical (or conserved) substitutions in total substitutions. Moreover, the F1-radical amino acid substitution (aR) can be much higher than that under F0 ({pi}R), as will be shown later.

Evolutionary Link between Early and Late Stages
The evolutionary link between early and late stages depends on the status of type-II (cluster-specific) functional divergence. Let {lambda}E and {lambda}L be the evolutionary rates in the E and L stages, respectively. The statistical framework we developed is under the following assumptions:

(i) A random variable u, called the rate component, varies among sites according to a standard gamma distribution

Formula 3(3)
The shape parameter {alpha} describes the strength of rate variation among sites, that is, a small value of {alpha} means a strong rate heterogeneity among sites, and {alpha} = {infty} means no rate variation among sites (Gu et al. 1995Go).

(ii) Under F0, the evolutionary rates in the early ({lambda}E) and late ({lambda}L) stages share the same rate component u. That is, {lambda}E = c1u and {lambda}L = c2u, where c1 and c2 are constant.
(iii) F1-amino acid substitutions in the early stage are independent of the rate component u, as indicated by equation (2). In other words, F1-amino acid substitutions have escaped from the ancestral functional constraint on the protein sequence.

Two Clusters by Gene Duplication
Consider the typical case of 2 clusters generated by a gene duplication event, each of which consists of several orthologous genes (fig. 1). Let X be the amino acid pattern of the late stage, the column (site) of the multiple alignments of the sequences. Let Y = (a, b) be the amino acid pattern of the early stage, the ancestral sequences of 2 internal nodes a and b. From the assumption (ii), the joint probability of X and Y under F0 is given by Formula 3 where P(Y|F0) is determined by equation (1) for Y = N, C, or R, respectively, P(X|Y) is the likelihood of the subtrees of the 2 clusters A and B, conditional on the ancestral states a and b, which can be constructed according to the Markov-chain property under a known phylogeny (Felseinstein 1981; Gu 2001Go), and the gamma distribution density for rate variation among sites {phi}(u) is given by equation (3). Similarly, from (iii), under F1 we have Formula 3 where P(Y|F1) is given by equation (2). Remembering that the probability of a site being under F1 is given by P(F1) = {theta}II, the coefficient of type-II functional divergence, we have the joint probability for X and Y as follows:

Formula 4(4)

Direct application of equation (4) for estimating {theta}II may face some difficulties because the amino acid pattern of early stage (Y) is unobservable. A straightforward solution is to invoke the ancestral sequence inference, for example, Yang et al. (1995)Go. Treating the ancestral sequences as inferred observations, the standard procedure for the likelihood analysis of protein sequence can be applied. In spite of nice statistical properties, this approach requires a detailed description of the model and sensitive to the statistical uncertainty in ancestral sequence inference. To solve this problem, we thus propose a simple but robust method that is computationally efficient, allowing genome-wide proteomic analysis.


    A Simple Robust Method
 TOP
 Abstract
 Introduction
 Theory
 A Simple Robust Method
 Software and Examples
 Discussion
 Acknowledgements
 References
 
Poisson Model in the Late Stage
Testing type-II functional divergence between 2 gene clusters (the early stage) utilizes the within-cluster amino acid patterns to examine the conservation in the late stage. Therefore, a Poisson-based model that counts the number (k) of substitutions may be sufficient for this purpose, where smaller values of k of substitutions in a gene cluster indicate high conservation. Formally, at a given amino acid residue, the number of substitutions in each cluster (A or B) follows a Poisson process, for example, for cluster A, we have

Formula 5(5)
with the same applying to pB(k), where TA (or TB) is the total evolutionary time of cluster A (or B), and {lambda}A (or {lambda}B) is the evolutionary rate of cluster A (or B), respectively. Hence, the early–late joint distribution can be specified as fij, Y = P(X = (i, j), Y), where i or j is the number of substitutions in cluster A or B. Under this model, P(X|Y) = pApB, which is independent of the early stage Y. Similar to the derivation of equation (4), we have Formula 5 and Formula 5 Together, one can show that the early–late distribution under the Poisson-based model is given by

Formula 6(6)
where P(Y|F0) is from equation (1) and P(Y|F1) from equation (2); here aN = P(N|F1) = 0.

Analytical Form of the Early–Late Distribution
First we consider the late-stage distribution, fij, the probability for i and j substitutions in clusters A and B, respectively. From equation (6), one can show that Formula 6 which is a specific version of bivariate negative binomial distribution,

Formula 7(7)
where Z = {alpha}/(DA + DB + {alpha}), ZA = DA/(DA + DB + {alpha}), and ZB = DB/(DA + DB + {alpha}); Formula 7 and Formula 7 are the total branch lengths of clusters A and B, respectively, and {alpha} is the gamma shape parameter.

Next we consider the early-stage distribution fY, the frequencies of 3 early-stage amino acid patterns for Y = N, R, or C. Because Formula 7 from equation (6) one can show

Formula 8(8)
and fN = 1 – fRfC. Moreover, let p = fR + fC be the proportion of amino acid differences (either radical or conserved) in the early stage, which is given by

Formula 9(9)
where Formula 9 is the branch length of the early stage.

We define W = {alpha}/(DA + DB + d + {alpha}), WA = DA/(DA + DB + d + {alpha}), and WB = DB/(DA + DB + d + {alpha}). Finally, we have shown that the joint distribution of early–late stages, fij, Y, can be expressed as follows:

Formula 9

Formula 9
and

Formula 10(10)
where Formula 10 is given by

Formula 11(11)

Estimation
Based on the likelihood principle, we have implemented the following algorithms to estimate unknown parameters for testing type-II functional divergence. Here we always assume that the phylogenetic tree of the gene family is known or can be reliably inferred.

Late-Stage Likelihood
The distribution of late-stage Qij is the probability of a site being i and j substitutions in the 2 clusters. As shown by equation (7), Qij depends on 3 (late-stage) parameters DA, DB, and {alpha}. We thus modified the likelihood method of Gu and Zhang (1997)Go to estimate them simultaneously, denoted by Formula 11 and Formula 11 respectively. Note that the algorithm of Gu and Zhang (1997)Go corrected the parsimony bias in counting the number of substitutions.

Likelihood for Estimating Early-Stage Parameters
Let nij, Y be the number of site with the pattern X = (i, j) and Y = N, R, or C. After treating 3 late-stage parameters as known, we develop a simple likelihood to estimate early-stage parameters {theta}II, aR/aC, and d. From equation (10), we have fij,S = fij,R + fij,C = Qij – (1 – {theta}II)Mij = Qij fij,N. Let nij,S = nij,R + nij,C. Thus, the log-likelihood function,


Formula

includes 2 unknown parameters {theta}II and d. Let Formula 11 is the total number of sites that have no change in the early stage. Under the p constraint of equation (9), the maximum likelihood estimate of {theta}II is given by Formula 11 where y is the solution of

Formula 13(13)
with d = –ln(1 p) + ln(1 – {theta}II). (Note that Mij depends on the parameter d, whereas Qij only depends on late-stage parameters that are treated as known). The iteration can start with the initial values of d(0) = –ln(1 – p) until convergence. Let L be the sequence length, Formula 13 and Formula 13 The sampling variance of Formula 13 can be calculated as follows:

Formula 14(14)
where Formula 14 When the estimates Formula 14 and Formula 14 are obtained, aR can be estimated from equation (8).

The proportion of amino acid differences between the internal nodes a and b represented by p can be computed as follows. First, we use the Bayesian algorithm (Zhang and Nei 1997Go) to infer the ancestral sequences of Y, which is a simplified version of Yang et al. (1995)Go in which the branch lengths of the phylogenetic tree are estimated using a least square method rather than the ML method. Then, we estimate p when each site in the inferred ancestral sequence receives the assignment of amino acid with the highest posterior probability. Simulations conducted by Zhang and Nei (1997)Go showed that this approach for estimating p is almost unbiased.

The U-Likelihood
This method utilizes amino acid sites that are universally conserved in both clusters, that is, i = j = 0. Let n00Y be the number of sites with Y = N (the U type), R, or C. Let n00 = n00N + n00R + n00C and f00 = f00N + f00R + f00C. Then, the log of U-likelihood can be written as


Formula

Let Formula 14 Similar to above, we have shown that the ML estimates of {theta}II and d are given by

Formula 14

Formula 16(16)

The sampling variance of the estimates Formula 16 is Var({theta}II) = f00N(1 – f00N)b2/N, where Formula 16 Because the U method largely relies on the universally conserved sites, it seems robust against the inaccuracy of ancestral sequence inference and sequence alignment.

Predicting Critical Amino Acid Residues: Empirical Bayesian Approach
The identification of which sites are responsible for these type-II (cluster-specific) functional differences is of great interest, if the coefficient of functional divergence ({theta}II) between early and late stages is significantly larger than 0. Here we develop a method of predicting such sites, which indeed can be further tested by experimentation, using molecular, biochemical, or transgenic approaches.

We wish to know the probability of state F1 in the early stage at a site, that is, P(F1|X, Y). According to the Bayesian law, we have

Formula 17(17)
where the prior probability of F1 in the early stage is given by P(F1) = {theta}II. Under the Poisson-based model, P(X = (i, j), Y|F1) and P(X = (i, j), Y|F0), and P(X = (i, j), Y) are given by equations (5) and (7), respectively. Noting that aY = 0 if Y = N, one can show

Formula 17

Formula 17
and

Formula 18(18)

One may find that it is simple to use the posterior probability ratio of F1 to F0, that is, R(F1|F0) = P(F1|X, Y)/P(F0|X, Y). After some algebras, we obtain

Formula 19(19)
where h = d/(DA + DB + d + {alpha}).

Statistical Evaluation of Cluster-Specific Sites
An important result from equation (19) is that the posterior ratio R(F1|F0) reaches its maximum if there is no amino acid substitution in each gene cluster but the amino acid is different between them, that is, i = j = 0 and Y != N. As usually observed, and assuming that the proportion of radical changes under F1 is higher than that under F0 such that aR/aC > {pi}R/{pi}C, we have

Formula 20(20)

Hence, a typical cluster-specific site indeed will receive a highest score for the type-II functional divergence, consistent with the intuitive biological interpretation. However, it should also be indicated that a high score could be statistically meaningless if {theta}II is not significantly larger than 0. Finally, we note that Formula 20 if Formula 20 This means that greater accuracy is achieved as more sequences are analyzed (i.e., increasing DA or DB). In practice, one may use this property to determine how many sequences are sufficient to achieve the statistical resolution of site prediction.


    Software and Examples
 TOP
 Abstract
 Introduction
 Theory
 A Simple Robust Method
 Software and Examples
 Discussion
 Acknowledgements
 References
 
The newly developed method for type-II functional divergence has been implemented in the software package DIVERGE2, which is available from our Web site http://xgu.gdcb.iastate.edu. We have distribution packages for both Microsoft Windows and Linux operating systems, including manual and example files. We have conducted several case studies to demonstrate its potential applications in understanding functional divergence of protein sequences.

Data Sets
We present case studies analyzing 3 gene families. 1) The cyclooxygenase (COX) enzymes catalyze a key step in the conversion of arachidonate to PGH2, the immediate substrate for a series of cell prostaglandin and thromboxane synthases. There are 2 tissue-specific isoforms in mammals: COX1 and COX2. Molecular cloning of COX2 led to a major investment by pharmaceutical companies in the development of selective inhibitors. The sequence alignment and the phylogenetic tree are the same as in Gu (2001)Go; also see figure 1C. 2) The caspase gene family is important for apoptosis (programmed cell death) and cytokine maturation, which has been studied extensively for type-I functional divergence (Wang and Gu 2001Go). And 3) we have also analyzed the duplicated isoforms (Gq and GS) of G-protein alpha subunits.

Testing the Significance of Type-II Functional Divergence
As expected, all these gene families show significant type-I functional divergence. Based on the phylogenetic tree in figure 1C, we estimated that the coefficient of type-II functional divergence between COX1 and COX2 duplicate genes {theta}II = 0.159 ± 0.036, which is statistically significant (P < 0.001). We also analyzed the duplicated isoforms (Gq and GS) of G-protein alpha subunits and found a similar pattern of type-II functional divergence (table 1). In contrast to Wang and Gu's (2001Go) finding for type-I functional divergence of caspase gene family, we found no evidence for type-II functional divergence between 2 major CED-3 and ICE subfamilies. This raises the question whether the relative importance of type-I and type-II functional divergences is associated with specific functional classes of the protein family.


View this table:
[in this window]
[in a new window]

 
Table 1 Summary of Functional-Divergence Analysis for COX and G-Protein Alpha Families

 
Site-Specific Profiles of Type-II (cluster-specific) Functional Divergence
For illustration, figure 2 shows the site-specific ratio profile of type-II functional divergence between COX1 and COX2. For 583 aligned sites, 492 (84%) sites have received the ratio score < 1, indicating that most sites are predicted to be unrelated to the type-II functional divergence. Moreover, we identified 28 radical cluster-specific sites that receive the highest posterior ratio score, that is, R(F1|F0)max = 7.17. In other words, if we select these radical cluster-specific sites as candidates for type-II functional divergence, the posterior probability for them is PII = 7.17/(1 + 7.17) = 87.8%, indicating that the prediction error (false-positive rate) is 12.2%. Actually, it is impressive that among 111 radical amino acid substitutions in the early stage after the gene duplication, about 29/111 {approx} 26% are potentially related to type-II functional divergence between COX1 and COX2.


Figure 2
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Site-specific profile for type-II functional divergence between COX1 and COX2, measured by the posterior ratio. Horizontal lines (1)–(4) indicate cluster-specific patterns in table 2.

 

View this table:
[in this window]
[in a new window]

 
Table 2 Functional Ranking of Several Cluster-Specific Patterns in the COX Gene Family

 
Effect of Radical Substitutions in the Early Stage
For the COX gene family, we found that the radical substitutions for type-II functional divergence in the early stage is about 2.7-fold increasing (aR/{pi}R = 2.7 in table 1). Consequently, an amino acid residue with a radical change between COX1 and COX2 may have a higher score than a conserved change for being type-II functional divergence related. As shown by table 2, the sites most likely exhibiting type-II behavior are the radical cluster-specific sites, whereas the conserved cluster-specific sites are less likely, as indicated by a low posterior probability (~ 0.35). This case study clearly shows the important role of statistical analysis, otherwise one cannot objectively justify whether one-less radical cluster-specific site (i.e., there is one amino acid substitution in the late stage) is more likely to be functional divergence related than conserved cluster-specific site.


    Discussion
 TOP
 Abstract
 Introduction
 Theory
 A Simple Robust Method
 Software and Examples
 Discussion
 Acknowledgements
 References
 
The importance of cluster-specific amino acid residues for understanding functional divergence of a protein family has been well recognized (Lichtarge et al. 1996Go; Sun et al. 2002Go). Because the computational prediction of these sites from a multiple alignment is straightforward, it is technically simple to develop a phylogenomic database for cluster-specific residues of large-scale protein families. The difficulty, however, is the statistical issue. Without developing a reliable statistical procedure to test the significance of observed cluster-specific residues, such phylogenomic effort could result in rapid accumulation of false-positive cases. Based on the statistical framework of type-II functional divergence, we have provided a practical solution for this problem.

We first test type-II functional divergence after the gene duplication. Rejection of the null hypothesis indicates that some conserved amino acid residues have experienced radical shift of amino acid property between the duplicate genes. The site-specific profile based on the posterior probabilities is useful to select residues that are type-II functional divergence related. Apparently, radical cluster-specific sites usually receive the highest scores for type-II functional divergence. However, previous computational methods have difficulty to rank ordering the conserved cluster-specific sites or imperfect radical cluster-specific sites (e.g., with one amino acid substitution within clusters) because the score system adopted was ad hoc, subject to arbitrary. Our software was developed to overcome these potential pitfalls. For instance, in the COX gene family (table 2), we found that the imperfect radical cluster-specific sites may be more important than conserved cluster-specific sites to understand the mechanism of functional divergence for this family.

As shown in tables 2 and 3, our case studies have demonstrated the critical role of amino acid classification in predicting type-II functional divergence as it determines radical or conserved change in the early stage. There have been extensive discussions about this issue (e.g., Atchley et al. 2005Go). As a starting point, the current system classifies 20 amino acids into 4 groups (charge positive, charge negative, hydrophilic, and hydrophobic) to characterize major types of radical amino acid substitutions, which may be sufficient for most protein families. However, the rationale of classification may be questionable in some cases. For instance, we categorize histidine (H) as charge positive. In fact, the isoelectric point of H is close to physiological pH (~ 7.5) and the pKa is ~ 6.0. So the positive charge on H is "very" sensitive to pH changes. As a result, in some areas of the cell, H is positively charged, whereas it is not the case in other areas. Indeed, the classification of 2 groups (positive and negatively charged residues) is somewhat exaggerated as, frequently, these residues are embedded in local environment that significantly modifies the pKa and make both groups of residues protonated. Consequently, they are equivalent to serve as H-bond donor or acceptor. Another example is site 419 in table 3: this site may be incorrectly highlighted as "radical" because glycine (G) and alanine (A) are the 2 smallest amino acids, if the functional constraints that act on this site require that only small amino acids occupy this site regardless of their physiochemical properties.


View this table:
[in this window]
[in a new window]

 
Table 3 Summary of Amino Acid Changes in 22 Radical Cluster-Specific Positions Associated with the Divergence of COX1 and COX2

 
Given these sophisticated cases, the software DIVERGE2 has the option of amino acid classification for the user. It should be emphasized here that biological interpretations of type-II site predictions are based on the specified amino acid classification. One possible improvement in the future is to adopt the hierarchical amino acid classification, for example, Murphy et al. (2000)Go and Li et al. (2003)Go; the former essentially clusters PAM/BLOSUM matrix, and the latter clusters residues based on biophysics of folding. Hence, such approach allows the user to group amino acid under a given cutoff, providing a systematic approach to test whether the predicted functional residues are sensitive to amino acid grouping.

Because our method relies on the ancestral sequence inference, the accuracy of ancestral state estimations may cause some potential problems. It should be noted that the statistical method we developed for estimating {theta}II and other parameters only utilized the frequencies of amino acid differences in the early stage, rather than the inferred ancestral characters. Computer simulations conducted by Zhang and Nei (1997)Go indicated that the frequencies of amino acid substitutions between ancestral nodes are almost unbiased, though the inferred amino acid residues at some sites may be incorrect. Moreover, in calculating the site-specific profile for prediction, we are more interested in radical/conserved cluster-specific sites or imperfect cluster-specific sites. In these cases, ancestral sequence inference is almost 100% correct. In addition, the number of amino acid substitutions in each cluster is estimated by Gu and Zhang's (1997)Go algorithm, and this approach accounts for multiple hits. Further, we have used a modified U method without invoking the ancestral sequence inference: the early-stage distance (d) between internal nodes (a and b) is approximated by the branch length estimated by the Neighbor-Joining phylogeny. In the cases of COX and G-alpha gene families, the results are very similar (not shown).

Among many statistical problems that remain to be solved in the phylogeny-based functional analysis of protein sequences, one common issue is how to model functional divergence of insertion and deletions (indels), which is important because the protein structure is more strongly affected by indels than by point substitutions. We are developing a Poisson model to integrate indel events into the framework of type-II functional divergence. This simple approach will treat indel as a 21st character, and its evolutionary rate will be indel-length (k) dependent. Gu and Li (1995)Go found a power law for the size distribution of indels. Thus, one may assume the rate of k indels can be written as µk ~ k–ß, where ß {approx} 2 (Gu and Li 1995Go).

To accomplish the goal of phylogenomic annotation of gene functions, we need many so-called evolutionary models of protein function to link between the biochemical–physiological perspective and the evolutionary pattern of sequence. Type-I and type-II functional divergences, for instance, are 2 special models for this purpose. The statistical method we have described in this article can be integrated into a relational protein database for cluster-specific amino acid residues of, for example, transcription factors, to help in the better understanding of protein function.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Theory
 A Simple Robust Method
 Software and Examples
 Discussion
 Acknowledgements
 References
 
The author is grateful to Jie Liang, Eric Gaucher, and Kent Vander Velden for constructive comments. This work was partially supported by the National Institutes of Health grants.


    Footnotes
 
Naoko Takezaki, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Theory
 A Simple Robust Method
 Software and Examples
 Discussion
 Acknowledgements
 References
 

    Atchley WR, Zhao J, Fernandes A, Drueke T. 2005. Solving the sequence ‘metric’ problem. Proc Natl Acad Sci 102:6395–400.[Abstract/Free Full Text]

    Dayhoff MO, Schwartz RM, Orcutt BC. 1978. A model of evolutionary change in proteins. In: Dayhoff MO, editor. Atlas of protein sequence structure. Volume 5. Suppl 3. Washington, DC: National Biomedical Research Foundation. p 342–52.

    Eisen MB, Fraser CM. 2003. Phylogenomics: intersection of evolution and genomics. Science 300:1706–7.[Abstract/Free Full Text]

    Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–76.[CrossRef][ISI][Medline]

    Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151:1531–45.[Abstract/Free Full Text]

    Gao X, Vander Velden K, Voytas DF, Gu X. 2005. SplitTester: software to identify domains responsible for functional divergence in protein family. BMC Bioinformatics 6:137.[Medline]

    Gaucher EA, Gu X, Miyamoto M, Benner S. 2002. Predicting functional divergence in protein evolution by site-specific rate shifts. Trend Biochem Sci 27:315–21.[CrossRef][ISI][Medline]

    Golding GB, Dean AM. 1998. The structural basis of molecular adaptation. Mol Biol Evol 15:355–69.[Abstract]

    Gribaldo S, Casane D, Lopez P, Philippea H. 2003. Functional divergence prediction from evolutionary analysis: a case study of vertebrate hemoglobin. Mol Biol Evol 20:1754–9.[Abstract/Free Full Text]

    Gu X. 1999. Statistical methods for testing functional divergence after gene duplication. Mol Biol Evol 16:1664–74.[Abstract]

    Gu X. 2001. Maximum likelihood approach for gene family evolution under functional divergence. Mol Biol Evol 18:453–64.[Abstract/Free Full Text]

    Gu X. 2003. Functional divergence in protein (family) sequence evolution. Genetica 118:133–41.[CrossRef][ISI][Medline]

    Gu X, Fu YX, Li WH. 1995. Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol Biol Evol 12:546–57.[Abstract]

    Gu X, Li WH. 1995. The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. J Mol Evol 40:464–73.[CrossRef][ISI][Medline]

    Gu X, Vander Velden K. 2002. DIVERGE: phylogeny-based analysis for functional-structural divergence of a protein family. Bioinformatics 18:500–1.[Abstract/Free Full Text]

    Gu X, Zhang J. 1997. A simple method for estimating the parameter of substitution rate variation among sites. Mol Biol Evol 14:1106–13.[Abstract]

    Jones DT, Taylor WR, Thornton JM. 1992. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275–82.[Abstract/Free Full Text]

    Jordan K, Bishop GR, Gonzalez DS. 2001. Sequence and structural aspects of functional diversification in class I {alpha}-mannosidase evolution. Bioinformatics 17:965–76.[Abstract/Free Full Text]

    Kimura M. 1983. The neutral theory of molecular evolution. Cambridge: Cambridge University Press.

    Knudsen B, Miyamoto MM. 2001. A likelihood ratio test for evolutionary rate shifts and functional divergence among proteins. Proc Natl Acad Sci 98:14512–7.[Abstract/Free Full Text]

    Landgraf R, Xenarios I, Eisenberg D. 2001. Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J Mol Biol 307:1487–502.[CrossRef][ISI][Medline]

    Li X, Hu C, Liang J. 2003. Simplicial edge representation of protein structures and alpha contact potential with confidence measure. Proteins 53:792–805.[CrossRef][ISI][Medline]

    Lichtarge O, Bourne HR, Cohen FE. 1996. An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 257:342–58.[CrossRef][ISI][Medline]

    Lopez P, Forterre P, Philippe H. 1999. The root of the tree of life in the light of the covarion model. J Mol Evol 49:496–508.[CrossRef][ISI][Medline]

    Madabushi S, Gross AK, Philippi A, Meng EC, Wensel TG, Lichtarge O. 2004. Evolutionary trace of G protein-coupled receptors reveals clusters of residues that determine global and class-specific functions. J Biol Chem 279:8126–32.[Abstract/Free Full Text]

    Murphy L, Wallqvist A, Levy R. 2000. Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein 13:149–52.

    Ohno S. 1970. Evolution by gene duplication. Berlin: Springer-Verlag.

    Rastogi S, Liberles D. 2005. Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol Biol 5:28.[CrossRef][Medline]

    Sun H, Merugu S, Gu X, Kang Y, Dickinson DP, Callaerts P, Li W-H. 2002. Identification of essential amino acid changes in paired domain evolution using a novel combination of evolutionary analysis and in vitro and in vivo studies. Mol Biol Evol 19:1490–500.[Abstract/Free Full Text]

    Tseng YY, Liang J. 2006. Estimation of amino acid residue substitution rates at local spatial regions and application in protein function inference: a Bayesian Monte Carlo approach. Mol Biol Evol 23:421–36.[Abstract/Free Full Text]

    Wang Y, Gu X. 2001. Predicting functional divergence of caspase gene family. Genetics 158:1311–20.[Abstract/Free Full Text]

    Yang Z, Kumar S, Nei M. 1995. A new method of inference of ancestral nucleotide and amino acid sequences. Genetics 141:1641–50.[Abstract]

    Zhang J, Nei M. 1997. Accuracies of ancestral amino acid sequence inferred by the parsimony, likelihood, and distance method. J Mol Evol 44:S139–46.

Accepted for publication July 6, 2006.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
C. Hernandez-Sanchez, A. Mansilla, F. de Pablo, and R. Zardoya
Evolution of the Insulin Receptor Family and Receptor Isoform Expression in Vertebrates
Mol. Biol. Evol., June 1, 2008; 25(6): 1043 - 1053.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
P. Courville, E. Urbankova, C. Rensing, R. Chaloupka, M. Quick, and M. F. M. Cellier
Solute Carrier 11 Cation Symport Requires Distinct Residues in Transmembrane Helices 1 and 6
J. Biol. Chem., April 11, 2008; 283(15): 9651 - 9658.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
K. Ye, G. Vriend, and A. P. IJzerman
Tracing evolutionary pressure
Bioinformatics, April 1, 2008; 24(7): 908 - 915.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
T. Matsuo
Rapid Evolution of Two Odorant-Binding Protein Genes, Obp57d and Obp57e, in the Drosophila melanogaster Species Group
Genetics, February 1, 2008; 178(2): 1061 - 1072.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
K. Ye, K. Anton Feenstra, J. Heringa, A. P. IJzerman, and E. Marchiori
Multi-RELIEF: a method to recognize specificity determining residues from multiple sequence alignments using a Machine-Learning approach for feature weighting
Bioinformatics, January 1, 2008; 24(1): 18 - 25.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/10/1937    most recent
msl056v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow