MBE Advance Access originally published online on November 23, 2006
Molecular Biology and Evolution 2007 24(2):349-351; doi:10.1093/molbev/msl181
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Letter |
Quaternary Structure Constraints on Evolutionary Sequence Divergence

* Centro de Estudios e Investigaciones, Universidad Nacional de Quilmes, Bernal, Argentina
Instituto Nacional de Investigaciones Fisicoquímicas Teóricas y Aplicadas, Universidad Nacional de La Plata, La Plata, Argentina
E-mail: jechave{at}inifta.unlp.edu.ar.
| Abstract |
|---|
|
|
|---|
The structurally constrained protein evolution (SCPE) model simulates protein divergence considering protein structure explicitly. The model is based on the observation that protein structure is more conserved during evolution than the sequences encoding for that structure. In the previous work, the SCPE model considered only the tertiary structure. Here we show that the performance of the model is enhanced when the oligomeric structure is taken into account. Our results agree with recent evolutionary studies of oligomeric proteins, which show that conservation of the quaternary structure imposes additional constraints on sequence divergence. The incorporation of proteinprotein interactions into protein evolution models may be important in the study of quaternary protein structures and complex protein assemblies.
Key Words: SCPE quaternary structure sequence divergence
A major constraint in protein sequence divergence is the conservation of protein structure. This constraint is related to the selective pressure involved in the conservation of protein cores, which involve a complex network of interresidue noncovalent interactions (Russell and Barton 1994
; Sali and Overington 1994
). This network of interactions is among the predominant factors involved in the protein-folding process to obtain a stable fold (Lim and Sauer 1991
; Lattman et al. 1994
; Xu et al. 1998
; Vendruscolo et al. 2000
). These observations explain the fact that the amino acid substitution pattern for a given site depends on its structural environment and also that residue substitutions in interacting sites are correlated (Overington et al. 1990
; Overington 1992
; Pollock and Taylor 1997
).
Noncovalent interactions between amino acid side chains are important for the correct assembly of folded chains into multichain proteins. The proteinprotein interactions in these complex proteins could be permanent or transient. In the first case, proteins exist only in their complexed form, which is usually very stable. On the other hand, transient complexes associate and dissociate in vivo according to the environment or to the presence of external factors and involve proteins that also exist as independent entities (Jones and Thornton 1996
; Nooren and Thornton 2003
). The emerging picture suggests that the residues involved in these proteinprotein interactions are evolutionarily constrained because of the selective pressure to conserve the structure of the complex to ensure the conservation of biological activity (Ofran and Rost 2003
; Caffrey et al. 2004
; Halperin et al. 2004
; Li et al. 2004
; Mintseris and Weng 2005
). It was also found that for close homologues (3040% or higher sequence identity), the proteinprotein interactions are invariably the same (Aloy et al. 2003
).
To study how protein structure modulates sequence divergence, we developed the structurally constrained protein evolution (SCPE) model (Parisi and Echave 2001
). The SCPE model simulates sequence divergence constrained by conservation of protein structure. Recently, we successfully applied the SCPE to representatives of the main 4 classes of protein fold (alpha, beta, alpha + beta, and alpha/beta) (Parisi and Echave 2005
). Using substitution matrices derived from SCPE simulations (Fornasari et al. 2002
), we found that the SCPE model outperforms site-independent models such as JTT (Jones et al. 1992
). In all these studies, the SCPE model considered only the tertiary structure of the protein. Here we extend the model to include proteinprotein interactions and show that performance of the model improves.
We will describe briefly the algorithm of the SCPE model (a more detailed description can be found elsewhere [Parisi and Echave 2001
; Fornasari et al. 2002
; Parisi and Echave 2005
]). In SCPE simulations, trial sequences are generated by introducing a random mutation in a reference sequence, which at the beginning of the simulation is equal to a sequence of known structure. Mutations are introduced using a amino acid mutational rate matrix Qmut derived using the HKY model of DNA evolution (Hasegawa et al. 1985
) and the universal genetic code (see below). For each trial, a score
which measures the structural perturbation, is calculated, where E
and E
are the mean-field energies of reference and trial sequences. Mean-field energies are calculated using a contact map representation of the protein structure and an empirical contact potential (Berrera et al. 2003
). Trial sequences are then accepted or rejected using an acceptance probability function (P) to generate in each round a new reference sequence,
|
|
is the only parameter of the SCPE model that must be fit to the data for each homologous set and is related to the degree of selection pressure for structural conservation.
Site-specific substitution matrices are derived from a matrix of counts for each site: for i
j, N
is half the number of mutational steps, which result in either i
j or j
i amino acid replacements at site p. N
is the number of mutational steps for which amino acid i remains constant. Then, the substitution rate matrix Qp is obtained using
|
|
|
|
Finally, to avoid numerical problems, each Qp is recalculated using pseudocounts as described previously (Parisi and Echave 2005
).
In this paper a series of position-specific Qp were calculated using 2 alternative models: one model considers only the tertiary structure of a protein (SCPEt) and the other the quaternary structure of the protein (SCPEq). Seven homooligomeric protein families were used as test systems for model comparisons. These families adopt different quaternary structures and also belong to different fold classes (table 1). As described previously (Parisi and Echave 2004
, 2005
), for each family a set of homologous DNA sequences were collected and a maximum parsimony (MP) topology was inferred using DNAPARS (DNA parsimony program) (Felsenstein 1993
). These sequences were aligned and Hasegawa-Kishino-Yano parameters estimated using hypothesis testing using phylogenies (HYPHY)(Pond et al. 2005
). This alignment was translated using the universal code to obtain a protein alignment. The alignment length was adjusted to fit the reference sequence length. With this protein alignment, a MP topology was obtained using PROTPARS (Felsenstein 1993
). This protein alignment and the derived MP topology are used to evaluate the likelihood of the models, as described below.
|
For each test system, we obtained, using SCPEt and SCPEq simulations, a set of Qp over a grid of
values. With maximum likelihood (ML) calculations, we obtained the ML for each
in the set. Then, both models were compared using parametric bootstrapping with a likelihood ratio test statistic (see Goldman 1993
for each model was used, and because the structural representation of the protein is not the same in SCPEt and SCPEq, the best
value for both models could not be the same. All the ML optimizations were performed independently for the different sites of the protein using the program HYPHY (Pond et al. 2005
For each representative set of sequences, the statistic 2
data = 2(ln (ML
) ln (ML
)) was calculated, using the SCPEt as the null hypothesis and SCPEq as the alternative one. In order to assess the significance of 2
data, we simulated 300 data sets (parametric bootstrapping) using the null hypothesis to obtain the 2
reference distribution. Then, the significance of 2
data was evaluated calculating a Z score as follows:
|
|
distribution obtained by parametric bootstrapping.
In the first column of table 2, we show the results of the comparison of both models using the likelihood ratio test. The SCPEq model outperforms the SCPEt model in all the cases with high statistical significance (P < 102). The main difference between the 2 models is that because SCPEq takes into account the interactions between the chains in the oligomeric structure, the constraints imposed on those positions involved in intermonomer interactions differ from those of the SCPEt model that does not consider these interactions. These positions, called quaternary positions (QP), are detected by the difference in the total number of contacts per position between the contact matrices obtained using the quaternary structure (SCPEq) or just the tertiary structure (SCPEt). Positions with the same number of contacts in the 2 models are called tertiary positions (TP). To study the reason for the enhanced performance of SCPEq over SCPEt, we studied the 2
data distributions for QP and TP. Using the KolmogorovSmirnov test, we found that these distributions are significantly different for all the protein families considered. This test was chosen because it has the advantage of making no assumption about the distribution of the data. Moreover, the average of 2
data for QP shows a bias toward positive values, whereas for the corresponding TP values these averages are approximately centered around 2
data = 0, as can be seen in table 2. We should note, however, that in 1 family (4-oxalocrotonate tautomerase, see table 2), the average 2
data for TP departs more than it would be expected from zero, probably indicating that the TPs could be influenced by contacts with quaternary sites. This hypothesis requires additional studies that will be addressed in the future.
|
In summary, we have shown that the model SCPEq significantly outperforms SPCEt and that this improvement rests on the better modeling of QP when the quaternary structure is considered. This improvement shows the importance of the conservation of quaternary structure as one of the factors constraining sequence divergence during evolution. Thus, the oligomeric state of a protein should be taken into account to improve the quality of evolutionary models. Taking into account these constraints will improve our understanding of the forces involved in the formation and evolution of protein complexes.
| Acknowledgements |
|---|
|
|
|---|
We thank Jeff Thorne and an anonymous reviewer for their useful remarks that resulted in an improved version of the manuscript. This work was partially supported by grants from Universidad Nacional de Quilmes, Agencia Nacional de Promocion Cientifica y Tecnologica, and Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET). G.P and J.E are members of CONICET.
| Footnotes |
|---|
Spencer V. Muse, Associate Editor
| References |
|---|
|
|
|---|
Aloy P, Ceulemans H, Stark A, Russell RB. (2003) The relationship between sequence and interaction divergence in proteins. J Mol Biol 332:989998.[CrossRef][ISI][Medline]
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. (2000) The Protein Data Bank. Nucleic Acids Res 28:235242.
Berrera M, Molinari H, Fogolari F. (2003) Amino acid empirical contact energy definitions for fold recognition in the space of contact maps. BMC Bioinformatics 4:8.[CrossRef][Medline]
Caffrey DR, Somaroo S, Hughes JD, Mintseris J, Huang ES. (2004) Are protein-protein interfaces more conserved in sequence than the rest of the protein surface? Protein Sci 13:190202.
Felsenstein J. (1993) PHYLIP (phylogeny inference package). Version 3.5c. Distributed by the author. (Department of Genetics, University of Washington, Seattle (WA)).
Fornasari MS, Parisi G, Echave J. (2002) Site-specific amino acid replacement matrices from structurally constrained protein evolution simulations. Mol Biol Evol 19:352356.
Goldman N. (1993) Simple diagnostic statistical tests of models for DNA substitution. J Mol Evol 37:650661.[ISI][Medline]
Halperin I, Wolfson I, Nussinov R. (2004) Protein-protein interactions; coupling of structurally conserved residues and of hot spots across interfaces. Implications for docking. Structure 12:10271038.[Medline]
Hasegawa M, Kishino H, Yano T. (1985) Dating of the humanape splitting by molecular clock of mitochondrial DNA. J Mol Biol 22:160174.
Jones DT, Taylor WR, Thornton JM. (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275282.
Jones S and Thornton JM. (1996) Principles of protein-protein interactions. Proc Natl Acad Sci USA 93:1320.
Lattman EE, Fiebig KM, Dill KA. (1994) Modeling compact denatured states of proteins. Biochemistry 33:61586166.[CrossRef][Medline]
Li X, Keskin O, Ma B, Nussinov R, Liang J. (2004) Protein-protein interactions: hot spots and structurally conserved residues often locate in complemented pockets that pre-organized in the unbound states: implications for docking. J Mol Biol 344:781795.[CrossRef][ISI][Medline]
Lim WA and Sauer RT. (1991) The role of internal packing interactions in determining the structure and stability of a protein. J Mol Biol 219:359376.[CrossRef][ISI][Medline]
Mintseris J and Weng Z. (2005) Structure, function, and evolution of transient and obligate protein-protein interactions. Proc Natl Acad Sci USA 102:1093010935.
Murzin AG, Brenner SE, Hubbard T, Chothia C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536540.[CrossRef][ISI][Medline]
Nooren IM and Thornton JM. (2003) Diversity of protein-protein interactions. EMBO J 22:34863492.[CrossRef][ISI][Medline]
Ofran Y and Rost B. (2003) Analysing six types of protein-protein interfaces. J Mol Biol 325:377387.[CrossRef][ISI][Medline]
Overington J. (1992) Structural constraints on residue substitution. Genet Eng 14:231249.
Overington J, Johnson MS, Sali A, Blundell TL. (1990) Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc Biol Sci 241:132145.
Parisi G and Echave J. (2001) Structural constraints and emergence of sequence patterns in protein evolution. Mol Biol Evol 18:750756.
Parisi G and Echave J. (2004) The structurally constrained protein evolution model accounts for sequence patterns of the LbetaH superfamily. BMC Evol Biol 4:41.[CrossRef][Medline]
Parisi G and Echave J. (2005) Generality of the structurally constrained protein evolution model: assessment on representatives of the four main fold classes. Gene.34 5:4553.[CrossRef]
Pollock DD and Taylor WR. (1997) Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Eng 10:647657.
Pond SL, Frost SD, Muse SV. (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics 21:676679.
Russell RB and Barton GJ. (1994) Structural features can be unconserved in proteins with similar folds. An analysis of side-chain to side-chain contacts secondary structure and accessibility. J Mol Biol 244:332350.[CrossRef][ISI][Medline]
Sali A and Overington JP. (1994) Derivation of rules for comparative protein modeling from a protein structure alignments. Protein Sci 3:15821596.[Abstract]
Vendruscolo M, Mirny LA, Shakhnovich EI, Domany E. (2000) Comparison of two optimization methods to derive energy parameters for protein folding: perceptron and Z score. Proteins 41:192201.[CrossRef][ISI][Medline]
Xu J, Baase WA, Baldwin E, Matthews BW. (1998) The response of T4 lysozyme to large-to-small substitutions within the core and its relation to the hydrophobic effect. Protein Sci 7:158177.[Abstract]
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||