Molecular Biology and Evolution 18:750-756 (2001)
© 2001 Society for Molecular Biology and Evolution
ARTICLE |
Structural Constraints and Emergence of Sequence Patterns in Protein Evolution
Universidad Nacional de Quilmes, Bernal, Argentina
| Abstract |
|---|
|
|
|---|
The aim of this work was to study the relationship between structure conservation and sequence divergence in protein evolution. To this end, we developed a model of structurally constrained protein evolution (SCPE) in which trial sequences, generated by random mutations at gene level, are selected against departure from a reference three-dimensional structure. Since at the mutational level SCPE is completely unbiased, any emergent sequence pattern will be due exclusively to structural constraints. In this first report, it is shown that SCPE correctly predicts the characteristic hexapeptide motif of the left-handed parallel ß helix (LßH) domain of UDP-N-acetylglucosamine acyltransferases (LpxA).
| Introduction |
|---|
|
|
|---|
Protein sequences diverge due to amino acid replacements with a mostly neutral effect on organism fitness (Kimura 1983
| Materials and Methods |
|---|
|
|
|---|
The Model
In SCPE, trial sequences, generated by random mutations at the gene level, are selected against departure from a reference structure. The model is based on a sequence-structure distance score, Sdist, which depends on a reference native structure, and a parameter, Sdiv, which measures the degree of structural divergence tolerated by natural selection. The sequence-structure distance measure Sdist is calculated as follows. First, the trial sequence is forced to adopt the three-dimensional reference structure. Then, mean field energies per position Etrial(p) and Eref(p) are calculated for the trial and reference sequences, respectively. Finally, Sdist = {
p [Etrial(p) - Eref(p)]2}1/2 is obtained. To calculate the mean-field energies, we used the PROSA II potential (Sippl 1993An SCPE simulation starts with a reference DNA sequence that codes for a reference protein of known three-dimensional structure. Then, each run involves the repetition of evolutionary time steps, which consist of the application of the following four operations. First, the DNA sequence of the previous time step is mutated by introducing a random nucleotide substitution into a randomly chosen sequence position (Jukes-Cantor model). Second, if the mutation introduces a stop codon, the mutated DNA is rejected; otherwise, the mutated DNA is translated, using the genetic code, to obtain a trial protein sequence. Third, the sequence-structure distance score, Sdist, is computed. Finally, the trial sequence is accepted only if Sdist is below the specified cut-off, Sdiv, which represents the degree of structural divergence allowed by natural selection.
Test System
The SCPE model was tested on the left-handed parallel ß helix (LßH) domain of UDP-N-acetylglucosamine acyltransferases (LpxA), which displays a distinctive sequence pattern that is likely to result from structural constraints. The reference for SCPE simulations was the LpxA of Escherichia coli (Raetz and Roderick 1995
) (fig. 1A
and B; PDB code 1lxa). The sequences of the LßH domain of members of the LpxA family (fig. 1C
) consist of the imperfect tandem repetition of hexapeptide units (Vaara 1992
; Vuorio et al. 1994
; Raetz and Roderick 1995
). The hexapeptides are characterized by a high degree of conservation of the third position, which usually displays I, L, or V (a one-letter code is used to designate amino acids). Hexapeptide position 1 is also significantly conserved, although less so than position 3, whereas the other four hexapeptide sites (2, 4, 5, and 6) are not conserved. Figure 1B
shows that the residues of conserved sites 1 and 3 point toward the inside of the beta helix, whereas those in variable positions point toward the outside. The LpxA family belongs to a larger superfamily of LßH acyltransferases. All members of this superfamily present the hexapeptide sequence motif, and those members whose structures have been determined display the LßH fold (Raetz and Roderick 1995
; Kisker et al. 1996
; Beaman et al. 1997
; Beaman, Sugantino, and Roderick 1998
; Brown et al. 1999
). Thus, both the LßH structure and the hexapeptide motif are highly conserved, despite the considerable divergence in sequence and function observed in the LßH superfamily (Parisi, Fornasari, and Echave 2000).
|
Probability Distributions
In order to compare the outcome of our simulations with the sequence patterns of actual sequences, we used amino acid probability distributions and entropies. The probability distributions for each of the sequence site classes s were calculated as follows. First, the sequences to be used to estimate the distribution were aligned. Second, a matrix H was built, where H(p, a) = 1 if amino acid a is found at column p of the multiple-sequence alignment, and H(p, a) = 0 otherwise. Finally, P(s, a) =
p
s H(p, a)/
p
s
20a=1 H(p, a) was calculated, where p
s indicates that the sum is limited to sequence positions that belong to the same class.
Let P(a; M) and P(a; D) be, respectively, a simulated distribution obtained with model M and the distribution obtained from experimental data set D. Then, the goodness of fit between the model and the data was measured using zP(M, D) = [
(M, D) -
(M)]/
(M), where the error was defined as
(M, D) =
a {[P(a; M) - P(a; D)]2/[P(a; M) + P(a; D)]}, and
(M) and 
(M) are the average and standard deviations of the errors obtained from comparing pairs of simulated runs. From such simulations, the distribution of zP(M, D) was obtained numerically, and it was found that it could be fit by a normal distribution of zero mean and unit standard deviation. To compare the abilities of two models M0 and M1 to fit the observed amino acid distribution D, we used zP(M0, M1) = [zP(M0, D) - zP(M1, D)]/
, which has a normal distribution with zero mean and unit variance.
Entropies
The variability of each site class was characterized using the site class entropy. These entropies were calculated from the amino acid probability distributions in the usual way using S(s) = -
20a=1 P(s, a)ln P(s, a).
The entropies of a model M and experimental data set D were compared using zS(M, D) = |[S(D) - S(M)]/
S(M)|, where S(D) is the entropy of D, S(M) is the entropy of M averaged over independent runs, and
S(M) is the corresponding standard deviation. The cumulative distribution function was found numerically from simulations to be well fitted by P(zS < z) = 2
(z) - 1, where
(z) is the normal cumulative distribution with zero mean and unit variance. As in the previous section, the abilities of two models M0 and M1 to fit the same data D can be compared using zS(M0, M1) = [zS(M0, D) - zS(M1, D)]/
, whose distribution is approximately normal with zero mean and unit variance.
| Results and Discussion |
|---|
|
|
|---|
We begin by exploring the relationship between sequence divergence and constraint for structure conservation. While the hexapeptide motif is very well conserved in the LpxA family, sequences can definitely diverge, showing as little as 40% identity (Vuorio et al. 1994
0), sequences cannot diverge at all, whereas in the limit of unconstrained evolution (Sdiv
), they lead to effectively random sequences. Since Sdiv measures the tolerance of the environment to structural divergence, it is expected to depend on the protein's function. This suggests the interesting possibility of a connection between figure 2
and the recent observation of a sigmoidal dependence of function similarity on sequence similarity (Wilson, Kreychman, and Gerstein 2000
|
We further tested to see if the SCPE model can reproduce the characteristic variability pattern of the hexapeptide motif. The site entropy was used as a measure of the variability of a given hexapeptide site. Figure 3 shows that hexapeptide sites 1 and 3 are significantly conserved, whereas sites 2, 4, 5, and 6 are almost free to vary. It can be seen from figure 3 that for Sdiv = 6, the SCPE variability pattern is in very good agreement with the LpxA family. More importantly, the agreement is much better than that of the reference LpxA of E. coli (Sdiv = 0), which is the only information SCPE has about the LpxA family, since no member of the LßH superfamily was part of the database used to fit the PROSA II potential (Sippl 1993
) with the same number of amino acid substitutions as the Sdiv = 6 simulation. Comparison between the Sdiv = 6 and Sdiv =
cases shows that the Sdiv = 6 pattern is mostly the result of structural constraints, rather than memory effects. A similar SCPE-experimental accord was found for intermediate constraints in the range 5 < Sdiv < 10 (data not shown).
|
Table 1 shows a quantitative comparison of the entropies shown in figure 3 . From the fourth row if this table, it is seen that SCPE with Sdiv = 6 fits the experimental LpxA entropies significantly better than E. coli (Sdiv = 0) for most sites. An exception is site 1, for which the LpxA of E. coli gives better results than the Sdiv = 6 SCPE simulations. However, when other members of the LßH superfamily are considered in the determination of the experimental pattern, Sdiv = 6 SCPE simulations also give significantly better results for hexapeptide site 1, as can be seen from the last two columns of table 1 . The last row of table 1 shows that SCPE with Sdiv = 6 gives significantly better results than the unconstrained case (Sdiv =
) for almost all hexapeptide sites, with all sites except site 4 supporting the rejection of the unconstrained model in favor of the constrained one with significances lower than 10%.
|
As a final assessment, the ability of SCPE to predict the correct amino acid probability distributions for the different hexapeptide sites was evaluated. Figure 4 shows that SCPE with Sdiv = 6 (and 5 < Sdiv < 10, not shown) is in very good agreement with the observed LpxA amino acid distributions. As for variability patterns, discussed in the previous paragraphs, this is in contrast with the poorer accord found between the LpxA family and either the reference protein (LpxA of E. coli; Sdiv = 0) or the unconstrained evolution (Sdiv =
) case. For the key hexapeptide site 3, Sdiv = 6 SCPE simulations reveal amino acids F, M, W, Y, and C, which are not present in the reference protein. Of these, F, M, and W are confirmed predictions, since they are also present in the LpxA family. Y, which does not appear in the LpxA distribution, is also a confirmed prediction, since we found it in other LßH proteins. In general, all upward triangles in figure 4
mark amino acids predicted by SCPE that, despite not being found in LpxA, are found in other LßH families. In contrast, downward triangles indicate differences between Sdiv = 6 SCPE and LpxA distributions that could not be found in the other LßH proteins considered. Note, however, that the probabilities of most downward-triangle amino acids are so small that they are not likely to be found in a sample the size of the LßH families considered. Moreover, it is interesting to note that even though downward-triangle amino acids may arise during evolution, they are selected against in Sdiv = 6 SCPE, as compared with the unconstrained case Sdiv =
.
|
In table 2 , a quantitative comparison of the amino acid distributions of figure 4 is performed. The fourth row of table 2 shows that SCPE with Sdiv = 6 fits the LpxA distributions significantly better than SCPE with Sdiv = 0 for most hexapeptide sites. As with entropies, an exception is site 1, for which the LpxA distribution is closer to that of the LpxA of E. coli than to the Sdiv = 6 SCPE distributions. As before, the situation is reversed when other members of the LßH superfamily are considered in the determination of the experimental pattern (last two columns of table 2 ). The last row of table 2 shows that the Sdiv = 6 SCPE gives significantly better results than the unconstrained case (Sdiv =
) for the conserved hexapeptide sites 1 and 3, but that the unconstrained model cannot be significantly rejected in favor of the constrained one for the variable sites 2, 4, 5, and 6.
|
| Conclusions |
|---|
|
|
|---|
This report presented a novel and general model of structurally constrained protein evolution, developed to study the effects of structural constraints on sequence divergence. For the LßH domain of the LpxA family, with the only information of the sequence and structure of one of its members, the model predicts the sequence patterns characteristic of the whole family with a remarkable accuracy. Clearly, the general applicability of the SCPE model to other protein families remains to be studied, but it will take some time, since the model is computationally demanding. In this report, we aimed to present the model and show its applicability by studying one example case. From a mutational point of view, the present model treats all sites and all nucleotide replacements equivalently. Therefore, the observed biases in amino acid replacement patterns are a genuine outcome of the model, showing that they result naturally from constraining structural divergence.
Three considerations should be taken into account. First, SCPE is a neutral evolution model that cannot account for adaptive amino acid substitutions. However, this is not a serious drawback, since such replacements are very rare (Perutz 1983
; Golding and Dean 1998
). Second, SCPE does not explicitly consider the folding pathway, whereas folding constraints are known to result in sequence conservation (Shakhnovich, Abkevich, and Ptitsyn 1996
; Li, Mirny, and Shakhnovich 2000
). Nevertheless, this should not be a major shortcoming, since folding seems to be largely determined by the native structure (Baker 2000). Finally, it is important to stress that introducing mutations at the gene level, rather than protein level, apart from being more realistic, makes this model potentially useful for studying issues such as the effects of nucleotide substitution biases on amino acid sequence patterns or the effects of selection at protein level on nucleotide substitution patterns.
The SCPE model can easily be improved by using a nucleotide mutation model that is more realistic than the Jukes-Cantor model. Also, different energy functions can be used to calculate the sequence-structure distance score. Finally, in the present case we accepted all sequences with scores Sdist < Sdiv and rejected those with Sdist > Sdiv, but other dependencies of the probability of acceptance on Sdist could be used.
Regarding the dynamics of the substitutional process under the SCPE model, some of the issues that are currently being addressed in our group are (1) the site-dependent amino acid substitution probabilities under the SCPE model and their comparison with current models of protein evolution, (2) substitutional rate variation among amino acid sites, (3) correlations between the evolution of different amino acid sites, and (4) effects of structural constraints on the patterns of nucleotide substitution.
Even though our aim in building the SCPE model was to gain a better understanding of the process of molecular evolution, this model can also be useful in addressing phylogenetic inference issues. Thus, the model can be used to generate large benchmark data sets for the assessment of current probabilistic models. Furthermore, SCPE may be used to obtain structure-dependent substitution matrices and build structure-based probabilistic models that can be used, in turn, for phylogenetic inference purposes. Both issues are currently being studied in our group.
| Acknowledgements |
|---|
|
|
|---|
This work was supported by the Universidad Nacional de Quilmes and the Fundación Antorchas. J.E. is a Researcher of CONICET and a Guggenheim Fellow.
| Footnotes |
|---|
William Taylor, Reviewing Editor
1 Abbreviation: SCPE, structurally constrained protein evolution. ![]()
2 Keywords: molecular evolution
protein evolution
simulation
model ![]()
3 Address for correspondence and reprints: Julián Echave, Universidad Nacional de Quilmes, Saenz Peña 180, B1876BXD Bernal, Argentina. je{at}unq.edu.ar ![]()
| literature cited |
|---|
|
|
|---|
Babajide, A., I. L. Hofacker, M. J. Sippl, and P. F. Stadler. 1997. Neutral networks in protein space: a computational study based on knowledge-based potentials of mean force. Fold Des. 2:261269.[ISI][Medline]
Bairoch, A., and R. Apweiler. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28:4548.
Bajaj, M., and T. Blundell. 1984. Evolution and the tertiary structure of proteins. Annu. Rev. Biophys. Bioeng. 13:453492.[ISI][Medline]
Baker, D. 2000. A surprising simplicity to protein folding. Nature 405:3942.
Beaman, T. W., D. A. Binder, J. S. Blanchard, and S. L. Roderick. 1997. Three-dimensional structure of tetrahydrodipicolinate N-succinyltransferase. Biochemistry 36:489494.
Beaman, T. W., M. Sugantino, and S. L. Roderick. 1998. Structure of the hexapeptide xenobiotic acetyltransferase from Pseudomonas aeruginosa. Biochemistry 37:66896696.
Brown, K., F. Pompeo, S. Dixon, D. Mengin-Lecreulx, C. Cambillau, and Y. Bourne. 1999. Crystal structure of the bifunctional N-acetylglucosamine 1-phosphate uridyltransferase from Escherichia coli: a paradigm for the related pyrophosphorylase superfamily. EMBO J. 18:40964107.[ISI][Medline]
Chothia, C., and A. M. Lesk. 1986. The relation between the divergence of sequence and structure in proteins. EMBO J. 5:823826.[ISI][Medline]
Flores, T. P., C. A. Orengo, D. S. Moss, and J. M. Thornton. 1993. Comparison of conformational characteristics in structurally similar protein pairs. Protein Sci. 2:18111826.[Abstract]
Golding, G. B., and A. M. Dean. 1998. The structural basis of molecular adaptation. Mol. Biol. Evol. 15:355369.[Abstract]
Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge, England.
Kisker, C., H. Schindelin, B. E. Alber, J. G. Ferry, and D. C. Rees. 1996. A left-hand beta-helix revealed by the crystal structure of a carbonic anhydrase from the archaeon Methanosarcina thermophila. EMBO J. 15:23232330.[ISI][Medline]
Koehl, P., and M. Levitt. 1999a. De novo protein design. I. In search of stability and specificity. J. Mol. Biol. 293:11611181.
. 1999b. De novo protein design. II. Plasticity in sequence space. J. Mol. Biol. 293:11831193.
Koradi, R., M. Billeter, and K. Wuthrich. 1996. MOLMOL: a program for display and analysis of macromolecular structures. J. Mol. Graph. 14:5155, 2932.[ISI][Medline]
Koshi, J. M., and R. A. Goldstein. 1998. Models of natural mutations including site heterogeneity. Proteins 32:289295.
Li, L., L. A. Mirny, and E. I. Shakhnovich. 2000. Kinetics, thermodynamics and evolution of non-native interactions in a protein folding nucleus. Nat. Struct. Biol. 7:336342.[ISI][Medline]
Liò, P., and N. Goldman. 1998. Models of molecular evolution and phylogeny. Genome Res. 8:12331244.
Mirny, L. A., and E. I. Shakhnovich. 1999. Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. J. Mol. Biol. 291:177196.[ISI][Medline]
Naylor, G. J., and W. M. Brown. 1997. Structural biology and phylogenetic estimation [letter]. Nature 388:527528.
Overington, J., M. S. Johnson, A. Sali, and T. L. Blundell. 1990. Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc. R. Soc. Lond. B Biol. Sci. 241:132145.[Medline]
Parisi, G., M. Fornasari, and J. Echave. 2000. Evolutionary analysis of gamma-carbonic anhydrase and structurally related proteins. Mol. Phylogenet. Evol. 14:323334.[ISI][Medline]
Perutz, M. F. 1983. Species adaptation in a protein molecule. Mol. Biol. Evol. 1:128.[Abstract]
Raetz, C. R., and S. L. Roderick. 1995. A left-handed parallel beta helix in the structure of UDP-N-acetylglucosamine acyltransferase. Science 270:9971000.
Shakhnovich, E., V. Abkevich, and O. Ptitsyn. 1996. Conserved residues and the mechanism of protein folding. Nature 379:9698.
Sippl, M. J. 1993. Recognition of errors in three-dimensional structures of proteins. Proteins 17:355362.
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:46734680.
Tourasse, N. J., and W. H. Li. 2000. Selective constraints, amino acid composition, and the rate of protein evolution. Mol. Biol. Evol. 17:656664.
Vaara, M. 1992. Eight bacterial proteins, including UDP-N-acetylglucosamine acyltransferase (LpxA) and three other transferases of Escherichia coli, consist of a six-residue periodicity theme. FEMS Microbiol. Lett. 76:249254.[Medline]
Vuorio, R., T. Harkonen, M. Tolvanen, and M. Vaara. 1994. The novel hexapeptide motif found in the acyltransferases LpxA and LpxD of lipid A biosynthesis is conserved in various bacteria. FEBS Lett. 337:289292.[ISI][Medline]
Wilson, C. A., J. Kreychman, and M. Gerstein. 2000. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol. 297:233249.[ISI][Medline]
Wood, T. C., and W. R. Pearson. 1999. Evolution of protein sequences and structures. J. Mol. Biol. 291:977995.[ISI][Medline]
Xia, X., and W. H. Li. 1998. What amino acid properties affect protein evolution? J. Mol. Evol. 47:557564.[ISI][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
J. Maletkovic, R. Schiffmann, J. R. Gorospe, E. S. Gordon, M. Mintz, E. P. Hoffman, G. Alper, D. R. Lynch, B. S. Singhal, C. Harding, et al. Genetic and Clinical Heterogeneity in eIF2B-Related Disorder J Child Neurol, February 1, 2008; 23(2): 205 - 215. [Abstract] [PDF] |
||||
![]() |
A. Doron-Faigenboim and T. Pupko A Combined Empirical and Mechanistic Codon Model Mol. Biol. Evol., February 1, 2007; 24(2): 388 - 397. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. S. Fornasari, G. Parisi, and J. Echave Quaternary Structure Constraints on Evolutionary Sequence Divergence Mol. Biol. Evol., February 1, 2007; 24(2): 349 - 351. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Rodrigue, H. Philippe, and N. Lartillot Assessing Site-Interdependent Phylogenetic Models of Sequence Evolution Mol. Biol. Evol., September 1, 2006; 23(9): 1762 - 1775. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Gesell and A. von Haeseler In silico sequence evolution with site-specific interactions along phylogenetic trees Bioinformatics, March 15, 2006; 22(6): 716 - 722. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Y. Tseng and J. Liang Estimation of Amino Acid Residue Substitution Rates at Local Spatial Regions and Application in Protein Function Inference: A Bayesian Monte Carlo Approach Mol. Biol. Evol., February 1, 2006; 23(2): 421 - 436. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Maguid, S. Fernandez-Alberti, L. Ferrelli, and J. Echave Exploring the Common Dynamics of Homologous Proteins. Application to the Globin Family Biophys. J., July 1, 2005; 89(1): 3 - 13. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Porto, H. E. Roman, M. Vendruscolo, and U. Bastolla Prediction of Site-Specific Amino Acid Distributions and Limits of Divergent Evolutionary Changes in Protein Sequences Mol. Biol. Evol., March 1, 2005; 22(3): 630 - 638. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Y. Lau and D. I. Chasman Functional classification of proteins and protein variants PNAS, April 27, 2004; 101(17): 6576 - 6581. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. M. Robinson, D. T. Jones, H. Kishino, N. Goldman, and J. L. Thorne Protein Evolution with Dependence Among Codons Due to Tertiary Structure Mol. Biol. Evol., October 1, 2003; 20(10): 1692 - 1704. [Abstract] [Full Text] |
||||
![]() |
B. Ma and R. Nussinov Energy landscape and dynamics of the {beta}-hairpin G peptide and its isomers: Topology and sequences Protein Sci., September 1, 2003; 12(9): 1882 - 1893. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. S. Fornasari, G. Parisi, and J. Echave Site-Specific Amino Acid Replacement Matrices from Structurally Constrained Protein Evolution Simulations Mol. Biol. Evol., March 1, 2002; 19(3): 352 - 356. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









