Molecular Biology and Evolution 19:352-356 (2002)
© 2002 Society for Molecular Biology and Evolution
Site-Specific Amino Acid Replacement Matrices from Structurally Constrained Protein Evolution Simulations
Universidad Nacional de Quilmes, Saenz Peña 180, B1876BXD Bernal, Argentina
Evolutionary amino acid replacement rates depend on local structural environment (Overington et al. 1992
; Koshi and Goldstein 1995
). Recent models of protein evolution aim to take such site heterogeneity into account by using site-specific amino acid replacement matrices (Lio and Goldman 1998
; Thorne 2000
). This is a difficult task, mainly because of a lack-of-data problem: too many sequences would be needed to fit the large number of parameters required to specify the replacement matrix for each site. Two strategies have been used to tackle this problem. A first type of model assumes that protein sites can be classified into a limited number of structural classes. Then, class-specific replacement matrices are obtained by analyzing large databases of sequences (Thorne, Goldman, and Jones 1996
; Lio et al. 1998
). These models are promising, provided the assumption of universality of structural classes is a reasonable one. However, such replacement matrices cannot be fitted to the actual sequence family under analysis. A second strategy is to develop models that depend on a reduced number of parameters (Halpern and Bruno 1998
; Koshi and Goldstein 1998
). These models have the advantage that they can, in principle, be fitted to the analyzed family. On the other hand, it is possible that the high reduction in the number of parameters oversimplifies such models to be generally applicable. Here we report a third strategy, which allows the determination of site-specific replacement matrices adapted to the sequence family under study, using just one member of the known structure. The lack-of-data problem is overcome by using large amounts of simulated sequence data generated using a structurally constrained protein evolution (SCPE) model developed recently (Parisi and Echave 2001
). We apply the method to a case taken as an example and compare it with other models of protein evolution.
We will deal with Markov models that assume that sequence sites evolve independently of each other. Even though this assumption is unrealistic, it is most often indispensable if the model is to be used for phylogenetic inference purposes. For a given site, an independent-sites Markov model is completely characterized by a 20 x 20 matrix Q composed by the instantaneous time-independent replacement rates Qij between amino acids i and j. In general, Q is not a symmetric matrix: Qij
Qji. However, it represents a reversible process if it satisfies
![]() |
j is the equilibrium frequency of amino acid j. Using equation (1)
, any reversible model can be adapted to the analyzed sequence family by treating the equilibrium frequencies as free parameters to be estimated from the data under study. This is usually indicated by adding +F at the end of the model name. A more detailed description of independent-sites Markov models can be found in recent reviews (Lio and Goldman 1998
Probabilistic models can be compared using likelihood-based statistical tests (Goldman 1993
; Posada 2001
; Whelan, Lio, and Goldman 2001
). Let Pr(s|T,Q) be the probability of observing a certain sequence data set s = {s1,s2,...,sN}, given a tree T and a model Q. The maximum likelihood of the model, given the data is defined as the result of maximizing such a probability with respect to T (tree topology and branch lengths) and Q (free model parameters): L = maxT maxQPr(s|T,Q). Apart from the maximum likelihood, for model comparison, it is important to consider the number of free parameters. In this work, the relative support given by the data to a model is measured by the Bayesian information criterion (BIC), specified by BIC = - 2 ln L + npln ns, where np is the number of free parameters of the model evaluated and ns is the sequence length (Schwarz 1974
; Posada 2001
). The smaller the BIC, the better the fit of the model to the data.
As stated previously, our purpose is to obtain evolutionary replacement matrices using the SCPE model. This model is based on introducing random mutations at the gene level and selecting sequences against too much structural variation. Without making a priori assumptions regarding site heterogeneity in substitution patterns, SCPE naturally predicts site-specific amino acid frequency distributions (Parisi and Echave 2001
). Not being a site-independent model, SCPE cannot be used for maximum likelihood phylogenetic inference. However, it can be employed to generate enough simulated sequence data to determine site-specific substitution matrices. These matrices, in turn, can be used to build an independent-sites model of evolution, that we shall call independent-sitesstructurally constrained protein evolution (IS-SCPE).
The IS-SCPE site-specific replacement matrices are obtained in a straightforward manner by counting substitutions in SCPE simulations. To start, we need one member of the family under study whose structure is known. From this starting point, we generate simulated lineages: independent SCPE runs. We perform several runs of several mutational steps each to generate a database of accepted replacements. Next, we assume that sequence sites can be classified into c = 1,2,...,Nclass site classes, where 1
Nclass
Nsites. Then, for each class we set up a matrix of counts: for i
j, Nijc is half the number of mutational steps that result in either i
j or j
i amino acid replacements at site class c, and Niic is the number of mutational steps for which amino acid i remains constant (i
i replacement). Finally, for each class, the replacement matrix is obtained using
|
|
The set of Qc for all site classes constitutes the IS-SCPE model.
As an illustrative example, IS-SCPE is applied to the left-handed parallel ß helix (LßH) domain of UDP-N-acetylglucosamine acyltransferases (LpxA). From a search of SWISS-PROT (Bairoch and Apweiler 2000
) 25 LpxA sequences were found: LPXA_AQUAE, LPXA_BURAB, LPXA_CAMJE, LPXA_CHLMU, LPXA_CHLPN, LPXA_CHLTR, LPXA_CHRVI, LPXA_CYACA, LPXA_ECOLI, LPXA_HAEIN, LPXA_HELPJ, LPXA_HELPY, LPXA_NEIMA, LPXA_NEIMB, LPXA_PASMU, LPXA_PROMI, LPXA_PSEAE, LPXA_RICPR, LPXA_RICRI, LPXA_SALTY, LPXA_SYNY3, LPXA_VIBCH, LPXA_XYLFA, LPXA_YEREN, and Q9AQK4. The structure of the LpxA of Escherichia coli is known (Raetz and Roderick 1995
) (PDB code 1lxa). It shows a left-handed ß helix (LßH) domain (sites 1177) that displays a characteristic hexapeptide motif (Vaara 1992
; Vuorio et al. 1994
; Raetz and Roderick 1995
). Because of this, most sites can be classified into one of the six site classes, according to their position in the hexapeptides. These classes are designated i - 2, i - 1, i, i + 1, i + 2, i + 3. We aligned the 25 LpxA sequences using CLUSTAL W (Thompson, Higgins, and Gibson 1994
). The columns of the multiple alignment that correspond to the active site (Wyckoff and Raetz 1999
), loops, and the first and last incomplete coils of the LßH fold were not further considered because they are subject to different constraints than the rest of the hexapeptides. Each of the remaining 116 sites was classified into one of the six hexapeptide classes. With this alignment, we calculated a standard JTT distance matrix (Jones, Taylor, and Thornton 1992
) and obtained a phylogenetic tree, using module FITCH (Fitch and Margoliash 1967
) of PHYLIP 3.57c (Felsenstein 1993
). This tree was used for model comparison, taking advantage of the observation that fixing the tree does not affect the model selection procedure (Posada 2001
). Such model comparison is shown in table 1
, discussed subsequently. Robustness with respect to tree topology was verified by repeating the calculations on two other trees, obtained using the Neighbor-Joining method and the average linking clustering method, respectively, as implemented in the NEIGHBOR module of PHYLIP 3.57c (data not shown) (Felsenstein 1993
). All maximum likelihood calculations were performed with PAML (Yang 1997
).
|
Using LPXA_ECOLI as the starting point, we performed 50 independent SCPE runs of 15,000 mutational steps each. Then we obtained the IS-SCPE replacement matrices for each of the six hexapeptide structural classes, as described previously. These matrices are shown in figure 1 . It is clear from the figure that, as expected, these matrices are strongly site specific. Note that they are particularly sparse for site classes i - 2 and i, which are the most conserved sites that characterize the hexapeptide motif (Raetz and Roderick 1995
|
It is of interest at this point to see whether the IS-SCPE model is able to reproduce reasonably well the evolution of the LpxA family. To do this, we performed a model comparison, shown in table 1 . The second column of this table lists the models compared. JTT, JTT+
, JTT+F, and JTT+F+
are global models based on using the standard JTT replacement matrix (Jones, Taylor, and Thornton 1992
points out that a gamma distribution of replacement rates is used to describe site heterogeneity in replacement rates (one extra parameter needed to define the gamma distribution) (Yang 1993
c). Finally, the last two models of table 1
, JTT+Fc, use site-specific JTT matrices, obtained by adapting JTT to each site class using the class-specific amino acid distributions estimated from the analyzed data. As for IS-SCPE, we consider interclass rate variation (+Rc) and intraclass rate variation (+
c).
The last two columns of table 1
show the logarithm of the maximum likelihood and the BIC values for the different models compared. First of all we note, from comparing the column of BIC values, that IS-SCPE fits the observed sequences better than any of the other models considered. The best fit is obtained with IS-SCPE+Rc+
c: the model that consists of using site-specific replacement matrices obtained from SCPE simulations as described in this work and taking into account interclass (+Rc) and intraclass (+
c) rate heterogeneity. IS-SCPE gives better results than the global JTT because it takes into account site-specific replacement patterns. Regarding the site-specific JTT models (JTT+Fc models), the maximum likelihood values seem to show that they are better than the global JTT models (see column 4 of table 1
). However, the BIC values (column 5 of table 1
) demonstrate that two global modelsJTT+
and JTT+
+Fprovide a better fit than their site-specific counterparts JTT+Fc and JTT+Fc+
c. IS-SCPE is better than global JTT, whereas JTT+Fc is worse because JTT+Fc has 19 x 6 = 114 more parameters than IS-SCPE: the class-specific equilibrium frequencies (notice the nplnns term of the BIC statistic). In general, JTT+Fc (or any model that treats class frequencies as free parameters) will have 19 x nclasses more parameters than its IS-SCPE counterpart. Therefore, for a general case of a protein that is not as regular as LpxA, for which a larger number of classes may be needed, the gap between IS-SCPE and JTT+Fc models will rapidly increase in favor of IS-SCPE. Moreover, to mimic the site-specific substitution pattern, the JTT+Fc models depend on a reliable estimation of the 19 independent amino acid frequencies for each site class. Therefore, one faces again the lack-of-data problem: such models will be applicable only to large families or a small number of structural classes (or both) to which most sites belong, such as the present case. In contrast, the IS-SCPE replacement matrices are obtained using only one reference protein, being thus applicable independent of the size of the protein family under study or the number of site classes needed.
To summarize, we have shown that a whole set of site-dependent evolutionary amino acid replacement matrices can be obtained using just one member of known structure of the protein family under study. The method overcomes the lack-of-data problem by using large sequence data sets simulated using the SCPE model of protein evolution. The replacement matrices obtained constitute an IS-SCPE probabilistic model of the evolution of the family under study. As an illustration, we apply the IS-SCPE model to the LßH domain of the LpxA family, for which we show that it performs better than JTT, or even JTT corrected to take into account site-specific amino acid distributions.
Acknowledgements
We wish to acknowledge fruitful discussions with Pietro Lio and Nick Goldman, with whom we are currently working on a rigorous and detailed statistical assessment of IS-SCPE models using likelihood-ratio tests. This work was supported by the Universidad Nacional de Quilmes and the Fundación Antorchas. J.E. is a researcher of CONICET and a Guggenheim Fellow.
Footnotes
William Taylor, Reviewing Editor
Keywords: molecular evolution
protein evolution
simulation
model
substitution matrix ![]()
Abbreviations: SCPE, structurally constrained protein evolution; IS-SCPE, independent-sitesstructurally constrained protein evolution. ![]()
Address for correspondence and reprints: Julian Echave, Universidad Nacional de Quilmes, Saenz Peña 180, B1876BXD Bernal, Argentina. je{at}unq.edu.ar
. ![]()
References
Bairoch A., R. Apweiler, 2000 The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Nucleic Acids Res 28:45-48
Felsenstein J., 1993 PHYLIP (phylogeny inference package) Distributed by the author, Department of Genetics, University of Washington, Seattle
Fitch W. M., E. Margoliash, 1967 Construction of phylogenetic trees Science 155:279-284
Goldman N., 1993 Statistical tests of models of DNA substitution J. Mol. Evol 36:182-198[Web of Science][Medline]
Halpern A. L., W. J. Bruno, 1998 Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies Mol. Biol. Evol 15:910-917[Abstract]
Jones D. T., W. R. Taylor, J. M. Thornton, 1992 The rapid generation of mutation data matrices from protein sequences Comput. Appl. Biosci 8:275-282
Koshi J. M., R. A. Goldstein, 1995 Context-dependent optimal substitution matrices Protein Eng 8:641-645
. 1998 Models of natural mutations including site heterogeneity Proteins Struct. Funct. Genet 32:289-295[Web of Science][Medline]
Lio P., N. Goldman, 1998 Models of molecular evolution and phylogeny Genome Res 8:1233-1244
Lio P., N. Goldman, J. L. Thorne, D. T. Jones, 1998 PASSML: combining evolutionary inference and protein secondary structure prediction Bioinformatics 14:726-733
Overington J., D. Donnelly, M. S. Johnson, A. Sali, T. L. Blundell, 1992 Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds Protein Sci 1:216-226[Web of Science][Medline]
Parisi G., J. Echave, 2001 Structural constraints and emergence of sequence patterns in protein evolution Mol. Biol. Evol 18:750-756
Posada D., 2001 The effect of branch length variation on the selection of models of molecular evolution J. Mol. Evol 52:434-444[Web of Science][Medline]
Raetz C. R., S. L. Roderick, 1995 A left-handed parallel beta helix in the structure of UDP-N-acetylglucosamine acyltransferase Science 270:997-1000
Schwarz G., 1974 Estimating the dimension of a model Ann. Stat 6:461-464
Thompson J. D., D. G. Higgins, T. J. Gibson, 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Res 22:4673-4680
Thorne J. L., 2000 Models of protein sequence evolution and their applications Curr. Opin. Genet. Dev 10:602-605[Web of Science][Medline]
Thorne J. L., N. Goldman, D. T. Jones, 1996 Combining protein evolution and secondary structure Mol. Biol. Evol 13:666-673[Abstract]
Vaara M., 1992 Eight bacterial proteins, including UDP-N-acetylglucosamine acyltransferase (LpxA) and three other transferases of Escherichia coli, consist of a six-residue periodicity theme FEMS Microbiol. Lett 76:249-254[Medline]
Vuorio R., T. Harkonen, M. Tolvanen, M. Vaara, 1994 The novel hexapeptide motif found in the acyltransferases LpxA and LpxD of lipid A biosynthesis is conserved in various bacteria FEBS Lett 337:289-292[Web of Science][Medline]
Whelan S., P. Lio, N. Goldman, 2001 Molecular phylogenetics: state-of-the-art methods for looking into the past Trends Genet 17:262-272[Web of Science][Medline]
Wyckoff T. J. O., C. R. H. Raetz, 1999 The active site of Escherichia coli UDP-N-acetylglucosamine acyltransferase, chemical modification and site-directed mutagenesis J. Biol. Chem 274:27047-27055
Yang Z., 1993 Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites Mol. Biol. Evol 10:1396-1401[Abstract]
. 1997 PAML: a program package for phylogenetic analysis by maximum likelihood Comput. Appl. Biosci 13:555-556
Yang Z. H., R. Nielsen, M. Hasegawa, 1998 Models of amino acid substitution and applications to mitochondrial protein evolution Mol. Biol. Evol 15:1600-1611[Abstract]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
S. Kryazhimskiy, G. A Bazykin, J. Plotkin, and J. Dushoff Directionality in the evolution of influenza A haemagglutinin Proc R Soc B, November 7, 2008; 275(1650): 2455 - 2464. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. O. Sassi, E. L. Braun, and S. A. Benner The Evolution of Seminal Ribonuclease: Pseudogene Reactivation or Multiple Gene Inactivation Events? Mol. Biol. Evol., April 1, 2007; 24(4): 1012 - 1024. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. S. Fornasari, G. Parisi, and J. Echave Quaternary Structure Constraints on Evolutionary Sequence Divergence Mol. Biol. Evol., February 1, 2007; 24(2): 349 - 351. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Rodrigue, H. Philippe, and N. Lartillot Assessing Site-Interdependent Phylogenetic Models of Sequence Evolution Mol. Biol. Evol., September 1, 2006; 23(9): 1762 - 1775. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Porto, H. E. Roman, M. Vendruscolo, and U. Bastolla Prediction of Site-Specific Amino Acid Distributions and Limits of Divergent Evolutionary Changes in Protein Sequences Mol. Biol. Evol., March 1, 2005; 22(3): 630 - 638. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




