Skip Navigation


MBE Advance Access originally published online on September 14, 2005
Molecular Biology and Evolution 2006 23(1):203-211; doi:10.1093/molbev/msj023
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/1/203    most recent
msj023v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Naya, H.
Right arrow Articles by Musto, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Naya, H.
Right arrow Articles by Musto, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Article

Inferring Parameters Shaping Amino Acid Usage in Prokaryotic Genomes via Bayesian MCMC Methods

Hugo Naya*,{dagger},{ddagger}, Daniel Gianola{ddagger}, Héctor Romero*,§, Jorge I. Urioste{dagger} and Héctor Musto*

* Laboratorio de Organización y Evolución del Genoma, Departamento de Biología Celular y Molecular, Facultad de Ciencias, Montevideo, Uruguay; {dagger} Departamento de Producción Animal y Pasturas, Facultad de Agronomía, Montevideo, Uruguay; {ddagger} Department of Animal Sciences, University of Wisconsin–Madison; and § Escuela Universitaria de Tecnología Médica, Facultad de Medicina, Montevideo, Uruguay

E-mail: hnaya{at}fcien.edu.uy.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Supplementary Material
 Acknowledgements
 References
 
Molar content of guanine plus cytosine (G + C) and optimal growth temperature (OGT) are main factors characterizing the frequency distribution of amino acids in prokaryotes. Previous work, using multivariate exploratory methods, has emphasized ascertainment of biological factors underlying variability between genomes, but the strength of each identified factor on amino acid content has not been quantified. We combine the flexibility of the phylogenetic mixed model (PMM) with the power of Bayesian inference via Markov Chain Monte Carlo (MCMC) methods, to obtain a novel evolutionary picture of amino acid usage in prokaryotic genomes. We implement a Bayesian PMM which incorporates the feature that evolutionary history makes observed data interdependent. As in previous studies with PMM, we present a variance partition; however, attention is also given to the posterior distribution of "systematic effects" that may shed light about the relative importance of and relationships between evolutionary forces acting at the genomic level. In particular, we analyzed influences of G + C, OGT, and respiratory metabolism. Estimates of G + C effects were significant for amino acids coded by G + C or molar content of adenine plus thymine (A + T) in first and second bases. OGT had an important effect on 12 amino acids, probably reflecting complex patterns of protein modifications, to cope with varying environments. The effect of respiratory metabolism was less clear, probably due to the already reported association of G + C with aerobic metabolism. A "heritability" parameter was always high and significant, reinforcing the importance of accommodating phylogenetic relationships in these analyses. "Heritable" component correlations displayed a pattern that tended to cluster "pure" G + C (A + T) in first and second codon positions, suggesting an inherited departure from linear regression on G + C.

Key Words: Bayesian methods • MCMC • amino acid usage • genome evolution • linear models • GC content • optimal growth temperature


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Supplementary Material
 Acknowledgements
 References
 
The increasing availability of completely sequenced genomes enables the examination of differences in relative amino acid usage between species, which offers opportunities to obtain new insights into molecular evolution. Multivariate methods for exploring large-scale data sets have been employed in a number of studies in genomics. For instance, correspondence analysis (Benzécri 1976Go) and principal component analysis (Hotelling 1933Go) have been applied to characterize several genome properties, shedding light about the global picture of molecular evolution. Even though artifacts may emerge (Zavala et al. 2002Go), the multivariate procedures have been helpful to learn about genomic fundamentals (Perriere and Thioulouse 2002Go), such as prediction of coding regions (Fichant and Gautier 1987Go) and trends in amino acid composition (Garat and Musto 2000Go).

Recent work based on multivariate methods (Lobry 1997Go; Kreil and Ouzounis 2001Go; Tekaia, Yeramian, and Dujon 2002Go; Takeuchi et al. 2003Go) has established that molar content of guanine plus cytosine (G + C) and optimal growth temperature (OGT) are main factors shaping amino acid usage in prokaryotic genomes. The role of G + C can be expected from the genetic code structure because coding for a certain amino acid implies a bias toward G + C in nonsynonymous position of the respective codon. OGT explains avoidance of glutamine in thermophilic organisms because this amino acid and their tRNA synthetase are temperature sensitive (Cambillau and Claverie 2000Go). Aerobiosis is strongly associated to an increase in GC content in prokaryotes, with a concomitant evolutionary change in amino acid frequencies. Aerobic and anaerobic organisms have distinct distributions of their frequencies of amino acids susceptible to oxygen (FASO), with the median content being 1.5% lower in aerobics (Naya et al. 2002Go).

Quantitative evidence concerning the strength of the G + C and OGT factors is lacking. A model explaining variation in amino acid usage between genomes should contemplate the evolutionary history of the organisms, with pertinent factors included as explanatory variables. Evolutionary history implies a between-genome dependency (inducing correlation) in amino acid frequencies. In particular, the statistical distance between amino acid distributions of two species should be smaller when the pair of species is phylogenetically closer than when it is farther. Felsenstein (1985)Go, among others, noted this issue and suggested a method (Felsenstein independent contrast [FIC]) for contrasting two different traits in various organisms while accounting for phylogenetic relationships. The rationale of FIC is an evolutionary model known as "Brownian motion" in which the between-species variance increases linearly with time. Several other phylogenetic comparative methods have been proposed in the last two decades, e.g., spatial autoregressive procedures (Cheverud, Dow, and Leutenegger 1985Go), phylogenetic generalized least squares (Martins and Hansen 1997Go), and phylogenetic eigenvector regression (Diniz-Filho, Ramos De Sant'Ana, and Bini 1998Go).

An alternative consists of using statistical methods developed for quantitative genetic analysis. Animal breeders, for example, encounter the analogous feature of correlated observations due to coancestry relationships. The pioneering work of Henderson (1949Go, 1953Go, 1963Go) led to methods that account for correlations between observations taken in related individuals through fitting a linear mixed model in which effects of families or individuals are viewed as random effects. Animal breeders use best linear unbiased prediction (BLUP) for inferring the genetic merit of candidates to artificial selection and likelihood-based or Bayesian methods for estimating dispersion parameters (Sorensen and Gianola 2002Go). With BLUP, one can simultaneously estimate and predict fixed and random effects, respectively.

Lynch (1991)Go perceived the flexibility and potential of this approach for evolutionary studies. Using the phylogenetic mixed model (PMM), he recognized that traits sometimes evolve so fast that their phylogenetic past can be overcome (Housworth, Martins, and Lynch 2004Go). The PMM estimates the relative contribution of two sources of evolutionary change for a trait, i.e., the proportion of the correlation between two species that corresponds to relationships between species and a species-specific contribution. In the phylogenetic context, the matrix of coancestries used in animal breeding corresponds to the proportion of time that taxa share a common ancestor. Further, the coefficient of heritability h2 corresponds to the proportion of the total variance "due to" relationships among taxa. However, in the phylogenetic context, genetic and environmental effects contribute to both heritable and nonheritable fractions of the variance (Housworth, Martins, and Lynch 2004Go). An important feature of the PMM is its flexibility as it can shift from nonphylogenetic correlation model to FIC assumptions when estimating the h2 parameter. When an analysis is bivariate or multivariate, the PMM allows estimating phenotypic correlations between traits, as well as partitioning covariance into phylogenetically heritable and nonheritable components.

Despite the attractiveness of the PMM, various technical difficulties arise in practice. Contrary to the situation in animal breeding, where the relationship matrix usually involves thousands of animals, the number of taxa is much smaller in evolutionary studies. A small number of taxa lead to highly imprecise estimates of correlations between taxa, regardless of the amount of data collected (Lynch 1991Go; Housworth, Martins, and Lynch 2004Go). Furthermore, in likelihood-based inference, standard errors and likelihood ratio tests rely on asymptotic properties (Lynch 1991Go; Sorensen and Gianola 2002Go) which may not hold in finite samples.

A conceptually different approach is Bayesian analysis, suggested by Gianola and Fernando (1986)Go for animal breeding and quantitative genetic applications; here, all estimands are treated as random variables. These unknowns may include parameters, random effects, and even the model itself. The idea is to combine knowledge before data are observed (via a prior probability distribution) with information coming from the data into a posterior distribution. Results from a Bayesian analysis can be reported as the entire posterior density or through posterior summaries, such as the mean, median, mode, variance, percentiles, etc. In the Bayesian framework, uncertainty is casted in probabilistic language and without relying on asymptotic arguments.

For a long time, a major problem associated with implementing Bayesian inference was that of the difficulty (or impossibility) of analytical or numerical evaluation of high-dimension integrals, thus limiting its application to stylized problems only. However, the rediscovery of Markov Chain Monte Carlo (MCMC) methods such as "Gibbs sampling" (S. Geman and D. Geman 1984Go) has enabled scientists (e.g., Gianola, Rodriguez-Zas, and Shook 1994Go; Wang, Rutledge, and Gianola 1994Go) to circumvent this difficulty via sampling from the joint posterior distribution.

In this paper, we combine the flexibility of PMM with Bayesian inference via MCMC methods, to obtain a new evolutionary picture of amino acid usage in prokaryotic genomes. We focus on the posterior distribution of "fixed effects" shedding light about the relative importance and relationships of evolutionary forces at the genomic level. We also assess the impact of accounting for phylogenetic relationships and report what is seemingly the first set of estimates of the strength of the G + C and OGT forces in shaping amino acid frequencies. Finally, drawbacks of our model are discussed and possible improvements are suggested.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Supplementary Material
 Acknowledgements
 References
 
Data
Sequences of 89 completely sequenced prokaryotic genomes (see Supplementary Material online) for which it was possible to retrieve the aligned small-subunit rRNA from Release 8.1 of the Ribosomal Database Project II (Cole et al. 2003Go) were downloaded from the National Center for Biotechnology Information site (http://www.ncbi.nlm.nih.gov).

Response variables in our model were the mean amino acid frequencies (percent) for each genome. Explanatory variables chosen were G + C (percent), OGT (in degree celsius), and "respiration." Information for the explanatory variables was downloaded from the Genomemine site (http://www.genomics.ceh.ac.uk/cgi-bin/genomemine/gminemenu.cgi) and the German Collection of Microorganisms and Cell Cultures (http://www.dsmz.de/species/strains.htm); additional information was in data from Naya et al. (2002)Go. Because in the sequenced genomes there are "respiration" metabolisms with few representatives, we grouped genomes into three classes: aerobics, anaerobics, and facultatives (all not strictly aerobic or anaerobic organisms).

Phylogeny and the Relationship Matrix
A key component of our PMM is the symmetric relationship matrix between the 89 genomes; model restrictions imply positive definiteness of this variance-covariance matrix. When organisms have similar evolutionary rates, as in Lynch (1991)Go, each entry of the matrix may be construed as the proportion of evolutionary time shared by two species, with all diagonal elements being equal to one. However, current work with deep prokaryotic phylogenies involves different evolutionary rates, so inference concerning molecular evolution should account for this feature. This implies that diagonal elements are no longer equal to one and that they probably have different values, reflecting differences in evolutionary rates.

Another difficulty in analysis of deep phylogenies involves orthologous selection. For the sake of simplicity, we chose the rRNA small subunit as the orthologous set. The "Ribosomal Database Project" provides a well-defined and carefully aligned data set; from this database, we selected the longest sequence for each genome. Distances between sequences were obtained through the program "dnadist" of the PHYLIP 3.5 package (Felsenstein 1993Go) using Kimura's two-parameter model. Subsequently, a tree was constructed using the WEIGHBOR software (Bruno, Socci, and Halpern 2000Go) and rooted at the archaea-bacteria edge with TREEVIEW (Page 1996Go). Finally, the relationship matrix was obtained from this tree through the COMPARE 4.5 package (Martins 2003Go).

Another approach to building the relationship matrix consisted of using the concatenated alignments of a set of 12 highly likely orthologous genes present in all the genomes analyzed obtained via a Blast search (Altschul et al. 1997Go) across all genomes, picking up the best reciprocal hits. Subsequently, we aligned each set of orthologous sequences using ClustalW (Thompson, Higgins, and Gibson 1994Go) and concatenated them, obtaining a single alignment. Then a distance matrix was built using the program "protdist" of the PHYLIP 3.5 package (Felsenstein, 1993Go) using Dayhoff PAM matrix. After this, we followed the same steps as in the rRNA data. Use of this relationship matrix yielded similar estimates of the "fixed effects" and small differences for other parameters. Thus, all results presented subsequently were from the analysis using the rRNA relationship matrix.

The processes underlying genome evolution and amino acid composition are extremely complex. It is very difficult to capture all complexity in the relationship matrix, and the choice of any of the above trees is an approximation. Other methods of obtaining the tree could have been used, such as those based on gene content.

Highly related species (or strains) may cause the relationship matrix to have nearly incomplete rank (noninvertible). Hence, some genomes were excluded, thus ending with 89 genomes, deemed representatives of the different factors in the model.

The Multivariate Linear Additive Genetic Model
In a mixed model, distinction is made between random and fixed effects, depending on whether or not the levels of a factor can be viewed as a draw from some distribution, typically assumed to be multivariate normal. This distinction is not needed in a Bayesian setting as all parameters are regarded as random variables. However, many frequentist estimators can be obtained as special cases of the Bayesian linear model (Sorensen and Gianola 2002Go). Due to computational limitations, all analyses described below were carried out with a bivariate model (pairs of amino acids). However, the procedures can be generalized to any desired number of traits.

Consider two observed traits Y and Z (amino acid mean frequencies for two amino acids in a prokaryotic genome or taxon, e.g., Ile and Leu), and let n be the number of taxa for which measurements on the pair of amino acids are available. Collect all observed values into vectors y and z, and put v = [y' z']', with order 2n x 1. Also, let ß = [ß'y ß'z]', a = [a'y a'z]', and e = [e'y e'z]', where ßy (ßz) is a vector of order p of "fixed effects" influencing trait Y (Z); ay (az) is a vector of order q of phylogenetically heritable values for trait Y (Z), noting that q may be smaller or larger than n; and ey (ez) is a vector of residual effects of order n for trait Y (Z). Our "fixed effects" are G + C and OGT (fitted as "continuous" covariables) for which we estimate regression slopes and the factor "respiration" mode having three levels: aerobic (Aer), anaerobic (Ana) and facultative (Fac, consisting of all other organisms, as noted earlier).

The following linear model was fitted:

(1)
where

(2)
are known incidence matrices relating observations to location effects (ß and a) for both traits. A bivariate normal distribution of the data was assumed, with bivariate pairs (Y, Z) being mutually independent across individuals, conditionally on the location effects. The residual distribution had null mean vector and covariance matrix

(3)
where and are residual variances for traits Y and Z, respectively, and {sigma}e,yz is the residual covariance between traits Y and Z. Pairs of residuals of different genomes were mutually independent, implying that the variance-covariance matrix of e (equivalently, the covariance matrix of v, given ß and a) can be expressed as where {otimes} denotes Kronecker product and In is an identity matrix of order n x n.

Prior Distributions
There is no information available on quantitative effects of factors shaping amino acid usage. Therefore, we chose vague (Bernardo 1997Go) prior distributions for the effects contained in ß; this also facilitates sampling from the posterior distribution. We assumed a proper (bounded) uniform distribution as prior for the vector of "fixed effects" ß, with density

(4)
The bounds were [–1, 1] for the regression coefficients (bGC and bOGT) and [–20, 100] for the general mean and the "respiration" effects.

The vector of phylogenetic values was assigned, a priori, the multivariate normal distribution

(5)
where A is the phylogenetic relationship matrix of order q x q and

(6)
is the covariance matrix of phylogenetically heritable effects on Y and Z.

We assigned independent two-dimensional scaled inverted Wishart distributions (IW2) to the Re and G0 (co)variance matrices, with the respective densities being

(7)
and

(8)
where k = 2 in the bivariate model and {nu}e, {nu}a, Ve, Va are hyperparameters of the distributions, assumed known. In the inverted Wishart distribution, {nu} is a shape parameter and V is a scale parameter. Usually, {nu} is viewed as a "degree of belief" in the prior; we assigned a low value to this parameter (it must be greater than two to ensure a proper distribution), to convey large prior uncertainty about the values of the Re and G0 matrices. Entries of the scale parameter matrices were set equal to the sample phenotypic variance-covariance matrix. This procedure was employed because there is no other information available on possible values of G0 and Re; this assignment does not bias, a priori, the partitioning of (co)variability into phylogenetic and residual components.

As usual, the heritability of each amino acid was defined as the proportion of the total variance accounted by phylogenetic relationship, that is:

(9)
and

(10)
are the heritable and nonheritable correlations, respectively.

Fully Conditional Posterior Distributions
We constructed a Gibbs sampler (e.g., S. Geman and D. Geman 1984Go) to obtain samples from marginal posterior distributions of unknown quantities. A detailed derivation of fully conditional distributions of location and dispersion parameters is in Sorensen and Gianola (2002)Go. Following these authors, let

(11)

(12)

(13)
and

(14)

The conditional (given dispersion parameters) posterior means of ß and a can be obtained by solving the linear system

(15)
where

(16)

Samples of {theta} are drawn from the multivariate normal process

(17)

Now let

(18)
where ey (ez) are residuals of the model for trait Y (Z). Updates of the Gibbs sampler for the residual covariance matrix are obtained from

(19)

Similarly, define

(20)
and updates for the phylogenetic covariance matrix are drawn from

(21)

We implemented a Gibbs sampler using the statistical package R (http://cran.r-project.org/). The sampler was run as a single chain, and convergence was assessed using the "CODA" and "BOA" packages in R. Length of the burn-in period was determined from several pilot runs using a diagnostic test developed by Geweke (1992)Go. Chains for different analyses stabilized rapidly, suggesting fast convergence to the posterior distribution. A conservative burn-in was set to 20,000 iterations for all runs, and samples for inference were obtained from the subsequent 20,000 iterations, with a thinning interval of 10.

Bivariate analyses were run for all 190 pairs of amino acids. Posterior plots were based on 50,000 samples, collected after a burn-in period of 50,000 iterations. For reasons of space, only plots for the Ile-Leu bivariate analysis are shown.


    Results and Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Supplementary Material
 Acknowledgements
 References
 
Posterior densities of location parameters were unimodal and nearly symmetric, as shown in figure 1 for Ile and Leu, which was typical. Close agreement between mean, median, and mode was consistent with the lack of skewness of the posterior distributions. Because there were 19 sets of central tendency estimates for each location parameter of each amino acid, we took as "best estimate" the average of the posterior means from the 19 runs; all 19 values were very close to each other. Likewise, for each run, the posterior probability that a "fixed" effect was larger (or smaller, when coefficients were negative) than zero was computed, and these posterior probabilities were averaged over the 19 runs as well.



View larger version (28K):
[in this window]
[in a new window]
 
FIG. 1.— Posterior densities of regression coefficients: bGC in left panels; bOGT in right panels; Ile and Leu in top and bottom panels, respectively.

 
Table 1 gives a summary of "fixed effects" estimates for each amino acid together with their respective codons. As expected from the genetic code, the signs of the partial regression coefficients for each amino acid on G + C (bGC) are (in general) in agreement with the composition at first and second bases of the corresponding codons. The amino acid with the highest value of bGC was Ala (CGN) and the most strongly negative was Lys (AAR), followed by Ile, the only amino acid coded by three codons (AUY–AUA) in which eight of nine sites are A or U. The amino acids with positive bGC values and posterior probability greater than 0.99 were Ala (GCN), Arg (CGN–AGR), Gly (GGN), Val (GUN), Pro (CCN), Asp (GAY), and Thr (ACN) in decreasing order of slope, while Trp (UGG), Gln (CAR), and His (CAY) were with probability greater than 0.95. Likewise, the most strongly negative regressions (with probability greater than 0.99) were Lys (AAR), Ile (AUY–AUA), Asn (AAY), Phe (UUY), Tyr (UAY), and Ser (UCN–AGY), while Glu (GAR) had a negative slope with probability greater than 0.95. Surprisingly, several amino acids without pure G + C, or molar content of adenine plus thymine (A + T), at first and second codon positions appeared to be correlated with G + C. However, in the case of Val, as pointed out by Kreil and Ouzounis (2001)Go, an explanation may be that Ile, Leu, Met, and Val form a group of conservatively interchangeable residues. Conservative replacement of Ile (to Val) from G + C pressure can lead to this association. This could also explain the absence of the expected correlation for Met. However, the effects of chance cannot be ruled out in a finite sample.


View this table:
[in this window]
[in a new window]
 
Table 1 Average (across 19 runs) Posterior Mean Estimates of Fixed Effects

 
The other non–pure amino acid with a positive value for bGC was Arg, which is coded by a quartet with G and C in addition to a duet with A and G in first and second positions, respectively. This may be understood in the light of (1) 10 of 12 possibilities in first and second base are G or C (if transfers were equally probable) and (2) the ample range in percentage of genome displayed by Arg (from 2.4% to 8.3% in our data).

The maximum and minimum G + C in the data set were 72.1% and 22.5%, respectively, giving an observed range of 49.6%. If all other factors in the model are kept constant, this range of G + C implies an expected range of 8.7% for Ala and 7.9% for Lys, in view of their respective bGC values; in the data, the observed range was 10.2% for Ala and 10.0% for Lys.

For OGT, eight amino acids had regression coefficients that were different from zero with probability greater than 0.99 and four with probability greater than 0.95. As expected (Kreil and Ouzounis 2001Go), Gln showed a negative relationship with temperature because this amino acid and its tRNA are both temperature sensitive. The slope for this amino acid (bOGT = –0.0173% per °C) implies a decrease of about 1.1% relative to all amino acids for the mesophiles-thermophiles range (37°C–100°C). On the other hand, Glu had a positive association with OGT (bOGT = 0.0293% per °C); this may be due to the fact that Glu is more thermally stable than Gln but similar with respect to other physicochemical properties.

The amino acid with the steepest regression on temperature was Val (bOGT = 0.0313% per °C). The implication is that the model predicts an increase in the percentage of Val of about 2.0% in thermophilic organisms with respect to mesophilics. Although this is in agreement with previous work (Kreil and Ouzounis 2001Go), the reason is not apparent. A possible explanation is that hydrophobic amino acids tend to increase in thermophilic organisms (Tekaia, Yeramian, and Dujon 2002Go). Additionally, Ser, Ala, Thr, and Asp were negatively correlated with OGT, and the posterior probability that the slopes were smaller than zero was greater than 0.99. Likewise, the regression of Asn on OGT was bOGT = –0.0124% per °C, and the posterior probability that this regression was smaller than zero was larger than 0.95. Lys, Arg, Leu, and Tyr appeared to be positively correlated with OGT (probability greater than 0.95).

"Respiration" had the least clear effects among the "fixed factors" considered. As mentioned earlier, the intercept includes the Ana effect, and the other two effects were estimated as contrasts between classes, i.e., Aer-Ana and Fac-Ana. The only amino acid for which the difference with respect to Ana organisms differed from zero with probability greater than 0.99 was Glu (Gln only in Fac-Ana). In the case of Glu, the Aer-Ana and Fac-Ana contrasts were negative. For Gln, the Fac-Ana difference was positive. The interpretation is complicated by the fact that our data set was limited and thermophilic organisms (in which Glu increases) were overrepresented within Ana. Leu, Ala, Gln, and Pro were avoided in anaerobic organisms with probability greater than 0.95, while Lys and Met were in excess.

If the six FASO amino acids are considered in the comparison between Aer and Ana, only Met was avoided in Aer, while Cys and Phe showed a small similar trend. His, Tyr, and Trp displayed the opposite trend, making it difficult to draw a conclusive statement about FASO departure from expected frequencies under G + C pressure.

Only Arg had a negative intercept trend, but it could not be considered to be different from zero. Significant negative intercepts are possible in our model, provided that the intercept estimates the amino acid frequency for anaerobic organisms at 0% G + C and 0°C. As amino acid frequencies must be at least zero, only Gibbs samples for which the intercept was at least null should have been accepted. For Arg, the estimated bGC was positive and approximately 10 times larger than bOGT (also positive); hence, even small departures from zero in G + C guarantee a positive frequency for the amino acid.

In some cases, dispersion parameters displayed skewed distributions, as shown for Ile and Leu in figure 2, suggesting limited amount of information for precise inference, and we adopted the posterior mode as a point estimate. However, all heritabilities were significantly greater than zero and ranged from 0.75 for Trp to 0.99 for Lys (table 2). This significance argues in favor of a statistical model incorporating phylogenetic information, a fact that may be expected in our case, in view of the phylogeny depth (Housworth et al. 2004Go), as values in the diagonal of the relationship matrix increase with evolutionary time.



View larger version (23K):
[in this window]
[in a new window]
 
FIG. 2.— Contour plot of the bivariate joint density of heritability of Ile and Leu (50,000 iterations after 50,000 of burn-in, thinning = 10).

 

View this table:
[in this window]
[in a new window]
 
Table 2 Amino Acid Heritabilities, Heritable and Nonheritable correlations

 
Correlations displayed a complex pattern, as summarized in the last four columns of table 2. For compactness of presentation, we omitted from this table the mean modal value of each correlation and classified the correlations (for each amino acid) as positive or negative. Amino acids can be grouped in a meaningful way as a function of G + C (A + T) purity in first and second bases of codons (Singer and Hickey 2000Go) because changes in these positions are intrinsically nonsynonymous. In table 2, we put in boldface amino acids "purely" coded by G + C in the first two positions of codons (Phe, Tyr, Met, Ile, Asn, Lys), and underlined amino acids corresponding to A + T–rich codons (Gly, Ala, Arg, Pro). Note that Leu is not included in the A + T–rich set as this only happens in the duet, and Arg is included as G + C rich because the quartet is rich. Recall that, in our model, G + C was included as a "fixed effect," so at least its linear influence in amino acids frequencies was accounted for. Then, all remaining heritable correlations for "pure" amino acids are interpretable as phylogenetically shared departures from linearity.

In general, the observed patterns of correlations are in agreement with the groups outlined above. On the other hand, the nonheritable fraction can be seen as changes in the tips and, hence, as recent evolutionary events. This should not be viewed as exclusively supporting the mutational viewpoint as amino acid–shared behaviors are results of genetically inherited and life history events. Furthermore, several phenomena related to the code structure are probably confounded. The negative correlation between Lys and Arg, one A + T rich and the other G + C rich, would enable genomes with strongly opposite mutational skew remain coding for amino acids with similar properties. This is an important feature when considering that charged residues play a critical role in stabilizing thermophilic proteins (Tekaia, Yeramian, and Dujon 2002Go).

While other non–pure coded amino acids appear to be correlated, Leu, Gln, Thr, and Val appear grouped in different combination several times. Nevertheless, these associations are probably spurious because correlations were not significant in the Bayesian sense. Furthermore, our runs were bivariate, and this might have hampered uncovering a more complex covariance pattern. However, as noted earlier, the approach used here can be extended to more than two variables. This is onerous, computationally, and a more elaborate and optimized software is required.

In our model, variation in phylogenetic values can be interpreted as reflecting differences in amino acid frequencies corresponding to a common evolutionary history, once influential known factors such as OGT or G + C are accounted for. Furthermore, differences in phylogenetic values ("breeding values" in the animal-breeding context) probably arise from selective forces, e.g., forces modifying frequencies of different genes (involved in integral membrane proteins, cytolosolic proteins, etc.) or maintaining amino acids at sites and regions of genes out of the frequencies that would be expected from values of "fixed effects" included in the model.

Additional analysis could involve an application of multivariate techniques, such as Principal Component Analysis to the covariance matrix between phylogenetic values for the 20 amino acids. Because the residual term stands for rapid changes in the tips, a multivariate analysis over the covariance matrix of mean "errors" for each genome may uncover convergent adaptive changes.

We discuss next some of the assumptions made in this study. First, the choice of priors is a central question in the Bayesian-frequentist controversy. As pointed out by Bernardo (1997)Go, truly noninformative priors do not exist. However, when prior information is scarce, it is possible to use prior distributions that convey vague information in some sense (for an interesting discussion in the animal-breeding context, see Blasco 2001Go). Furthermore, the Bayesian approach combines the "prior" with the "data" information in a natural manner, and as more data become available, the relative importance of the "prior" diminishes. Eventually, the likelihood of the data increasingly dominates inferences as more and more observations accrue. Under regularity conditions, it can be shown that the posterior distribution is strictly proportional to the likelihood function (e.g., Sorensen and Gianola 2002Go). We were not able to find estimates from previous work, so we adopted prior distributions which we believe are conservative. For example, the bounds of the prior distribution of "fixed effects" defined a fairly wide allowable range of values, and such boundary values are not inconsistent with previous knowledge in genomics (i.e., Lobry 1997Go). Also, prior uncertainty about dispersion parameters was handled via being conservative in the assignment of "degrees of belief" values in the inverted Wishart distributions.

Second, inferring deep phylogenies is a controversial matter. Evolution along billions of years in adaptable organisms, such as prokaryotes, can erase or confound even strong phylogenetic signals. Furthermore, phylogenetic reconstruction is usually made using different data sets and methods, e.g., consider the continuously changing topology of prokaryotic trees. Our choice of using the small-subunit RNA as the sequence carrying the phylogenetic information was based more on implementation considerations than on biological grounds. More work is needed to assess the impact of this choice on parameter estimates. An alternative with completely sequenced genomes is that of phylogenetic reconstruction from the minimal orthologous genes set. The results obtained using either approach, rRNA or minimal orthologous genes set, were conceptually similar. Furthermore, the uncertainty in the phylogenies could be readily incorporated, for example, using in the MCMC sample process different relationship matrices weighed by their probability.

Because the heritability values were rather high, we perform new analysis to increase our confidence in the results. Thus, we carried out 20 bivariate models with the frequencies of amino acids in the genomes and in the orthologous gene set. In this case, the heritabilities in the orthologous set were very similar to those in the rest of the genome. Heritable correlations were in general high and positive except for the scarce amino acids in the sets.

We also assumed that the G0 matrix does not evolve in time. Many examples challenge this assumption for truly genetic "G matrices" and phenotypic "P matrices" (see Steppan, Houle, and Phillips 2002Go, for a review) at the species and genus levels, although none of these are from prokaryotes. If G matrix evolution can be detected in prokaryotes at these taxonomic levels, our assumption may not hold true. Because the basic biochemical machinery seems to remain conserved along evolution, it is not unlikely that genetic restrictions at the amino acid usage level remain essentially unchanged in the evolution of prokaryotes. As more data become available, estimation and comparison of G matrices obtained at different phylogenetic levels may be feasible.

A possibly major deficiency in our model is that it assumes a single mechanism driving the amino acids frequencies based on mutational biases and that it can be inherited. A possible solution to this would be modeling G + C as a response variable treated jointly with each amino acid (or all, if it were computationally feasible) in a bivariate model. Here, one would estimate the genetic and residual correlations, with the former corresponding to the inherited mutational bias. However, as shown by Naya et al. (2002)Go, the G + C distribution of aerobic organisms is shifted with respect to that of anaerobic ones, and it is clearly not normal. These authors found that facultatives showed a bimodal distribution with larger variance than for aerobics or anaerobics (probably related to the uncertainty in the classification), suggesting that some biological phenomena are not properly taken into account by our model. These findings question the bivariate model implementation based on normal distributions used in the present study.

The enormous flexibility of Bayesian MCMC methods provides the means for overcoming at least some of the limitations mentioned above. Hierarchical models permit modeling different observations as arising from different distributions, conditionally to their origin (e.g., "respiration" modes in our example). Furthermore, data could be transformed differentially via introduction of "Box-Cox" parameters which, in turn, can be estimated itself from the observations. For each origin, this is along the lines of the Programa de Desarrollo de las Ciencias Básicas approach (Martins and Hansen 1997Go), in which a parameter associated with the strength of selective forces is estimated.

In conclusion, the PMM coupled with Bayesian MCMC techniques enable to obtain novel, model-based information about molecular evolution. In particular, it may be of interest to assess if evolutionary forces, such as G + C, have similar strength in different genes within genomes and, if differences arise, to quantify their extent. As more data become available, it will be possible to fit even more complex models than the one used in this paper.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Supplementary Material
 Acknowledgements
 References
 
List of organisms analyzed in the present study are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Supplementary Material
 Acknowledgements
 References
 
We thank Alejandro Zavala, Diego Gimeno, Yu-Mei Chang, and Guilherme J. M. Rosa for helpful discussions. H.N. was the recipient of a fellowship from PEDECIBA, Uruguay. This work was partially supported by award 7094 from "Fondo Clemente Estable," Uruguay, by the Wisconsin Agriculture Experiment Station, and by grants National Research Initiative Competitive Grants Program/United States Department of Agriculture 2003-35205-12833, National Science Foundation Department of Environmental Biology-0089742, and National Science Foundation Department of Mathematical Sciences-044371.


    Footnotes
 
Jianzhi Zhang, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Supplementary Material
 Acknowledgements
 References
 

    Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389–3402.[Abstract/Free Full Text]

    Benzécri, J. P. 1976. L'Analyse des Données, Tome 2: L'Analyse des Correspondances. Dunod, Paris.

    Bernardo, J. M. 1997. Noninformative priors do not exist: a discussion. J. Stat. Plan. Inference 65:159–189.[CrossRef]

    Blasco, A. 2001. The Bayesian controversy in animal breeding. J. Anim. Sci. 79:2023–2046.[Abstract/Free Full Text]

    Bruno, W. J., N. D. Socci, and A. L. Halpern. 2000. Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. Evol. 17:189–197.[Abstract/Free Full Text]

    Cambillau, C., and J. M. Claverie. 2000. Structural and genomic correlates of hyperthermostability. J. Biol. Chem. 275:32383–32386.[Abstract/Free Full Text]

    Cheverud, J. M., M. M. Dow, and W. Leutenegger. 1985. The quantitative assessment of phylogenetic constraints in comparative analyses: sexual dimorphism in body weight among primates. Evolution 39:1335–1351.[CrossRef][Web of Science]

    Cole, J. R., B. Chai, T. L. Marsh et al. (11 co-authors). 2003. The ribosomal database project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Res. 31:442–443.[Abstract/Free Full Text]

    Diniz-Filho, J. A. F., C. E. Ramos De Sant'Ana, and L. M. Bini. 1998. An eigenvector method for estimating phylogenetic inertia. Evolution 52:1247–1262.[CrossRef]

    Felsenstein, J. 1985. Phylogenies and the comparative method. Am. Nat. 125:1–15.

    ———. 1993. PHYLIP (phylogeny inference package). Version 3.5c. Distributed by the author, Department of Genetics, University of Washington, Seattle.

    Fichant, G., and C. Gautier. 1987. Statistical method for predicting protein coding regions in nucleic acid sequences. Comput. Appl. Biosci. 3:287–295.[Abstract/Free Full Text]

    Garat, B., and H. Musto. 2000. Trends of amino acid usage in the proteins from the unicellular parasite Giardia lamblia. Biochem. Biophys. Res. Commun. 279:996–1000.[CrossRef][Web of Science][Medline]

    Geman, S., and D. Geman. 1984. Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6:721–741.

    Geweke, J. 1992. Evaluating the accuracy of sampling-based approaches to calculating posterior moments. Pp. 169–193 in J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, eds. Bayesian statistics 4. Clarendon Press, Oxford.

    Gianola, D., and R. L. Fernando. 1986. Bayesian methods in animal breeding theory. J. Anim. Sci. 63:217–244.[Abstract/Free Full Text]

    Gianola, D., S. Rodriguez-Zas, and G. E. Shook. 1994. The Gibbs sampler in the animal model: a primer. Pp. 47–56 in J. L. Foulley and M. Molenat, eds. Séminaire modele animal. INRA Departament de Genetique Animale, La Colle sur Loup, France.

    Henderson, C. R. 1949. Estimation of changes in herd environment. J. Dairy Sci. 32:706–715.

    ———. 1953. Estimation of variance and covariance components. Biometrics 9:226–252.[CrossRef][Web of Science]

    ———. 1963. Selection index and expected genetic advance. Pp. 141–153 in W. D. Hanson and H. F. Robinson, eds. Statistical genetics and plant breeding. National Academy of Sciences–National Research Council, Washington, D.C.

    Hotelling, H. 1933. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 24:417–441.[CrossRef][Web of Science]

    Housworth, E. A., E. P. Martins, and M. Lynch. 2004. The phylogenetic mixed model. Am. Nat. 163:84–96.[CrossRef][Medline]

    Kreil, D. P., and C. A. Ouzounis. 2001. Identification of thermophilic species by the amino acid compositions deduced from their genomes. Nucleic Acids Res. 29:1608–1615.[Abstract/Free Full Text]

    Lobry, J. R. 1997. Influence of genomic G+C content on average amino-acid composition of proteins from 59 bacterial species. Gene 205:309–316.[CrossRef][Web of Science][Medline]

    Lynch, M. 1991. Methods for the analysis of comparative data in evolutionary biology. Evolution 45:1065–1080.[CrossRef][Web of Science]

    Martins, E. P. 2003. COMPARE 4.5: statistical analysis of comparative data. Computer programs distributed by the author (http://compare.bio.indiana.edu).

    Martins, E. P., and T. F. Hansen. 1997. Phylogenies and the comparative method: a general approach to incorporating phylogenetic information into the analysis of interspecific data. Am. Nat. 149:646–667[erratum in Am. Nat. 153:448].[CrossRef][Web of Science]

    Naya, H., H. Romero, A. Zavala, B. Alvarez, and H. Musto. 2002. Aerobiosis increases the genomic guanine plus cytosine content (GC%) in prokaryotes. J. Mol. Evol. 55:260–264.[CrossRef][Web of Science][Medline]

    Page, R. D. M. 1996. TREEVIEW: an application to display phylogenetic trees on personal computers. Comput. Appl. Biosci. 12:357–358.[Free Full Text]

    Perriere, G., and J. Thioulouse. 2002. Use and misuse of correspondence analysis in codon usage studies. Nucleic Acids Res. 30:4548–4555.[Abstract/Free Full Text]

    Singer, G. A., and D. A. Hickey. 2000. Nucleotide bias causes a genomewide bias in the amino acid composition of proteins. Mol. Biol. Evol. 17:1581–1588.[Abstract/Free Full Text]

    Sorensen, D., and D. Gianola. 2002. Likelihood, Bayesian, and MCMC methods in quantitative genetics. Springer-Verlag, Berlin.

    Steppan, S. J., D. Houle, and P. C. Phillips. 2002. Comparative quantitative genetics: evolution of the G matrix. Trends Ecol. Evol. 17:320–327.[CrossRef]

    Takeuchi, F., Y. Futamura, H. Yoshikura, and K. Yamamoto. 2003. Statistics of trinucleotides in coding sequences and evolution. J. Theor. Biol. 222:139–149.[CrossRef][Web of Science][Medline]

    Tekaia, F., E. Yeramian, and B. Dujon. 2002. Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene 297:51–60.[CrossRef][Web of Science][Medline]

    Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673–4680.[Abstract/Free Full Text]

    Wang, C. S., J. J. Rutledge, and D. Gianola. 1994. Bayesian analysis of mixed linear models via Gibbs sampling with an application to litter size in Iberian pigs. Genet. Sel. Evol. 26:91–115.[CrossRef][Web of Science]

    Zavala, A., H. Naya, H. Romero, and H. Musto. 2002. Trends in codon and amino acid usage in thermotoga maritima. J. Mol. Evol. 54:563–568.[CrossRef][Web of Science][Medline]

Accepted for publication September 13, 2005.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/1/203    most recent
msj023v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Naya, H.
Right arrow Articles by Musto, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Naya, H.
Right arrow Articles by Musto, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?