Skip Navigation


MBE Advance Access originally published online on December 5, 2006
Molecular Biology and Evolution 2007 24(2):611-619; doi:10.1093/molbev/msl190
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrowOA All Versions of this Article:
24/2/611    most recent
msl190v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Jørgensen, F. G.
Right arrow Articles by Clark, A. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Jørgensen, F. G.
Right arrow Articles by Clark, A. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 The Authors
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.


Research Articles

Heterogeneity in Regional GC Content and Differential Usage of Codons and Amino Acids in GC-Poor and GC-Rich Regions of the Genome of Apis mellifera

Frank Grønlund Jørgensen*, Mikkel Heide Schierup* and Andrew G. Clark{dagger}

* Bioinformatics Research Center, Department of Genetics and Ecology, University of Aarhus, Aarhus, Denmark
{dagger} Department of Molecular Biology and Genetics, Cornell University

E-mail: frank{at}birc.au.dk.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
The honeybee (Apis mellifera) has a genome with a wide variation in GC content showing 2 clear modal GC values, in some ways reminiscent of an isochore-like structure. To gain insight into causes and consequences of this pattern, we used a comparative approach to study the genome-wide alignment of primarily coding sequence of A. mellifera with Drosophila melanogaster and Anopheles gambiae. The latter 2 species show a higher average GC content than A. mellifera and no indications of bimodality, suggesting that the GC-poor mode is a derived condition in honeybee. In A. mellifera, synonymous sites of genes generally adopt the GC content of the region in which they reside. A large proportion of genes in GC-poor regions have not been assigned to the honeybee assembly because of the low sequence complexity of their genome neighborhood. The synonymous substitution rate between A. mellifera and the other species is very close to saturation, but analyses of nonsynonymous substitutions as well as amino acid substitutions indicate that the GC-poor regions are not evolving faster than the GC-rich regions. We describe the codon usage and amino acid usage and show that they are remarkably heterogeneous within the honeybee genome between the 2 different GC regions. Specifically, the genes located in GC-poor regions show a much larger deviation in both codon usage bias and amino acid usage from the Dipterans than the genes located in the GC-rich regions.

Key Words: Apis mellifera • GC content • isochore • codon bias • amino acids


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Genome-wide sequencing at high coverage of several Drosophila species (Adams et al. 2000Go; Richards et al. 2005Go), the mosquito Anopheles gambiae (Holt et al. 2002Go), the silkworm Bombyx mori (Xia et al. 2004Go), and finally the honeybee Apis mellifera (Honeybee Genome Sequencing Consortium 2006Go) presents us with an unprecedented opportunity to investigate the evolutionary forces underlying genome-wide patterns of substitution, rearrangements, codon, and amino acid usage and the way they change over time in insects.

The honeybee is a member of the Hymenoptera order, one of 11 orders of holometabolous insects. All other insect species sequenced thus far at relatively high coverage belong to one of the following 3 orders: Diptera (Drosophila sp. and Anopheles sp.), Lepidoptera (B. mori), or Coleoptera (Tribolium castaneum). The exact phylogenetic relationships between these 3 orders and Hymenoptera have been hard to resolve, but recently large genome-scale investigations have provided evidence for a basal position of Hymenoptera (Savard et al. 2006Go). Combining the new evidence for a basal position of Hymenoptera among the holometabolous insects sequenced thus far with divergence times estimated using a molecular clock, it is believed that the most recent common ancestor of Diptera and Hymenoptera existed, approximately, 330–350 MYA (Gaunt and Miles 2002Go; Savard et al. 2006Go).

Analysis of the complete draft sequence of the honeybee genome has revealed several unique features compared with the other insect genomes sequenced to date. Most relevant to our study are the following 3 observations: an extremely low overall GC content, a relatively slow rate of molecular evolution, and a high frequency of CpG dinucleotides (Honeybee Genome Sequencing Consortium 2006Go).

One of the very basic differences between the genomes of amniotes and most other groups of multicellular organisms is the occurrence of relatively homogeneous GC-rich and GC-poor stretches of DNA, often referred to as isochores (Bernardi et al. 1985Go). Even though the occurrence of isochores has been known for several decades, there are still several controversies regarding their origin, evolution, and extent (Eyre-Walker and Hurst 2001Go; Clay and Bernardi 2005Go; Cohen et al. 2005Go; Duret et al. 2006Go). Although the origin and evolution of isochores is still a controversial subject, the existence of clustering of high-GC and low-GC regions within the genomes of mammals and birds is generally accepted. In insects such as Drosophila melanogaster, the heterogeneity in GC content appears to be far less evident. Initial analyses of the honeybee genome revealed much greater AT richness compared with D. melanogaster. Furthermore, there were large regions of homogeneous GC content varying from low- to high-GC content along the genome, resembling isochore structures (Honeybee Genome Sequencing Consortium 2006Go).

A comparative analysis of relatively distantly related species can be more reliable if we restrict the analysis to coding regions. In coding regions there is less ambiguity about alignment, and comparison of synonymous and nonsynonymous sites allows an a priori discrimination between functional classes. Here we examine patterns of GC content variation and the effects it has on codon usage bias and amino acid usage in greater detail by comparing the coding genome of the honeybee with the coding genomes of 2 dipteran genomes, D. melanogaster and An. gambiae.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
Generation of Data
The multi-Z alignments (Blanchette et al. 2004Go) of the 8 insect species: D. melanogaster (dm2), Drosophila simulans (droSim1), Drosophila yakuba (droYak1), Drosophila ananassae (droAna1), Drosophila pseudoobscura (dp3), Drosophila virilis (droVir1), Drosophila mojavensis (droMoj1), An. gambiae (anoGam1), and A. mellifera (apiMel1) were downloaded from the UCSC genome browser (Kent et al. 2002Go). These multi-Z alignments were screened for alignments with at least 90 aligned basepairs annotated as coding sequence from the 3 species: D. melanogaster, An. gambiae, and A. mellifera. All other alignments and sequences from the remaining 5 insect species were discarded. This resulted in a total of 22,541 alignments with 6,198,531 bp (average alignment length ~275 bp) that were used in the subsequent analyses.

Annotation of Data
Using the refgene file for the dm2 version of D. melanogaster annotations, alignments were annotated as 3' untranslated regions (UTR), 5' UTR, exons, introns, or intergenic. For exons, the reading frame of A. mellifera was inferred from the reading frame of D. melanogaster. All annotated reading frames from the refgene file in D. melanogaster were checked for start and stop codons. We found a few cases where the reading frame had shifted 1 or 2 bases to the left or right. These were all corrected before the annotation was done, so as to fulfill the criteria that all annotated genes have a start and a termination codon and no internal stop codons.

Estimation of GC Content and Test for Bimodality in Third Codon Position GC Content
The GC content of a given fragment was estimated as the proportion of Gs and Cs separately for each codon position of each alignment. To test whether the apparent bimodality in honeybee in the third codon position was significant, we fitted 2 different models to our data set: (M1) a beta distribution with 2 parameters (a and b) and (M2) a mixture of 2 beta distributions with 5 parameters (p, a1, b1, a2, and b2), where p is the percentage of sequences that belong to the first beta distribution with parameters (a1, and b1) and (1 – p) is the percentage of sequences that belong to the second beta distribution with parameters (a2 and b2). Both models were fitted using the nlminb linear optimization method implemented in the statistical program package R (R Development Core Team 2005). A standard likelihood ratio test of twice the difference in log likelihood of the 2 models M1 and M2 with 3 degrees of freedom (df) was used to compare the 2 models. The 2 probability densities specified by the best-fitting model 2 were used to divide the data into 2 different classes based on the GC content of third codon position in the honeybee sequence.

Codon Usage at 4-Fold Degenerate Amino Acids in Low- and High-GC Areas
From the GC content of the third codon position, all alignments were divided into 2 classes representing the 2 modes in the bimodal distribution: low-GC regions and high-GC regions. All positions where one of the 4-fold degenerate amino acids (glycine, proline, valine, threonine, and alanine) or one of the 6-fold degenerate amino acids (leucine, arginine, and serine) have been conserved between D. melanogaster and A. mellifera were extracted for codon usage bias analyses.

Inference of Equilibrium GC Content under Inferred Substitution Models
The 5-codon substitution matrices that are estimated for the 4-fold degenerate amino acids were transformed into relative rate matrices QAA | AA = (glycine, proline, valine, threonine, and alanine) that satisfy the condition that each row sums to 1: Ri->A + Ri->C + Ri->G + Ri->T = 1, where i is one of the 4 nucleotides (A, C, G, and T). Each index in the matrix represents a relative rate of change from one nucleotide to another:

Formula

These rate matrices were multiplied with the observed nucleotide frequencies represented by the vector NAA, giving the actual counts of the third codon position nucleotide in each of the five 4-fold degenerate amino acids:

Formula

The dot product N'AAQ'AA=N'AA, where N'AA represents the vector of codon frequencies after one time step of evolution following the substitution rates of QAA. The total number of codons in N'AA will be equal to the total current number of codons in NAA. This procedure was iterated using the new nucleotide count matrix N'AA until equilibrium was reached. The resulting codon distributions and GC content at equilibrium were determined and current deviations from equilibrium were evaluated separately for each amino acid in a {chi}2 distribution with 3 df.

Differential Amino Acid Usage in High– and Low–GC Content Regions
All aligned amino acids where we have an unambiguous amino acid (no "?" or "-") for all 3 species were counted and divided into the 2 groups determined by the GC content of the third codon position in honeybee. The average GC content of all codons encoding each amino acid was estimated using an assumption of equal synonymous codon frequencies within each amino acid.

Estimation of Evolutionary Distances and Correlation with GC Content
For each 3-species alignment we estimated the 3 different pairwise p distances (DrosophilaApis, DrosophilaAnopheles, and AnophelesApis) for the 3 different codon positions as well as for the encoded amino acids.

Correlation between Third Codon Position GC Content and Regional GC Content
The regional GC content was estimated by dividing the genome into 20-kb windows with 10-kb overlap. The GC content for each region was estimated as a simple count. Unknown nucleotides (N) in the sequence were skipped so the GC content was based on fewer bases in those regions. If any region has less than 15 kb of unambiguous nucleotides, we define the GC content to be missing. The position of each sequence on the honeybee genome was determined by blasting the sequence against the genome using Megablast (Zhang et al. 2000Go) with the following settings: word size (w) = 24, minimum e value (e) = 0.000001. The best hit with the highest score was used, but if a hit could not be found or if the regional GC content was missing the alignment was discarded in this analysis.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
The overall genomic average GC content of the different autosomes of the 3 species studied here are 33–39% (range over 16 linkage groups) for A. mellifera, 42–43% for D. melanogaster, and 44–45% for An. gambiae, so the clear outlier is the honeybee with its exceptionally low average GC content.

Figure 1 shows the distributions of GC contents of each alignment of the 3 species for each codon position separately. The assembly of the honeybee genome contains sequences that did not map unambiguously to a known linkage group, and these are currently just labeled the "unknown group" (Honeybee Genome Sequencing Consortium 2006Go). We have decided to split the data of the third codon position into 2 groups, 1 containing all the sequences that have already been assembled and placed onto one of the 16 honeybee linkage groups and the other containing the remaining sequences that have not so far been placed onto any linkage groups; please see the Discussion for a possible explanation of this interesting pattern. Unless otherwise specified, the remaining analyses in this paper deals with the entire alignment data set, but we have also conducted similar analyses on the 2 different subsets of the data and the results all appear to be robust across subsets of the data. Looking at the sequences that map to a known linkage group, we see that the GC content distribution of third codon position in honeybee has a clear bimodal shape, not observed in the 2 dipteran species. Adding the sequences from the unknown group does not change this observation but makes the data in the low-GC mode more abundant.


Figure 1
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Histograms of GC content distribution of the 3 different codon positions. Apis mellifera (top), Drosophila melanogaster (middle), and Anopheles gambiae (bottom). The third codon position is divided into 2 different categories of roughly equal sizes depending on whether the sequence in A. mellifera is assigned to one of the 16 linkage groups or not (unknown). For the first and second codon position the distributions of the 2 different categories are so similar that the 2 subsets of the data are concatenated and their total is shown.

 
A likelihood ratio test with 3 df shows that a mixture model (M2) of 2 beta distributions fit the data significantly better than a single beta distribution (M1) (P < 10–100). The best fit of the M2 mixture model has the following 5 parameters: p = 0.479, a1 = 5.528, b1 = 22.504, a2 = 4.041, and b2 = 3.063. A scaled histogram of our data and the density function of the fitted M2 model as well as the 2 individual beta distributions that make up the mixture model are shown in figure 2. The intersection of the 2 individual beta distributions is approximately 33% GC content. Excluding the unknown linkage groups from the data results in a similar mixture model with a slightly higher intersection around 37% GC. This slightly higher intersection point is mainly due to a wider variance on the low mode in the smaller data set. We conducted the following analysis using GC-cutoff values of 30%, 33%, 35%, 37%, and 40% and because the results are highly similar, we have chosen only to present results using a cutoff of 33% GC determined by the intersection of the 2 individual beta distributions in M2 based on the full data set. The results using the other GC values can be obtained from the authors on request.


Figure 2
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Density plot of the GC content of the third codon position. The histogram shows the distribution of sequences with different GC contents. The solid line represents the density function of the best fit of the 2-component beta mixture model M2. The dotted line shows the 2 individual beta distributions that make up the mixture model.

 
We estimated the 3 pairwise p distances in the 3-species unrooted tree (D. melanogaster, An. gambiae, and A. mellifera) at the first, second, and third codon positions separately for each of the 2 GC content classes (table 1). Table 1 also includes the pairwise p distances estimated at the amino acid level. The third codon position is highly saturated in all comparisons and therefore not very informative. The first and second codon positions as well as the amino acids are not highly saturated, and from that we see that the low-GC region does not show a higher substitution rate in honeybee.


View this table:
[in this window]
[in a new window]

 
Table 1. Pairwise p Distances in Low- and High-GC Regions

 
Using the estimated p distances for the individual alignments that have been summarized in table 1, we calculated the correlation between the evolutionary rate of the honeybee in an alignment and the GC content of the honeybee in the corresponding alignment. The mean GC-content of the alignments for each of the three species in low and high GC-regions respectively is summarized in table 2. There are no strongly significant correlations between the rate of evolution and the GC content of the honeybee sequence based on any of the 3 types of data: first codon position, second codon position, and amino acids (Pearson product–moment correlation: cor = –0.01, P = 0.1198; cor = 0.03, P = 1.5 x 10–6; and cor = 0.02, P = 0.0207), respectively.


View this table:
[in this window]
[in a new window]

 
Table 2. GC Contents of the 3 Species

 
To investigate the consequences of the bimodality seen in the GC distributions all 4-fold degenerate and 6-fold degenerate amino acids, where the amino acid is conserved among D. melanogaster and A. mellifera, were extracted and divided into the 2 logically exclusive categories based on the GC content of the honeybee sequence they came from (low GC < 33% and high GC ≥ 33%). For each 4-fold amino acid, we have a 4 x 4 matrix showing the minimum number of substitutions that are required to explain the data. Table 3 shows 10 matrices for the 5 different amino acids for low- and high-GC content, respectively. For the low–GC content sequences, it is clear that all five 4-fold degenerate amino acids show a similar pattern, a generally very strong change in usage from GC-rich codons in Drosophila to AT-rich codons in honeybee. In total, we observe 114,781 changes from GC-rich codons in Drosophila to AT-rich codons in honeybee compared with 9,048 changes in the other direction, a ratio of approximately 12.7. For the high–GC content sequences the trend is in the same direction, but the pattern is much less clear with 41,653 and 28,541 changes, respectively, which results in a ratio of approximately 1.5. Even though the trend is not as strong as in the low-GC regions, it is still highly significant (chi-square = 2,449, P < 10–6).


View this table:
[in this window]
[in a new window]

 
Table 3. Codon Substitution Matrices for 4-Fold Degenerate Amino Acids

 
Table 4 shows the 6 corresponding matrices for the three 6-fold degenerate amino acids. A pattern similar to that seen in the 4-fold degenerate sites is seen in the "4-fold" part of all three 6-fold degenerate amino acids. The amino acids located in a low-GC region generally use the codons with A or T in 3rd position. For leucine and arginine, only 1 mutation is necessary to go from the 4-fold codons to the "2-fold" codons (see table 4), whereas serine requires a mutation in both first and second codon position to change from the 4-fold codons to the 2-fold codons. Because it requires 2 mutations, we do not expect many changes from a 4-fold codon to a 2-fold codon in serine, which agrees well with what we observe. Interestingly, the most extreme ratio of changes between 2 synonymous codons is found in the low-GC region in leucine in a case where it requires 2 mutations (CTG {leftrightarrow} TTA). We observe a ratio of 16,971/84 an approximately 200-fold greater tendency for TTA in A. mellifera and CTG in D. melanogaster compared with the one having CTG in A. mellifera and TTA in D. melanogaster. In contrast, the ratio seen in the high-GC regions is "only" 2,902/373, approximately 8. Another interesting observation is that the first-position changes between synonymous codons in leucine and arginine on average show a much stronger bias toward the AT-rich first-position codons in honeybee in the low-GC regions compared with the high-GC regions; 15.7 versus 4.0 in leucine and 6.4 versus 3.6 in arginine, respectively.


View this table:
[in this window]
[in a new window]

 
Table 4. Codon Substitution Matrices for 6-Fold Degenerate Amino Acids

 
The counts of substitutions presented in table 3 allow estimates of relative rates of substitutions, and from these we can model how the base composition will change under each of the 4 x 4 substitution matrices over time. We use these substitution matrices to find the equilibrium codon frequencies for each of the five 4-fold degenerate amino acids. The resulting third codon position GC content at equilibrium and the current deviation from that for each of the 5 amino acids at both high- and low-GC content is shown in table 5. We find that the high-GC regions seem to be evolving to a GC content fairly similar to what is observed, whereas the low-GC regions are evolving to a GC content only slightly lower. That is, high-GC regions appear to be closer to equilibrium (F-test [total]: 19.00, P = 0.019). In the regions of low-GC content, there remains a pronounced trend for the substitution matrices to drive the GC content to be lower still. Thus, it is the low-GC regions that appear to be in the process of the greatest change. We note that proline does not show this tendency: in fact, for proline the low–GC content region is actually a bit closer to equilibrium than the high–GC content region. This could in part be due to sampling because of the relative rarity of proline.


View this table:
[in this window]
[in a new window]

 
Table 5. Deviation from Equilibrium Codon Frequencies and GC Content

 
The amino acid alignments were used to study the differential usage of amino acids in the low- and high-GC regions of A. mellifera compared with the average of the 2 dipterans. The result of this analysis is shown in figure 3, which shows a strong tendency in the low-GC regions to use more GC-poor amino acids and less GC-rich amino acids. In the high-GC regions in A. mellifera this trend is much less pronounced, resulting in a significantly different amino acid usage in the 2 different regions of the genome of A. mellifera.


Figure 3
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Differential amino acid usage in low- and high-GC regions. The bar plot shows the usage frequency of the 20 different amino acids in low-GC regions (black bars) and high-GC regions (gray bars), based on the third codon position of the Apis mellifera sequence. The amino acids are listed in order of descending GC content estimated as the mean GC frequency of all possible codons for a given amino acid. Equal codon frequencies within each amino acid are assumed when calculating the mean amino acid–GC content. The deviation is simply the percentage gain or loss compared with the mean of the 2 dipteran species.

 
Dividing the honeybee genome into windows of 20 kb with 10-kb overlap, we mapped each individual sequence onto its specific region and measured the correlation between the 20-kb regional GC content and the GC content of the third codon position of the given honeybee sequence; this plot is shown in figure 4. The correlation is highly significant and positive (Pearson product–moment correlation: cor = 0.68, P < 10–6). Using only 4-fold degenerate sites, where the encoded amino acid is conserved between D. melanogaster and A. mellifera, or using different regional window sizes yields very similar results (data not shown).


Figure 4
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Correlation between third codon position and regional GC content. Third codon position GC content is estimated from the honeybee sequence for each alignment and mapped to a specific region in the linkage groups. The regional GC content is estimated for 20-kb sliding windows with 10-kb overlap along the linkage groups.

 

    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
One of the major problems when comparing very distantly related species is that only short fragments can be aligned with high confidence. The last common ancestor of A. mellifera and D. melanogaster probably existed at least 300 MYA, sufficiently long ago that the substitution process at all neutrally evolving sites is very close to saturation (Gaunt and Miles 2002Go; Savard et al. 2006Go). For this reason, we focused on coding regions where the more slowly evolving first and second codon positions provide us with a trustworthy framework for alignments. These annotated alignments will provide us with a reasonably neutrally evolving set of aligned nucleotides in the third codon position. Using third codon position instead of intronic or intergenic sequences reduces the risk of analyzing nonorthologous alignments. On the other hand, by treating third codon positions as neutrally evolving sites, we introduce other biases that might be hard to control. For one thing, it is known that D. melanogaster and many other species show strong biases in synonymous codon usage, making the evolution of the third codon position constrained even in 4-fold degenerate amino acids (Shields et al. 1988Go; Moriyama and Hartl 1993Go). Whether this is also the case in A. mellifera is to our knowledge still not well known. It is important to remember that only 76% of all substitutions possible in the third codon position are synonymous. Although it is worth noting that all amino acids, except tryptophan and methionine, which are only encoded by 1 codon, have the possibility of being encoded by a codon with a G/C or an A/T nucleotide in the third position. So, even though the third position is not evolving perfectly neutrally, the choice of using third codon positions to study the evolution of GC content does not seem unreasonable to us.

Random shotgun sequences of whole genomes generally produce a residual set of scaffolds that may have many sequence reads in the assembly, but they fail to connect to any known gene or other mappable feature of the genome. These unmapped scaffolds are often enriched for sequences embedded in heterochromatin (Adams et al. 2000Go), and given the GC poorness of most heterochromatic repeats, the unmapped scaffolds tend to be GC poor. The third codon positions of coding sequence in the unmapped contigs in A. mellifera have a very low GC content. Part of the reason for this is most likely that a very high or low GC content will result in a relatively lower sequence complexity, which might make it hard to uniquely place such a sequence on a specific place in the genome. Because the third codon positions were found to be a good indicator of the regional GC content, these coding sequences should probably be placed on the honeybee genome in regions with relatively low sequence complexity. We have very few trustworthy alignments of intronic and intergenic regions because reasonable alignments could usually only be produced if they had a clear and unique overlap with a coding region, but the shape of the GC content distribution of these 2 noncoding regions resembles the distribution of the unknown third codon position fairly well.

This leaves us to explain why such a markedly bimodal distribution of GC content is seen only in the honeybee genome. One peak of the bimodal distribution is approximately equal in GC content to that found in the 2 dipterans, whereas the new peak has markedly lower GC content. One explanation could be that the honeybee generally experiences a very strong mutational bias toward A/T nucleotides. If selection is maintaining the GC content of third codon positions through a very strong codon usage bias in some genes or regions of the genome, we would expect the evolutionary rate in these regions to be significantly slower than other regions. We have no evidence that this is the case. On the contrary, table 1 actually indicates that the low-GC regions evolve a bit slower than the high-GC regions. The reason for this is not obvious, but we can speculate that it might be because stronger negative selection has acted on the regions that are changing in GC content and therefore these changes more often result in different amino acids. The observation that a significantly different distribution of amino acids is now being used in the low-GC regions compared with the high-GC regions supports this idea (fig. 3).

Another explanation of the bimodality in GC content we observe could be that for some currently unknown reason 2 markedly different mutational patterns exist in different regions of the genome. In some regions, a very strong bias toward A/T mutations makes this part of the genome extremely AT rich, whereas an unbiased mutation pattern exists in the other parts of the genome. If this were the case, we would expect to see the following 4 things: 1) a change in synonymous codon usage proportional to the change in GC content in that respective region, 2) a significant shift in amino acid usage to more GC-poor amino acids (depending on the magnitude of the mutation bias and the strength of the selective pressure on conserving the protein structure), 3) third codon position GC content resembling the GC content of the surrounding intergenic regions, and 4) a greater difference between flies and honeybee in GC content in the first codon position compared with the second codon position (because the first codon position changes are not all nonsynonymous). We also expect the AT bias to be significantly more pronounced in the low-GC regions compared with the high-GC regions.

We observe all 4 attributes in our data. Figure 1 shows a significantly lower GC content in the honeybee when looking at the first codon position compared with that of the second codon position. From table 4, we observe that the AT bias is much stronger in the low-GC regions as expected. Furthermore, figure 4 shows a very strong correlation between the regional GC content and third codon position GC content. This coupled with the very strong codon usage bias and amino acid bias we see in the low-GC regions indicates that the GC content of the third codon position could merely be a product of the mutational processes acting on the specific regions the genes reside in and that the very strong codon usage bias and amino acid bias we see in the low-GC regions are simply a consequence of an extremely high A/T bias in the mutation process. This mutation process might simply be so strong that it overwhelms the negative selection that acts to conserve the amino acid composition and the codon usage bias in the different proteins.

In sum, our analysis suggests that 2 very different mutational processes could be acting in different regions of the honeybee genome. One has a pattern of substitutions similar to that in D. melanogaster and An. gambiae. The other, which appears to be unique to honeybee, is driving a portion of the genome to decreased GC content resulting in a very different usage of synonymous codons and in the amino acids used in the encoded proteins in those regions. Both the low- and high-GC regions seem to evolve to be more AT rich, but the low-GC regions seem to be further from equilibrium. The reason for this is possibly the very rigid division into low- and high-GC regions used in this study. Looking at figure 2, it is easy to see that the 2 different beta distributions used to model the different modes in GC content have a rather large overlapping region. Therefore, reducing this complex bimodal distribution to a dichotomous partitioning may be an artificially simple characterization. We have conducted these analyses using different GC-cutoff values and even using a trinomial splitting where the middle region is not used in the analyses. The results are very similar to those reported here, so we do not expect these uncertainties to bias our conclusions.

The consequences of this apparent heterogeneity in mutation processes acting on the A. mellifera genome are still poorly understood, but our analysis indicates that it has had a significant effect on both codon and amino acid usage in the low-GC regions. Given the regional nature of the shift in the substitution process, and the fact that it appears so strong as to change the amino acid usage of a large fraction of the proteins caught in these regions, it seems likely that the change is systemic, changing a process of substitution rather than a novel pressure acting independently on many gene regions. Although the evidence is so far indirect, it appears as though there is a regional shift in the substitution process due to a shift in the underlying mutation process. At the moment we have no direct evidence to explain the underlying mechanism, but one can speculate that this may be due to regional recruitment of an altered DNA polymerase, a regionally maintained chromatin state, and/or occupancy of chromatin-binding proteins. More studies of possible mechanisms driving the GC content shift are certainly warranted, and further analysis will be possible when additional insect genome sequences become available.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 
We gratefully acknowledge the Honeybee Genome Sequencing Consortium for generating a valuable set of data and analysis resources and making it available to us. We would like to thank Bo Eskerod Madsen for many useful discussions throughout the duration of this study as well as 2 anonymous reviewers for several useful comments and suggestions. Chris Elsik, Eran Elhaik, and Dan Graur did much of the GC content analysis for the honeybee genome paper and have a manuscript of their own under review. Only the vagaries of publication timing allowed our paper to appear in print before theirs.

FGJ acknowledges funding from the Faculty of Science, University of Aarhus. MHS was supported by grant no. 2052-01-0032 from the Danish Agricultural Science Research Council. AGC was supported by grant no. NIH R01 GM64590 Funding for the Open Access publication was provided by these grants.


    Footnotes
 
Takashi Gojobori, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Acknowledgements
 References
 

    Adams MD, Celniker SE, Holt RA, et al. (195 co-authors). (2000) The genome sequence of Drosophila melanogaster. Science 287:2185–2195.[Abstract/Free Full Text]

    Bernardi G, Olofsson B, Filipski J, Zerial M, Salinas J, Cuny G, Meunier-Rotival M, Rodier F. (1985) The mosaic genome of warm-blooded vertebrates. Science 228:953–958.[Abstract/Free Full Text]

    Blanchette M, Kent WJ, Riemer C, et al. (12 co-authors). (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14:708–715.[Abstract/Free Full Text]

    Clay O and Bernardi G. (2005) How not to search for isochores: a reply to Cohen et al. Mol Biol Evol 22:2315–2317.[Abstract/Free Full Text]

    Cohen N, Dagan T, Stone L, Graur D. (2005) GC composition of the human genome: in search of isochores. Mol Biol Evol 22:1260–1272.[Abstract/Free Full Text]

    Duret L, Eyre-Walker A, Galtier N. (2006) A new perspective on isochore evolution. Gene 385:71–74.[CrossRef][ISI][Medline]

    Eyre-Walker A and Hurst LD. (2001) The evolution of isochores. Nat Rev Genet 2:549–555.[CrossRef][ISI][Medline]

    Gaunt MW and Miles MA. (2002) An insect molecular clock dates the origin of the insects and accords with palaeontological and biogeographic landmarks. Mol Biol Evol 19:748–761.[Abstract/Free Full Text]

    Holt RA, Subramanian GM, Halpern A, et al. (123 co-authors). (2002) The genome sequence of the malaria mosquito Anopheles gambiae. Science 298:129–149.[Abstract/Free Full Text]

    Honeybee Genome Sequencing Consortium. (2006) Insights into social insects from the genome of the honeybee Apis mellifera. Nature 443:931–949.[CrossRef][Medline]

    Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. (2002) The human genome browser at UCSC. Genome Res 12:996–1006.[Abstract/Free Full Text]

    Moriyama EN and Hartl DL. (1993) Codon usage bias and base composition of nuclear genes in Drosophila. Genetics 134:847–858.[Abstract]

    R Development Core Team. (2006) R: A Language and Environment for Statistical Computing [Internet] R Foundation for Statistical Computing, Vienna, Austria Available from: http://www.R-project.org.

    Richards S, Liu Y, Bettencourt BR, et al. (52 co-authors). (2005) Comparative genome sequencing of Drosophila pseudoobscura: chromosomal, gene, and cis-element evolution. Genome Res 15:1–18.[Abstract/Free Full Text]

    Savard J, Tautz D, Richards S, Weinstock GM, Gibbs RA, Werren JH, Tettelin H, Lercher MJ. (2006) Phylogenomic analysis reveals bees and wasps (Hymenoptera) at the base of the radiation of Holometabolous insects. Genome Res 16:1334–1338.[Abstract/Free Full Text]

    Shields DC, Sharp PM, Higgins DG, Wright F. (1988) "Silent" sites in Drosophila genes are not neutral: evidence of selection among synonymous codons. Mol Biol Evol 5:704–716.[Abstract]

    Xia Q, Zhou Z, Lu C, et al. (83 co-authors). (2004) A draft sequence for the genome of the domesticated silkworm (Bombyx mori). Science 306:1937–1940.[Abstract/Free Full Text]

    Zhang Z, Schwartz S, Wagner L, Miller W. (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7:203–214.[CrossRef][ISI][Medline]

Accepted for publication November 27, 2006.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Proc. Natl. Acad. Sci. USAHome page
A. Zayed and C. W. Whitfield
A genome-wide signature of positive selection in ancient and recent invasive expansions of the honey bee Apis mellifera
PNAS, March 4, 2008; 105(9): 3421 - 3426.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrowOA All Versions of this Article:
24/2/611    most recent
msl190v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Jørgensen, F. G.
Right arrow Articles by Clark, A. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Jørgensen, F. G.
Right arrow Articles by Clark, A. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?