Molecular Biology and Evolution 18:982-986 (2001)
© 2001 Society for Molecular Biology and Evolution
ARTICLE |
Synonymous Codon Bias Is Not Caused by Mutation Bias in G+C-Rich Genes in Humans
Centre for the Study of Evolution and School of Biological Sciences, University of Sussex, Brighton, England
| Abstract |
|---|
|
|
|---|
It is has been suggested that synonymous codon bias is a consequence of mutation bias in mammals. We tested this hypothesis in humans using single-nucleotide polymorphism data. We found a pattern of polymorphism which was inconsistent with the mutation bias hypothesis in G+C-rich genes. However, the data were consistent with the action of natural selection or biased gene conversion. Similar patterns of polymorphism were also observed in noncoding DNA, suggesting that natural selection or biased gene conversion may affect large tracts of the human genome.
| Introduction |
|---|
|
|
|---|
It is well established that selection acts on synonymous codon use in many groups of organisms, including bacteria, fungi, and insects (Sharp et al. 1992
We can test whether synonymous codon bias is caused by mutation bias using population genetic data. Let u be the mutation rate from G : C base pairs to A : T base pairs, and let v be the mutation rate in the opposite direction. If mutation rates are low (i.e., Neu << 1 and Nev << 1, where Ne is the effective population size) and constant, and no other evolutionary forces affect base composition, then the equilibrium frequency of G : C base pairs in a sequence is f = v/(v + u) (Suoeka 1962
). Therefore, the probability that we will observe an A or T mutation segregating at a site which was ancestrally G or C, henceforth referred to as a GC
AT mutation, is MGC
AT = fuH(n), where H(n) is the probability of observing a neutral mutation in a sample of n sequences, and the probability of observing a G or C mutation at a site which was ancestrally A or T, henceforth referred to as an AT
GC mutation, is MAT
GC = (1 - f)vH(n). It is not difficult to show that MGC
AT = MAT
GC; i.e., the number of AT
GC mutations segregating in a sample is expected to be equal to the number of GC
AT mutations if mutation bias is the sole cause of synonymous codon bias (Eyre-Walker 1997, 1999
).
A recent analysis showed that there were more GC
AT mutations than AT
GC mutations segregating at synonymous sites in mammalian MHC genes, suggesting that mutation bias was not solely responsible for synonymous codon bias (Eyre-Walker 1999
). However, it was not possible to demonstrate conclusively that the data conformed to the infinite-sites model (the requirement that mutation rates are low), and the results lacked generality, since for each species, all the studied genes came from a small region of a single chromosome.
A large number of single-nucleotide polymorphisms (SNPs) from human protein-coding genes, dispersed throughout the genome, have recently been published (Cargill et al. 1999
; Hacia et al. 1999
). For many of these SNPs, the corresponding sites have been sequenced in chimpanzees. Since the divergence between humans and chimpanzees is low (
0.015 at fourfold-degenerate synonymous sites; Eyre-Walker and Keightley 1999
), the chimpanzee sequence can be used to infer the ancestral state in humans (i.e., whether an SNP segregating X and Y is due to an X
Y or a Y
X mutation). Furthermore, the average nucleotide diversity at fourfold-degenerate sites in human genes is sufficiently low (
0.001; Li and Stadler 1991
; Cargill et al. 1999
) for the data to conform to the infinite-sites model, even at CpG dinucleotides which mutate approximately 1020 times as fast as other sites (Bulmer 1986
; Sved and Bird 1990
).
In this paper, we test the mutation bias hypothesis (i.e., whether mutation bias is responsible for synonymous codon bias) in humans by analyzing the pattern of polymorphism in synonymous SNPs.
| Materials and Methods |
|---|
|
|
|---|
Data
SNP data from two recent studies were obtained from their respective websites (http://waldo.wi.mit.edu/cvar_snps/ for Cargill et al. [1999
AT or AT
GC according to the mutation which generated them. We ignored those SNPs at which A/T or G/C were segregating and the few sites for which ancestral state reconstruction was ambiguous (if the chimpanzee site was polymorphic or if the chimpanzee nucleotide differed from both human nucleotides). We obtained the sequence containing each SNP by using the accession number from the Human SNP database (www-genome.wi.mit.edu/SNP/human/index.html) or by using the SNP-flanking sequences in a BLAST search. Annotations in the GenBank sequences allowed us to classify the SNPs into four classes: there were 125 synonymous, 60 intron, 60 3' untranslated region (UTR), and 49 anonymous STS SNPs. All of the synonymous and intron SNPs came from Cargill et al. (1999)
CpG Islands
CpG islands were identified by calculating the expected number of CpG's based on the base composition and comparing this number to the level observed. If the observed/expected ratio was >50%, the SNP was inferred to be in a CpG island. For CpG analysis, we used the longest available contiguous sequence. If the sequence length was >600 bp, then a sliding-window analysis was performed (window length = 300 bp, step length = 50 bp), and the maximum observed/expected value overlapping the SNP was taken. For shorter contiguous sequences, we took the observed/expected value for the entire sequence.
| Results and Discussion |
|---|
|
|
|---|
Synonymous SNPs
There are 125 GC
AT synonymous SNPs in the data set of Cargill et al. (1999)
AT mutations segregating at synonymous sites (88 GC
AT and 37 AT
GC mutations, P < 0.00001). The excess of GC
AT mutations is particularly evident in those genes which preferentially use G- and C-ending codons (for SNPs in exons with GC3 > 0.6, 60 GC
AT and 11 AT
GC mutations, P < 0.00001) (table 1
); there is no evidence of an excess of GC
AT mutations in genes with low GC3.
|
Sampling Bias
While this result would seem to be inconsistent with the mutation bias hypothesis in the G+C-rich genes, there are a number of explanations for the excess of GC
AT mutations which need to be considered: sampling bias, hypermutable sites, and a recent change in the pattern of mutation. It seems unlikely that our results were due to biases in the methods used to detect the SNPs for several reasons (i.e., ascertainment bias). First, Cargill et al. (1999)
AT mutations to increase with increasing G+C content, as we see in the data (table 1
); since under the mutation bias hypothesis we expect equal numbers of GC
AT and AT
GC mutations at all G+C contents, we would therefore expect a similar level of ascertainment bias at all compositional levels. Third, a similar excess of synonymous GC
AT mutations was observed in MHC genes, where the mutations were detected by a different method, direct sequencing (Eyre-Walker 1999
Hypermutation
Hypermutable sites potentially have two effects; they could lead to problems with parsimony, and they could violate the infinite-sites assumption. In each case, if the hypermutable sites had elevated rates of AT
GC mutation, they would tend to generate an excess of GC
AT mutations, as we see in the data. The reasons for this rather counterintuitive behavior are fully discussed elsewhere (Eyre-Walker 1998, 1999
). However, three lines of evidence suggest that hypermutable sites were not responsible for the excess of GC
AT mutations we observed. First, we are not aware of any evidence of AT
GC hypermutable sites in mammals; the one well-known class of hypermutable sites, CpG dinucleotides, are expected to cause a bias in the opposite direction of that required to explain the data: CpG dinucleotides generate C
T and G
A transitions at elevated rates, and such mutations will tend to appear as T
C and A
G changes, respectively, in the data (Eyre-Walker 1998, 1999
). Second, it is possible to demonstrate that the excess in GC
AT mutations is not due to a problem with parsimony, since we can dispense with the chimpanzee sequence and infer the direction of mutation from the frequencies of the alleles segregating at a site; the rarer allele is assumed to be more recent. This method is unbiased under the null hypothesis (synonymous codon bias is caused by mutation bias) and the infinite-sites assumption (Eyre-Walker 1999
). Using allele frequencies, we infer that there have been 65 GC
AT mutations, compared with 37 AT
GC mutations over all genes (P = 0.007) and 45 GC
AT versus 15 AT
GC mutations (P = 0.0001) for genes with GC3 > 0.6; the sample sizes are smaller because frequency data are available for only a subset of the SNPs. Third, the infinite-sites assumption would only be seriously compromised in this context if the rate of mutation were some 100 times as high as the average nucleotide diversity observed (Eyre-Walker 1999
), and with that level of hypermutability, we would expect to see an excess of GC
AT substitutions inferred by parsimony (Eyre-Walker 1998
) over even short timescales, such as the divergence along the human lineage since we split from chimpanzees (Eyre-Walker and Keightley 1999
). In a sample of 28 genes sequenced in humans, chimpanzees, and gorillas (Eyre-Walker and Keightley 1999
), there have been identical numbers of GC
AT and AT
GC synonymous substitutions along the human lineage (22 substitutions in each direction inferred by parsimony, 18 GC
AT and 15 AT
GC substitutions for genes with GC3 > 0.6), just as we expect for a sequence of stationary base composition.
Mutation Pattern
The excess of GC
AT mutations segregating in human SNPs could be the result of a recent change in the mutation pattern from a GC bias to an AT bias, but this seems unlikely for three reasons. First, a change in the mutation pattern would manifest itself as an excess of GC
AT substitutions over AT
GC substitutions unless the change in the mutation pattern had been very recent. As we showed above, there appear to have been similar numbers of GC
AT and AT
GC substitutions along the human lineage since the split from chimpanzees. Second, a dramatic change in the mutation pattern is required to explain the data. For example, there are 18 GC
AT mutations and 4 AT
GC mutations for the SNPs in exons with GC3 between 70% and 80%, and the change in the mutation process needed to cause this pattern would eventually reduce GC3 to
40% (calculated using eq. 8 in Eyre-Walker [1997
]). Third, we would require several independent changes in the mutation pattern in the same direction to explain the excess of GC
AT synonymous polymorphisms in the MHC genes of other mammals (Eyre-Walker 1999
).
Selection and Biased Gene Conversion
It therefore seems that mutation bias is not responsible for synonymous codon bias in human genes. However, there are at least two other possibilities: natural selection and biased gene conversion; biased gene conversion is a process which leads to the biased transmission of alleles; for example, if biased gene conversion is very strong and G+C-biased, 100% of all gametes from a C/T heterozygote will be C. Both selection and biased gene conversion are expected to generate an excess of GC
AT mutations. This can be seen using the following simple argument: Let us imagine there is no mutation bias, and selection has elevated the G+C content of a sequence to 80%. Since there is no mutation bias, 80% of the new mutations will be GC
AT, and 20% will be AT
GC (ignoring G
C and A
T mutations). Unfortunately, the situation is more complicated, because selection may affect the probability of detecting a mutation; for example, if directional selection had elevated the G+C content to 80% in the previous example, each GC
AT mutation would be slightly deleterious, while each AT
GC would be slightly advantageous; we would therefore expect to detect the AT
GC mutations more readily, because they would segregate at slightly higher frequencies, on average, than the GC
AT mutations.
To demonstrate formally that selection and biased gene conversion are expected to generate an excess of GC
AT mutations, we derived the expected proportion of GC
AT mutations segregating in a sample of sequences, PGC
AT, under two models: a model of weak directional selection, which is equivalent to a model of biased gene conversion (Nagylaki 1983
); and a model of strong stabilizing selection. Let f ' (or f '') be the frequency of sites fixed for G : C base pairs, u be mutation rate from G : C to A : T base pairs, and v be the mutation rate in the opposite direction. We will assume that selection or biased gene conversion favors high G+C. First, consider weak directional selection and biased gene conversion, two processes which can be described by a single parameter s, since they are dynamically identical (Nagylaki 1983
). Under semidominant directional selection, s is the strength of selection in favor of G : :C base pairs, and under biased gene conversion, s is the strength of biased gene conversion, where (s + 1)/2 of the alleles from a G : C/A : T heterozygote are G or C. If mutation rates are low enough that the infinite-sites assumption holds (i.e., Neu << 1, Nev << 1), the equilibrium proportion of sites fixed for G : C in a diploid is
|
|
AT and AT
GC mutations are given by
|
|
AT mutations segregating in the sample is simply
|
|
AT or AT
GC, is deleterious. If we assume that selection is symmetrical about the optimum, then each mutation will be subject to the same level of selection; let the strength of selection be s against the mutation. Then, we have
|
|
AT mutations under both models, and when selection favors increased A+T, we expect a deficit of GC
AT mutations. This is likely to be the pattern we expect under most models of selection, since the stabilizing- and directional-selection models lie at opposite ends of a continuum; as selection becomes weak in the stabilizing-selection model, mutation pressure will push the population away from the optimum; if selection becomes very weak, then the population will be sufficiently far below the optimum that the model becomes a weak directional-selection model.
|
CpG Dinucleotides
While both selection and biased gene conversion are consistent with the data presented here, there are few data which can discriminate between them at present. We can test two simple selective hypotheses: that selection is acting on synonymous codon use, but only to maintain (1) CpG islands,
1-kb sequences which have high levels of the dinucleotide CpG and high G+C content, or (2) methylated CpG dinucleotides. Both CpG islands and methylated CpGs have been implicated in the regulation of gene expression (Lewis and Bird 1991
AT mutations is very apparent for both CpG islands and CpG dinucleotides (CpG islands: 14 GC
AT mutations and 1 AT
GC mutation, P = 0.0005; CpG dinucleotides: 47 GC
AT and 15 AT
GC mutations, P = 0.0001 at SNPs segregating C/T at a site flanked 3' by G, or G/A at a site flanked 5' by C), there is an excess of GC
AT mutations both for non-CpG island DNA and for dinucleotides other than CpG (non-CpG island: 73 GC
AT and 36 AT
GC mutations, P = 0.0005; other dinucleotides: 41 GC
AT and 15 AT
GC mutations, P = 0.023).
Noncoding DNA
It is likely that whatever affects synonymous codon bias also affects large regions of the genome, since in mammals synonymous codon bias is correlative with the base composition of the chromosomal region in which the gene is situatedi.e., GC3 is strongly correlated to the G+C content of the 5' and 3' UTR regions, introns, and isochores (Bernardi et al. 1985
; Clay et al. 1996
). As expected, there is an excess of GC
AT mutations segregating in intron, 3' UTR, and anonymous STS sequences (i.e., STS sequences which are not known to be within or flanking a protein-coding sequence), particularly in those sequences which are G+C rich (table 2
). It therefore seems that either natural selection or biased gene conversion also affects the base composition of G+C rich noncoding DNA and therefore has a profound effect on the structure of the human genome, since large sections of the genome are G+C-rich, while others are G+C-poor (Bernardi 1995
).
|
| Acknowledgements |
|---|
|
|
|---|
We thank Eric Lander, Francis Collins, and their groups for making their data available, and Gil McVean, Laurence Hurst, and Peter Keightley for comments and helpful discussion. This work was supported by the BBSRC (N.G.C.S., A.E.-W.) and the Royal Society (A.E.-W.).
| Footnotes |
|---|
Manolo Gouy, Reviewing Editor
1 Abbreviations: EST, expressed sequence tag; MHC, major histocompatibility complex; SNP, single-nucleotide polymorphism; STS, sequence tagged site; UTR, untranslated region. ![]()
2 Keywords: human
synonymous codons
mutation bias ![]()
3 Address for correspondence and reprints: Adam Eyre-Walker, Centre for the Study of Evolution and School of Biological Sciences, University of Sussex, Brighton BN1 9QG, United Kingdom. a.c.eyre-walker{at}sussex.ac.uk ![]()
| literature cited |
|---|
|
|
|---|
Bernardi, G. 1995. The human genome: organization and evolutionary history. Annu. Rev. Genet. 29:445476[Web of Science][Medline]
Bernardi, G., B. Olofsson, J. Filipski, M. Zerial, J. Salinas, G. Cuny, M. Meunier-Rotival, and F. Rodier. 1985. The mosaic genome of warm blooded vertebrates. Science 228:953958
Bulmer, M. 1986. Neighbouring base effects on substitution rates in pseudogenes. Mol. Biol. Evol. 3:322329[Abstract]
. 1991. The selection-mutation-drift theory of synonymous codon usage. Genetics 129:897907
Cargill, M., D. Altshuler, J. Ireland et al. (17 co-authors). 1999. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22:231238[Web of Science][Medline]
Clay, O., S. Caccio, Z. Zoubak, D. Mouchiroud, and G. Bernardi. 1996. Human coding and noncoding DNA: compositional correlations. Mol. Phylogenet. Evol. 5:212[Web of Science][Medline]
Duret, L., and D. Mouchiroud. 1999. Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc. Natl. Acad. Sci. USA 96:44824487
Eyre-Walker, A. 1997. Differentiating selection and mutation bias. Genetics 147:19831987
. 1998. Problems with parsimony in sequences of biased base composition. J. Mol. Evol. 47:686690[Web of Science][Medline]
. 1999. Evidence of selection on silent site base composition in mammals: potential implications for the evolution of isochores and junk DNA. Genetics 152:675683
Eyre-Walker, A., and P. D. Keightley. 1999. High genomic deleterious mutation rates in hominids. Nature 397:344347
Filipski, J. 1987. Correlation between molecular clock ticking, codon usage, fidelity of DNA repair, chromosome banding and chromtin compactness in germline cells. FEBS Lett. 217:184186[Web of Science][Medline]
Gouy, M., and C. Gautier. 1982. Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res. 10:70557074
Hacia, J. G., J.-B. Fan, O. Ryder et al. (16 co-authors). 1999. Determination of ancestral alleles for human single-nucleotide polymorphisms using high-density oligonucleotide arrays. Nat. Genet. 22:164167[Web of Science][Medline]
Ikemura, T. 1985. Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2:1334[Abstract]
Lewis, J., and A. P. Bird. 1991. DNA methylation and chromatin structure. FEBS Lett. 205:155159
Li, W.-H., and L. A. Stadler. 1991. Low nucleotide diversity in man. Genetics 129:513523
Li, W.-H., M. Tanimura, and P. M. Sharp. 1987. An evaluation of the molecular clock hypothesis using mammalian DNA sequences. J. Mol. Evol. 25:330342[Web of Science][Medline]
Nagylaki, T. 1983. Evolution of a finite population under gene conversion. Proc. Natl. Acad. Sci. USA 80:62786281
Sawyer, S. A., and D. L. Hartl. 1992. Population genetics of polymorphism and divergence. Genetics 132:11611176
Sharp, P. M., C. J. Burgess, A. T. Lloyd, and K. J. Mitchell. 1992. Selective use of termination and variation in codon choice. Pp. 397425 in D. L. Hatfield, B. J. Lee, and R. M. Pirtle, eds. Transfer RNA in protein synthesis. CRC Press, Boca Raton, Fla
Suoeka, N. 1962. On the genetic basis of variation and heterogeneity of DNA base composition. Proc. Natl. Acad. Sci. USA 48:582592
. 1988. Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. USA 85:26532657
Sved, J., and A. P. Bird. 1990. The expected equilibrium of the CpG dinucleotide in vertebrate genomes under a mutation model. Proc. Natl. Acad. Sci. USA 87:46924696
Wolfe, K. H., P. M. Sharp, and W.-H. Li. 1989. Mutation rates differ among regions of the mammalian genome. Nature 337:283285
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
G. Bernardi The neoselectionist theory of genome evolution PNAS, May 15, 2007; 104(20): 8385 - 8390. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Schmegner, J. Hoegel, W. Vogel, and G. Assum The Rate, Not the Spectrum, of Base Pair Substitutions Changes at a GC-Content Transition in the Human NF1 Gene Region: Implications for the Evolution of the Mammalian Genome Structure Genetics, January 1, 2007; 175(1): 421 - 428. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. H. Press and H. Robins Isochores Exhibit Evidence of Genes Interacting With the Large-Scale Genomic Environment Genetics, October 1, 2006; 174(2): 1029 - 1040. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Comeron Weak selection and recent mutational changes influence polymorphic synonymous mutations in humans PNAS, May 2, 2006; 103(18): 6940 - 6945. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Costantini, O. Clay, F. Auletta, and G. Bernardi An isochore map of human chromosomes. Genome Res., April 1, 2006; 16(4): 536 - 541. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Galtier, E. Bazin, and N. Bierne GC-Biased Segregation of Noncoding Polymorphisms in Drosophila Genetics, January 1, 2006; 172(1): 221 - 228. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. S. Balakirev, V. R. Chechetkin, V. V. Lobzin, and F. J. Ayala Entropy and GC Content in the {beta}-esterase Gene Cluster of the Drosophila melanogaster Subgroup Mol. Biol. Evol., October 1, 2005; 22(10): 2063 - 2072. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Ebersberger and M. Meyer A Genomic Region Evolving Toward Different GC Contents in Humans and Chimpanzees Indicates a Recent and Regionally Limited Shift in the Mutation Pattern Mol. Biol. Evol., May 1, 2005; 22(5): 1240 - 1245. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. E. Vinogradov Noncoding DNA, isochores and gene expression: nucleosome formation potential Nucleic Acids Res., January 26, 2005; 33(2): 559 - 563. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Lercher, J.-V. Chamary, and L. D. Hurst Genomic Regionality in Rates of Evolution Is Not Explained by Clustering of Genes of Comparable Expression Profile Genome Res., June 1, 2004; 14(6): 1002 - 1013. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. P. Noonan, J. Grimwood, J. Schmutz, M. Dickson, and R. M. Myers Gene Conversion and the Evolution of Protocadherin Gene Cluster Diversity Genome Res., March 1, 2004; 14(3): 354 - 366. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. E. Vinogradov Isochores and tissue-specificity Nucleic Acids Res., September 1, 2003; 31(17): 5212 - 5220. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. E. Vinogradov DNA helix: the importance of being GC-rich Nucleic Acids Res., April 1, 2003; 31(7): 1838 - 1844. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. T. Webster, N. G. C. Smith, and H. Ellegren Compositional Evolution of Noncoding DNA in the Human and Chimpanzee Genomes Mol. Biol. Evol., February 1, 2003; 20(2): 278 - 286. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Lercher, N. G. C. Smith, A. Eyre-Walker, and L. D. Hurst The Evolution of Isochores: Evidence From SNP Frequency Distributions Genetics, December 1, 2002; 162(4): 1805 - 1810. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Duret, M. Semon, G. Piganeau, D. Mouchiroud, and N. Galtier Vanishing GC-Rich Isochores in Mammalian Genomes Genetics, December 1, 2002; 162(4): 1837 - 1847. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. I. Wright, B. Lauga, and D. Charlesworth Rates and Patterns of Molecular Evolution in Inbred and Outbred Arabidopsis Mol. Biol. Evol., September 1, 2002; 19(9): 1407 - 1420. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Novembre Accounting for Background Nucleotide Composition When Measuring Codon Usage Bias Mol. Biol. Evol., August 1, 2002; 19(8): 1390 - 1394. [Full Text] [PDF] |
||||
![]() |
J. A. Birdsell Integrating Genomics, Bioinformatics, and Classical Genetics to Study the Effects of Recombination on Genome Evolution Mol. Biol. Evol., July 1, 2002; 19(7): 1181 - 1197. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







