MBE Advance Access originally published online on October 20, 2006
Molecular Biology and Evolution 2007 24(1):23-25; doi:10.1093/molbev/msl156
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Letters |
Methylation-Dependent Transition Rates Are Dependent on Local Sequence Lengths and Genomic Regions
Virginia Institute for Psychiatric and Behavioral Genetics and Center for the Study of Biological Complexity, Virginia Commonwealth University
E-mail: zzhao{at}vcu.edu.
| Abstract |
|---|
|
|
|---|
Recently, Fryxell and Moon (2005) examined methylation-dependent transition rates (5mC deamination rates), which were calculated by the difference between the CpG transition and GpC transition rates, using 4,437 transition mutations in CpG or GpC dinucleotides. They concluded that 5mC deamination rates were highly dependent on local GC content but not on local sequence lengths over which GC content was calculated or the genomic regions where the mutations occurred. Here, we reexamined these statements by using 292,216 CpG
TpG/CpA and GpC
GpT/ApC mutations, an increase of 66 times as much data. Contrary to Fryxell and Moon's conclusions, our analysis indicated that 5mC deamination rates in the human genome were dependent on both the local sequence length and the genomic region. Some explanations for their conclusions were provided.
Key Words: CpG GpC mutation rate single nucleotide polymorphisms GC content genomic regions
CpG dinucleotides are subject to global methylation in mammalian genomes. The transition rate of methylated CpG (5mCpG) to TpG is 10- to 50-fold higher than other transitional changes (Sved and Bird 1990
; Fryxell and Moon 2005
). Because cytosines in GpC dinucleotides are not methylated in mammalian genomes (Razin and Riggs 1980
), the difference between the CpG transition rate, measured by the number of CpG
TpG/CpA per CpG dinucleotide in a sequence, and GpC transition rate, measured by the number of GpC
GpT/ApC per GpC dinucleotide, represents the rate of methylation-dependent transition or 5mC deamination rate. By applying this approach to human single nucleotide polymorphism (SNP) data, Fryxell and Moon (2005)
indicated that the 5mC deamination rate was exponentially dependent on local GC content. Importantly, in their plots of log10(5mC deamination rate) versus GC content of SNP neighboring sequences, the slope values using linear regression analysis were close to –3.0. This is an ideal slope that was predicted based on DNA melting as a function of base composition (Fryxell and Zuckerkandl 2000
). They further concluded that the slope of –3.0 was neither dependent on the lengths of DNA sequences where the GC content was measured nor specifically caused by the genomic regions (e.g., exons, introns, and differential methylation of CpG islands) where the SNPs were located (Fryxell and Moon 2005
, see Conclusions).
Our recent studies indicated that the sequence context pattern observed in the local environment of SNPs depended on many factors such as genomic regions, sequence lengths, and SNP types (Zhao and Zhang 2006a
, 2006b
). Fryxell and Moon's study used only 4,437 SNPs including 2,568 in intergenic regions, 1,222 in introns, 187 in exons, 260 in 5' untranslated regions (UTRs), 52 in 3 UTRs, 145 in CpG islands, and 3 others. Their general conclusions were only based on the observations on the 4,437 SNPs and 2,568 intergenic SNPs but not on the observations in each genomic region. Therefore, further investigation is warranted. Here, we reexamined this issue using a much larger data set extracted from the 8,353,499 SNPs recently released in the dbSNP database (ftp://ftp.ncbi.nih.gov/snp/, Build 124). To ensure the high quality of our data, we selected only those SNPs that were biallelic, noninsertion/deletion, validated, uniquely mapped in the human nonrepetitive sequences and whose ancestral alleles were reliably inferred by comparing with their homologous sequences in the chimpanzee genome (see Methods). These processes resulted in 292,216 CpG
TpG/CpA and GpC
GpT/ApC SNPs, 66 times more data than Fryxell and Moon's sample. These SNPs were further categorized by genomic region.
We first examined whether the 5mC deamination rates were dependent on the lengths of local sequences. Using the same method as Fryxell and Moon (2005)
, we calculated the rates of CpG
TpG/CpA per CpG dinucleotide and GpC
GpT/ApC per GpC dinucleotide in the SNP sequences whose lengths were 101, 201, 401, 601, and 1,001 nt (fig. 1A). In each length category (e.g., 101 nt), each SNP sequence consisted of the polymorphic site (1 nt) and an equal length (50 nt) of the 5 and 3 flanking sequences. We then calculated the difference of CpG and GpC transition rates (fig. 1B) and plotted log10(5mC deamination rate) versus GC content for each length category (fig. 1C). Linear regression analysis confirmed the previous findings that the 5mC deamination rates are inversely correlated with GC content (Bernardi et al. 1985
; Bernardi 1995
). However, our slope values varied with SNP sequence lengths. Specifically, they increased when the sequence lengths decreased. For example, the slope values increased from –3.1 to –1.2 when the lengths decreased from 1,001 to 101 nt (table 1). We noted that when the SNP sequence length was 601 nt, the slope value (–2.8) was the same as Fryxell and Moon's observed value, where the modal average length of their SNP sequences was 564 nt.
|
|
We next examined whether the 5mC deamination rates were dependent on genomic regions. The above analysis was performed for the SNPs in each genomic region. The CpG
TpG/CpA rates in intergenic and intronic regions were nearly the same; however, they were remarkably higher than other genomic regions. Conversely, the GpC
GpT/ApC rates were nearly the same among all genomic regions (supplementary fig. S1, Supplementary Material online). The slope values in the intergenic regions and introns were always similar regardless of the sequence lengths (table 1), reflecting that these 2 noncoding regions are nearly selectively neutral and have similar GC content. Moreover, the slope values in the overall genome were similar to those in the intergenic and intronic regions. This is largely because the number of SNPs in the intergenic and intronic regions accounted for 85% of the total SNPs. However, the slopes in the CpG islands were much smaller (or greater in absolute value) than those in the intergenic or intronic regions for all length categories. This reflects the lack of methylation and suppression of 5mC deamination in CpG islands (Zhao and Zhang 2006a
Now we will try to explain why some conclusions in Fryxell and Moon (2005)
were different from our observations. First, they stated that the slope of –3.0 did not depend on the sequence lengths over which GC content was calculated because the results from their 2 analysis, one based on SNP sequence length (564 nt) and the other based on chimpanzee genomic contig length (163 kb), were essentially the same. However, large nucleotide composition biases were found at a few adjacent sites of SNPs, whereas small biases were detected within 200 nt at each flanking side of SNPs (Zhao and Boerwinkle 2002
). Their 2 analysis failed to reveal the difference of slopes because both lengths were longer than 400 nt. Second, after they found that the slopes were close to –3.0 using SNPs in the overall genome and in intergenic regions, they implied that the slopes would remain unchanged regardless of the inclusion or exclusion of exons, introns, or CpG islands. This generalization was not directly based on the analysis of SNPs in the specific regions but, in fact, based on the fact that the intergenic and intronic SNPs accounted for 85% of the total SNPs in their study.
In summary, contrary to Fryxell and Moon's conclusions, our analysis indicated that the CpG transition rates, measured by the difference of CpG transition and GpC transition rates, in the human genome were dependent on both local sequence length and genomic region.
| Methods |
|---|
|
|
|---|
We used the SNP data set prepared in Jiang and Zhao (2006)
The ancestral alleles of these SNPs were inferred by mapping human SNPs to the chimpanzee genome using the MegaBlast program (Zhang et al. 2000
). A total of 1,785,712 SNPs' ancestral alleles were inferred by the stringent criteria (Jiang and Zhao 2006
). Among them, there were 292,216 transition mutations that occurred in ancestral CpG or GpC dinucleotides.
To identify these transition mutations in intergenic, intronic, and exonic regions, we compared the SNP locations with the coordinates of each well-categorized (e.g., known) intergenic, intronic, and exonic region based on the human gene annotation information from the ENSEMBL database (ftp://ftp.ensembl.org/pub/, version 32.35e) (Jiang and Zhao 2006
). Because the annotations for UTRs were not readily available in the ENSEMBL database, we used the nonredundant human UTR data from UTResource (http://www.ba.itb.cnr.it/UTR/, Release 21). The mutations in 5 and 3 UTRs were identified by comparing the SNP locations with the coordinates of 5 and 3 UTRs in human chromosomes. The SNPs in CpG islands were identified according to the procedure in Jiang and Zhao (2006)
. We identified 176,518 transition mutations in intergenic regions, 71,456 in introns, 4,168 in exons, 8,027 in 5 UTRs, 28,714 in 3 UTRs, and 6,642 in CpG islands.
| Supplementary Material |
|---|
|
|
|---|
A supplementary figure S1 is available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|
We are grateful to Wen-Hsiung Li for comments on the manuscript. This study was supported by the Thomas F. and Kate Miller Jeffress Memorial Trust Fund.
| Footnotes |
|---|
Naruya Saitou, Associate Editor
| References |
|---|
|
|
|---|
Bernardi G. (1995) The human genome: organization and evolutionary history. Annu Rev Genet 29:445–476.[CrossRef][ISI][Medline]
Bernardi G, Olofsson B, Filipski J, Zerial M, Salinas J, Cuny G, Meunier-Rotival M, Rodier F. (1985) The mosaic genome of warm-blooded vertebrates. Science 228:953–958.
Fryxell KJ and Moon WJ. (2005) CpG mutation rates in the human genome are highly dependent on local GC content. Mol Biol Evol 22:650–658.
Fryxell KJ and Zuckerkandl E. (2000) Cytosine deamination plays a primary role in the evolution of mammalian isochores. Mol Biol Evol 17:1371–1383.
Jiang C and Zhao Z. (2006) Mutational spectrum in the recent human genome inferred by single nucleotide polymorphisms. Genomics 88:527–534.[CrossRef][ISI][Medline]
Lander ES, Linton LM, Birren B, et al. (255 co-authors). (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921.[CrossRef][Medline]
Razin A and Riggs AD. (1980) DNA methylation and gene function. Science 210:604–610.
Sved J and Bird A. (1990) The expected equilibrium of the CpG dinucleotide in vertebrate genomes under a mutation model. Proc Natl Acad Sci USA 87:4692–4696.
Zhang Z, Schwartz S, Wagner L, Miller W. (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7:203–214.[CrossRef][ISI][Medline]
Zhao Z and Boerwinkle E. (2002) Neighboring-nucleotide effects on single nucleotide polymorphisms: a study of 2.6 million polymorphisms across the human genome. Genome Res 12:1679–1686.
Zhao Z and Zhang F. (2006a) Sequence context analysis of 8.2 million single nucleotide polymorphisms in the human genome. Gene 366:316–324.[CrossRef][ISI][Medline]
Zhao Z and Zhang F. (2006b) Sequence context analysis in the mouse genome: single nucleotide polymorphisms and CpG island sequences. Genomics 87:68–74.[CrossRef][ISI][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
C. Jiang, L. Han, B. Su, W.-H. Li, and Z. Zhao Features and Trend of Loss of Promoter-Associated CpG Islands in the Human and Mouse Genomes Mol. Biol. Evol., September 1, 2007; 24(9): 1991 - 2000. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

