MBE Advance Access originally published online on June 24, 2007
Molecular Biology and Evolution 2007 24(9):1991-2000; doi:10.1093/molbev/msm128
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Features and Trend of Loss of Promoter-Associated CpG Islands in the Human and Mouse Genomes
,
,1

* Department of Psychiatry and Center for the Study of Biological Complexity, Virginia Commonwealth University, Richmond, Virginia, USA
Key Laboratory of Cellular and Molecular Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
Graduate School, Chinese Academy of Sciences, Beijing, China
Department of Ecology and Evolution, University of Chicago, Chicago, Illinois, USA
E-mail: zzhao{at}vcu.edu.
| Abstract |
|---|
|
|
|---|
CpG islands (CGIs) are often considered as gene markers, but the number of CGIs varies among mammalian genomes that have similar numbers of genes. In this study, we investigated the distribution of CGIs in the promoter regions of 3,197 human-mouse orthologous gene pairs and found that the mouse genome has notably fewer CGIs in the promoter regions and less pronounced CGI characteristics than does the human genome. We further inferred CGI's ancestral state using the dog genome as a reference and examined the nucleotide substitution pattern and the mutational direction in the conserved regions of human and mouse CGIs. The results reveal many losses of CGIs in both genomes but the loss rate in the mouse lineage is two to four times the rate in the human lineage. We found an intriguing feature of CGI loss, namely that the loss of a CGI usually starts from erosion at the both edges and gradually moves towards the center. We found functional bias in the genes that have lost promoter-associated CGIs in the human or mouse lineage. Finally, our analysis indicates that the association of CGIs with housekeeping genes is not as strong as previously estimated. Our study provides a detailed view of the evolution of promoter-associated CGIs in the human and mouse genomes and our findings are helpful for understanding the evolution of mammalian genomes and the role of CGIs in gene function.
Key Words: CpG island evolution promoter DNA methylation functional bias human mouse
| Introduction |
|---|
|
|
|---|
CpG dinucleotides tend to be methylated and are greatly under-represented in the vertebrate genome. However, CpG sites may cluster in GC-rich regions to form CpG islands (CGIs) (Bird 1986
CGIs have been commonly used to estimate the number of genes in a genome (Antequera and Bird 1993
). The gene numbers (e.g.,
30,000) as well as the genome sizes were found to be similar in the human, mouse, and rat (Venter et al. 2001
; Waterston et al. 2002
; Gibbs et al. 2004
). However, the number of CGIs varies among these genomes. For example, Waterston et al. (2002)
estimated 27,000 and 15,500 CGIs in the non-repetitive portions of the human and mouse genomes, respectively. In a systematic survey using Takai and Jones' (2002) algorithm, we found that the number of CGIs in the dog genome (58,300) was nearly three times that in the rat (19,600) or mouse (20,500) genome (unpublished data), even though the number of dog genes was estimated to be smaller than those of human and mouse genes (Lindblad-Toh et al. 2005
). This motivated us to investigate the detailed association of CGIs with genes and the evolution of CGIs. One important question is whether the CGIs associated with genes have undergone gains or losses during vertebrate evolution. A few previous studies based on a limited number of genes suggested more frequent losses of CGIs in the mouse lineage than in the human lineage (Aissani and Bernardi 1991
; Antequera and Bird 1993
; Matsuo et al. 1993
). Comparing CpG and TpG/CpA dinucleotides in three genes that have CGIs in humans but not in mice, Antequera and Bird (1993)
suggested an evolutionary scenario that in mice some CpG islands became de novo methylated and then were progressively lost due to the high deamination rate at the newly methylated CpG sites, leading to TpGs and CpAs. Whether this scenario applies to the whole mouse genome and any other vertebrate genomes remains unclear. In this study, we performed a large-scale analysis of CGIs covering the promoter regions of 3,197 human and mouse orthologous gene pairs. We found that human genes have a much higher presence of promoter-associated CGIs and that these CGIs have more pronounced CGI characteristics in the human genome, consistent with a previous finding (Matsuo et al. 1993
). Our study of nucleotide substitutions at the CpG sites and their mutational direction suggests frequent losses of CGIs in both genomes but a much faster rate in the mouse lineage. Interestingly, our analysis revealed an intriguing feature that erosion of a CGI usually starts from both edges and moves towards the center. We also studied the functional bias of genes that have lost CGIs.
| Materials and Methods |
|---|
|
|
|---|
Human-Mouse Orthologous Genes
We used the 3,750 human-mouse orthologous gene pairs prepared by Iwama and Gojobori (2004)
300 bp (Pesole et al. 1999
1 kb (Zhao and Zhang 2006a
Human-Mouse-Dog Orthologous Genes
Among the 3,197 human-mouse orthologous genes, 2,717 were annotated orthologous to dog genes in the NCBI HomoloGene database (ftp://ftp.ncbi.nih.gov/pub/HomoloGene/, build 52). We retrieved the sequences and annotations of the 2,717 dog genes from the NCBI dog genome database (ftp://ftp.ncbi.nih.gov/genomes/Canis_familiaris/, build 2.1). We prepared the dog data using the same procedures as in the human-mouse orthologous genes. We discarded 47 genes and retained 2,670 human-mouse-dog orthologous genes (see Supplementary table S2).
Identification of CGIs
CGIs were identified using the CpG island searcher program (CpGi130) available at http://cpgislands.usc.edu/. We used the search criteria: GC content
55%, ObsCpG/ExpCpG
0.65, and length
500 bp. These criteria can effectively identify CGIs associated with genes (Takai and Jones 2002
). Further, we excluded short interspersed repeats (e.g., Alu), which typically have a sequence length of 80-400 (Takai and Jones 2002
; Lopez-Giraldez et al. 2006
).
Genomic Global Alignment
For each human-mouse orthologous gene pair, we used BLAST 2 Sequences (Tatusova and Madden 1999) to find the local alignments. We used the same parameters as in Iwama and Gojobori (2004)
, which are (1) mismatch penalty = -2, (2) word size = 7, (3) hit length > 7 bp, (4) sequence identity
70%, (5) hit strand in the same direction, and (6) only one hit in the local alignments. A total of 109,003 conserved blocks covering 6,145,266 bp were identified among these genes. In our analysis, we focused on the CGIs that are within or overlap with the 2-kb sequences upstream of the first coding site and we grouped them into the following three categories: (1) 5,865 blocks (650,669 bp) each of which was mapped to both human and mouse CGIs (H+M+), (2) 2,930 blocks (153,721 bp) that were mapped to human CGIs but mouse non-CGIs (H+M-), and (3) 1,154 blocks (39,579 bp) that were mapped to mouse CGIs but human non-CGIs (H-M+).
Inference of Mutational Direction in SNPs
The human and mouse reference SNPs were downloaded from the NCBI dbSNP database (ftp://ftp.ncbi.nih.gov/snp/, build 126). The details of SNP data processing, inference of ancestral alleles of SNPs, and identification of SNPs in genomic regions have been described in the Materials and Methods in Jiang and Zhao (2006)
. Briefly, we first identified the SNPs that were bi-allelic, uniquely mapped in the genome, and located in the CGIs that are within or overlap with the 2-kb upstream sequences of the genes. Next, we inferred the ancestral alleles of SNPs by mapping the human SNPs to the chimpanzee genome and the mouse SNPs to the rat genome using the MegaBlast program (Zhang et al. 2000
). The chimpanzee and rat genome reference sequences were downloaded from the NCBI RefSeq database (ftp://ftp.ncbi.nih.gov/genomes/, the chimpanzee genome: build 2, version 1; the rat genome: build 4, version 1). For each SNP, its uniquely matched nucleotide in the outgroup species (chimpanzee or rat) would be its ancestral allele (Jiang and Zhao 2006
). There were a total of 5,624 human SNPs and 646 mouse SNPs whose ancestral alleles could be inferred in these CGIs.
Gene Expression
Human and mouse gene expression data in the second version of Gene Expression Atlas were directly obtained from Andrew Su (Su et al. 2004
). We consider that a gene is expressed in a tissue when its average difference (AD) value is
200 (Yang, Su, and Li 2005
).
Gene Ontology (GO) Annotation
Among the 3,197 human and mouse orthologous pairs, we obtained the GO annotations of 2,976 human genes and 2,977 mouse genes from the EBI Gene Ontology Annotation (GOA) database (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/). To simplify the analysis, all GO terms were mapped to the goslim-generic subset (ftp://ftp.geneontology.org/pub/go/GO_slims/goslim_generic.obo), which contains the top-level GO terms.
| Results |
|---|
|
|
|---|
Distribution of Promoter-Associated CGIs in Human and Mouse Orthologous Genes
We used 3,197 human-mouse orthologous pairs from the 3,750 pairs prepared by Iwama and Gojobori (2004)
We grouped the 3,197 gene pairs by the presence of promoter-associated CGIs. There were 1,393 pairs (43.6%) that had at least one CGI in both the human and mouse genes, 293 pairs (9.2%) that had at least one CGI in the human gene but none in the mouse gene, 87 pairs (2.7%) that had at least one CGI in the mouse gene but none in the human gene, and 1,424 pairs (44.5%) had no CGI in either the human or mouse gene (table 1). We denoted these four groups as H+M+, H+M-, H-M+, and H-M-, respectively. This distribution indicates that most of the orthologous gene pairs (43.6% + 44.5% = 88.1%) had the same presence or absence of CGIs in the promoter regions. Importantly, among the 1,686 human genes having CGIs in the promoter regions, 17.4% of their mouse orthologues had no promoter-associated CGI. In contrast, among the 1,480 mouse genes having CGIs, only 5.9% of their human orthologues had no CGI in the promoter regions.
|
The comparison above indicates that the human genes have stronger association with CGIs in the promoter regions and these human CGIs have more pronounced characteristics. This is further confirmed when we plotted the 1,257 orthologous pairs by CGI length, GC content, ObsCpG/ExpCpG, and CpG density per 100 bp (fig. 1). These orthologous pairs were selected for comparison because each gene in a pair had one and only one promoter-associated CGI. The opposite distribution pattern was observed for the Obs/Exp ratios of TpGs and CpAs (fig. 2A, B). However, the Obs/Exp ratios of GpCs were much less different between the human and mouse genes (fig. 2C).
|
|
Because of the unknown ancestral information for these human and mouse CGIs, the observations above suggest three possible evolutionary scenarios: (1) many genes in the human and mouse genomes have lost CGIs and the mouse lineage has a faster loss rate; (2) both the human and mouse genes have gained CGIs and the human lineage has a faster gain rate; and (3) since there are more cases of H+M– than H–M+, the human genes have gained more CGIs, whereas the mouse genes have lost more CGIs. In the next three subsections, we examined these competing scenarios using the dog as a reference species and studying the substitution pattern and mutational direction in the CGI regions.
Promoter-Associated CGIs in Human-Mouse-Dog Orthologous Genes
We identified 2,670 human-mouse-dog orthologous genes from the above 3,197 genes. We found that 1,261 of these 2,670 genes had promoter-associated CGIs in the dog genome (denoted as D+). These genes were further categorized into four groups: D+H+M+ (872 genes), D+H+M– (156 genes), D+H–M+ (37 genes), and D+H–M– (196 genes) (table 1). In the D+H+M– and D+H–M+ groups, it is likely that the CGI has lost in one (human or mouse) lineage rather than has gained in the other two lineages based on the parsimony principle. Thus, the CGIs associated with these genes have likely lost in the human or mouse lineage. The number of the D+H+M– genes is 4.2 times that of the D+H–M+ genes. This compares to a smaller ratio (293 H+M– genes / 87 H–M+ genes = 3.4 times) when the ancestral information was not considered (table 1). The result indicates many more mouse-specific losses than human-specific losses.
Substitutions in the Conserved Blocks in Promoter-Associated CGIs
Because the sequence divergence between human and mouse is generally high, we examined the substitution pattern only in the conserved blocks in the CGI regions. We identified 9,949 conserved blocks that were mapped to CGIs that lie within or overlap with the 2-kb upstream sequences of the start codon site (see Materials and Methods). These conserved blocks included 5,865 blocks (size range: 13 – 3,026 bp; average size: 111 bp; total length: 650,669 bp), 2,930 blocks (size range: 13 – 1,469 bp; average size: 52 bp; total length: 153,721 bp), and 1,154 blocks (size range: 13 – 734 bp; average size: 34 bp; total length: 39,579 bp) that were, respectively, mapped to the CGIs in the H+M+ group, to the CGIs in the H+M– group, and to the CGIs in the H–M+ group. To study the evolution of CGIs, we examined substitutions at the CpG sites, in particular between CpG and TpG/CpA dinucleotides, in the aligned sequences of the conserved blocks. In the H+M+ group, there were 8,157 human CpGs that were aligned to mouse TpGs/CpAs, significantly more than the 5,839 mouse CpGs that were aligned to human TpGs/CpAs (
2 test, P = 1.8 x 10–85). Similarly, in the H+M– group, the number of human CpGs with mouse TpGs/CpAs (2,261) was significantly more than that (771) of mouse CpGs with human TpGs/CpAs (
2 test, P = 2.9 x 10–161). In contrast, in the H–M+ group, the number of human CpGs with mouse TpGs/CpAs (237) was significantly smaller than that (328) of mouse CpGs with human TpGs/CpAs (
2 test, P = 0.0001). Importantly, we consistently observed more TpGs than CpAs in each group (table 2).
|
The observations above suggest that (1) in the H+M+ and H+M– groups, the mouse overall has accumulated many TpGs/CpAs and lost many CpGs, and (2) in the H–M+ group, the human overall has accumulated TpGs/CpAs and lost CpGs. This is consistent with the loss-of-CGIs model, in which a much higher rate of deamination at 5mCpGs leads to an overabundance of TpGs/CpAs and a deficiency of CpGs (Antequera and Bird 1993
Mutational Direction at the CpG Sites in Promoter-Associated CGIs
The difference between the numbers of CpGs and TpGs/CpAs tends to support the loss of CGIs. However, this assumes that the nucleotide changes from CpG to TpG/CpA. The opposite (i.e., gain of CGIs) would be true if the direction of these nucleotide changes was in the opposite direction (e.g., TpG
CpG). Therefore, we inferred the mutational direction of the SNPs in these promoter-associated CGIs by comparing with their ancestral alleles in an outgroup species, the chimpanzee or the rat. There were 5,624 SNPs whose ancestral allele could be inferred in the human promoter-associated CGIs. Among them, 1,548 had the ancestral CpG allele and 744 had the derived CpG allele. Moreover, the number of CpG
TpG/CpA changes (823) is 3.1 times that of TpG/CpA
CpG changes (268) (table 3). Under the assumption of no CpG effect, we would expect these numbers to be similar. The difference is even more striking in the mouse in which there were 131 CpG
TpG/CpA changes compared to only 26 TpG/CpA
CpG changes (Supplementary table S3).
|
Thus, the examination of the ancestral state of the CGIs in the human-mouse orthologous genes and the substitution pattern and mutational direction at the CpG sites in promoter-associated CGIs strongly suggest a higher rate of loss than gain of CGIs in both of the human and mouse genomes and a faster rate of loss in the mouse lineage. Under this scenario, some CpG islands became de novo methylated. The high transition rate of 5mCpG to TpG, which is estimated to be 10- to 50-fold higher than other transitional changes (Sved and Bird 1990
A/T changes but only 1,020 A/T
G/C changes in the human CGI regions. This difference is even larger in the mouse CGI regions: 431 G/C
A/T changes compared to 92 A/T
G/C changes.
Characteristics of CGI Loss
We attempted to uncover how a CGI might become lost. For each CGI, we evenly dissected it into 11 segments, including the central part (segment 0), 5 segments on the 5' side (–5, –4, –3, –2, and –1), and 5 segments on the 3' side (+1, +2, +3, +4, and +5) (fig. 3A). Then, for each segment, we calculated the GC content, the ObsCpG/ExpCpG ratio, and the CpG density. Strikingly, these statistics tend to be the highest in the central segment and decrease in both the 5' and 3' directions (fig. 3B-D). For example, the GC content in human CGIs decreases from 70% in segment 0 to 50-52% in segments –5 and +5. This observation is consistent with the previous finding that GC content and CpG frequency peaks at the transcription start site and declines symmetrically towards both the 5' and 3' directions (Aerts et al. 2004
; Saxonov, Berg and Brutlag 2006
). This distribution pattern was similarly observed in mouse CGIs (fig. 3B-D). Note that the values of the parameters in the edge segments of mouse CGIs were always smaller than those of human CGIs. We further separated CGIs into three groups by their lengths:
1 kb, 1-2 kb, and
2 kb and similar patterns were consistently observed in these three groups (Supplementary fig. S1). Further, we measured the Obs/Exp ratios for TpG and for CpA. As expected, we found both the ObsTpG/ExpTpG and ObsCpA/ExpCpA ratios increase from the central segment to the edges, though the extent of change was not as strong as that of ObsCpG/ExpCpG (Supplementary fig. S2). This pattern was similarly observed in mice.
|
The above distribution pattern reflects a general feature of CGIs. However, it may also imply that the erosion of a CpG island starts at the edges of the CGI, and then progressively decays towards the center.
Overrepresentation of GO Terms in H–M+ and H+M– Genes
It is interesting to know which genes in terms of function might be more likely to lose CGIs during evolution. We examined the distribution of characteristic GO terms associated with the genes in the H+M+, H–M+ and H+M– groups. Table 4 shows the GO terms that are more represented in the H–M+ or H+M– group compared to the H+M+ group. For the mouse genes having CGIs (M+), their human orthologous genes might more likely lose CGIs when the genes are associated with enzyme regulator activity, chromosome, chromatin binding, transport, DNA binding, transport activity, and cytoplasmic membrane-bound vesicle. For the human gene having CGIs (H+), their mouse orthologous genes might more likely lose CGIs when the genes are associated with cell, receptor activity, physiological process, actin binding, and cytoskeletal protein binding.
|
To understand the relevance of these GO terms to gene function, we performed an analysis to identify the GO terms that are significantly enriched in the housekeeping or tissue-specific genes in the human and mouse. We used the second version of Gene Expression Atlas, which surveyed gene expression in 79 human and 61 mouse tissues (Su et al. 2004
The results above indicate a tendency of loss of promoter-associated CGIs in tissue-specific genes relative to the housekeeping genes in the mouse genome; however, this feature is not observed in the human genome.
| Discussion |
|---|
|
|
|---|
A couple of early surveys showed that mouse genes had likely lost some CGIs during the course of evolution (Antequera and Bird 1993
Methylation and CGI Loss
CpG islands are usually unmethylated in a genome, especially in the promoter regions (Antequera 2003
). However, recent studies found that CGIs may be methylated under an abnormal condition or even in normal cells. Aberrant hypermethylation in promoter-associated CGIs, which causes transcriptional silencing and disruption of gene function, has been found to be associated with many diseases including various types of cancers (Herman et al. 1994
; Esteller et al. 2000a
; Esteller et al. 2000b
; Leung et al. 2001
), mental disorders (Abdolmaleky et al. 2004
; Abdolmaleky et al. 2005
), developmental defects (Bell and Felsenfeld 2000
; Maher and Reik 2000
), and autoimmune diseases (Attwood, Yung, and Richardson 2002
). Importantly, Graff et al. (1997)
reported that de novo methylation initiated at the two ends of CGIs and progressively encompassed the entire CGI in the E-cadherin tumor suppressor gene. This supports our finding that erosion of a CGI starts from both edges and moves towards the center. Moreover, Yamada et al. (2004)
reported that 31 of the 149 CGIs studied were fully methylated even in normal peripheral blood cells. In a recent genome-wide survey of methylation in the promoter sequences of somatic and germline cells, Weber et al. (2007)
found an association of DNA methylation in CpG-poor promoters in the germline with an increased loss of CpG dinucleotides, implying that characteristics of the CGIs have been weakened or even vanished in the course of evolution.
Characteristics of CGI Loss
An examination of CGI characteristics in 11 segments of CGIs revealed that erosion of a CGI moves from the edges towards the center (fig. 3). This observation might be influenced by the CGI search algorithm, which evaluates a 200-bp window each time, combines the consecutive 200-bp windows, and then retracts at both ends until it meets the criteria of a CGI (Takai and Jones 2002
). We modified the algorithm by fixing the 5' end, but the results were nearly the same (data not shown). Furthermore, we observed the same pattern in the short CGIs (
1 kb). In these short CGIs, the segment length we used to divide a CGI into 11 segments was less than half of the window size (200 bp) used in the CGI search algorithm. However, we still found that the value of each measurement in the neighboring segments consistently decreased in both the 5' and 3' directions (see Supplementary fig. S1). Moreover, although the strongest CGI characteristics (e.g., ObsCpG/ExpCpG, CpG density) were found in the central segment(s) in most cases, they could be in any segment (see Supplementary table S6 for a list of genes having the highest CpG density (per 100 bp) in different CGI segments). Thus, the erosion from both edges towards the center is likely a characteristic of CGI loss.
Rate of CGI Loss
We found 1,795 CGIs in the promoter regions of 1,686 human genes and 1,553 CGIs in 1,480 mouse genes. Interestingly, the number of CGIs per gene in human (1.06) is similar to that (1.05) in mouse. Here we roughly estimate the CGI loss rates in the human and mouse lineages, which split about 80 million years (Myr) ago. The dog lineage is commonly thought to have branched off earlier than the primate-rodent divergence (e.g., Springer et al. 2003
; Ostrander, Giger, and Lindblad-Toh 2006
), so we used it as an outgroup to infer the number of CGI losses in the human or mouse lineage. Among the 1,261 genes having CGIs in the dog genome, there were 164 CGIs in the 156 D+H+M– genes, 41 CGIs in the 37 D+H–M+ genes, 226 CGIs in the 196 D+H–M– genes, and 935 CGIs in the 872 D+H+M+ genes (table 1). Based on the parsimony principle, we assume that the 164 CGIs in the D+H+M– group were lost in the mouse lineage and the 41 CGIs in the D+H–M+ group were lost in the human lineage. For the 226 CGIs in the D+H–M– group, we assume, for simplicity, that half of them (113) represent gains in the dog lineage plus losses in the common ancestor of human and mouse before the divergence of the human and mouse lineages and half of them represent losses in both the human and mouse lineages. Under this assumption, there were 164 + 41 + 113 + 935 = 1,253 CGIs in the common ancestor of human and mouse. Then, the rate of loss in the human lineage is estimated to be (41 + 113)/1,253/80 Myr = 1.5 losses per CGI per billion years while that in the mouse lineage is (164 +113)/1,253/80 Myr = 2.8 losses per CGI per billion years. According to these estimates the rate of loss in the mouse lineage is
1.9 times the rate in the human lineage. This estimate is lower than the above estimate of 4.1 times but it is probably less reliable because it involves more assumptions.
The higher CGI loss rate in the mouse lineage is consistent with the view that the GC content has decreased at a higher rate in GC-rich sequences in the rodent lineage than in the human lineage (e.g., Mouchiroud, Gautier, and Bernardi 1988
; Duret et al. 2002
; Gu and Li 2006
). Smith and Eyre-Walker (2002)
examined the nucleotide polymorphisms in mice and found significantly more G/C
A/T than A/T
G/C changes, indicating that the GC content is decreasing in the mouse genome. In our analysis, we found significantly more G/C
A/T than A/T
G/C changes in both the mouse (431 vs. 92, P = 1.0 x10–49) and the human (3,440 vs. 1,020, P = 1.6 x10–287) genomes, but a stronger extent of GC content decrease in the mouse. Similar results were observed when we excluded the SNPs at the CpG sites or CpG
TpG/CpA changes, or when we normalized with the nucleotide content in the sequences (data not shown). Thus, the losses of CGIs in the human and especially the mouse genome are probably mainly due to a general tendency of decreasing GC content in the two genomes.
The higher CGI loss rate in the mouse lineage may reflect a weaker selective constraint in the mouse promoter regions than in the human's. This is consistent with our observations that the CGIs are more likely to be lost in the tissue-specific genes in the mouse genome, a feature not observed in the human genes (table 4). The weaker selective constraint in mouse CGIs may also be related to the unique rodent traits such as short generation time and small body size, which may allow a relaxed control of gene regulation (Matsuo et al. 1993
).
CGIs and Gene Expression
An early analysis of 375 human genes found that all housekeeping and widely expressed genes, but only 25% of tissue-specific genes, have a CGI covering the transcription start site (TSS) (Larsen et al. 1992
). More recently, Ponger et al. (2001)
used expressed sequence tag (EST) data and estimated that 90% of housekeeping genes have promoter-associated CGIs. We reexamined this issue using our gene list and the second version of Gene Expression Atlas, which surveyed gene expression in 79 human and 61 mouse tissues (Su et al. 2004
). Under the definition that a housekeeping gene is expressed in all tissues, we found that 316 of the 474 human housekeeping genes and 241 of the 418 mouse housekeeping genes had at least one CGI associated with the promoter regions. These frequencies (human 67%, mouse 58%) are much lower than the previous estimates. Further, if we define a gene to be widely expressed when it is expressed in more than 80% of tissues and to be tissue-specific when it is expressed in less than 20% of tissues, then for widely expressed genes 61% (836 of 1,374) of the human genes and 55% (555 of 1,009) of the mouse genes have promoter-associated CGIs, while for tissue-specific genes, the corresponding frequencies are 45% (256 of 575) for the human genes and 38% (417 of 1,100) for the mouse genes. The lists of these genes are provided in Supplementary tables S7–12 and the results in each group (H+M+, H+M–, H–M+, H–M–, D+H+M+, D+H+M–, D+H–M+, and D+H–M–) are shown in Supplementary table S13. These results suggest that the difference between the association of CGIs with the TSS or promoter regions of housekeeping genes and that of tissue-specific genes is not as large as previously estimated. However, one should be cautious in this type of comparison because the results depend on the CGI searching algorithm used, the definition of promoter-associated CGIs, the complexity of promoters (e.g., alternative promoters), and the gene expression data used.
| Conclusion |
|---|
|
|
|---|
This study represents the first genome-level investigation of CpG island evolution in the promoter regions of human and mouse genes. Using three approaches (inference of the ancestral state of CGIs, examination of the substitution pattern, and inference of the mutational direction), we found numerous losses of promoter-associated CGIs in both the human and mouse genomes, but the loss rate in the mouse lineage is two to four times the rate in the human lineage. We identified a novel feature of CGIs that suggests that the erosion of a CGI usually starts from both ends and moves towards the center of the CGI. Moreover, we found functional bias in the genes that have lost promoter-associated CGIs in the human or mouse lineage. Finally, our analysis indicated that the association of CGIs with housekeeping genes is much less strong than previously estimated. These findings are important for the study of gene function and vertebrate genome evolution.
| Supplementary Material |
|---|
|
|
|---|
Supplementary tables S1-13 and figures S1-2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|
We thank Hisakazu Iwama and Takashi Gojobori for kindly sending us their alignments of gene sequences, Andrew Su for kindly sending us the Gene Expression Atlas data, and Miaojun Han for assistance in sequence analysis. We thank the two anonymous reviewers for valuable comments. This project was supported by the Thomas F. and Kate Miller Jeffress Memorial Trust Fund and a NARSAD Young Investigator Award (to Z. Zhao).
| Footnotes |
|---|
1 These authors contributed equally to this work
Naoko Takezaki, Associate Editor
| References |
|---|
|
|
|---|
Abdolmaleky HM, Cheng KH, Russo A, et al. Hypermethylation of the reelin (RELN) promoter in the brain of schizophrenic patients: a preliminary report. Am J Med Genet B Neuropsychiatr Genet. (2005) 134:60–66.[Medline]
Abdolmaleky HM, Smith CL, Faraone SV, Shafa R, Stone W, Glatt SJ, Tsuang MT. Methylomics in psychiatry: modulation of gene-environment interactions may be through DNA methylation. Am J Med Genet B Neuropsychiatr Genet. (2004) 127:51–59.[Medline]
Aerts S, Thijs G, Dabrowski M, Moreau Y, De Moor B. Comprehensive analysis of the base composition around the transcription start site in Metazoa. BMC Genomics (2004) 5:34.[CrossRef][Medline]
Aissani B, Bernardi G. CpG islands, genes and isochores in the genomes of vertebrates. Gene (1991) 106:185–195.[CrossRef][ISI][Medline]
Antequera F. Structure, function and evolution of CpG island promoters. Cell Mol Life Sci. (2003) 60:1647–1658.[CrossRef][ISI][Medline]
Antequera F, Bird A. Number of CpG islands and genes in human and mouse. Proc Natl Acad Sci USA (1993) 90:11995–11999.
Attwood JT, Yung RL, Richardson BC. DNA methylation and the regulation of gene transcription. Cell Mol Life Sci. (2002) 59:241–257.[CrossRef][ISI][Medline]
Bell AC, Felsenfeld G. Methylation of a CTCF-dependent boundary controls imprinted expression of the Igf2 gene. Nature (2000) 405:482–485.[CrossRef][Medline]
Bird A. DNA methylation patterns and epigenetic memory. Genes Dev. (2002) 16:6–21.
Bird AP. CpG-rich islands and the function of DNA methylation. Nature (1986) 321:209–213.[CrossRef][Medline]
Bird AP. CpG islands as gene markers in the vertebrate nucleus. Trends Genet. (1987) 3:342–347.[CrossRef][ISI]
Duret L, Semon M, Piganeau G, Mouchiroud D, Galtier N. Vanishing GC-rich isochores in mammalian genomes. Genetics (2002) 162:1837–1847.
Esteller M, Avizienyte E, Corn PG, Lothe RA, Baylin SB, Aaltonen LA, Herman JG. Epigenetic inactivation of LKB1 in primary tumors associated with the Peutz-Jeghers syndrome. Oncogene (2000a) 19:164–168.[CrossRef][ISI][Medline]
Esteller M, Silva JM, Dominguez G, et al. Promoter hypermethylation and BRCA1 inactivation in sporadic breast and ovarian tumors. J Natl Cancer Inst (2000b) 92:564–569.
Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. (1987) 196:261–282.[CrossRef][ISI][Medline]
Gibbs RA, Weinstock GM, Metzker ML, et al. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature (2004) 428:493–521.[CrossRef][Medline]
Graff JR, Herman JG, Myohanen S, Baylin SB, Vertino PM. Mapping patterns of CpG island methylation in normal and neoplastic cells implicates both upstream and downstream regions in de novo methylation. J Biol Chem. (1997) 272:22322–22329.
Gu J, Li WH. Are GC-rich isochores vanishing in mammals? Gene (2006) 385:50–56.[CrossRef][ISI][Medline]
Herman JG, Latif F, Weng Y, et al. Silencing of the VHL tumor-suppressor gene by DNA methylation in renal carcinoma. Proc Natl Acad Sci USA (1994) 91:9700–9704.
Iwama H, Gojobori T. Highly conserved upstream sequences for transcription factor genes and implications for the regulatory network. Proc Natl Acad Sci USA (2004) 101:17156–17161.
Jiang C, Zhao Z. Mutational spectrum in the recent human genome inferred by single nucleotide polymorphisms. Genomics (2006) 88:527–534.[CrossRef][ISI][Medline]
Jones PA, Baylin SB. The fundamental role of epigenetic events in cancer. Nat Rev Genet. (2002) 3:415–428.[ISI][Medline]
Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature (2001) 409:860–921.[CrossRef][Medline]
Larsen F, Gundersen G, Lopez R, Prydz H. CpG islands as gene markers in the human genome. Genomics (1992) 13:1095–1107.[CrossRef][ISI][Medline]
Leung WK, Yu J, Ng EK, To KF, Ma PK, Lee TL, Go MY, Chung SC, Sung JJ. Concurrent hypermethylation of multiple tumor-related genes in gastric carcinoma and adjacent normal tissues. Cancer (2001) 91:2294–2301.[CrossRef][ISI][Medline]
Lindblad-Toh K, Wade CM, Mikkelsen TS, et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature (2005) 438:803–819.[CrossRef][Medline]
Lopez-Giraldez F, Andres O, Domingo-Roura X, Bosch M. Analyses of carnivore microsatellites and their intimate association with tRNA-derived SINEs. BMC Genomics (2006) 7:269.[CrossRef][Medline]
Maher ER, Reik W. Beckwith-Wiedemann syndrome: imprinting in clusters revisited. J Clin Invest (2000) 105:247–252.[ISI][Medline]
Matsuo K, Clay O, Takahashi T, Silke J, Schaffner W. Evidence for erosion of mouse CpG islands during mammalian evolution. Somat Cell Mol Genet. (1993) 19:543–555.[CrossRef][ISI][Medline]
Mouchiroud D, Gautier C, Bernardi G. The compositional distribution of coding sequences and DNA molecules in humans and murids. J Mol Evol. (1988) 27:311–320.[CrossRef][ISI][Medline]
Ostrander EA, Giger U, Lindblad-Toh K. The Dog and Its Genome (2006) New York: Cold Spring Harbor Laboratory Press.
Pesole G, Liuni S, Grillo G, Ippedico M, Larizza A, Makalowski W, Saccone C. UTRdb: a specialized database of 5' and 3' untranslated regions of eukaryotic mRNAs. Nucleic Acids Res. (1999) 27:188–191.
Ponger L, Duret L, Mouchiroud D. Determinants of CpG islands: expression in early embryo and isochore structure. Genome Res. (2001) 11:1854–1860.
Saxonov S, Berg P, Brutlag DL. A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc Natl Acad Sci USA (2006) 103:1412–1417.
Smith NG, Eyre-Walker A. The compositional evolution of the murid genome. J Mol Evol. (2002) 55:197–201.[CrossRef][ISI][Medline]
Springer MS, Murphy WJ, Eizirik E, O'Brien SJ. Placental mammal diversification and the Cretaceous-Tertiary boundary. Proc Natl Acad Sci USA (2003) 100:1056–1061.
Su AI, Wiltshire T, Batalov S, et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA (2004) 101:6062–6067.
Sved J, Bird A. The expected equilibrium of the CpG dinucleotide in vertebrate genomes under a mutation model. Proc Natl Acad Sci USA (1990) 87:4692–4696.
Takai D, Jones PA. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc Natl Acad Sci USA (2002) 99:3740–3745.
Tatusova TA, Madden TL. BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett (1999) 174:247–250.[CrossRef][ISI][Medline]
Venter JC, Adams MD, Myers EW, et al. The sequence of the human genome. Science (2001) 291:1304–1351.
Waterston RH, Lindblad-Toh K, Birney E, et al. Initial sequencing and comparative analysis of the mouse genome. Nature (2002) 420:520–562.[CrossRef][Medline]
Weber M, Hellmann I, Stadler MB, Ramos L, Paabo S, Rebhan M, Schubeler D. Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nat Genet. (2007) 39:457–466.[CrossRef][ISI][Medline]
Yamada Y, Watanabe H, Miura F, Soejima H, Uchiyama M, Iwasaka T, Mukai T, Sakaki Y, Ito T. A comprehensive analysis of allelic methylation status of CpG islands on human chromosome 21q. Genome Res. (2004) 14:247–266.
Yamashita R, Suzuki Y, Sugano S, Nakai K. Genome-wide analysis reveals strong correlation between CpG islands with nearby transcription start sites of genes and their tissue specificity. Gene (2005) 350:129–136.[CrossRef][ISI][Medline]
Yamashita R, Suzuki Y, Wakaguri H, Tsuritani K, Nakai K, Sugano S. DBTSS: DataBase of Human Transcription Start Sites, progress report 2006. Nucleic Acids Res. (2006) 34:D86–89.
Yang J, Su AI, Li W-H. Gene expression evolves faster in narrowly than in broadly expressed mammalian genes. Mol Biol Evol. (2005) 22:2113–2118.
Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol. (2000) 7:203–214.[CrossRef][ISI][Medline]
Zhao Z, Jiang C. Methylation-dependent transition rates are dependent on local sequence lengths and genomic regions. Mol Biol Evol. (2007) 24:23–25.
Zhao Z, Zhang F. Sequence context analysis of 8.2 million single nucleotide polymorphisms in the human genome. Gene (2006a) 366:316–324.[CrossRef][ISI][Medline]
Zhao Z, Zhang F. Sequence context analysis in the mouse genome: Single nucleotide polymorphisms and CpG island sequences. Genomics (2006b) 87:68–74.[CrossRef][ISI][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
N. Elango and S. V. Yi DNA Methylation and Structural and Functional Bimodality of Vertebrate Promoters Mol. Biol. Evol., August 1, 2008; 25(8): 1602 - 1608. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||



