MBE Advance Access originally published online on March 21, 2006
Molecular Biology and Evolution 2006 23(6):1203-1216; doi:10.1093/molbev/msk008
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Article |
Strong Regional Biases in Nucleotide Substitution in the Chicken Genome
Department of Evolution, Genomics and Systematics, Evolutionary Biology Centre, Uppsala University, Uppsala, Sweden
E-mail: websterm{at}tcd.ie.
| Abstract |
|---|
|
|
|---|
Interspersed repeats have emerged as a valuable tool for studying neutral patterns of molecular evolution. Here we analyze variation in the rate and pattern of nucleotide substitution across all autosomes in the chicken genome by comparing the present-day CR1 repeat sequences with their ancestral copies and reconstructing nucleotide substitutions with a maximum likelihood model. The results shed light on the origin and evolution of large-scale heterogeneity in GC content found in the genomes of birds and mammalsthe isochore structure. In contrast to mammals, where GC content is becoming homogenized, heterogeneity in GC content is being reinforced in the chicken genome. This is also supported by patterns of substitution inferred from alignments of introns in chicken, turkey, and quail. Analysis of individual substitution frequencies is consistent with the biased gene conversion (BGC) model of isochore evolution, and it is likely that patterns of evolution in the chicken genome closely resemble those in the ancestral amniote genome, when it is inferred that isochores originated. Microchromosomes and distal regions of macrochromosomes are found to have elevated substitution rates and a more GC-biased pattern of nucleotide substitution. This can largely be accounted for by a strong correlation between GC content and the rate and pattern of substitution. The results suggest that an interaction between increased mutability at CpG motifs and fixation biases due to BGC could explain increased levels of divergence in GC-rich regions.
Key Words: isochore base composition chicken mutation recombination
| Introduction |
|---|
|
|
|---|
Base composition is heterogeneous within a wide variety of eukaryotic genomes, characterized by local similarities in GC content within genomic regions and significant differences between regions (Nekrutenko and Li 2000
Although the genomes of most eukaryotes exhibit some spatial heterogeneity in GC content, the genomes of mammals, birds, and reptiles exhibit more extreme variation in GC content commonly known as the isochore structure (Filipski, Thiery, and Bernardi 1973
; Bernardi, Hughes, and Mouchiroud 1997
; Hughes, Zelus, and Mouchiroud 1999
; Hughes, Clay, and Bernardi 2002
). These genomes were originally described as mosaics of isochores with different GC contents, comprising long regions (>300 kb), where GC content is relatively homogeneous separated by distinct boundaries (Bernardi 2000
). Subsequent whole-genome analyses have indicated that this structure does not exist in the strict sense (Nekrutenko and Li 2000
; IHGSC 2001
). Nonetheless, it is clear that the genomes of mammals, birds, and reptiles are highly heterogeneous in GC content and have acquired GC-rich regions. In this article, we refer to such regions as GC-rich isochores. The phylogenetic distribution of GC-rich isochores suggests that they were acquired in the amniote lineage after the split with amphibians (the genomes of Xenopus species appear uniformly AT rich; Bernardi 2000
). This isochore structure could potentially be influenced by natural selection. As GC-rich isochores have been observed in both warm- and cold-blooded animals, it seems unlikely that selection for increased thermal stability is responsible, as initially suggested (Bernardi et al. 1985
). However, it is possible that selection could act on variation in GC content to optimize genomic structure due to the effects of GC content on physical properties of DNA and chromatin-level effects on control of gene expression (Vinogradov 2003
, 2005
).
It is likely that variation in neutral processes have played a large part in generating variation in GC content. A number of hypotheses have been put forward to explain how patterns of mutation could be variable and lead to heterogeneous patterns of GC content, for example, related to replication timing (Wolfe, Sharp, and Li 1989
) or variation in efficiency of repair (Filipski 1987
; Sueoka 1988
). However, much recent evidence in mammals points to a role for recombination in generating variation in GC content (Galtier et al. 2001
; Birdsell 2002
; Galtier 2003
; Montoya-Burgos, Boursot, and Galtier 2003
; Webster et al. 2005
). In humans, the equilibrium GC content (GC*)defined as the stable GC content toward which a genomic region is evolvingwas found to correlate with the crossover rate, indicating that recombination is a major factor influencing variation in substitution pattern (Meunier and Duret 2004
). Many reports support the idea that recombination is mutagenic (Lercher and Hurst 2002
; Hellmann et al. 2003
), and it is possible that this mutagenic effect also produces a bias toward mutations that incorporate G:C nucleotide pairs. However, it is most likely that recombination facilitates the accumulation of GC nucleotides through biased gene conversion (BGC), which results in a bias toward fixation of G:C alleles at sites that are polymorphic for A:T and G:C alleles (Eyre-Walker 1993
; Galtier et al. 2001
; Birdsell 2002
). It has been demonstrated that BGC has dynamics identical to weak directional selection (Nagylaki 1983
). This process is believed to act either by biased repair of heteroduplexes formed by gene conversion or by meiotic drive (Marais 2003
).
Fixation biases toward G:C alleles have been detected by comparison of patterns of mutational changes in divergence and polymorphism data in both human and mouse (Duret et al. 2002
; Smith and Eyre-Walker 2002
; Webster, Smith, and Ellegren 2003
), using modified versions of the McDonald Kreitman tests for selective neutrality (McDonald and Kreitman 1991
). Furthermore, some studies in which the direction of mutations giving rise to single-nucleotide polymorphisms could be determined have noted that mutations from A or T to G or C (AT
GC) segregate at significantly higher frequencies than the opposite type (GC
AT) (Duret et al. 2002
; Lercher et al. 2002
; Webster and Smith 2004
; Webster et al. 2005
), also indicating that AT
GC mutations have an increased probability of fixation. It should be noted that the mechanisms described so far are not mutually exclusive: it is possible that biases in patterns of both mutation and fixation exist. Similarly, although it is unlikely that selection is acting on millions of single-nucleotide changes to alter GC content in vertebrates, it is plausible that selection acts on the processes creating this variation, such as recombination intensity (Otto and Lenormand 2002
).
Perhaps surprisingly, a number of recent reports suggest that isochores in mammals are being homogenized (Duret et al. 2002
; Smith, Webster, and Ellegren 2002
; Arndt, Petrov, and Hwa 2003
; Webster, Smith, and Ellegren 2003
). This is particularly apparent in GC-rich regions, which are tending toward much lower GC contents. This was first suggested by the analysis of substitutions at synonymous sites in genes along lineages within three different mammalian orders (rodents, artiodactyls, and primates) using appropriate outgroups (Duret et al. 2002
). Although the extent to which this homogenization is occurring in all mammalian lineages is unclear (Alvarez-Valin et al. 2004
), a maximum likelihood (ML) analysis of 41 genes in up to 66 diverse mammals strongly supports this effect in early mammalian evolution (Belle et al. 2004
). By analysis of patterns of substitution in human interspersed repeats of many different ages, Arndt, Petrov, and Hwa (2003)
demonstrated a shift in the pattern of substitution that occurred around the time of mammalian radiation from isochore preserving to homogenization. A homogenization of GC content has also been observed in primate noncoding alignments (Smith, Webster, and Ellegren 2002
; Webster, Smith, and Ellegren 2003
; Meunier and Duret 2004
) and analysis of patterns of substitution in Alu repeats (Webster et al. 2005
). Note that a recent study arguing against the homogenization of GC content in primates and rodents (Antezana 2005
) is based on a highly unreliable method of inferring substitution patterns (Duret 2006
).
This trend toward homogenization of GC content suggests that the forces responsible for creating and maintaining isochores in the ancestral amniote have reduced in efficacy in mammals. Assuming that BGC is the main factor generating variation in GC content, at least three factors could be involved. Firstly, it is possible that the enzymes involved in heteroduplex repair have altered to change a preference for incorporating G:C pairs in mammalian lineages (a change in the repair bias). Secondly, because BGC leads to a bias in the fixation of certain alleles (comparable to natural selection), it is sensitive to changes in effective population size (Ne). Hence, a reduction in Ne would reduce the effects of BGC. Thirdly, chromosomal rearrangements could affect variation in GC by altering the intensity of recombination. In particular, there is likely to be a requirement for one crossover per meiosis per chromosome arm (Pardo-Manuel de Villena and Sapienza 2001
), which results in smaller chromosomes having higher average rates of recombination (Meunier and Duret 2004
). Hence, changes in chromosome size could affect recombination rate (and hence strength of BGC).
Chromosomes in the chicken genome (Gallus gallus) are variable in size, and the autosomes are classified into 5 large macrochromosomes (56188 Mb), 5 intermediate chromosomes (2134 Mb), and 28 small microchromosomes (<100 kb to 19 Mb). In contrast, mammalian chromosomes tend to be longer and more similar in size. For example, the human genome has 22 autosomes (47246 Mb) (IHGSC 2001
; ICGSC 2004
). There is evidence that the ancestral amniote genome closely resembled the chicken, implying that microchromosomes fused together during the evolution of the premammalian karyotype. Several studies point to an extremely slow rate of chromosomal evolution in the avian lineage compared with mammals (Bush et al. 1977
; Burt et al. 1999
; Burt 2002
). Recently, Bourque et al. (2005)
estimated that the number of interchromosomal rearrangements between chicken and a putative mammalian ancestor only slightly exceeds the number inferred in the mouse lineage, although the evolutionary distance is more than fivefold greater.
Recombination varies over an eightfold range among chicken chromosomes, and microchromosomes have elevated recombination rates and GC content. This extreme variation makes the chicken an ideal model for understanding the effects of recombination and GC content on substitution rate. Furthermore, as the chicken karyotype is similar to the ancestral amniote karyotype, analysis of the forces affecting GC content in the chicken genome can shed light on the forces responsible for generating GC-rich isochores in birds, reptiles, and mammals.
Comparative genomic studies have suggested that mutation rates are elevated in microchromosomes and subtelomeric regions compared with the rest of the chicken genome (ICGSC 2004
; Axelsson et al. 2005
). However, as many genomic features are variable between micro- and macrochromosomes, the precise causes of these observations are unclear. For instance, recombination, GC content, and the number of CpG motifs are all higher on microchromosomes. One possibility is that recombination is directly mutagenic (Lercher and Hurst 2002
; Hellmann et al. 2003
; Filatov 2004
; ICGSC 2004
). Alternatively, GC-rich regions could be more mutable simply because the rate of GC
AT mutation is high (Smith, Webster, and Ellegren 2002
) or because they possess more hypermutable CpG sites (Hurst and Williams 2000
). Fixation biases due to BGC can also alter substitution rates (Piganeau et al. 2002
). One way to understand the relative contribution of these various processes is to analyze individual substitution frequencies separately. For example, if double-stranded breaks associated with recombination increase the rate of all types of transitions and transversions, then those mutations that do not alter GC content will also be affected (i.e., A
T or G
C transversions) (Filatov 2004
). In contrast, BGC only affects mutations that do alter GC content (AT
GC and GC
AT).
Patterns of divergence in interspersed repeats can be used to examine variability in rates and patterns of substitution during evolution (Arndt, Petrov, and Hwa 2003
; Arndt, Hwa, and Petrov 2005
; Webster et al. 2005
). Whereas a massive proportion of human and other mammalian genomes are made up of interspersed repeats (40%50%), less than 9% of the chicken genome is classified as interspersed repeats (ICGSC 2004
; Wicker et al. 2005
). However, 80% of this is dominated by the CR1 element (6.4% of genome). CR1 is a long interspersed nuclear element with close similarity to the mammalian L1 element. So far, no intact copies closely resembling a CR1 master copy have been observed in the chicken genome, and only one full-length open reading frame was found in the initial chicken genome analysis, suggesting that CR1 is unlikely to be currently active in the chicken genome. A full-length CR1 is 4.5 kb, but the vast majority (99.4%) are truncated from their 5' end to around 1.2 kb. RepBase contains 22 CR1 master sequences, divided into 11 families (Jurka 2000
; ICGSC 2004
).
Here we reconstruct patterns of nucleotide substitution in a genome-wide sample of CR1 repeats by comparing each repeat sequence found in the chicken genome with its respective master copy, inferred to be its ancestral sequence. The results shed light on the causes of heterogeneity in GC content and variation in substitution rates observed in the chicken genome. In order to confirm our findings using an independent source of data, we also performed a comparative analysis of 34 intron sequences in chicken (G. gallus), turkey (Meleagris gallopavo), and Japanese quail (Coturnix japonica). As the ancestral sequence is not known in this case, we inferred patterns of substitution since the common ancestor of chicken and turkey using a parsimony approach (Meunier and Duret 2004
).
| Methods |
|---|
|
|
|---|
Inference of Substitution Patterns from Interspersed Repeats
We reconstructed patterns of substitution in the chicken genome by comparison of all copies of a particular repeat family with their inferred ancestral copies. Interspersed repeats are noncoding and should therefore be free from functional constraints. It is commonly assumed that after insertion in any genomic location, they become inactive and begin to accumulate neutral substitutions. As repetitive elements are abundant in vertebrate genomes, they can provide large amounts of raw data with which to estimate variation in patterns of substitution. The ancestral copy of each sequence can be estimated by using the master copy defined in RepBase (Jurka 2000
Knowledge of the ancestral sequence permits use of an ML approach to estimate the substitution patterns (Arndt, Burge, and Hwa 2003
). This reconstructs substitution frequencies, correcting for multiple hits, for the four transversions, two transitions, and CpG transitions. These seven rates comprise all possible mutational changes assuming strand complementarity and that there are no other important context effects other than CpG mutability. As many repeats are highly diverged from their ancestral sequence, it is crucial to take into account multiple hits. This is particularly important at CpG sites where mutation rates are elevated up to 10 times due to methylated cytosine mutagenesis (Yang et al. 1996
; Templeton et al. 2000
). Indeed, it is impossible to account for the effects of CpG hypermutability by selectively removing sites because sites that are not associated with CpGs in the ancestral may still be involved with CpG mutations during evolution due to mutations at neighboring sites and/or multiple hits. Furthermore, some mutations at CpG sites may not be due to CpG hypermutability.
Alignment of CR1 Repeats with Master Sequences
We searched the draft assembly of the chicken genome (WASHUC1) using RepeatMasker (http://www.repeatmasker.org/) on the default settings. Generation and analysis of alignments were done using self-written Perl programs. We used a sliding window of 5 Mb across chromosomes. To prevent the subtelomeric regions of the five macrochromosomes being excluded from the analysis, we first removed a distal 5-Mb segment from each end of the macrochromosomes and divided the remainder into nonoverlapping 5-Mb segments. We concatenated the alignments made by RepeatMasker of each CR1 repeat with its identified master copy (taken from a library containing 22 CR1 master copies) within each segment. We made 11 such alignments for each segment by dividing initial alignments into the 11 CR1 families (shown in fig. 4 of ICGSC 2004
). Hence, for each genomic segment, a set of 11 long pairwise alignments were produced, where one sequence consisted of concatenated master copies from a particular repeat family and the other of the concatenated repeat sequences from the particular genomic sequence identified as descendents of those copies. We calculated the GC content and frequency of CpG motifs present in the nonrepetitive, nongenic sequence within each 5-Mb segment. This was done using the repeat-masked sequence from which genes were masked using the annotations from the initial genome sequence analysis. We also calculated the proportion of sites in exons within each segment using these annotations, which we term exon density.
|
Estimation of Regional Substitution Pattern in the Chicken Genome
Alignments with fewer than 5,000 bases were excluded from the analysis of substitution rates. Estimates of the frequencies of the seven different substitution events (see above) were obtained using the ML approach described by Arndt, Burge, and Hwa (2003)
Each substitution frequency represents the relative frequency of each event per potentially mutable site (e.g., the G:C
A:T rate is the frequency of this type of substitution at positions which are A or T in the sequence). In order to estimate the predicted relative contribution to present-day substitution rate each substitution frequency has on the present-day sequence, we multiplied the time-averaged substitution frequency in each region by the proportion of A:T or G:C base pairs in the noncoding, nonrepetitive region in each particular 5-Mb genomic window. For example, the relative contribution of G:C
A:T changes to the expected substitution rate in a particular genomic region is equal to the G:C
A:T substitution frequency in that region multiplied by its GC content. To calculate the CpG rates, we multiplied the relevant CpG rate by the number of CpG sites in the present-day flanking sequence (this assumes that all CpG sites are methylated). We refer to these as net predicted rates.
Analysis of Substitution Pattern in Human Alu Repeats
In order to compare the relationship between GC* and GC content in the chicken and human genomes, we reanalyzed a comparable data set of human repeats presented in Webster et al. (2005)
. In this data set, concatenated alignments were made of Alu repeats with the human genome divided into segments with the boundaries halfway between genetic markers with known crossover frequencies (average length of segments was 595 kb). We used the same correction for repeat element age described above for CR1 repeats (Arndt, Hwa, and Petrov 2005
) to correct for the age of each Alu repeat using the average transversion frequencies of AluJ, AluS, and AluY repeats. The resulting estimates of each of the substitution frequencies were then used to calculate GC* for each genomic segment using forward simulation (Arndt, Burge, and Hwa 2003
).
Analysis of Pattern of Substitution in Intron Alignments
Sequence data from 34 orthologous introns in chicken and turkey, spread over the genome, were previously presented by Axelsson et al. (2005)
. For the purpose of this study, the orthologous intron sequence in Japanese quail was obtained using the same laboratory methods. Alignment of orthologous sequences was performed using ClustalW (Thompson, Higgins, and Gibson 1994
) under the default settings and then checked manually. Details of all alignments are presented in Supplementary Table 1 (Supplementary Material online). We first performed a pairwise analysis of substitutions between all three species. This indicated that divergence between quail and either chicken or turkey was
20% higher than between chicken and turkey. We therefore considered quail to be the outgroup of chicken and turkey, as also indicated by previous studies (Dimcheff, Drovetski, and Mindell 2002
).
We analyzed substitutions along the chicken and turkey lineages using a parsimony approach. In order to minimize misinference caused by homoplasy at hypermutable CpG sites, we followed the protocol developed by Meunier and Duret (2004)
. Accordingly, we considered three classes of sites: (1) CpG free, (2) CpG ancestral, and (3) all other sites. We estimated the four transversion and two transition rates from the first site class. The CpG transition rate was estimated using the second class. We used these seven substitution rates to derive the GC* for each alignment using the sequence evolution model of Arndt, Burge, and Hwa (2003)
.
We performed simulations of molecular evolution to estimate the error expected by using parsimony for estimating substitution rates from our intron data set. We simulated evolution along the inferred phylogenetic relationship inferred between the three species using the observed chicken-turkey and chicken-quail divergences of 10% and 12%, respectively. In each simulation, the transition/transversion ratio was set to 2.75, and the CpG transition rate was 10 times greater than other transitions. We then introduced different biases in the relative rates of GC
AT and AT
GC substitutions, resulting in four parameter sets. The stationary dinucleotide base composition corresponding to each set of substitution rate parameters was obtained using the sequence evolution model of Arndt, Burge, and Hwa (2003)
. The four parameter sets corresponded to GC* values of 36%, 42%, 52%, and 60%. The stationary dinucleotide frequencies were used to generate random sequences, which served as a starting point for the simulations. Substitutions were subsequently allowed to accumulate on the chicken-turkey-quail phylogenetic tree, using the same substitution rate parameters (i.e., assuming that GC content remains at equilibrium). We then compared the parameter estimates obtained by applying the parsimony approach to the sequences with the real parameter values to ascertain the accuracy of the parsimony approach.
Statistics
All statistical analyses were performed in R (http://www.r-project.org). Confidence intervals (CIs) were produced by bootstrapping. In order to determine the CIs for average transition frequencies for each individual CR1 family, we resampled with replacement concatenated alignments corresponding to each family from the entire chicken genome with 10,000 replicates. CIs for the rate and pattern of substitution in different chromosome classes were derived in a similar way by randomly resampling time-averaged estimates of each individual substitution frequency in each chromosomal class. CIs for correlation coefficients were also calculated by bootstrap. To describe the relationship between flanking GC content and the individual substitution frequencies, we fitted quadratic equations. In order to correct for the correlation with GC content when calculating the difference in substitution rate on different chromosome classes, we used the residuals of the fitted curve between GC and the substitution pattern.
| Results |
|---|
|
|
|---|
We divided the chicken genome into 5-Mb segments, resulting in a total of 191 blocks. For each block, we constructed 11 concatenated alignments corresponding to repeats within each major CR1 repeat family aligned with their corresponding master sequence. In total, 2,131 concatenated alignments were produced, which reduced to 1,881 when those under 5,000 bp were removed from the data set. The GC content of the CR1 master copies in RepBase ranges from 52.9% to 56.9%. Both full-length master copies and truncated CR1 repeats are rich in CpG motifs (full-length copies contain 128.8 CpG motifs and truncated copies contain 27.2), indicating that it is crucial to accurately consider the CpG mutation process. A summary of the data set is shown in table 1.
|
In general, there is a good correspondence between the genome-wide estimates of average transition and transversion substitution frequencies from the 11 CR1 families (fig. 1). However, different CR1 families have different ages. To measure the extent to which different genomic regions accumulate changes at different rates, we calculated a time-averaged estimate for each of the seven substitution frequencies in each 5-Mb block. To do this, we corrected each ML estimate of the individual substitution frequency with the genome-wide average transversion frequency of the particular CR1 family as described in Methods. These time-corrected estimates of each of the seven substitution frequencies in each 5-Mb genomic segment were used in all further analyses. Note, however, that this correction for differences in age does not affect the overall "pattern" of substitution (e.g., estimates of GC*) as all substitution frequencies are affected equally.
|
There is a significant correlation between the nonrepetitive nongenic GC content of each segment and GC* (fig. 2A; Pearson's r = 0.898; P < 104; 95% CI 0.8660.928 by bootstrap). A 1:1 relationship between GC content and GC* (shown on graph) is expected if base composition is stable along the chicken lineage. As the gradient of the linear regression line is significantly greater than one (1.39; P < 104; 95% CI 1.271.55), the heterogeneity between genomic regions appears to be increasing. In order to make a comparison with the human lineage, we made a similar analysis of a data set from humans using Alu repeats (fig. 2B). A significant correlation between GC content and GC* is observed (r = 0.614; P < 104; 95% CI 0.5830.642). As has been previously demonstrated, the gradient of the slope between GC content and GC* is much less than one (0.242; P < 104; 95% CI 0.2270.258), indicating that GC content is becoming homogenized on the human lineage (Webster et al. 2005
|
We also analyzed the pattern of substitution in 34 intron alignments from chicken, turkey, and quail. The total number of aligned bases was 22.5 kb. Figure 3 shows a significant correlation between the average GC content of each intron and GC* (r = 0.618; P < 104; 95% CI 0.3740.790). The gradient of this line is not significantly different from one (0.837; 95% CI 0.4511.24). However, the gradient is significantly greater than the gradient from Alu repeats in figure 2B (P = 0.006). This is therefore consistent with the data from chicken CR1 repeats, indicating that GC content is either stable or reinforced along the chicken lineage. We performed simulations to test the reliability of using parsimony to estimate GC* from the intron alignments. When the substitution parameters were set so that GC* remained at 36%, parsimony estimated GC* to be 0.4% higher. With values of 42%, 52%, and 60%, the parsimony method estimated GC* to be 0.6%, 1.4%, and 2.1% lower, respectively. This indicates that the expected error is small within the range of GC values used in this study. In addition, there is a trend toward parsimony leading to false inference of homogenization of GC content, as has been previously demonstrated (Eyre-Walker 1998
|
Figure 4 shows the relationship between GC content and each of the seven individual substitution frequencies. We fitted quadratic equations to each graph to describe this relationship. Figure 4A shows this relationship for the two tranversions that do not affect GC content (A:T
T:A and C:G
G:C). Neither of these rates exhibit much variation with GC content. However, for the transversions that do affect GC content (fig. 4B), the GC-increasing transversion (A:T
C:G) shows a strong increase with GC content, whereas the opposite trend is shown by the GC-decreasing transversion (G:C
T:A). A similar trend is exhibited by the two transitions (fig. 4C). The GC-increasing transition (A:T
G:C) shows a strong increase with GC content, whereas the GC-decreasing one (C:G
T:A) decreases with increasing GC. The CpG transition rate seems to increase in regions of higher GC content (fig. 4D). There is large variance in this measure, which could reflect difficulties in accurate estimation because there are fewer CpG sites. In a similar analysis of the human genome, Arndt, Hwa, and Petrov (2005)
G:C and G:C
T:A transversion frequencies in GC-poor regions (<35%). This is not observed in our data set.
In order to determine the predicted net effect of the estimated substitution pattern on substitution rate in each genomic segment, we multiplied each rate by the GC or AT content of the genomic segment (or CpG content in the case of the CpG rate). Figure 5A shows the calculated net predicted rate in each region due to the substitutions that do not affect GC content (A:T
T:A or C:G
G:C). This rate is virtually unchanged across regions of different GC contents. As seen from figures 4B and C, both of the GC-increasing (AT
GC) substitution frequencies show a similar (positive) relationship with GC content, whereas the GC-decreasing (GC
AT) substitution frequencies both show a negative relationship with GC content. Figure 5B shows the net predicted effect on the AT
GC rate. Despite the fact that the number of A:T nucleotides is lower (by definition) in regions of high GC content, the substitution rate due to AT
GC changes increases with GC content. Hence, the increased substitution frequency of AT
GC changes in regions of high GC content is strong enough to counteract the paucity of A:T nucleotides. Figure 5C shows the net predicted effect on the GC
AT rate. When the CpG transitions are also included, the relationship is almost identical to the AT
GC relationship in figure 5B. This congruence of the relationships between both AT
GC and GC
AT with GC content is consistent with the relationship between GC content and GC* shown in figure 2A: the net effects of AT
GC and GC
AT substitutions in changing GC content cancel out, and GC content remains relatively stable. When CpG transitions are excluded, the net predicted effect on substitution rate of GC
AT substitutions shows very little variation with GC content. This indicates that the presence of additional methylated CpG sites is an important factor increasing mutation rate in GC-rich regions in chicken.
|
When all seven substitution frequencies are used to estimate the net predicted substitution rate relative to the genomic average, there is a strong positive correlation with GC content (r = 0.832; P < 104; 95% CI 0.7810.876). Figure 6 shows the linear regression fitted to this data. There is a more than twofold variation in these rates between 5-Mb blocks in genomic regions with low and high GC content, indicating that there is substantial variation in rates of single-nucleotide mutation and fixation across the chicken genome.
|
To investigate the effect of genomic location on the rate and pattern of nucleotide substitution, we partitioned the 5-Mb blocks into those on microchromosomes, macrochromosomes, and intermediate chromosomes. The distal portions of macrochromosomes (defined as 5 Mb encompassing the subtelomeric region at each end of the chromosome) were considered separately. A shorter definition of subtelomeric regions was not used due to lack of data. As we have shown previously (Axelsson et al. 2005
|
There are also significant differences between GC* estimated between different chromosomal regions (fig. 8A). The average GC* in microchromosomes is 47%, whereas for macrochromosomes it is 37%, with intermediate chromosomes about halfway between (42%). Pairwise comparisons between macrochromosomes, intermediate, and microchromosomes are all highly significant (P < 104) with distal regions of macrochromosomes significantly higher than the remainder of macrochromosomes (P < 104). The distal portions of macrochromosomes appear similar to microchromosomes in GC*. When we corrected for flanking GC content by examining the residuals of the linear regression between GC content and GC*, the difference in GC* between genomic regions almost completely disappears (fig. 8B). None of the pairwise comparisons are now significant except for that between the distal and nondistal regions of macrochromosomes (P < 0.009).
|
Exon density has been previously examined as a potential correlate of the rate and pattern of nucleotide substitution (Arndt, Hwa, and Petrov 2005
| Discussion |
|---|
|
|
|---|
Inferring Substitutions from CR1 Repeats
We reconstructed patterns of nucleotide substitution along the chicken lineage by comparing CR1 repeats with their inferred ancestral sequence. Calibrating the molecular clock with an estimate of substitution rate from Alu repeats in mammals suggests that an average transversion frequency of 0.1 corresponds to 35 MYA (Kapitonov and Jurka 1996
155 MYA, with the majority of repeats being inserted 50125 MYA. Phylogenetic analysis of the ancestral sequences suggests that CR1 repeats are distantly related to mammalian L3 elements and that some chicken CR1 repeat families predate the chicken-turtle split (
210230 MYA; Hedges and Poling 1999
We used an ML method that can accurately reconstruct neutral substitution pattern, including the neighbor-dependent CpG rate, to infer patterns of evolution in CR1 repeats. This is important because it takes into account multiple hits and can model the effect of CpG mutations, which may also affect sites that are not ancestrally CpG. As with other analyses of this type, it is necessary to assume a star phylogeny, which implies that insertions of particular CR1 families occur in rapid bursts followed by inactivity. This is a generally accepted model for the evolution of vertebrate interspersed repeats (Kapitonov and Jurka 1996
; Jurka 2000
). We make the assumption that the master copy in RepBase identified by RepeatMasker at each insertion site is the true ancestral sequence, and all the differences with the descendent sequences were accumulated due to neutral substitutions subsequent to insertion.
The set of 22 CR1 master copies currently available in RepBase was constructed by improving on a previously defined set of transposable elements using the program RECON (Bao and Eddy 2002
) as part of the analysis of the completed chicken genome (ICGSC 2004
). The RECON program has been demonstrated to recover the known families of transposable elements in the human genome with high accuracy (Bao and Eddy 2002
). The master copies in RepBase should therefore correspond well to the ancestral sequences of the CR1 repeats in the chicken genome. However, we cannot rule out the possibility that as yet unidentified master copies have acted as secondary source elements, as may occur in human Alu repeats (Cordaux et al. 2004
). Errors in identification of the correct master element would mean that some of the estimated substitutions actually occurred between the identified master copy and the true ancestor, rather than accumulating neutrally subsequent to insertion. This would result in genome-wide errors and would therefore lead to inference of a more homogenized substitution pattern. As we observe strong regional biases in the pattern of substitution, we argue that it is unlikely that the presence of unidentified master copies have strongly influenced the results.
Another potential problem with the use of repeat elements to infer patterns of substitution is that they may not be representative of noncoding sequence in general. In particular, CR1 elements are rich in CpG sites, which may experience higher degrees of methylation in transposable elements than in surrounding noncoding DNA (Meunier et al. 2005
). This could cause rates of CpG mutability to be slightly elevated in CR1 elements. Furthermore, ectopic gene conversion could occur between CR1 repeats, which could bias the pattern of neutral substitution, possibly leading to elevated estimates of GC* (Galtier 2003
). However, as the effects of CpG mutability and gene conversion are likely to influence patterns of substitution at all CR1 repeats in a similar fashion, we do not expect them to contribute to the significant regional biases in patterns of nucleotide substitution that we infer.
Evolution of Isochores
In contrast to mammals, where GC content is becoming homogenized, patterns of molecular evolution in CR1 repeats indicates that genomic heterogeneity in GC content along the chicken lineage is increasing. Our analysis of patterns of substitution along the chicken and turkey lineages using intronic alignments also indicates that the forces maintaining variation in GC content are much stronger than in mammals.
What are the forces responsible for variation in GC? As the phylogenetic distribution of GC-rich isochores indicates that they have a common origin in the amniote common ancestor (see Introduction), it is likely that similar processes govern the evolution of GC content in mammals and birds. In primate noncoding alignments, a strong correlation between recombination rate and GC* indicates that recombination drives the evolution of GC content (Meunier and Duret 2004
). Unfortunately, fine-scale recombination maps are not available for the chicken genome. In humans, recombination is known to be highly variable and rapidly evolving, even on the kilobase scale (Kauppi, Jeffreys, and Keeney 2004
; McVean et al. 2004
; Ptak et al. 2004
). As GC content is known to correlate with recombination in birds (Hurst, Brunton, and Smith 1999
; Galtier et al. 2001
; ICGSC 2004
) and a variety of other organisms (Birdsell 2002
), it is the best available measure for local rates of recombination. Insight into the effect of recombination on the pattern of nucleotide substitution can therefore be gained by examining the variation of individual substitution rates with GC content. Those substitutions that affect GC content (AT
GC or GC
AT) all show a strong relationship with GC content. The AT
GC substitution frequency increases with GC content, whereas the GC
AT substitution frequency decreases with GC content (fig. 4B and C). These findings are compatible with an increased bias toward fixation of G:C over A:T alleles in regions of higher recombination. This is consistent with a strong effect of BGC in regions of high recombination. It should, however, be noted that so far there is no direct evidence to suggest that that BGC is an important process in birds.
What could be responsible for the differences in the evolution of GC content between mammals and birds? The karyotype of chicken is divided into macro- and microchromosomes and characterized by extreme variation in GC content and recombination rate. The BGC hypothesis suggests that these factors are linked because smaller chromosomes tend to have higher recombination rates and hence experience a higher intensity of BGC, which results in elevated GC content. There is good evidence from comparative genomics that the ancestral amniote karyotype was similar to the chicken genome (Burt et al. 1999
; Burt 2002
; ICGSC 2004
; Bourque et al. 2005
). This could suggest that a GC-reinforcing pattern of nucleotide substitution, as inferred in the chicken genome, was also present in the ancestral amniote genome. The genome of the ancestral bony vertebrate genome (450 MYA) has been estimated to have contained 12 chromosomes by comparison of human and tetraodon (ICGSC 2004
; Jaillon et al. 2004
). Hence, it appears that the ancestral amniote karyotype evolved after this split. Heterogeneity in GC content could then have arisen due to increased variability in recombination rates.
Although it seems likely that the major trends for GC content are decay in mammals and reinforcement in birds, the detailed picture is probably far more complex. Chromosome number and genome size vary both within and between mammalian and avian orders, and extremes of recombination rate are therefore possible in a variety of species. In general, avian genomes have a large number of chromosomes (2n is usually 6080) and genome sizes of roughly 12 billion base pairs. Relative to mammals, large-scale genome duplications and rearrangements are infrequent (Shetty, Griffin, and Graves 1999
; Bourque et al. 2005
). Mammalian genomes are roughly 24 billion base pairs in size but exhibit large variability in chromosome numbers (Gregory 2005
). This large variation in karyotype suggests that many mammalian genomes may have regions with high recombination rates. It is therefore quite possible that GC-rich isochores are being preserved or reinforced in parts of some mammalian genomes. Likewise, they may be found in lineages outside amniotes: both a GC repair bias and a correlation between GC content and recombination has also been observed in many species unrelated to amniotes, including amphibians, plants, fish, yeast, and bacteria (Birdsell 2002
). However, the broad picture that is emerging is one whereby strong variation in GC arose in the ancestor of birds, mammals, and reptiles and that this heterogeneity has been maintained or reinforced along some lineages, whereas others show a tendency for homogenization.
Determinants of Nucleotide Substitution Rate
The strong isochore structure of the chicken genome and the finding that this heterogeneity in GC content is being reinforced have important consequences in generating variation in mutation and substitution rate across the genome. Some recent studies have suggested that recombination is mutagenic in humans (Lercher and Hurst 2002
; Hellmann et al. 2003
). To examine this potential effect, Filatov (2004)
analyzed rates and patterns of substitution in the highly recombining human p-arm pseudoautosomal region. In order to exclude the potential effects of BGC, which only affects AT
GC and GC
AT mutations, only A
T and G
C substitutions were considered. As these rates were elevated in the pseudoautosomal region, it was concluded that an additional factor such as a mutagenic effect of recombination was important. To examine this using the current data set, we studied variation with GC content of the same transversion frequencies (A
T and G
C). These substitution frequencies show little variation with GC content. Indeed, all the predicted net variation in substitution rates is due to AT
GC and GC
AT substitutions. If a process such as recombination increases all forms of mutation in GC-rich regions, we would expect substitutions that do not affect GC to also change. As this is not observed, recombination may not be mutagenic in chicken. Alternatively, it is possible that it does not cause A
T or G
C mutations or that some information is lost by using GC content as a proxy for recombination.
The main trends of variation in substitution frequencies are a strong increase in AT
GC substitutions with GC content and corresponding decrease in the opposite GC
AT type (fig. 4B and C). This is consistent with the action of BGC, which favors the fixation of G:C over A:T alleles in regions of higher recombination rate (and GC content). However, as shown in figure 5B and C, the net effect of these opposing substitutions on GC content is expected to cancel out (i.e., the relationship between GC content and the net predicted AT
GC and GC
AT substitution rates is roughly the same). This results in GC content remaining approximately stable, as evidenced by a good correspondence between GC* and GC content in fig. 2A (although the gradient is actually significantly greater than one). In concordance with the findings of Axelsson et al. (2005)
and ICGSC (2004)
, the net predicted effect of the variation in substitution frequencies is a strong elevation of substitution rate in microchromosomes and other GC-rich regions, such as the distal portions of macrochromosomes encompassing the subtelomeric regions. Examination of the residuals of the correlations between GC content and the rate and pattern of substitution indicates that the majority of variation in nucleotide substitution between chromosome types and distal regions can be accounted for by GC content. We therefore have no evidence to suggest that there are any qualitative differences in the mode of evolution between these genomic regions.
What is the cause of higher substitution rates in GC-rich regions? It is likely that the patterns result from a complex interaction of factors influencing mutation and fixation. From figure 5C, it seems clear that there are an increased number of predicted changes due to CpG mutations in GC-rich regions. This is not unexpected, as CpG sites are highly mutable and more frequent in GC-rich regions. However, there is also a corresponding increase in AT
GC substitutions (fig. 5B). It is likely that BGC is important in generating this increase as it favors the fixation of G:C alleles over A:T in regions of high recombination. Hence, this could lead to a dynamic situation where the high rate of decay of CpG sites is balanced by BGC, which creates new CpG sites in GC-rich regions. So far this situation has not been explicitly modeled, although good simulations exist where fixation biases are uniform across the genome (Piganeau et al. 2002
). Notably, recombination and CpG motifs are strongly correlated in the human genome (Kong et al. 2002
). This could be explained by BGC favoring the fixation of G:C alleles and thus generating new CpG sites. The correlation between divergence and recombination reported in humans and related species (Hellmann et al. 2003
) could also be partly due to this effect. Further simulations would be helpful in understanding this dynamic process.
| Supplementary Material |
|---|
|
|
|---|
Supplementary Table 1 is available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|
This work was supported by the Swedish Research Council and Science Foundation Ireland. We thank Arian Smit and Robert Hubley for help with RepeatMasker and Ken Wolfe, Gavin Conant, and Marie Sémon for critical reading of the manuscript.
| Footnotes |
|---|
1 Present address: Smurfit Institute of Genetics, University of Dublin, Trinity College, Dublin, Ireland.
Aoife McLysaght, Associate Editor
| References |
|---|
|
|
|---|
Alvarez-Valin, F., O. Clay, S. Cruveiller, and G. Bernardi. 2004. Inaccurate reconstruction of ancestral GC levels creates a "vanishing isochores" effect. Mol. Phylogenet. Evol. 31:788793.[CrossRef][ISI][Medline]
Antezana, M. A. 2005. Mammalian GC content is very close to mutational equilibrium. J. Mol. Evol. 61:834836.[CrossRef][ISI][Medline]
Arndt, P. F., C. B. Burge, and T. Hwa. 2003. DNA sequence evolution with neighbor-dependent mutation. J. Comput. Biol. 10:313322.[CrossRef][ISI][Medline]
Arndt, P. F., T. Hwa, and D. A. Petrov. 2005. Substantial regional variation in substitution rates in the human genome: importance of GC con







