MBE Advance Access originally published online on November 27, 2007
Molecular Biology and Evolution 2008 25(2):362-374; doi:10.1093/molbev/msm261
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Exponential Decay of GC Content Detected by Strand-Symmetric Substitution Rates Influences the Evolution of Isochore Structure
,
,1
,1

* Department of Computer Science and Systems Analysis, Miami University, Ohio
Department of Microbiology, Miami University, Ohio
Center for Comparative Genomics and Bioinformatics, Pennsylvania State University
Institute of Chemistry, Karl-Franzens University Graz, Graz, Austria
|| Institute for Theoretical Biology, Humboldt University, Berlin, Germany
E-mail: jkarro{at}acm.org.
| Abstract |
|---|
|
|
|---|
The distribution of guanine and cytosine nucleotides throughout a genome, or the GC content, is associated with numerous features in mammals; understanding the pattern and evolutionary history of GC content is crucial to our efforts to annotate the genome. The local GC content is decaying toward an equilibrium point, but the causes and rates of this decay, as well as the value of the equilibrium point, remain topics of debate. By comparing the results of 2 methods for estimating local substitution rates, we identify 620 Mb of the human genome in which the rates of the various types of nucleotide substitutions are the same on both strands. These strand-symmetric regions show an exponential decay of local GC content at a pace determined by local substitution rates. DNA segments subjected to higher rates experience disproportionately accelerated decay and are AT rich, whereas segments subjected to lower rates decay more slowly and are GC rich. Although we are unable to draw any conclusions about causal factors, the results support the hypothesis proposed by Khelifi A, Meunier J, Duret L, and Mouchiroud D (2006. GC content evolution of the human and mouse genomes: insights from the study of processed pseudogenes in regions of different recombination rates. J Mol Evol. 62:745–752.) that the isochore structure has been reshaped over time. If rate variation were a determining factor, then the current isochore structure of mammalian genomes could result from the local differences in substitution rates. We predict that under current conditions strand-symmetric portions of the human genome will stabilize at an average GC content of 30% (considerably less than the current 42%), thus confirming that the human genome has not yet reached equilibrium.
Key Words: GC content isochore decay neutral substitution rates strand symmetry genomic equilibrium
| Introduction |
|---|
|
|
|---|
The (local) "GC content," or the percentage of GC base pairs in a given genomic region, is highly variable across many mammalian genomes (Bernardi 2000
It is well established that a genome is subject to a varying "neutral substitution rate"—the rate at which neutral (functionless) DNA undergoes changes that become fixed in the population. This rate varies considerably, differing between organisms, between sexes, and across the genome (Lander et al. 2001
; Waterston et al. 2002
; Ellegren et al. 2003
; Hardison et al. 2003
; Gibbs et al. 2004
; von Grünberg et al. 2004
; Arndt et al. 2005
; Gaffney and Keightley 2005
; Lindblad-Toh et al. 2005
; Taylor et al. 2006
). Many of these studies have investigated the interdependence between variation in the (neutral) substitution rate and the variation in GC content, but simple explanations of causation have not emerged.
Various studies have also established an asymmetry in the rates at which different substitutions occur; GC base pairs become AT base pairs far more frequently than the reverse (Lander et al. 2001
; Arndt et al. 2005
). Using realistic estimates for these rates, the imbalance implies that the GC content at equilibrium will be considerably below the current GC content. Thus, one would predict from the rate asymmetry that GC content must be decreasing. It further seems likely that those areas subject to higher substitution rates experience faster GC-content decay. However, these predictions have not been easy to test as calculating time-resolved substitution rates is a difficult task (Hardison et al. 2003
; von Grünberg et al. 2004
; Arndt et al. 2005
; Gaffney and Keightley 2005
).
Further complicating matters, local substitution rates are not the only factor determining the change in GC content and its eventual equilibrium. In both Meunier and Duret (2004)
and Duret et al. (2006)
it was argued that recombination plays a role in the decay of GC content, that the modern human genome has not yet reached its equilibrium point, and that the equilibrium point of any given genomic region is determined by the interplay of the substitution and recombination rates. Though there is some disagreement (Alvarez-Valin et al. 2004
; Antezana 2005
), several studies have supported these positions (Lercher et al. 2002
; Webster et al. 2003
; Belle et al. 2004
; Duret 2006
; Khelifi et al. 2006
). However, the rate of GC-content decay has not been determined, nor has the relative importance of the substitution and recombination rates in influencing that decay. How fast is GC-content decaying in a given region? What is the importance of substitution, as opposed to recombination, in determining that rate?
The human genome currently has a genome-wide average GC content of approximately 43% (42% chimpanzee, 42% macaque, 41% dog). We have applied 2 independent methods for determining rates of substitutions across mammalian genomes to better understand the history and predict the future of these genomes. We limited our analysis to the 620 Mb of human DNA in which the rate for each type of substitution is the same on the complementary strand—regions referred to as strand symmetric (Green et al. 2003
). This allowed us to apply a simple but powerful model for determining substitution rates. With these rates, we predict and confirm that the local GC content decays exponentially over time toward a local equilibrium value GC* at a rate determined by the local rate of substitution. We predict that these strand-symmetric areas will stabilize their average GC content at 30% (±4%) for each of these genomes, supporting the notion that none of these genomes are currently at equilibrium. Furthermore, the equilibrium point of a region is determined by the local rates of substitution and recombination, as well as by the Biased Gene Conversion (BGC) mechanism (Nagylaki 1983
; Galtier et al. 2001
; Duret et al. 2002
), thus independently confirming the results of Meunier and Duret (2004)
. Our studies agree with the prediction that each genome is evolving toward a new isochore structure (as proposed by Khelifi et al. 2006
). They also suggest that the current isochore structure could result from the action of locally varying rates on an ancestral genome, regardless of that genome's GC-content pattern.
| Methods and Materials |
|---|
|
|
|---|
Whereas calculating the GC content of a sequenced genome is simple, estimating substitution rates and GC* are more difficult tasks. Substitution rates are estimated using 2 different methods. The first is the "LogDet" method (Barry and Hartigan 1987
Estimating neutral substitution rates is necessarily based on nucleotides known to be nonfunctional. There are 2 sources of such data frequently used: 4-fold degenerate coding sites and interspersed repeats (Hardison et al. 2003
). We selected the repeat data because of its quantity. Repeats make up approximately 45% of the genome (providing significantly more points of estimation than do the 4-fold degenerate coding sites) and, as a whole, are uniformly distributed (Lander et al. 2001
)—allowing us to average out the effect of location bias. It is generally believed that the substitution rates of repeat bases are reflective of the local neutral substitution rate, though this is not completely agreed upon (Hardison et al. 2003
; Arndt, Burge, and Hwa 2003
, Gaffney and Keightley 2005
).
Further, instead of fitting our model parameters to interspecies alignments, we follow the lead of Arndt in using the alignment of modern interspersed repeat sequences to their RepeatMasker-derived ancestral sequences (Smit A, Hubley R, Green P, unpublished data). This allows us to discard the normal Markov chain constraint of reversibility used as a basis for many calculations. Like the Arndt studies, we also assume strand-symmetric substitution: that both strands of the genome segment undergo any given type of substitution at the same rate, hence complementary substitution rates are equal (e.g., the rate of
substitutions is equal to that of
substitutions). Strand symmetry is not a universal condition; it is clearly disrupted by selective pressure and has also been shown to break down in transcribed regions and around replication origins (Green et al. 2003
, Touchon et al. 2005
). We will shortly describe a technique of identifying where this assumption is valid (or is a valid approximation) and can use this technique to identify regions where our predictions can be tested.
Notation
Like many studies, we will break the genome into nonoverlapping windows, or partitions (Hardison et al. 2003
; Gaffney and Keightley 2005
). The (instantaneous state transition) rate matrix defining a given Markov chain will be denoted by q, and we use q
to denote the rate matrix that has been estimated exclusively from the partition
(i.e., from the repeats falling within
),
will denote the mean substitution rate of
(averaged over the time span under investigation), and
will denote the partition's GC content at time t.
Any method based on a Markov chain imposes certain constraints on q. For example, Jukes and Cantor (1968)
require that all off-diagonal elements be equal, Hasegawa et al. (1985)
group the substitution types into 4 different parameter classes, and the fully reversible model allows for 6 parameters configured to enforce "full reversibility" (Ewens and Grant 2005
). The constraints of our own model are fully defined by the strand-symmetric property. As a result, we have a 6-parameter model—that is, 6 independent rates appearing in the rate matrix (see Appendix, eq. 7, for the matrix layout). These 6 rates are labeled q1, ..., q6 (defined in table 1). When referring to the components of the matrix q
, we will denote these rates by
. This strand-symmetric Markov chain was first proposed by Sueoka (1995)
and later studied by Lobry and Lobry (1999)
.
|
Estimating GC-Content Equilibrium (GC*)
The importance of limiting our analysis to a strand-symmetric q is as follows. We derive a value

from q
—specifically, 
is the sum of the rates of all substitutions between an AT base and a GC base in either direction (see eq. 9 in the Appendix). It is shown in Lobry and Lobry (1999)
is strand symmetric and time independent, we have
|
| (1) |
|
| (2) |
is subject to strand-symmetric substitution (and not affected by issues such as elevated CpG mutation rates and recombination, which we will deal with shortly), then equation (1) means that the GC content of
must be decaying exponentially with time, at a rate dictated by the product of 
and the local substitution rate
. Further, the local GC content is converging to the value
. Thus,
is the GC content in partition
that the genome will eventually reach at equilibrium (with notation GC* chosen to be consistent with that used in Meunier and Duret 2004
can be thought of as the "excess GC content" contained by partition
on its way to equilibrium.
One would like to test the prediction by picking 2 points in time and plugging in the relevant values into equation (1). However, we know the GC content for partition
at only one point in time: "now." In order to reflect this, we calibrate time such that t = 0 refers to the present, with t < 0 then representing points in the past, and rewrite equation (1) as
|
| (3) |
(denoting the current GC content) appearing on the left and
(denoting the GC content T time units in the past) occurring on the right. In the derivation of this equation, we are assuming both strand symmetry and time independence of the rate matrix—the model cannot be reliably used for analysis unless we can verify that the target region conforms to these assumptions. But in those regions where this is the case, it follows that the local GC content is decaying over time toward the value
at an exponential rate determined by
.
Equation (3) describes the interdependence of 3
-dependent (i.e., locally dependent) functions: the local GC equilibrium value
, the local GC content
, and the product of the local substitution rate
and 
. The fourth local function appearing in the equation,
, is not accessible and will produce statistical noise. So equation (3) describes a curve in 3-dimensional space spanned by
,
and
. To visualize this curve, we will have to consider a projection of it onto a plane that is spanned by just 2 of these 3 local quantities.
Models of Substitution
We estimated substitution rates using 2 independent methods: a Markov chain–based method and the LogDet method (Barry and Hartigan 1987
; Gu and Li 1996
; Baake and von Haeseler 1999
; Sumner and Jarvis 2006
). By comparing the estimations derived from each method, we are able to both identify strand-symmetric regions of the genome and verify that our results on these regions are independent of certain differing assumptions.
Our Markov chain–based method, which assumed strand symmetry, estimates the rates
in q
and the mean local substitution rate
from the repeat alignments falling within partition
. We use standard fitting techniques to estimate the state-transition probability matrix P
(t) (see eq. 6 in the Appendix). In other words, [P
(t)]ij denotes the probability of a base with content i at time –t0 becoming a base with content j at time –t0 + t. The problem is more complex than with the more common applications of Markov chain theories (e.g., as used in Hardison et al. 2003
or Gaffney and Keightley 2005
) because we are dealing with alignments corresponding to repeat families of different ages. Thus, we will denote a given repeat family as
, the age of that family as t
, and can then estimate P
(t
) for different families
that intersect partition
. Given a set of alignments corresponding to repeat family
, we estimate both
and the mean local substitution rate
from a maximum-likelihood fit to the matrices P
(t
). The final step in the application of our model is to define and calculate a new value
, the deviation of
's substitution rate from the genome mean substitution rate
. (Note that whenever we are interested in genome-wide values, we allow for just one partition comprising the whole genome; the index
is then dispensable and has been omitted.) Of course, when applying this locally, the procedure is dependent on choosing an appropriate resolution, as reflected in the partition size. We experimented with resolutions ranging from 160-kb windows to 1-Mb windows. The accumulated length of all repeats used for our analysis ranged from 20% to 30% of the genome, depending on the organism. Examples of the resulting local substitution rate variations are shown and discussed in the Appendix.
In contrast to the previously described approach, LogDet makes no assumptions about strand symmetry or places any constraints on the rate matrix. Let R
(t) be the rate matrix at time t and let
be the average of the R
(t) matrices over the appropriate time interval. The LogDet model is based directly on the matrix
, hence makes no assumptions about the changes in R
(t) or on the relationship between elements of
. Let µ
be the arithmetic mean rate based on
, which can be written as
. Barry and Hartigan (1987)
and Zharkikh (1994)
each showed that there is a direct connection between the product µ
t and the transitional probability matrix P(t), which when applied to our P
(t
) may be written as
|
| (4) |

is a "model-free" measure for the time distance t
in partition
(i.e., it avoids the constraining assumptions of other models, as discussed previously). We are able to compute LogDet times genome-wide, d
, and in every partition, d
, and then can combine both sets of numbers by computing the percentage deviation in partition
from the genome-wide mean,
d
=
(d
– d
)/d
(where 
stands for an average over all
's).
Materials
Genome builds used were downloaded from the University of California, Santa Cruz browser (Kent et al. 2002
): hg18 (human), cf2 (dog), pt2 (chimpanzee), and rm2 (macaque). Human repeat information was extracted by RepeatMasker v. 3.1.2 and RM database version 20051025. Chimpanzee repeat information was extracted by RepeatMasker v. 3.1.3, RM database version 20060120.
All software tools created for this project will be provided to interested parties on request.
Statistics
Linear regressions are characterized with Pearson's moment correlation coefficient (denoted by rp) at a P value of less than 10–8. Each correlation passed all standard diagnostic tests to ensure that rp is a legitimate characterization of the fit (results not shown).
| Results |
|---|
|
|
|---|
Identifying strand-symmetric regions
Our core prediction, equation (3), follows directly from our 2 assumptions: strand symmetry and the time independence of the rate matrix. To verify the prediction with genomic data, we must find regions on the genome that conform to both assumptions. We do this by comparing the local rates, as determined by our method, against the prediction of the previously discussed LogDet method—which relaxes both assumptions. Only in areas where the 2 predictive methods agree can we expect the assumptions to hold. It is worth noting that as both models estimate average rates over some time period, we are actually locating segments that experienced rates which "effectively" meet these assumptions over that period. Thus, we will identify both those segments that conformed to the assumptions uniformly and those that fail to do so at any instant but that experienced rate variations over time that produced the same effect.
As the LogDet estimations are independent of a specific model of substitution-rate relations, they are ideal for testing the assumptions of time constancy and strand symmetry of q made in the first model. We show in the supplementary material (eq. S18, S19, and S23–S25, Supplementary Material online) that if these assumptions and approximations have no effect on the result, we should find that 1) for a genome-wide fit, the estimated d
from equation (4) must be equal to
and that 2) for the partition-wide fits,
d
must be equal to
obtained with the strand-symmetric model. Figure 1 shows both correlation diagrams for the human genome d
versus
for 812 different repeat families and
d
versus
for the 21,000 partitions covering the genome (160-kb window). For the comparison d
versus
in figure 1a, we obtain a (Pearson's) correlation coefficient as high as rp = 0.997 (human), with 50% of all families showing a difference smaller than 2% to the LogDet times. Equally good results were obtained for the other genomes; rp = 0.997 (chimpanzee), rp = 0.996 (dog), and rp = 0.994 (macaque). This strong agreement between the predictions shows that our genome-wide assumptions, including strand symmetry, are valid. Apparently, the effect of regions where these assumptions do not hold averages out in such a genome-wide analysis.
|
When performing the same investigation on a local scale, a different picture emerges. For
d
versus
in figure 1b, we obtain a correlation coefficient of rp = 0.92. Here, 50% of all partitions show a difference between
d
and
that is larger than 64% of
in that partition. By means of this plot, we can now clearly identify those partitions where our model assumptions are justified. We have selected those regions where the difference between
and
d
is smaller than 5%. For the human genome, we discarded almost 75% of all 160-kb windows, leaving approximately 5,000 partitions representing 620 Mb of the entire human genome. Similarly, applying the same criterion to the corresponding figures for the other genomes, we obtained 560 Mb for the dog genome, 730 Mb for the macaque genome, and 680 Mb for the chimpanzee genome (see table 2). In the following, we will limit our analysis to only these partitions—where both methods produce approximately the same values (within 5%). The correlations hardly differ whether we show
or LogDet rates
d
, hence the results are independent of the method used to compute the neutral substitution rate variation.
|
A full list of identified strand-symmetric regions is included in the supplementary materials (Supplementary Material online).
Genome-Wide, Strand-Symmetric GC Equilibrium Value Is at about 30% for Various Mammalian Genomes
We next look at the results of fitting q both globally and locally. In table 1, we show the results of fitting the elements of q on a global scale, giving us a picture of the general trend of substitution rates similar to that in Lander et al. (2001)
. We observe that the transversion rate away from GC base pairs, q4, is higher than any other transversion rate and that the transition rate away from GC base pairs, q6, is twice the reverse rate q5 (
and
transitions) and 5 times a typical transversion rate. Also listed are results obtained after masking CpG sites on the consensus sequences of each repeat family, leading to a small shift of the rate q6.
When we fit P
(t
) in a specific partition
, we obtain
and the rates
that allow us to estimate the GC equilibrium values GC*
(for those partitions conforming to our assumptions). Averaging over these partitions, we obtain an estimate for the genome-wide GC* that is representative for 620 Mb of the human genome (560–730 Mb for the other genomes). This GC* value is compared with the current GC content on today's genome in table 2. We observe that for the investigated species, our model predicts that the equilibrium point GC* is about 30%, whereas the mean GC content on today's genome is at about 42%, implying that these genomes are far away from the stationary state. We stress that locally GC
* (estimated with a variance of 0.04) shows considerable deviations from the genome-wide value GC*.
Exponential GC-Content Decay Verified
We are now in the position to determine the accuracy of prediction (3).
, the modern GC content, is easily calculable.
, 
, and
can be calculated by the application of our model, with which we can determine the argument of the exponential function in (3) relative to the time distance
:
. This allows us to plot
on the left-hand side of equation (3) versus the argument of the exponential function on the right-hand side of this equation, as we do in figure 2 for both human and dog. As we have already observed, equation (3) describes a curve in a space spanned by 3 local functions:
,
and
. Thus, figure 2 can be considered to be a projection of this 3-dimensional data cloud onto a plane spanned alone by
and
(i.e., onto a plane where
in equation (3) is fixed to some value). This should then result in an exponential relationship between
and
. Indeed, we see this to be the case in figure 2. We also note an apparent convergence value of GC content to the predicted GC* values from table 2, which suggests that we should examine excess GC content (i.e., the GC content relative to GC*). Taking the logarithm of the excess GC content, we expect to find that
has a linear relationship to
(see eq. 3). We check this in figure 3 for 4 mammalian genomes. A very strong correlation is found, with coefficients as high as rp = –0.85 for human, chimpanzee, and macaque and rp = –0.82 for dog (table 2). We also show the results of a least squares fit, providing us with estimates for the parameters b and
in the fitting formula
, where
. All values are given in table 2. Figure 3e is an alternative to figure 3a in which we do not use repeats (the basis for our calculation of distance) in the calculation of GC content to ensure that the calculations on each axis are independent. The minimal change between the 2 figures shows this is not a problem.
|
|
In contrast to the Arndt model of substitution rates (Arndt, Burge, and Hwa 2003
Including Recombination Rates
In table 1 we see that q1 + q5 gives the substitution rate of AT base pairs to GC base pairs, whereas q4 + q6 gives the substitution rate of GC base pairs to AT base pairs. Our method provides us with estimates for these rates in every partition
. BGC, on the other hand, is known to have the effect of increasing the
substitution rate by an amount proportional to the local recombination rate (Meunier and Duret 2004
; Duret 2006
; Galtier and Duret 2007
), which we will denote by 
. Having this in mind, we may expect to find that the rates
are locally increased by β
(where β is some proportionality constant), whereas the
rates,
, should be locally reduced by the same amount. Thus, we expect that
and
with 2 constants c1 and c2 that are not material to our argument. From these relations, it follows that the difference
should be equal to c1 – c2 + 2β
. Figure 4 correlates
against 
(using sex-averaged deCODE recombination rates from Kong et al. 2002
, resolved here with 1 Mb windows). As a result, we see a small, but statistically significant, correlation with rp = 0.33 (contained by the 95% confidence interval [0.29, 0.38]). We have also correlated the recombination rate with
which is just
in equation (9) and found no significant correlation.
|
Meunier and Duret (2004)
and the recombination rate. As observed by these authors, there are a number of reasons why we cannot expect this correlation to be large. Recombination appears to vary on a scale much smaller than 1 Mb and thus its rate averages out over scales greater than 1 Mbb; the 2 variables correlated here reflect processes operating on different time scales. Although recombination rates may change rapidly, the
can be traced back to rates that are actually time averages over long evolutionary periods. Additionally, in our study, the genomic segments under analysis were picked based on strand symmetry and time constancy. We have no reason to believe that characteristic is correlated with recombination; thus, our plot likely superimposes partitions supporting a strong correlation to others supporting no correlations at all.
Asymmetric Regions of the Genome
Our approach allows us to easily distinguish between parts of the genome that follow a strand-symmetric, time-independent substitution rate model and those that do not. In figure 5, we graph GC content (on the x axis) against our substitution rate estimator (on the y axis) for both our selected 620 Mb of the human genome (black points, lighter fit) and the complementary set (dark dots, dark fit). It is evident that by selecting strand-symmetric areas we have been able to discard data that grossly deviate from the exponential decay curve, but it also becomes obvious that we have discarded a bulk of data that were compatible with our picture (the majority of red data points are covered by the black data points).
|
| Discussion |
|---|
|
|
|---|
In our investigation, we have concentrated on regions of the genome that we know to be subjected to strand-symmetric substitution rates. Selective pressure will frequently produce substitution asymmetries (Frank and Lobry 1999
) of the exponential decay, then we can conclude that local GC content is decaying over the genome, with genomic areas subjected to higher decay rates (i.e. experiencing a faster decay) having a disproportionately lower GC content than those subjected to lower decay rates. This implies that, regardless of the initial distribution of local GC content, even in the extreme case of no initial isochore structure, the exponential relationship of GC content to local substitution rate will inevitably lead to the establishment of some sort of isochore structure simply because of the regional differences in decay rates and the equilibrium points—a finding similar to that of Khelifi et al. (2006)
, such as those extracted from any Markov chain–based model (Ewens and Grant 2005
at time –T. In fact,
is presumably different in every partition, inducing a vertical shift for every data point in figures 2 and 3. Surprisingly, these shifts are small enough that they do not obscure the exponential decay. Figure 2 allows us to roughly estimate an upper limit of
by the vertical distance of each data point from the fitted curve.
Comparisons with Previous Studies
The shape of our correlation is different than that found in previous studies, but it is not contradictory; our analysis just presents a clearer picture as we are able to strip away a source of noise inherent in the other studies. Consider the works by Hardison et al. (2003)
, Hellmann et al. (2003
, 2005)
, and Belle et al. (2004)
. In the study by Hardison et al., we see a quadratic relation between GC content and substitution rate, where local substitution rate is calculated using the multispecies, repeat-based tAR statistic. The latter study by Hellmann et al. finds a similar result (measuring substitution rates with human–chimpanzee divergence) but reduces this to a linearly decreasing correlation when introducing CpG content as a second variable in their regression model. In figure 5, we explain both these results with our model. We see that when using the regions that are not strand symmetric (or not identifiable as such), we replicate the shape of the Hardison curve. Concentrating on the strand-symmetric regions, we find the negative correlation predicted by Hellmann—but are able to better resolve the shape of the curve. In the Belle paper, the authors point out that an exponential decay would make sense (an observation also made by Gu and Li 2006
), but they find the explanation incompatible with their data (calculated using the method of Galtier and Gouy 1998
). However, their results reflect factors not relevant to our analysis as they base their rate estimations on the alignment of coding sequences, which are not subject to strand-symmetric substitution and are shaped by constraint-related forces.
The question of whether the genome has reached its GC-content equilibrium has been addressed in a number of studies. The works of Lercher et al. (2002)
and Webster et al. (2003)
both support our assertion that the genome is in a state of decay through the analysis of SNP evidence, and the Webster et al. (2005)
results are also consistent with this hypothesis. Both Alvarez-Valin et al. (2004)
and Antezana (2005)
have taken the opposite position, though the methodology of the latter is shown to be faulty in Duret (2006)
.
A line of studies including Duret et al. (2002)
, Meunier and Duret (2004)
, Khelifi et al. (2006)
, and Duret (2006)
have made significant contributions to the understanding of GC-content decay, the projected equilibrium point, and the underlying causes of decay; it is important to consider how our results fit into the picture resulting from these works. Together, these studies have looked at the relationship between local current GC content and
, as well as the effect of recombination rates on these variables (presumably through the mechanism of BGC). They find pairwise correlations between recombination rate,
, and the GC content of
, and then argue a causal relationship starting from recombination and with
as a mediator.
Our analysis is consistent with their work, though it does not lead to any conclusions about causality. Consider the formula equation (2) for the equilibrium GC content. This expression for GC* results directly from balancing net rates q15AT* = q46GC*, where q15 = q1 + q5 is the
rate, q46 = q4 + q6 is the
rate, and AT* = 1 – GC* is the equilibrium AT content. If q46 were equal to q15, then GC* would be 1/2. As q46 increases, the balance shifts and GC* decrease. We find that in the genome-wide average, the GC
AT rate is roughly twice the reverse rate, implying that genome-wide GC* is roughly 1/3. Of course, this argument applies not only for the whole genome but also for every partition
: a ratio
to
larger than one implies a local
falling below one-half.
Following Duret et al. (2002)
and Meunier and Duret (2004)
, we incorporate the BGC mechanism as follows. Consider a time and a location
where the recombination rate has reached a high enough level to make a difference. We expect a GC-biased fixation process and thus an increase of the local AT
GC rate by an amount proportional to the local recombination rate, implying that the reverse rate GC
AT decreases by the same amount. The sum of these 2 rates,
, is not affected, but their difference is reduced, leading to an upward shift of the local
back toward half. Thus, by changing the local rates, the BGC process will not alter the exponential decay behavior (which depends on 
and
) but simply shifts the local
upward to an extent that is directly proportional to the recombination rate. This effect is then responsible for correlation between
and the recombination rate found by Meunier and Duret (2004)
.
By the definition of an "equilibrium point," the GC content must be headed toward
. Over the 620 Mb of the human genome we have investigated, we have not found any partition where the GC content is currently less then
, and in the 14 Mb investigated by Meuiner and Duret, they found only 2 Mb where this was the case. So, in practice, the local GC content in
is decreasing toward the lesser
value. Although a high recombination rate is able to, through BGC, raise the local AT
GC rate over some period of time, this change can stop the decay of local GC only if it is strong enough to shift
above the local GC-content level—an effect we do not see.
We recall that equation (3) describes the interdependence of 3 local functions:
,
(local GC content), and the value
. When projecting the 3-dimensional plot onto a plane by setting any one of the variables to a constant, we see the association between the remaining 2 variables found by Duret. To this point, we have only studied the projection derived from setting
to a constant, resulting in our exponential curves. The other projections are of interest in their own right but are quite complicated and beyond the scope of this study.
Other Points of Consideration
We need to address our assumption of independent substitution rates in neighboring sites—specifically that of elevated CpG substitution rates, which are known to be significantly higher than those of other substitution rates (Arndt and Hwa 2004
; Arndt et al. 2005
). To investigate this effect, we have followed the common practice of masking out CpG cites (Meunier and Duret 2004
; Gaffney and Keightley 2005
; Gu and Li 2006
; Taylor et al. 2006
) and found no difference in our results, as illustrated by a comparison of figure 3a against that of figure 3f and in table 1—a result consistent with that of Gu and Li (2006)
. The idea that these rates would have no effect on our calculation seems unlikely in light of results such as those of the Arndt studies (Arndt, Burge, and Hwa 2003
; Arndt, Petrov, and Hwa 2003
; Arndt et al. 2005
), so the more plausible conclusion is that the CpG effect is not compatible with one of the characteristics used to pick our partition set. In other words, if a partition conforms to our assumptions, then the CpG effect is minimal within that partition.
Our analysis has been extended to 3 other genomes: chimpanzee, macaque, and dog. We have obtained almost identical results for all, with differences only in
(the slope of the red curve in fig. 3) for dog compared with the other 3 species. Notably missing from this analysis are the murids. Because mouse and rat are subjected to a much higher substitution rate (Waterston et al. 2002
; Gibbs et al. 2004
), RepeatMasker has considerably less power to recognize older repeats. For our purposes, the amount of sufficiently older repeat data is less than is needed to get a clear picture of the decay in those genomes.
We also investigated the possibility that SNPs occurring within repeats could introduce error into our calculations. An SNP occurring in a repeat is reflective of mutation rate but not necessarily reflective of substitution rate; using SNPs to calculate substitution rates could bias the results. To check for this, we masked out all SNP locations, using information from dbSNP build 125 downloaded from the University of Santa Cruise Genome Browser (Sherry et al. 2001
; Kent et al. 2002
), and found no significant changes (data not shown).
Finally, we addressed whether our calculation of repeat substitution rates could be biased by the faulty reconstruction of ancestors by RepeatMasker. To check this, we repeated the experiment based on different subsets of repeat families, both manually and randomly chosen. Each analysis resulted in the same conclusions. (See supplementary materials for details, Supplementary Material online.)
Our results show a clear, strong exponential correlation between the substitution rate pattern and GC content in strand-symmetric areas of the genome, which confirms a prediction that is a direct consequence of a strand-symmetric rate model. Further investigations of the relationship, with an aim toward determining causality, are certainly worthwhile. A study of our identified regions of strand symmetry (or, more appropriately, the complementary set of regions) may also prove valuable. And although we cannot label any factor as being causal, it is potentially revealing to consider equation (3) under a model in which the substitution rate drives GC-content decay. The location dependence of the substitution rate pattern then allows us to predict the evolution of the isochore structure: it decays faster toward
in regions of high substitution rate and lower in those of low rate. This also implies that if the genome started from a uniform GC content (e.g., with no isochore structure), then the substitution rate pattern would inevitably lead to a local variation of the GC content. However, this model is only speculation; we could also envision a model in which GC content filled the role of determining the local substitution rate.
| Supplementary Material |
|---|
|
|
|---|
Supplementary materials are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Appendix |
|---|
|
|
|---|
In the following, we give a brief description of our model, the approximations used in its application, and the tests performed to check the validity of those approximations. A detailed report is given in the supplementary materials (Supplementary Material online).
The Rate Model
Our underlying model is a nonhomogeneous Markov chain, similar to that used in other standard approaches. The 4-dimensional time-dependent state vector (pA(t), pC(t), pG(t), pT(t)), defining the probability of being in each state at time t, evolves according to a substitution rate matrix R(t) having components Rij(t). Recall that every possible R(t) must satisfy
, which allows us to define a mean substitution rate m by averaging over all components of this rate matrix:
. To make this m explicit in the following expressions, we replace Rij by mqij and scale the matrix q such that
.
The transition probability matrix propagating a state vector at time –t0 to time –t0 + t reads in its most general form
|
| (5) |
. In our application, we cannot consider the most general case and instead approximate
by the product of
by the product of
by the product of
, with the constants qij obtained from a maximum-likelihood fit, as explained further below. Underlying this approximation is the assumption that though rates may be time dependent, the ratio of any 2 rates within the rate matrix does not change with time. This approximation is necessary in order to be able to perform fits; with the comparison in figure 1 to LogDet times (which do not depend on this approximation), we have ensured that we consider only those partitions where this approximation has a negligible effect (for a discussion of this approximation, see section A2 in supplementary material, Supplementary Material online).
To estimate rates on a local scale, we break the genome into Z partitions, representing each partition by an index
. The index is dropped whenever Z = 1, that is, when we want to indicate that a quantity is based on a genome-wide analysis. Using this partitioning notation and applying our approximation, equation (5) can be written as
|
| (6) |
(the rate matrix estimated from partition
), we use a strand-symmetric rate model (Sueoka 1995
![]() | (7) |
defined in table 1. Every strand-symmetric matrix can be transformed into a block-diagonal matrix consisting of two 2 x 2 blocks, q+ and q– (Lobry and Lobry 1999|
| (8) |
representing the A + T and G + C content at time t. Note that q
+ depends just on
and
which correspond to a A,T
G,C transition and a G,C
A,T transition, respectively (see table 1). The set of differential equations associated with this reduced 2 x 2 system,
can be integrated (see Lobry and Lobry 1999|
| (9) |
Calculation of Local Substitution Rates
All estimates of rates are based on the transitional probability matrix P(t) on the left-hand side of equation (5) which we fit to RepeatMasker-generated repeat data. Low-complexity repeats were removed. We were then left with M repeat families, where M = 812 (human), M = 434 (dog), M = 825 (chimpanzee), and M = 779 (macaque).
The RepeatMasker data provide the reconstructed ancestor of each repeat family
(
= 1, ..., M) as well as a pairwise alignment between each modern instance and the family's reconstructed ancestor, which was inserted into the genome at time –t
. Alignment positions involving gaps are discarded. For each repeat family
and each partition
, we determine from the alignments 1) a 4 x 4 matrix k
counting the number of each substitution type, 2) a 4 x 1 vector
such that
which is the number of nucleotides that were in state i at time –t
. It then follows that
is the maximum-likelihood fit of P based on the alignments.
These P
(t
) matrices represent the input which we want to use to estimate local substitution rates. We have 2 alternatives: the method of LogDet time distance, equation (4), and a method based on the matrix in equation (7) and the approximation in equation (6) (von Grünberg et al. 2004
). The LogDet time connects the transitional probability matrix P(t) for a transition over time t on the left-hand side of equation (5) with the product d of the time t and the arithmetic mean µ rate of
on the right-hand side of equation (5). Recalling that
with
i being the eigenvalues of
, the LogDet formula in equation (4) for d = µt results directly from the following consideration:
|
| (10) |
(t
), the LogDet formula equation (4) provides us with time distances d
for partition
and family
, and, when performing a genome-wide analysis, with d
as the time distance from the moment when family
was inserted into the genome. Averaging (d
– d
)/d
over all repeat families, we obtain
d
, the percentage deviation of the local substitution rate in partition
relative to the mean.
Whereas the LogDet time produces model-free time distances, the second method is based on the rate model in equation (7) and the approximation we made in going from equation (5) to equation (6) that
can be split into
and a constant matrix q. Based on equation (6), we then 1) set Z = 1 and perform a genome-wide maximum-likelihood fit of the M numbers
and the 6 rates
in q on the right-hand side of equation (6) to the matrices P(t
) on the left-hand side of equation (6) obtained from the repeat data as described above; 2) partition the genome and calculate a maximum-likelihood fit of
and the rates
in the expression
on the right-hand side of equation (6) to our data in P
(t
), taking the values of
from the genome-wide fit in the previous step. Partitions were sized to include 40 kb of repeat bases per partition (Z = 21,000), resulting in a typical window size of 160-kb window at highest resolution. Our rates were then filtered for noise using a technique discussed in Peifer et al. (2003)
and averaged over all M repeat families. We report relative rates:
|
| (11) |
d
as done in figure 1b. An alternative way of testing our computations is to compare it with the results of more established methods. For a few repeat families, we compare our age estimates
against those computed by Khan et al. (2006)
|
|
| Acknowledgements |
|---|
|
|
|---|
The authors would like to thank Svitlana Tyekucheva, Kateryna Makova, Adam Eyre-Walker, and Webb Miller for their helpful comments; Richard C. Burhans and Nathan Coraor for their technical support; and Laura Tabacca for editing. John Karro worked under the support of National Institutes of Health (NIH) grant 5K01HG003315. Martin Peifer received financial support from the Austrian Science Foundation (FWF) under project title P18762. Ross Hardison was supported by NIH grant 5R01DK065806. This project is funded, in part, under a grant with the Pennsylvania Department of Health using tobacco settlement funds. The department specifically disclaims responsibility for any analyses, interpretations, or conclusions.
| Footnotes |
|---|
1 These authors contributed equally to this work.
| References |
|---|
|
|
|---|
Alvarez-Valin F, Clay O, Cruveiller S, Bernardi G. Inaccurate reconstruction of ancestral GC levels creates a "vanishing isochores" effect. Mol Phylogenet Evol (2004) 31:788–793.[CrossRef][Web of Science][Medline]
Antezana MA. Mammalian GC content is very close to mutational equilibrium. J Mol Evol (2005) 61:834–836.[CrossRef][Web of Science][Medline]
Arndt PF, Burge CB, Hwa T. DNA sequence evolution with neighbor-dependent mutation. J Comput Biol (2003a) 10:313–322.[CrossRef][Web of Science][Medline]
Arndt PF, Hwa T. Regional and time-resolved mutation patterns of the human genome. Bioinformatics (2004) 20:1482–1485.
Arndt PF, Hwa T, Petrov DA. Substantial regional variation in substitution rates in the human genome: importance of GC content, gene density, and telomere-specific effects. J Mol Evol (2005) 60:748–763.[CrossRef][Web of Science][Medline]
Arndt PF, Petrov DA, Hwa T. Distinct changes of genomic biases in nucleotide substitution at the time of Mammalian radiation. Mol Biol Evol (2003b) 20:1887–1896.
Baake E, von Haeseler A. Distance measures in terms of substitution processes. Theor Popul Biol (1999) 55:166–175.[CrossRef][Web of Science][Medline]
Barry D, Hartigan J. Asynchronous distance between homologous DNA sequences. Biometrics (1987) 43:261–276.[CrossRef][Web of Science][Medline]
Belle EM, Duret L, Galtier N, Eyre-Walker A. The decline of isochores in mammals: an assessment of the GC content variation along the mammalian phylogeny. J Mol Evol (2004) 58:653–660.[CrossRef][Web of Science][Medline]
Bernardi G. Isochores and the evolutionary genomics of vertebrates. Gene (2000) 241:3–17.[CrossRef][Web of Science][Medline]
Duret L. The GC content of primates and rodents genomes is not at equilibrium: a reply to Antezana. J Mol Evol (2006) 62:803–806.[CrossRef][Web of Science][Medline]
Duret L, Eyre-Walker A, Galtier N. A new perspective on isochore evolution. Gene (2006) 385:71–74.[CrossRef][Web of Science][Medline]
Duret L, Mouchiroud D, Gautier C. Statistical analysis of vertebrate sequences reveals that long genes are scarce in GC-rich isochores. J Mol Evol (1995) 40:308–317.[CrossRef][Web of Science][Medline]
Duret L, Semon M, Piganeau G, Mouchiroud D, Galtier N. Vanishing GC-rich isochores in mammalian genomes. Genetics (2002) 162:1837–1847.
Ellegren H, Smith NG, Webster MT. Mutation rate variation in the mammalian genome. Curr Opin Genet Dev (2003) 13:562–568.[CrossRef][Web of Science][Medline]
Ewens WJ, Grant GR. Statistical methods in bioinformatics: an introduction. (2005) New York: Springer Science.
Frank AC, Lobry JR. Asymmetric substitution patterns: a review of possible underlying mutational or selective mechanisms. Gene (1999) 238:65–77.[CrossRef][Web of Science][Medline]
Gaffney DJ, Keightley PD. The scale of mutational variation in the murid genome. Genome Res (2005) 15:1086–94.
Galtier N, Duret L. Adaptation or biased gene conversion? Extending the null hypothesis of molecular evolution. Trends Genet (2007) 23:273–277.[CrossRef][Web of Science][Medline]
Galtier N, Gouy M. Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol Biol Evol (1998) 15:871–879.[Abstract]
Galtier N, Piganeau G, Mouchiroud D, Duret L. GC-content evolution in mammalian genomes: the biased gene conversion hypothesis. Genetics (2001) 159:907–911.
Gibbs R, Weinstock G, Metzker M, Muzney D, Sondergren E, et al. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature (2004) 428:493.[CrossRef][Medline]
Green P, Ewing B, Miller W, Thomas PJ, Green ED. Transcription-associated mutational asymmetry in mammalian evolution. Nat Genet (2003) 33:514–517.[CrossRef][Web of Science][Medline]
Gu J, Li WH. Are GC-rich isochores vanishing in mammals? Gene (2006) 385:50–56.[CrossRef][Web of Science][Medline]
Gu X, Li W. Bias-corrected paralinear and logdet distances and tests of molecular clocks and phylogenies under nonstationary nucleotide frequencies. Mol Biol Evol (1996) 13:1375–1383.[Abstract]
Hardison RC, Roskin KM, Yang S, et al. Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res (2003) 13:13–26.
Hasegawa M, Kishino H, Yano T. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol (1985) 22:160–174.[CrossRef][Web of Science][Medline]
Hellmann I, Ebersberger I, Ptak SE, Paabo S, Przeworski M. A neutral explanation for the correlation of diversity with recombination rates in humans. Am J Hum Genet (2003) 72:1527–1535.[CrossRef][Web of Science][Medline]
Hellmann I, Prufer K, Ji H, Zody MC, Paabo S, Ptak SE. Why do human diversity levels vary at a megabase scale? Genome Res (2005) 15:1222–1231.
Jukes TH, Cantor CR. Evolution of protein molecules. In: Mammalian Protein Metabolism—Munro HN, ed. (1969) New York: Academic Press. 21–132.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res (2002) 12:996–1006.
Khan H, Smit A, Boissinot S. Molecular evolution and tempo of amplification of human LINE-1 retrotransposons since the origin of primates. Genome Res (2006) 16:78–87.
Khelifi A, Meunier J, Duret L, Mouchiroud D. GC content evolution of the human and mouse genomes: insights from the study of processed pseudogenes in regions of different recombination rates. J Mol Evol (2006) 62:745–752.[CrossRef][Web of Science][Medline]
Kong A, Gudbjartsson DF, Sainz J, et al. A high-resolution recombination map of the human genome. Nat Genet (2002) 31:241–247.[CrossRef][Web of Science][Medline]
Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature (2001) 409:860–921.[CrossRef][Medline]
Lercher MJ, Smith NG, Eyre-Walker A, Hurst LD. The evolution of isochores: evidence from SNP frequency distributions. Genetics (2002) 162:1805–1810.
Lindblad-Toh K, Wade CM, Mikkelsen TS, et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature (2005) 438:803.[CrossRef][Medline]
Lobry JR, Lobry C. Evolution of DNA base composition under no-strand-bias conditions when the substitution rates are not constant. Mol Biol Evol (1999) 16:719–723.[Abstract]
Meunier J, Duret L. Recombination drives the evolution of GC-content in the human genome. Mol Biol Evol (2004) 21:984–990.
Mouchiroud D, D'Onofrio G, Aissani B, Macaya G, Gautier C, Bernardi G. The distribution of genes in the human genome. Gene (1991) 100:181–187.[CrossRef][Web of Science][Medline]
Nagylaki T. Evolution of a finite population under gene conversion. Proc Natl Acad Sci USA (1983) 80:6278–6281.
Peifer M, Timmer J, Voss H. Non-paraetric identification of non-linear oscillating systems. J Sound Vibration (2003) 267:1157–1167.[CrossRef]
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res (2001) 29:308–311.
Smit AF. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev (1999) 9:657–663.[CrossRef][Web of Science][Medline]
Sueoka N. Intrastrand parity rules of DNA base composition and usage biases of synonymous codons. J Mol Evol (1995) 40:318–325.[CrossRef][Web of Science][Medline]
Sumner J, Jarvis P. Using the tangle: a consistent construction of phylogenetic distance matrices for quartets. Math Biosciences (2006) 204:49–67.[CrossRef]
Taylor J, Tyekucheva S, Zody M, Chiaromonte F, Makova KD. Strong and weak male mutation bias at different sites in the primate genomes: insights from the human-chimpanzee comparison. Mol Biol Evol (2006) 23:565–573.
Touchon M, Nicolay S, Audit B, Brodie of Brodie EB, d'Aubenton Carafa Y, Arneodo A, Thermes C. Replication-associated strand asymmetries in mammalian genomes: toward detection of replication origins. Proc Natl Acad Sci USA (2005) 102:9836–9841.
von Grünberg HH, Peifer M, Timmer J, Kollmann M. Variations in substitution rate in human and mouse genomes. Phys Rev Lett (2004) 93:208102.[CrossRef][Medline]
Waterston RH, Lindblad-Toh K, Birney E, et al. Initial sequencing and comparative analysis of the mouse genome. Nature (2002) 420:520–562.[CrossRef][Medline]
Webster MT, Smith NG, Ellegren H. Compositional evolution of noncoding DNA in the human and chimpanzee genomes. Mol Biol Evol (2003) 20:278–286.
Webster MT, Smith NG, Hultin-Rosenberg L, Arndt PF, Ellegren H. Male-driven biased gene conversion governs the evolution of base composition in human alu repeats. Mol Biol Evol (2005) 22:1468–1474.
Zharkikh A. Estimation of evolutionary distances between nucleotide sequences. J Mol Evol (1994) 39:315–329.[CrossRef][Web of Science][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
C. F. Mugal, H.-H. von Grunberg, and M. Peifer Transcription-Induced Mutational Strand Bias and Its Effect on Substitution Rates in Human Genes Mol. Biol. Evol., January 1, 2009; 26(1): 131 - 142. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Peifer, J. E. Karro, and H. H. von Grunberg Is there an acceleration of the CpG transition rate during the mammalian radiation? Bioinformatics, October 1, 2008; 24(19): 2157 - 2164. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


with the fitting parameters b and 




of repeat type
.

