MBE Advance Access originally published online on April 6, 2006
Molecular Biology and Evolution 2006 23(6):1318-1323; doi:10.1093/molbev/msk017
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Article |
CpG + CpNpG Analysis of Protein-Coding Sequences from Tomato




* Bioinformatics Research Center, North Carolina State University;
Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark; and
Department of Plant Biology, Cornell University
E-mail: asger{at}daimi.au.dk.
| Abstract |
|---|
|
|
|---|
We develop codon-based models for simultaneously inferring the mutational effects of CpG and CpNpG methylation in coding regions. In a data set of 369 tomato genes, we show that there is very little effect of CpNpG methylation but a strong effect of CpG methylation affecting almost all genes. We further show that the CpNpG and CpG effects are largely uncorrelated. Our results suggest different roles of CpG and CpNpG methylation, with CpNpG methylation possibly playing a specialized role in defense against transposons and RNA viruses.
Key Words: codon model context dependency CpG + CpNpG methylation
| Introduction |
|---|
|
|
|---|
In plants, methylation of cytosine in CpG and CpNpG sites occur with much higher frequency than in other sequence patterns (Bender 2003
In this paper, we develop statistical models for analysis of patterns of CpG and CpNpG deficiency in coding regions. Two factors make the development of such models more difficult than most standard models. Firstly, the pattern of codons usage and selection acting on nonsynonymous mutation must be taken into account while accounting for CpG and CpNpG hypermutation. Secondly, the assumption made in classical sequence evolution of independence among sites is violated by the CpG and CpNpG effects. We develop a new statistical method for analyzing CpG and CpNpG hypermutation from a single sequence based on context-dependent codon models. Formulating the model on the codon level allows us to take codon bias into account.
The proposed codon model is an extension of the model introduced by Jensen and Pedersen (2000)
. The Jensen and Pedersen (2000)
model significantly improved the description of HIV sequence evolution by extending the Goldman and Yang (1994)
codon model to take CpG hypermutation into account. Siepel and Haussler (2004)
and Huttley (2004)
also incorporated methylation into codon models and found that methylation has a significant effect on the evolution of protein-coding genes from mammals. Huttley (2004)
not only considered CpG hypermutation but also included the dinucleotides CpA and CpT in the analysis. Sved and Bird (1990)
and Lunter and Hein (2004)
developed CpG mutation models on the nucleotide level and analyzed pseudogene evolution in human and noncoding evolution between human and mouse, respectively.
We develop a context-dependent codon model to investigate the relative effect of CpG and CpNpG mutation and the degree to which mutation rates in these two types of sequence motifs are correlated. Using this new method, we investigate the CpG and CpNpG methylation effects of 369 tomato genes. We show that there is a very strong deficiency of CpG sites. Thus, there are significantly fewer CpG sites in the genes than predicted from the codon usage. While there is a strong CpG deficiency, there seems to be little or no evidence for CpNpG hypermutation in the investigated genes. Additionally, we show that the effect of CpG and the effect of CpNpG hypermutation seem to be largely uncorrelated.
| Materials and Methods |
|---|
|
|
|---|
Context-Dependent Nucleotide Model
In order to understand the context-dependent "codon" model, it is useful first to consider a context-dependent "nucleotide" model. The evolution of a DNA sequence is often described as a stationary homogeneous reversible continuous time Markov process. In the site-independent model, the general time-reversible model has rate matrix (e.g., Yap and Speed 2004
![]() |
), where
![]() |
) is the diagonal matrix with
= (
A,
G,
C,
T) on the diagonal. We observe that the detailed balance condition diag(
)Q = Q*diag(
) is fulfilled, where * denotes vector transpose, and thus
is the stationary distribution.
Now suppose the flanking nucleotides of the evolving site are fixed during evolution and consider the case where the left nucleotide is an A and the right nucleotide is a C. In order to model the higher substitution rate away from CpGs, we divide each off-diagonal term in the row corresponding to substitution rates from C with a parameter
< 1 so that the rate matrix becomes
![]() |
![]() |
A,
G, 
C,
T), which means that we observe fewer C's than in the site-independent model.
In the flanking situation where the left nucleotide is a C and the right nucleotide is a G, we also need to divide each off-diagonal term in the row corresponding to G by
. We therefore obtain the following rate matrix and corresponding symmetric and diagonal matrices
![]() |
A, 
G, 
C,
T). In this case, we observe fewer C's and G's than in the site-independent model.
We now move from the fixed flanking situation to the general case. A change in the nucleotide sequence x = (x1, ..., xn) consists of a change of one nucleotide only, and the rate matrix is no longer a 4 x 4 matrix but is a 4n x 4n matrix. Consider the rate from sequence x to sequence
where x and
are the same except at position k. The new nucleotide is denoted as
Following the ideas from above, the rate from x to
is determined by two main components.
Firstly, there is the 4 x 4 substitution rate matrix Q, where the rates do not depend on the neighboring codons. This component corresponds to the model one would use had there been no interaction among nucleotides. We assume that the site-independent part of the model is reversible with stationary distribution
, such that detailed balance diag(
)Q = Q*diag(
) is fulfilled.
Secondly, there is the CpG component, determined by the parameter
, that introduces dependence among nucleotides. If
< 1, the component introduces higher substitution rates from CpG pairs. If
> 1, the component introduces lower substitution rates, and if
= 1, there is no CpG effect. Consider the nucleotide triplet (y1, y2, y3) and suppose y2 undergoes a change. If (y1, y2) or (y2, y3) are CG pairs and
< 1 (
> 1), the substitution rate for a change should increase (decrease); and if
= 1, the substitution rate should remain unchanged. We therefore define the function
![]() |
, if y2 is a member of a CG pair and 1 otherwise. The function 1CG(y1, y2) is the indicator function for the event (y1, y2) = (C, G).
The rate
for a change from sequence x to sequence
thereby depends on xk,
and the neighboring pairs xk1 and xk+1 and is given by
![]() |
![]() |
,
) is a normalizing constant and x0 and xn+1 are fixed. This expression is expected from the stationary distributions derived above for the fixed flanking situations.
We can use this expression for the stationary distribution to analyze the CpG effect. Indeed, we can estimate the parameters
and
= (
A,
G,
C,
T) from a single sequence using, for example, maximum likelihood, and if
is significantly smaller than 1, we may conclude that CpG methylation has played a role during the evolution of the sequence.
Context-Dependent Codon Model
In this section, we formulate the model from the previous section on the codon level and include the CpNpG effect.
We write a codon sequence x of n codons as x = (x1, ..., xn) with
where the upper index u in
indicates the position within the kth codon, and we assume that the boundary codons of a sequence are a start codon (x0 = ATG) and a stop codon (xn+1
TAA, TGA, TGG). Now consider the rate from sequence x to a new sequence
which equals x except at one position in codon xk. The new codon is denoted
and the rate from x to
is determined by two main components.
Firstly, there is the codon substitution rate matrix Q, where the rates do not depend on the neighboring codons. This component corresponds to the site-independent part of the model. We assume that the model is reversible so that detailed balance
aQ(a, b) =
bQ(b, a) is fulfilled. Here
a is the frequency of codon a.
Secondly, there is the CpG + CpNpG component determined by the parameters
CG and
CNG. This component introduces dependence among codons. If
CG < 1, the parameter introduces higher substitution rates from CG pairs; if
CG > 1, the parameter introduces lower substitution rates; and if
CG = 1, there is no methylation CG effect. As for the nucleotide model, we define the function
![]() |
CG, if y2 is a member of a CG pair and 1 otherwise.
Similarly, consider the nucleotide pentet (y1, ..., y5) and suppose y3 undergoes a change. If (y1, y2, y3) or (y3, y4, y5) are CNG triplets, the rate should increase (
CNG < 1), decrease (
CNG > 1), or remain unchanged (
CNG = 1). We therefore define the function
![]() |
The rate
for a change from sequence x to sequence
thereby depends on xk,
and the neighboring pairs
and
and is given by
![]() |
![]() | (1) |
CG,
CNG,
) is a normalizing constant. When
CG =
CNG = 1, we get Z = 1, and the stationary distribution for a sequence is the product of the codon frequencies along the sequence. Thus, the last term in equation (1) takes codon bias into account. The two terms in equation (1) that involves
CG and
CXG introduces a Markovian dependence structure along the sequence. If, for example,
CG < 1, it is less likely to observe the dinucleotide CG across a codon boundary than if
CG = 1.
From the stationary distribution (1), we obtain the log likelihood
![]() |
CNG = 1 and a model allowing
CNG to be a free parameter, to a
2(1) distribution.
Data
We considered 369 protein-coding genes from tomato. The genes were selected from a tomato expressed sequence tagassembled unigene set in the Solanaceae Genome Network database (http://www.sgn.cornell.edu/content/sgn_data.pl#Solanumlycopersicum). Gene sequences were subjected to untranslated region trimming and manual correction of sequencing errors based on multiple sequence alignments of the corresponding Arabidopsis ortholog. The URL for the data is ftp://ftp.sgn.cornell.edu/COSII/Rasmus_s_cleantomatoseq.fasta.
| Results |
|---|
|
|
|---|
Tomato Gene Analysis
The tomato genes are analyzed using the stationary distribution of the context-dependent codon model described above. We assume common codon frequencies but gene-specific CpG + CpNpG effects. In order to find the maximum likelihood estimates, we used an iterative procedure where we updated the CpG and CpNpG parameters for each gene using the current estimate of the common codon frequencies and updated the common codon frequencies using the current estimates of the CpG and CpNpG parameters.
The left histogram in figure 1 shows the strong effect of CpG mutation in these genes. The mean value of
is 0.375 and the median is 0.326. The fact that
in almost all cases suggest that the CpG effect is acting on all the genes. As can be seen from the right histogram in figure 1, in only 30 of the 369 genes was the test for
CG = 1 against the alternative that
CNG
1 not rejected at the 5% significance level. We used the likelihood ratio test statistics
and a
2(1) distribution to calculate the P values.
|
The left histogram in figure 2 shows the maximum likelihood estimates of
CNG. It is evident that the mean value of
CNG is close to 1 (the mean is 1.02 and the median is 0.99), and the CpNpG effect is therefore less pronounced. Indeed, for each gene, we tested
CNG = 1 against the alternative that
CNG
1, and the P values are almost uniformly distributed (right plot in fig. 2). We conclude that the CpNpG effect does not affect the stationary distribution of the sequences.
|
To confirm that these results are not an artifact of the statistical models, we conducted a simple analysis of the number of CpG and CpNpG sites spanning codon boundaries. We estimated codon frequencies from the observed codon counts of 369 protein-coding sequences from tomato. From the observed codon frequencies, we calculated the frequency of nucleotide C in third codon position (0.1714) and nucleotide G in first codon position (0.3332). The 369 coding sequences have 100,060 codon boundaries, and we thus expect 0.1714 x 0.3332 x 100060 = 5,715 CpGs over codon boundaries (table 1). However, we only observe 2,674 CpGs over codon boundaries. A similar analysis can be performed for CpNpG patterns (see table 1). In this case, the boundary is either immediately after the C or just before the G. As is seen from table 1, we also observe fewer CpNpGs over codon boundaries, but the effect is not very pronounced.
|
These results confirm that the predominant mutational effect of methylation in the coding regions analyzed here is on CpG dinucleotides and not on CpNpG motifs.
Finally, we consider the joint distribution of
and
as shown in figure 3. This plot shows no correlation between the CpG and CpNpG effects (the correlation coefficient between the CpG and CpNpG estimates is only 0.06).
|
In order to verify this conclusion in more detail, we considered the two issues of biasedness and standard errors (SEs) of the estimates. Maximum likelihood estimates are asymptotically unbiased, and the sample size is relatively large in our application. The vast majority of the genes have codon lengths in the range 100400. Thus, we expect the maximum likelihood estimates to be approximately unbiased. We used profile maximum likelihood confidence intervals (CIs) to determine the SEs for
CG and
CNG for each gene. For
CG, the 95% CIs are not very variable between genes and in the range
The
CNG CIs are also rather stable but somewhat larger. The intervals are typically in the range
Thus, neither biasedness nor reliability of the estimates should be an issue when interpreting figure 3. | Discussion |
|---|
|
|
|---|
While comparative sequence analysis traditionally have been used to investigate patterns of substitution, there is also considerable information about these patterns in a single sequence. In traditional models of sequence evolution, which assume independence among sites, only three independent mutation parameters can be estimated from a single sequence. However, by increasing the state space of the model, context-dependent effects can be estimated. In particular, using context-dependent models, we were here able to estimate the rate of CpG and CpNpG hypermutation.
We found that mutation rates in CpG and CpNpG sites were uncorrelated in the tomato genome. This supports the idea that separate methyltransferases are responsible for CpG and CpNpG methylation activity (e.g., Pradhan and Adams 1995
). Furthermore, we found an apparent absence of CpNpG hypermutation. This observation supports the notion that CpNpG methylation, in contrast to CpG methylation, almost never occurs in genic regions. CpNpG methylation may possibly play a role in silencing of genomic elements such as transposable and retrotransposable elements (e.g., Papa et al. 2001
). CpNpG methylation is controlled by the CMT3 methyltransferase gene (e.g., Lindroth et al. 2001
, Cao and Jacobsen 2002
). CMT3 mutants display a wild-type morphology but exhibit both decreased CpNpG methylation and reactivated expression of endogenous retrotransposon sequences (Lindroth et al. 2001
). In contrast, the pervasiveness of CpG avoidance in the tomato genes investigated here suggests that CpG methylation in plant genes does not have as radical an effect on expression as CpNpG methylation. The vast majority of all genes show a strong tendency toward CpG avoidance, suggesting that CpG methylation is pervasive in expressed genes.
Lindroth et al. (2001)
observed an almost complete genomic loss of CpNpG methylation in CMT3 null mutants. However, the mutants were morphologically normal, even after five generations of inbreeding. They hypothesized that CpNpG and CpG methylation may act in a partially redundant fashion to silence most genes. Our results suggest instead different roles of CpNpG and CpG methylation. While CpG methylation may play a role in both normal regulation of gene expression and defense against viruses, the role of CpNpG methylation may be more specialized in the defense against transposons and RNA viruses. No phenotypic effects are, therefore, expected from reduced CpNpG methylation in the absence of viral or transposon activity.
| Appendix |
|---|
|
|
|---|
Normalizing Constant for Stationary Distribution
In order to calculate the normalizing constant Z(
CG,
CNG,
), we first write the stationary distribution (1) as
![]() |
![]() |
We can evaluate this directly, but for large n, we can obtain a good approximation as follows. Consider a spectral expansion of T with eigenvalues
(there are 61 sense codons) in decreasing order. Let the right and left eigenvectors of T be rj and lj so that Trj = µjrj and
where * denotes vector transpose, and suppose the eigenvectors are normalized so that
Then we have
and
Recalling that µ1 is the largest eigenvalue, we get
and thereby
![]() |
| Acknowledgements |
|---|
|
|
|---|
We thank the associate editor and three anonymous reviewers for their constructive suggestions. A.H. and R.N. are supported by the Danish Research Council.
| Footnotes |
|---|
Naoka Takezaki, Associate Editor
| References |
|---|
|
|
|---|
Bender, J. 2003. DNA methylation and epigenetics. Annu. Rev. Plant Biol. 55:4168.[CrossRef]
Cao, X., and S. E. Jacobsen. 2002. Locus-specific control of asymmetric and CpNpG methylation by the DRM and CMT3 methyltransferase genes. Proc. Natl. Acad. Sci. USA 99:1649116498.
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725736.[Abstract]
Huttley, G. A. 2004. Modeling the impact of DNA methylation on the evolution of BRCA1 in mammals. Mol. Biol. Evol. 21:17601768.
Jensen, J. L., and A.-M. K. Pedersen. 2000. Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Adv. Appl. Probab. 32:499517.[CrossRef]
Lindroth, A. M., X. Cao, J. P. Jackson, D. Zilberman, C. M. McCallum, S. Henikoff, and S. E. Jacobsen. 2001. Requirement of CHROMOMETHYLASE3 for maintenance of CpXpG methylation. Science 15:20772080.
Lunter, G. A., and J. Hein. 2004. A nucleotide substitution model with nearest-neighbour interactions. Bioinformatics 20S:i216i223.[Abstract]
Ng, H., and A. Bird. 1999. DNA methylation and chromatin modification. Curr. Opin. Genet. Dev. 9:158.[CrossRef][Web of Science][Medline]
Papa, C. M., N. M Springer, M. G. Muszynski, R. Meeley, and S. M. Kaeppler. 2001. Maize chromomethylase Zea methyltransferase2 is required for CpNpG methylation. Plant Cell 13:19191928.
Pradhan, S., and R. L. P. Adams. 1995. Distinct CG and CNG DNA methyltransferases in Pisum sativum. Plant J. 7:471481.[Medline]
Siepel, A., and D. Haussler. 2004. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 21:468488.
Sved, J., and A. Bird. 1990. The expected equilibrium of the CpG dinucleotide in vertebrate genomes under a mutation model. Proc. Natl. Acad. Sci. USA 87:46924696.
Yap, V. B., and T. P. Speed. 2004. Modeling DNA base substitution in large genomic regions from two organisms. J. Mol. Evol. 58:1218.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
M. Anisimova and C. Kosiol Investigating Protein-Coding Sequence Evolution with Probabilistic Codon Substitution Models Mol. Biol. Evol., February 1, 2009; 26(2): 255 - 271. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. R. Morton, V.-u.-N. Dar, and S. I. Wright Analysis of Site Frequency Spectra from Arabidopsis with Context-Dependent Corrections for Ancestral Misinference Plant Physiology, February 1, 2009; 149(2): 616 - 624. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Delport, K. Scheffler, and C. Seoighe Models of coding sequence evolution Brief Bioinform, January 1, 2009; 10(1): 97 - 109. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Wang, A. Diehl, F. Wu, J. Vrebalov, J. Giovannoni, A. Siepel, and S. D. Tanksley Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae Genetics, September 1, 2008; 180(1): 391 - 408. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. L. Thorne, S. C. Choi, J. Yu, P. G. Higgs, and H. Kishino Population Genetics Without Intraspecific Data Mol. Biol. Evol., August 1, 2007; 24(8): 1667 - 1677. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||














Right: histogram of P values for testing 
Right: histogram of P values for testing 






