MBE Advance Access originally published online on November 22, 2006
Molecular Biology and Evolution 2007 24(2):522-531; doi:10.1093/molbev/msl179
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Estimating the Neutral Rate of Nucleotide Substitution Using Introns

* European Molecular Biology LaboratoryEuropean Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, England
Graduate School of Biological, Medical and Veterinary Sciences, University of Cambridge, Cambridge, England
E-mail: birney{at}ebi.ac.uk.
| Abstract |
|---|
|
|
|---|
Evolutionary biologists frequently rely on estimates of the neutral rate of evolution when characterizing the selective pressure on protein-coding genes. We introduce a new method to estimate this value based on intron nucleotide substitutions. The new method uses a metascript model that considers alternative splicing forms and an algorithm to pair orthologous introns, which we call Introndeuce. We compare the intron method with a widely used method that uses observed substitutions in synonymous coding nucleotides, by using both methods to estimate the neutral rate for humandog and mouserat comparisons. The estimates of the 2 methods correlate strongly (rS = 0.75), but cannot be considered directly equivalent. We also investigate the effect of alignment error and G + C content on the variance in the intron method: in both cases there is an effect, and it is species-pair specific. Although the intron method may be more useful for shorter evolutionary distances, it is less useful at longer distances due to the poor alignment of less-conserved positions.
Key Words: introns neutral rate synonymous substitution mammals
| Introduction |
|---|
|
|
|---|
Evolutionary biologists have long had an interest in determining which regions of the genome are under selective pressure, and if so, what kind of selective pressure affects them (Nei and Kumar 2000
The most sophisticated models of selective pressure (Yang and Nielsen 2000
) use the evolutionary distances dN (nonsynonymous) and dS (synonymous) and label the ratio of the two
= dN/dS. Some researchers (Hurst 2002
) instead use the notations KA, KS, and KA/KS ratio and perhaps different models, but they all rely on the same underlying set of assumptions. These models usually assume that dS, the number of synonymous substitutions per potentially synonymous site, represents neutral changes unaffected by selective pressure. In contrast, the models assume that dN, the number of nonsynonymous substitutions per potentially nonsynonymous site, represents differences affected by selective pressure. Dividing dN by dS to get the nonsynonymous/synonymous rate ratio
will yield a quantity that assesses selective pressure after correction for the local variation in neutral evolution. One must estimate this ratio locally because the character of neutral evolution varies in different regions of the genome. For any given gene (or portion of a gene), if
> 1, the gene is under positive selection, whereas if
< 1, the gene is under purifying (or negative) selection.
Using dS as a measure of the neutral rate of evolution presents several problems. First, the underlying assumption that synonymous coding nucleotides can change freely is not correct in all cases. For example, synonymous sites may overlap with exonic splicing enhancers (Cartegni et al. 2002
) or silencers (Chamary et al. 2006
). Synonymous sites may also be under selective pressure due to their effects on mRNA stability, either through G + C content or the appearance or avoidance of certain sequence motifs (Chamary et al. 2006
). In some animals, there is definitely selection in synonymous coding sites in the form of codon bias related to tRNA abundance (Chamary et al. 2006
), but whether it affects synonymous coding sites in mammals is unclear (Bernardi 1995
, 2000
; Graur and Li 2000
; Iida and Akashi 2000
). There is also indirect evidence that selection affects synonymous sites that shows in comparison of base composition and evolutionary rates with other potentially neutral types of sequence (Chamary et al. 2006
). Second, the usually unstated assumption that synonymous coding sites undergo the same mutational processes as other sites in the genome ignores the fact that they occur next to a limited number of flanking nucleotides and therefore will be unequally affected by effects such as CpG hypervariability (Hardison et al. 2003
). Finally, there are so few synonymous coding nucleotides for a given gene that it can be difficult to precisely estimate the number of changes per nucleotide. This is especially true when comparing species such as humans and chimpanzees, which are so closely related that there has not been enough time for many of these sites to change.
In order to overcome primarily the issue of low numbers of synonymous coding sites, some researchers have estimated the neutral rate using either ancestral repeats near genes (Chimpanzee Sequencing and Analysis Consortium 2005
) or substitutions in introns (Castresana 2002
). Introns in homologous genes provide natural and well-defined boundaries for alignment regions, and the recent release of several high-quality vertebrate genome sequences allows for a robust comparison across several species. One cannot assume that intron changes are always neutral, but one cannot assume that synonymous coding nucleotide changes are always neutral either. Intron substitutions still provide a useful alternative for dealing with the small-sample error inherent in the synonymous coding nucleotide model.
Despite the previous use of neutral rates estimated using intron nucleotide substitutions, no one has yet performed a thorough comparison with a neutral rate estimate based on synonymous coding nucleotides along with an investigation of the evolutionary distances at which using introns is sensible. This article aims to examine properties of a neutral rate estimate based on intron substitutions compared with one based on synonymous coding sites by highlighting their similarities and differences, with particular regard to their use in estimates of selective pressure.
We will call the estimated neutral rate from nucleotide substitution in introns dI, defined as the number of intron substitutions per intron nucleotide. We wish to estimate dI for a large portion of the genome, but the mechanics involved are not trivial as one must first identify truly orthologous intronic sequences to compare. When studying tens of thousands of gene pairs, it is not practical to use handcrafted alignments. Naive assumptions about the congruence of gene structures between species, however, are prone to differences in gene annotations or changes in the biological transcript structure. One cannot, for example, assume that the first intron in a mouse gene is orthologous to the first intron in its orthologous rat gene. In addition, the existence of alternative splice forms in eukaryotic genes increases the challenge in determining which intron regions are homologous. We found the common approach of using the longest translation of a gene as a stand-in for the whole gene inadequate as it ignores the selective pressure on introns that contain nonconstitutive exons. To address this need, we created a model of gene evolution, which we call "metascript," that incorporates alternative splice forms and considers varying selective pressures on introns and coding sequence.
We chose to analyze data from mouserat and humandog comparisons. Evolutionary biologists believe that the mouse and rat lineages diverged 1635 MYA (Adkins et al. 2001
; Springer et al. 2003
) and the human and dog lineages diverged 9095 MYA (Springer et al. 2003
). Although the humanmurid divergence occurred later (8588 MYA) (Springer et al. 2003
), the humanmurid evolutionary distance is longer than the distance between human and dog, probably partly due to shorter generation time in the murid lineage (Li et al. 1996
).
| Materials and Methods |
|---|
|
|
|---|
Ortholog Identification, Genomic Sequence, and Transcript Predictions
We used Ensembl Compara version 28 (Birney et al. 2006
|
The Metascript Model of Alternative Splicing
We constructed a metatranscript or metascript model for each gene, which incorporates all of its alternative splicing forms. An artificial example is shown in figure 1. Genomic regions expressed as protein in some or all of the predicted transcripts are called metaexons, and regions that never express as protein are called metaintrons. Each metaintron has a phase code, determined by the phases of the coding sequence upon interruption by the metaintron's constituent introns. A metaintron's phase code can indicate that its constituent introns start in phases 0, 1, or 2, the untranslated region, or a specific mixture of these phases. Frameshift errors and introns shorter than 50 nt each get special phase codes. We represent the metascript model as a string of nucleotides and phase codes.
|
The Introndeuce Algorithm for Pairing Orthologous Introns
To identify orthologous metaintrons, we perform a SmithWaterman alignment (Smith and Waterman 1981
Software Availability
We have made available a Python (http://www.python.org/) package that constructs metascripts from Ensembl gene models and implements the Introndeuce algorithm (http://www.ebi.ac.uk/
hoffman/software/metascript).
Estimation of dN and dS
We estimated the proportion of different nonsynonymous coding nucleotides pN and proportion of different synonymous coding nucleotides pS for each gene pair using the NeiGojobori method (Nei and Gojobori 1986
), based on PSW alignments of the longest translations of the genes. We then estimated the number of nonsynonymous substitutions dN and the number of synonymous substitutions dS based on pN and pS, respectively, using the JukesCantor model (Jukes and Cantor 1964
):
|
|
Estimation of dI
We took the nucleotide sequence of each orthologous metaintron identified by Introndeuce and masked out the first 10 nt and the last 30 nt of each metaintron to exclude conserved intron splicing signals. We then used BlastZ (Schwartz et al. 2003
) with a reduced stringency (reducing the maximal scoring pair score threshold to 2,200) to align the sequences. Before making further calculations, we masked 5 nt from both edges of each aligned block, in order to reduce edge wander effects and decrease uncertainty about the correctness of an accepted alignment column. If we let Id be the number of mismatches in the remaining aligned sequence and I the number of matches plus mismatches, we estimate a proportion of differing intron nucleotides pI by dividing the 2 quantities:
|
|
We then estimate dI using the JukesCantor model as above. We also estimated an alternative version dI,maskedges=0 without masking the edges of aligned blocks. We calculated the relative error
between the 2 with this equation:
|
|
Phylogenetic Tree Construction
We used MartShell (http://www.biomart.org/) to identify all the human members of the MAGE family, ENSF00000000336, in Ensembl Compara version 30, and estimated dS and dI as above for each possible gene pairing. We converted the pairwise d values into distance matrices, filling in a distance of 1 where the distance could not be estimated because the sequences were too divergent. We then used the FITCH program of PHYLIP 3.64 (Felsenstein 2005
) through the Pylip interface (http://www.ebi.ac.uk/
hoffman/software/pylip/) to generate a tree that we visualized with TreeView 1.6.6 (Page 1996
).
| Results |
|---|
|
|
|---|
We developed a new algorithm (Introndeuce) for robustly assigning orthologous introns in the presence of alternative splicing, without requiring genomic alignments. For each gene, we project all exons into genomic coordinates and produce a novel sequence-like model called a metascript (fig. 1). The metascript is the concatenation of the nucleotide sequence of all annotated exonic regions with phase codes to indicate the phase of the intervening introns. The Introndeuce algorithm then aligns the exonic sequence and intronic phase codes of the resulting metascripts with an extension of standard dynamic programing methods. This results in the pairing of introns if and only if the surrounding exonic sequence is truly orthologous. As expected, most introns pair with consistent phases between the species considered. The 15,176 orthologs between human and dog and the 16,183 orthologs between mouse and rat produced 96,476 humandog paired metaintrons in 10,443 gene pairs and 105,560 mouserat paired metaintrons in 12,566 gene pairs (table 1). In the metascripts that could be produced, the median number of metaintrons is 7 in humans and dogs and 6 in the murids. We then aligned these metaintrons using BlastZ individually.
We estimated dS and dI values for the ortholog pairs with at least 1 aligned metaintron (table 1 and fig. 2) using the JukesCantor model (Jukes and Cantor 1964
) in an identical manner for each statistic (see Materials and Methods). The median dS value is 0.370 for humandog and 0.212 for mouserat, whereas the corresponding median dI values are lower at 0.305 and 0.158, respectively. This may be due to a lower substitution rate or the effects of more selective constraint in introns than in synonymous coding sites (Chen and Li 2001
). As expected, the variance of the dI values is smallerthe median absolute deviations of the humandog and mouserat dS values (0.113 and 0.062, respectively) are 2 to 3 times the median absolute deviations of the corresponding dI values (0.053 and 0.022). Even while considering extreme outliers, the dI range (humandog: [0.0285, 0.700]; mouserat: [0, 0.535]) is still smaller than the dS range (humandog: [0.0259, 4.813]; mouserat: [0, 2.328]).
|
Figure 2 shows a scatterplot of dS versus dI values for humandog and mouserat gene pairs. As expected, these 2 variables are strongly correlated when including the data from both species pairs (rS = 0.75). One can clearly see, however, that a different model for each species pair would better fit the data from that species pair, despite lower correlation on a perspecies-pair basis (humandog rS = 0.57; mouserat rS = 0.46) than when examining all the gene pairs together. This means that one cannot universally predict a dI value from a dS value without reference to a particular species pair. One can see a clear separation between the ranges of dI values for the 2 species pairs analyzed, whereas the dS ranges overlap quite a bit. Additionally, within the same species pairs, the variance of dI values is smaller. This suggests that dI might provide a more distinctive characterization of genome-wide neutral evolutionary distance for species pairs in this range, which would allow better phylogenetic tree construction. When looking at 8,095 genes that have a 1:1:1:1 ortholog relationship in humandogmouserat, we find that
|
|
|
|
There are 2 plausible reasons for the systematically different estimates of neutral substitution in synonymous coding nucleotides and introns if one assumes that neutral mutation occurs at the same rate in exonic and intronic sequence. First, the 2 different kinds of sequence are subject to different kinds of selective pressures. Some of these selective pressures, such as those discussed in Introduction, will differently affect the fixation of point substitutions at certain sites in introns and synonymous coding sequence. Additionally, indels are much more likely to be selected against in coding sequence. The other possibility is that alignment effects cause the systematic difference. Even when corrected for by masking the edges of aligned blocks, edge wander effects may still lead to an observed similarity within the same species pair that underestimates the divergence of the 2 species.
Variability in dI and dS Measures
We decided to look at the variability of dS and dI in a number of ways. First, we examined the variance for each data point as calculated by an analytical formula. The variance of a single JukesCantor distance (representing the error in that particular estimation rather than the dispersion of the whole population as discussed earlier) varies inversely with the number of nucleotides examined (Nei and Kumar 2000
). Because of this and the fact that I is generally much greater than S, the error for a single estimate of dI (which is the square root of the variance and called sdI) usually will be lower than the error for a single estimate of dS (called sds). This is the case for 22,541 gene pairs or 98% of those examined.
We then investigated the effects of other factors, such as G + C content, on the differences between dI and dS, as well as their variability. To visualize the difference between dI and dS and the effect of G + C content on this difference, we created Tukey mean-difference plots (Cleveland 1993
) (fig. 3) split at the quartiles of G + C content. We measured G + C content in the longest translation of the human or mouse gene of an orthologous gene pair as G + C is highly correlated across the species pairs we used (humandog: Spearman's rank-order correlation coefficient rS = 0.95; mouserat: rS = 0.97). These plots allow a visual assessment of shift between dN and dS, by analyzing deviation from the zero line in both distance and slope. As the neutral evolutionary distance between 2 orthologs increases, the difference between dI and dS becomes more pronounced (dS increases proportionally more than dI). The median dI dS is below 0 for both species pairs for all ranges of G + C content, indicating that median dS is larger than median dI. The slope indicates the difference in variance between dS and dI.
|
One of the key reasons for estimating an evolutionary distance from introns is that the number of intron sites I is an order of magnitude higher than the number of synonymous coding sites S or the number of nonsynonymous coding sites N, where fractional values in S or N indicate partial degeneracy. The distribution of the number of each kind of site in a gene pair is positively skewed.
A striking difference between the humandog comparison and the mouserat comparison is the effect of G + C content. As G + C content (qG + C) measured in human increases, humandog dI increases (rS = 0.41; significant at P < 0.001). Humandog dS increases with G + C content as well but much less of the proportion of variation in dS is attributable to a change in G + C content as the correlation is weaker (rS = 0.27; P < 0.001). Mouserat dS and dI decrease as mouse G + C content increases, but the correlation between G + C content and dS is much weaker (rS = 0.15; P < 0.001) as is the correlation with dI (rS = 0.11; P < 0.001). This effect is best seen in the comparison of the first quarter of G + C content to the fourth quarter in figure 3 showing dramatically increased variance in the humandog dI value at high G + C. This effect is unchanged if the G + C content is measured with dog or rat. This change in responsiveness to GC content between the species pairs suggests dI is a more labile measure over evolutionary time (see Discussion).
One benefit of the dS measure is that the information-rich amino acid alignment provides a robust scaffold for identifying orthologous synonymous nucleotides. This contrasts with nucleotide alignments, where the placement of substitutions and indels is determined by an alignment program, which must penalize gaps and substitutions. This process is therefore more error-prone in particular in the placement of substitutions near insertions where the highest-scoring alignment is less likely to reflect the true evolutionary relationships between bases, an effect known as "edge wander" (Holmes and Durbin 1998
). We generally estimated dI by counting only substitutions in aligned blocks at least 5 nt away from the nearest indel (dI,maskedges=5) to remove these errors. To examine how well this strategy removed errors, we also estimated it by counting all substitutions in the intron nucleotide alignment (dI,maskedges=0). For mouserat values, both versions of dI are very close (fig. 4) and the median relative error
between the 2 measurements is 1.7%. The 2 methods of estimating dI for humandog values produce values that are close with median
= 3.7%, but this is twice the median for mouserat. Absolute error increases with the mean of the 2 dI methods for humandog (rS = 0.67; P < 0.001) but actually decreases very slightly for mouserat (rS = 0.04; P < 0.001). The increase in variance indicates that the humandog distance might be at the edge of where such alignment artifacts will dominate (see Discussion).
|
To examine the effects of selective pressures on supposedly synonymous sites, we examined dS and dI for genes with known functional elements in synonymous sites. We inspected a set of 34 human genes where synonymous coding mutations can alter splicing (Cartegni et al. 2002
Effect on the Estimation of Selection
One of the major uses of dS is to calculate a ratio
= dN/dS to determine what sort of selective pressure a region is under. Here, we will refer to this established ratio as
S and define an equivalent that uses intron substitutions to estimate the nonneutral rate of evolution
I = dN/dI. Looking at the human and dog genes with the largest
I, we see a variety of kinds of genes, including transcription factors, RNA-binding proteins, reproductive-related genes, and genes of unknown function. The human and dog genes with the largest
S include immune-related genes and similar kinds of genes to those with the highest RNA- and DNA-binding genes and the genes with the highest
I. Surprisingly only 2 genes, C1orf198 and CENPA, are shared among the top 10
I and top 10
S groups, suggesting that many of the previous noted outliers in
S are potentially due to variance in neutral rate estimation and not due to changes in nonsynonymous rate. No genes are shared among the top 10
I and top 10
S groups for mouse and rat.
When broader groups are considered, such as the
I and
S values above the 95th percentile, more genes (humandog: 341/523 = 65%; mouserat: 397/629 = 63%) are shared in both lists, but a substantial number are still unique. In contrast, when considering the
I and
S values below the 5th percentile, many more genes are shared in both lists (humandog: 475/523 = 91%; mouserat: 606/629 = 96%). One can explain the asymmetry by noting that as dN and, therefore,
increase, the difference between dS and dI also increases. This means that especially when considering genes under positive selection, it is particularly important to use both the coding site and intron methods to determine neutral rate.
We found overrepresented Gene Ontology (Gene Ontology Consortium 2006
) terms using GO-TermFinder (Boyle et al. 2004
), with a cutoff of P = 0.01 and annotations by GOA (Camon et al. 2004
). Humandog genes with
I values above the 95th percentile (but with
S values below the 95th percentile) significantly overrepresent only biological process terms involved in immunity and defense (immune response; defense response; response to pest, pathogen, or parasite). So do humandog genes above the 95th percentile exclusively for
S (response to biotic stimulus; defense response; immune response). The mouserat genes above the 95th percentile exclusively for
S overrepresent only similar terms, but no terms are significantly overrepresented in the genes above 95th percentile exclusively for
I.
Use of dI for the Investigation of Paralog Relationships
Both dS and dI have applications beyond orthologs in examining the relationship between paralogous genes. For recently evolved gene families, dS is likely to be small (often 0) between 2 paralogs due to the short evolutionary time separating the 2 genes. In these cases, using a dI measure becomes crucial to understand gene relationships because dI is usually greater than dS (fig. 3) at small evolutionary distances. Figure 5 shows the MAGE family of paralogs (Chomez et al. 2001
), where different trees are found when using dS or dI measures. As expected, the dI tree resolves some of the recent duplication events, including the cases where the coding sequence is identical at the DNA level. For more diverged sequences, however, the dI measurement appears to be saturated due to edge wander. This, along with the arbitrary assignment of distances that cannot be estimated, produces artificially small branch lengths.
|
| Discussion |
|---|
|
|
|---|
In order to sensibly compare dS with dI, one must have a large number of correctly paired orthologous introns. Previously, researchers have 1) annotated orthologous introns manually, 2) assumed the colinearity of introns in orthologous genes with identical numbers of introns (Castresana 2002
We have introduced a new method (Introndeuce), which automatically pairs orthologous introns in the absence of detectable intronic alignments with modest computational requirements. This method may have applications beyond this paper. For example, metascript alignment is an elegant way to consider gene-level selection when alternative splicing is present. Previous approaches ignored the selective pressure on intron regions due to nonconstitutive exons in these regions.
Our large amount of data has allowed us to investigate the properties of dI compared with dS. It is clear that the 2 variables are correlated but that they are measuring different properties of the genomeone cannot consider dI directly equivalent to dS because dI would systematically underestimate dS as it increases. Our opinion is that both measures have their flaws, both from a conceptual perspective of potential nonneutral bases in each case and from a pragmatic issue of alignability and observed mutations.
One surprise has been the more marked species difference in dI compared with dS. The dI measure shows far greater change in median between humandog versus mouserat, when compared with dS. Although the greater variance of dS when compared with dI in both species pairs must explain this partially, figure 2 clearly shows that the relationship between dS and dI is not equivalent in the 2 species pairs. We have investigated both alignment artifact effects (fig. 4) and G + C content effect (fig. 3) to explain this difference. It seems clear that dI has more specific variation due to G + C effects in humandog but not in mouserat. Humans and dogs have more closely related G + C content distributions, which are more likely to reflect the boreoeutherian ancestor, and in these species evolutionary distance rates are correlated with G + C content along chromosomal positions (Lindblad-Toh et al. 2005
). The variation in G + C content in genomes is a complex phenomenon that probably interacts with the dI measure in multiple ways, such as their shared correlation with local recombination rate (Eyre-Walker 1992
; Fullerton et al. 2001
; Lander et al. 2001
; Lercher and Hurst 2002
; Waterston et al. 2002
; Hardison et al. 2003
; Hellmann et al. 2003
), hypermutability of CpG dinucleotides, which more frequently occur in elevated G + C areas (Fryxell and Moon 2005
), biased gene conversion to GC (Marais 2003
) causing G + C content elevation in regions of high recombination, increased short interspersed nuclear element insertion in elevated G + C regions (Jurka 1997
), and other effects. In particular, Webber and Ponting (2005)
have observed elevated G + C content and humandog dS values in dog genes in subtelomeric and pericentromeric regions or in syntenic blocks of <4 Mb, as well as human genes in subtelomeric regions, whereas noting that these phenomena are much less striking in the mouse and rat genomes. Due to the correlation between dS and dI, similar phenomena may affect dI, although the CpG hypermutability will affect intron sites differently from synonymous coding sites as noted earlier. Genome-wide changes in mutation rates and genomic landscape are more likely to affect dI than dS as runs of intron nucleotides have fewer constraints than runs of coding site nucleotides, which include nonsynonymous sites. Consequently dI measures are less useful over broader evolutionary time.
Preliminary research into estimating dI values for humanmouse, humanrat, and Tetraodon nigroviridisTakifugu rubripes species pairs indicated that the edge wander problem was too severe to trust the dI measure (data not shown). We surmised that there had been too much neutral evolutionary change at these kinds of distances for the nucleotide substitution dI method to be useful. Judging by the increase in variance as dI rises in the alignment artifact investigation (fig. 4), the humandog evolutionary distance (76% median identity in intronic local alignments) is close to the limit of where dI measures are useful.
For evolutionary distances shorter than humanmouse, dI may be a more useful measurement than dS. The small number of synonymous coding sites becomes more crucial when there is less time over which one can observe changes. In particular, we have shown that it can be used to make the evolutionary relationship between young paralogs less ambiguous when compared with a phylogeny generated with the use of dS. Although dI can provide a less ambiguous tree with greater distances for closely related paralogs than a dS tree, for more distantly related paralogs the distances are far shorter than those estimated with dS. The dI measurement is probably saturated for distances this far out, but dS is useful in this range. Additionally, the difference between dI and dS varies for different kinds of genes. Genes that have RNA-editing sites, distant intron branch points (Gooding et al. 2006
), or other conserved intronic elements will have a particularly low dI, whereas genes with conserved DNA and RNA regulatory elements in their exons will have a low dS.
We used BlastZ local alignments to identify the orthologous nucleotides within introns. Because we were only inspecting aligned blocks anyway, we needed only the high-scoring areas of alignment that a local alignment algorithm could produce. Preliminary investigations with a global alignment algorithm, LAGAN (Brudno et al. 2003
), convinced us not to pursue a global alignment approach as it forces even more edge wander. Replacing BlastZ with LAGAN in this analysis led to a bimodal distribution of dI values for gene pairs that had and did not have sufficient alignable intron sequence (data not shown).
In our work, we have used a relatively simple JukesCantor model in a consistent manner for both dS and dI. More sophisticated models are available but require more parameters to be estimated, such as transition/transversion bias (Nei and Kumar 2000
). Another advantage of the NeiGojobori method we used over more sophisticated models such as that of Yang and Nielsen (2000)
is that the latter has parameters to account for codon bias, for which there is no comparable analog in introns. The lack of parameters to estimate simplifies the analysis, and NeiGojobori is similar to maximum likelihood if transition/transversion and codon-usage bias are ignored (Yang and Bielawski 2000
). Ignoring transition/transversion bias leads to overestimation of dS because transitions at the third position of a codon are more likely to be synonymous (Ina 1995
; Nei and Kumar 2000
). This may partially explain the tendency of dI to underestimate dS.
The difference in dS and dI suggests that one should use caution when investigating outliers of
S or
I because many of these outliers are not consistent between the 2 measures of neutral rate. We suggest that wherever appropriate, such as within reasonably close species, researchers should quote both values as the most robust outliers will have support from both the
S and
I measures.
Comparison with Previous Research
Researchers have suggested still other methods to estimate the neutral rate; we summarize some of these in table 2. The most similar method to the intron method in this paper is by Castresana (2002)
, who also suggested a method to estimate an evolutionary distance using intron substitutions in aligned blocks and used it on a manually selected set of 63 humanmouse gene pairs with 504 introns. It differed from the method described in this paper in several notable ways. First, he did not use an algorithm like Introndeuce to identify orthologous introns, instead assigning intron orthology colinearly on orthologous genes and discarding genes that had different number of genes in either species. This will not work in cases of intron gain or loss or differences in annotated gene structure, although it might have worked for the small hand-selected set Castresana used. His method also did not take alternative splicing into account, which would affect the results. Castresana uses a NeedlemanWunsch global alignment (Needleman and Wunsch 1970
) followed by Gblocks (Castresana 2000
) to identify aligned blocks based both on distance from gaps and anchoring by highly conserved positions. This differs from our approach of BlastZ local alignment followed by identifying aligned blocks solely on the basis of distance from gaps. Castresana uses the HKY model of evolution (Hasegawa et al. 1985
), which is more sophisticated than the JukesCantor model we use. However, it would be possible to use the alignments produced by both methods as input to any model of DNA evolution.
|
Although we tested our method on a genome-wide 23,009 gene pairs, Castresana's input data were winnowed from a larger hand-selected set of 77 gene pairs (Jareborg et al. 1999
values in the top 95th percentile for both methods. Also the correlation between dS and dI is much better in our data set (rS = 0.75 overall and rS = 0.46 for mouserat, as opposed to rS = 0.34 for the humanmouse set). Castresana's data also leads him to claim that genes with fast-evolving exons have fast-evolving introns. We used our method at closer evolutionary distances than humanmouse and have concluded that it is not actually useful at this distance due to edge wander problems. This is borne out by the lower correlation at this distance. | Conclusion |
|---|
|
|
|---|
We have investigated the properties of using introns to estimate the neutral rate of nucleotide substitution. We have developed an efficient method to automatically pair orthologous intron sequences. We have shown that this intron-based rate estimation is correlated with but not directly equivalent to a synonymous nucleotide measure. In particular, the dI measure has more species-specific properties, suggesting that it is more labile over evolutionary time. These differences significantly affect the prediction of positive and purifying selection. We therefore strongly recommend that both dS and dI measures are used when possible for evolutionary analysis of closely related species such as those within the primates or within the rodents, but not used for more distant comparisons, in particular, when other genome properties, such as G + C bias, have changed. As researchers sequence an increasing number of primate genomes (http://www.genome.gov/19517271), the use of dI will become ever more important in understanding the evolution of primate genes and their implications for human biology.
| Acknowledgements |
|---|
|
|
|---|
This material is based upon work supported under a National Science Foundation Graduate Research Fellowship. We would like to thank Nick Goldman, Martin Hammond, Daniel Zerbino, Ben Paten, Alison Meynert, and 2 anonymous reviewers for helpful suggestions and the Wellcome Trust Sanger Institute Informatics Systems Group for computer support.
| Footnotes |
|---|
Lauren McIntyre, Associate Editor
| References |
|---|
|
|
|---|
Adkins RM, Gelke EL, Rowe D, Honeycutt RL. (2001) Molecular phylogeny and divergence time estimates for major rodent groups: evidence from multiple genes. Mol Biol Evol 18:777791.
Bernardi G. (1995) The human genome: organization and evolutionary history. Annu Rev Genet 29:445476.[CrossRef][ISI][Medline]
Bernardi G. (2000) Isochores and the evolutionary genomics of vertebrates. Gene 241:317.[CrossRef][ISI][Medline]
Birney E, Andrews D, Caccamo M, et al. (50 co-authors). (2006) Ensembl 2006. Nucleic Acids Res 34:D556D561.
Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G. (2004) GO::TermFinderopen source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics 20:37103715.
Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S. (2003) LAGAN and multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 13:721731.
Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R. (2004) The Gene Ontology Annotation (GOA) database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 32:D262D266.
Cartegni L, Chew SL, Krainer AR. (2002) Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nat Rev Genet 3:285298.[CrossRef][ISI][Medline]
Castresana J. (2000) Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol 17:540552.
Castresana J. (2002) Estimation of genetic distances from human and mouse introns. Genome Biol 3: research0028.
Chamary JV, Parmley JL, Hurst LD. (2006) Hearing silence: non-neutral evolution at synonymous sites in mammals. Nat Rev Genet 7:98108.[CrossRef][ISI][Medline]
Chen FC and Li WH. (2001) Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am J Hum Genet 68:444456.[CrossRef][ISI][Medline]
Chiaromonte F, Yap VB, Miller W. (2002) Scoring pairwise genomic sequence alignments. Pac Symp Biocomput 7:115126.
Chimpanzee Sequencing and Analysis Consortium. (2005) Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437:6987.[CrossRef][Medline]
Chomez P, De Backer O, Bertrand M, De Plaen E, Boon T, Lucas S. (2001) An overview of the MAGE gene family with the identification of all human members of the family. Cancer Res 61:55445551.
Cleveland WS. (1993) Visualizing data. (Hobart Press, Summit (NJ)).
Emes RD, Beatson SA, Ponting CP, Goodstadt L. (2004) Evolution and comparative genomics of odorant- and pheromone-associated genes in rodents. Genome Res 14:591602.
Eyre TA, Ducluzeau F, Sneddon TP, Povey S, Bruford EA, Lush MJ. (2006) The HUGO Gene Nomenclature database, 2006 updates. Nucleic Acids Res 34:D319D321.
Eyre-Walker A. (1992) The role of DNA replication and isochores in generating mutation and silent substitution rate variance in mammals. Genet Res 60:6167.[ISI][Medline]
Felsenstein J. (2005) PHYLIP (phylogeny inference package) [Internet]. (University of Washington, Seattle, Seattle (WA)) [cited 2006 November 1]. Available from: http://evolution.genetics.washington.edu/phylip.html.
Fryxell KJ and Moon WJ. (2005) CpG mutation rates in the human genome are highly dependent on local GC content. Mol Biol Evol 22:650658.
Fullerton SM, Bernardo Carvalho A, Clark AG. (2001) Local rates of recombination are positively correlated with GC content in the human genome. Mol Biol Evol 18:11391142.
Gene Ontology Consortium. (2006) The Gene Ontology (GO) project in 2006. Nucleic Acids Res 34:D322D326.
Gooding C, Clark F, Wollerton MC, Grellscheid SN, Groom H, Smith CW. (2006) A class of human exons with predicted distant branch points revealed by analysis of AG dinucleotide exclusion zones. Genome Biol 7:R1.[CrossRef][Medline]
Graur D and Li W-H. (2000) Fundamentals of molecular evolution. (Sinauer Associates, Sunderland (MA)).
Hardison RC, Roskin KM, Yang S, et al. (17 co-authors). (2003) Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res 13:1326.
Hasegawa M, Kishino H, Yano T. (1985) Dating of the humanape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22:160174.[CrossRef][ISI][Medline]
Hellmann I, Ebersberger I, Ptak SE, Paabo S, Przeworski M. (2003) A neutral explanation for the correlation of diversity with recombination rates in humans. Am J Hum Genet 72:15271535.[CrossRef][ISI][Medline]
Holmes I and Durbin R. (1998) Dynamic programming alignment accuracy. J Comput Biol 5:493504.[ISI][Medline]
Hurst LD. (2002) The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends Genet 18:486.[CrossRef][ISI][Medline]
Iida K and Akashi H. (2000) A test of translational selection at silent sites in the human genome: base composition comparisons in alternatively spliced genes. Gene 261:93105.[CrossRef][ISI][Medline]
Ina Y. (1995) New methods for estimating the numbers of synonymous and nonsynonymous substitutions. J Mol Evol 40:190226.[CrossRef][ISI][Medline]
Jareborg N, Birney E, Durbin R. (1999) Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. Genome Res 9:815824.
Jukes TH and Cantor CR. (1964) Evolution of protein molecules. In Munro HN and Allison JB (Eds.). Mammalian protein metabolism(Academic Press, New York) pp. 21132.
Jurka J. (1997) Sequence patterns indicate an enzymatic involvement in integration of mammalian retroposons. Proc Natl Acad Sci USA 94:18721877.
Lander ES, Linton LM, Birren B, et al. (254 co-authors). (2001) Initial sequencing and analysis of the human genome. Nature 409:860921.[CrossRef][Medline]
Lercher MJ and Hurst LD. (2002) Human SNP variability and mutation rate are higher in regions of high recombination. Trends Genet 18:337340.[CrossRef][ISI][Medline]
Li WH, Ellsworth DL, Krushkal J, Chang BH, Hewett-Emmett D. (1996) Rates of nucleotide substitution in primates and rodents and the generation-time effect hypothesis. Mol Phylogenet Evol 5:182187.[CrossRef][ISI][Medline]
Lindblad-Toh K, Wade CM, Mikkelsen TS, et al. (235 co-authors). (2005) Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438:803819.[CrossRef][Medline]
Marais G. (2003) Biased gene conversion: implications for genome and sex evolution. Trends Genet 19:330338.[CrossRef][ISI][Medline]
Needleman SB and Wunsch CD. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443453.[CrossRef][ISI][Medline]
Nei M and Gojobori T. (1986) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 3:418426.[Abstract]
Nei M and Kumar S. (2000) Molecular evolution and phylogenetics. (Oxford University Press, Oxford).
Ogurtsov AY, Sunyaev S, Kondrashov AS. (2004) Indel-based evolutionary distance and mousehuman divergence. Genome Res 14:16101616.
Page RD. (1996) TreeView: an application to display phylogenetic trees on personal computers. Comput Appl Biosci 12:357358.
Rogozin IB, Lyons-Weiler J, Koonin EV. (2000) Intron sliding in conserved gene families. Trends Genet 16:430432.[CrossRef][ISI][Medline]
Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W. (2003) Humanmouse alignments with BLASTZ. Genome Res 13:103107.
Seeburg PH. (2002) A-to-I editing: new and old sites, functions and speculations. Neuron 35:1720.[CrossRef][ISI][Medline]
SmitAFA,HubleyR, Green P. 1996. RepeatMasker Seattle: Institute for Systems Biology. [Internet]. Available from: http://www.repeatmasker.org/.
Smith TF and Waterman MS. (1981) Identification of common molecular subsequences. J Mol Biol 147:195197.[CrossRef][ISI][Medline]
Springer MS, Murphy WJ, Eizirik E, O'Brien SJ. (2003) Placental mammal diversification and the Cretaceous-Tertiary boundary. Proc Natl Acad Sci USA 100:10561061.
Stoltzfus A, Logsdon JM Jr.,, Palmer JD, Doolittle WF. (1997) Intron "sliding" and the diversity of intron positions. Proc Natl Acad Sci USA 94:1073910744.
Tufte ER. (2001) The visual display of quantitative inform




