Skip Navigation


MBE Advance Access originally published online on December 19, 2007
Molecular Biology and Evolution 2008 25(4):634-642; doi:10.1093/molbev/msm281
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
25/4/634    most recent
msm281v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kim, S.-H.
Right arrow Articles by Yi, S. V.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kim, S.-H.
Right arrow Articles by Yi, S. V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Mammalian Nonsynonymous Sites Are Not Overdispersed: Comparative Genomic Analysis of Index of Dispersion of Mammalian Proteins

Seong-Ho Kim1 and Soojin V. Yi

School of Biology, Georgia Institute of Technology

E-mail: soojinyi{at}gatech.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Appendix
 Acknowledgements
 References
 
It is often stated that patterns of nonsynonymous rate variation among mammalian lineages are more irregular than expected or overdispersed under the neutral model, whereas synonymous sites conform to the neutral model. Here we reexamined genome-wide patterns of the variance to mean ratio, or index of dispersion (R), of substitutions in proteins from human, mouse, and dog. Contrary to the prevailing notion, we found that the mean index of dispersion for nonsynonymous sites of mammalian proteins is not significantly different from 1. We propose that earlier analyses were biased because the data included disproportionately more protein hormones, which tend to be more dispersed than genes in other functional categories. Synonymous sites exhibit greater degree of dispersion than nonsynonymous sites, although similar to earlier estimates and potentially due to errors associated with correction for multiple hits. Overall, our analysis identifies strong genome-wide generation-time effect and natural selection as important determinants of among-lineage variation of protein evolutionary rates. Furthermore, patterns of lineage-specific selective constraint are consistent with the nearly neutral model of molecular evolution.

Key Words: index of dispersion • lineage effects • neutral theory • nearly neutral theory • comparative genomics • gene ontology


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Appendix
 Acknowledgements
 References
 
A commonly used tool to evaluate the neutral theory of molecular evolution is the so-called index of dispersion (Kimura 1983Go; Gillespie 1986Go, 1989Go, 1991Go; Ohta 1995Go; Nielsen 1997Go; Cutler 2000Go; Wilke 2004Go). This statistic measures the ratio of variance to mean of the number of substitutions among lineages, which is expected to follow a Poisson distribution according to the strict neutral theory (Kimura 1983Go). Thus, the "index of dispersion" should be 1 under neutrality.

Formally, index of dispersion is usually estimated as R = Var(Ni) /E (Ni), where Ni is the estimated number of substitutions in the i-th lineage. Studies of index of dispersion began early in the history of molecular evolution (Langley and Fitch 1974Go; Kimura 1983Go). However, those earlier studies (before Gillespie's [1989]Go analysis) suffered from several methodological problems. The first issue arises from estimating the underlying phylogeny. Earlier studies approximated the mammalian radiation as a "star" phylogeny (Kimura 1983Go) or reconstructed ancestral sequences using parsimony (Langley and Fitch 1974Go). Such methods can introduce errors associated with branch length due to phylogenetic inaccuracy. Another potential problem is systematic variation of substitution rates among lineages, due to neutral factors such as generation times or mutation rates. Furthermore, methods to correct for multiple hits introduce additional errors (Bulmer 1989Go; Nielsen 1997Go).

Gillespie (1989Go, 1991)Go improved upon these issues by using 3 species only, thereby avoiding problems associated with phylogenetic reconstruction (because there is only 1 possible star phylogeny of 3 species). Another significant technical improvement by Gillespie (1989)Go was the usage of lineage-weighting factors to account for the so-called lineage effect, which refers to species-specific average rate of evolution.

Gillespie (1989)Go then analyzed 20 proteins from 3 mammalian species and estimated that R for nonsynonymous sites was 6.75, much larger than the neutral expectation of 1. Gillespie’s (1989)Go R for synonymous sites was 4.64, considered only marginally different from 1. Thus, it was concluded that in mammals, nonsynonymous sites evolve nonneutrally, in "episodic" fashion, whereas synonymous sites were only marginally deviant of the neutral pattern (Gillespie 1989Go, 1991Go). This view has since remained authoritative and is commonly reiterated in literature (p. 231 in Li 1997Go; Cutler 2000Go; Wilke 2004Go).

Ohta (1995)Go analyzed a larger data set (49 mammalian proteins) and observed that R for synonymous sites (5.89) was comparable with, if slightly greater than, that for nonsynonymous sites (5.60). Interestingly, Zeng et al. (1998)Go showed that in Drosophila, R for nonsynonymous sites was 1.64, much less than that in mammalian proteins. In contrast, they found that R for synonymous sites was 4.37, significantly greater than 1. Zeng et al. (1998)Go interpreted that different evolutionary forces affect synonymous site evolution in Drosophila and mammals.

Sequencing of several mammalian genomes brought an opportunity to examine a genome-wide pattern of variance of evolutionary rates among lineages. Methods to correct for multiple hits have also significantly improved, and new analytical tools to investigate roles of selection and mutation on protein evolution have become available. In addition, theoretical analyses of selective and neutral models of "overdispersion" have provided new insights into the behavior of index of dispersion under a variety of evolutionary models (Gillespie 1993Go; Cutler 2000Go; Bastolla et al. 2003Go; Wilke 2004Go). Our study is motivated by these advances, aiming to gain a comprehensive understanding on the variance of the number of nucleotide substitutions in mammalian proteins.

Here, we further improve upon Gillespie's methods of estimating the index of dispersion in mammalian proteins by 1) using a codon-substitution model to correct for multiple hits, 2) deriving a statistically superior estimate of index of dispersion, and 3) using a large number of loci to estimate "lineage effects." For this purpose, we used genomic data from 3 lineages, primates, carnivores, and rodents.

Our analyses provide genome-wide lineage-weighting factors for each lineage, which clearly represent generation-time effects. The numbers of nonsynonymous and synonymous substitutions in each lineage are consistent with the nearly neutral theory of molecular evolution. Interestingly, our genome-wide analysis of index of dispersion reveals that the mean index of dispersion for mammalian nonsynonymous sites is substantially lower than previous estimates and not different from one.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Appendix
 Acknowledgements
 References
 
Sequence Data
We used data from 3 mammalian lineages, Primata (Homo sapiens), Rodentia (Mus musculus), and Carnivora (Canis familiaris). We obtained known gene sequences from these species from Ensembl (http://www.ensembl.org). The numbers of sequences from each species are 22,719 for H. sapiens, 26,452 for M. musculus, and 8,253 for C. familiaris. Next, we used the OrthoMCL program (Li et al. 2003Go) to identify orthologs among 3 mammalian lineages with default parameters except for a more stringent E value of 10–10. This step identified 3,493 orthologs. We first aligned the amino acid sequences of these genes using ClustalW (Thompson et al. 1994Go) and then translated back to DNA sequence alignment.

Estimating Sequence Divergence
We used a codon-based maximum likelihood method to estimate the numbers of pairwise synonymous and nonsynonymous substitutions, using PAML (Yang 1997Go). Because estimation of sequence divergence is not reliable when pairwise sequence divergence is much greater than saturation, we constrained on pairwise dN and dS to curate the data set before calculating the index of dispersion for each gene.

As the constraint increases the index of dispersion decrease for both synonymous and nonsynonymous sites because the number of substitutions can "limit" the variance of evolutionary rates (Cutler 2000Go). In the remainder of the paper, we mainly discuss results when we restrict our data to cases when pairwise dN and dS are less than 1.5, unless noted otherwise. All qualitative results remained the same when we did not constrain our data or used more stringent criteria of dN, dS < 1. We then obtained lineage-specific numbers of substitutions from the pairwise estimates, as in Gillespie (1989)Go and Ohta (1995)Go.

Calculating the Indices of Dispersions
Reliable estimation of sequence divergence is critical for our calculation of the indices of dispersion. As mentioned above, we constrained on pairwise dN and dS. In addition, we excluded genes in which the numbers of ungapped aligned sites correspond to a 95 percentile in the overall distribution of the proportion of ungapped aligned sites in the 3 species alignments. This cutoff corresponds to alignments where the numbers of ungapped aligned sites in any pairwise alignment are less than 88%. Employing different cutoff values did not change our conclusions. The final number of alignments used to calculate the indices of dispersion (with the constraints of pairwise dN and dS < 1.5) is 2,932.

Gene Ontology
We downloaded gene ontology (GO) identities from Ensembl (http://www.ensembl.org) for each gene of H. sapiens, M. musculus, and C. familiaris. We then used the AmiGO database (http://www.godatabase.org) to extract the explanation for each GO identity. All GO identities were then categorized according to biological process, cellular component, and molecular function. The expectations of the index of dispersion and the number of synonymous and nonsynonymous substitutions of each GO identity were calculated.

An Improved Measure of Index of Dispersion, Rb
We improved the commonly used Gillespie's (1989)Go measure of index of dispersion. We first obtained the minimum variance unbiased estimator (MVUE) of the mean number of substitution for a particular locus (Formula). We also derived an unbiased estimator of variance of the number of substitutions for a particular locus (Formula) and prove that it has less variance than that of Gillespie's estimator. Therefore, we can improve Gillespie's expression Formula , which has lower variance than Gillespie's (1989)Go statistic. Details are presented in the Appendix.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Appendix
 Acknowledgements
 References
 
Numbers of Synonymous and Nonsynonymous Substitutions in 3 Mammalian Lineages and the Genome-Wide Lineage-Weighting Factors
We obtained lineage-specific numbers of nucleotide substitutions for each gene. Figure 1 shows the mean numbers of synonymous and nonsynonymous substitutions per site in the 3 mammalian genomes. The branch lengths are the longest for the mouse (Rodentia), followed by the dog (Carnivora) and the human (Primata), for both synonymous and nonsynonymous sites (fig. 1). Our estimates of lineage-specific branch lengths and pairwise divergences are well in accord with previous estimates (Makalowski and Boguski 1998Go; Mouse Genome Sequencing Consortium 2002Go; Lindblad-Toh et al. 2005Go).


Figure 1
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— The star phylogeny of the 3 lineages analyzed in this study. Branch lengths are the estimated numbers of substitution per site, calculated using a maximum likelihood method based on a codon-substitution model using PAML.

 
The order of lineage-specific branch lengths is inversely correlated with the lengths of generation time in each lineage, supporting generation-time effect; current estimates of generation times of human, dog, and mouse are 25, 4, and 0.5 years, respectively (Keightley and Eyre-Walker 2000Go). Furthermore, the generation-time effect is more conspicuous for synonymous substitutions than for nonsynonymous substitutions (the ratio of branch lengths of Rodentia:Carnivora:Primata is 2.78:1.45:1 for synonymous sites, compared with 2.14:1.17:1 for nonsynonymous sites), suggesting that synonymous sites are more strongly influenced by neutral factors that differ among lineages, such as mutation rates.

Table 1 shows the lineage-weighting factors and the average numbers of nonsynonymous and synonymous substitutions per lineage. The lineage-weighting factors are calculated by dividing the number of lineage-specific substitutions by the mean number of substitutions in all 3 lineages, to account for the genome-wide differences in branch lengths between lineages (Gillespie 1989Go). Previous studies of index of dispersion in mammals (Gillespie 1989Go; Ohta 1995Go) used Artiodactyla instead of Carnivora for comparisons. It appears that Artiodactyla and Carnivora are both intermediate of rodents and primates in terms of their relative branch lengths and the resulting lineage effects. For example, among 49 genes, the lineage-weighting factors for Primata:Rodentia:Artiodactyla were 0.61:1.58:0.82 for synonymous sites (0.75:1.28:0.97 for nonsynonymous sites) (Ohta 1995Go). These numbers are similar to those obtained from the current analyses (table 1).


View this table:
[in this window]
[in a new window]

 
Table 1 Lineage Weights (Numbers of Substitutions) Estimated from Genome-Wide Comparison of 3 Species

 
Genome-Wide Index of Dispersion (R) in Mammals
Now we examine the genome-wide indices of dispersion in mammals. We calculated the indices of dispersion using 3 different methods, Kimura's (1983)Go method (Rk), Gillespie's (1989)Go method (Rg), and our new method (Rb), which improves upon Gillespie's method (see Materials and Methods and Appendix for details). Table 2 is reanalysis of the 20 loci used in Gillespie's (1989)Go analysis. Overall, Gillespie's estimator and the new estimator appear similar, whereas Kimura's estimator gives the largest dispersion. This is due to the fact that Kimura's measure does not correct for lineage effects (see below).


View this table:
[in this window]
[in a new window]

 
Table 2 Comparison of the 3 Methods of Estimating the Index of Dispersion (R), Using 20 Loci as Included in table 1 of Gillespie (1989)Go

 
We present genome-wide mean indices of dispersion for synonymous and nonsynonymous sites obtained by these 3 methods in table 3. Indices of dispersion when different levels of constraints were used are also presented for comparison. Note that Rk, which does not correct for lineage-specific weights, always shows the largest dispersion. As we correct for lineage effect using lineage-weighting factor (i.e., in case of Rg and Rb), indices decrease. The behaviors of Rg and Rb are similar. As expected, when we constrain data to allow smaller pairwise dN and dS, indices decrease.


View this table:
[in this window]
[in a new window]

 
Table 3 Mean Index of Dispersion for Synonymous and Nonsynonymous Sites

 
Our new estimates of the index of dispersion are 8.04 for synonymous sites and 4.42 for nonsynonymous sites, for the data set in which pairwise dN and dS were less than 1.5. All results were qualitatively similar when data obtained under different constraints were used or when we considered genes with greater than 100 synonymous sites only, to avoid errors introduced by short genes (results not shown).

We tested whether the estimates of indices are significantly different from the neutral expectation of 1. Kimura (1983)Go suggested using a chi-square distribution for this purpose. Gillespie (1989)Go showed, by simulation, that chi-square approximation is appropriate for nonsynonymous sites but for synonymous sites. Therefore, we used critical values obtained in Gillespie (1989)Go for our test (table 4). According to these critical values, the new measure of index of dispersion for nonsynonymous sites (referred to as Rbn) is significantly different from 1 for 960 (at 5% significance level) genes out of 2,932 genes in the whole-data set. In comparison, index of dispersion for synonymous sites (referred to as Rbs) is significantly different from 1 for 875 (at 5% significance level) genes. Index of dispersion for synonymous sites (Rbs = 8.04) is larger than those of Gillespie (1989Go, R = 4.6) and Ohta (1995Go, R = 5.89). With increasing constraint, new estimate of index of dispersion for synonymous sites approaches that in Ohta (table 3). At 1% significance level, mean indices of dispersion for both synonymous and nonsynonymous sites are statistically equivalent to 1.


View this table:
[in this window]
[in a new window]

 
Table 4 Cutoff values for Significance of the Indices of Dispersion

 
The index of dispersion is greater for synonymous sites than for nonsynonymous sites, the opposite of the previously reported pattern in mammals (Gillespie 1989Go) and qualitatively similar to the pattern observed in Drosophila (Zeng et al. 1998Go). In the least, we do not observe the pattern that was concluded by the influential analysis of Gillespie (1989)Go, that nonsynonymous sites show clear overdispersion whereas synonymous sites evolve neutrally.

It should be noted that correcting for multiple hits introduces another source of variance for estimating index of dispersion (Bulmer 1989Go; Gillespie 1989Go; Ohta 1995Go; Nielsen 1997Go). Hence, we consider our estimates as overestimates compared with true values. It follows that our conclusion that mammalian nonsynonymous sites are not overdispersed is a conservative one (see Discussion). In comparison, index of dispersion for synonymous sites may be substantially affected by errors due to multiple hit correction.

Genome-Wide Factors on Mammalian Index of Dispersion
Overdispersion of protein molecular evolution in mammals has been attributed to effects of genetic drift and natural selection. Gillespie (1989)Go emphasized the role of natural selection and changing environments on overdispersion of protein molecular clock. In other words, episodic positive selection will inflate variance of evolutionary rates. On the other hand, strong purifying selection can suppress variance of evolutionary rates.

The opposing effects of positive and negative natural selection on variance of evolutionary rates predict that index of dispersion and selective constraint (measured as the ratio of rates of nonsynonymous substitution to synonymous substitutions, dN/dS) should be correlated. Indeed, we found that the index of dispersion for nonsynonymous sites is strongly positively correlated with dN/dS (Spearman's rank correlation coefficient {rho} between Rbn and dN/dS = 0.32, P < 10–4). This relationship did not change when we changed the constraint in the data set. In contrast, the correlation between Rbs and dN/dS was not significant, suggesting that natural selection on amino acid sequence is not a strong determinant of variance of synonymous rates.

We showed that for approximately 60% of genes, variance of nonsynonymous rates can be explained by statistical fluctuations under a simple Poisson model. It suggests that the majority of mammalian proteins are under purifying selection that is strong enough to suppress variance of evolutionary rates caused by other factors. This can be further demonstrated by the observation that average selective constraint for the genes whose Rbn is significantly greater than 1 (at 1% significance level) is weaker than in the rest of the data. For example, between human and mouse, average dN/dS of genes whose Rbn > 1 is 0.173, whereas the mean selective constraint is 0.127 for the whole data. In comparison, average selective constraint for genes whose Rbn is not significantly different from 1 is 0.114. The difference between mean selective constraints of 2 groups of genes (divided by whether Rbn is greater than 1 or not) is highly significant in all 3 pairwise species comparisons (P < 10–4 in all comparisons). Furthermore, we show in the next section that genes encoding essential functions such as transcription regulation tend to exhibit lower index of dispersion than proteins in other functional categories, which attest the effect of purifying selection to suppress variance of nonsynonymous rates.

According to the neutral models of overdispersion (Takahata 1987Go; Ohta 1992Go), the numbers of substitutions among lineages can vary above the Poisson expectation due to factors such as varying effective population sizes, weakly deleterious mutations, and "fluctuating neutral space." For example, in Ohta's (1992Go, 1995Go) theory, different lineages may experience different rates of slightly deleterious substitutions, if their effective population sizes differ. According to this hypothesis, a significant correlation between the mean number of substitutions and the index of dispersion is expected (Ohta 1995Go; Zeng et al. 1998Go). This relationship is expected to be much stronger when an unweighted measure of index of dispersion (e.g., Rk) is used.

Indeed, Rk and the mean number of substitutions are strongly correlated in our data (correlation coefficient greater than 0.6 for both synonymous and nonsynonymous sites). When we use weighted measures of index of dispersion (i.e., Rg or Rb), we still find strong correlation between the mean number of substitutions and the index of dispersion for both synonymous (e.g., in case of Rb, Spearman's rank correlation coefficient {rho} = 0.33, P < 10–4) and nonsynonymous ({rho} = 0.39, P < 10–4) sites. This suggests that lineage-specific factors for each gene (even after removing the genome-wide lineage effects) are significant determinants of rate variation among lineages, for both synonymous and nonsynonymous sites.

Lineage-specific branch lengths (as compared with the mean number of substitutions analyzed above) and the indices of dispersion should be also correlated according to this hypothesis (Zeng et al. 1998Go). Consistent with this idea, all 3 lineages showed significant correlation between dN or dS and Rb (table 5). We also observe a significant correlation between Rbs and Rbn ({rho} = 0.1790, P < 10–4). Hence, synonymous and nonsynonymous substitutions in each protein are evolving under similar evolutionary forces. As Gillespie (1989)Go pointed out, this may reflect gene-specific mutation rates.


View this table:
[in this window]
[in a new window]

 
Table 5 Spearman's Rank Correlations between Lineage-Specific Branch Lengths (dS or dN) and Indices of Dispersion

 
In conclusion, we find evidence that variance in the numbers of nucleotide substitutions in nonsynonymous sites among the 3 mammalian lineages is caused by positive and negative selection, lineage-specific effects, and mutation rates. Variance in the numbers of synonymous substitutions can be partially explained by gene-specific mutation rates and potential lineage–specific effects.

Relationship between Index of Dispersion and GO
Recent efforts to classify proteins according to their functions (The Gene Ontology Consortium 2000Go) have advanced our understanding of specific selective forces acting on different functional classes of genes (e.g., The Chimpanzee Sequencing and Analysis Consortium 2005Go). Index of dispersion may vary according to functional classifications of proteins because the types and the strength of natural selection may vary in different functional categories or that proteins encoding–specific functions are subject to different neutral factors. To test this hypothesis, we divided our data into different GO categories. We chose GO terms with more than 10 genes per GO and examined indices of dispersion for synonymous and nonsynonymous sites. The results are shown in table 6.


View this table:
[in this window]
[in a new window]

 
Table 6 Mean Index of Dispersion for Proteins in Different GO Terms (Number of Proteins Analyzed)

 
Proteins belonging to different GO terms exhibit different degrees of dispersion in nonsynonymous sites. Among the GO terms examined, GO0006508 (proteolysis) and GO0005179 (hormone activity) show the greatest dispersion (6.95 and 8.11, respectively). In contrast, GO0006355 (regulation of transcription) and GO0003700 (transcription factor activity) exhibited little variation in evolutionary rates among lineages (Rbn = 2.15 and 2.18, respectively). Synonymous sites showed relatively little difference among GO terms, with the exception of the GO0005634 (nucleus), which exhibited a large index of dispersion (Rbs = 15.96).

We tested whether we may observe such high and low values of index of dispersion by chance, by randomly choosing the same number of genes from the total sample and calculating the index of dispersion for that group. We repeated this procedure 1,000 times. We found that the index of dispersion for nonsynonymous sites for the GO0006508 (proteolysis) was marginally significant by this analysis (P = 0.09). Similarly, the average index of dispersion of the GO0005179 (hormone activity) is greater than obtained by this bootstrapping approach at 8% level. In other words, genes belonging to the GO0005179 tend to show overdispersion of nonsynonymous sites compared with the rest of the genome.

We propose that different degrees of overdispersion for proteins in different functional categories can explain the qualitative discrepancy between our results and that of Gillespie (1989)Go, in that he concluded that mammalian nonsynonymous sites were significantly overdispersed. In Gillespie's data set of 20 genes, 4 proteins (prolactin, parathyroid, glycoprotein hormone alpha subunit [GPHA], and growth hormone) belong to the GO0001579 (hormone activity) (see table 2). Hence, his estimate of R may have been inflated due to functional bias in the data. If we exclude those 4 genes from his data set, the average R for nonsynonymous sites becomes 3.02, only marginally different from 1 and similar to our genome-wide estimate. Zeng et al. (1998)Go also noticed that Gillespie's data set contained many protein hormones and proposed that sampling bias may inflated the estimate of index of dispersion in mammalian nonsynonymous sites.

According to our bootstrapping analysis, synonymous sites of proteins in the GO0005634 (nucleus) exhibit significant (P = 0.02) overdispersion. This observation is intriguing and demands future investigation of causes of synonymous rate variation in mammals (see Discussion).


    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Appendix
 Acknowledgements
 References
 
Analysis of index of dispersion has served at least 2 important functions. First, the usage of lineage-weighting factors has been informative to understand patterns and causes of lineage-specific evolutionary rates. In the current study, we provide updated estimates of the lineage factors and the genome-wide index of dispersion, obtained using a large number of loci from 3 mammalian genomes and improving upon previous statistical methods. Our measures of genome-wide lineage-weighting factor support the generation-time effect; species with longer generation time tend to have smaller lineage-weighting factor. Generation-time effect is more prominent in synonymous sites than in nonsynonymous sites, consistent with the idea that synonymous sites are under greater influence of neutral force, such as mutation rates.

We observe that selection strength has been reduced in the primate lineage compared with the other 2 mammalian lineages. The mean dN/dS is larger in primate lineage (0.185) than it is in rodents (0.118) or in carnivores (0.136). This is consistent with the idea that reduced effective population size (Ne) can promote fixation of weakly deleterious mutations (Ohta 1992Go, 1997Go). Even though the ancestral effective population sizes of different mammalian lineages are not known, current effective population size of rodents is considered to be in the range of 105 (Ideraabdullah et al. 2004Go), whereas in primates it is in the order of 104’s (Takahata 1993Go; Harding et al. 1997Go; Chen and Li 2001Go). The Ne of carnivores is less well understood, although some studies have indicated Ne of several 104’s (Nei and Graur 1984Go; Spong et al. 2000Go). Therefore, reduction of Ne in primates may have accelerated fixation of mildly deleterious amino acid substitutions.

Second, different values of indices of dispersion for synonymous and nonsynonymous sites were taken as evidence of fundamental difference in the underlying evolutionary mechanisms on them. In Gillespie’s (1989)Go influential analysis of mammalian proteins, synonymous sites appeared to have index of dispersion not different from 1, whereas nonsynonymous sites had greater than expected index of dispersion under the neutral model. Based upon this observation, it has been often cited that in mammals, synonymous sites are mostly evolving in neutral fashion, whereas nonsynonymous sites are overdispersed due to nonneutral forces (Gillespie 1989Go, 1991Go; Li 1997Go; Cutler 2000Go; Wilke 2004Go).

Contrary to the aforementioned view, we here show that mammalian nonsynonymous sites are on the whole not overdispersed as previously thought. Genes subject to stronger purifying selection (as measured by dN/dS) tend to have lower index of dispersion, suggesting that negative (purifying) selection is an important force to reduce variance of evolutionary rates among lineages. In addition, genes in specific functional categories as defined by GO terms exhibit different among-lineage variation of evolutionary rates. For example, genes involved in regulation of transcription tend to show lower index of dispersion, whereas protein hormones tend to have large index of dispersion (table 6). In particular, earlier estimates may have been biased because they included analyses of protein hormones, which tend to exhibit greater variation among lineages than the rest of the genome.

Many protein hormones exhibit lineage-specific episodic molecular evolution in mammals (Wallis 2001Go; Maston and Ruvolo 2002Go; Opazo et al. 2005Go; Yi and Li 2007Go). The observation that protein hormones generally show significant overdispersion is therefore consistent with Gillespie's (1986Go, 1989Go, 1991Go) proposal that bursts of adaptive evolution can cause overdispersion. However, even though this hypothesis is well in accord with the data presented here, there are some theoretical problems with this proposal. Namely, to cause an overdispersion, the rate of environmental fluctuation should be roughly the same as the rate of a substitution (Gillespie 1993Go; Cutler 2000Go), which is estimated to be in millions of years for a typical nonsynonymous change in mammals (Smith and Eyre-Walker 2003Go), appearing too slow compared with the perceived timescale of ecological habitat changes (see Smith and Eyre-Walker [2003]Go for alternative possibilities).

Estimates of index of dispersion can be inflated due to errors associated with correction for multiple hits, even with sophisticated evolutionary models (Bulmer 1989Go; Goldman 1994Go; Nielsen 1997Go; Yang and Nielsen 1998Go). Given that the number of nonsynonymous substitutions per site is generally much below saturation within mammals, multiple hit correction is not likely to be a significant source of error for nonsynonymous sites. For example, Gillespie (1989)Go showed, by simulation, that in case of Jukes–Cantor method, the extent of inflation of the index of dispersion due to multiple hit correction is only approximately ~10% for nonsynonymous sites. Note that variance due to multiple hit correction will make our main conclusion (that the mean index of dispersion for mammalian nonsynonymous sites is not significantly different from 1) a conservative one. However, for synonymous sites, having generally undergone much greater number of substitutions than nonsynonymous sites, errors associated with multiple hit correction can inflate index of dispersion substantially.

Thus, we refrain from overinterpreting our observation that index of dispersion is generally greater for synonymous sites than nonsynonymous sites. Nevertheless, it should be noted that a variety of factors that affect synonymous rates in lineage- and gene-specific manner have been discovered in recent years. For example, neutral evolutionary rates can vary greatly due to different CpG contents (Kim et al. 2006Go) or changes in recombination rates (Perry and Ashworth 1999Go; Montoya-Burgos et al. 2003Go; Meunier and Duret 2004Go; Yi and Li 2005Go). Synonymous sites may be under selective constraint related to mechanisms such as RNA editing, microRNA binding, and conservation of splice signals (Chamary et al. 2006Go). Remarkably, a synonymous polymorphism in the human MDR1 locus has been shown to change the protein's biochemical properties, in this case its ability to pump certain molecules (Kimchi-Sarfaty et al. 2007Go). Thus, synonymous sites are potentially subject to positive selection. If true, it has significant implications on several aspects of molecular evolutionary analyses. For example, the assumption of molecular clock for synonymous sites in evolutionary inference or utilizing synonymous rates as a reference to detect positive selection need to be employed with caution. Analyses of proteins with significantly overdispersed synonymous sites may offer insights on these issues.


    Conclusions
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Appendix
 Acknowledgements
 References
 
As stated above, our study refutes the canonical view that mammalian nonsynonymous sites exhibit episodic evolution. One caveat of our study is that we considered genes that remained single copy since the divergence of the 3 mammalian lineages. Lineage-specific gene duplication occurs frequently, and they may acquire new function and undergo episodic molecular evolution.

Several possible causes of overdispersion for synonymous and nonsynonymous sites are discussed. The fact that different functional categories of proteins can exhibit different degrees of among-lineage variation of rates attests the role of natural selection on dispersion of evolutionary rates. Index of dispersion may be useful in determining selective and neutral factors that lead to episodic, positive selection for both synonymous and nonsynonymous sites of mammalian proteins.


    Appendix
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Appendix
 Acknowledgements
 References
 
Statistical Properties of the New Measure of Index of Dispersion, Rb
Let the number of substitutions at the m-th locus on the i-th lineage be Nm,i. Assuming that the Nm,i's are independent random variables, we will write their moments in a form that allows the removal of lineage effects: Formula with probability mass function (pmf) Formula .

We now consider testing the null hypothesis Formula for all i, versus the alternative hypothesis Formula . For brevity, we substitute Formula in all expression above. Then H0:{lambda}m,i = µm for all i, versus Formula whereFormula are random variables with pmf Formula .

A New Estimator of Mean
We aim to find the MVUE of the mean number of substitution for a particular locus (Formula m) under H0. Because {lambda}m,i = µm for all i, we have the log-likelihood function

Formula

Because we have been an exponential family, we obtain

Formula (1)
using Lemma 7.3.1 on p. 312 in Casella and Berger (1990).

As a first speculation, consider the stastic Formula . The pmf of Formula andFormula . Thus,

Formula (2)
Therefore, Formula is an unbiased estimator of Formula. We next calculate

Formula (3)

According to the equation (2), we find Formula Then, by the Cramer–Rao theory (theorem 7.3.1, p. 308, Casella and Berger, 1990) and equations (1) and (3), Formula

Therefore, Formula is a MVUE, that is, a best unbiased estimation. Furthermore, we can infer that Formula is the unique MVUE because Formula is a complete sufficient statistic.

Alternatively, we can directly see that the variance of the new estimator of the mean number of substitution for a particular locus under H0 is less than Gillespie's (1989) estimator. We know that Gillespie's estimator of the mean number is Formula and the new estimator is Formula The difference Formula of variances of 2 estimators is Formula We then have Formula from the fact that the arithmetic mean is greater than or equal to the harmonic mean.

A New Estimator of Variance
We would next like to find the estimator of the variance that has less variance than that of Gillespie's estimator. Let Formula be Gillespie's estimator of the variance and Formula the new estimator of the variance, where {sum}Formulawi = n, wi >0 for all i, E(Nm,i) = wiµm, and Var (Nm,i) = wi {sigma}Formula.

First, we look for the mean of the new estimator of the variance. The mean of Formula , revealing that the new estimator is an unbiased estimator of the variance.

Hence, the difference of 2 variances above is

Formula
where

Formula
Therefore, Formula because µm≥0 under H0 and A, B, and C are greater than or equal to zero.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Appendix
 Acknowledgements
 References
 
We thank the members of the Yi laboratory, Dr. Koichiro Tamura, and 2 anonymous reviewers for valuable comments. Dr. Tomoko Ohta has kindly provided comments and pointed out a mistake in an earlier version of the manuscript. This study is supported by funds from the Georgia Institute of Technology.


    Footnotes
 
1 Present address: Division of Biostatistics, School of Medicine, Indiana University, Indianapolis, IN. Back

Koichiro Tamura, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Conclusions
 Appendix
 Acknowledgements
 References
 

    Bastolla U, Porto M, Roman HE, Vendruscolo M. Connectivity of neutral networks, overdispersion, and structural conservation in protein evolution. J Mol Evol (2003) 56:243–254.[CrossRef][Web of Science][Medline]

    Bulmer M. Estimating the variability of substitution rates. Genetics (1989) 123:615–619.[Abstract/Free Full Text]

    Casella G, Berger RL. Statistical inference (1990) Belmont (CA): Duxbury Press.

    Chamary JV, Parmley JL, Hurst LD. Hearing silence: non-neutral evolution at synonymous sites in mammals. Nat Rev Genet (2006) 7:98–108.[CrossRef][Web of Science][Medline]

    Chen FC, Li W-H. Genomic divergence between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am J Hum Genet (2001) 68:444–456.[CrossRef][Web of Science][Medline]

    Cutler DJ. Understanding the overdispersed molecular clock. Genetics (2000) 154:1403–1417.[Abstract/Free Full Text]

    Gillespie JH. Natural selection and the molecular clock. Mol Biol Evol (1986) 3:138–155.[Abstract]

    Gillespie JH. Lineage effects and the index of dispersion of molecular evolution. Mol Biol Evol (1989) 6:636–647.[Abstract]

    Gillespie JH. The causes of molecular evolution (1991) Oxford (UK): Oxford University Press.

    Gillespie JH. Substitution processes in molecular evolution I. Uniform and clustered substitutions in a haploid model. Genetics (1993) 134:971–981.[Abstract]

    Goldman N. Variance to mean ratio, R(t), for poisson processes on phylogenetic trees. Mol Phylogenet Evol (1994) 3:230–239.[CrossRef][Medline]

    Harding RM, Fullerton SM, Griffiths RC, Bond J, Cox MJ, Schneider JA, Mouline DS, Clegg JB. Archaic African and Asian lineages in the genetic ancestry of modern humans. Am J Hum Genet (1997) 60:772–789.[Web of Science][Medline]

    Ideraabdullah FY, de la Casa-Esperon E, Bell TA, Detwiler DA, Magnuson T, Sapienza C, de Villena FP-M. Genetic and haplotype diversity among wild-derived mouse inbred strains. Genome Res (2004) 14:1880–1887.[Abstract/Free Full Text]

    Keightley PD, Eyre-Walker A. Deleterious mutations and the evolution of sex. Science (2000) 290:331–333.[Abstract/Free Full Text]

    Kim S-H, Elango N, Warden CD, Vigoda E, Yi S. Heterogenous genomic molecular clocks in primates. PLoS Genet (2006) 2:e163.[CrossRef][Medline]

    Kimchi-Sarfaty C, Oh JM, Kim I-W, Sauna ZE, Calcagno AM, Ambudkar SV, Gottesman MM. A "silent" polymorphism in the MDR1 gene chanages substrate specificity. Science (2007) 315:525–528.[Abstract/Free Full Text]

    Kimura M. The neutral theory of molecular evolution (1983) Cambridge (UK): Cambridge University Press.

    Langley CH, Fitch WM. An examination of the constancy of the rate of molecular evolution. J Mol Evol (1974) 3:161–177.[CrossRef][Web of Science][Medline]

    Li L, Stoeckert CJJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res (2003) 13:2178–2189.[Abstract/Free Full Text]

    Li W-H. Molecular evolution (1997) Sunderland (MA): Sinauer.

    Lindblad-Toh K, Wade CM, Mikkelsen TS, et al, (46 co-authors). Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature (2005) 438:803–819.[CrossRef][Medline]

    Makalowski W, Boguski MS. Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. Proc Natl Acad Sci USA (1998) 95:9407–9412.[Abstract/Free Full Text]

    Maston GA, Ruvolo M. Chorionic gonadotropin has a recent origin within primates and an evolutionary history of selection. Mol Biol Evol (2002) 19:320–335.[Abstract/Free Full Text]

    Meunier J, Duret L. Recombination drives the evolution of GC-content in the human genome. Mol Biol Evol (2004) 21:984–990.[Abstract/Free Full Text]

    Montoya-Burgos JI, Boursot P, Galtier N. Recombination explains isochores in mammalian genomes. Trends Genet (2003) 19:128–130.[CrossRef][Web of Science][Medline]

    Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature (2002) 420:520–562.[CrossRef][Medline]

    Nei M, Graur D. Extent of protein polymorphism and the neutral mutation theory. Evol Biol (1984) 17:73–118.

    Nielsen R. Robustness of the estimator of the index of dispersion for DNA sequences. Mol Phylogenet Evol (1997) 7:346–351.[CrossRef][Web of Science][Medline]

    Ohta T. The nearly neutral theory of molecular evolution. Annu Rev Ecol Syst (1992) 23:263–286.[CrossRef][Web of Science]

    Ohta T. Synonymous and nonsynonymous substitutions in mammalian genes and the nearly neutral theory. J Mol Evol (1995) 40:56–63.[CrossRef][Web of Science][Medline]

    Ohta T. Role of random genetic drift in the evolution of interactive systems. J Mol Evol (1997) 44:S9–S14.[CrossRef][Web of Science][Medline]

    Opazo JC, Palma RE, Melo F, Lessa EP. Adaptive evolution of the insulin gene in Caviomorph rodents. Mol Biol Evol (2005) 22:1290–1298.[Abstract/Free Full Text]

    Perry J, Ashworth A. Evolutionary rate of a gene affected by chromosomal position. Curr Biol (1999) 9:987–989.[CrossRef][Web of Science][Medline]

    Smith NG, Eyre-Walker A. Partitioning the variation in mammalian substitution rates. Mol Biol Evol (2003) 20:10–17.[Abstract/Free Full Text]

    Spong G, Johansson M, Björklund M. High genetic variation in leopards indicates large and long-term stable effective population size. Mol Ecol (2000) 9:1773–1782.[CrossRef][Medline]

    Takahata N. On the overdispersed molecular clock. Genetics (1987) 116:169–179.[Abstract/Free Full Text]

    Takahata N. Allelic genealogy and human evolution. Mol Biol Evol (1993) 10:2–22.[Abstract]

    The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature (2005) 437:69–87.[CrossRef][Medline]

    The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nat Genet (2000) 25:25–29.[CrossRef][Web of Science][Medline]

    Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res (1994) 22:4673–4680.[Abstract/Free Full Text]

    Wallis M. Episodic evolution of protein hormones in mammals. J Mol Evol (2001) 53:10–18.[Web of Science][Medline]

    Wilke CO. Molecular clock in neutral protein evolution. BMC Genetics (2004) 5:25.[CrossRef][Medline]

    Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci (1997) 13:555–556.[Free Full Text]

    Yang Z, Nielsen R. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J Mol Evol (1998) 46:409–418.[CrossRef][Web of Science][Medline]

    Yi S, Li W-H. Molecular evolution of recombination hotspots and highly recombining pseudoautosomal regions in hominoids. Mol Biol Evol (2005) 22:1223–1230.[Abstract/Free Full Text]

    Yi S, Li W-H. Episodic molecular evolution of some protein hormones in primates and its implications for primate adaptation. In: Primate origins: adaptations and evolution—Ravosa MJ, Dagosto M, eds. (2007) New York: Springer. 739–773.

    Zeng L, Comeron J, Chen B, Kreitman M. The molecular clock revisited: the rate of synonymous vs. replacement change in Drosophila. Genetica (1998) 102–103:369–382.[CrossRef]

Accepted for publication December 13, 2007.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
T. Bedford and D. L. Hartl
Overdispersion of the Molecular Clock: Temporal Variation of Gene-Specific Substitution Rates in Drosophila
Mol. Biol. Evol., August 1, 2008; 25(8): 1631 - 1638.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
T. Bedford, I. Wapinski, and D. L. Hartl
Overdispersion of the Molecular Clock Varies Between Yeast, Drosophila and Mammals
Genetics, June 1, 2008; 179(2): 977 - 984.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
25/4/634    most recent
msm281v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kim, S.-H.
Right arrow Articles by Yi, S. V.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kim, S.-H.
Right arrow Articles by Yi, S. V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?