MBE Advance Access originally published online on September 4, 2008
Molecular Biology and Evolution 2008 25(12):2567-2577; doi:10.1093/molbev/msn194
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Patterns of Divergence among Conifer ESTs and Polymorphism in Pinus sylvestris Identify Putative Selective Sweeps



* Department of Evolutionary Functional Genomics, Uppsala University, S-752 36 Uppsala, Sweden
Department of Biology, University of Oulu, FIN-90014 Oulu, Finland
Department of Biological Statistics and Computational Biology, Cornell University
E-mail: anna.palme{at}ebc.uu.se.
| Abstract |
|---|
|
|
|---|
Finding genes that are under positive selection is a difficult task, especially in non-model organisms. Here, we have analyzed expressed sequence tag (EST) data from 4 species (Pinus pinaster, Pinus taeda, Picea glauca, and Pseudotsuga menziesii) to investigate selection patterns during their evolution and to identify genes likely to be under positive selection. To confirm selection, population samples of these genes have been sequenced in Pinus sylvestris, a species that was not included in the EST data set. The estimates of branch-specific Ka/Ks (nonsynonymous/synonymous substitution rates) across all genes in the EST data set were similar or smaller than estimates from other higher plant species. There was no evidence for the traditional indication of positive selection, Ka/Ks above 1. However, several lines of evidence based on polymorphism patterns suggest that genes with high Ka/Ks (0.20–0.52) in the EST data set are in fact more affected by positive selection in P. sylvestris than genes with low Ka/Ks (0.01–0.04). The high Ka/Ks genes have a lower level of polymorphism and more negative Tajima's D than the low Ka/Ks genes. Further, in the high Ka/Ks group, the Hudson–Kreitman–Aguade test is significant. This suggests that the EST data set is a good starting point for finding genes under positive selection in conifers and that even moderate Ka/Ks values could be indicative of selection. A group of 5 genes with high Ka/Ks collectively show evidence for positive selection within P. sylvestris.
Key Words: selection candidate genes Ka/Ks dn/ds Pinus sylvestris Scots pine
| Introduction |
|---|
|
|
|---|
Identifying genes under positive directional or balancing selection has been a long-standing goal in evolutionary biology, as these genes are the basis of adaptation. Even though several approaches have been applied to this task, this is still not easy even in model organisms with dense maps or full genome sequences. In non-model organisms, the task is even more challenging. Genome scans have been used (e.g., Vasemägi et al. 2005
The ratio of nonsynonymous and synonymous substitution rates, Ka/Ks, is a widely used indicator of selection (e.g., Tiffin and Hahn 2002
; Barrier et al. 2003
; Roth and Liberles 2006
). Neutrally evolving genes are expected to have ratios close to 1, whereas purifying selection removes disadvantageous nonsynonymous mutations and thus decreases Ka/Ks. Positive directional or balancing selection can result in Ka/Ks ratios above 1 by increased fixation of nonsynonymous mutations. The majority of genes have Ka/Ks ratios well below 1 (Barrier et al. 2003
; Roth and Liberles 2006
) because purifying selection is the most common selective force. As the ratio is usually calculated for a whole gene or gene region, positive selection has to act on many of the sites to produce an overall value above 1, which makes the criterion of Ka/Ks >1 very stringent. Ratios above 1, suggesting strong positive selection, have mainly been found in genes related to biological interactions, for example genes involved in the immune system (Adams et al. 2000
; Hedrick et al. 2002
), self-incompatibility (Awadalla and Charlesworth 1999
; Roth and Liberles 2006
), or in genes expressed in sexual organs (Swanson et al. 2001
).
Here, the aim is to investigate if a high interspecies Ka/Ks ratio below 1 can be used as a criterion for identification of genes affected by positive selection. We study the evolutionary patterns within P. sylvestris in 2 groups of genes, one with high and one with low interspecies Ka/Ks ratio, but both with a Ka/Ks ratio well below 1. The difference between the groups is mainly due to Ka, as they have similar Ks and thus presumably similar mutation rates. We hypothesize that the differences within P. sylvestris between the 2 groups are due to differences in the selective forces that act on them. As the same individuals and populations have been investigated in all genes, they have all been subject to the same population history. Shared patterns found across most genes are therefore likely to have been caused by their common history and population structure. Selective forces, on the other hand, act on specific sites and their effect should only be evident at the selected site and linked regions (Przeworski 2002
; Charlesworth 2006
). However, the evolutionary process has a high stochastic variance (Nordborg 2001
) and therefore coalescent trees, and consequently polymorphism patterns, are highly variable even under the same model, which seriously complicates inferences.
There are at least 3 potential explanations for differences in Ka/Ks, barring the simple explanation of random variation in divergence estimates. The selective constraint could be weaker in the high Ka/Ks than the low Ka/Ks group, or positive directional or balancing selection could be acting on some sites in the genes in the high group and thereby increase the Ka/Ks ratio. Thus, we would expect the low group to be under strong purifying selection and the high group to be under relaxed purifying selection (Hypothesis 1), purifying selection in combination with directional positive selection (Hypothesis 2), or purifying selection in combination with balancing selection (Hypothesis 3).
The 3 alternative hypotheses give rise to predictions on patterns of polymorphism in the high and low Ka/Ks groups within P. sylvestris, assuming that the basic patterns of selection on a locus are maintained over time. Under this assumption, the same processes that gave rise to the Ka/Ks values in the above species are still going on in many conifers, and could leave their footprint in P. sylvestris. First, both purifying and positive directional selection decrease the levels of diversity at selected as well as linked sites (Thomson 1977
; Kaplan et al. 1989
; Charlesworth et al. 1993
; Hudson and Kaplan 1995
), whereas balancing selection would increase the diversity (Hudson and Kaplan 1988
; Charlesworth 2006
). However, purifying selection is only expected to have a substantial impact on silent diversity in regions of restricted recombination or in inbreeding species (Charlesworth et al. 1993
; Hudson and Kaplan 1995
). Given the overall high recombination rate in genic areas of P. sylvestris (Dvornyk et al. 2002
; Pyhäjärvi et al. 2007
) and the outcrossing mating system, we would not expect an overall significant decrease in diversity due to purifying selection, except at loci closely linked to the site under purifying selection.
Under the neutral model, the neutral mutation rate is expected to govern both interspecies divergence and intraspecies diversity, and therefore a correlation between the 2 is expected (Hudson et al. 1987
). This same pattern of correlation will also hold for genes with purifying selection, but positive directional selection or balancing selection are expected to affect the relationship between diversity and divergence. The Hudson–Kreitman–Aguade (HKA) test (Hudson et al. 1987
) was developed to test for this deviation by investigating if the pattern of divergence and diversity is different among loci. A group of genes experiencing positive selection of different strengths and timing, such as under Hypothesis 2 or 3, would be expected to deviate from neutral predictions.
Purifying selection can cause negative Tajima's D (Tajima 1989
) but the study of Charlesworth et al. (1995)
indicates that background selection is unlikely to cause significantly negative Tajima's D or Fu and Li's D, especially if selection is strong and population sizes are large. The relationship between the strength of purifying selection and Tajima's D is not straightforward. Charlesworth et al. (1995)
show that weak selection causes a larger change in the frequency spectrum and more significant Tajima's D values than stronger selection does. This is in part backed up by Akashi (1999)
whose simulations indicate that the power to detect purifying selection with Tajima's D increases with increasing strength of selection when the absolute selection levels are low, but then starts decreasing as the strength of selection increases. To make predictions from these simulation studies to our data is not straightforward, as we do not know the strength of selection, dominance levels, or effective population size. However, neither strong nor weak purifying selection would have high power to cause significant Tajima's D considering the sample size and length of the sequences. A selective sweep on the other hand will have a strong impact on Tajima's D, which will become negative and its variance will decrease (Braverman et al. 1995
; Simonsen et al. 1995
).
In conclusion, under Hypothesis 1 (relaxed purifying selection in the high Ka/Ks group), we expect there to be little difference in Tajima's D or the HKA test between the 2 groups. Under this hypothesis, the high group could display higher levels of silent diversity in regions closely linked to selected sites and about equal levels of silent diversity in the other sites. Under Hypothesis 2 (directional positive selection), we would expect the high group to have more negative Tajima's D and lower silent diversity than the low group and also significant HKA test (if there is enough power), whereas under Hypothesis 3 balancing selection would cause a more positive Tajima's D, higher levels of diversity in the high than in the low group, and significant HKA test.
In this paper, we want to specifically examine the following questions: (I) Is the level of purifying selection in conifers similar to that in angiosperms, and consistent among different genera? (II) Are the genes with high Ka/Ks in the expressed sequence tag (EST) data set more likely to be under positive selection in P. sylvestris than genes with low Ka/Ks or randomly chosen genes? (III) Can we identify candidate genes for positive selection in P. sylvestris? Question II is of special importance because this would enable us to use EST data sets to select genes that have a higher probability to be under positive selection within species, and thus simplify the search for such genes in non-model species. Further, we also begin to address the issue of how frequent the traces of directional selection are in conifer genomes.
| Materials and Methods |
|---|
|
|
|---|
The EST Data Set
The public EST libraries of Picea glauca and Pseudotsuga menziesii (GenBank) were used to construct longer "unigene" (unique gene) sequences with methods and programs developed for the Solanaceae Genomics Network EST database (Wright M, unpublished data). In the case of Pinus taeda and Pinus pinaster, unigenes were downloaded directly from http://dendrome.ucdavis.edu/treegenes, and http://cbi.labri.u-bordeaux.fr/outils/SAM/, respectively. Presently the address is presently at http://dendrome.ucdavis.edu/treegene. We have used a "reciprocal best match" strategy to identify putatively orthologous unigenes. Briefly, the collection of unigene sequences for all 4 species was searched (Blast) against each of the other species in both directions, and the best hit for each unigene recorded for each pairwise comparison. If a collection of 4 unigene sequences, one from each species, matched each other as the best hit for all of the respective reciprocal comparisons, the unigenes were deemed to represent orthologous sequences in the 4 species. By including only orthologs, we get comparable data sets in each gene, whereas including also paralogues in some cases would introduce more complex and different gene trees, longer evolutionary timescales, and the issue of evolution after gene duplication. The coding regions were identified with a Bayesian method and frequently confirmed by homology to known protein sequences (via BlastX). Sequences with reading frame uncertainties or internal STOP codons were removed and a cutoff value of 90 bp was applied, resulting in a data set with 138 orthologs.
Analysis of the EST Data Set
Pair-wise estimates of Ka, Ks, and Ka/Ks were calculated for all possible species comparisons, using the method of Pamilo and Bianchi (1993)
. Ka/Ks was used as selection criterion when choosing genes for further sequencing (see below).
To make branch-specific estimates of Ka/Ks, the free-ratio branch model implemented in the program codeml from the package PAML 3.15 (Yang 1997
) was used for each gene in the EST data set. The codon frequencies were estimated from the average nucleotide frequencies and parameters (e.g., Ka/Ks, transition/transversion ratio, branch lengths) were estimated by maximum likelihood. An unrooted tree mirroring known relationships among the 4 species was assumed (Wang et al. 2000
; Grotkopp et al. 2004
). To investigate if Ka/Ks is conserved through time, we conducted linear regression analysis between Ka/Ks estimates on different branches. As low divergence leads to uncertain estimates, cases where Ks was below 0.05 were excluded. On the branches leading to the 2 pine species, very few Ka/Ks estimates were left after applying the 0.05 cutoff, due to the low divergence. Therefore, results from the branch-specific analysis are only presented for 3 branches: the internal branch separating the pine species from P. glauca and P. menziesii (109 genes with Ks >0.05), the branch leading to P. glauca (71 genes with Ks >0.05), and the branch leading to P. menziesii (128 genes with Ks >0.05). The average Ka/Ks was estimated by averaging across all the genes for each branch, and differences among branches were evaluated with the Wilcoxon signed rank test using R 2.6.0 (http://www.R-project.org). To calculate overall Ka/Ks for the pine species, the divergence estimates on the branch leading to P. pinaster and the branch leading to P. taeda were pooled (54 genes with Ks >0.05).
Choice of Genes for Sequencing in P. sylvestris
Among the long EST unigenes, 2 groups of genes were selected from the extremes of the average pairwise Ka/Ks distribution, one with high (0.20–0.52) and one with low (0.01–0.04) average Ka/Ks. The averages were calculated across all possible comparisons among the 4 species in the EST data set (table 1). All selected genes had more than 300 bp EST sequence, hits in GenBank, correct reading frame according to GenBank comparison and no apparent orthology problems. The difference in Ka/Ks between the 2 groups was largely due to Ka and not Ks. Average Ka in the low Ka/Ks group was 0.0052 (range 0.0020–0.0078), only about 8% of the average in the high Ka/Ks group: 0.066 (range 0.051–0.116). The average divergence at synonymous sites was about equal in the 2 groups: 0.22 (range 0.16–0.32) and 0.21 (range 0.18–0.27) in the low and the high group, respectively (table 1).
|
Primer Construction
Primers for amplification of the selected genes in P. sylvestris were constructed from full-length unigene sequences and generally the sequences of P. taeda and/or P. pinaster were used as templates for the primers. The program OLIGO 5.0 (Wojciech Rychlik, National Biosciences Inc.) was used for primer design. The primers were tested in 8 P. sylvestris individuals and those that amplified one single PCR product were used for this study (supplementary table S1, Supplementary Material online). These primers were also used for amplification in other Pinus species.
Samples
Seed samples of P. sylvestris (subgenus Pinus) from 8 different European locations were obtained from the Finnish Forest Research Institute. The locations are Northern Finland (latitude 67°11', longitude 24°03'), Southern Finland (60°52', 21°20'), Sweden (56°28', 15°55'), Poland (50°41', 20°05'), Austria (47°26', 16°29'), France (48°45', 07°50'), Turkey (39°27', 30°18'), and Spain (37°22', 02°50'W). For sample sizes for each gene, see table 2. As there is a very low level of population differentiation across the investigated region (average FST of –0.01, see table 2), we chose to largely treat the whole data set as one single population. However, as described in Pyhäjärvi et al. (2007)
, the regions have different glacial and postglacial histories and display differences in, for example, the frequency spectrum and recombination rate. Some analyses have therefore been made separately on the following subgroups: the northern populations (the Swedish and Finnish populations), the central populations (Poland, Austria, and France), Spain, and Turkey. Additional seed samples were obtained from Pinus lambertiana and Pinus ponderosa.
|
DNA Extraction and Sequencing
DNA was extracted from megagametophytes (haplotypic tissue) with FastDNA Kit (Qbiogene Inc., Carlsbad, CA). PCR was preformed with DyNAzyme EXT DNA Polymerase (Finnzymes, Espoo, Filand) following one of the 2 protocols: touchdown PCR or PCR with a single annealing temperature (see supplementary table S1, Supplementary Material online). The amplified product was cleaned with MiniElute 96 UF PCR Purification Kit (Qiagen, Valencia, CA) and then sequenced with ABI PRISM Big Dye Terminator v 3.1 Cycle Sequencing Kit (Applied Biosystems, Foster City, CA). The sequencing reaction was cleaned with Sephadex G-50 using Multiscreen 96-well filtration plates (Millipore, Millerica, MA) and then analyzed on a 3730 DNA Analyzer (Applied Biosystems).
Sequence Analysis
DNA sequences from each individual were edited and assembled in Sequencher (Gene Codes Corporation, Ann Arbor, MI). Multiple sequence alignment was conducted in ClustalX 1.83 (Thompson et al. 1997
) and if necessary edited by hand in BioEdit 7.0.5.2
[EC]
(Hall 1999
). Intron and exon boundaries were assigned after comparison with GenBank sequences and the EST unigenes. Pinus sylvestris sequences have been deposited in GenBank under accession numbers EU999244
[GenBank]
–EU999589
[GenBank]
and P. lambertiana under accession numbers EU999628, EU999638, EU999649, EU999665, EU999668, EU999678, EU999690, and P. ponderosa EU999635, EU999642, EU999653, EU999660, EU999672, EU999686, EU999695.
Analysis of diversity and divergence in each gene was conducted in DNAsp 4.10 (Rozas et al. 2003
) and a paired sign test was used to test for significant differences in nucleotide diversity among different geographic regions. A Mann–Whitney test was applied to test if the ratio between diversity and divergence was different between the high and the low group. Analysis of variance and estimation of FST was done in Arlequin 2.000 (Schneider et al. 2000
) and the significance of FST was estimated by permutation of haplotypes among populations. The HKA program (distributed by Jody Hey, http://lifesci.rutgers.edu/
heylab) was used to conduct the HKA test (Hudson et al. 1987
) on multiple loci as well as multilocus tests of Tajima's D (see below). The McDonald–Kreitman test was preformed in DNAsp with EST unigenes as the second species (P. taeda, P. glauca, P. menziesii).
Several neutrality tests based on the frequency spectrum were conducted in DNAsp (Fu and Li's D and F and Fu's F statistic) but we will only present more extensive results on Tajima's D (D) (1989
) and Fay and Wu's H (H) (2000
). D is presented as it generally has higher power to detect a selective sweep than other such tests (Simonsen et al. 1995
; Przeworski 2002
). H was chosen because it is supposed to be specific to a selective sweep (Fay and Wu 2000
), even if later studies have shown that the claim that H is not affected by demography does not hold (Przeworski 2002
). The H test was performed on the whole sequence with P. lambertiana if available, and otherwise with P. ponderosa as an outgroup. The EST unigenes from P. taeda, P. glauca, and P. menziesii were also used as outgroup, and then the test was preformed only on overlapping coding sequence. Tajima's D on silent sites was estimated manually according to Tajima (1989)
, as well as the ratio between D and its minimum value, Dmin, for total and silent sites (Schaeffer 2002
). The latter was included because the value of D depends on the number of segregating sites.
Significance of H was estimated in DNAsp by coalescent simulations given
and assuming no within-gene recombination (1,000 replicates), which makes the test conservative. Significance of total and silent D was estimated by coalescent simulations in the HKA program (distributed by Jody Hey, http://lifesci.rutgers.edu/
heylab). All tests conducted on several genes were corrected for multiple testing by the sequential Bonferroni procedure. A Mann–Whitney test was used to investigate if the groups with the high and low Ka/Ks ratio were significantly different with regard to D, H, and estimates of nucleotide diversity (
) and linear regression to investigate if there is a correlation between Ka/Ks and D, D/Dmin, or H.
The effect of variation in number of sequences, sequence length, and number of segregating sites between the high and the low Ka/Ks group on differences in D among the groups was studied with coalescent simulations conducted with mlcoalsim v1.21 (Ramos-Onsins and Mitchell-Olds 2007
) with 5,000 iterations. The simulations were conducted separately on the northern and central populations but not performed in Spain and Turkey due to low sample sizes. The number of sites, the number of samples, and
were given as in the data and a least-squares estimate of recombination from Pyhäjärvi et al. (2007)
was used. This was based on genes not included here, but on sequences from the same populations. An equal recombination rate per base pair was assumed across genes. The simulations were conducted with both a neutral stationary panmictic model and a bottleneck model (Pyhäjärvi et al. 2007
) that was deemed to fit the data better than the standard neutral equilibrium: a 0.006 x 4Ne long bottleneck with 1% of the current population size starting 0.1 x 4Ne generations ago. All size changes were modeled as instantaneous. The mlcoalsim program was also used to test the significance of D and H against the bottleneck model.
| Results |
|---|
|
|
|---|
The EST Data Set
The frequency distribution of Ka/Ks displays a similar overall pattern on the different branches analyzed: the internal branch, the branch leading to P. glauca, and the branch leading to P. menziesii (fig. 1). Most of the genes have Ka/Ks ratios below 0.1, indicating strong purifying selection and no genes have a ratio above 1. The average branch-specific Ka/Ks was 0.12, 0.14, and 0.15 for the internal branch, the branch leading to P. glauca, and the branch leading to P. menziesii, respectively, and no significant difference was detected among branches with the Wilcoxon signed rank test (P > 0.05). The combined average of the 2 branches leading to the pine species was 0.10.
|
Linear regression analysis indicates that there is a correlation between Ka/Ks estimates on different branches. This correlation is significant but low when comparing estimates for the internal branch both to estimates on the branch leading to P. glauca (R2 = 0.24, P = 0.00010) and to the branch leading to P. menziesii (R2 = 0.24, P = 1.8 x 10–07) but only marginally so for Ka/Ks on the P. glauca and P. menziesii branches (R2 = 0.06, P = 0.052).
Nucleotide Diversity in P. sylvestris
Under neutrality or only purifying selection, we would expect the high and the low Ka/Ks group to have the same level of synonymous diversity as they show similar levels of synonymous divergence (table 1). Against this expectation, the high Ka/Ks group has a lower level of nucleotide diversity than the low Ka/Ks group (table 2). The medians of the 2 groups are significantly different for total and silent nucleotide diversity (Mann–Whitney test, P < 0.05).
In contrast to the expectation of similar synonymous diversity, we a priori expected the 2 groups to be different with regard to nonsynonymous diversity, as the high Ka/Ks group has larger nonsynonymous divergence. This is also what we find in P. sylvestris where the nonsynonymous diversity was significantly elevated in the high relative to the low group (Mann–Whitney test, P < 0.05).
The diversity is not evenly distributed across the sampled area. The total level of diversity as well as the silent diversity was the highest in the north, followed by the Turkish population, whereas nonsynonymous diversity was the highest in the Spanish population. However, there were no significant differences among regions (P > 0.05). Spain also has a higher average
a/
s ratio than the other regions: 0.66 compared with 0.19, 0.15, and 0.22 for the north, central, and Turkish areas, respectively. Locus-specific analysis of molecular variance was performed for each variable site, and FST was calculated. No significant FST was identified for any variable site in either Ka/Ks group. Gene-specific FSTs are presented in table 2.
Diversity and Divergence
A multilocus HKA test was conducted on P. sylvestris using EST data from P. taeda, P. glauca, or P. menziesii as the second species. When the high Ka/Ks group was analyzed, the HKA test gave a significant result when using P. taeda (P = 0.04) and P. glauca (P = 0.04) but not with P. menziesii (P = 0.16). The loci causing the largest deviation from neutral expectation were 207 and 175. Gene 207 displayed lower observed polymorphism than expected, whereas gene 175 had the opposite pattern. The test was far from significant in all cases when the low Ka/Ks group was analyzed (P = 0.23, 0.72, 0.41, respectively). Observe that the power in the former test is much higher than in the latter due to larger sample size and more polymorphic sites in coding regions.
The high and the low Ka/Ks group show different ratios between diversity and divergence, largely due to the differences in diversity mentioned above. The high Ka/Ks group consistently shows a lower diversity/divergence ratio than the low group for synonymous sites, when divergence was estimated between P. sylvestris and P. taeda, P. glauca, or P. menziesii (Pi(s)/Ks for the high–low group: 0.11–0.76, 0.012–0.016, 0.008–0.016) but the difference was not significant in any individual case (Mann–Whitney, P < 0.05). The McDonald–Kreitman test, which tests the prediction that
a/
s and Ka/Ks are identical, was only conducted on half the genes due to low levels of polymorphisms or divergence in coding regions analyzed. The remaining genes had low power for the same reason, and none of them deviated significantly from neutrality (data not shown).
Selection Tests within P. sylvestris
We applied several tests to search for the footprints of selection within P. sylvestris. The overall pattern shows on average negative values for D, H (table 3), and other such summary statistics of the frequency spectrum (data not shown). Importantly, the results were different for the high and the low Ka/Ks group.
|
The 2 groups are significantly different with regard to both silent and total D (table 3), silent and total D/Dmin, and other frequency spectrum summary statistics (Mann–Whitney test, P < 0.05). In all cases, the average value for the high group was more negative than the value for the low group. We also tested for significant differences in H between the groups, but here the power was lower because of unavailable outgroup sequences or absence of variation in the overlapping sequences in P. sylvestris. H was not significantly different between groups (Mann–Whitney test, P > 0.05), but the average was more negative in the high than the low group, which is in agreement with the results of the tests without an outgroup.
The high Ka/Ks group has significantly lower variance in D than expected under neutrality (the observed variance 0.098 is significantly lower than the expected from simulations, 0.88: P < 0.05). In the low group, there was no significant difference.
There is a correlation between Ka/Ks and several statistics measuring deviations in the frequency spectrum. Linear regression indicate significant negative correlations between Ka/Ks(P. sylvestris–P. glauca) and silent D (P = 0.0006), silent D/Dmin (P = 0.017), total D (P = 0.003), and H (P = 0.03) but not total D/Dmin (P = 0.05). Ka/Ks(P. sylvestris–P. seudotsuga) was significantly correlated with silent D (P = 0.0004), silent D/Dmin (P = 0.004), total D (P = 0.006), and total D/Dmin (P = 0.003) but not with H (P = 0.61). In all cases, the higher the Ka/Ks, the more negative the statistics were.
Figure 2 displays the relationship between diversity and D. It is clear that the high and the low Ka/Ks groups show different patterns in both diversity and D, but while diversity measures are overlapping between the groups, D is not. Importantly among the genes that have similar levels of diversity, the high group displays decidedly more negative D values than the low Ka/Ks group.
|
The geographic distribution shows that D and H are most negative in the north followed by the central regions and least negative in the southern populations. This is in agreement with earlier findings (Pyhäjärvi et al. 2007)
|
To test if the differences in the number of sequences, sequence length, and number of segregating sites could in combination cause the differences between the high and the low Ka/Ks groups, coalescent simulations were conducted (conditioned on number of sequences, sequence length, and number of segregating sites). The neutral stationary panmictic model gives D values very close to 0: –0.04 and –0.02 for the northern high and low group and –0.02 and –0.00 for the central high and low group, respectively. As also demonstrated by Pyhäjärvi et al. (2007)
observed values) = 0.007; PH = 0.0048). In the central area, H was significantly different (PH = 0.0076) but not D (PD = 0.11). In the low group, no significant differences were found. Individual genes do not deviate significantly from the model after correction for multiple testing (P > 0.05). | Discussion |
|---|
|
|
|---|
Conifer Selection Patterns
The EST data set clearly shows that purifying selection is the dominating force acting on the genes during the investigated time periods, as Ka/Ks is generally well below 1 (fig. 1). The average Ka/Ks ratios (0.10–0.15) are equal or lower than those found in other plant species, such as 0.21 between Arabidopsis thaliana and Arabidopsis lyrata (Barrier et al. 2003
None of the genes show the traditional indication of positive selection: Ka/Ks > 1. In a comparison between A. thaliana and A. lyrata (Barrier et al. 2003
), about 5% of the genes had Ka/Ks above 1, whereas no such genes were found in a comparison between A. thaliana and B. rapa (Tiffin and Hahn 2002
). A contributing factor to the differences could be low divergence between the former but not the latter species pair, resulting in higher variation and Ka/Ks >1 created by chance in the former case. In our analysis, we use a cutoff value of 0.05 for Ks to decrease errors due to low divergence.
The absence of genes with Ka/Ks above 1 does not exclude positive selection on these genes during the evolutionary time studied here. Ka/Ks above 1 across the whole sequence is a very conservative indication of positive selection. Some sites may have been under positive selection during some time periods, whereas the majority has been evolving under purifying selection. This can potentially lead to an increased Ka/Ks that is still well below 1. The mode of evolution as measured by the Ka/Ks is conserved across evolutionary time, as shown by correlations between Ka/Ks estimates from different branches. This is most likely caused by a constant level of selective constraint. However, there is a high level of unexplained variation in Ka/Ks (R2 = 0.06–0.24) which is probably largely due to the high level of variation inherent in the evolutionary process but positive selection in some branches and not others could be a contributing factor.
Are the High Ka/Ks Genes under Positive Selection in P. sylvestris?
Population history and structure in P. sylvestris have been studied in other papers (Sinclair et al. 1999
; Cheddadi et al. 2006
; Pyhäjärvi et al. 2008
) and it has been suggested that the nucleotide variation found in nuclear genes can best be explained by an ancient bottleneck (Pyhäjärvi et al. 2007
). This manifests itself in, for example, negative overall D and H (Pyhäjärvi et al. 2007
) and thus our a priori expectations were that these measures should be negative, which is also what we find here (table 3).
Given what we know about the population history of P. sylvestris, we cannot regard a significantly negative D or H, tested against the neutral model, as conclusive evidence for positive selection. We address this problem in 2 ways: (I) comparing groups of genes that share the same population structure and history and (II) testing significance against a bottleneck model. In the former case, we compare one group with high and one with low interspecies Ka/Ks but similar neutral mutation rate (similar Ks). The patterns in the low Ka/Ks group are very similar to those reported in (Pyhäjärvi et al. 2007
) for genes studied in the same populations (see table 4), suggesting that low Ka/Ks genes display a pattern similar to genes randomly chosen with respect to their Ka/Ks ratio, whereas the high group is deviating from this pattern.
As described in the Introduction, there are 3 possible scenarios that could potentially cause differences in Ka/Ks: (1) relaxed selective constraint in the high group, (2) positive selection in the high group, or (3) balancing selection in the high group. These scenarios lead to different predictions that can be used to separate them.
We observe a significant difference in D (table 3) and other statistics quantifying deviations in the frequency spectrum between the high and the low Ka/Ks group, and the high group consistently displays a more negative value. This is in accordance with the positive selection hypothesis and should not be confounded by differences in neutral mutation rate as Ks is the same in the 2 groups. In addition, the largely negative values of H in the high group (tables 3 and 4) suggest that selective sweeps should have occurred rather recently (Przeworski 2002
). The balancing selection hypothesis would result in more positive D values in the high group, whereas under the relaxed constraint hypothesis D would not differ greatly among the groups and neither should H. We also generally observe a negative correlation between Ka/Ks and D and Ka/Ks and H, which is suggestive of increasing effect of positive selection in P. sylvestris as Ka/Ks increases.
To address the problem with unequal levels of variation and sampling sizes in the 2 groups, we calculated D/Dmin and conducted simulations. The 2 groups are still significantly different when using D/Dmin instead of D, suggesting that this difference is not caused by variation in the number of segregating sites (Schaeffer 2002
). In addition, even genes with similar diversity in the 2 groups have different D (fig. 2). The simulations, conditioned on observed data on segregating sites, sequence number, and lengths, suggest that the differences are not explained by these factors.
There is a significant difference in silent diversity between the high and the low Ka/Ks group (table 2), even though the EST data indicated no difference in neutral mutation rate between the groups (table 1). The observed lower silent diversity in the high than the low Ka/Ks group is in accordance with the positive selection hypothesis as selective sweeps decrease the level of diversity, while the alternative hypotheses are not compatible. Under the hypothesis of relaxed purifying selection, we expect equal or higher levels of silent variation in the high group, and under balancing selection, we would also expect more variation in the high group.
Additional support for positive selection in the high group is given by the HKA test, which indicates that there was a significant deviation from neutral expectations in the high Ka/Ks group but not in the low group. Note, however, that the power to detect deviations in the low groups is lower due to smaller sample sizes. The high group has a lower (nonsignificant) diversity/divergence ratio independent of outgroup, which is also in accordance with positive directional selection.
In conclusion, several lines of evidence suggest that the group of genes with high Ka/Ks is more affected by directional positive selection than the low Ka/Ks group. This demonstrates that multispecies EST data sets can be a useful tool in non-model organisms for identifying groups of genes with increased probability of being under positive selection. However, a valid question in this context is what kind of directional selection do we identify with this method and in what kind of biological processes do we expect these genes to be involved? We have used high average Ka/Ks and not Ka/Ks on a single branch or a single species comparison as a selection criterion. This means that we are actively choosing genes that are maintaining high Ka/Ks throughout the gene tree, implying that they are affected not only by a single selection event but also by repeated directional selection during their evolution.
Both Andolfatto (2007)
and Macpherson et al. (2007)
have made similar findings of a relationship between Ka and frequency spectrum–based selection tests, comparing Drosophila melanogaster and Drosophila simulans. These results suggest that repeated selective sweeps are a frequent occurrence in the genome. Our results do not permit estimating the overall frequency, but in a rather small data set of short sequences, we did detect selection. Thus, selective sweeps seem to occur readily, despite the high levels of recombination in the coding areas (Pyhäjärvi et al. 2007
).
Whether these genes undergoing repeated directional selection are involved in any particular biological process is unclear. The classic cases of frequent/constant selection involve biological interactions, and whereas cases such as MHC or self-incompatibility genes involve balancing selection and can therefore be excluded in this case, other cases of arms race can involve directional selection (Buckling and Rainey 2002
). Local adaptation could also potentially result in repeated directional selection if the same genes are frequently involved in adaptations to changing environmental conditions. There is some evidence that this is occurring in natural populations as the gene FLC has been implicated in local adaptations both in Arabidopsis (Werner et al. 2005
) and in Capsella (Slotte 2007
).
There are many examples of genes that have undergone selection during different time periods or within separate lineages (e.g., Carginale et al. 2004
; Guillet-Claude et al. 2004
; Borrelli et al. 2006
). However, individual tests for selection at different timescales seem to have a low overlap in what genes they identify (reviewed for humans in Biswas and Akey (2006)
). This may be due partly to statistical issues or, alternatively, many genes may be affected by selection only during some parts of their history, and would therefore not be picked up by the strategy used here.
Candidates for Positive Selection?
There are patterns suggestive of positive selection in all genes in the high Ka/Ks group but we cannot provide conclusive evidence in any of the individual cases. In some of these genes, D is significantly different from the neutral model (table 3) but this is most likely not the appropriate model for P. sylvestris (Pyhäjärvi et al. 2007
), and while they collectively depart significantly from the bottleneck model (table 4) none of the individual genes do. Thus, although several lines of evidence suggest that there is positive selection in the high Ka/Ks group (see above), we cannot single out any specific genes within this group. The only gene where selection has been confirmed in other studies is phytocyanin, where some sites appear to have been evolving under positive selection during their evolution in conifers (Palmé AE, Pyhäjärvi T, Wachowiak W, and Savolainen O, unpublished data). This is also the only gene that is marginally significant when tested against the bottleneck model (central region, PH = 0.056).
Previous studies of genes in P. sylvestris have not identified strong candidates for positive selection among the 16 genes studied (Dvornyk et al. 2002
; Garcia-Gil et al. 2003
; Pyhäjärvi et al. 2007
; Savolainen and Pyhäjärvi 2007
). Here, we have identified a group of 5 possible candidates. The fact that selection has been implicated both within P. sylvestris and as an explanation for high Ka/Ks suggests that selection has been acting repeatedly on different sites and time periods during the evolution of conifers. However, to confirm and understand the potential role of these genes in selection, more work is needed, especially because their functions are putative or completely unknown (table 1).
| Supplementary Material |
|---|
|
|
|---|
Supplementary table S1 is available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|
We gratefully acknowledge financial support from the Academy of Finland and the Biosciences and Environment Research Council to O.S. We thank Tanja Pyhäjärvi for discussions on simulations and for providing an Excel spread sheet for the calculation of silent Tajima's D, Soile Finne for skillful technical assistance, and Martin Lascoux for discussions and the Finnish Forest Research Institute for seed samples. O.S. thanks the Aquadro laboratory at Cornell University for hospitability and inspiring discussions during a short sabbatical.
| Footnotes |
|---|
John H McDonald, Associate Editor
| References |
|---|
|
|
|---|
Adams EJ, Stewart C, Glenys T, Parham P, Adams E. Common chimpanzees have greater diversity than humans at two of the three highly polymorphic mhc class I genes. Immunogenetics (2000) 51:410–424.[CrossRef][Web of Science][Medline]
Akashi H. Inferring the fitness effects of DNA mutations from polymorphism and divergence data: statistical power to detect directional selection under stationarity and free recombination. Genetics (1999) 151:221–238.
Andolfatto P. Hitchhiking effects of recurrent beneficial amino acid substitutions in the Drosophila melanogaster genome. Genome Res (2007) 17:1755–1762.
Awadalla P, Charlesworth D. Recombination and selection at Brassica self-incompatibility loci. Genetics (1999) 152:413–425.
Barrier M, Bustamante CD, Yu J, Purugganan MD. Selection on rapidly evolving proteins in the Arabidopsis genome. Genetics (2003) 163:723–733.
Biswas S, Akey JM. Genomic insights into positive selection. Trends Genet (2006) 22:437–446.[CrossRef][Web of Science][Medline]
Bonin A, Taberlet P, Miaud C, Pompanon F. Explorative genome scan to detect candidate loci for adaptation along a gradient of altitude in the common frog (Rana temporaria). Mol Biol Evol (2006) 23:773–783.
Borrelli L, De Stasio R, Filosa S, Parisi E, Riggio M, Scudiero R, Trinchella F. Evolutionary fate of duplicate genes encoding aspartic proteinases. Nothepsin case study. Gene (2006) 368:101–109.[CrossRef][Web of Science][Medline]
Braverman JM, Hudson RR, Kaplan NL, Langley CH, Stephan W. The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics (1995) 140:783–796.[Abstract]
Buckling A, Rainey PB. Antagonistic coevolution between a bacterium and a bacteriophage. Proc R Soc Lond B Biol Sci (2002) 269:931–936.[CrossRef][Medline]
Carginale V, Trinchella F, Capasso C, Scudiero R, Riggio M, Parisi E. Adaptive evolution and functional divergence of pepsin gene family. Gene (2004) 333:81–90.[CrossRef][Medline]
Charlesworth B, Morgan MT, Charlesworth D. The effect of deleterious mutations on neutral molecular variation. Genetics (1993) 134:1289–1303.[Abstract]
Charlesworth D. Balancing selection and its effects on sequences in nearby genome regions. PLoS Genetics (2006) 2:e64.
Charlesworth D, Charlesworth B, Morgan MT. The pattern of neutral molecular variation under the background selection model. Genetics (1995) 141:1619–1632.[Abstract]
Cheddadi R, Vendramin GG, Litt T, et al, (12 co-authors). Imprints of glacial refugia in the modern genetic diversity of Pinus sylvestris. Glob Ecol Biogeogr (2006) 15:271–282.
Dvornyk V, Sirviö A, Mikkonen M, Savolainen O. Low nucleotide diversity at the pal1 locus in the widely distributed Pinus sylvestris. Mol Biol Evol (2002) 19:179–188.
Fay JC, Wu C-I. Hitchhiking under positive Darwinian selection. Genetics (2000) 155:1405–1413.
Garcia-Gil MR, Mikkonen M, Savolainen O. Nucleotide diversity at two phytochrome loci along a latitudinal cline in Pinus sylvestris. Mol Ecol (2003) 12:1195–1206.[CrossRef][Medline]
Gonzalez-Martinez SC, Ersoz E, Brown GR, Wheeler NC, Neale DB. DNA sequence variation and selection of tag single-nucleotide polymorphisms at candidate genes for drought-stress response in Pinus taeda L. Genetics (2006) 172:1915–1926.
Grotkopp E, Rejmanek M, Sanderson M, Rost T. Evolution of genome size in pines (Pinus) and its life-history correlates: supertree analyses. Evolution (2004) 58:1705–1729.[CrossRef][Web of Science][Medline]
Guillet-Claude C, Isabel N, Pelgas B, Bousquet J. The evolutionary implications of knox-I gene duplications in conifers: correlated evidence from phylogeny, gene mapping, and analysis of functional divergence. Mol Biol Evol (2004) 21:2232–2245.
Hall T. Bioedit: a user-friendly biological sequence alignment editor and analysis program for windows 95/98/NT. Nucleic Acids Symp Ser (1999) 41:95–98.
Hedrick PW, Lee RN, Garrigan D. Major histocompatibility complex variation in red wolves: evidence for common ancestry with coyotes and balancing selection. Mol Ecol (2002) 11:1905–1913.[CrossRef][Medline]
Heuertz M, De Paoli E, Källman T, Larsson H, Jurman I, Morgante M, Lascoux M, Gyllenstrand N. Multilocus patterns of nucleotide diversity, linkage disequilibrium and demographic history of Norway spruce [Picea abies (L.) karst]. Genetics (2006) 174:2095–2105.
Hudson RR, Kaplan NL. The coalescent process in models with selection and recombination. Genetics (1988) 120:831–840.
Hudson RR, Kaplan NL. Deleterious background selection with recombination. Genetics (1995) 141:1605–1617.[Abstract]
Hudson RR, Kreitman M, Aguade M. A test of neutral molecular evolution based on nucleotide data. Genetics (1987) 116:153–159.
Kaplan NL, Hudson RR, Langley CH. The "hitchhiking effect" revisited. Genetics (1989) 123:887–899.
Macpherson JM, Sella G, Davis JC, Petrov DA. Genomewide spatial correspondence between nonsynonymous divergence and neutral polymorphism reveals extensive adaptation in Drosophila. Genetics (2007) 177:2083–2099.
Nordborg M. Coalescent theory. In: Handbook of statistical genetics—Balding D, Bishop M, Cannings C, eds. (2001) Chichester (UK): John Wiley & Sons, Ltd. 179–212.
Pamilo P, Bianchi N. Evolution of the Zfx and Zfy genes: rates and interdependence between the genes. Mol Biol Evol (1993) 10:271–281.[Abstract]
Przeworski M. The signature of positive selection at randomly chosen loci. Genetics (2002) 160:1179–1189.
Pyhäjärvi T, García-Gil MR, Knürr T, Mikkonen M, Wachowiak W, Savolainen O. Demographic history has influenced nucleotide diversity in European Pinus sylvestris populations. Genetics (2007) 177:1713–1724.
Pyhäjärvi T, Salmela MJ, Savolainen O. Colonization routes of Pinus sylvestris inferred from distribution of mitochondrial DNA variation. Tree Genet Genomes (2008) 4:247–254.
Ramos-Onsins SE, Mitchell-Olds T. Mlcoalsim: multilocus coalescent simulations. Evol Bioinform (2007) 2:41–44.
Roth C, Liberles DA. A systematic search for positive selection in higher plants (Embryophytes). BMC Plant Biol (2006) 6:12.[CrossRef][Medline]
Rozas J, Sanchez-DelBarrio JC, Messeguer X, Rozas R. DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics (2003) 19:2496–2497.
Savolainen O, Pyhäjärvi T. Genomic diversity in forest trees. Curr Opin Plant Biol (2007) 10:162–167.[CrossRef][Web of Science][Medline]
Schaeffer S. Molecular population genetics of sequence length diversity in the Adh region of Drosophila pseudoobscura. Genet Res (2002) 80:163–175.[CrossRef][Web of Science][Medline]
Schneider S, Roessli D, Excoffier L. Arlequin ver. 2000: a software for population genetics data analysis (2000) Geneva (Switzerland): University of Geneva.
Simonsen KL, Churchill GA, Aquadro CF. Properties of statistical tests of neutrality for DNA polymorphism data. Genetics (1995) 141:413–429.[Abstract]
Sinclair WT, Morman JD, Ennos RA. The postglacial history of Scots pine (Pinus sylvestris L.) in Western Europe: evidence from mitochondrial DNA variation. Mol Ecol (1999) 8:83–88.[CrossRef]
Slotte T. Evolution of flowering time in the tetraploid Capsella bursa-pastoris (Brassicaceae) (2007) Uppsala (Sweden): Acta Universitatis Uppsaliensis, Uppsala University.
Swanson WJ, Clark AG, Waldrip-Dail HM, Wolfner MF, Aquadro CF. Evolutionary EST analysis identifies rapidly evolving male reproductive proteins in drosophila. Proc Natl Acad Sci USA (2001) 98:7375–7379.
Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics (1989) 123:585–595.
Thompson J, Gibson T, Plewniak F, Jeanmougin F, Higgins D. The Clustal X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res (1997) 24:4876–4882.
Thomson G. The effect of a selected locus on linked neutral loci. Genetics (1977) 85:753–788.
Tiffin P, Hahn MW. Coding sequence divergence between two closely related plant species: Arabidopsis thaliana and Brassica rapa ssp. Pekinensis. J Mol Evol (2002) 54:746–753.[CrossRef][Web of Science][Medline]
Vasemägi A, Nilsson J, Primmer CR. Expressed sequence tag-linked microsatellites as a source of gene-associated polymorphisms for detecting signatures of divergent selection in Atlantic salmon (Salmo salar L.). Mol Biol Evol (2005) 22:1067–1076.
Wang X-Q, Tank DC, Sang T. Phylogeny and divergence times in Pinaceae: evidence from three genomes. Mol Biol Evol (2000) 17:773–781.
Werner JD, Borevitz JO, Uhlenhaut NH, Ecker JR, Chory J, Weigel D. Frigida-independent variation in flowering time of natural Arabidopsis thaliana accessions. Genetics (2005) 170:1197–1207.
Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci (1997) 13:555–556.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
A. J. Eckert, J. L. Wegrzyn, B. Pande, K. D. Jermstad, J. M. Lee, J. D. Liechty, B. R. Tearse, K. V. Krutovsky, and D. B. Neale Multilocus Patterns of Nucleotide Diversity and Divergence Reveal Positive Selection at Candidate Genes Related to Cold Hardiness in Coastal Douglas Fir (Pseudotsuga menziesii var. menziesii) Genetics, September 1, 2009; 183(1): 289 - 298. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.E. Palme, T. Pyhajarvi, W. Wachowiak, and O. Savolainen Selection on Nuclear Genes in a Pinus Phylogeny Mol. Biol. Evol., April 1, 2009; 26(4): 893 - 905. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



