MBE Advance Access originally published online on December 28, 2007
Molecular Biology and Evolution 2008 25(2):438-446; doi:10.1093/molbev/msm273
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Inferring Selection in Partially Sequenced Regions
Department of Molecular Biology and Genetics, Cornell University
E-mail: jjensen{at}ucsd.edu.
| Abstract |
|---|
|
|
|---|
A common approach for identifying loci influenced by positive selection involves scanning large portions of the genome for regions that are inconsistent with the neutral equilibrium model or represent outliers relative to the empirical distribution of some aspect of the data. Once identified, partial sequence is generated spanning this more localized region in order to quantify the site-frequency spectrum and evaluate the data with tests of neutrality and selection. This method is widely used as partial sequencing is less expensive with regard to both time and money. Here, we demonstrate that this approach can lead to biased maximum likelihood estimates of selection parameters and reduced rejection rates, with some parameter combinations resulting in clearly misleading results. Most significantly, for a commonly used sample size in Drosophila population genetics (i.e., n = 12), the estimate of the target of selection has a large mean square error and the strength of selection is severely under estimated when the true selected site has not been sampled. We propose sequencing approaches that are much more likely to accurately localize the target and estimate the strength of selection. Additionally, we examine the performance of a commonly used test of selection under a variety of recurrent and single sweep models.
Key Words: selective sweeps natural selection composite likelihood recurrent selection
| Introduction |
|---|
|
|
|---|
There is considerable interest in using population genetic approaches to identify regions of the genome that underlie population- or species-specific adaptations. These approaches can also be used to address basic evolutionary questions such as the relative importance of adaptive and demographic factors in shaping patterns of genome variability. The rapid increase in our ability to survey population level nucleotide variability for larger sample sizes and for larger portions of the genome yields increasing statistical power to distinguish among alternative population genetic models. At the same time, because more tests are being performed by each study (with more power per test), the chance of identifying false positives also increases dramatically.
Methods for identifying regions influenced by positive selection from sequence data rely on the expectation that the substitution of a strongly selected advantageous mutation alters the frequencies of linked neutral variation (Maynard Smith and Haigh 1974
; Kaplan et al. 1989
; Stephan et al. 1992
). These approaches can generally be divided into 2 classes. The first involves a scan in which outlier loci are identified that are not compatible with neutrality under some plausible demographic model (e.g., Schlotterer 2002
; Kauer et al. 2003
; Storz et al. 2004
; Tenaillon et al. 2004
; Altshuler et al. 2005
; Bauer DuMont and Aquadro 2005
; Ometto et al. 2005
; Stajich and Hahn 2005
; Wright et al. 2005
). A related approach to detect selected loci has been to summarize the empirical, genome-wide background site–frequency spectrum from which outliers are identified (e.g., Nielsen et al. 2005
; Williamson et al. 2005
), though model-based comparisons are often necessary in order to assess significance. An important addition to this framework has been the ability to correct for the ascertainment bias introduced from choosing loci based on the presence of "sweep-like" characteristics (Thornton and Jensen 2007
).
By identifying markers with skewed distributions or decreased variation, subsequent sequencing studies may be directed in order to determine if the observed patterns are consistent with a sweep hypothesis (e.g., Bauer DuMont and Aquadro 2005
; Beisswanger et al. 2006
; Pool et al. 2005
; Jensen et al. 2007
). Although partial sequencing is often used to quickly screen these large regions identified as being near putative selective sweeps and to better localize the target, the optimal way to sample these identified regions has not been systematically investigated.
We examine both models in which the age of a single selective sweep is fixed and known, as well as a model of recurrent selective sweeps. In the case of the former, we here assume that the departure originally detected, providing the motivation for regional localization, truly represents selection. As such, we suggest that these results be used in combination with the genome scan localization procedure of Thornton and Jensen (2007)
. We ask how best to sample these localized regions in order to obtain accurate estimates of selection parameters as well as provide available methods with enough information to reject neutrality. In the case of the latter, we assume that a randomly selected region has been sequenced and we determine the power of existing tests to reject neutrality when there is a background rate of selective sweeps in which advantageous mutations are uniformly distributed across a chromosome. We examine a wide range of parameter combinations, including those that are relevant for both Drosophila and humans. This analysis suggests that modifications to current strategies for sampling regions believed to be shaped by a selective sweep can lead to a greater accuracy of parameter estimates.
| Methods |
|---|
|
|
|---|
Modeling Selective Sweeps
We model positive selection using coalescent simulations for a region of M nucleotides, as described in equations 1–7 of Thornton and Jensen (2007)
in the past (measured in units of 4N generations), a beneficial allele has fixed in the population at position X. For cases where the selected site is within the region, 1
X
M. For models of recurrent sweeps (see below), X may lie outside the M nucleotides.
In addition to what was implemented in Thornton and Jensen (2007)
, we also simulate stochastic trajectories of beneficial alleles, conditioning on their reaching fixation in the population (Coop and Griffiths 2004
; Przeworski et al. 2005
). For a beneficial mutation at frequency x at time t, x jumps to either
|
|
|
|
t. The term µ (x) is the infinitesimal mean change in allele frequency of the conditional process. For the case of genic selection considered here and conditional on the ultimate fixation of the beneficial mutation, |
|
Recurrent Selective Sweeps
We also considered a model of selective sweeps occurring in the genome at a rate determined by
, the expected number of sweeps per recombination unit in the last 4N generations (Kaplan et al. 1989
; Braverman et al. 1995
). Our implementation follows that described in Przeworski (2002)
, with 2 modifications. First, the allele frequency trajectory of the selected site is determined stochastically, as described above. Second, we allow for the selective sweeps both within the region of M nucleotides as well as at linked sites. We do this because we simulate relatively large neutral regions (M = 104), and the probability of a sweep within that region may not be negligible for large
, assuming a constant
across the genome. Similarly, it is important to consider sweeps outside of the M nucleotides as they will impact patterns of variation within the region under investigation. In this model, the time until the next selective phase is entered is exponentially distributed with rate
, where
bp is the scaled recombination rate between adjacent base pairs. Given that a selective phase is entered, the selected site is located within the M nucleotides with probability
, otherwise it is located at a linked site up to a maximum genetic distance of 2
on either side of the sampled region (see Kaplan et al. 1989
; Durrett and Schweinsberg 2004
, for details).
We estimated the power to reject the equilibrium neutral model using 2 sample sizes (n = 12 and 50) and 90 parameter combinations generated by considering all combinations of
,
,
, and
. These parameters cover cases where we expect hitchhiking effects to be minimal (
= 10–7,
= 100) to those where the effect should be substantial (
= 10–5,
= 5,000). For these simulations, we used N = 106.
For each simulated replicate, we also calculated the power P values for D of Tajima (1989)
and H statistics of Fay and Wu (2000)
, using 1-tailed tests (of the lower tail) for both statistics. In order to make the power estimates of D and H comparable with those from the composite likelihood ratio test (CLRT), we assumed that
is known precisely.
Sampling
In order to evaluate the effects of partial data for the single sweep data sets, a number of sampling schemes were evaluated. First, 1,000 replicates of complete 10-kb data sets were simulated for n = 50, 2Ns = 0, 100, 500, and 1,000;
bp = 0.05 and 0.1;
= 15 and 75; and
= 0.001, 0.01, and 0.02. Then, using these data, partial data sets were parsed in 4 configurations: 5 or 2.5 kb of sequence distributed across the 10-kb region, including for each a scenario in which the selected site does and does not fall in a sampled region (fig. 1). In all cases, the target of selection is at position X = 5,000, and there is sequencing on both sides of the target. These parameters were chosen for their relevance to a significant portion of the Drosophila melanogaster genome (e.g.,
bp = 0.05 means a recombination rate of 1.25 x 10–8/base pair/generation over a 10-kb region for Ne = 1 x 106, and
= 75 means µ = 1.87 x 10–8/base pair/generation over a 10-kb region for Ne = 1 x 106). Additionally, the size of the region (10 kb) was chosen as it encompasses perturbations of the site-frequency spectrum produced after a selective sweep, for the values of 2Ns here considered (Kim and Stephan 2002
).
|
In order to replicate a likely empirical approach, we simulated a second round of sequencing by adding data around the predicted target and then reanalyzing the data set. This was done by assuring that the predicted target had at least 0.5 kb on either side, which, depending on where the prediction was made relative to the initial segments, meant adding anywhere between 0.5 and 1 kb of new data.
Statistics
Let
and
be the maximum likelihood estimates (MLEs) of the strength of the selection parameter (2Ns) and target of selection, respectively. These parameter estimates are found via maximization of the composite likelihood function of Kim and Stephan (2002)
so that
|
|
|
|
, and
is given by equation (5) of Kim and Stephan (2002)
= (2
)–1.
Two statistics were utilized to evaluate the MLEs of X and
. First, in order to measure any biases in the predicted location of selection introduced by partial sampling, relative bias (RB) was determined from 1,000 replicates, conditional on rejecting neutrality, as:
|
|
Second, in order to measure deviations from the expected values, the relative mean square error (RMSE) was determined as:
|
|
in an identical way. | Results |
|---|
|
|
|---|
Rejecting Neutrality in Favor of Selection—Single Sweep Model
Applying the CLRT to our partial and complete data sets, we see that with less data the null is rejected less often (supplementary table 1, Supplementary Material online). For high recombination (
bp = 0.1) and
= 75, with a complete 10 kb of sequence, the neutral model is rejected in favor of the sweep model in 95–97% of simulated sweep data sets when
is very large (
500), and in approximately 77–82% of cases when
= 100 for very recent sweeps (
= 0.001 in units of 4N generations). Predictably, as
increases, these rejection rates decrease (supplementary table 2, Supplementary Material online [
= 0.01] and supplementary table 3, Supplementary Material online [
= 0.02]). In partially sequenced regions when the target of selection has been sampled, rates of rejection are nearly equivalent, except for
= 100. When the target has not been sampled, these rejection rates are uniformly lower—rejecting approximately 92%, 83%, and 19% of the time for
= 1,000, 500, and 100, respectively, for the 5-kb data set, where n = 50,
= 75, and
= 0.001.The primary factor determining rejection remains whether or not the site of selection has been sequenced (supplementary tables 1–3, Supplementary Material online). Thus, although the ability to detect selection is diminished under all partial sampling schemes, the effect is simply to make the test more conservative with respect to rejecting neutrality.
Rejecting Neutrality in Favor of Selection—Recurrent Sweep Model
The simulation results presented above assume a single selective sweep fixing at time
in the past. For considering the power of the CLRT when applied to genome scan data, it is appropriate to consider a model where
is a random variable determined by
, the rate of sweeps in the genome (per recombination unit per 4N generations),
= 2Ns, and
= 4Nr.
The parameters of this model have important implications. If the rate of sweeps is high, then there may be many recent sweeps across the genome which existing methods could have power to detect. However, if the rate is this great, then there is an appreciable probability that sweeps are occurring on already swept backgrounds. This multiple-sweep effect will result in very different patterns in the site-frequency spectrum (Kim 2006
). If the rate of sweeps is low, then many sweeps will be old enough that patterns of variability will have recovered (Przeworski 2002
). As a consequence, the CLRT has low power to reject the null model, unless both
and
are large (e.g., fig. 2). Further, Tajima's D was observed to be generally more powerful than the CLRT and the power of Fay and Wu's H was never estimated to be greater than 10% (supplementary table 4, Supplementary Material online). These results are qualitatively similar to those of Przeworski (2002)
. Further, power was higher in regions of low recombination (fig. 2, supplementary table 4 [Supplementary Material online]) and increased with larger sample size. Parameter combinations for which a test's power is observed to exceed 0.5 are noted in bold on supplementary table 4 (Supplementary Material online).
|
Inferring the Target of Selection
Among the single sweep data sets that rejected the CLRT in favor of selection, we evaluated the accuracy of target prediction as measured by the RMSE, as well as the RB in the MLEs of the target of selection (as described in the Methods section). When a high recombination (
bp = 0.1) region is fully sequenced,
= 75, and the sweep is very recent (
= 0.001), the estimate of the target is within the correct 1-kb window that encompasses the true target with probability 0.89, 0.87, and 0.84 for
= 1,000, 500, and 100 for n = 12, respectively (representative cases illustrated in fig. 3). In the 5-kb partially sequenced regions in which the target has been sampled, these probabilities are similar except for low
, in which the probability drops to around 0.65, regardless of the sample size. When the target has not been sampled, however, the situation is considerably different. For a commonly used sample size (n = 12), very large selection coefficients (
= 500, 1,000), recent sweeps, and having sequenced regions immediately flanking the true target, the MLE only has a probability of roughly 1/3 of being within the correct 1-kb window. Figure 3 visualizes these results for a subsample of our data. Full results across all parameter combinations are presented in supplementary tables 1–3 (Supplementary Material online).
|
As the partial data sets are simply subsamples of the complete data sets, it is possible to examine directly the benefits of complete versus incomplete sequencing. For example, figure 4 summarizes the improvement in the MLE of the target of selection of a complete, 10-kb data set over a data set in which only half of the region has been sequenced (5 kb), but the true target of selection (at position 5 kb) has not been sampled. Consistent with the RMSEs presented in supplementary tables 1–3 (Supplementary Material online), we see a wide range of target predictions when the site of selection has not been sampled and a relatively small range in the complete data set. In order to further explore this issue, we selected a small number of scenarios and fixed the number of segregating sites between the complete and partial data sets in order to determine if the performance is based simply on the fact that the complete data sets have approximately twice the number of segregating sites as the 5/10 kb data sets. Under this scheme, we observed results that are very similar to our fixed
results presented above. We note, however, that this example is illustrative only because fixing S creates the problem that the Pr(S|
) would be drastically different between the partial and complete data sets. The average number of segregating sites produced under each set of parameters is given in supplementary tables 1–3 (Supplementary Material online).
|
Examining the relative bias, we observe no significant skew in the prediction of the location of the target under any sampling scenario (supplementary tables 1–3 Supplementary Material online). In order to evaluate whether the performance in these complete 10-kb data sets was being maximized by sequencing symmetrically around the target, we also evaluated otherwise identical data sets with the target at position 1 kb rather than 5 kb. There were no significant differences with regards to either RB or RMSE.
In order to better replicate a typical empirical approach, we examined a sample of the above described scenarios (n = 12,
bp = 0.1,
= 75, and
= 0.001) to determine the extent to which target prediction is improved by "resequencing" around the predicted target (fig. 5, supplementary table 5 [Supplementary Material online]). For the data sets consisting of five 0.5-kb regions, we added a sixth fragment encompassing the predicted target (by taking it from the corresponding 10-kb data set), both for scenarios in which the true target has, and has not, been sampled. Note that we simply assure that there is 1 kb of data surrounding the predicted target, so, depending on whether this happens to overlap with an existing fragment, this additional data could represent between 0.5 and 1 kb of new sequence (see Methods).
|
There are 3 observations of particular interest. First, there is a strong correlation between target predictions between the first and second round of sampling, particularly when the true target was not originally sampled. This is owing to the fact that the second sampling does not represent an independent draw—rather it is simply an addition of a relatively small amount of data. Second, in data sets in which the true target was not originally sampled, this additional sequencing makes a measurable improvement in a proportion of replicates. This is shown clearly in figure 4 by the horizontal grouping centered around 5 kb, demonstrating a wide range of primary target predictions and more accurate secondary MLEs. However, it is worth noting that the improvement seen by resequencing is scarcely comparable with the accuracy associated with complete sequence, where the RMSE for
is 0.0548 for the resequenced data set and 0.0083 for the complete data sets, when
= 500 (supplementary table 5, Supplementary Material online). Finally, the MLEs are not investigated under the recurrent selection model as localization would not be attempted if the pattern of hitchhiking was not initially detectable. As shown in the power analysis (supplementary table 4, Supplementary Material online), the probability of rejection under recurrent hitchhiking models rarely exceeds 10% for the CLRT. In the cases where rejections do occur, the same limitations of partial sequencing for target site estimation are expected as were described for the single sweep model.
Estimating the Strength of Selection
Evaluating the MLEs for data sets that rejected in favor of the selection model, we determine the RB in the estimated strength of selection (supplementary tables 1–3, Supplementary Material online). We observe a stark underestimate of
under all scenarios of partial sequencing (i.e., the RB for
is nearly always negative regardless of recombination rate, whether the target has been sampled, sampling scheme, or sample size). In regions of high recombination for the complete 10-kb data sets, we observe only a small RB in these estimates across all sample sizes and selection coefficients. As
increases, however, this bias becomes increasingly more negative, owing to the assumption of the CLRT that the sweep has just ended. However, the RMSE on these estimates remains large even in the completely sequenced data sets. We observe similar relative biases across all partial sequencing scenarios, with relatively little difference between samples in which the target has and has not been sampled. The variance of the estimate may be decreased slightly by having a larger sample size (note that the numerator of the RMSE expression is equivalent to the variance of the estimator, thus a smaller RMSE implies a smaller variance). As with rates of rejection and the MLE of
, the performance is consistently, though mildly, worse across all scenarios when the recombination rate of the region is reduced by half (supplementary tables 1–3, Supplementary Material online).
Application to Data
The challenges associated with target site prediction are illustrated by 2 recent experimental data sets. First is the putative sweep around the wapl region of D. melanogaster, which was inferred from partial data (roughly 6 kb of total data distributed in 12 fragments across a 110-kb region for a sample size of n = 12; Beisswanger et al. 2006). We evaluated the ability of the MLE to accurately estimate the location of the target of selection by generating, via parametric bootstrap, 1,000 sweep replicates using the African parameters given in Beisswanger et al. (2006) (location of sequenced regions,
, recombination, the selection coefficient [
], and the target of selection [
]). We found that target prediction is very poor in this case, with only a 20% chance of the target being placed within the correct 10-kb window and a 2% chance of being in the proper 1-kb window. We note that the 95% confidence intervals (constructed using the percentile method) on their estimate of
spans nearly 68 kb for
= 0 or 65% of the region (fig. 6). We also see a grouping of target predictions to fragments where sequence data exist, emphasizing that there is no information about X where data are missing. For comparison, we also set the age of the sweep (
) to its minimum value necessary to be consistent with their ancestral sweep hypothesis (
= 0.019 in units of 4N generations, based on Bayesian estimates of the colonization time presented in Thornton and Andolfatto 2006
). In this case, the 95% confidence intervals span 90 kb or approximately 81% of the region examined.
|
The putative sweep downstream of the Notch locus in D. melanogaster (Bauer DuMont and Aquadro 2005
|
Based on these combined results, we propose that parametric bootstrapping to obtain confidence intervals is appropriate for quantifying uncertainty in parameter estimates and is informative when presenting and interpreting results from the CLRT. We note that as the CLRT is widely used in tandem with a recently proposed goodness-of-fit test (Jensen et al. 2005
| Discussion |
|---|
|
|
|---|
Simulations were used to investigate the effects of different sequencing sampling strategies on the ability to detect signatures of hitchhiking along a recombining chromosome, particularly using the CLRT proposed by Kim and Stephan (2002)
and
, the signature of selection observable in the data is a reduction in diversity and an excess of rare alleles, rather than an excess of high-frequency derived alleles. Only very recent sweeps appear to be detectable using the CLRT because for older sweeps, the pattern of variation will have recovered somewhat, as noted by Przeworski (2002)
The cases where D had high power to reject the null model were for high rates of strong sweeps in regions of low recombination. It is useful to consider what the rate of sweeps must be in order for the power to reject the null model to be high. For the case of
= 5,000,
= 10–5, and
= 10 (i.e., very strong and very common), sweeps are occurring on average every
0.008 time units (4N generations) for the 10
7 bp region examined here, and Tajima's D rejects the null model 87.7% of the time (for large sample sizes in regions of relatively low recombination; fig. 8). Such frequent and strong sweeps would have nearly chromosome-wide effects on levels of variability (Braverman et al. 1995
), and it remains to be determined if such a large mutation rate to strongly selected mutations is biologically reasonable. A further discussion is found in Thornton et al. (2007)
.
|
For single, recent selective sweeps, we found that the Kim and Stephan (2002)
Additionally, whereas sampling only partial segments across the region leads to lower rates of rejection and higher RMSEs for both
and
, the principle factor dictating performance remains whether the target has been sampled. Thus, smaller data sets are shown to be undesirable if for no other reason than this effectively decreases the probability of sampling the target. With regard to sample size, although n = 12 summarizes the site-frequency spectrum sufficiently to provide accurate MLEs in a complete 10-kb data set, and does reasonably well in partial data sets in which the target has been sampled, there is a marked difference between small and large sample sizes when the target has not been sequenced. Importantly, it is unwise to reason that the target has been placed accurately just because it falls within a sequenced segment.
By adding an additional sequenced fragment encompassing the initially predicted target of selection, we examined the relative benefit of follow-up sequencing aimed at refining the true targets location. We observe a strong correlation between primary and secondary predictions, though we note a marked improvement in a small proportion of the resequenced data sets. Thus, the addition of more data around the first estimated target,
, particularly when the target was not originally sampled, leads to small improvements but is far less reliable than an initial analysis based on complete sequencing.
Thus, although partial sequencing has oft been employed for reasons both financial and practical, we demonstrate that when regions are localized through initial marker screens, complete sequencing offers far superior results in terms of the probability of rejecting neutrality in favor of selection, as well as in estimating the selection coefficient and target of selection. Although this proposal may seem sequence intensive, we note that the approach in distant second with regard to all of these measures (sequencing half of the region for n = 50) represents more than a 2-fold increase in data generation given that complete sequencing performs well for sample sizes of n = 12 (e.g., 5 kb for n = 50 represents 250 kb of total sequencing vs. 10 kb for n = 12 that represents 120 kb).
| Supplementary Material |
|---|
|
|
|---|
Supplementary tables 1–5 are available at Molecular Biology Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|
We appreciate fruitful discussion with the Aquadro lab, particularly Vanessa Bauer DuMont, as well as comments on the manuscript from Yuseob Kim and several reviewers. This research was supported by National Institutes of Health grant GM36431 to C.F.A., National Science Foundation grant DMS-0201037 to R. Durrett, C. F. Aquadro, and R. Nielsen, a Sloan postdoctoral fellowship in Computational Molecular Biology to K.R.T, and a National Science Foundation postdoctoral fellowship in Biological Informatics to J.D.J.
| Footnotes |
|---|
1 Present address: Department of Ecology, Behavior and Evolution, University of California, San Diego.
2 Present address: Department of Ecology and Evolution, University of California, Irvine. ![]()
Michael Nachman, Associate Editor
| References |
|---|
|
|
|---|
Bauer DuMont V, Aquadro CF. Multiple signatures of positive selection downstream of notch on the X chromosome in Drosophila melanogaster. Genetics (2005) 171:639–653.
Beisswanger S, Stephan W, De Lorenzo D. Evidence for a selective sweep in the wapl region of Drosophila melanogaster. Genetics (2006) 172:265–274.
Braverman JM, Hudson RR, Kaplan NL, Langley CH, Stephan W. The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics (1995) 140:783–796.[Abstract]
Coop G, Griffiths RC. Ancestral inference on gene trees under selection. Theor Popul Biol (2004) 66:219–232.[CrossRef][Web of Science][Medline]
Durrett R, Schweinsberg J. Approximating selective sweeps. Theor Popul Biol (2004) 66:129–138.[CrossRef][Web of Science][Medline]
Ewens W. Mathematical Population Genetics I. Theoretical Introduction (2004) 2nd. Springer-Verlag, New York.
Fay J, Wu C-I. Hitchhiking under positive Darwinian selection. Genetics (2000) 155:1405–1413.
Jensen JD, Bauer DuMont V, Ashmore AB, Gutierrez A, Aquadro CF. Patterns of variability and divergence at the diminutive gene region of Drosophila melanogaster. Genetics (2007) 177:832–840.
Jensen JD, Kim Y, Bauer DuMont V, Aquadro CF, Bustamante CD. Distinguishing between selective sweeps and demography using DNA polymorphism data. Genetics (2005) 170:1401–1410.
Kaplan NL, Hudson RR, Langley CH. "The hitchhiking effect" revisited. Genetics (1989) 123:887–899.
Kauer MO, Dieringer D, Schlotterer C. A microsatellite variability screen for positive selection associated with the "out of Africa" habitat expansion of Drosophila melanogaster. Genetics (2003) 165:1137–1148.
Kim Y. Allele frequency distribution under recurrent sweep selective sweeps. Genetics (2006) 172:1967–1978.
Kim Y, Stephan W. Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics (2002) 160:765–777.
Maynard Smith J, Haigh J. The hitch-hiking effect of a favorable gene. Genet Res (1974) 23:23–35.[Web of Science][Medline]
Nielsen R, Williamson S, Kim Y, Hubisz MJ, Clark AG, Bustamante CD. Genomic scans for selective sweeps using SNP data. Genome Res (2005) 15:1566–1575.
Ometto L, Glinka S, De Lorenzo D, Stephan W. Inferring the effects of demography and selection on Drosophila melanogaster populations from a chromosome-wide scan of DNA variation. Mol Biol Evol (2005) 22:2119–2130.
Pool JE, Bauer DuMont V, Mueller JL, Aquadro CF. A scan of molecular variation leads to the narrow localization of a selective sweep affecting both afrotropical and cosmopolitan populations of Drosophila melanogaster. Genetics (2005) 172:1093–1105.[CrossRef][Web of Science][Medline]
Przeworski M. The signature of positive selection at randomly chosen loci. Genetics (2002) 160:1179–1189.
Przeworski M, Coop G, Wall JD. Signatures of positive selection on standing variation. Evolution (2005) 59:2312–2323.[CrossRef][Web of Science][Medline]
Schlotterer C. A microsatellite-based multilocus screen for the identification of local selective sweeps. Genetics (2002) 160:753–763.
Stajich ES, Hahn MW. Disentangling the effects of demography and selection in human history. Mol Biol Evol (2005) 22:63–73.
Stephan W, Wiehe THE, Lenz MW. The effect of strongly selected substitutions on neutral polymorphism: analytical results based on diffusion theory. Theor Popul Biol (1992) 41:237–254.[CrossRef][Web of Science]
Storz JF, Payseur BA, Nachman MW. Genome scans of DNA variability in humans reveal evidence for selective sweeps outside of African. Mol Biol Evol (2004) 21:1800–1811.
Tajima F. Statistical method for testing the neutral mutation hypothesis. Genetics (1989) 123:437–460.
Tenaillon MI, U'Ren J, Tanaillon O, Gaut BS. Selection versus demography: a multilocus investigation of the domestication process in maize. Mol Biol Evol (2004) 21:1214–1225.
The International HapMap Consortium. A haplotype map of the human genome. Nature (2005) 437:1299–1320.[CrossRef][Medline]
Thornton KR, Andolfatto P. Approximate bayesian inference reveals evidence for a recent, severe, bottleneck in non-African populations of Drosophila melanogaster. Genetics (2006) 172:1607–1619.
Thornton KR, Jensen JD. Controlling the false-positive rate in multilocus genome scans for selection. Genetics (2007) 175:737–750.
Thornton KR, Jensen JD, Becquet C, Andolfatto P. Progress and prospects in mapping recent selection in the genome. Heredity (2007) 98:340–348.[Web of Science][Medline]
Williamson SH, Hernandez R, Fledel-Alon A, Zhu L, Nielsen R, Bustamante CD. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc Natl Acad Sci USA (2005) 102:7882–7887.
Wright SI, Bi IV, Gaut BS. The effect of artificial selection on the maize genome. Science (2005) 308:1310–1314.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






. Note that we have indicated the positions of their sequenced fragments, as well as the number of segregating sites observed in each region.
. Note that the entire 10.5-kb region was completely sequenced in all lines (n = 15) for this USA population sample, and 147 segregating sites were observed.
= 1/2N,
, and
. The estimates of power are taken from supplementary table 4 (Supplementary Material online) for the case n = 50,