Skip Navigation


MBE Advance Access originally published online on November 2, 2007
Molecular Biology and Evolution 2008 25(1):199-206; doi:10.1093/molbev/msm239
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
25/1/199    most recent
msm239v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Johnson, P. L. F.
Right arrow Articles by Slatkin, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Johnson, P. L. F.
Right arrow Articles by Slatkin, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Accounting for Bias from Sequencing Error in Population Genetic Estimates

Philip L. F. Johnson* and Montgomery Slatkin{dagger}

* Biophysics Graduate Group, University of California, Berkeley
{dagger} Department of Integrative Biology, University of California, Berkeley

E-mail: plfjohnson{at}berkeley.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
Sequencing error presents a significant challenge to population genetic analyses using low-coverage sequence in general and single-pass reads in particular. Bias in parameter estimates becomes severe when the level of polymorphism (signal) is low relative to the amount of error (noise). Choosing an arbitrary quality score cutoff yields biased estimates, particularly with newer, non-Sanger sequencing technologies that have different quality score distributions. We propose a rule of thumb to judge when a given threshold will lead to significant bias and suggest alternative approaches that reduce bias.

Key Words: sequencing error • population genetics • bias • quality score


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
Ever since the advent of high-throughput sequencing with the Human Genome Project, investigators have faced a trade-off between data quality and cost of acquisition (Bouck et al. 1998Go; Olson and Green 1998Go). Too much error will swamp the true signal of genetic diversity, but too little data will lead to increased variance in parameter estimates.

Fifteen years ago, Clark and Whittam (1992)Go recognized that sequencing error could be a problem for evolutionary analysis, and other groups tackled error in the context of alignments (States 1992Go) and genetic maps (Lincoln and Lander 1992Go); however, sequencing analysis software and technology have since changed in 2 key ways. First, standard algorithms now provide accurate estimates of the probability that a given base call is incorrect (Ewing and Green 1998Go). Second, newer, non-Sanger technologies such as pyrosequencing (Margulies et al. 2005Go) and oligonucleotide array–based resequencing (Frazer et al. 2004Go) have different error rate distributions than capillary-based Sanger sequencers.

One widely used technique for maintaining sequence quality involves picking an arbitrary quality cutoff, discarding all bases with quality below the threshold, and, in some studies, verifying the remaining data by manual inspection. Sometimes, this strategy is employed in tandem on forward and reverse reads that cover the same base twice (e.g., Brown et al. 2004Go; Ryynänen and Primmer 2004Go). In this case, the effective threshold will range between the single-read threshold probability and the square of that probability, depending on the degree to which errors in the 2 reads are independent.

From a quantitative standpoint, eliminating bases below a threshold is equivalent to censoring a portion of the data (machine-called low quality plus human-called low quality) and then trusting what remains. If we assume that quality scores are independent of base identity, this censoring produces data with values missing at random and will not affect parameter estimates (Little and Rubin 2002Go). However, the second step of trusting what remains creates a more fundamental problem in that parameter estimates made on the basis of these data will be biased by any remaining error. In particular, error will often cause sites that are truly nonpolymorphic to appear polymorphic but with only a single chromosome having a different allele. Estimators that are particularly sensitive to such singleton sites (e.g., Watterson's estimator of {theta}; 1975Go) thus will be particularly susceptible to bias.

Here, we apply simple analytic models to determine when this threshold strategy is acceptable versus when it leads to substantially biased results. In the latter situation, a modified estimation framework that incorporates quality scores may be able to recover an unbiased estimator (Johnson and Slatkin 2006Go). Although we refer to sequencing error in this report, the general principle holds for any type of error leading to base miscalls, including ancient DNA degradation (Pääbo 1989Go; Briggs et al. 2007GoGo) or phantom mutation hotspots (Brandstatter et al. 2005Go). We start by analyzing output from 3 sequencing technologies (traditional Sanger, pyrosequencing by 454 Life Sciences, and microarray data) to find their quality score distributions. Given these distributions, we quantify the bias in 4 estimators caused by using a threshold score with each of these technologies.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
First, we present theory that describes how a number of parameters are affected by error. Then, we show how the theory can be used to create improved estimators. Finally, we review data from 3 major sequencing technologies to determine their error distributions.

Theory
The underlying concept behind our calculations is straightforward: error can cause truly fixed (nonpolymorphic) sites to appear segregating (polymorphic) and vice versa. However, error can also change one polymorphic site into a different polymorphic site by altering the frequency at which the 2 alleles appear in the sample. Estimators that depend on the allele frequency spectrum must take this last type of change into account along with the more obvious fixed-to-polymorphic and polymorphic-to-fixed changes.

We split our observed data between apparent fixed sites and apparent polymorphic sites, without subdividing the polymorphic sites on the basis of how many alleles are observed. Although apparent polymorphic sites with more than 2 alleles are likely the result of at least one error, they are also likely to be truly polymorphic. Thus, these polyallelic sites form a biased subsample of all sites, and a simple strategy of discarding such sites will lead to bias in parameter estimates. We avoid this problem and simplify the theory by considering bi-, tri-, and quadallelic sites all to be segregating.

We highlight 4 estimators of 3 basic parameters: Watterson's estimator of {theta} (1975Go), Tajima's estimator of {theta} (1983Go), Tajima's D (1989)Go, and Weir and Cockerham's estimator of FST (1984Go). The expected value of each estimator is calculated in turn, given the true value of the parameter and the distribution of quality scores. For some parameters, we use coalescent simulations instead of analytic theory. We assume that an error causes a switch to any of the other 3 nucleotides with equal probability.

First, we outline our notation. We define {theta} = 4Neµ, where Ne is the effective population size and µ is the mutation rate per nucleotide per generation. Note, this means {theta} is a per-site measure rather than a per-locus measure. A phred style (Ewing and Green 1998Go) quality score, Q, for a single nucleotide maps directly to an error probability via the relation Pr(error) = 10Q/10, so we will henceforth refer to error probabilities rather than quality scores. The distribution of error probabilities is represented by G, with the average per-site error probability Formula . Observed values incorporating error are distinguished from true values without error by the subscripts o and t (i.e., So and St represent observed and true numbers of segregating sites). Our sample consists of n chromosomes, each of which is L nucleotides long.

Tajima's Formula
This estimator of {theta} uses the average number of pairwise nucleotide differences (often denoted {pi}) between sequences in a sample of size n (Tajima 1983Go):

Formula
where {pi}ij is the number of differences per nucleotide between sequences i and j. Without error, this is an unbiased estimator (Formula ) With error:

Formula
Because the {pi}ij are identically distributed for all i != j, a and b represent arbitrary sequences where a != b. We write the observed differences ({pi}ab,o) as a function of the true differences ({pi}ab,t), those sites that were originally mismatched and error made matching (Xk) and those sites that were originally matching and error made mismatch (Yk). To maintain {pi} (and thereby {theta}) as a per-site measure, we divide by L:

Formula
where m is the number of true mismatches between 2 sequences, Xk ~ Bernoulli(p1), Yk ~ Bernoulli(p2), p1 = Pr(observed match|true mismatch, {g1, g2}), p2 = Pr(observed mismatch|true match, {g1, g2}), and g1, g2 ~ G. Note that the probabilities p1 and p2 are random variables in their own right that depend on the error probabilities, g1 and g2, at a particular site. Continuing our derivation and using Wald's equation (Ross 1996Go),

Formula (1)

The match-given-mismatch event of p1 can arise either when exactly one base switches to the other (1 – g1)g2(1/3) + (1 – g2)g1(1/3) or when both bases switch to the same new nucleotide g1g2(2/3)(1/3):

Formula
The mismatch-given-match event of p2 can arise either when exactly one base switches to another (1 – g1)g2(3/3) + (1 – g2)g1(3/3) or when both bases switch to different nucleotides g1g2(3/3)(2/3):

Formula
Going back to equation (1), we can now calculate the expectations of p1 and p2. Because we assume that the error probabilities are independent, these expectations depend only on the average error probability, {varepsilon}:

Formula (2)

Watterson's Formula
This estimator of {theta} uses the number of segregating sites in a sample of size n (Watterson 1975Go):

Formula
where S is the number of segregating sites. Without error, this is also an unbiased estimator. With error:

Formula (3)
The derivation for Formula is analogous to the derivation of equation (1) with 2 differences. First, So refers to the actual number of segregating sites, so its expected value does not have the factor (1/L). Second, the match/mismatch probabilities (now denoted q1 and q2) are conditional on the sample size, n, and defined within the context of segregating sites such that q1 = Pr(observed fixed|true segregating, n, {gk}) and q2 = Pr(observed segregating|true fixed, n, {gk}), where {gk} is a set of n error probabilities each distributed according to G. For the reason noted earlier, we do not distinguish between sites observed to segregate with 2 alleles versus those observed with 3 or 4 alleles.

We start by calculating q1:

Formula
where fn(i) is the probability of a mutation segregating at frequency i in a sample of size n (i.e., the frequency spectrum): fn(i) = (1/i)/a1,n. Taking the first term in the above summation, we observe a fixed site given a particular frequency of true segregation if 1 of 3 events occurs: all of one type switch to the other Formula or vice versa Formula or all bases of both types switch to a new, third allele Formula

The q2 event (segregating-given-fixed) will arise if any error occurs Formula , except for the case when all bases switch to the same new allele Formula

As with equation (1), we assume the error probabilities are independent, so the expected values of q1 and q2 depend only on the average error probability, {varepsilon}:

Formula (4)

Variance of Watterson's Formula s
Watterson (1975)Go derived the variance in his estimator of {theta} without error, given a finite number of sites, L, and sample size of n:

Formula
We use the notation from the Formula expectation calculations above. With error:

Formula (5)
Remember,

Formula
where Xi ~ Bernoulli(q1) and Yi ~ Bernoulli(q2). Now we condition on the true number of segregating sites, St, and, for brevity, no longer explicitly write the conditionals on G, L, n, {theta}:

Formula (6)
Working with the first term:

Formula (7)
Working with the second term from equation (6), we use the fact that errors are independent of each other and move the variance inside the summations:

Formula (8)
We calculate Var[Xi] by further conditioning on q1:

Formula
The derivation of the variance of Yi follows identically to yield Formula . Continuing from equation (8) and by Wald's equation (Ross 1996Go):

Formula (9)
We calculate the overall variance of Formula s by starting with equation (5) and substituting, in turn, equations (6, 7, 9, and 4).

Tajima's D
Tajima's D statistic is defined as the normalized difference between the 2 estimators of {theta} for a given length of sequence, L:

Formula
where CS,L,n is defined such that, under neutrality, Formula (Tajima 1989Go). With error:

Formula
We approximate this expectation by taking the limit as the length of the sequence goes to infinity (which means S -> {infty} as well, although S/L -> constant) and using a first-order Taylor series expansion:

Formula (10)
Now we assume neutrality and replace the numerator with the expected values calculated earlier (eqs. 1 and 3). The denominator reduces to be proportional to Formula which we also know from equation (3).

We evaluate our approximations via coalescent simulations (Hudson 2002Go) to which we add "sequencing errors" with fixed probability {varepsilon}.

Wright's FST
FST is defined as the proportion of heterozygosity found between subpopulations:

Formula
where W is the average within-subpopulation heterozygosity and T is the total population heterozygosity.

We use Weir and Cockerham's (1984)Go estimator for r subpopulations, each of size n:

Formula (11)
where Formula is the average allele frequency across the total sample and s2 is the estimated variance in allele frequency across the subpopulation samples:

Formula
We combine information from multiple sites by summing the numerator and denominator across sites, as suggested by Weir and Cockerham (1984)Go.

The expected value of Formula ST with error is difficult to calculate because it is a ratio of random variables. We again turn to simulations (Hudson 2002Go) with n = 10 samples from each of r = 10 island subpopulations with symmetric migration rate M = 4Nem. Given this idealized situation, FST {approx} 1/(1 + M(r – 1)/r) (Cockerham and Weir 1987Go), so simulating for a range of values of M corresponds to a range of values for FST.

Example Unbiased Estimators
Given the average amount of error remaining in an observed data set, {varepsilon}, and the exact analytic derivations for Formula s and Formula {pi} we can construct unbiased estimators that take into account sequencing error by inverting these equations:

Formula (12)
substituting equation (2) for Formula

Formula (13)
substituting equation (4) for Formula

Data
We calculated quality score distributions arising from 3 different sequencing technologies (traditional Sanger, pyrosequencing denoted "454," microarray denoted "Chip") by downloading publicly available data from the National Center for Biotechnology Information (NCBI) Trace Archive (http://www.ncbi.nlm.nih.gov/Traces) on 17 May 2007. To ensure a fair comparison, we took the Sanger and 454 data from the same sequencing center (Joint Genome Institute [JGI]) and the same species (monkey flower) using the queries "SPECIES_CODE = ‘MIMULUS GUTTATUS’ AND CENTER_NAME = ‘JGI’ AND TRACE_TYPE_CODE = ‘WGS’" and "SPECIES_CODE = ‘MIMULUS GUTTATUS’ AND CENTER_NAME = ‘JGI’ AND TRACE_TYPE_CODE = ‘454.’" We limited the Sanger query to the first 200,000 reads to reduce the load on the server. The only Chip project currently in the Trace Archive is from the Center for Rodent Genetics/Perlegen's resequencing of 15 mouse strains, and we downloaded the first 200,000 reads from that project with the query "TRACE_TYPE_CODE = ‘CHIP.’"


    Results
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
Figure 1 shows histograms of the quality scores from the 3 types of sequences (Sanger, 454, Chip) downloaded from the NCBI Trace Archive (see Methods). These distributions represent all scores found in the reads, without any trimming. Clearly these distributions have different shapes, which means that, given a fixed minimum quality score, the amount of error remaining in the data will depend on the sequencing technology. Because the 454 and Chip distributions are skewed toward lower quality, these 2 methods will have more error remaining than Sanger reads. For example, taking the typical threshold value of Q = 30 ({equiv}g = 1/1,000 chance of an error), we find that the mean error remaining after discarding data with quality scores less than 30 is approximately {varepsilon} = 1/10,000 for Sanger, {varepsilon} = 3/10,000 for 454, and {varepsilon} = 5/10,000 for Chip. For the analyses below, we take each base in a given individual to be sequenced only once and apply these mean error rates directly. Note that, because the relationship between quality score, Q, and error probability, g, is nonlinear (g = 10Q/10; Ewing and Green 1998Go), the mean error, Formula is not the same as the error corresponding to the mean quality score.


Figure 1
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Quality score distributions for (A) Sanger, (B) 454 (pyrosequencing), and (C) Chip (Perlegen microarray). Qualities for 454 were truncated at 60 (Pr(>60) = 1.2 x 10–4) for ease of comparison. The multimodal nature of the 454 distribution is not exclusive to our data from JGI. A similar distribution arises from 454 sequencing of a mammoth bone conducted in a different laboratory (Poinar et al. [2006Go]; data not shown).

 
Given these remaining rates of error, we want to quantify the bias in estimates of population genetic parameters. First, we analyze 2 estimators of the scaled mutation rate, {theta}, by comparing the expected value of each estimator with and without error for a sample size of n = 10 (figs. 2A and B). In a neutrally evolving, random-mating population without error, both the estimator based on pairwise differences, Formula {pi} (Tajima 1983Go), and the estimator based on segregating sites, Formula S (Watterson 1975Go), are unbiased, so this comparison is equivalent to comparing the estimator with error to the true value of {theta}. Sequencing error has the potential to affect the latter estimator (fig. 2B) more than the former (fig. 2A) because the number of segregating sites does not distinguish between a site segregating because of a single individual in the sample (i.e., a singleton, likely caused by an error) versus a site segregating at a higher frequency in the sample. As a result, the expected value of Formula S increases as the sample size, n, increases (see eq. 4 in Methods; note n = 10 in fig. 2B).


Figure 2
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Expected value of parameter estimates with various levels of error (corresponding to Sanger, 454, Chip with quality threshold of 30) compared with expected value without error. (A) Figure 2 with error (analytic results) relative to without (={theta}); expectation is independent of sample size, n, and sequence length, L. (B) Figure 2S, where n = 10 (analytic results) with error relative to without (={theta}); expectation is independent of L. (C) Tajima's D where true D = 0, n = 10, and L = 100 kb (lines show approximate analytic results; symbols show average of 100,000 replicate simulations). Dotted line represents 5% level of significance for test of neutrality from beta distribution of Tajima (1989)Go. (D) FST with error relative to without as calculated from 10 samples from each of 10 subpopulations, where L = 10 kb and migration M = 4Nem = 1 (average of 100,000 replicate simulations). Results shown for migration parameter M = 4Nem = 1, although no qualitative difference was seen in the tested range (M from 1 to 10).

 
Next, we combine these 2 estimators to calculate Tajima's D, a statistic that tests the null hypothesis of neutrality by taking the normalized difference between Formula {pi} and Formula S (fig. 2C). Our analytic derivation (eq. 10; shown with lines in fig. 2C) makes several approximations (see Methods) but qualitatively matches simulation results (shown with points). The threshold for assessing a significant departure from neutrality is relatively strict (0.05 level of significance illustrated by the horizontal dotted line in fig. 2C), so error will not make D appear artificially significant unless the true {theta} is extremely low.

Finally, we look at a measure of population subdivision, FST (Wright 1951Go), that compares heterozygosity within putative subpopulations versus the total population (fig. 2D). An FST value of 1 corresponds to fixed differences between subpopulations, whereas an FST value of 0 corresponds to the same allele frequencies in all subpopulations. In theory, error could obscure true subdivision (decreasing Formula ST) by changing sites with fixed differences into sites with polymorphism within each subpopulation—the same outcome that would result if the subpopulations were truly interbreeding. However, error could also enhance true subdivision (increasing Formula ST) by changing sites with the same allele frequencies across populations into sites with different allele frequencies across populations. The net effect of error will depend on the amount of true subdivision, the amount of mutation, and the amount of error. Our simulations of n = 10 samples from each of r = 10 subpopulations for migration rates ranging from 1 to 10 (corresponding to FST from ca. 0.5 to 0.1) resulted in error exclusively causing a decrease in Formula ST. For the parameters we explored, the relative amount of bias in Formula ST appears to depend only on {theta} and not on the true FST.


    Discussion
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
The precise boundaries of the bias zone depend on the particular parameter being estimated, the form of the estimator, and the distribution of quality scores. As illustrated by the 2 estimators of {theta}, a greater sensitivity toward singletons leads to increased bias in Formula S over Formula {pi}. The estimator based on segregating sites, Formula S, generally has lower variance than the one based on pairwise differences, Formula {pi} (Watterson 1975Go; Tajima 1983Go); however, the unequal bias introduced by sequencing error can lead to a higher mean squared error (=bias2 + var) for Formula S—unexpectedly making Formula {pi} the preferred estimator (fig. 3A). As the sample size increases beyond the n = 10 used here, the bias in Formula S increases as well, whereas the bias of Formula {pi} remains constant. When we take the difference between the 2 estimators in the form of Tajima's D, it is less affected by error because both estimators are biased in the same direction. The relative bias in Formula ST depends primarily on the total population heterozygosity (i.e., {theta}) rather than on the amount of subdivision, a fact that we discuss below. Although this relative bias in Formula ST can be substantial, its affect on the conclusion of subdivision (or no subdivision) depends on the absolute value of FST, which depends, in turn, on the within-subpopulation heterozygosity.


Figure 3
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Mean squared error of Figure 3S relative to Figure 3{pi} for n = 10 sequences 100 kb in length calculated from 100,000 replicate simulations. Values above 1 (the dotted line) indicate Figure 3{pi} is preferred, and values below indicate Figure 3S is preferred. (A) Comparison for standard estimators. (B) Comparison for new, unbiased estimators (eqs. 12 and 13).

 
The behavior of Formula ST with error warrants further discussion. Error tends to add minimally polymorphic sites where a single base in a single subpopulation has a different allele; when we apply equation (11) (see Methods) to such a site, we find 0 in the numerator and 1/(nr) in the denominator. When m such sites are combined with more normal polymorphic sites, the true numerator remains unchanged (m·0 = 0), whereas the true denominator increases by m/(nr). Thus, the relative change in Formula ST depends only on the magnitude of m/(nr) versus the true denominator. Because the true denominator includes both the within- and between-subpopulation variances (Weir and Cockerham 1984Go), it is not directly affected by the true amount of subdivision, which explains why the relative amount of bias appears unaffected by the true FST.

Given a function that calculates the expected value of an estimator with error, inverting it should yield an unbiased estimator. When the function has a closed form, as in the case of Formula {pi} and Formula S, this process for creating an unbiased estimator is straightforward (see eqs. 12 and 13). When the expected value with error cannot be calculated analytically, simulations can be used to approximate the expected value across a wide range of parameters, and then the resulting table of values can be inverted and interpolated to arrive at an approximate unbiased estimate for a given observation. However, all these unbiased estimators will still contain the variance from sequencing error in addition to the variance from the genealogical process. In the case of our {theta} example, this additional variance means that the unbiased Formula {pi} will still sometimes be preferred over the unbiased Formula S (fig. 3B).

The diversity of different populations spans orders of magnitude, but, in general, sequencing error will only be a concern when looking at within-species diversity. Estimates of human diversity fall a little below {theta} = 0.001 (Crawford et al. 2005Go), but species with larger or smaller effective population sizes will have correspondingly larger or smaller {theta}. For instance, endangered species can have extremely low diversity as a result of inbreeding (e.g., Atlantic salmon {theta} {approx} 5 x 10–4 [Ryynänen and Primmer 2004Go] and bog turtle {theta} {approx} 10–4 [Rosenbaum et al. 2007Go]). Microbial populations, on the other hand, have enormous population sizes with the potential for high diversity (Allen et al. 2007Go), although near-clonal populations with very low diversity can also exist (Strous et al. 2006Go).

As a rule of thumb, an uncorrected estimate will be biased significantly if n{varepsilon} ≥ {theta}, where n is the sample size, {varepsilon} is the average error remaining in the data, and {theta} is defined on a per-site basis (if {theta} is defined per-locus, then n{varepsilon}L ≥ {theta}). However, estimators that focus on singletons as a means of detecting selection (e.g., Fu and Li's D [1993]Go) or population growth will encounter trouble much earlier. Given low diversity, a particular problem arises from "single-pass" data in which each nucleotide is sequenced at most once from a particular individual. Metagenomics and ancient DNA projects both fit into this category, along with any high-throughput sequencing where experiments cannot be repeated, either because of expense or insufficient quantity of biological sample.

The proper way to avoid bias is to explicitly incorporate quality scores into the parameter estimation framework, either by correcting the biased estimator as demonstrated above or by taking a likelihood-based approach as we did in an earlier study working with metagenomics data (Johnson and Slatkin 2006Go). An alternative method of reducing bias involves raising the threshold quality score and discarding even more data. However, not only does this latter strategy retain some bias (albeit a smaller amount) but it also leads to an increase in variance through the reduction in data. Thus, we urge empiricists to use threshold-based estimators with caution and theoreticians to develop estimators that avoid the problem altogether by accounting for data quality.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
Thanks to Weiwei Zhai for helpful discussions and to Michael Jordan for pointing out the broader connection to missing data problems. This research was supported by National Institutes of Health grant R01-GM40282 to M.S.


    Footnotes
 
Yoko Satta, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 

    Allen EE, Tyson GW, Whitaker RJ, Detter JC, Richardson PM, Banfield JF. Genome dynamics in a natural archaeal population. Proc Natl Acad Sci USA (2007) 104:1883–1888.[Abstract/Free Full Text]

    Bouck J, Miller W, Gorrell JH, Muzny D, Gibbs RA. Analysis of the quality and utility of random shotgun sequencing at low redundancies. Genome Res (1998) 8:1074–1084.[Abstract/Free Full Text]

    Brandstatter A, Sanger T, Lutz-Bonengel S, Parson W, Beraud-Colomb E, Wen B, Kong QP, Bravi CM, Bandelt HJ. Phantom mutation hotspots in human mitochondrial DNA. Electrophoresis (2005) 26:3414–3429.[CrossRef][Web of Science][Medline]

    Briggs AW, Stenzel U, Johnson PLF, et al, (11 co-authors). Patterns of damage in genomic DNA sequences from a Neandertal. Proc Natl Acad Sci USA (2007) 104:14616–14621.[Abstract/Free Full Text]

    Brown G, Gill G, Kuntz R, Langley C, Neale D. Nucleotide diversity and linkage disequilibrium in loblolly pine. Proc Natl Acad Sci USA (2004) 101:15255–15260.[Abstract/Free Full Text]

    Clark AG, Whittam TS. Sequencing errors and molecular evolutionary analysis. Mol Biol Evol (1992) 9:744–752.[Abstract]

    Cockerham CC, Weir BS. Correlations, descent measures: drift with migration and mutation. Proc Natl Acad Sci USA (1987) 84:8512–8514.[Abstract/Free Full Text]

    Crawford DC, Akey DT, Nickerson DA. The patterns of natural variation in human genes. Annu Rev Genomics Hum Genet (2005) 6:287–312.[CrossRef][Web of Science][Medline]

    Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res (1998) 8:186–194.[Abstract/Free Full Text]

    Frazer KA, Wade CM, Hinds DA, Patil N, Cox DR, Daly MJ. Segmental phylogenetic relationships of inbred mouse strains revealed by fine-scale analysis of sequence variation across 4.6 mb of mouse genome. Genome Res (2004) 14:1493–1500.[Abstract/Free Full Text]

    Fu YX, Li WH. Statistical tests of neutrality of mutations. Genetics (1993) 133:693–709.[Abstract]

    Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics (2002) 18:337–338.[Abstract/Free Full Text]

    Johnson PLF, Slatkin M. Inference of population genetic parameters in metagenomics: a clean look at messy data. Genome Res (2006) 16:1320–1327.[Abstract/Free Full Text]

    Lincoln SE, Lander ES. Systematic detection of errors in genetic linkage data. Genomics (1992) 14:604–610.[CrossRef][Web of Science][Medline]

    Little RJ, Rubin DB. Statistical analysis with missing data (2002) 2nd ed. Hoboken (NJ): John Wiley & Sons.

    Margulies M, Egholm M, Altman WE, et al, (56 co-authors). Genome sequencing in microfabricated high-density picolitre reactors. Nature (2005) 437:376–380.[Medline]

    Olson M, Green P. A "quality-first" credo for the Human Genome Project. Genome Res (1998) 8:414–415.[Free Full Text]

    Pääbo S. Ancient DNA: extraction, characterization, molecular cloning, and enzymatic amplification. Proc Natl Acad Sci USA (1989) 86:1939–1943.[Abstract/Free Full Text]

    Poinar HN, Schwarz C, Qi J, et al, (13 co-authors). Metagenomics to paleogenomics: large-scale sequencing of mammoth DNA. Science (2006) 311:392–394.[Abstract/Free Full Text]

    Rosenbaum PA, Robertson JM, Zamudio KR. Unexpectedly low genetic divergences among populations of the threatened bog turtle (Glyptemys muhlenbergii). Conserv Genet (2007) 8:331–342.[CrossRef]

    Ross SM. Stochastic Processes. (1996) 2nd ed. New York: John Wiley & Sons, Inc.

    Ryynänen HJ, Primmer CR. Distribution of genetic variation in the growth hormone 1 gene in Atlantic salmon (Salmo salar) populations from Europe and North America. Mol Ecol (2004) 13:3857–3869.[CrossRef][Medline]

    States DJ. Molecular sequence accuracy: analysing imperfect data. Trends Genet (1992) 8:52–55.[Web of Science][Medline]

    Strous M, Pelletier E, Mangenot S, et al, (37 co-authors). Deciphering the evolution and metabolism of an anammox bacterium from a community genome. Nature (2006) 440:790–794.[CrossRef][Medline]

    Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics (1983) 105:437–460.[Abstract/Free Full Text]

    Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics (1989) 123:585–595.[Abstract/Free Full Text]

    Watterson GA. On the number of segregating sites in genetical models without recombination. Theor Popul Biol (1975) 7:256–276.[CrossRef][Web of Science][Medline]

    Weir BS, Cockerham CC. Estimating F-statistics for the analysis of population structure. Evolution (1984) 38:1358–1370.[CrossRef][Web of Science]

    Wright S. The genetical structure of populations. Ann Eugen (1951) 15:323–354.

Accepted for publication October 27, 2007.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
X. Liu, T. J. Maxwell, E. Boerwinkle, and Y.-X. Fu
Inferring Population Mutation Rate and Sequencing Error Rate Using the SNP Frequency Spectrum in a Sample of DNA Sequences
Mol. Biol. Evol., July 1, 2009; 26(7): 1479 - 1490.
[Abstract] [Full Text] [PDF]


Home page
Microbiol. Mol. Biol. Rev.Home page
V. Kunin, A. Copeland, A. Lapidus, K. Mavromatis, and P. Hugenholtz
A Bioinformatician's Guide to Metagenomics
Microbiol. Mol. Biol. Rev., December 1, 2008; 72(4): 557 - 578.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
M. Lynch
Estimation of Nucleotide Diversity, Disequilibrium Coefficients, and Mutation Rates from High-Coverage Genome-Sequencing Projects
Mol. Biol. Evol., November 1, 2008; 25(11): 2409 - 2419.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
R. Burgess and Z. Yang
Estimation of Hominoid Ancestral Population Sizes under Bayesian Coalescent Models Incorporating Mutation Rate Variation and Sequencing Errors
Mol. Biol. Evol., September 1, 2008; 25(9): 1979 - 1994.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
G. Achaz
Testing for Neutrality in Samples With Sequencing Errors
Genetics, July 1, 2008; 179(3): 1409 - 1424.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
25/1/199    most recent
msm239v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Johnson, P. L. F.
Right arrow Articles by Slatkin, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Johnson, P. L. F.
Right arrow Articles by Slatkin, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?