MBE Advance Access originally published online on November 2, 2007
Molecular Biology and Evolution 2008 25(1):199-206; doi:10.1093/molbev/msm239
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Accounting for Bias from Sequencing Error in Population Genetic Estimates

* Biophysics Graduate Group, University of California, Berkeley
Department of Integrative Biology, University of California, Berkeley
E-mail: plfjohnson{at}berkeley.edu.
| Abstract |
|---|
|
|
|---|
Sequencing error presents a significant challenge to population genetic analyses using low-coverage sequence in general and single-pass reads in particular. Bias in parameter estimates becomes severe when the level of polymorphism (signal) is low relative to the amount of error (noise). Choosing an arbitrary quality score cutoff yields biased estimates, particularly with newer, non-Sanger sequencing technologies that have different quality score distributions. We propose a rule of thumb to judge when a given threshold will lead to significant bias and suggest alternative approaches that reduce bias.
Key Words: sequencing error population genetics bias quality score
| Introduction |
|---|
|
|
|---|
Ever since the advent of high-throughput sequencing with the Human Genome Project, investigators have faced a trade-off between data quality and cost of acquisition (Bouck et al. 1998
Fifteen years ago, Clark and Whittam (1992)
recognized that sequencing error could be a problem for evolutionary analysis, and other groups tackled error in the context of alignments (States 1992
) and genetic maps (Lincoln and Lander 1992
); however, sequencing analysis software and technology have since changed in 2 key ways. First, standard algorithms now provide accurate estimates of the probability that a given base call is incorrect (Ewing and Green 1998
). Second, newer, non-Sanger technologies such as pyrosequencing (Margulies et al. 2005
) and oligonucleotide array–based resequencing (Frazer et al. 2004
) have different error rate distributions than capillary-based Sanger sequencers.
One widely used technique for maintaining sequence quality involves picking an arbitrary quality cutoff, discarding all bases with quality below the threshold, and, in some studies, verifying the remaining data by manual inspection. Sometimes, this strategy is employed in tandem on forward and reverse reads that cover the same base twice (e.g., Brown et al. 2004
; Ryynänen and Primmer 2004
). In this case, the effective threshold will range between the single-read threshold probability and the square of that probability, depending on the degree to which errors in the 2 reads are independent.
From a quantitative standpoint, eliminating bases below a threshold is equivalent to censoring a portion of the data (machine-called low quality plus human-called low quality) and then trusting what remains. If we assume that quality scores are independent of base identity, this censoring produces data with values missing at random and will not affect parameter estimates (Little and Rubin 2002
). However, the second step of trusting what remains creates a more fundamental problem in that parameter estimates made on the basis of these data will be biased by any remaining error. In particular, error will often cause sites that are truly nonpolymorphic to appear polymorphic but with only a single chromosome having a different allele. Estimators that are particularly sensitive to such singleton sites (e.g., Watterson's estimator of
; 1975
) thus will be particularly susceptible to bias.
Here, we apply simple analytic models to determine when this threshold strategy is acceptable versus when it leads to substantially biased results. In the latter situation, a modified estimation framework that incorporates quality scores may be able to recover an unbiased estimator (Johnson and Slatkin 2006
). Although we refer to sequencing error in this report, the general principle holds for any type of error leading to base miscalls, including ancient DNA degradation (Pääbo 1989
; Briggs et al. 2007![]()
) or phantom mutation hotspots (Brandstatter et al. 2005
). We start by analyzing output from 3 sequencing technologies (traditional Sanger, pyrosequencing by 454 Life Sciences, and microarray data) to find their quality score distributions. Given these distributions, we quantify the bias in 4 estimators caused by using a threshold score with each of these technologies.
| Methods |
|---|
|
|
|---|
First, we present theory that describes how a number of parameters are affected by error. Then, we show how the theory can be used to create improved estimators. Finally, we review data from 3 major sequencing technologies to determine their error distributions.
Theory
The underlying concept behind our calculations is straightforward: error can cause truly fixed (nonpolymorphic) sites to appear segregating (polymorphic) and vice versa. However, error can also change one polymorphic site into a different polymorphic site by altering the frequency at which the 2 alleles appear in the sample. Estimators that depend on the allele frequency spectrum must take this last type of change into account along with the more obvious fixed-to-polymorphic and polymorphic-to-fixed changes.
We split our observed data between apparent fixed sites and apparent polymorphic sites, without subdividing the polymorphic sites on the basis of how many alleles are observed. Although apparent polymorphic sites with more than 2 alleles are likely the result of at least one error, they are also likely to be truly polymorphic. Thus, these polyallelic sites form a biased subsample of all sites, and a simple strategy of discarding such sites will lead to bias in parameter estimates. We avoid this problem and simplify the theory by considering bi-, tri-, and quadallelic sites all to be segregating.
We highlight 4 estimators of 3 basic parameters: Watterson's estimator of
(1975
), Tajima's estimator of
(1983
), Tajima's D (1989)
, and Weir and Cockerham's estimator of FST (1984
). The expected value of each estimator is calculated in turn, given the true value of the parameter and the distribution of quality scores. For some parameters, we use coalescent simulations instead of analytic theory. We assume that an error causes a switch to any of the other 3 nucleotides with equal probability.
First, we outline our notation. We define
= 4Neµ, where Ne is the effective population size and µ is the mutation rate per nucleotide per generation. Note, this means
is a per-site measure rather than a per-locus measure. A phred style (Ewing and Green 1998
) quality score, Q, for a single nucleotide maps directly to an error probability via the relation Pr(error) = 10–Q/10, so we will henceforth refer to error probabilities rather than quality scores. The distribution of error probabilities is represented by G, with the average per-site error probability
. Observed values incorporating error are distinguished from true values without error by the subscripts o and t (i.e., So and St represent observed and true numbers of segregating sites). Our sample consists of n chromosomes, each of which is L nucleotides long.
Tajima's ![]()
This estimator of
uses the average number of pairwise nucleotide differences (often denoted
) between sequences in a sample of size n (Tajima 1983
):
|
|
ij is the number of differences per nucleotide between sequences i and j. Without error, this is an unbiased estimator (
![]() |
ij are identically distributed for all i
j, a and b represent arbitrary sequences where a
b. We write the observed differences (
ab,o) as a function of the true differences (
ab,t), those sites that were originally mismatched and error made matching (Xk) and those sites that were originally matching and error made mismatch (Yk). To maintain
(and thereby
) as a per-site measure, we divide by L: |
|
Bernoulli(p1), Yk
Bernoulli(p2), p1 = Pr(observed match|true mismatch, {g1, g2}), p2 = Pr(observed mismatch|true match, {g1, g2}), and g1, g2
G. Note that the probabilities p1 and p2 are random variables in their own right that depend on the error probabilities, g1 and g2, at a particular site. Continuing our derivation and using Wald's equation (Ross 1996
![]() | (1) |
The match-given-mismatch event of p1 can arise either when exactly one base switches to the other (1 – g1)g2(1/3) + (1 – g2)g1(1/3) or when both bases switch to the same new nucleotide g1g2(2/3)(1/3):
|
|
|
|
:
|
| (2) |
Watterson's ![]()
This estimator of
uses the number of segregating sites in a sample of size n (Watterson 1975
):
![]() |
![]() | (3) |
We start by calculating q1:
|
|
The q2 event (segregating-given-fixed) will arise if any error occurs
, except for the case when all bases switch to the same new allele
As with equation (1), we assume the error probabilities are independent, so the expected values of q1 and q2 depend only on the average error probability,
:
![]() | (4) |
Variance of Watterson's
s
Watterson (1975)
derived the variance in his estimator of
without error, given a finite number of sites, L, and sample size of n:
![]() |
|
| (5) |
|
|
Bernoulli(q1) and Yi
Bernoulli(q2). Now we condition on the true number of segregating sites, St, and, for brevity, no longer explicitly write the conditionals on G, L, n,
:
|
| (6) |
![]() | (7) |
|
| (8) |
![]() |
![]() | (9) |
Tajima's D
Tajima's D statistic is defined as the normalized difference between the 2 estimators of
for a given length of sequence, L:
|
|
|
|
as well, although S/L
constant) and using a first-order Taylor series expansion:
|
| (10) |
We evaluate our approximations via coalescent simulations (Hudson 2002
) to which we add "sequencing errors" with fixed probability
.
Wright's FST
FST is defined as the proportion of heterozygosity found between subpopulations:
|
|
We use Weir and Cockerham's (1984)
estimator for r subpopulations, each of size n:
|
| (11) |
|
|
The expected value of
ST with error is difficult to calculate because it is a ratio of random variables. We again turn to simulations (Hudson 2002
) with n = 10 samples from each of r = 10 island subpopulations with symmetric migration rate M = 4Nem. Given this idealized situation, FST
1/(1 + M(r – 1)/r) (Cockerham and Weir 1987
), so simulating for a range of values of M corresponds to a range of values for FST.
Example Unbiased Estimators
Given the average amount of error remaining in an observed data set,
, and the exact analytic derivations for
s and ![]()
we can construct unbiased estimators that take into account sequencing error by inverting these equations:
![]() | (12) |
![]() | (13) |
Data
We calculated quality score distributions arising from 3 different sequencing technologies (traditional Sanger, pyrosequencing denoted "454," microarray denoted "Chip") by downloading publicly available data from the National Center for Biotechnology Information (NCBI) Trace Archive (http://www.ncbi.nlm.nih.gov/Traces) on 17 May 2007. To ensure a fair comparison, we took the Sanger and 454 data from the same sequencing center (Joint Genome Institute [JGI]) and the same species (monkey flower) using the queries "SPECIES_CODE = MIMULUS GUTTATUS AND CENTER_NAME = JGI AND TRACE_TYPE_CODE = WGS" and "SPECIES_CODE = MIMULUS GUTTATUS AND CENTER_NAME = JGI AND TRACE_TYPE_CODE = 454." We limited the Sanger query to the first 200,000 reads to reduce the load on the server. The only Chip project currently in the Trace Archive is from the Center for Rodent Genetics/Perlegen's resequencing of 15 mouse strains, and we downloaded the first 200,000 reads from that project with the query "TRACE_TYPE_CODE = CHIP."
| Results |
|---|
|
|
|---|
Figure 1 shows histograms of the quality scores from the 3 types of sequences (Sanger, 454, Chip) downloaded from the NCBI Trace Archive (see Methods). These distributions represent all scores found in the reads, without any trimming. Clearly these distributions have different shapes, which means that, given a fixed minimum quality score, the amount of error remaining in the data will depend on the sequencing technology. Because the 454 and Chip distributions are skewed toward lower quality, these 2 methods will have more error remaining than Sanger reads. For example, taking the typical threshold value of Q = 30 (
g = 1/1,000 chance of an error), we find that the mean error remaining after discarding data with quality scores less than 30 is approximately
= 1/10,000 for Sanger,
= 3/10,000 for 454, and
= 5/10,000 for Chip. For the analyses below, we take each base in a given individual to be sequenced only once and apply these mean error rates directly. Note that, because the relationship between quality score, Q, and error probability, g, is nonlinear (g = 10–Q/10; Ewing and Green 1998
|
Given these remaining rates of error, we want to quantify the bias in estimates of population genetic parameters. First, we analyze 2 estimators of the scaled mutation rate,
, by comparing the expected value of each estimator with and without error for a sample size of n = 10 (figs. 2A and B). In a neutrally evolving, random-mating population without error, both the estimator based on pairwise differences,
(Tajima 1983
. Sequencing error has the potential to affect the latter estimator (fig. 2B) more than the former (fig. 2A) because the number of segregating sites does not distinguish between a site segregating because of a single individual in the sample (i.e., a singleton, likely caused by an error) versus a site segregating at a higher frequency in the sample. As a result, the expected value of
|
Next, we combine these 2 estimators to calculate Tajima's D, a statistic that tests the null hypothesis of neutrality by taking the normalized difference between
and
is extremely low.
Finally, we look at a measure of population subdivision, FST (Wright 1951
), that compares heterozygosity within putative subpopulations versus the total population (fig. 2D). An FST value of 1 corresponds to fixed differences between subpopulations, whereas an FST value of 0 corresponds to the same allele frequencies in all subpopulations. In theory, error could obscure true subdivision (decreasing
ST) by changing sites with fixed differences into sites with polymorphism within each subpopulation—the same outcome that would result if the subpopulations were truly interbreeding. However, error could also enhance true subdivision (increasing
ST) by changing sites with the same allele frequencies across populations into sites with different allele frequencies across populations. The net effect of error will depend on the amount of true subdivision, the amount of mutation, and the amount of error. Our simulations of n = 10 samples from each of r = 10 subpopulations for migration rates ranging from 1 to 10 (corresponding to FST from ca. 0.5 to 0.1) resulted in error exclusively causing a decrease in
ST. For the parameters we explored, the relative amount of bias in
ST appears to depend only on
and not on the true FST.
| Discussion |
|---|
|
|
|---|
The precise boundaries of the bias zone depend on the particular parameter being estimated, the form of the estimator, and the distribution of quality scores. As illustrated by the 2 estimators of
, a greater sensitivity toward singletons leads to increased bias in
. The estimator based on segregating sites,
(Watterson 1975
the preferred estimator (fig. 3A). As the sample size increases beyond the n = 10 used here, the bias in
remains constant. When we take the difference between the 2 estimators in the form of Tajima's D, it is less affected by error because both estimators are biased in the same direction. The relative bias in
) rather than on the amount of subdivision, a fact that we discuss below. Although this relative bias in
|
The behavior of
Given a function that calculates the expected value of an estimator with error, inverting it should yield an unbiased estimator. When the function has a closed form, as in the case of ![]()
and
S, this process for creating an unbiased estimator is straightforward (see eqs. 12 and 13). When the expected value with error cannot be calculated analytically, simulations can be used to approximate the expected value across a wide range of parameters, and then the resulting table of values can be inverted and interpolated to arrive at an approximate unbiased estimate for a given observation. However, all these unbiased estimators will still contain the variance from sequencing error in addition to the variance from the genealogical process. In the case of our
example, this additional variance means that the unbiased ![]()
will still sometimes be preferred over the unbiased
S (fig. 3B).
The diversity of different populations spans orders of magnitude, but, in general, sequencing error will only be a concern when looking at within-species diversity. Estimates of human diversity fall a little below
= 0.001 (Crawford et al. 2005
), but species with larger or smaller effective population sizes will have correspondingly larger or smaller
. For instance, endangered species can have extremely low diversity as a result of inbreeding (e.g., Atlantic salmon
5 x 10–4 [Ryynänen and Primmer 2004
] and bog turtle
10–4 [Rosenbaum et al. 2007
]). Microbial populations, on the other hand, have enormous population sizes with the potential for high diversity (Allen et al. 2007
), although near-clonal populations with very low diversity can also exist (Strous et al. 2006
).
As a rule of thumb, an uncorrected estimate will be biased significantly if n
, where n is the sample size,
is the average error remaining in the data, and
is defined on a per-site basis (if
is defined per-locus, then n
L
). However, estimators that focus on singletons as a means of detecting selection (e.g., Fu and Li's D [1993]
) or population growth will encounter trouble much earlier. Given low diversity, a particular problem arises from "single-pass" data in which each nucleotide is sequenced at most once from a particular individual. Metagenomics and ancient DNA projects both fit into this category, along with any high-throughput sequencing where experiments cannot be repeated, either because of expense or insufficient quantity of biological sample.
The proper way to avoid bias is to explicitly incorporate quality scores into the parameter estimation framework, either by correcting the biased estimator as demonstrated above or by taking a likelihood-based approach as we did in an earlier study working with metagenomics data (Johnson and Slatkin 2006
). An alternative method of reducing bias involves raising the threshold quality score and discarding even more data. However, not only does this latter strategy retain some bias (albeit a smaller amount) but it also leads to an increase in variance through the reduction in data. Thus, we urge empiricists to use threshold-based estimators with caution and theoreticians to develop estimators that avoid the problem altogether by accounting for data quality.
| Acknowledgements |
|---|
|
|
|---|
Thanks to Weiwei Zhai for helpful discussions and to Michael Jordan for pointing out the broader connection to missing data problems. This research was supported by National Institutes of Health grant R01-GM40282 to M.S.
| Footnotes |
|---|
Yoko Satta, Associate Editor
| References |
|---|
|
|
|---|
Allen EE, Tyson GW, Whitaker RJ, Detter JC, Richardson PM, Banfield JF. Genome dynamics in a natural archaeal population. Proc Natl Acad Sci USA (2007) 104:1883–1888.
Bouck J, Miller W, Gorrell JH, Muzny D, Gibbs RA. Analysis of the quality and utility of random shotgun sequencing at low redundancies. Genome Res (1998) 8:1074–1084.
Brandstatter A, Sanger T, Lutz-Bonengel S, Parson W, Beraud-Colomb E, Wen B, Kong QP, Bravi CM, Bandelt HJ. Phantom mutation hotspots in human mitochondrial DNA. Electrophoresis (2005) 26:3414–3429.[CrossRef][Web of Science][Medline]
Briggs AW, Stenzel U, Johnson PLF, et al, (11 co-authors). Patterns of damage in genomic DNA sequences from a Neandertal. Proc Natl Acad Sci USA (2007) 104:14616–14621.
Brown G, Gill G, Kuntz R, Langley C, Neale D. Nucleotide diversity and linkage disequilibrium in loblolly pine. Proc Natl Acad Sci USA (2004) 101:15255–15260.
Clark AG, Whittam TS. Sequencing errors and molecular evolutionary analysis. Mol Biol Evol (1992) 9:744–752.[Abstract]
Cockerham CC, Weir BS. Correlations, descent measures: drift with migration and mutation. Proc Natl Acad Sci USA (1987) 84:8512–8514.
Crawford DC, Akey DT, Nickerson DA. The patterns of natural variation in human genes. Annu Rev Genomics Hum Genet (2005) 6:287–312.[CrossRef][Web of Science][Medline]
Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res (1998) 8:186–194.
Frazer KA, Wade CM, Hinds DA, Patil N, Cox DR, Daly MJ. Segmental phylogenetic relationships of inbred mouse strains revealed by fine-scale analysis of sequence variation across 4.6 mb of mouse genome. Genome Res (2004) 14:1493–1500.
Fu YX, Li WH. Statistical tests of neutrality of mutations. Genetics (1993) 133:693–709.[Abstract]
Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics (2002) 18:337–338.
Johnson PLF, Slatkin M. Inference of population genetic parameters in metagenomics: a clean look at messy data. Genome Res (2006) 16:1320–1327.
Lincoln SE, Lander ES. Systematic detection of errors in genetic linkage data. Genomics (1992) 14:604–610.[CrossRef][Web of Science][Medline]
Little RJ, Rubin DB. Statistical analysis with missing data (2002) 2nd ed. Hoboken (NJ): John Wiley & Sons.
Margulies M, Egholm M, Altman WE, et al, (56 co-authors). Genome sequencing in microfabricated high-density picolitre reactors. Nature (2005) 437:376–380.[Medline]
Olson M, Green P. A "quality-first" credo for the Human Genome Project. Genome Res (1998) 8:414–415.
Pääbo S. Ancient DNA: extraction, characterization, molecular cloning, and enzymatic amplification. Proc Natl Acad Sci USA (1989) 86:1939–1943.
Poinar HN, Schwarz C, Qi J, et al, (13 co-authors). Metagenomics to paleogenomics: large-scale sequencing of mammoth DNA. Science (2006) 311:392–394.
Rosenbaum PA, Robertson JM, Zamudio KR. Unexpectedly low genetic divergences among populations of the threatened bog turtle (Glyptemys muhlenbergii). Conserv Genet (2007) 8:331–342.[CrossRef]
Ross SM. Stochastic Processes. (1996) 2nd ed. New York: John Wiley & Sons, Inc.
Ryynänen HJ, Primmer CR. Distribution of genetic variation in the growth hormone 1 gene in Atlantic salmon (Salmo salar) populations from Europe and North America. Mol Ecol (2004) 13:3857–3869.[CrossRef][Medline]
States DJ. Molecular sequence accuracy: analysing imperfect data. Trends Genet (1992) 8:52–55.[Web of Science][Medline]
Strous M, Pelletier E, Mangenot S, et al, (37 co-authors). Deciphering the evolution and metabolism of an anammox bacterium from a community genome. Nature (2006) 440:790–794.[CrossRef][Medline]
Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics (1983) 105:437–460.
Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics (1989) 123:585–595.
Watterson GA. On the number of segregating sites in genetical models without recombination. Theor Popul Biol (1975) 7:256–276.[CrossRef][Web of Science][Medline]
Weir BS, Cockerham CC. Estimating F-statistics for the analysis of population structure. Evolution (1984) 38:1358–1370.[CrossRef][Web of Science]
Wright S. The genetical structure of populations. Ann Eugen (1951) 15:323–354.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
X. Liu, T. J. Maxwell, E. Boerwinkle, and Y.-X. Fu Inferring Population Mutation Rate and Sequencing Error Rate Using the SNP Frequency Spectrum in a Sample of DNA Sequences Mol. Biol. Evol., July 1, 2009; 26(7): 1479 - 1490. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Kunin, A. Copeland, A. Lapidus, K. Mavromatis, and P. Hugenholtz A Bioinformatician's Guide to Metagenomics Microbiol. Mol. Biol. Rev., December 1, 2008; 72(4): 557 - 578. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Lynch Estimation of Nucleotide Diversity, Disequilibrium Coefficients, and Mutation Rates from High-Coverage Genome-Sequencing Projects Mol. Biol. Evol., November 1, 2008; 25(11): 2409 - 2419. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Burgess and Z. Yang Estimation of Hominoid Ancestral Population Sizes under Bayesian Coalescent Models Incorporating Mutation Rate Variation and Sequencing Errors Mol. Biol. Evol., September 1, 2008; 25(9): 1979 - 1994. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Achaz Testing for Neutrality in Samples With Sequencing Errors Genetics, July 1, 2008; 179(3): 1409 - 1424. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
















