MBE Advance Access originally published online on July 24, 2008
Molecular Biology and Evolution 2008 25(10):2181-2187; doi:10.1093/molbev/msn163
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
The Effect of Ancient DNA Damage on Inferences of Demographic Histories

* Department of Biology, Evolutionary Biology, Copenhagen University, Copenhagen, Denmark
Departments of Biology and Statistics, University of California, Berkeley
E-mail: eaxelsson{at}bio.ku.dk.
| Abstract |
|---|
|
|
|---|
The field of ancient DNA (aDNA) is casting new light on many evolutionary questions. However, problems associated with the postmortem instability of DNA may complicate the interpretation of aDNA data. For example, in population genetic studies, the inclusion of damaged DNA may inflate estimates of diversity. In this paper, we examine the effect of DNA damage on population genetic estimates of ancestral population size. We simulate data using standard coalescent simulations that include postmortem damage and show that estimates of effective population sizes are inflated around, or right after, the sampling time of the ancestral DNA sequences. This bias leads to estimates of increasing, and then decreasing, population sizes, as observed in several recently published studies. We reanalyze a recently published data set of DNA sequences from the Bison (Bison bison/Bison priscus) and show that the signal for a change in effective population size in this data set vanishes once the effects of putative damage are removed. Our results suggest that population genetic analyses of aDNA sequences, which do not accurately account for damage, should be interpreted with great caution.
Key Words: aDNA genealogy coalescent process DNA damage demography
| Introduction |
|---|
|
|
|---|
Ever since genetic material from an extinct species was first analyzed in 1984 (Higuchi et al. 1984
At the core of all aDNA work remains the postmortem instability of nucleic acid. Over time DNA is degraded and fragmented into smaller and smaller pieces, in the end leaving little or no amplifiable templates. As a consequence, PCR becomes very sensitive to contamination from other DNA sources. Apart from fragmentation, postmortem DNA damage can also lead to alteration of the individual bases that are present in the sequences generated post-PCR. Termed miscoding lesions, these can be explained by a combination of base degradation to derivatives that are subsequently misreplicated by polymerase enzymes, plus the compounded effects of innate enzyme infidelity on typically low numbers of initial PCR templates (Hansen et al. 2001
; Binladen et al. 2006
). With regards to base degradation, although there has been some controversy as to exactly which base lesions are formed from such alterations, much evidence now points toward the deamination of cytosine (C) to uracil (U) as the dominant source of damage (Stiller et al. 2006
; Briggs et al. 2007
; Brotherton et al. 2007
; Gilbert, Binladen, et al. 2007
). During subsequent enzymatic replication (e.g., during PCR), these uracils are replaced with thymines (T), via an intermediate step where uracil leads to adenine incorporation opposite on the complementary DNA strand (Pääbo 1989
). Ultimately, this process leads to the generation of C
T miscoding lesions. With regards to innate enzyme error, the principal problem is that, although error may be typically low, when PCR is initiated from very small numbers of templates, any error that occurs early during the reaction cycle might lead to a substantial number of descendent molecules containing the error (e.g., Hofreiter et al. 2001
). In turn, when these are sequenced, the effect of this error may be severe. Due to both these problems therefore, the usefulness of aDNA studies often become critically dependent on precautions taken with respect to the status of the template DNA.
In recent years, population genetic theory has undergone a phase of rapid development. In particular, this progress has meant that coalescent theory now forms the basis of most population genetic methods. Furthermore, for studies including aDNA, or rapidly evolving organisms such as viruses, extensions to this theory now allow for inferences from data sets including temporally separated samples (Rodrigo et al. 1999
). At the heart of these analyses lie inferences of population genetic parameters that influence the appearance of the genealogy of a locus. In this study, we investigate the effect of DNA damage, using a simple model of base degradation as an example, on inferred genealogical structure and its subsequent consequence for demographic inferences.
Miscoding lesions may impact the shape of inferred genealogies in 2 ways. First, false variation can cause misleading relationships among samples (Ho, Heupink, et al. 2007
). Second, it may inflate the length of lineages in the tree connected to the leaf nodes. Here we focus on the latter of the 2 effects and illustrate how this problem may affect the inference of past population size change. In the standard coalescent model (e.g., Kingman 1982a
, 1982b
; Hudson 1983
), coalescences occur at rate n(n – 1)/4Ne, where n is the number of ancestral lineages and Ne is the effective population size. Genealogies of populations of large size are thus expected to exhibit relatively long coalescent times, whereas the opposite is expected for small populations. The predominant effect of miscoding lesions is to extend inferred branch lengths at around the time where the aDNA sequences were sampled. The artifactually elongated branch lengths will be interpreted as a reduction in coalescence rates and consequently also an increase in population size. In figure 1, the genealogy of a sample including both modern and ancient sequences is depicted. The example shows how genetic variation added to the sample as a result of damage accumulates on edges connecting ancient specimens to the rest of the tree. Simultaneously, both external branches leading to modern samples and internal branches of the genealogy are left largely unaffected by the false diversity. In the case of only aDNA, we would likewise predict artifactual inferences of population growth. However, for population samples including both damaged aDNA and modern DNA, we would expect coalescent-based methods to suggest demographic scenarios where an initial phase of growth is followed by a relatively recent decline, even though the actual population size was kept constant.
|
| Materials and Methods |
|---|
|
|
|---|
To investigate this prediction further, we simulate serial samples of DNA sequences evolving under a constant population size and no recombination (most population studies including ancient samples published to date are based on mitochondrial loci). We assume a mutation rate of 10–7 mutations per site per year, occurring according to the Jukes and Cantor (1969)
T and G
A transitions. Although such damage can arise due to both base degradation and enzyme sequencing infidelity, we simulate our rate of damage using values that relate to observed occurrences of base degradation only. Calculated from an exceptionally well-preserved, frozen (and thus relatively thermally young) mammoth bone, this rate is likely a conservative underestimate of the rate of base degradation in many ancient bone samples. Thus, we highlight that in a situation where all, or a subset, the aDNA sequences in a data set are generated without any form of replication, the damage rate used here likely represent a relatively modest number of miscoding lesions per sequence. In contrast, should replication of some form be applied to the data (including the generation of sequence from multiple PCRs or cloned sequences), damage may have less of an impact than that simulated here (cf. Hofreiter et al. 2001). The simulated damage corresponds to deamination of cytosine as the sole cause of DNA degradation, however, due to the complementary nature of the genetic code, and the fact that a change may have originated on either of the 2 strands, both types of transitions are simulated (Hansen et al. 2001| Results |
|---|
|
|
|---|
Figure 2a shows the results of a scenario in which the 25 ancient samples are of equal age set to 50,000 years. Moving from past toward present times, we see an initial phase where estimates suggest a constant population size. This phase probably represents coalescent events primarily involving internal and thus undamaged branches of the genealogy. Then, in line with expectations from arguments presented above, a phase of spurious but dramatic population growth follows as we move closer to the sampling time of the ancient sequences. A peak is reached before the actual age of the aDNA, likely corresponding to a burst in coalescent events involving damaged sequences. Such events can only take place after the time point at which ancient samples were included to the genealogy. The growth phase is next replaced by a rather steep decline. The fact that we observe a relatively large population size still some time after having passed 50,000 years is probably partly reflected in the way the generalized skyline plots averages population size estimates across a number of coalescent events (Drummond et al. 2005
|
We next simulate 20 new genealogies, this time without the addition of miscoding lesions to the ancient sequences. Figure 2b presents the results of these simulations, and this time we observe a flat line in good agreement with our constant-sized simulation settings. In order to imitate scenarios that resemble real data sets more closely, we also run simulations where aDNA ages are drawn randomly from a uniform distribution between 0 and 100,000 years. The results of these simulations, with and without damage, are shown in figure 2c and d, respectively. The general trend is similar to that of the previous simulations, although the false signal of population growth and decline is now prolonged as a result of the diversity in sample age. These simulations suggest that, unless stringent measures are taken to minimize their effects, miscoding lesions that can be expected to be present in sequences sampled from ancient specimens have the potential to create spurious signals of population size change. Furthermore, this signal has a characteristic shape where an initial apparent population growth is followed by a decline.
Next, in order to examine the effect of different damage rates on inferences of historical populations sizes, we perform simulations at 3 different damage rates (10–6, 10–7, and 10–8 miscoding lesions per site per year). Although explicit damage rates have rarely been calculated in past studies, previous observations on base degradation alone have indicated that the rate of cytosine deamination is up to 10x lower in ancient frozen hair than frozen bone (reflecting a rate of 10–8) (Gilbert, Tomsho, et al. 2007
), whereas the higher rate of 10–6 might be observed in extremely old nonfrozen materials or where significant levels of enzyme driven error are present. Again, we use 20 simulations for each rate and ancient sequences are of equal age (50,000 years). Figure 3 shows the estimated population size at the time of the sampling of the ancient sequences divided by the true population size (bias) plotted for the 3 damage rates. The results clearly show that the population size bias is affected by the damage rate and thus corroborates our finding that aDNA damage cause coalescent-based methods to overestimate past population sizes.
|
Reanalysis of Bison Data
Our analyses demonstrate that caution should be exercised when drawing conclusions from population genetic studies that include damaged sequences. Attempting to correlate the demographic history of species with for instance human activity or climate change might in some cases be premature and give misleading results. In the light of our findings, it might thus be worthwhile revisiting some of the published data sets in order to understand whether damage may have influenced previous results. To do this, we have chosen the largest yet published aDNA data set, on the steppe bison (Bison bison/Bison priscus) (Shapiro et al. 2004
T and G
A transitions could have been removed. However, as frequent and spurious T
C and A
G transitions have been observed in several aDNA data sets, generated on a wide range of substrates by a wide range of research groups (Gilbert et al. 2003
|
Clearly, the result of the analysis of the reduced data set might simply be due to a reduction in statistical accuracy and power due to the elimination of a large number of segregating sites (79%). Therefore in order to test if the reduced data set still holds enough statistical power to allow for the analysis, we perform several additional analyses. First, we produce 20 data sets where 79% of the segregating sites have been removed randomly without regards to the mutational type. The signal characteristic to the original set of sequences now reappears in these data sets (fig. 4b), indicating that the diverging results of the reduced and original data sets, respectively, are not explained by a simple lack of statistical power but more likely by the presence of miscoding lesions in the original bison data set—lesions that are not present in the reduced data set. Second, using an appropriate transition:transversion ratio, we simulate 3 different data sets consisting of 22 modern and 169 ancient, 700 bp long, sequences each (ages are uniformly distributed between 0 and 50,000 years), in the end harboring similar amounts of genetic variation as the real bison data set (
200 segregating sites). We then infer the demographic histories of these data sets prior to, and after, the removal of segregating transitions. Figure 5 displays the results of these simulations. Panel (a) shows the skyline plot of a sample taken from a population that recently increased 20-fold in size. As evident from panel (b), BEAST is able to recover the signal of population growth even after the removal of all segregating transitions from this data set. Panels (c) and (d) display the results of a scenario where the genealogy was simulated without population size change and again the shape of the skyline plots before (c) and after (d) removing all sites segregating as transitions both reflect the simulated demographic history well. In the third test, we once again simulate a sample from a population evolving under constant size; however, this time we also damage ancient sequences as described previously. Panel (e) displays how the population size of this simulated data is falsely inferred to have changed through time, while a signal consistent with the true demographic history is captured after the removal of segregating transitions prior to the analyses (panel f). The results of these simulations suggest that we are able to recover the demographic history of the bison using sites segregating as transversions only.
|
Finally, we infer the demographic history of the steppe bison using the 22 modern bison sequences included in the data set analyzed by Shapiro et al. (2004)
| Discussion |
|---|
|
|
|---|
It is clear that DNA damage may be an important factor influencing population genetic studies. Although the preservation of some specimens used in aDNA studies is so good that damage rates may be negligible, this situation is rare in general and extremely rare in population studies. As such damage must be taken into account for most situations. There are several possible methods for eliminating the effects of DNA damage. A first is to treat aDNA extracts with the enzyme uracil-N-glycosylase (Pääbo 1989
, which models DNA damage (Ho, Heupink, et al. 2007
| Acknowledgements |
|---|
|
|
|---|
E.W. and E.A. were supported by the Danish National Science Foundation and the Villum Kann Rasmussen Foundation.
| Footnotes |
|---|
Lisa Matisoo-Smith, Associate Editor
| References |
|---|
|
|
|---|
Binladen J, Wiuf C, Gilbert MT, et al, (11 co-authors). Assessing the fidelity of ancient DNA sequences amplified from nuclear genes. Genetics (2006) 172:733–741.
Briggs AW, Stenzel U, Johnson PL, et al, (11 co-authors). Patterns of damage in genomic DNA sequences from a Neandertal. Proc Natl Acad Sci USA (2007) 104:14616–14621.
Brotherton P, Endicott P, Sanchez JJ, Beaumont M, Barnett R, Austin J, Cooper A. Novel high-resolution characterization of ancient DNA reveals C > U-type base modification events as the sole cause of post mortem miscoding lesions. Nucleic Acids Res (2007) 35:5717–5728.
Drummond AJ, Rambaut A. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol (2007) 7:214.[CrossRef][Medline]
Drummond AJ, Rambaut A, Shapiro B, Pybus OG. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol Biol Evol (2005) 22:1185–1192.
Gilbert MT, Binladen J, Miller W, Wiuf C, Willerslev E, Poinar H, Carlson JE, Leebens-Mack JH, Schuster SC. Recharacterization of ancient DNA miscoding lesions: insights in the era of sequencing-by-synthesis. Nucleic Acids Res (2007) 35:1–10.
Gilbert MT, Hansen AJ, Willerslev E, Rudbeck L, Barnes I, Lynnerup N, Cooper A. Characterization of genetic miscoding lesions caused by postmortem damage. Am J Hum Genet (2003) 72:48–61.[CrossRef][Web of Science][Medline]
Gilbert MT, Tomsho LP, Rendulic S, et al, (30 co-authors). Whole-genome shotgun sequencing of mitochondria from ancient hair shafts. Science (2007) 317:1927–1930.
Green RE, Krause J, Ptak SE, et al, (11 co-authors). Analysis of one million base pairs of Neanderthal DNA. Nature (2006) 444:330–336.[CrossRef][Web of Science][Medline]
Hansen AJ, Willerslev E, Wiuf C, Mourier T, Arctander P. Statistical evidence for miscoding lesions in ancient DNA templates. Mol Biol Evol (2001) 18:262–265.
Higuchi R, Bowman B, Freiberger M, Ryder OA, Wilson AC. DNA sequences from the quagga, an extinct member of the horse family. Nature (1984) 312:282–284.[CrossRef][Web of Science][Medline]
Ho SY, Heupink TH, Rambaut A, Shapiro B. Bayesian estimation of sequence damage in ancient DNA. Mol Biol Evol (2007) 24:1416–1422.
Ho SY, Phillips MJ, Cooper A, Drummond AJ. Time dependency of molecular rate estimates and systematic overestimation of recent divergence times. Mol Biol Evol (2005) 22:1561–1568.
Ho SY, Shapiro B, Phillips MJ, Cooper A, Drummond AJ. Evidence for time dependency of molecular rate estimates. Syst Biol (2007) 56:515–522.[CrossRef][Web of Science][Medline]
Hofreiter M, Jaenicke V, Serre D, Haeseler Av A, Paabo S. DNA sequences from multiple amplifications reveal artifacts induced by cytosine deamination in ancient DNA. Nucleic Acids Res (2001) 29:4793–4799.
Hofreiter M, Serre D, Rohland N, Rabeder G, Nagel D, Conard N, Munzel S, Paabo S. Lack of phylogeography in European mammals before the last glaciation. Proc Natl Acad Sci USA (2004) 101:12963–12968.
Hudson RR. Properties of a neutral allele model with intragenic recombination. Theor Popul Biol (1983) 23:183–201.[CrossRef][Web of Science][Medline]
Johnson SS, Hebsgaard MB, Christensen TR, et al, (15 co-authors). Ancient bacteria show evidence of DNA repair. Proc Natl Acad Sci USA (2007) 104:14401–14405.
Jukes TH, Cantor CR. Evolution of protein molecules. In: Mammalian protein metabolism—Munro HN, ed. (1969) New York: Academic Press. 21–132.
Kingman JFC. On the genealogy of large populations. J Appl Probab (1982a) 19:27–43.[CrossRef]
Kingman JFC. The coalescent. Stoch Proc Appl (1982b) 13:235–248.[CrossRef]
Noonan JP, Coop G, Kudaravalli S, et al, (11 co-authors). Sequencing and analysis of Neanderthal genomic DNA. Science (2006) 314:1113–1118.
Noonan JP, Hofreiter M, Smith D, Priest JR, Rohland N, Rabeder G, Krause J, Detter JC, Paabo S, Rubin EM. Genomic sequencing of Pleistocene cave bears. Science (2005) 309:597–599.
Pääbo S. Ancient DNA: extraction, characterization, molecular cloning, and enzymatic amplification. Proc Natl Acad Sci USA (1989) 86:1939–1943.
Poinar HN, Schwarz C, Qi J, et al, (13 co-authors). Metagenomics to paleogenomics: large-scale sequencing of mammoth DNA. Science (2006) 311:392–394.
Rodrigo AG, Shpaer EG, Delwart EL, Iversen AK, Gallo MV, Brojatsch J, Hirsch MS, Walker BD, Mullins JI. Coalescent estimates of HIV-1 generation time in vivo. Proc Natl Acad Sci USA (1999) 96:2187–2191.
Shapiro B, Drummond AJ, Rambaut A, et al, (27 co-authors). Rise and fall of the Beringian steppe bison. Science (2004) 306:1561–1565.
Stiller M, Green RE, Ronan M, et al, (19 co-authors). Patterns of nucleotide misincorporations during enzymatic amplification and direct large-scale sequencing of ancient DNA. Proc Natl Acad Sci USA (2006) 103:13578–13584.
Wagner A, Blackstone N, Cartwright P, Dick M, Misof B, Snow P, Wagner GP, Bartels J, Murtha M, Pendleton J. Surveys of gene families using polymerase chain-reaction—PCR selection and PCR drift. Syst Biol (1994) 43:250–261.
Willerslev E, Cappellini E, Boomsma W, et al, (30 co-authors). Ancient biomolecules from deep ice cores reveal a forested southern Greenland. Science (2007) 317:111–114.
Willerslev E, Cooper A. Ancient DNA. Proc Biol Sci (2005) 272:3–16.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
A. Rambaut, S. Y.W. Ho, A. J. Drummond, and B. Shapiro Accommodating the Effect of Ancient DNA Damage on Inferences of Demographic Histories Mol. Biol. Evol., February 1, 2009; 26(2): 245 - 248. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


). All simulated genealogies consist of 25 modern and 25 ancient sequences sampled from a population evolving under a constant population size (5,000 individuals). (a) All ancient sequences are sampled 50,000 years before present with damage as described in the main text. (b) Same as (a) but without damage. (c) The ages of the ancient sequences are drawn randomly from a uniform distribution ranging from 0 to 100,000 years, with damage as described in the main text. (d) Same as (c) but without damage.


