MBE Advance Access originally published online on April 16, 2008
Molecular Biology and Evolution 2008 25(7):1503-1511; doi:10.1093/molbev/msn095
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Bayesian Inference of Errors in Ancient DNA Caused by Postmortem Degradation

* Department of Forest Sciences/Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
Genome Center and Department of Evolution and Ecology, University of California, Davis
E-mail: brannala{at}ucdavis.edu.
| Abstract |
|---|
|
|
|---|
Methods for extracting and amplifying sequences using ancient DNA (aDNA) can be prone to errors caused by postmortem modifications of the DNA strand. A new statistical method is developed for predicting errors in aDNA sequences caused by such processes. In addition to the canonical DNA substitution model parameters, a discrete Markov chain is used to describe nucleotide substitutions occurring via postmortem degradation of the aDNA sequences. A computer program, BYPASSR-degr, was developed implementing the method and was used in subsequent analyses of simulated data sets under the new model. Simulation studies show that the new method can be powerful and accurate in identifying damaged sites. The method is applied to analyze aDNA sequences of Etruscans, Adélie penguins, and horses. No significant signals of degradation were observed at any sites of the aDNA sequences we analyzed.
Key Words: Bayesian phylogenetic inference molecular evolution Markov process site-specific rates Metropolis–Hastings algorithm ancient DNA DNA damage gene amplification
| Introduction |
|---|
|
|
|---|
Recent advances in molecular genetics allow DNA to be extracted, amplified, and sequenced from ancient tissues (Pääbo 1989
G)/(T
C) (type I) and (C
T)/(G
A) (type II) (Hansen et al. 2001
The continuous improvement of amplification techniques has reduced the number of such artifacts, but the precise rate, or pattern, of occurrence of miscoding lesions remains difficult to estimate. An approximate rate of postmortem damage was calculated by Hofreiter et al. (2001)
by comparing the PCR products of ancient samples with a database reference sequence. These authors concluded that miscoding lesions are unlikely to be more frequent than 0.1%. The overall number of transitions attributed to DNA damage processes is suspected to be inflated because it may include some errors caused by the PCR amplification technique itself (Gilbert et al. 2007
). An even smaller number of miscoding lesions, mimicking substitutions that cause evolutionary changes, influence phylogenetic analyses aiming at estimating the probability that particular sites have undergone changes. Considering an alignment of reduced length aDNA sequences (typically a few hundred nucleotides), miscoding lesions can lead to higher estimated substitution rates at the degraded sites and consequent overestimates of overall levels of polymorphism.
As experimental procedures improve, the rates of enzymatic errors are being reduced leaving miscoding lesions as the most likely cause of misincorporated nucleotides in aDNA samples. Computational and statistical approaches aimed at addressing this problem are currently too simplistic. One early method, for example, uses parsimony principles to construct median-joining networks of the clones (using a weighted distance measure between 2 sequences obtained by counting the number of differences); the resulting sequences, inferred by assuming the minimum number of changes and clustered into a network, are taken to represent the unsampled sequences from which the observed sequences were derived (Bandelt et al. 1999
). The postprocessing of this method was recently improved by calculating a statistic for each of the sequences in the median vector based on current knowledge concerning the types of substitutions characterized as miscoding lesions (Helgason et al. 2007
). The network-based method suffers from several weaknesses, the main one being that it ignores uncertainties in the reconstructed ancestral sequences. A Bayesian solution to this problem was pursued by Ho et al. (2007)
who introduced a parameter that describes the nucleotide error rate at the tips for sequences incorporated into a general model used for phylogenetic inference. A weakness of this approach is that it is not sufficiently specific to account for the biases in degradation-induced nucleotide change evident from recent experimental analyses. The Ho et al. (2007)
method assumes that the same substitution process applies to both evolutionary substitutions and degradation damage because it only allows for a difference in the rate of substitution at tips and not a difference in the relative rates of substitutions to different nucleotides. Thus, substantial differences in frequency among different types of miscoding lesions (e.g., type I vs. type II) (Gilbert et al. 2007
) are not accounted for and there is a greater risk of removing genuine polymorphisms as damage. Empirical evidence clearly indicates that the patterns of nucleotide substitution are very different under these 2 processes, and this should be explicitly accounted for in the model.
In the present work, we address the problem of misincorporated nucleotides in aDNA data by extending the flexible framework for modeling DNA substitution processes described in Mateiu and Rannala (2006)
to explicitly model aDNA errors. In addition to the canonical DNA substitution model parameters, a discrete Markov chain is used to describe nucleotide substitutions occurring via postmortem degradation of the aDNA sequences. A discrete Markov chain is the appropriate formulation because the DNA degradation process does not appear to depend on time (branch lengths) and instead depends on the conditions of preservation and so on.
| Theory |
|---|
|
|
|---|
Rate Heterogeneity, Data Augmentation, and Uniformization
In molecular phylogenetics, site-specific substitution rates can be integrated into a Bayesian formulation by allowing the Metropolis–Hastings algorithm to integrate over the unobserved rate values, for which a prior was specified. In our formulation, following Yang 1993
![]() | (1) |
) is the prior density of rates for the mth site (with
specifying the variance of the prior on rates), f(w|
, µ) is the birth–death prior density of branch lengths, w = {wl}, with sampling parameter fixed at 0.15,
and µ are the parameters of the birth–death prior on branch lengths (Yang and Rannala 1997
represents the parameters of the substitution model, and x = {xml} is a matrix of l aligned nucleotide sequences of m sites (where xml is the nucleotide present at site m of sequence l).
Mateiu and Rannala 2006
introduced 2 additional vectors of random variables: the numbers of transitions on each branch and the unobserved ancestral nucleotides at the internal nodes of the tree. Explicit modeling of the number of substitutions on the tree allowed us to use the uniformization procedure as an efficient alternative to calculate transition probabilities along the branches of the tree. The detailed description of the transformation of a continuous time Markov nucleotide substitution process into an equivalent Poisson substitution process is given by Mateiu and Rannala (2006)
. By treating the nucleotides at the internal nodes as random variables in the chain, one avoids the need to calculate the conditional probabilities on subtrees (as in the pruning algorithm [Felsenstein 1981
]) but instead directly evaluates the ancestral nucleotides in a Markov chain Monte Carlo (MCMC) step.
Using the notation x–={x
} for a matrix of the l – 2 ancestral nucleotide sequences on the tree and M = {Mlm} for Mlm number of transitions at site m on branch l of a phylogenetic tree T, the augmented likelihood is
![]() | (2) |
|
|
Modeling Miscoding Lesions
The miscoding lesions generated during amplification of an aDNA template are predominantly characterized by 4 types of substitutions with 2 phenotypic outcomes: (A
G)/(T
C) and (C
T)/(G
A) (Binladen et al. 2006
). Miscoding lesions were detected in tissues thousands of years old (Willerslev et al. 2003
) as well as museum samples tens or hundreds of years old and even samples of 4-year-old dried tissues (Pääbo 1989
). As the accumulation of substitutions is not a strict function of time, the generation of miscoding lesions cannot be modeled in the same way as the substitution process on the branches of a phylogenetic tree. Instead, a discrete Markov process in which the 4 possible substitutions are allowed with a small rate is a simple and straightforward way to describe the process. The transition probability matrix for this process is
![]() |
Substitutions caused by the degradation process are captured by parameter q of our model, whereas the unlikely transversions are represented by parameter z. Adding the degradation process to the preexisting phylogenetic tree, we can think of degradation as a substitution process happening on an edge that connects the sequence extracted from the aDNA with a hypothetical sequence that existed at the instant in the past at which the organism died. In figure 1, this is represented by evolving the aDNA sequences from the hypothetical nodes to the tips along the "degradation" edges, whereas the nucleotides at hypothetical nodes A', B', and C' are obtained according to their probabilities in the usual stochastic formulation of the DNA substitution model. Formally written, for a mixture of s contemporary sequences and U aDNA sequences, each of length n, the augmented likelihood becomes
![]() | (3) |
is a nucleotide at the hypothetical sequence u at site v.
|
Metropolis–Hastings Algorithm
In our model, the degradation process is a time-independent process and the age of the aDNA is irrelevant. The BYPASSR program (available from http://www.rannala.org [Mateiu and Rannala 2006
|
| (4) |
0 and B is a normalizing constant
|
| (5) |
|
|
(a0+1), respectively. One method for sampling from the Dirichlet is to draw y1, y2, y3 from independent gamma distributions with common scale and shape parameters a1 = a0 x x1, a2 = a0 x x2, a3 = a0 x x3 where for each yi, x
=yi/
![]() | (6) |
Next, a nucleotide at a random site and hypothetical node is chosen as a candidate for a proposed change. The likelihood ratio is the product of 2 fractions. The first term is given by the substitution probabilities in degradation matrix D corresponding to the proposed and current nucleotide at the hypothetical node. The second fraction is the ratio of transition probabilities along the branch connecting the hypothetical node with its parent node, a process described by the uniformized substitution matrix M. The acceptance ratio in this case is written as
|
|
Simulation Analysis of Statistical Performance
Simulation studies were used to evaluate the performance of our new method by examining the accuracy of estimates of parameters of the site degradation model (for which the true values are known), the accuracy of the method in identifying sites that are known to have undergone degradation, and so on. Currently, most extracted aDNA sequences have been mitochondrial and analyses have focused on comparisons of relatively closely related sequences for which a molecular clock assumption is likely to be satisfied; we therefore focused on testing the model and program using data generated under a strict molecular clock. A program was written in C++ to simulate random clock-like trees. The program EVOLVER (PAML) (Yang 2007
) was then used to generate sequences on the simulated trees under a specified DNA substitution model and Gamma distribution parameter
, allowing rate heterogeneity across sites. A second program was developed, which continues to "evolve" the sequences for the nodes corresponding to ancient data. The parameters p, q, and z are set to specific values and a proportion q of the total sites at the hypothetical nodes (randomly chosen) are allowed to degrade according to the Markov chain model of degradation outlined above. The location and the types of degradation changes are stored for postanalysis comparison. The simulated data sets were analyzed with the specific objectives of assessing the accuracy of the new method in recovering sites in the simulated data known to be degraded, investigating the extent of bias in the estimates of site-specific rates when degradation processes are ignored and evaluating the optimal proportion of aDNA sequences in a data set necessary to recover (with high probability) the original nucleotides at the damaged sites.
Initially, 6 data sets of 20, 30, 40, and 50 sequences, each comprising 500 sites, were generated for random trees of total length 1 (in units of expected substitutions). A general time-reversible (GTR) model with parameters a = 1, b = 2, c = 3, d = 4, e = 5 and nucleotide frequencies
T = 0.1,
C = 0.2,
A = 0.3, and
G = 0.4 was assumed. The shape parameter,
, of the Gamma distribution was varied allowing different levels of rate variation among sites,
= 0.3, 0.5, or 1.0. In all the data sets, the number of aDNA sequences was 10 and the degradation parameter q was set to be either q = 0.005, 0.05, or 0.2, with z = 0 in all cases. In total, 36 alignments were analyzed with BYPASSR-degr, using 6 million iterations in the "burn-in" phase and 6 million iterations in the sampling phase (during which 2,000 samples were collected). Besides the site-specific rates, branch lengths, and GTR parameter estimates, we examined the posterior means of the nucleotides at the aDNA nodes. If the method is highly accurate, the nucleotide with the largest posterior mean should match the nucleotide known (from the simulation) to be present immediately prior to the point at which degradation occurs. The detailed results of each run are shown in table 1. The last 2 columns in the tables represent the method's ability to identify true changes calculated as the proportion of damaged sites identified when a posterior probability of 0.95 is used as the criterion for accepting an alternative nucleotide at an aDNA site and the false-positive rate calculated as the proportion of sites that were incorrectly identified as damaged.
|
We are interested in comparing the site-specific substitution rates inferred when damaged data are analyzed using a model that allows for the presence of degraded sites with those obtained using a model that does not allow for the possibility of degraded sites. We expect a significant difference between the 2. On the other hand, we expect to obtain similar results when we use degraded data with the degradation model implementation (BYPASSR-degr) and data without degradation, both analyzed using the BYPASSR-degr implementation of the degradation model. The similarity is evident from the correct assignment of the nucleotides at the degraded nodes and sites that recreates the sequences before the inclusion of the damaged nucleotides. A typical result of this model testing approach is shown in figure 2, in which data sets generated with
= 0.5 and degradation matrix parameters p = 0.95 and q = 0.05 are analyzed. A good fit in this case indicates that the model is accurate in the estimation of site-specific substitution rates, even in the presence of degraded sites.
|
An important difference is observed between the posterior mean site-specific substitution rates obtained when data with incorporated errors are analyzed using a model that integrates over the uncertainty in the aDNA data versus a model that ignores the degraded nucleotides (fig. 3). In the later case, the presence of 5% degraded sites creates the appearance of a higher number of substitutions in the data which is reflected in a higher
(
20seqs = 1.49 ± 0.24,
30seqs = 1.30 ± 0.19,
40seqs = 1.67 ± 0.29,
50seqs = 1.45 ± 0.24) and longer branch lengths (fig. 3B panels). The comparison between the mean posterior substitution rates in the 2 situations (degraded data analyzed with BYPASSR-degr and BYPASSR) shows a significant reduction of rate variation among sites that causes the lower rates to be higher and vice versa (fig. 3A panels).
|
A larger simulation study was performed on data sets with additional sequences and different proportions of aDNA. Following the same procedure as described in the previous paragraphs, we started with random trees of either 50, 100, or 150 taxa and generated data sets with 1/2 or 1/4 of the sequences representing aDNA. For each combination, 2 values of the degradation parameter, q, were used: q = 0.01 and 0.05, producing 24 data sets in total. Table 2 shows the results of our simulations. The results show that the power to detect damaged sites increases with increasing numbers of taxa and/or an increase in the proportion of aDNA sequences present in the sample. The estimates of
based on the posterior means for these analyses tend to slightly overestimate the true value, probably due to the relatively short branch lengths and small numbers of sites examined. In this case, the prior can be expected to influence the posterior potentially leading to some bias.
|
Analysis of Etruscan HVR-I Mitochondrial aDNA
A genetic study on the remains of 80 Etruscans, the pre-Roman population of Italy, was published in 2004 by Vernesi et al. (2004)
|
For each of the data sets, the phylogenetic tree was obtained by maximum likelihood using the HKY85 substitution model and 4 categories for the discrete gamma distribution approximation (Yang 2007
The results of the runs are summarized in tables 4 and 5. The degree of degraded sites in these data sets thus appears very low as evidenced by a q parameter with posterior mean q < 10–3. The posterior means across independent MCMC runs are highly consistent indicating convergence, with the exception of run 9 for data set 2, which appears to have an inflated value for z. A small number of sites showed a weak signal of degradation with posterior probability (averaged across runs) for an alternative nucleotide present in the aDNA sequences
0.95.
|
|
Previous analysis of the HVR-I region from Etruscan remains have found several sites to be prone to postmortem damage or to show high substitution rates (Vernesi et al. 2004
Analysis of Adélie Penguin and Horse aDNA
Two of the data sets analyzed by Ho et al. (2007)
were also analyzed using our method for purposes of comparison. The first data set analyzed is from the study of Lambert et al. (2002)
of 345 bp from the mtDNA control region (hypervariable region I) of Adélie penguins. We analyzed 107 nonredundant sequences from a set of 379 contemporary sequences (accession numbers AF474412
[GenBank]
–AF474791
[GenBank]
) as well as 92 aDNA sequences (Genbank accession numbers AF474887
[GenBank]
–AF474792
[GenBank]
). From the total of 107 contemporary + 92 aDNA sequences, 2 smaller data sets were created for the use in the analysis using the even (set 1: 57 DNA + 43 aDNA) and odd (set 2: 49 DNA + 49 aDNA) sequence numbers. Sequences were aligned using ClustalW, and the tree was obtained using the HKY85 model with 8 categories for the gamma distribution of among-site rate variation, as implemented in PAML (Yang 2007
). In analyzing the first data set, 258 sites were used in the analysis following the exclusion of 33 gaps. In the second data set, 276 sites were used after the exclusion of 37 gaps. BYPASSR-degr was run on these data using 6 million iterations of burn in and sampling. The mean posterior estimates of the degradation model parameters are shown in table 6. There were no individual sites with a posterior probability of degradation greater than 0.95.
|
The second data set analyzed is from the study of Vila et al. (1989)
|
| Discussion |
|---|
|
|
|---|
We have developed a novel approach to infer properties of the degradation process in aDNA and have incorporated this process in the context of a model with continuous variation of substitution rates among sites. In the limited analyses we have carried out using simulated data, BYPASSR-degr, the implementation of the model proposed by us performed very well in identifying the damaged sites (even if there are many damaged sites spread across several sequences) and obtaining reasonably precise estimates of the other tree parameters (i.e., branch lengths, site-specific rate, GTR model parameters, etc.). In general, the type I error rate was very low, and the method was not prone to spurious detection of nonexistent degradation errors. By contrast, Ho et al. (2007)
| Acknowledgements |
|---|
|
|
|---|
We thank Giorgio Bertorelle for bringing the problem of miscoding lesions in aDNA to our attention. Support for this research was provided by a grant from the Canadian Institutes of Health Research (MOP 44064) to B.H.R.
| Footnotes |
|---|
Jeffery Thorne, Associate Editor
| References |
|---|
|
|
|---|
Bandelt HJ, Forster P, Rohl A. Median-joining networks for inferring intraspecific phylogenies. Mol Biol Evol (1999) 16:37–48.[Abstract]
Binladen J, Wiuf C, Gilbert M, Bunce M, et al, (11 co-authors). Assessing the fidelity of ancient DNA sequences amplified from nuclear genes. Genetics (2006) 172:733–741.
Cooper A, Poinar H. Ancient DNA: do it right or not at all. Science (2000) 289:1139.[Web of Science][Medline]
Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol (1981) 17:368–376.[CrossRef][Web of Science][Medline]
Gelman A, Carlin J, Stern H, Rubin D. Bayesian data analysis (2004) Boca Raton (FL): Chapman & Hall/CRC.
Gilbert M, Binladen J, Miller W, Wiuf C, Willerslev E, Poinar H, Carlson JE, Leebens-Mack JH, Schuster SC. Recharacterization of ancient DNA miscoding lesions: insights in the era of sequencing-by-synthesis. Nucleic Acids Res (2007) 35:1–10.
Hansen A, Willerslev E, Wiuf C, Mourier T, Arctander P. Statistical evidence for miscoding lesions in ancient DNA templates. Mol Biol Evol (2001) 18:262–265.
Helgason A, Palsson S, Lalueza-Fox C, Ghosh S, (10 co-authors). A statistical approach to identify ancient template DNA. J Mol Evol (2007) 65:92–102.[CrossRef][Web of Science][Medline]
Ho S, Heupink T, Rambaut A, Shapiro B. Bayesian estimation of sequence damage in ancient DNA. Mol Biol Evol (2007) 24:1416–1422.
Hoelzel AR. Ancient genomes. Genome Biol (2005) 6:239.[CrossRef][Medline]
Hofreiter M, Jaenicke V, Serre D, Haeseler A, Pääbo S. DNA sequences from multiple amplifications reveal artifacts induced by cytosine deamination in ancient DNA. Nucleic Acids Res (2001) 29:4793–4799.
Kimura M. A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J Mol Evol (1980) 16:111–120.[CrossRef][Web of Science][Medline]
Lambert DM, Ritchie PA, Millar CD, Holland B, Drummond AJ, Baroni C. Rates of evolution in ancient DNA from Adélie penguins. Science (2002) 295:2270–2273.
Mateiu L, Rannala B. Inferring complex DNA substitution processes on phylogenies using uniformization and data augmentation. Syst Biol (2006) 55:259–269.[CrossRef][Web of Science][Medline]
Pääbo S. Ancient DNA: extraction, characterization, molecular cloning, and enzymatic amplification. Proc Natl Acad Sci USA (1989) 86:1939–1943.
Pääbo S, Poinar H, Serre D, Jaenicke-Despres V, et al, (10 co-authors). Genetic analyses from ancient DNA. Annu Rev Genet (2004) 38:645–679.[CrossRef][Web of Science][Medline]
Vernesi C, Caramelli D, Dupanloup I, et al, (13 co-authors). The Etruscans: a population-genetic study. Am J Hum Genet (2004) 74:694–704.[CrossRef][Web of Science][Medline]
Vila C, Leonard JA, Gotherstrom A, Marklund S, Sandberg K, et al. Widespread origins of domestic horse lineages. Science (1989) 291:474–477.[CrossRef]
Willerslev E, Hansen A, Binladen J, Brand T, et al, (11 co-authors). Diverse plant and animal genetic records from Holocene and Pleistocene sediments. Science (2003) 300:791–795.
Yang Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol (1993) 10:1396–1401.[Abstract]
Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol (2007) 24:1586–1591.
Yang Z, Rannala B. Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo method. Mol Biol Evol (1997) 14:717–724.[Abstract]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
S. Guimaraes, S. Ghirotto, A. Benazzo, L. Milani, M. Lari, E. Pilli, E. Pecchioli, F. Mallegni, B. Lippi, F. Bertoldi, et al. Genealogical Discontinuities among Etruscan, Medieval, and Contemporary Tuscans Mol. Biol. Evol., September 1, 2009; 26(9): 2157 - 2166. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








