Skip Navigation


MBE Advance Access originally published online on April 16, 2008
Molecular Biology and Evolution 2008 25(7):1503-1511; doi:10.1093/molbev/msn095
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
25/7/1503    most recent
msn095v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Mateiu, L. M.
Right arrow Articles by Rannala, B. H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Mateiu, L. M.
Right arrow Articles by Rannala, B. H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Bayesian Inference of Errors in Ancient DNA Caused by Postmortem Degradation

Ligia M. Mateiu* and Bruce H. Rannala{dagger}

* Department of Forest Sciences/Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
{dagger} Genome Center and Department of Evolution and Ecology, University of California, Davis

E-mail: brannala{at}ucdavis.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Theory
 Discussion
 Acknowledgements
 References
 
Methods for extracting and amplifying sequences using ancient DNA (aDNA) can be prone to errors caused by postmortem modifications of the DNA strand. A new statistical method is developed for predicting errors in aDNA sequences caused by such processes. In addition to the canonical DNA substitution model parameters, a discrete Markov chain is used to describe nucleotide substitutions occurring via postmortem degradation of the aDNA sequences. A computer program, BYPASSR-degr, was developed implementing the method and was used in subsequent analyses of simulated data sets under the new model. Simulation studies show that the new method can be powerful and accurate in identifying damaged sites. The method is applied to analyze aDNA sequences of Etruscans, Adélie penguins, and horses. No significant signals of degradation were observed at any sites of the aDNA sequences we analyzed.

Key Words: Bayesian phylogenetic inference • molecular evolution • Markov process • site-specific rates • Metropolis–Hastings algorithm • ancient DNA • DNA damage • gene amplification


    Introduction
 TOP
 Abstract
 Introduction
 Theory
 Discussion
 Acknowledgements
 References
 
Recent advances in molecular genetics allow DNA to be extracted, amplified, and sequenced from ancient tissues (Pääbo 1989Go). Conclusions drawn from a study of ancient DNA (aDNA) often generate a lot of interest in the scientific community, especially when they do not correspond to prior expectations. This is particularly true when human remains are analyzed. Recent criticisms focus on the possibility of contamination of the ancient samples with modern DNA (Hoelzel 2005Go). To reduce this possibility, meticulous DNA extraction procedures are followed and researchers adhere to a strict set of procedural guidelines (Cooper and Poinar 2000Go). However, the validity of an ancient sample may also be compromised by postmortem damage (Hoelzel 2005Go). In living organisms, DNA damage is repaired by various enzymatic mechanisms. However, once the metabolic pathways of a cell cease to operate, the DNA molecules begin a progressive decay. The decay rate is influenced by a variety of factors related to the environment and the storage conditions. Biochemical processes subsequent to cell death cause the reduction of nucleotide sequence information in many ways: breakage of the DNA into 100- to 500-bp fragments, fragmentation of bases and sugars, loss of amino groups, and so on (Pääbo et al. 2004Go). Several of these postmortem aDNA modifications can block amplification during polymerase chain reaction (PCR), whereas others allow PCR products to be obtained, but with incorrect bases incorporated and maintained in the amplification products. These kinds of PCR artifacts, termed miscoding lesions, are commonly represented by 2 types of transitions: (A -> G)/(T -> C) (type I) and (C -> T)/(G -> A) (type II) (Hansen et al. 2001Go), the second type being observed more frequently in nuclear and mitochondrial DNA (Binladen et al. 2006Go).

The continuous improvement of amplification techniques has reduced the number of such artifacts, but the precise rate, or pattern, of occurrence of miscoding lesions remains difficult to estimate. An approximate rate of postmortem damage was calculated by Hofreiter et al. (2001)Go by comparing the PCR products of ancient samples with a database reference sequence. These authors concluded that miscoding lesions are unlikely to be more frequent than 0.1%. The overall number of transitions attributed to DNA damage processes is suspected to be inflated because it may include some errors caused by the PCR amplification technique itself (Gilbert et al. 2007Go). An even smaller number of miscoding lesions, mimicking substitutions that cause evolutionary changes, influence phylogenetic analyses aiming at estimating the probability that particular sites have undergone changes. Considering an alignment of reduced length aDNA sequences (typically a few hundred nucleotides), miscoding lesions can lead to higher estimated substitution rates at the degraded sites and consequent overestimates of overall levels of polymorphism.

As experimental procedures improve, the rates of enzymatic errors are being reduced leaving miscoding lesions as the most likely cause of misincorporated nucleotides in aDNA samples. Computational and statistical approaches aimed at addressing this problem are currently too simplistic. One early method, for example, uses parsimony principles to construct median-joining networks of the clones (using a weighted distance measure between 2 sequences obtained by counting the number of differences); the resulting sequences, inferred by assuming the minimum number of changes and clustered into a network, are taken to represent the unsampled sequences from which the observed sequences were derived (Bandelt et al. 1999Go). The postprocessing of this method was recently improved by calculating a statistic for each of the sequences in the median vector based on current knowledge concerning the types of substitutions characterized as miscoding lesions (Helgason et al. 2007Go). The network-based method suffers from several weaknesses, the main one being that it ignores uncertainties in the reconstructed ancestral sequences. A Bayesian solution to this problem was pursued by Ho et al. (2007)Go who introduced a parameter that describes the nucleotide error rate at the tips for sequences incorporated into a general model used for phylogenetic inference. A weakness of this approach is that it is not sufficiently specific to account for the biases in degradation-induced nucleotide change evident from recent experimental analyses. The Ho et al. (2007)Go method assumes that the same substitution process applies to both evolutionary substitutions and degradation damage because it only allows for a difference in the rate of substitution at tips and not a difference in the relative rates of substitutions to different nucleotides. Thus, substantial differences in frequency among different types of miscoding lesions (e.g., type I vs. type II) (Gilbert et al. 2007Go) are not accounted for and there is a greater risk of removing genuine polymorphisms as damage. Empirical evidence clearly indicates that the patterns of nucleotide substitution are very different under these 2 processes, and this should be explicitly accounted for in the model.

In the present work, we address the problem of misincorporated nucleotides in aDNA data by extending the flexible framework for modeling DNA substitution processes described in Mateiu and Rannala (2006)Go to explicitly model aDNA errors. In addition to the canonical DNA substitution model parameters, a discrete Markov chain is used to describe nucleotide substitutions occurring via postmortem degradation of the aDNA sequences. A discrete Markov chain is the appropriate formulation because the DNA degradation process does not appear to depend on time (branch lengths) and instead depends on the conditions of preservation and so on.


    Theory
 TOP
 Abstract
 Introduction
 Theory
 Discussion
 Acknowledgements
 References
 
Rate Heterogeneity, Data Augmentation, and Uniformization
In molecular phylogenetics, site-specific substitution rates can be integrated into a Bayesian formulation by allowing the Metropolis–Hastings algorithm to integrate over the unobserved rate values, for which a prior was specified. In our formulation, following Yang 1993Go, substitution rates are modeled assuming a continuous, unit mean, gamma density prior. Conditional on a known topology, T, and assuming a molecular clock, the joint posterior density of site-specific rates, substitution model paramaters, and branch lengths is

Formula (1)
where r = {rm} is a vector of site-specific rates (of length n), with rm being the rate at site m, f (rm|{alpha}) is the prior density of rates for the mth site (with {alpha} specifying the variance of the prior on rates), f(w|{lambda}, µ) is the birth–death prior density of branch lengths, w = {wl}, with sampling parameter fixed at 0.15, {lambda} and µ are the parameters of the birth–death prior on branch lengths (Yang and Rannala 1997Go) (for which we used uniform hyperpriors), {theta} represents the parameters of the substitution model, and x = {xml} is a matrix of l aligned nucleotide sequences of m sites (where xml is the nucleotide present at site m of sequence l).

Mateiu and Rannala 2006Go introduced 2 additional vectors of random variables: the numbers of transitions on each branch and the unobserved ancestral nucleotides at the internal nodes of the tree. Explicit modeling of the number of substitutions on the tree allowed us to use the uniformization procedure as an efficient alternative to calculate transition probabilities along the branches of the tree. The detailed description of the transformation of a continuous time Markov nucleotide substitution process into an equivalent Poisson substitution process is given by Mateiu and Rannala (2006)Go. By treating the nucleotides at the internal nodes as random variables in the chain, one avoids the need to calculate the conditional probabilities on subtrees (as in the pruning algorithm [Felsenstein 1981Go]) but instead directly evaluates the ancestral nucleotides in a Markov chain Monte Carlo (MCMC) step.

Using the notation x={xFormula} for a matrix of the l – 2 ancestral nucleotide sequences on the tree and M = {Mlm} for Mlm number of transitions at site m on branch l of a phylogenetic tree T, the augmented likelihood is

Formula (2)
According to the uniformized Markov process, the probability of Mlm transitions at site m on branch l is Poisson with probability distribution

Formula

Modeling Miscoding Lesions
The miscoding lesions generated during amplification of an aDNA template are predominantly characterized by 4 types of substitutions with 2 phenotypic outcomes: (A -> G)/(T -> C) and (C -> T)/(G -> A) (Binladen et al. 2006Go). Miscoding lesions were detected in tissues thousands of years old (Willerslev et al. 2003Go) as well as museum samples tens or hundreds of years old and even samples of 4-year-old dried tissues (Pääbo 1989Go). As the accumulation of substitutions is not a strict function of time, the generation of miscoding lesions cannot be modeled in the same way as the substitution process on the branches of a phylogenetic tree. Instead, a discrete Markov process in which the 4 possible substitutions are allowed with a small rate is a simple and straightforward way to describe the process. The transition probability matrix for this process is

Formula
where each line has to sum to 1 and the rows and columns represent T, C, A, and G nucleotides. Most of the nucleotides are expected to not be affected by degradation and this is manifested by a value of p close to 1. This is a discrete time analog of the Kimura (1980)Go 2-parameter model. A more complex Markov model could be easily incorporated using the same general framework if needed, including a model in which each possible substitution has a unique rate. Furthermore, one could allow each site to have a different degradation transition matrix. To avoid overparameterizing the model, we assume a global matrix of degradation rates in the sequel. We note that it is possible that in some cases the degree of aDNA damage may be approximately predicted by measures such as the thermal age of a sample, for example, which is a combination of the mean environmental temperature and absolute age. If a series of samples have been preserved at the same, consistent temperature (e.g., in permafrost), then the amount of postmortem damage could potentially be predicted from age. To allow for such effects one could reintroduce a time-dependent model to test for age-dependent rates. However, such detailed information is currently not widely available for most aDNA samples.

Substitutions caused by the degradation process are captured by parameter q of our model, whereas the unlikely transversions are represented by parameter z. Adding the degradation process to the preexisting phylogenetic tree, we can think of degradation as a substitution process happening on an edge that connects the sequence extracted from the aDNA with a hypothetical sequence that existed at the instant in the past at which the organism died. In figure 1, this is represented by evolving the aDNA sequences from the hypothetical nodes to the tips along the "degradation" edges, whereas the nucleotides at hypothetical nodes A', B', and C' are obtained according to their probabilities in the usual stochastic formulation of the DNA substitution model. Formally written, for a mixture of s contemporary sequences and U aDNA sequences, each of length n, the augmented likelihood becomes

Formula (3)
where u is the hypothetical sequence and xFormula is a nucleotide at the hypothetical sequence u at site v.


Figure 1
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Phylogenetic tree illustrating the model used to accommodate postmortem degradation in the analysis of aDNA. Degradation edges (shown in red) connect ancestral sequences A', B', and C' (the sequences existing at the time of death) with the sampled sequences A, B, and C. The sequences at nodes D and E are contemporary.

 
Metropolis–Hastings Algorithm
In our model, the degradation process is a time-independent process and the age of the aDNA is irrelevant. The BYPASSR program (available from http://www.rannala.org [Mateiu and Rannala 2006Go]) was modified to distinguish between aDNA and contemporary DNA samples and to allow the addition of the hypothetical ancestral nodes for aDNA sequences. The new program BYPASSR-degr performs these tasks and is available at www.rannala.org. The nucleotides at the hypothetical nodes become random variables in the chain together with the degradation parameters p, q, and z for which we used a uniform prior. A Dirichlet distribution was used to propose new values for the parameters p, q, and z in the MCMC. The probability density function of the Dirichlet distribution for a vector of 3 parameters, x = (x1 = p, x2 = q, x3 = z), is

Formula (4)
where a = (a1, a2, a3) is the parameter vector with ai ≥ 0 and B is a normalizing constant

Formula (5)
and

Formula
The marginal means and variances of the distribution are ai/a0 and ai(a0ai)/aFormula(a0+1), respectively. One method for sampling from the Dirichlet is to draw y1, y2, y3 from independent gamma distributions with common scale and shape parameters a1 = a0 x x1, a2 = a0 x x2, a3 = a0 x x3 where for each yi, xFormula=yi/Formula (Gelman et al. 2004Go). We propose values for the parameters from a Dirichlet with means equal to the current parameter values while a0 is a scaling parameter. Once the new set of degradation parameters is proposed, the Metropolis–Hasting ratio is calculated as

Formula (6)
with the likelihood ratio evaluated across all sites m at hypothetical nodes n.

Next, a nucleotide at a random site and hypothetical node is chosen as a candidate for a proposed change. The likelihood ratio is the product of 2 fractions. The first term is given by the substitution probabilities in degradation matrix D corresponding to the proposed and current nucleotide at the hypothetical node. The second fraction is the ratio of transition probabilities along the branch connecting the hypothetical node with its parent node, a process described by the uniformized substitution matrix M. The acceptance ratio in this case is written as

Formula
where a and a' are the current and proposed nucleotides at the randomly chosen hypothetical node, b is the nucleotide at the end of the edge connecting the hypothetical node to the ancient nucleotide, and c is the nucleotide at the same site at the parent node of the chosen hypothetical node.

Simulation Analysis of Statistical Performance
Simulation studies were used to evaluate the performance of our new method by examining the accuracy of estimates of parameters of the site degradation model (for which the true values are known), the accuracy of the method in identifying sites that are known to have undergone degradation, and so on. Currently, most extracted aDNA sequences have been mitochondrial and analyses have focused on comparisons of relatively closely related sequences for which a molecular clock assumption is likely to be satisfied; we therefore focused on testing the model and program using data generated under a strict molecular clock. A program was written in C++ to simulate random clock-like trees. The program EVOLVER (PAML) (Yang 2007Go) was then used to generate sequences on the simulated trees under a specified DNA substitution model and Gamma distribution parameter {alpha}, allowing rate heterogeneity across sites. A second program was developed, which continues to "evolve" the sequences for the nodes corresponding to ancient data. The parameters p, q, and z are set to specific values and a proportion q of the total sites at the hypothetical nodes (randomly chosen) are allowed to degrade according to the Markov chain model of degradation outlined above. The location and the types of degradation changes are stored for postanalysis comparison. The simulated data sets were analyzed with the specific objectives of assessing the accuracy of the new method in recovering sites in the simulated data known to be degraded, investigating the extent of bias in the estimates of site-specific rates when degradation processes are ignored and evaluating the optimal proportion of aDNA sequences in a data set necessary to recover (with high probability) the original nucleotides at the damaged sites.

Initially, 6 data sets of 20, 30, 40, and 50 sequences, each comprising 500 sites, were generated for random trees of total length 1 (in units of expected substitutions). A general time-reversible (GTR) model with parameters a = 1, b = 2, c = 3, d = 4, e = 5 and nucleotide frequencies {pi}T = 0.1, {pi}C = 0.2, {pi}A = 0.3, and {pi}G = 0.4 was assumed. The shape parameter, {alpha}, of the Gamma distribution was varied allowing different levels of rate variation among sites, {alpha} = 0.3, 0.5, or 1.0. In all the data sets, the number of aDNA sequences was 10 and the degradation parameter q was set to be either q = 0.005, 0.05, or 0.2, with z = 0 in all cases. In total, 36 alignments were analyzed with BYPASSR-degr, using 6 million iterations in the "burn-in" phase and 6 million iterations in the sampling phase (during which 2,000 samples were collected). Besides the site-specific rates, branch lengths, and GTR parameter estimates, we examined the posterior means of the nucleotides at the aDNA nodes. If the method is highly accurate, the nucleotide with the largest posterior mean should match the nucleotide known (from the simulation) to be present immediately prior to the point at which degradation occurs. The detailed results of each run are shown in table 1. The last 2 columns in the tables represent the method's ability to identify true changes calculated as the proportion of damaged sites identified when a posterior probability of 0.95 is used as the criterion for accepting an alternative nucleotide at an aDNA site and the false-positive rate calculated as the proportion of sites that were incorrectly identified as damaged.


View this table:
[in this window]
[in a new window]

 
Table 1 Estimates for the Parameters of the Degradation Model with a Fixed Number of 10 aDNA Sequences

 
We are interested in comparing the site-specific substitution rates inferred when damaged data are analyzed using a model that allows for the presence of degraded sites with those obtained using a model that does not allow for the possibility of degraded sites. We expect a significant difference between the 2. On the other hand, we expect to obtain similar results when we use degraded data with the degradation model implementation (BYPASSR-degr) and data without degradation, both analyzed using the BYPASSR-degr implementation of the degradation model. The similarity is evident from the correct assignment of the nucleotides at the degraded nodes and sites that recreates the sequences before the inclusion of the damaged nucleotides. A typical result of this model testing approach is shown in figure 2, in which data sets generated with {alpha} = 0.5 and degradation matrix parameters p = 0.95 and q = 0.05 are analyzed. A good fit in this case indicates that the model is accurate in the estimation of site-specific substitution rates, even in the presence of degraded sites.


Figure 2
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— The correlation between the mean posterior rates obtained from degraded data (+) using BYPASSR-degr versus original data set before including the errors (–) analyzed with BYPASSR is calculated for simulated data having 20, 30, 40, and 50 sequences and generated with {alpha} = 0.5, p = 0.95, and q = 0.05 (data set 5 in table 1). Note that {alpha} is the scale parameter of the gamma distribution used to model among-site rate variation, p is the probability that a site does not undergo degradation, q is the rate of type I and II transitions, and z is the rate of transversions.

 
An important difference is observed between the posterior mean site-specific substitution rates obtained when data with incorporated errors are analyzed using a model that integrates over the uncertainty in the aDNA data versus a model that ignores the degraded nucleotides (fig. 3). In the later case, the presence of 5% degraded sites creates the appearance of a higher number of substitutions in the data which is reflected in a higher {alpha} ({alpha}20seqs = 1.49 ± 0.24, {alpha}30seqs = 1.30 ± 0.19, {alpha}40seqs = 1.67 ± 0.29,{alpha}50seqs = 1.45 ± 0.24) and longer branch lengths (fig. 3B panels). The comparison between the mean posterior substitution rates in the 2 situations (degraded data analyzed with BYPASSR-degr and BYPASSR) shows a significant reduction of rate variation among sites that causes the lower rates to be higher and vice versa (fig. 3A panels).


Figure 3
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— The panels at left (A) show the correlation between the mean posterior rates obtained from degraded data (+) using BYPASSR-degr versus BYPASSR (that ignores the presence of damaged sites). The panels at right (B) show the comparison between the mean posterior branch lengths in the same situations as in panels (A). The simulated data sets have 20, 30, 40, and 50 sequences (from top to bottom) and were generated with {alpha} = 0.5, p = 0.95, and q = 0.05 (data set 5 in table 1). Note that {alpha} is the scale parameter of the gamma distribution used to model among-site rate variation, p is the probability that a site does not undergo degradation, q is the rate of type I and II transitions, and z is the rate of transversions.

 
A larger simulation study was performed on data sets with additional sequences and different proportions of aDNA. Following the same procedure as described in the previous paragraphs, we started with random trees of either 50, 100, or 150 taxa and generated data sets with 1/2 or 1/4 of the sequences representing aDNA. For each combination, 2 values of the degradation parameter, q, were used: q = 0.01 and 0.05, producing 24 data sets in total. Table 2 shows the results of our simulations. The results show that the power to detect damaged sites increases with increasing numbers of taxa and/or an increase in the proportion of aDNA sequences present in the sample. The estimates of {alpha} based on the posterior means for these analyses tend to slightly overestimate the true value, probably due to the relatively short branch lengths and small numbers of sites examined. In this case, the prior can be expected to influence the posterior potentially leading to some bias.


View this table:
[in this window]
[in a new window]

 
Table 2 Estimates for the Parameters of the Degradation Model When the Proportion aDNA:DNA is 1:1 or 1:3

 
Analysis of Etruscan HVR-I Mitochondrial aDNA
A genetic study on the remains of 80 Etruscans, the pre-Roman population of Italy, was published in 2004 by Vernesi et al. (2004)Go. Mitochondrial DNA was extracted from bones following strict criteria to avoid contamination or other possible artifacts in the data. The authors decided on a final set of 27 sequences of the HVR-I region (360 nt in length), obtained from a consensus of multiple clones, as clean and reliable for further detailed analysis. The authors provided us with these sequences as well as DNA sequences of the same mitochondrial region from contemporary populations (106 Basques, 69 Cornish, 45 Druz, 240 Saami, 74 Sardinians, and 49 Tuscans) to investigate the possibility of damaged sites. In the first step of our analysis, we extracted nonidentical sequences belonging to the contemporary population samples resulting in an alignment of 208 sequences of 360 nt each. We then assembled 2 smaller data sets by choosing a subset of sequences from the contemporary data set. Table 3 shows the number of sequences analyzed from each population. For each of the data sets, we chose every third and fifth sequence from the complete alignment of 208 sequences. By varying the number of sequences in the data sets representing the same DNA region, we expect to obtain similar posterior densities of site-specific rates and the similar inferred locations for damaged sites in the aDNA sequences.


View this table:
[in this window]
[in a new window]

 
Table 3 The Numbers of aDNA and Contemporary HVR-I Sequences Analyzed for 6 Contemporary Populations Using BYPASSR-degr

 
For each of the data sets, the phylogenetic tree was obtained by maximum likelihood using the HKY85 substitution model and 4 categories for the discrete gamma distribution approximation (Yang 2007Go). BYPASSR-degr was used to analyze the sequences under a molecular clock assumption, using the newly implemented theory, with 30–40 million iterations in the burn-in stage and the same number of iterations in the sampling stage, during which 2,000 samples were collected. Ten independent chains were run for each of the data sets. Multiple chains were run for the same data set with different initial values to assess convergence.

The results of the runs are summarized in tables 4 and 5. The degree of degraded sites in these data sets thus appears very low as evidenced by a q parameter with posterior mean q < 10–3. The posterior means across independent MCMC runs are highly consistent indicating convergence, with the exception of run 9 for data set 2, which appears to have an inflated value for z. A small number of sites showed a weak signal of degradation with posterior probability (averaged across runs) for an alternative nucleotide present in the aDNA sequences ≤0.95.


View this table:
[in this window]
[in a new window]

 
Table 4 Posterior Means with the Standard Deviations for the Gamma Distribution Parameter {alpha}, Tree Length (TL), and Degradation Model Parameters p, q, and z obtained from the analysis of 27 Etruscan aDNA + 70 Contemporary DNA (upper part) and 27 aDNA + 42 DNA (lower part)

 

View this table:
[in this window]
[in a new window]

 
Table 5 Posterior Mean, Standard Deviation, and Highest Posterior Density Interval of the Substitution Rate, r, at Sites of the Etruscan Sequences

 
Previous analysis of the HVR-I region from Etruscan remains have found several sites to be prone to postmortem damage or to show high substitution rates (Vernesi et al. 2004Go). The posterior mean and highest posterior density interval (averaged over runs) of the substitution rate at these sites are shown in table 5. Among the 24 sites, only sites 270 and 261 have posterior mean above 1 (the site rates average set by the prior), whereas the others have a posterior mean substitution rate that is lower than the average rate at the remaining sites in the sequence. Note that all rates are relative with the mean rate across all sites of the sequence being equal to 1.

Analysis of Adélie Penguin and Horse aDNA
Two of the data sets analyzed by Ho et al. (2007)Go were also analyzed using our method for purposes of comparison. The first data set analyzed is from the study of Lambert et al. (2002)Go of 345 bp from the mtDNA control region (hypervariable region I) of Adélie penguins. We analyzed 107 nonredundant sequences from a set of 379 contemporary sequences (accession numbers AF474412 [GenBank] –AF474791 [GenBank] ) as well as 92 aDNA sequences (Genbank accession numbers AF474887 [GenBank] –AF474792 [GenBank] ). From the total of 107 contemporary + 92 aDNA sequences, 2 smaller data sets were created for the use in the analysis using the even (set 1: 57 DNA + 43 aDNA) and odd (set 2: 49 DNA + 49 aDNA) sequence numbers. Sequences were aligned using ClustalW, and the tree was obtained using the HKY85 model with 8 categories for the gamma distribution of among-site rate variation, as implemented in PAML (Yang 2007Go). In analyzing the first data set, 258 sites were used in the analysis following the exclusion of 33 gaps. In the second data set, 276 sites were used after the exclusion of 37 gaps. BYPASSR-degr was run on these data using 6 million iterations of burn in and sampling. The mean posterior estimates of the degradation model parameters are shown in table 6. There were no individual sites with a posterior probability of degradation greater than 0.95.


View this table:
[in this window]
[in a new window]

 
Table 6 Posterior Means with the Standard Deviations for the Gamma Distribution Parameter {alpha}, Tree Length (TL), and Degradation Model Parameters p, q, and z Obtained from the Analysis of the Adélie Penguins Data Set 1 at Top (57 DNA + 43 aDNA) and Data Set 2 at Bottom (49 DNA + 49 aDNA) (10 independent runs each)

 
The second data set analyzed is from the study of Vila et al. (1989)Go of 348 bp from the mtDNA control region. We analyzed 33 contemporary sequences (Genbank accession numbers AF326635 [GenBank] –AF326667 [GenBank] ) and 12 aDNA sequences (accession numbers AF326668 [GenBank] –AF326679 [GenBank] ). Sequences were aligned using ClustalW, and the tree was obtained using the HKY85 model with 8 categories for the gamma distribution of among-site rate variation, as implemented in PAML (Yang 2007Go). In total, 345 sites were used in the analysis after the exclusion of 3 gaps. BYPASSR-degr was run on these data using 4 million iterations of burn in and sampling. The mean posterior estimates of the degradation model parameters are shown in table 7. There were no individual sites with a posterior probability of degradation greater than 0.95.


View this table:
[in this window]
[in a new window]

 
Table 7 Posterior Means with the Standard Deviations for the Gamma Distribution Parameter {alpha}, Tree Length (TL), and Degradation Model Parameters p, q, and z Obtained from the Analysis of 12 Horse aDNA + 33 Contemporary DNA (10 independent runs)

 

    Discussion
 TOP
 Abstract
 Introduction
 Theory
 Discussion
 Acknowledgements
 References
 
We have developed a novel approach to infer properties of the degradation process in aDNA and have incorporated this process in the context of a model with continuous variation of substitution rates among sites. In the limited analyses we have carried out using simulated data, BYPASSR-degr, the implementation of the model proposed by us performed very well in identifying the damaged sites (even if there are many damaged sites spread across several sequences) and obtaining reasonably precise estimates of the other tree parameters (i.e., branch lengths, site-specific rate, GTR model parameters, etc.). In general, the type I error rate was very low, and the method was not prone to spurious detection of nonexistent degradation errors. By contrast, Ho et al. (2007)Go found that their method was prone to false-positive degradation errors for at least some of their simulation conditions. This difference in the performance of the methods could be due to the use of a more realistic model in our study, although further simulation analyses of the 2 methods are probably warranted. The results of our simulation analyses suggest that efficient recovery of the model parameters is possible when the number of aDNA sequences is sufficiently large for the model of degradation to be well defined and the number of contemporary sequences sufficiently large that information about the underlying substitution process is available. We have applied the method to analyze a data set comprised of Etruscan aDNA and contemporary sequences. By choosing different sets of contemporary sequences in addition to the aDNA sequences and by running multiple chains for each data set, we were able to evaluate the performance of the MCMC method in obtaining estimates of the parameters of the degradation model. Our analysis revealed no significant signals of degradation in the Etruscan aDNA. The fact that one of our runs analyzing the Etruscan data produced a small but potentially misleading inflation of the parameter z due to likely nonconvergence, emphasizes the importance of conducting multiple independent runs when analyzing a single data set using MCMC to confirm convergence. We further applied the method to analyze contemporary and ancient mtDNA sequences from Adélie penguins and horses and found no significant evidence for degradation errors in either data set. In conclusion, with sufficient numbers of sequences, it appears possible to identify sites in aDNA that have experienced degradation errors using the method presented in this paper. However, the 3 data sets we analyzed all suggest extremely low rates of degradation-induced nucleotide substitutions, suggesting that degradation may be less of a problem for aDNA sequence data than was previously supposed.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Theory
 Discussion
 Acknowledgements
 References
 
We thank Giorgio Bertorelle for bringing the problem of miscoding lesions in aDNA to our attention. Support for this research was provided by a grant from the Canadian Institutes of Health Research (MOP 44064) to B.H.R.


    Footnotes
 
Jeffery Thorne, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Theory
 Discussion
 Acknowledgements
 References
 

    Bandelt HJ, Forster P, Rohl A. Median-joining networks for inferring intraspecific phylogenies. Mol Biol Evol (1999) 16:37–48.[Abstract]

    Binladen J, Wiuf C, Gilbert M, Bunce M, et al, (11 co-authors). Assessing the fidelity of ancient DNA sequences amplified from nuclear genes. Genetics (2006) 172:733–741.[Abstract/Free Full Text]

    Cooper A, Poinar H. Ancient DNA: do it right or not at all. Science (2000) 289:1139.[Web of Science][Medline]

    Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol (1981) 17:368–376.[CrossRef][Web of Science][Medline]

    Gelman A, Carlin J, Stern H, Rubin D. Bayesian data analysis (2004) Boca Raton (FL): Chapman & Hall/CRC.

    Gilbert M, Binladen J, Miller W, Wiuf C, Willerslev E, Poinar H, Carlson JE, Leebens-Mack JH, Schuster SC. Recharacterization of ancient DNA miscoding lesions: insights in the era of sequencing-by-synthesis. Nucleic Acids Res (2007) 35:1–10.[Abstract/Free Full Text]

    Hansen A, Willerslev E, Wiuf C, Mourier T, Arctander P. Statistical evidence for miscoding lesions in ancient DNA templates. Mol Biol Evol (2001) 18:262–265.[Free Full Text]

    Helgason A, Palsson S, Lalueza-Fox C, Ghosh S, (10 co-authors). A statistical approach to identify ancient template DNA. J Mol Evol (2007) 65:92–102.[CrossRef][Web of Science][Medline]

    Ho S, Heupink T, Rambaut A, Shapiro B. Bayesian estimation of sequence damage in ancient DNA. Mol Biol Evol (2007) 24:1416–1422.[Abstract/Free Full Text]

    Hoelzel AR. Ancient genomes. Genome Biol (2005) 6:239.[CrossRef][Medline]

    Hofreiter M, Jaenicke V, Serre D, Haeseler A, Pääbo S. DNA sequences from multiple amplifications reveal artifacts induced by cytosine deamination in ancient DNA. Nucleic Acids Res (2001) 29:4793–4799.[Abstract/Free Full Text]

    Kimura M. A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J Mol Evol (1980) 16:111–120.[CrossRef][Web of Science][Medline]

    Lambert DM, Ritchie PA, Millar CD, Holland B, Drummond AJ, Baroni C. Rates of evolution in ancient DNA from Adélie penguins. Science (2002) 295:2270–2273.[Abstract/Free Full Text]

    Mateiu L, Rannala B. Inferring complex DNA substitution processes on phylogenies using uniformization and data augmentation. Syst Biol (2006) 55:259–269.[CrossRef][Web of Science][Medline]

    Pääbo S. Ancient DNA: extraction, characterization, molecular cloning, and enzymatic amplification. Proc Natl Acad Sci USA (1989) 86:1939–1943.[Abstract/Free Full Text]

    Pääbo S, Poinar H, Serre D, Jaenicke-Despres V, et al, (10 co-authors). Genetic analyses from ancient DNA. Annu Rev Genet (2004) 38:645–679.[CrossRef][Web of Science][Medline]

    Vernesi C, Caramelli D, Dupanloup I, et al, (13 co-authors). The Etruscans: a population-genetic study. Am J Hum Genet (2004) 74:694–704.[CrossRef][Web of Science][Medline]

    Vila C, Leonard JA, Gotherstrom A, Marklund S, Sandberg K, et al. Widespread origins of domestic horse lineages. Science (1989) 291:474–477.[CrossRef]

    Willerslev E, Hansen A, Binladen J, Brand T, et al, (11 co-authors). Diverse plant and animal genetic records from Holocene and Pleistocene sediments. Science (2003) 300:791–795.[Abstract/Free Full Text]

    Yang Z. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol (1993) 10:1396–1401.[Abstract]

    Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol (2007) 24:1586–1591.[Abstract/Free Full Text]

    Yang Z, Rannala B. Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo method. Mol Biol Evol (1997) 14:717–724.[Abstract]

Accepted for publication April 11, 2008.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
S. Guimaraes, S. Ghirotto, A. Benazzo, L. Milani, M. Lari, E. Pilli, E. Pecchioli, F. Mallegni, B. Lippi, F. Bertoldi, et al.
Genealogical Discontinuities among Etruscan, Medieval, and Contemporary Tuscans
Mol. Biol. Evol., September 1, 2009; 26(9): 2157 - 2166.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
25/7/1503    most recent
msn095v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Mateiu, L. M.
Right arrow Articles by Rannala, B. H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Mateiu, L. M.
Right arrow Articles by Rannala, B. H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?