Skip Navigation


MBE Advance Access originally published online on February 22, 2006
Molecular Biology and Evolution 2006 23(5):988-996; doi:10.1093/molbev/msj111
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/5/988    most recent
msj111v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ewing, G.
Right arrow Articles by Rodrigo, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ewing, G.
Right arrow Articles by Rodrigo, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Proceedings of the SMBE Tri-National Young Investigators' Workshop 2005

Coalescent-Based Estimation of Population Parameters When the Number of Demes Changes Over Time

Greg Ewing*,{dagger} and Allen Rodrigo*,{dagger}

* The Allan Wilson Centre for Molecular Ecology and Evolution, University of Auckland, Private Bag 920191, Auckland, New Zealand; and {dagger} The Bioinformatics Institute, University of Auckland, Private Bag 920191, Auckland, New Zealand

E-mail: a.rodrigo{at}auckland.ac.nz.


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Appendix. MCMC Move Types
 Acknowledgements
 References
 
We expand a coalescent-based method that uses serially sampled genetic data from a subdivided population to incorporate changes to the number of demes and patterns of colonization. Often, when estimating population parameters or other parameters of interest from genetic data, the demographic structure and parameters are not constant over evolutionary time. In this paper, we develop a Bayesian Markov chain Monte Carlo method that allows for step changes in mutation, migration, and population sizes, as well as changing numbers of demes, where the times of these changes are also estimated. We show that in parameter ranges of interest, reliable estimates can often be obtained, including the historical times of parameter changes. However, posterior densities of migration rates can be quite diffuse and estimators somewhat biased, as reported by other authors.

Key Words: coalescent • migration • Markov chain Monte Carlo • Bayesian inference • serial samples • measurably evolving populations


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Appendix. MCMC Move Types
 Acknowledgements
 References
 
Inferring effective population sizes and migration rates from genetic data is a fundamental problem in population genetics. Arguably, the most successful approach is one based on the coalescent (Kingman 1982aGo, 1982bGo). The coalescent is a probabilistic description of a sample genealogy that takes account of the influence of population size, migration rates, population growth and decline, and recombination rates on the distributions of times to the most recent common ancestors. Recently, genetic data from measurably evolving populations (MEPs) (Drummond et al. 2003Go) that have samples spread over time as well as space are allowing more complex models to be considered. Serial samples allow the direct estimation of mutation rate (Drummond and Rodrigo 2000Go; Fu 2001Go; Rosenberg, Tsolaki, and Tanaka 2003Go). In Drummond et al. (2002)Go, a coalescent-based Bayesian Markov chain Monte Carlo (MCMC) procedure to estimate mutation rate simultaneously with effective population size was developed. This was subsequently extended to include coalescent migration models (Ewing, Nicholls, and Rodrigo 2004Go).

Ewing, Nicholls, and Rodrigo (2004)Go used a Bayesian MCMC coalescent-based approach to estimate asymmetric migration rates and population sizes, simultaneously with mutation rate and genealogy. In that paper, analysis on an example HIV data set revealed that mutation rate of recent samples was significantly different from that of earlier samples. In fact, it is not unreasonable to expect evolutionary parameters to change over time; indeed, it is one of the strengths of serial-sample analysis that we are able to estimate these changes (Drummond, Forsberg, and Rodrigo 2001Go).

In this paper, we extend the model in Ewing, Nicholls, and Rodrigo (2004)Go to allow for step-wise changes in evolutionary parameters and demographic structure over time, using a Bayesian MCMC approach. The times that these changes occur are also estimated. Our model allows all parameters to vary between intervals, including the number of subpopulations, mutation rate, and population size. By using a Bayesian inference framework, we can incorporate prior information about the timing of expected changes, in a straightforward manner.

Nielsen and Wakeley (2001)Go have previously treated a specific case of changing demographic structure with Bayesian MCMC. In their model, a single founding population splits at an estimated time into two populations with migration between them. Our model differs in several respects. First, no single population splits into other subpopulations, but rather subpopulations become "available" for migration at particular time intervals. In this respect, our approach is based on a colonization model rather than a vicariance model. We discuss the difference between these two models more fully in Model III: Changing Number of Demes, Constant Mutation Rate, Changing Population Size, and Changing Migration Rate.

In the Bayesian framework described here, we estimate all the relevant parameters of the model simultaneously. The implementation allows any number of time intervals and any migration structure with one or more demes within these intervals. In practice, limits on computation resources and data restrict the above choices. Further, our approach is designed to use data from MEPs (Drummond et al. 2003Go), incorporating serial samples and using a finite sites evolutionary model of sequence evolution (Felsenstein 1981Go).

We start by generalizing the coalescent model to take into account a fixed number of time intervals. In each time interval, we have a fixed number of demes, constant migration rates, and population sizes. These are allowed to vary between time intervals. The mutation rate in each interval is also allowed to vary between time intervals. The posterior density is written down and motivation for the choice of priors discussed. Briefly, we discuss some aspects of the MCMC implementation but leave the details to the Appendix. Finally, we present some simulation studies on four different demographic models and show that we can simultaneously estimate migration rates, population sizes, mutation rates, and the time intervals associated with historic parameter changes.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Appendix. MCMC Move Types
 Acknowledgements
 References
 
Coalescent with Parameter Step Changes
We now describe the coalescent probability density incorporating changes in population and mutation parameters, thereby extending the model in Ewing, Nicholls, and Rodrigo (2004)Go. We use a Fisher-Wright (Fisher 1930Go; Wright 1931Go) population model with constant population size. Using the Kingman coalescent (Kingman 1982aGo, 1982bGo) extended to include migration (Hudson 1990Go; Notohara 1990Go) and nonisochronous data (Drummond et al. 2002Go; Ewing, Nicholls, and Rodrigo 2004Go), we derive a probability density over migration-coalescent genealogy g. Readers are directed to figure 1 for clarification of notation used in this section.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
 
FIG. 1.— A simplified genealogy to clarify the notations used throughout this paper. This genealogy has just three sequences all collected at different times (n = 3). There is just one migration event (m = 1) and two time intervals (A = 2). Time intervals represent periods when parameters are constant. Time interval boundaries do not necessarily correspond to sampling times. Time interval a = 1 spans the time (T0, T1), while time interval a = 2 spans (T1, T2). This produces a genealogy with six short intervals {delta}1{delta}6. Short intervals are periods between events, where events can be sampling times, coalescent events, migration events, and time interval boundaries.

 
The island model of migration is a model of p populations, or "demes," over A "time intervals" each with distinct population parameters. For Formula and Formula deme j is a panmictic population of Formula haploid individuals in time interval a. Time is measured backward from the present and hence increases into the past; time is measured in calendar units. Let Formula denote the "per capita" migration rate from deme i to j in interval a (time increases into the past, so in forward time the individual is moving from j to i).

The migration-coalescent genealogy g is defined as in Ewing, Nicholls, and Rodrigo (2004)Go. There are n leaf nodes (with label set Formula), n – 1 coalescent nodes (label set Formula) including the root (label Formula), and m migration nodes (label set Formula). There are A time intervals over the genealogy where population size, migration rates, and mutation rate are possibly different in each time interval. Time interval a is the interval (Ta–1, Ta], where T0 is the time of the most recent sample and TA = tR is the time of the root. Let Formula denote the set of all ancestral (i.e., nonleaf) node labels and Formula denote the set of all node labels. Samples are collected at different times and from different demes, within different time intervals. The leaf nodes therefore have a specified deme label, denoted by the vector Formula because we know the source deme of the sequence. Every edge in the tree also has an associated deme label, where the deme edge label is denoted by Formula. A deme label indicates that a lineage is currently in that deme and genealogy is constrained by leaf deme membership. We do not consider the case where some samples are from unknown demes. A path from a leaf node to the root includes migration nodes, representing legal migration events, and coalescent nodes where the coalescing edges must be in the same deme. Nodes are labeled from u = 1 to u = m + 2n 1 in order of increasing age and by least child label in case of ties, so that Formula A tree edge between u and v is denoted <u, v> and is directed toward the present. Let E denote the set of all edges in the tree graph. So <u, v> isin E implies tu ≥ tv.

Let tg equal the mean number of units of calendar time per generation, which is assumed to be constant across demes. We do not estimate tg, instead, we present results for a nominal tg, where we assume that tg = 1.

The migration-coalescent process is realized as follows. Beginning at the leaf nodes and as time increases into the past, each lineage in deme i and time interval a migrates independently of all other lineages at rate Formula to deme j. Each pair of lineages in deme i coalesces at instantaneous rate Formula where Formula The process terminates when the number of lineages equals one. With each event we associate a node, Formula and with each lineage between events an edge <u, v> isin E and with each edge a deme label Formula A visual representation can be seen by skipping ahead to figure 2. Here the leaf deme membership is shown with either a dashed line for one deme or a solid line for the other deme.


Figure 2
View larger version (8K):
[in this window]
[in a new window]
 
FIG. 2.— A single sampled genealogy for model III (see table 5). Note that the time interval boundary is constrained by the gray line lineage because there can be no gray lineages above the boundary.

 
We now write down the joint density Formula for a migration genealogy, where Formula and Formula are vectors of the model parameters. Consider the "short interval" (we use short interval to distinguish this from time intervals defined earlier) of time {delta}r = tr+1tr between consecutive events on the migration genealogy. Events are nodes, including "new" leaf nodes and also time interval boundaries. There are m + 2n + A 3 such short intervals (set Formula) on a migration genealogy Formula with m migration events). Let {lambda}ijr be the migration rate in short interval r. Let 1/{theta}ir be the coalescent rate for deme i in short interval r. That is Formula and Formula where Formula and Formula such that Formula For Formula and Formula let kir denote the number of lineages in deme i in short interval r. For Formula let Formula denote the set of demes omitting deme i. For each r, the short interval (tr, tr+1] contributes a factor Formula to the density, along with a second factor equal 1/{theta}ir or {lambda}ijr as the event type at time tr+1 is coalescent in deme i or Formula-migration. A short interval terminated by a leaf or a time interval boundary ends in a nonevent, and the second factor is one. Let Formula if the short interval r terminates on a coalescent {theta}ir, otherwise Formula Likewise let Formula if the short interval r terminates on the migration {lambda}ijr, otherwise Formula The overall density is

Formula 1(1)
When there is a time interval that does not contain a deme, we omit the deme from the set Formula 1 for those short intervals. That is, if deme u' is not valid in short interval r, we skip i = u' for the summation in the exponential. Note that if Formula 1 and Formula 1 for all Formula 1 the above expression reduces to that given in Ewing, Nicholls, and Rodrigo (2004)Go.

Mutation
Mutation rates are inferred with our method by incorporating the finite sites mutation model. Let µa be the mutation rate in time interval a with units of expected number of substitution per site per calendar time and define the vector Formula 1 We use the finite sites mutation model, with neutral or no selection and general time reversible substitution process of Felsenstein (1981)Go and Rodriguez et al. (1990)Go. The substitution process is a continuous time Markov process with states Formula 1 = {A, C, G, T}, a 4 x 1 vector of equilibrium probabilities {pi} and a 4 x 4 rate matrix Formula 1 normalized to generate one substitution per unit calender time (–{sum}c{pi}cQcc = 1). The substitution and migration processes are independent.

Each leaf node Formula 1 has associated with its nucleotide sequence data Formula 1 of length L with Formula 1 for w = 1,2,...,L. Additional gap characters, indicated by {phi}, are treated as unobserved sites. Let DFormula 1 be the n x L matrix of sequences on the leaves and Formula 1 be the (n – 1) x L matrix of unobserved ancestral sequences at the coalescent nodes. The set Formula 1 denotes the set of all possible ancestral sequences at the coalescent nodes.

The likelihood Formula 1 is defined and computed in the usual way with only a small modification, using node to node transition probabilities (Felsenstein 1981Go). For these purposes, migration nodes may be ignored, and we consider the topology in the traditional sense.

Let Formula 1 be the edge set for a graph Formula 1 obtained from g by ignoring migration nodes; that is, Formula 1 is the edge set containing those pairs Formula 1 of nodes of g that have the property that the path on g between node u and node v contains no nodes from Formula 1 (and <u, v> isin E*(g) implies tu ≥ tv).

For tutv > 0, a given interval of calendar time, let Formula equal the distance in substitutions per site over the edge <u, v>. If the path <u, v> is contained in the interval a (i.e., Formula 1 for some a), then Formula If the path however spans a time interval boundary (i.e., tv < Ta, tu > Ta for some a), then Formula This can be extended to the case where the path <u, v> spans more than two time intervals.

Let exp Formula denote the matrix exponential of the matrix Formula For Formula 1 <u, v> isin E*(g), and each Formula 1 we have Formula and P(DR,w = b) = {pi}b. The likelihood for genealogy Formula 1 and mutation rate vector µ is then


Formula 2

(2)
The sum Formula 1 in (2) is evaluated using pruning (Felsenstein 1981Go).

Bayesian Inference
The Bayesian posterior density is given by

Formula 3(3)
where P is the likelihood function, f the migration genealogy prior, p a prior on µ, {theta}, {lambda}, and T and z, an unknown normalization constant.

Prior selection is important for any Bayesian analysis. We do not consider this issue in detail here but simply give the motivation for the priors we have selected. A more detailed discussion of prior selection for this type of problem is given in Ewing, Nicholls, and Rodrigo (2004)Go and Drummond et al. (2002)Go, including sufficient conditions for the posterior density to be proper or finitely integrable.

Markov Chain Monte Carlo
We sample the posterior distribution h with Metropolis Hastings Markov chain Monte Carlo (MHMCMC) (Metropolis et al. 1953Go; Hastings 1970Go). Much of this implementation was reported earlier in Ewing, Nicholls, and Rodrigo (2004)Go. This has been extended to allow moves that take the time intervals into account. In particular, the joint scale move was modified from that in Ewing, Nicholls, and Rodrigo (2004)Go. A further migration event move was added to improve mixing. The effect of this move is to remove all the migration nodes off the genealogy and then generate a new set of migration nodes conditional on the existing coalescent topology. Note however that neither this move nor the joint scale moves are required for irreducibility, but significantly reduce convergence and mixing times. The Appendix describes the Metropolis Hastings algorithm with enough detail to reproduce the results presented here. Generally, we used the same MCMC moves as described in Ewing, Nicholls, and Rodrigo (2004)Go and Drummond et al. (2002)Go, with the additional constraints that the time intervals impose. We achieve this by simply using the moves as it is and checking after a move if the state is illegal. If it is the move is rejected in the usual way. Because this check can be done before the expensive likelihood calculation, rejections are fast and mixing time per CPU cycle is not adversely affected.

We collect N samples from the MCMC chain, which are samples from the posterior, where each sample is correlated to the previous sample. That is, each successive sample is not independent from the previous sample. In practice, this means that the number of independent samples is less than the total number of samples. We use the term effective sample size (ESS) to refer to the total number of effective independent samples. Increasing the chain length will increase the ESS and have the same variance-reducing effect on the mean as the same number of independent samples. The ESS is estimated as we have done previously in Ewing, Nicholls, and Rodrigo (2004)Go. That is, we estimate the integrated autocorrelation time {tau} of a parameter using the monotone sequence estimator described in Geyer (1992)Go. The ESS is then N/{tau} for that parameter. For detailed information on ESS and convergence issues, Geyer (1992)Go is an excellent review.

Simulations
In order to evaluate the method, we generated a set of simulated data with known parameter values and then recovered the posterior distributions from these data sets. Each simulated genealogy has between 70–90 sequences, with four sample times across a total of two demes. Sample times were set at 0, 0.05, 0.1, and 0.15, where we have scaled time to expected substitutions per site. All but one model used two time intervals with T1 = 0.075, in the center of the sample times. Sequences, 250 sites long, were generated with a fixed rate matrix as used previously in Ewing, Nicholls, and Rodrigo (2004)Go. We used short sequences primarily for speed, but generally simulated sequences give much tighter bounds on the genealogy than real data and would reflect a more realistic data set. Mutation rate was normalized to 1 except where noted below. We also set Formula 3 the number of time units per generation, to 1. This means that {theta} = N, the effective population size.

We used only flat bounded priors on the free parameters in this simulation study. Such diffuse priors offer the most challenging environment for MCMC convergence and mixing, and thus, if working correctly, provides good evidence that the method can move over the state space well enough for practical inference. Bounding priors provide the necessary conditions for the density to be proper (Ewing, Nicholls, and Rodrigo 2004Go). In this analysis, we bound Formula 3 < 500, 10 > µi > 10–9, and we bound the root height tR < 5.

The potential parameter space is large. Bayesian MCMC should give good indication when too many parameters are being fitted to the data via diffuse posterior densities on many parameters. Unfortunately, this leads to slow mixing times and makes data analysis difficult to be presented in a concise and informative way. We therefore restrict the parameter space while keeping a model that is complex enough to exhibit the relevant behaviors. We present results on models with no more than two subpopulations in any interval and constrain the migration and population sizes to be symmetric. That is Formula 3 and Formula 3 for all Formula 3 In words, we constrain the population sizes and migration rates for demes in the same interval to be equal. Simulation results of more complex models, such as asymmetric migration rates, have shown that the general trends and behaviors are captured by the simplified model. However, for the sake of brevity, the results are not presented here. For some insight into possible problems with asymmetric rates and population sizes, see Ewing, Nicholls, and Rodrigo (2004)Go.

A total of 30 independent simulated data sets were generated for each parameter set, and the starting state was randomized before estimation. To verify that the MCMC chain has reached equilibrium we implement several diagnostics. First, we check the estimated parameter values against the known values. We also check the trace of a parameter value for signs of a trend visually. Other checks are done as described in Geyer (1992)Go. The results are summarized in table 2 with the true values. Although, Bayesian MCMC inference aims to recover the posterior densities of parameters of interest, we need to provide point summary statistics that quantify the central tendency of these densities. We used a mode estimator on the marginal posterior density to do this. For continuous densities, the mode is estimated by the highest local density of a moving window of a fixed number of points, where this local density is calculated by the number of points divided by the distance between the first and last point. We simulated four distinct scenarios, each with a different demographic/migration pattern (table 1).


View this table:
[in this window]
[in a new window]
 
Table 2 Means of the Modes and Standard Deviation of the Modes for Relevant Parameters of 30 Simulated Data Sets Each

 

View this table:
[in this window]
[in a new window]
 
Table 1 A Summary of the Models Used in This Simulation Study

 

    Results
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Appendix. MCMC Move Types
 Acknowledgements
 References
 
From table 2, we see that generally good estimation of the parameters is possible, with the exception of migration rates, where we see strong bias in the modal estimator. We observed that for some simulated data sets, the posteriors can be very diffuse for migration, and in general, migration was the most poorly estimated parameter. It should be noted that this table gives the estimation ability and variance over "expected data" and "expected genealogies" under the migration coalescent density described in Coalescent with Parameter Step Changes. Any given genealogy does not necessarily have the bias indicated in table 2; in fact, there were simulated data sets for each model where the recovery of the truth was exceptional for all parameters and without bias.

Other authors have also noted biased estimation of migration rates using coalescent-based estimators (Beerli and Felsenstein 1999Go). One reason for the bias is the relatively low level of sequence variation in the data or within a given time interval. This translates into greater variation in topology, which in turn means that lineages from different demes are brought together. The net result is an inflation of estimated migration rates. The other reason that there may be an inflation of migration rate estimates relates to the length of the sampling interval. If the interval is too small, then it becomes difficult to place a migration event in that interval. This will be true regardless of whether there are low or high rates of migration. Consequently, that interval is uninformative with respect to migration rates and will support a very flat posterior distribution. However, we point out that even though these biases exist for our chosen estimator, the 95% highest posterior density (HPD) estimates do not reject the truth any more than we would expect. That is, the posterior density reflects the wide ranges of possible values in the cases of just a few migration events.

Model I: Two Demes, Constant Mutation Rate, Changing Population Sizes, and Changing Migration Rates
The first model has two time intervals, each with a different symmetric migration rate. The same two demes are present in both intervals, and there is a single fixed mutation rate for the whole genealogy. Ten sequences were generated for each deme in each of four time points for a total of 80 sequences. In table 2, we see that {theta}(1), {theta}(2), and {lambda}(1) estimate correctly with low error, in contrast to {lambda}(2), where {theta}(1) and {theta}(2) are the population sizes for the first and second time intervals, respectively, and {lambda}(1) and {lambda}(2) are the migration rates of the first and second time intervals, respectively. The 95% HPD intervals still contain the true values of {theta}(1), {theta}(2), {lambda}(1), and {lambda}(2). Figure 3 shows the marginal posterior density for {lambda}(2) from one of the estimated data sets. The density has a long tail out to the boundary and thus will be sensitive to the boundaries of the priors used. We note that the histogram is noisy and in this case, a larger ESS or longer MCMC runs, would tend to smooth the curve. However, this would not affect the general shape of this marginal posterior density.


Figure 3
View larger version (9K):
[in this window]
[in a new window]
 
FIG. 3.— Marginal probability density for {lambda}(2) for model I. The truth is 25. It can be seen that the density is finite out to the bounds.

 
Model II: Two Demes, Changing Mutation Rate, Changing Population Sizes, and Changing Migration Rates
In the second model, we add a single extra estimated parameter to the first model. We allow the mutation rate to vary in different time intervals, where the mutation rates are 1 and 0.5 in the older and more recent time intervals, respectively. Again from table 2, we note that the performance compared to the first model is almost identical. Furthermore, mutation in both time intervals is estimated accurately. Figure 4 shows the marginal posterior density for both mutation rates. It should be noted that in each sample interval, there are two sets of serial sequence samples. This means that µ is estimable in both intervals.


Figure 4
View larger version (9K):
[in this window]
[in a new window]
 
FIG. 4.— Marginal probability density for µ1 and µ2 for model II. The truth is 0.5 and 1, respectively.

 
Model III: Changing Number of Demes, Constant Mutation Rate, Changing Population Size, and Changing Migration Rate
This model comes closest to that proposed by Nielsen and Wakeley (2001)Go but differs in one very important aspect. Let us divide time to the root of the tree into two intervals, one from the present to some time T1 generations before the present which we will refer to as the "recent" interval and the other from T1 generations before present to the root of the tree which we refer to as the "ancient" interval. In our model, there is a single deme, A, in the ancient interval, but in the recent interval, a second deme, B, becomes available for colonization. In our MCMC implementation, we explicitly place migration events on the tree, so that lineages or fragments of lineages may be labeled as belonging to A or B. When we run the MCMC under our colonization model, a tree is always rejected if it has B-lineages in the recent interval that fail to migrate to A when T1 is reached. This means that colonization of B is stochastically determined by migration from A, once B becomes available. In contrast, with the Nielsen-Wakeley model, as lineages are traced backward in time, once T1 is reached, all B-lineages simply become A-lineages, that is, the two demes merge.

Our implementation means that estimates of both the migration rate in the recent interval, {lambda}(1), and the time T1 are upwardly biased because (1) the sampled trees must have B-to-A migrations and (2) the time T1 will tend to be pushed backward to accommodate these migrations. Table 2 confirms this, although as with other estimates, 95% HPD intervals enclose the true values of the parameters. Estimates of {theta} and µ are well estimated.

Model IV: Three Time Intervals, Changing Number of Demes, Constant Mutation Rate, and Changing Population Parameters
The last model is the most complicated, with potentially the slowest MCMC mixing time. It has three time intervals. In the first and last intervals, we have two demes with symmetric migration and population sizes ({lambda}(1) = {lambda}(3) = 5, {theta}(1) = {theta}(3) = 0.01). The center time interval has just one deme as for model III, and we constrain the topology to not allow "dashed lines" lineages in this time interval. Ten sequences per deme per time interval were used with a total of 70 sequences, and run lengths of the MCMC chains were 300 million states. From table 2, we see that the estimators provided good recovery of the truth for most parameters. There is, however, a large bias and variance for {lambda}(3), while {lambda}(1) seems to have lost the bias from the previous model and has a smaller variance.

There are several reasons for this. First, there is an uneven number of samples in each interval. The first interval has two demes and sample times for a total of 40 sequences, whereas last interval has only 20 sequences from two demes. Second, the third time interval has just one time sample. Together, these two factors account for the larger variance of {lambda}(3).

In many instances, the marginal posteriors densities of {lambda}(3) were flat and diffuse. In these cases, the 95% HPD is not significantly smaller than the prior. With a modal estimator, such as the one we use here, it can be difficult obtaining an estimate with a relatively flat posterior density, particularly when there is noise in the distribution, as one can expect when the ESS is not large. This accounts for the large bias in {lambda}(3) because many of the modes lie some distance away from the true value on flat and diffuse posterior densities.


    Discussion
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Appendix. MCMC Move Types
 Acknowledgements
 References
 
In this paper, we develop a coalescent-based Bayesian MCMC procedure that allows estimation of migration and mutation rates, effective population sizes, and times of these changes, when the number of demes changes over time. We have shown that simultaneously estimating population size, migration rates, mutation rates, and the times of demographic and evolutionary changes is possible. However, the efficiency of estimators varies with the complexity of the models and, as one expects, the amount and quality of data. In our simulations, the number of demes was allowed to vary over time, and although this affected the accuracy and bias of some estimators, in particular migration rates, informative inference was still possible. We found that the ability to make accurate estimates was strongly dependent on the quality of the data and the sampling regimes, and under some circumstances informative posterior densities on some parameters were not obtained. However, MCMC chains in most analysis converged and gave sufficient ESSs after 107–108 iterations, so that results can be obtained on a single desktop computer over a week.

Although the analysis presented here is for two demes, the code produced is not limited to two demes or to the demographic time histories presented here. We have shown that even for quite complicated models we can get meaningful results. Even though some parameters may not have informative posterior densities, it is important to take these effects into account when considering other parameters of interest. For example, excluding population subdivision from a model may lead to incorrect population history estimates.

Unfortunately, the variation of performance creates problems with practical data analysis. It is possible nonetheless to determine whether, and to what extent, parameter estimates for a given data set may suffer from bias. This may be done as follows. First run some MCMC chains on the data, sample several chain states, simulate data from the parameters at each state, and attempt to recover the known parameters on the simulated data. This iterated approach is analogous to the double bootstrap (Hall and Martin 1988Go; Rodrigo 1993Go; Efron, Halloran, and Holmes 1996Go).

We have not considered priors other than to ensure that the posterior distribution was proper. In any study on real data, prior sensitivity analysis is required. In this study, we chose uniform priors that typically make inference difficult. Nonetheless, the method worked sufficiently well under these circumstances. More informative priors will significantly improve estimation and recovery of posterior densities.

By allowing a changing number of demes, these methods permit analysis of complex models of migration. These include models of colonization, where new demes appear, or of dispersal and vicariance, when migration may be interrupted for a period of time by some barrier. However, our analysis indicated that our parameter estimates can sometimes be weak. The challenge that remains is to identify the experimental and sampling conditions (i.e., number of sequences, frequency of sampling) that will allow better estimates. Of course, our analysis relates to serially sampled data, and certainly with ancient DNA, it may not be possible to control the number of sequences, or indeed the time between sampling. However, with rapidly evolving pathogens, for example, HIV to which these analysis have previously been applied, the experimental design is certainly within the investigators' control.


    Appendix. MCMC Move Types
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Appendix. MCMC Move Types
 Acknowledgements
 References
 
We now present the details of our MHMCMC implementation. We do not cover the moves already presented in Drummond et al. (2002)Go and Ewing, Nicholls, and Rodrigo (2004)Go. We note that where such a move results in an illegal state due to constraints, the move is rejected as noted in the text. Generally, acceptance rates were quite low for some moves but still gave acceptable ESS per CPU cycle because of the very quick evaluation of the acceptance ratio. All tunable parameters were not difficult to be adjusted and gave acceptance rates of approximately 20%. The MCMC scheme used was reversible jump MCMC, see Green (1995)Go and Green, Hjort, and Richardson (2003)Go for further details. We follow the notation as in Ewing, Nicholls, and Rodrigo (2004)Go.

Suppose Xn = {psi}. We refer to n as the MCMC update counter. Xn+1 is determined in the following way. Move k is chosen, with probability r(k), from a fixed set of state operators. Let Formula 3 denote an ordered list of independent uniform variates. Let Wk({psi}, u) = {psi}' denote a move W generating the new state {psi}'. We suppose that there is k' for which r(k') > 0 and Wk'({psi}', u') = {psi} for some u'. With some probability {alpha}({psi}', {psi}) set Xn+1 = {psi}' (accept), and otherwise set Xn+1 = {psi} (reject). The acceptance probability {alpha} is chosen to ensure that the Markov process is reversible with respect to h. Following Green (1995)Go set

Formula 3
We define

Formula 3
so that {alpha}({psi}', {psi}) = min{1, Q({psi}', {psi})h({psi}')/h({psi})}.

Joint Scale Move
This move was needed to get acceptable µ-mixing and is almost identical to that presented previously in Ewing, Nicholls, and Rodrigo (2004)Go, and we add terms for time intervals. Fix 0 < ß < 1 and draw {delta} uniformly at random from [ß, 1/ß]. The candidate state {psi}' is

Formula 3
where Formula 3 The leaf times are not scaled because they are data. If the move produces an invalid state (child older than parent), the move is rejected. Otherwise, the move is its own inverse and Formula 3 We found that ß = 1.1–1.2 gave good acceptance ratios (about 20%).

Migration Regeneration
In this move, all migration events are removed and a new set of migration events generated. This allows large moves within state space and can be combined with other moves to aid mixing. We generate migration events like Nielsen (2002)Go generates mutation events, with some procedural differences.

Consider a genealogy g with all migrations removed. Let Formula 3 be the root of s and s', where Formula 3 We call the path <s, s'> a cherry and define Formula 3 to be the set of cherries on g. Restricting ourselves to Formula 3 the deme labels constrain the migration events on the path <s, s'>. Let j and j' be the deme labels of s and s', respectively. If j = j', there must be 0, 2, or more migration events on the path. Alternatively, if j != j', there must be one or more migration events on the path. Let Ma be the total migration rate in interval a and set Formula to the length of the path <s, s'>. We draw random variables {delta}i from the probability density Formula we set l = i – 1 and drop the last variable, so we have Formula 3 To increase the efficiency of the method, use a different distribution for {delta}1 if j != j', that is, Formula (Nielsen 2002Go). Let Formula 3 be the distance along the path <s, s'> from s of the new migration node x. Now that we have generated new migration nodes on the cherry, we must set the deme labels on the edges. First, we check to see if there is a legal deme labeling. If not we regenerate migration nodes, until there is a legal labeling or we exceed some maximum iteration limit. In the case that no legal number of demes is generated, we reject and update the counter. Otherwise, we choose a legal deme labeling with probability c(s, s', u). In the case of just two demes, there is just one deme labeling and c(s, s', {delta}). Now, we can determine the deme label at sr, and we continue to recurse through the cherries from the leaves up to the root, thereby generating migration events over the tree.

The probability density of generating migration events on cherry <s, s'> is Formula 3 and if j != j', we divide by Formula The probability density of generating migration events on the whole tree is Formula 3 Again, this move is its own inverse and

Formula 3

Acceptance rates were good, ranging from 5% to 80%. It makes calculation rapid and improves mixing significantly with high migration rate.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Appendix. MCMC Move Types
 Acknowledgements
 References
 
This work was partially supported by grants from the United States Public Health Service. Greg Ewing is supported by an Allan Wilson Centre doctoral scholarship. We thank Geoff Nicholls for considerable advice and help, particularly with the MCMC theory and implementation. Also, we would like to thank two anonymous reviewers who helped improve the manuscript.


    Footnotes
 
Yoko Satta, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Appendix. MCMC Move Types
 Acknowledgements
 References
 

    Beerli, P., and J. Felsenstein. 1999. Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics 152:763–773.[Abstract/Free Full Text]

    Drummond, A., R. Forsberg, and A. Rodrigo. 2001. The inference of stepwise changes in substitution rates using serial sequence samples. Mol. Biol. Evol. 18:1365–1371.[Abstract/Free Full Text]

    Drummond, A., and A. Rodrigo. 2000. Reconstruction genealogies of serial samples under the assumption of a molecular clock using serial-sample UPGMA. Mol. Biol. Evol. 17:1807–1815.[Abstract/Free Full Text]

    Drummond, A. J., G. K. Nicholls, A. G. Rodrigo, and W. Solomon. 2002. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics 161:1307–1320.[Abstract/Free Full Text]

    Drummond, A. J., O. G. Pybus, A. Rambaut, R. Forsberg, and A. G. Rodrigo. 2003. Measurably evolving populations. Trends Ecol. Evol. 18:481–488.[CrossRef]

    Efron, B., E. Halloran, and S. Holmes. 1996. Bootstrap confidence levels for phylogenetic trees. Biometrika 93:7085.

    Ewing, G. B., G. K. Nicholls, and A. G. Rodrigo. 2004. Using temporally spaced sequences to simultaneously estimate migration rates, mutation rate and population sizes in measurably evolving populations. Genetics 168:2407–2420.[Abstract/Free Full Text]

    Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–376.[CrossRef][Web of Science][Medline]

    Fisher, R. A. 1930. The genetical theory of natural selection. Clarendon Press, Oxford.

    Fu, Y. X. 2001. Estimating mutation rate and generation time from longitudinal samples of DNA sequences. Mol. Biol. Evol. 18:620–626.[Abstract/Free Full Text]

    Geyer, C. J. 1992. Practical Markov chain Monte Carlo. Stat. Sci. 7:473–511.

    Green, P. J. 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82:711–732.[Abstract/Free Full Text]

    Green, P. J., N. L. Hjort, and S. Richardson. eds. 2003. Highly structured stochastic systems. Oxford University Press, Oxford.

    Hall, P., and M. A. Martin. 1988. On bootstrap resampling and iteration. Biometrika 75:661–671.[Abstract/Free Full Text]

    Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97–109.[Abstract/Free Full Text]

    Hudson, R. R. 1990. Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7:1–44.

    Kingman, J. 1982a. The coalescent. Stoch. Proc. Appl. 13:235–248.[CrossRef]

    ———. 1982b. On the genealogy of large populations. J. Appl. Probab. 19A:27–43.

    Metropolis, N., A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21:1087–1091.[CrossRef]

    Nielsen, R. 2002. Mapping mutations on phylogenies. Syst. Biol. 51:729–739.[CrossRef][Web of Science][Medline]

    Nielsen, R., and J. Wakeley. 2001. Distinguishing migration from isolation: a Markov chain Monte Carlo approach. Genetics 158:885–896.[Abstract/Free Full Text]

    Notohara, M. 1990. The coalescent and the genealogical process in geographically structured population. J. Math. Biol. 29:59–75.[Web of Science][Medline]

    Rodrigo, A. G. 1993. Calibrating the bootstrap test of monophyly. Int. J. Parasitol. 23:507–514.[CrossRef][Web of Science][Medline]

    Rodriguez, F., J. L. Oliver, A. Marin, and J. R. Medina. 1990. The general stochastic model of nucleotide substitution. J. Theor. Biol. 142:485–501.[Web of Science][Medline]

    Rosenberg, N. A., A. G. Tsolaki, and M. M. Tanaka. 2003. Estimating mutation rate and generation time from longitudinal samples of DNA sequences. Theor. Popul. Biol. 63:347–363.[CrossRef][Web of Science][Medline]

    Wright, S. 1931. Evolution in Mendelian populations. Genetics 16:97–159.[Free Full Text]

Accepted for publication February 7, 2006.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/5/988    most recent
msj111v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ewing, G.
Right arrow Articles by Rodrigo, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ewing, G.
Right arrow Articles by Rodrigo, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?