MBE Advance Access originally published online on April 11, 2008
Molecular Biology and Evolution 2008 25(7):1459-1471; doi:10.1093/molbev/msn090
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Smooth Skyride through a Rough Skyline: Bayesian Coalescent-Based Inference of Population Dynamics

,
* Department of Statistics, University of Washington
Department of Biostatistics, School of Public Health, University of California, Los Angeles
Departments of Biomathematics and Human Genetics, David Geffen School of Medicine, University of California, Los Angeles
E-mail: vminin{at}u.washington.edu.
| Abstract |
|---|
|
|
|---|
Kingman's coalescent process opens the door for estimation of population genetics model parameters from molecular sequences. One paramount parameter of interest is the effective population size. Temporal variation of this quantity characterizes the demographic history of a population. Because researchers are rarely able to choose a priori a deterministic model describing effective population size dynamics for data at hand, nonparametric curve-fitting methods based on multiple change-point (MCP) models have been developed. We propose an alternative to change-point modeling that exploits Gaussian Markov random fields to achieve temporal smoothing of the effective population size in a Bayesian framework. The main advantage of our approach is that, in contrast to MCP models, the explicit temporal smoothing does not require strong prior decisions. To approximate the posterior distribution of the population dynamics, we use efficient, fast mixing Markov chain Monte Carlo algorithms designed for highly structured Gaussian models. In a simulation study, we demonstrate that the proposed temporal smoothing method, named Bayesian skyride, successfully recovers "true" population size trajectories in all simulation scenarios and competes well with the MCP approaches without evoking strong prior assumptions. We apply our Bayesian skyride method to 2 real data sets. We analyze sequences of hepatitis C virus contemporaneously sampled in Egypt, reproducing all key known aspects of the viral population dynamics. Next, we estimate the demographic histories of human influenza A hemagglutinin sequences, serially sampled throughout 3 flu seasons.
Key Words: coalescent smoothing effective population size Gaussian Markov random fields
| Introduction |
|---|
|
|
|---|
Accurate estimation of population size dynamics has important implications for public health and conservation biology (Pybus et al. 2003
Many coalescent-based estimation algorithms rely on simple parametric forms to characterize the evolution of the population size dynamics over time. Advantageously, these deterministic functions contain a relatively small number of parameters to be estimated (Kuhner et al. 1998
; Drummond et al. 2002
). However, justifying strong parametric assumptions can be difficult and may require laborious and computationally expensive testing of many candidate functional forms to find an appropriate description of the population size trajectory. An extreme alternative to parametric population size models is the classical skyline plot estimation proposed by Pybus et al. (2000)
. This estimation procedure relies on a piecewise constant population dynamics model. Because the number of free parameters in this model is equal to the number of independently distributed observations, the classical skyline plot approach results in very noisy estimates.
To arrive at a middle ground between overly stringent parametric and noisy classical skyline plot approaches, 3 extensions to the classical skyline plot estimation have been recently proposed. Strimmer and Pybus (2001)
develop generalized classical skyline plot estimation. These authors employ a model selection approach, based on the Akaike Information Criterion correction (AICc), to reduce the number of free parameters in the classical skyline plot. Drummond et al. (2005)
and Opgen-Rhein et al. (2005)
use multiple change-point (MCP) models to estimate the population size dynamics in a Bayesian framework. These methods approximate the effective population size trajectory with a step function, defined by estimable change-point locations and step heights. One of the main advantages of MCP models is the ease of incorporating them into a joint Bayesian estimation of genealogies and population genetics parameters as demonstrated by Drummond et al. (2005)
. Both proposed MCP models share the same weakness as they require fairly strong prior decisions. Drummond et al. (2005)
a priori fix the total number of change points, a critical parameter in their model that controls the smoothness of the population size trajectory. Opgen-Rhein et al. (2005)
bypass the problem of fixing the number of change points through reversible jump Markov chain Monte Carlo (MCMC) sampling (Green 1995
). However, these authors use an informative and very influential prior for the number of change points in their model. Therefore, in both MCP approaches choosing an appropriate level of smoothness of population size dynamics remains problematic.
We propose to smooth population size trajectories explicitly. We choose piecewise constant demographic model of Pybus et al. (2000)
as our point of departure. Our goal is to construct a smooth skyride through a rough classical skyline profile. Our construction is accomplished by imposing a Gaussian Markov random field (GMRF) smoothing prior on the parameters of the piecewise constant population size trajectory. We make our smoothing prior "time aware" to penalize effective population size changes between "small" consecutive (Ghedin et. al 2005) intercoalescent intervals more than changes between intervals of larger size. To achieve this desirable behavior of our smoothing prior, we equip each consecutive pair of intercoalescent intervals with an appropriate smoothing weight. We show through a simulation study that the time-aware prior is very effective in capturing important characteristics of "true" population size trajectories and is superior to a uniform, time-ignorant GMRF prior. The extension of Kingman's coalescent to serially sampled (heterochronous) data by Rodrigo and Felsenstein (1999)
opens the door for coalescent-based inference for measurably evolving populations (Drummond et al. 2003
). We show that the GMRF smoothing can be easily incorporated into analyses of both isochronous (contemporaneously sampled) and heterochronous data.
Through simulation, we compare performance of the Bayesian skyride with Opgen-Rhein's MCP (ORMCP) model (Opgen-Rhein et al. 2005
) and a Bayesian skyline plot, a MCP model implemented in the software package BEAST (Drummond et al. 2005
). In 3 simulation scenarios that we consider, we find that the Bayesian skyride performs as well or better than both MCP approaches. Although the small number of our simulations prevent us from a detailed comparison of the methods, we nevertheless can conclude that the Bayesian skyride is a competitive alternative to the MCP models and requires substantially weaker prior assumptions. We demonstrate the utility of the proposed method by applying it to 2 real data sets. We analyze isochronous sequences of hepatitis C virus (HCV) and demonstrate that the Bayesian skyride is able to recover all previously inferred characteristics of the Egyptian HCV population dynamics. We proceed with an investigation of intraseason population dynamics of human influenza virus. Our analysis of heterochronous influenza data from 3 seasons demonstrates that estimation of influenza population dynamics holds promise for predicting peak infection time within a flu season.
| Methods |
|---|
|
|
|---|
Coalescent Background
We start with a random population sample of n sequences. Coalescent theory provides a stochastic process that produces genealogies relating these sampled sequences. The process starts at sampling time t = 0 and proceeds backward in time as t increases, coalescing n individuals one pair at a time until the time to the most recent common ancestor (TMRCA) of the sample is reached (Kingman 1982
The effective population size is an abstract quantity that brings populations with different reproductive models to a "common denominator," namely the Wright–Fisher model (Kingman 1982
). One can obtain the census population size by appropriate scaling of the effective population size. Because the dynamics of the effective population size plays a very important role in shaping coalescent-based genealogies, it should be possible to solve the inverse problem and recover the effective population size trajectories from known genealogies.
We assume for the moment that g is a known genealogy relating the n sampled sequences. Suppose that function Ne(t) describes the time evolution of the effective population size as we move into the past. Given Ne(t), we need to compute the probability of observing g under the coalescent with variable population size. To achieve this, it suffices to construct a probability density function over the intercoalescent times u = (u2,...,un) induced by g, where uk = tk – tk – 1, tk is the time of the (n – k)th coalescent event for k = 2,...,n and tn = 0 is the time at which sequences are sampled (Felsenstein 1992
; Pybus et al. 2000
). Griffiths and Tavaré (1994)
show that the joint density of intercoalescent times can be obtained by multiplying conditional densities
|
| (1) |
Piecewise Demographic Model for Isochronous Data
Let us assume that all sampled sequences were collected effectively at the same time, meaning that differences between sampling times are negligible compared with the TMRCA of the sample. We start with a critical assumption that Ne(t) can change its value only at coalescent times, Ne(t) =
k for some
k > 0 and tk < t
tk–1, k = 2,...,n. Informally, we can plug the piecewise constant Ne(t) into equation (1) and arrive at
|
| (2) |
k is equivalent to estimating the rate of an exponential distribution after drawing only a single realization from this distribution. Maximizing likelihood function (2) with respect to
k yields
|
| (3) |
borrow their population sizes from the neighboring intervals. The authors choose
by maximizing a second-order extension of the AICc. The Bayesian skyline plot of Drummond et al. (2005)
Throughout this section, we have assumed that time is measured in units of generations. When analyzing isochronous data, branches of genealogies are often estimated in units of average number of substitutions per site. Even if we are able to estimate branches in units of clock time, the generation time may be unknown. However, this inability to identify time in units of generations does not limit estimation of the dynamics of demographic histories. If estimated intercoalescent intervals are not measured in units of generations and are rescaled as uk = cuk, then plugging u
s into the likelihood function (2), we can recover the rescaled effective population size trajectory N
(t)=cNe(t).
Piecewise Demographic Model for Heterochronous Data
We now turn to the piecewise demographic model for sequences sampled at sufficiently different time points. As before, we assume that genealogy g relating the sampled sequences is known and fixed. Moreover, branch lengths of g satisfy constraints imposed by sampling times s. The sampling times divide each intercoalescent interval k into subintervals wk = (wk0,...,wkjk), where jk
{0,...,n – 1} is the number of "distinct" sampling times occurring during interval k,
, and the interval that ends with the (n – k)th coalescent event is always indexed by k0. To each subinterval kj, we attach the number of lineages nkj present in the genealogy at the beginning of this interval. See figure 1 for an example genealogy with labeled intercoalescent intervals, subintervals, and numbers of lineages. For heterochronous data, we still assume that Ne(t) =
k for some
k > 0 and tk < t
tk–1 for k = 2,...,n.
|
Rodrigo and Felsenstein (1999)
![]() | (4) |
|
| (5) |
Working with heterochronous data has 2 major advantages. First, such data allow one to estimate branch lengths of genealogies in units of time and simultaneously estimate the mutation rate (Drummond et al. 2002
, 2003
). Secondly, the sampling times provide additional information for the effective population size estimation. Although the sampling subintervals do not allow one to observe more coalescence events, these subintervals serve as censored time-to-event data. Unfortunately, because there are at most n distinct sampling times, the improvement in the information content may not be dramatic. Therefore, estimation of intercoalescent interval-specific effective population sizes remains problematic without further modifications.
Temporally Smoothed Piecewise Demographic Model
Because the heterochronous likelihood (4) reduces to the isochronous likelihood (2) when jk = 0 for all k = 2,...,n, from now on we assume that that observed data come in the form of intercoalescent subintervals w = (w2,...,wn). We first transform the intercoalescent interval-specific effective population sizes onto the whole real line via
|
| (6) |
|
| (7) |
= (
2,...,
n).
We invoke a common assumption stating that the effective population size changes continuously through time. To infuse this minimal prior knowledge into our Bayesian model, we devise a GMRF prior distribution for the vector
. This prior penalizes the differences between components of
as
![]() | (8) |
is the overall precision of the GMRF and
k is the distance between sites k + 1 and k on the 1-dimensional lattice {2,...,n}. A uniform GMRF smoothing with equal distances
2 =
=
n–1 = 1 is often assumed for temporal and spatial smoothing. However, in the piecewise demographic model, the penalty for the dissimilarity between adjacent intercoalescent, interval-specific, effective population sizes should depend on the interval sizes. Therefore, we construct a time-aware GMRF prior based on midpoint distances between intercoalescent intervals,
|
| (9) |
We conclude our prior specification by assigning a gamma prior to the GMRF precision parameter
,
|
| (10) |
= β = 0.001 making prior (10) relatively uninformative, with expectation 1 and variance 1,000.
We estimate
and
by MCMC sampling from the posterior distribution of these parameters. Moreover, we implement our Bayesian skyride method in the software package BEAST to estimate effective population size trajectories and genealogies simultaneously. The details of our MCMC algorithm can be found in Appendix A. Appendix B contains instructions on using the Bayesian skyride in BEAST.
Testing Significance of the Effective Population Size Changes
It is common to assess significance of effective population size changes by visually inspecting quantiles of the marginal posterior distributions of effective population sizes. Such an informal approach can be deceptive. In using it, one attempts to draw conclusions about the difference of 2 possibly highly correlated random variables based on their marginal distributions. Moreover, the correlation between values of the effective population size cannot be ignored if we indeed believe that Ne(t) changes continuously through time. As an alternative, we propose to use Bayes factors to formally test the significance of effective population size changes when g is assumed to be known.
We start with a 1-sided hypothesis test. Suppose that we a priori fix 2 intercoalescent intervals i < j that correspond to some historical events of interest. Then, we may be interested in testing hypotheses H0:
i <
j versus H1:
i
j. The Bayes factor
|
| (11) |
i <
j. The GMRF prior (8) implies that
as long as the prior is proper.
Clearly, this 1-sided test is impractical if the direction of the effective population size change is irrelevant or if one is interested in testing effective population size differences among multiple intercoalescent intervals. Therefore, we also consider the hypothesis H0:
i1 =
=
ij, where {i1,...,ij} is a subset of {2,...,n}. The alternative hypothesis H1 states that not all
i1,...,
ij are equal. To estimate the Bayes factor in equation (11), we calculate 2 marginal likelihoods, Pr(w|H0) and Pr(w|H1), from the MCMC output using the harmonic mean estimator (Newton and Raftery 1994
). This Bayes factor estimation procedure requires sampling from the posterior distribution of (
,
) under j – 1 linear constraints on
imposed by hypothesis H0. Fortunately, sampling from GMRFs under linear constraints adds very little computational cost to the unconstrained sampling algorithm (Rue and Held 2005
). Because the harmonic mean estimator of the marginal likelihood is not the most efficient, one could also use the method of Chib and Jeliazkov (2001)
. Alternatively, it may be possible to adapt the generalized Savage–Dickey ratio of Verdinelli and Wasserman (1995)
to test the sharp hypothesis H0.
| Results |
|---|
|
|
|---|
Simulated Genealogies
We examine the ability of the Bayesian skyride to recover effective population size dynamics in a simulation study. Because the number of individuals in a population at time t = 0 affects only the time measurement units, we always start with N(0) = 1.0 in our simulations. First, we simulate a genealogy assuming the constant population size. Next, we use the molecular sequence evolution simulator of Rambaut and Grassly (1997)
|
Next, we simulate a genealogy assuming that Ne(t) = e–500t. Because we proceed in time from present to past in our simulations, the negative growth constant implies an exponentially growing population. This exponential growth is shown on the log scale as a straight dashed line in all plots of figure 3. All models perform reasonably well in this simulation. Here, as in the constant population size case, intercoalescent intervals near the root of the genealogy cause estimation problems. One "unusually" large intercoalescent interval in this region provokes the uniform Bayesian skyride, the ORMCP model, and the Bayesian skyline plot model into overestimating the effective population size in the proximity of this interval. However, the fixed-tree and BEAST time-aware Bayesian skyride methods are less prone to such overestimation. We believe that this desirable behavior results from our weighting scheme that prohibits rapid changes of the effective population size during short time periods.
|
Finally, we generate a genealogy from a coalescent process assuming that a population experiences a bottleneck during the population's evolutionary history. More specifically, we set
![]() | (12) |
|
To provide a comparative summary of the performance of the Bayesian skyride and MCP models, we report the percent error (PE) for all simulation scenarios. We define this PE as
|
| (13) |
|
Because running BEAST is time consuming even under a simple parametric demographic model, we use our simulation study to investigate the computational cost of incorporating the Bayesian skyride into BEAST. On a dual-processor Pentium 3.4 Ghz with 4 GB of RAM, the Bayesian skyride analysis took 2.75, 2.30, and 2.21 h for the constant, exponential growth, and bottleneck simulated data sets, respectively. The corresponding running times for the Bayesian skyline plot with 10 change points are 1.70, 1.81, and 1.58 h, indicating that the Bayesian skyride method is only slightly slower than the Bayesian skyline plot. We expect to further optimize our BEAST implementation and improve computational efficiency of the Bayesian skyride.
Population Dynamics of Egyptian HCV
We analyze 63 HCV sequences, sampled in 1993 in Egypt. Pybus et al. (2003)
use this data set to study the population dynamics of Egyptian HCV. The analyzed sequences are derived from the HCV E1 genomic region. Pybus et al. (2003)
argue that the approximately random sequence sampling, no sign of population substructure, and other properties of these HCV sequences make them very suitable for the coalescent analysis.
We perform a phylogenetic analysis of the HCV sequences using the BEAST software package (Drummond and Rambaut 2007
). Following Pybus et al. (2003)
, we use a strict molecular clock and the HKY substitution model (Hasegawa et al. 1985
). We put a coalescent prior with constant population size on genealogies. However, the wide uniform prior over interval [0, 1000] (measured in years) on the TMRCA together with the abundant phylogenetic information in the HCV sequences should limit the coalescent prior influence on estimated genealogies. To estimate branches in units of years, we use a previously estimated mutation rate in the HCV E1 genomic region, 7.9 x 10–4 substitutions/site/year (Pybus et al. 2001
). For estimation of the population dynamics with the fixed-tree Bayesian skyride, we use the summary, majority clade support genealogy with median node heights depicted in the top-left plot of figure 5. We also use the BEAST Bayesian skyride to estimate the effective population size trajectory and the HCV genealogy simultaneously.
|
The top-right and bottom-left plots of figure 5 show the posterior medians and 95% BCIs of Ne(t) under the time-aware BEAST and fixed-tree Bayesian skyride models. The similarity between the BEAST and fixed-tree Bayesian skyride results indicates that tree uncertainty does not play a significant role in the estimation of the Egyptian HCV population dynamics. Because time is measured in years, we estimate the effective population size scaled by the generation length per year. These scaled estimates are often interpreted as effective numbers of infections (Pybus et al. 2001
The exponential growth of HCV infections in the 20th century is the most remarkable aspect of the HCV evolution in Egypt. Pybus et al. (2003)
argue that the exponential growth of HCV infections is a result of intravenously administered parenteral antischistosomal therapy (PAT), practiced in Egypt from the 1920s to the 1980s. Our temporal smoothing procedure successfully recovers this exponential growth. However, the effective population size reconstruction is noisier in the time periods preceding the exponential growth phase due to lack of coalescent events. Pybus et al. (2003)
hypothesize that the effective number of HCV infections was constant before the start of the exponential growth phase. We test this hypothesis using the fixed-tree Bayesian skyride by constraining intercoalescent effective population sizes to be equal from the TMRCA to the year 1920. In terms of our model parameters, this hypothesis translates to H0:
2 =
=
3. The Bayes factor of 12,880 in favor of H0 decisively supports constant population size hypothesis of Pybus et al.
The posterior summary of the effective population size trajectory under the constrained Bayesian skyride is shown in the bottom-right plot of figure 5. Enforcing a constant population size prior to the 1920s generates long-range effects on the estimation of more recent population sizes due to the increase in the GMRF precision. Under the unconstrained Bayesian skyride, the effective population size trajectory shows a slight decrease in the period 1970–1993. However, the constrained model predicts a constant population size during these years. A gradual transition from the intravenous to oral administration of the PAT started in the 1970s (Frank et al. 2000
). We would like to test whether this transition caused a significant decrease in the effective number of HCV infections, seen in the estimates under the unconstrained Bayesian skyride. Our null hypothesis H0:
62 <
46, where index 46 corresponds to time period 1960–1977 and index 62 corresponds to time period 1992–1993. The Bayes factor of only 3 in favor of H0 suggests that the decay of the HCV effective population size in the 1970–1993 period is not statistically significant.
Intraseason Population Dynamics of Human Influenza
To study intraseason population dynamics of human influenza, we compile 3 data sets from sequences reported by Ghedin et al. (2005)
that correspond to the 3 flu seasons: 1999–2000, 2001–2002, and 2003–2004. The data sets consist of 48, 59, and 72 hemagglutinin sequences, respectively. All sequences derive from H3N2 isolates. Sequence sampling was restricted to New York State. This should diminish the effects of geographical population structure on estimated genealogies. Recombination also should not play a part in shaping intragenic influenza genealogies because homologous intragenic recombination is very rare in influenza viruses (Steinhauer and Skehel 2002
). Because sampling dates for all analyzed sequences are available, all 3 flu season data sets are heterochronous.
Similarly to the Egyptian HCV analysis, we obtain a posterior sample of genealogies using the BEAST constant population size model and Bayesian skyride. In the latter analysis, we estimate influenza effective population size trajectory simultaneously with the viral genealogy. Our model specification of nucleotide evolution is nearly identical to the HCV analysis. We test the molecular clock assumption using Bayes factors (Suchard et al. 2003
). After analyzing the intraseason data sets with a lognormal relaxed clock model, proposed by Drummond et al. (2006)
, we compute the Bayes factors in favor of the molecular clock hypothesis using a harmonic mean estimator. The estimated Bayes factors (season 1999–2000: 111,000, 2001–2002: 19, and 2003–2004: 16) strongly support the molecular clock hypothesis in all 3 data sets. Although the molecular clock together with the heterochronous nature of the data sets in principle allow us to estimate the mutation rates simultaneously with other model parameters, we find that these intrahost influenza data have little information about mutation rates. Therefore, for the mutation rate, we use informative lognormal prior, commensurate with previously obtained estimates of mutation rate in influenza hemagglutinin genes (Fitch et al. 1997
; Yang et al. 2007
).
From posterior samples of all model parameters in the intraseason data sets, we obtain maximum clade support genealogies with median node heights. We feed these genealogies, shown in the left column of figure 6, into our fixed-tree Bayesian skyride procedure. Branch lengths are measured in units of weeks because these units of time are commonly used for intraseason surveillance of flu epidemics. The middle column of figure 6 shows the fixed-tree Bayesian skyride estimates of Ne(t). Results of the BEAST Bayesian skyride, depicted in the right column of figure 6, significantly differ from the fixed-tree inference. Such discrepancy is a clear indication that genealogical uncertainty cannot be ignored during inference of influenza intraseason population dynamics. Interestingly, applying the fixed-tree Bayesian skyride to a random subset of genealogies, sampled under the constant population size model, does not reveal significant variation of the effective population size trajectory estimates (results not shown). This observation suggests that joint inference of genealogies and Ne(t) should be preferred even when the effect of ignoring genealogical uncertainty is not apparent.
|
Exponentially growing influenza populations suggest that the viral diversity was increasing before the start of all 3 flu seasons. The posterior medians of the influenza effective population size trajectories exhibit piecewise exponential shapes. However, the wide BCIs of Ne(t) prevent us from studying local features of these curves. These wide BCIs of the effective population size trajectories appropriately reflect the lack of information about influenza genealogies in the sampled sequences. We only point out that the BEAST Bayesian skyride results suggest that the influenza effective population size was growing slower in 2001–2002 than in 1999–2000 and 2003–2004. This is consistent with the Centers for Disease Control and Prevention surveillance data (http://www.cdc.gov/flu/) reporting a significant delay of flu activity during the 2001–2002 season.
Prior Sensitivity
In Bayesian modeling, it is important to investigate sensitivity of results to the prior assumptions of the model. We place a GMRF smoothing prior on log effective population sizes
. This prior informs our model only about the smoothness of the population size trajectory, leaving the task of determining an overall level of
to the likelihood. The GMRF precision parameter
regulates this degree of smoothness; unfortunately, expert knowledge of
is rarely known a priori. Using the Egyptian HCV data, we illustrate the prior and posterior distributions of log
in the left plot of figure 7. We use the log transformation to mitigate boundary effects near 0, facilitating interpretability of distributional summaries. The dramatic difference between the flat and diffuse prior density and highly peaked posterior histogram of log
suggests that our data alone contain more than sufficient information to estimate
.
|
Recall that in all our analyses we set
= β = 0.001 in
s prior density (10). To investigate the sensitivity of our results to these hyperprior parameter choices, we reanalyze the Egyptian HCV data using 5 different values of
: 0.001, 0.002, 0.005, 0.01, and 0.1, leaving β unchanged. For these values of
, the prior mean of
grows from 1, 2, 4, 10, to 100 respectively. We summarize posterior distributions of log
in the 5 boxplots on the right side of figure 7. These boxplots demonstrate that the posterior distribution of log
hardly changes when we alter the hyperprior parameter
. We conjecture that this remarkable robustness indicates that our Bayesian temporal smoothing model is very well suited for estimating population size dynamics. | Discussion |
|---|
|
|
|---|
We present the Bayesian skyride, a novel coalescent-based, statistical approach for estimating effective population size dynamics. In contrast to previously proposed methods, we explicitly incorporate smoothing into our Bayesian model via a GMRF prior. This strategy allows us to regularize the noisy skyline plot estimates of piecewise constant effective population size trajectories. We make our Bayesian skyride time aware using a weighting scheme based on the sizes of intercoalescent intervals. As with many other smoothing techniques, our GMRF prior has a precision parameter
that controls the strength of smoothness. Estimating this parameter from data alone can be challenging and often requires injection of prior knowledge (Bernardinelli et al. 1995
, eliminating the need of informative priors. Therefore, our method, in contrast to competitors, does not require subjective decisions from the user. The independence of intercoalescent intervals permits us to use very efficient, GMRF-tailored algorithms for sampling from the posterior distribution of the model parameters. This computational efficiency becomes even more critical for integration of the Bayesian skyride into a joint Bayesian estimation of genealogies and population genetics parameters.
We demonstrate that the Bayesian skyride can successfully reconstruct effective population size trajectories under the 3 simulated demographic scenarios. During our simulations, we find that our time-aware Bayesian skyride is superior to the uniform GMRF prior. Therefore, we use the former during our analyses of real data sets. The temporally smoothed scaled effective population size trajectory of the Egyptian HCV demographic history agrees with previous estimates of Ne(t) remarkably well (Pybus et al. 2003
; Drummond et al. 2005
). Using this example, we illustrate how to formally test one- and two-sided hypotheses in our Bayesian skyride framework. Next, we analyze the intraseason population dynamics of human influenza. We find that fixing a genealogy, estimated under the constant population size demographic model, leads to inadequate estimation of Ne(t). The striking difference between our fixed-tree and BEAST Bayesian skyride methods highlights the importance of joint estimation of the effective population size and the genealogy of sampled sequences.
Other building blocks of the coalescent model may also be needed to accurately determine influenza intraseason population dynamics. Although all influenza sequences were sampled in the same geographical location, the periodic migrational patterns of the virus may have a significant effect on shaping genealogies relating the sampled sequences. Omitting selection in our coalescent model can also lead to inaccurate estimation of population dynamics. A more detailed coalescent analysis of intrahost influenza evolution should resolve these difficulties.
The influenza example motivates the need for new statistical tools for quantifying commonalities in multiply observed demographic histories. Similar repeated evolutionary patterns occur during intrahost HIV evolution and have important medical implications (Shankarappa et al. 1999
). We envision analyzing such repeated patterns using a Bayesian hierarchical framework (Kitchen et al. 2004
). Such an approach will require an accurate alignment of observed time intervals and inclusion of external factors that may effect demographic dynamics. The Bayesian skyride is perfectly suitable for the inclusion of such external information as covariates in a generalized linear model framework (MacNab 2003
). This proposed methodology will enable statistical testing of environmental effects on demographic histories of populations.
| Appendix A: MCMC Sampling Scheme |
|---|
|
|
|---|
Fixed Genealogy
We first describe a sampling scheme for the fixed genealogy case. Here, we represent genealogy g through its vector of intercoalescent and sampling intervals w. To approximate the posterior
|
| (A1) |
,
), we first generate a candidate value for the GMRF precision,
* =
f, where f is drawn from a symmetric proposal distribution with density Pr(f)
f + 1/f defined on the interval [1/F, F]. The tuning constant F controls the distance between the proposed and current values of the GMRF precision. Conditional on
*, we propose a new state
* for the vector of log-effective population sizes using a Gaussian approximation to the full conditional density
|
| (A2) |
, the observed wks are distributed independently of each other (Rue et al. 2004
*,
*), we accept or reject it in a Metropolis–Hastings step (Metropolis et al. 1953
Incorporating Genealogical Uncertainty
So far, we have assumed that the genealogy g is known and fixed. However, we do not observe genealogies relating individuals randomly sampled from a population. Instead, we observe molecular sequence data for each individual on the tips of an unknown genealogy. Sequence data and genealogies are connected through the standard assumption that sequence characters are generated by a mutational process that acts along a hidden genealogy. Therefore, the complete likelihood of observing sequence data D is Pr(D|g, Q), where Q is a vector of mutational process model parameters. A priori, we assume that Q and g are independent. Probability distribution Pr(Q) depends on the parameterization of the mutational process model. We use the coalescent as a prior for g so that
|
| (A3) |
) is defined by equation (7). The posterior distribution of all model parameters becomes
|
| (A4) |
| Appendix B: BEAST Implementation |
|---|
|
|
|---|
We have implemented the time-aware Bayesian skyride in the BEAST software package (Drummond and Rambaut 2007
![]() |
| Acknowledgements |
|---|
|
|
|---|
We would like to thank John O'Brien for helpful discussions and his advice on the analysis of influenza evolution. We are grateful to Oliver Pybus and another anonymous reviewer for their constructive comments that greatly improved our manuscript. V.N.M. was supported by a Dissertation Year Fellowship from the University of California, Los Angeles Graduate Division. E.W.B. is supported by National Institutes of Health grant AI07370. M.A.S. is an Alfred P. Sloan Research Fellow.
| Footnotes |
|---|
Jody Hey, Associate Editor
| References |
|---|
|
|
|---|
Bernardinelli L, Clayton D, Montomoli C. Bayesian estimates of disease maps: how important are priors? Stat Med. (1995) 14:2411–2431.[Web of Science][Medline]
Biek R, Drummond A, Poss M. A virus reveals population structure and recent demographic history of its carnivore host. Science (2006) 311:538–541.
Chib S, Jeliazkov I. Marginal likelihood from the Metropolis-Hastings output. J Am Stat Assoc. (2001) 96:270–281.[CrossRef][Web of Science]
Drummond A, Ho S, Phillips M, Rambaut A. Relaxed phylogenetics and dating with confidence. PLoS Biology (2006) 4:e88.[CrossRef][Medline]
Drummond A, Nicholls G, Rodrigo A, Solomon W. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics (2002) 161:1307–1320.
Drummond A, Pybus O, Rambaut A, Forsberg R, Rodrigo A. Measurably evolving populations. Trends Ecol Evol. (2003) 18:481–488.[CrossRef]
Drummond A, Rambaut A. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol. (2007) 7:214.[CrossRef][Medline]
Drummond A, Rambaut A, Shapiro B, Pybus O. Bayesian coalescent inference of past population dynamics from molecular sequences. Mol Biol Evol. (2005) 22:1185–1192.
Elliott P, Wakefield J, Best N, Briggs D. Spatial epidemiology: methods and applications. (2000) Oxford (UK): Oxford University Press.
Felsenstein J. Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates. Genet Res. (1992) 59:139–147.[Web of Science][Medline]
Fitch W, Bush R, Bender C, Cox N. Long term trends in the evolution of H(3) HA1 human influenza type A. Proc Natl Acad Sci USA (1997) 94:7712–7718.
Frank C, Mohamed M, Strickland G, et al, (11 co-authors). The role of parenteral antischistosomal therapy in the spread of hepatitis C virus in Egypt. Lancet (2000) 355:887–891.[CrossRef][Web of Science][Medline]
Ghedin E, Sengamalay N, Shumway M, et al, (19 co-authors). Large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution. Nature (2005) 437:1162–1166.[CrossRef][Medline]
Green P. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika (1995) 82:711–732.
Griffiths R, Tavaré S. Sampling theory for neutral alleles in a varying environment. Philos Trans R Soc Lond B Biol Sci. (1994) 344:403–410.[Web of Science][Medline]
Hasegawa M, Kishino H, Yano T. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. (1985) 22:160–174.[CrossRef][Web of Science][Medline]
Hastings W. Monte Carlo sampling methods using Markov chains and their applications. Biometrika (1970) 57:97–109.
Hein J, Schierup M, Wiuf C. Gene genealogies, variation and evolution (2005) New York: Oxford University Press.
Hudson R. Properties of a neutral allele model with intragenic recombination. Theor Popul Biol. (1983) 23:183–201.[CrossRef][Web of Science][Medline]
Kass R, Raftery A. Bayes factors. J Am Stat Assoc. (1995) 90:773–795.[CrossRef][Web of Science]
Kingman J. On the genealogy of large populations. J Appl Probab. (1982) 19:27–43.[CrossRef]
Kitchen C, Philpott S, Burger H, Weiser B, Anastos K, Suchard M. Evolution of human immunodeficiency virus type 1 coreceptor usage during antiretroviral therapy: a Bayesian approach. J Virol. (2004) 78:11296–11302.
Knorr-Held L, Rue H. On block updating in Markov random field models for disease mapping. Scand J Stat. (2002) 29:597–614.[CrossRef]
Krone S, Neuhauser C. Ancestral processes with selection. Theor Popul Biol. (1997) 51:210–237.[CrossRef][Web of Science][Medline]
Kuhner M, Yamato J, Felsenstein J. Maximum likelihood estimation of population growth rates based on the coalescent. Genetics (1998) 149:429–434.
MacNab Y. Hierarchical Bayesian modeling of spatially correlated health service outcome and utilization rates. Biometrics (2003) 59:305–316.[CrossRef][Web of Science][Medline]
Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E. Equation of state calculation by fast computing machines. J Chem Phys. (1953) 21:1087–1092.[CrossRef]
Newton M, Raftery A. Approximate Bayesian inference with the weighted likelihood bootstrap. J R Stat Soc Ser B (1994) 56:3–48.
Notohara M. The coalescent and the genealogical process in geographically structured population. J Math Biol. (1990) 29:59–75.[Web of Science][Medline]
Opgen-Rhein R, Fahrmeir L, Strimmer K. Inference of demographic history from genealogical trees using reversible jump Markov chain Monte Carlo. BMC Evol Biol. (2005) 5:6.[CrossRef][Medline]
Pybus O, Charleston M, Gupta S, Rambaut A, Holmes E, Harvey P. The epidemic behavior of the hepatitis C virus. Science (2001) 292:2323–2325.
Pybus O, Drummond A, Nakano T, Robertson B, Rambaut A. The epidemiology and iatrogenic transmission of hepatitis c virus in Egypt: a Bayesian coalescent approach. Mol Biol Evol. (2003) 20:381–387.
Pybus O, Rambaut A. GENIE: estimating demographic history from molecular phylogenies. Bioinformatics (2002) 18:1404–1405.
Pybus O, Rambaut A, Harvey P. An integrated framework for the inference of viral population history from reconstructed genealogies. Genetics (2000) 155:1429–1437.
Rambaut A, Grassly N. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. (1997) 13:235–238.
Rodrigo A, Felsenstein J. The evolution of HIV, chapter Coalescent approaches to HIV population genetics (1999) Baltimore (MD): Johns Hopkins University Press.
Rue H. Fast sampling of Gaussian Markov random fields. J R Stat Soc Ser B (2001) 63:325–338.[CrossRef]
Rue H, Held L. Gaussian Markov random fields: theory and applications. (2005) London: Chapman & Hall. (Monographs on Statistics and Applied Probability; vol. 104).
Rue H, Steinsland I, Erland S. Approximating hidden Gaussian Markov random fields. J R Stat Soc Ser B (2004) 66:877–892.[CrossRef]
Shankarappa R, Margolick J, Gange S, et al. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol. (1999) 73:10489–10502.
Shapiro B, Drummond A, Rambaut A, et al, (27 co-authors). Rise and fall of the Beringian steppe bison. Science (2004) 306:1561–1565.
Steinhauer D, Skehel J. Genetics of influenza viruses. Annu Rev Genet. (2002) 36:305–332.[CrossRef][Web of Science][Medline]
Strimmer K, Pybus O. Exploring the demographic history of DNA sequences using the generalized skyline plot. Mol Biol Evol. (2001) 18:2298–2305.
Suchard M, Weiss R, Sinsheimer J. Testing a molecular clock without an outgroup: derivations of induced priors on branch-length restrictions in a Bayesian framework. Syst Biol. (2003) 52:48–54.
Thorne J, Kishino H, Painter I. Estimating the rate of evolution of the rate of molecular evolution. Mol Biol Evol. (1998) 15:1647–1657.[Abstract]
Verdinelli I, Wasserman L. Computing Bayes factors using a generalization of the Savage-Dickey density ratio. J Am Stat Assoc. (1995) 90:614–618.[CrossRef][Web of Science]
Welch D, Nicholls G, Rodrigo A, Solomon W. Integrating genealogy and epidemiology: the ancestral infection and selection graph as a model for reconstructing host virus histories. Theor Popul Biol. (2005) 68:65–75.[CrossRef][Web of Science][Medline]
Yang Z, O'Brien J, Zheng X, Zhu H, She Z. Tree and rate estimation by local evaluation of heterochronous nucleotide data. Bioinformatics (2007) 23:169–176.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
J. C. Stack, J. D. Welch, M. J. Ferrari, B. U. Shapiro, and B. T. Grenfell Protocols for sampling viral sequences to study epidemic dynamics J R Soc Interface, February 10, 2010; (2010) rsif.2009.0530v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Ford, F. A. Matsen, and T. Stadler A Method for Investigating Relative Timing Information on Phylogenetic Trees Syst Biol, June 12, 2009; (2009) syp018v1. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












