MBE Advance Access originally published online on September 6, 2006
Molecular Biology and Evolution 2006 23(12):2355-2360; doi:10.1093/molbev/msl106
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Robustness of Coalescent Estimators to Between-Lineage Mutation Rate Variation
Department of Genome Sciences, University of Washington, Seattle
E-mail: mkkuhner{at}gs.washington.edu.
| Abstract |
|---|
|
|
|---|
Data from HIV and from human neoplastic cells can show substantial between-lineage mutation rate variation even within a single population. Such variation may affect estimators of population quantities such as
= 4Neµ. Using simulated DNA data, I measured the effect of rate variation on recovery of
by the summary-statistic estimator of Watterson (Watterson GA. 1975. On the number of segregating sites in genetical systems without recombination. Theor Popul Biol. 7:256276) and the coalescent maximum likelihood algorithm LAMARC (Kuhner MK. 2006. LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters. Bioinformatics. Advance Access doi: 10.1093/bioinformatics/btk051). Watterson's estimator showed a downward bias, as expected, with high values of
. LAMARC's mean estimate was accurate for all tested values of
and rate variation except for a downward bias when rate variation was maximal (i.e., the slow rate was zero). LAMARC had consistently narrower confidence intervals (CIs) than Watterson's estimator. Both methods tended to reject the truth too often when rate variation was 8x or greater and independent among branches, as well as when variation was 4x or greater and correlated among branches. In the case of Watterson's estimate, this excess rejection was fully attributable to variation among genealogies in the amount of total branch length associated with the fast and slow rates. However, in the case of LAMARC, some excess rejection was still observed even when between-genealogy variation was taken into account. Both estimators are robust to modest rate variation; however, their use should be coupled with a statistical test to rule out extreme rate variation as the resulting CIs may not be reliable.
Key Words: coalescent mutation rate variation maximum likelihood
| Introduction |
|---|
|
|
|---|
A number of research groups have developed coalescent-based estimators of population parameters such as
= 4Nµ, migration rates, growth rates, recombination rate, and divergence times. Approaches have included summary-statistics measures (e.g., Watterson 1975It has generally been assumed that variation in mutation rate between lineages, although an issue in species phylogeny reconstruction, is not a significant concern within populations. Mutation rate is assumed to be a stable parameter, unlikely to vary substantially within a single species or over a short time. However, the current study was motivated by 2 systems in which between-lineage variation does appear to be significant: analysis of HIV lineages within a single patient and analysis of esophageal cells taken from Barrett's esophagus (BE), a human neoplasm, again within a single patient.
In HIV, differences in mutation rate of up to 100x have been reported between different tissue compartments within a patient (Salemi et al. 2005
). The virus can potentially be replicated either by its own machinery or by the human DNA replication machinery, with a huge impact on mutation rate, and these effects may be lineage specific.
In BE, neoplastic (potentially precancerous) cells show an elevated rate of microsatellite shift mutation (although BE does not meet the formal criteria for a microsatellite instability syndrome), and this rate may depend on mutations in tumor suppressor genes such as p16 and p53 (Maley et al. 2004
). Because the neoplastic cells show an elevated mutation rate compared with normal cells, the assumption that their rate will subsequently be stable throughout the neoplastic process is hard to justify.
It is likely that other viral and neoplastic systems will also show between-lineage rate heterogeneity, and it is important to know how coalescent estimators will react to this. In this study, I examine estimation of
= 4Neµ, a fundamental population-genetics parameter, using a likelihood-based genealogy sampler embodied in the LAMARC program (Kuhner 2006
) and the summary-statistic method of Watterson (1975)
, which counts the number of segregating sites.
The
parameter is a composite of the effective population size Ne and the mutation rate µ. If µ varies,
reflects its average across the sample history. This average is not merely a summary statistic;
determines the population's carrying capacity for neutral genetic variation and thus its susceptibility to genetic drift. Data from fast-evolving organisms sampled at multiple time points could instead be used to make separate estimates of Ne and µ (Rodrigo and Felsenstein 1999
), but variation in µ remains an issue. Drummond et al. (2006)
have described a relaxed-clock estimator, which samples among different mutation rates, drawn from a prior, for each lineage. However, it is not yet clear what level of rate variation must exist before the loss of power in relaxed-clock estimation is justified. This paper explores the performance of a naive strict-clock estimator in the face of increasing clock violation.
| Methods |
|---|
|
|
|---|
Data Simulation
Coalescent genealogies with and without between-lineage mutation rate variation were simulated using the program rantree.c (provided by Felsenstein J, personal communication).
Underlying values of
were chosen to correspond to the polymorphism levels observed in BE microsatellite data and in HIV DNA sequence data. Data sets of the same size and polymorphism level should be about equally informative in coalescent estimation of
as long as the data are not close to mutational saturation.
To simulate BE-like data, I chose values of
that produced mutation counts similar to those in a BE microsatellite data set (Maley C, personal communication). I used 2 mutation counts, leading to 2 values of
: the first was based only on the earliest data sample from each patient and the second on all data samples. My simulated data were not a close approximation of BE microsatellite data, as they were 4-state DNA rather than many-state microsatellites. However, at these values of
multiple hits at the same position are rare, and microsatellite and DNA data of the same degree of polymorphism should be equally informative. To simulate HIV-like data, I used a published estimate of
= 0.1008 from Rodrigo et al. (1999)
. I did not attempt to fine-tune my mutational model for HIV as again I was interested mainly in producing data with the same level of polymorphism.
I used 10 lineages with 2,000 bp of simulated DNA each and simulated 100 independent genealogies and DNA data sets for each condition. For cases with rate heterogeneity, I assigned each branch either a "fast" or a "slow" rate, chosen to have a mean of
. I used rate combinations with a 1x (no rate heterogeneity), 2x, 4x, and 8x spread, as well as an "infinite x" case, where the low rate was zero.
I considered both cases in which rates were assigned independently to each branch and cases in which there was a correlation between the rate of ancestor and descendant branch. In independent cases, each branch had a 50% chance to be in the fast or slow condition. To test whether my results were due to the production of trees with a large imbalance between fast and slow branches, I also generated data with exactly equal numbers of fast and slow branches in each genealogy. For cases with rate correlation between ancestor and descendent branches, I modified rantree.c so that when a descendent branch chose its rate, it chose its ancestor's rate 50% of the time and chose a rate at random (which could lead to choice of the ancestor's rate) 50% of the time.
DNA data were simulated on the genealogies using the program treedna.c (provided by Felsenstein J, personal communication) under a Kimura 2-parameter model (Kimura 1980
) with a transition/transversion ratio of 2.0.
Data Analysis
was estimated using a coalescent maximum likelihood Markov chain Monte Carlo (MCMC) approach as implemented in LAMARC v2.0.2 (Kuhner 2006
). I used the F84 (Kishino and Hasegawa 1989
) mutational model (a generalization of the model under which the data were generated) with base frequencies determined from the data. Run conditions were 10 initial chains of 11,000 steps each and 2 final chains of 41,000 steps each, discarding the first 1,000 steps of each chain and then sampling every 20th step.
LAMARC constructs approximate confidence intervals (CIs) around its maximum likelihood estimate (MLE) by a likelihood ratio test that assumes a
2 distribution.
I evaluated the estimated MLE and approximate CIs by considering the "true" underlying
to be the mean of the high and low mutation rates. To further explore the algorithm's performance, in some cases I also computed an "adjusted"
that reflected the actual contribution of the fast and slow rates to the genealogy. The "adjusted"
was the weighted average of fast and slow rates, where the weights corresponded to the proportion of branch length assigned to each category. "Adjusted"
can be thought of as the true
of a given tree realization, as opposed to the true
of the distribution from which the tree was drawn.
I also used Watterson's (1975)
summary-statistic estimator, which uses the number of variable sites in the sample to estimate
. Watterson's calculation estimates
per locus; I divided by the sequence length to obtain
per site. CI boundaries were approximated by the method of Hudson (1990)
.
Although these run lengths have been successful in estimation of
in the past, it is possible that they are inadequate in cases with model violation. To test this, I reran the case of
= 0.1008 and "infinite x" rate heterogeneity with a more intensive search: 2 Metropolis-coupled or "heated" searches in addition to the original. The heated searches had initial "temperatures" of 2 and 4 (the acceptance probabilities were raised to the power of 1/temperature) and swapping was attempted between the chains every 10th step. Temperatures were varied adaptively every 200th step to keep the rate of acceptance for swaps between 10% and 40%. I estimated the effective sample size (ESS) for 5 of these data sets using Tracer (Rambaut and Drummond 2003
) as a diagnostic of adequate mixing.
Finally, to test whether there was a systematic length bias in clocklike reconstruction of nonclocklike genealogies, I used the previously simulated data sets for the 1x, 8x, and "inf x" independent rate variation cases with the high value of
. I treated the highest likelihood genealogy found by LAMARC v2.0.2 (run with the true value of
as its driving value) as an estimate of the underlying genealogy and compared the total branch length of the true and estimated genealogies. For this application, a single chain of 20,000 steps was used. In this experiment, LAMARC was being used as a maximum likelihood phylogeny estimation algorithm. Its results when applied this way are broadly similar to those from phylogeny programs such as PAUP* and DNAMLK (Kuhner MK and Yamato JA, unpublished data).
| Results |
|---|
|
|
|---|
Figure 1A shows LAMARC results from simulations in which no correlation was imposed between ancestor and descendent lineages. Results have been normalized so that the true
is taken as 1.0 in all cases. Conditions range from equal rates ("1x") to a low rate of zero ("inf x"). The final condition "inf x=" shows results when the numbers of high- and low-rate lineages were constrained to be equal. Error bars show approximate 95% CIs. Little effect of rate variation was seen except for a downward trend with maximal (inf x) rate variation. CIs were narrower for the high value of
than for the others.
|
Figure 1B shows results when a correlation coefficient of 0.5 was imposed on the simulation, so that 75% of descendent lineages had the same rate as their ancestor. This condition was expected to be more difficult to analyze because there was less opportunity for high and low branches to balance one another out across a tree. Again, little effect of rate variation was seen except for a downwards bias in MLE and a narrowing of CIs in the inf x case.
Figure 2A and B show mean results and mean CI boundaries for Watterson's (1975)
estimator. The CIs were systematically wider than LAMARC's, mainly in the upper tail. Results for high
showed a consistent downwards bias as expected because Watterson's estimator relies on an infinite-sites model that fits such data poorly. There was no visible effect of rate heterogeneity.
|
Table 1 shows the frequency with which each estimator rejected the true underlying value of
. I additionally measured, for the case of high
and inf x rate heterogeneity, the "adjusted"
of each simulated genealogy as an average of the fast and slow rates weighted by the amount of genealogy length governed by each. Rejection of this adjusted
is shown as "adjusted" in the table. Both methods tended to reject the truth too often (more than 10 times in 100 data sets, though note that this figure does not correct for multiple trials), when rate variation was 8x or more in the independent case or 4x or more in the correlated case. Forcing the genealogies to have exactly equal numbers of fast- and slow-rate branches reduced but did not eliminate this excess rejection. Adjusting the "true
" to correspond to the weighted proportion of fast and slow rates in each genealogy eliminated excess rejection for Watterson's estimator, but some excess was still seen for LAMARC.
|
Figures 3 and 4 show the 100 individual LAMARC results from the low and high
uncorrelated 1x, 8x, and "inf x" cases. They have been sorted by MLE value. Error bars show the 95% approximate confidence limits. Dashed lines indicate the true
(0.1008). Excess rejections in the nonclocklike compared with the clocklike case can be seen in both tails, though small estimates with narrow CIs contribute the most rejections.
|
|
Finally, table 2 tests the hypothesis that LAMARC's results with high rate variation may be biased due to systematic bias in reconstructing a nonclocklike tree as a clocklike tree, by comparing total branch length between true and inferred genealogies for the high
case. No major difference between true and inferred genealogies can be seen; the reconstructed genealogy was longer than the true one about half the time.
|
| Discussion |
|---|
|
|
|---|
LAMARC uses a MCMC algorithm and as such needs to be run long enough to produce an adequate sample. To test whether excess rejection of the truth might be due to inadequate mixing in the more difficult cases, I reran the
= 0.1008, "inf x" variability case of table 1 with 3 Metropolis-coupled Markov chains ("heating"). The number of rejections of the truth changed from 21 to 20 due to fluctuation in one borderline case. Because the LAMARC runs were likelihood based and not Bayesian, the ESS statistic of Rambaut and Drummond (2003)
). However, I computed an ESS based on the variation in the likelihood of the genealogy with regard to the data. In cases that estimate only
, my experience is that mixing of the data likelihood is a good proxy for mixing of the entire search. For 5 randomly sampled cases from the heated runs, the mean ESS was 7,990 and the minimum was 7,902, generously exceeding Rambaut and Drummond's suggested conservative cutoff of 200. Although convergence of the MCMC sampler can never be proven, there is no sign that more extensive searching would make a substantial difference in these results.
Both methods had good average recovery of
. Watterson's estimator showed its expected bias when
is too large for its infinite-sites assumption. (This could be partially overcome by substituting a parsimony-based count of mutations for the simple count of segregating sites. I did not pursue this possibility.) The LAMARC (2006) estimator has an accurate mean MLE estimate for all cases except the most extreme rate variation, where some downward bias of the MLE is seen. However, for both methods rejection of the truth was increased over the nominal 5% level for higher values of rate variation, especially in the presence of autocorrelation. This effect appeared significant (more than 10 rejections per 100 cases, cutoff established by Monte Carlo simulation) for 8x and higher rate variation without correlation and 4x and higher rate variation with correlation (table 1). The high
value was associated with a higher rate of rejection.
Does this excess rejection consist mainly of high or low estimates of
? Figures 3 and 4, which show detailed results for LAMARC in the uncorrelated 1x, 8x, and "inf x" cases, show that both high and low estimates rejected the truth. A few of the high-value runs were extremely high; however, the bulk of the rejections come from low-value runs with too-narrow CIs.
One possible explanation for this excess rejection is that some genealogies by chance contain a large number of high-rate or low-rate branches and thus naturally generate a high or low estimate, which will reject the "truth." At the limit, a genealogy whose branches are all high rate cannot be expected to allow accurate recovery of the mean rate. (It should, of course, allow accurate recovery of the high rate; but if the underlying population shows a mix of rates, inference of the high rate as the mean rate will mislead us about the population values.) To test this, I forced the simulation code to assign an equal number of high and low rates for each genealogy ("inf x (equal proportions)"). This greatly reduced but did not eliminate excess rejection.
A genealogy with an equal number of high- and low-rate branches could still be mainly high rate if the low-rate branches are short tipward branches and the high-rate branches are long rootward branches. To test this, I computed for each genealogy a weighted average of the high and low rates, weighting by the total branch length governed by each rate. I then used each genealogy's weighted average rate as the "true rate" in assessing inclusion or exclusion of the truth ("inf x (adjusted)"). This adjustment (which would not, of course, be possible in real data analysis) eliminated excess rejection for Watterson's estimator, whereas it showed reduced but still significant excess rejection for LAMARC. It can be thought of as answering the question "Did we correctly infer the
of this specific genealogy realization?" rather than "Did we correctly infer the
of the distribution from which this genealogy was drawn?"
LAMARC's evolutionary model assumes equal-rate coalescent genealogies. As rate discrepancies among lineages increase, the true underlying genealogy becomes less and less like an equal-rate coalescent, and LAMARC may be misled by this discrepancy. I hypothesized that the clocklike reconstructions on which LAMARC relies might become systematically shorter than the nonclocklike genealogies they were attempting to reconstruct when rate heterogeneity was extreme. This was not the case, as shown in table 2. Although LAMARC, which is constrained to consider only clocklike trees, cannot be correctly recovering these strongly nonclocklike genealogies, this did not produce a large bias in recovered branch lengths that could explain LAMARC's depressed MLE and excess rejection.
Despite the observed bias, LAMARC's CIs are robust to at least a 2x difference among rates, and its MLE is biased only with the most drastic rate difference (a rate of zero for the slow lineages). With these values of
and sample size, a rate difference of 2x is difficult to detect with standard tests such as the likelihood ratio test, but a rate difference of 4x or greater can be detected fairly reliably (Kuhner MK, in preparation). This suggests that estimation of
should be paired with a test for rate heterogeneity such as a likelihood ratio test (Felsenstein 1981
) or parametric bootstrap (Goldman 1993
). If the test is nonsignificant, these methods can be used safely, whereas if significant heterogeneity is detected, their CIs may be too narrow. LAMARC's lower variance and ability to handle higher values of
than Watterson's estimator suggest that it should be preferred, except possibly when rate variation is extreme.
In cases where rate variation is substantial enough to disrupt estimates of
that rely on the molecular clock assumption, the relaxed-clock approach of Drummond et al. (2006)
should be preferred. However, the current results suggest that the additional complexity of a relaxed clock may not be justified if rate variation is modest. Drummond et al. (2006)
find that when there is little variation in rates, methods that assume a clock can be more powerful than relaxed-clock or nonclock methods.
| Acknowledgements |
|---|
|
|
|---|
I thank Jon Yamato for conducting the computer simulations, Jon Seger for providing code-implementing confidence bounds for Watterson's estimate, Eric Rynes for proofreading, and Carlo Maley and Marco Salemi for drawing my attention to the motivating data sets. I also thank the anonymous reviewers for their helpful comments. This work was supported by the National Institutes of Health grant 5R01GM51929-11.
| Footnotes |
|---|
John H. Mcdonald, Associate Editor
| References |
|---|
|
|
|---|
Drummond AJ, Ho SYW, Phillips MJ, Rambaut A. (2006) Relaxed phylogenetics and dating with confidence. PLoS Biol 4:e88.[CrossRef][Medline]
Drummond AJ, Nicholis GK, Rodrigo AG, Solomon W. (2002) Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics 161:13071320.
Felsenstein J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368376.[CrossRef][Web of Science][Medline]
Goldman N. (1993) Statistical tests of models of DNA substitution. J Mol Evol 36:182198.[CrossRef][Web of Science][Medline]
Griffiths RC and Tavaré S. (1993) Sampling theory for neutral alleles in a varying environment. Proc R Soc Lond B Biol Sci 344:403410.
Hudson RR. (1990) Gene genealogies and the coalescent process. Oxf Surv Evol Biol 7:144.
Kimura M. (1980) A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16:111120.[CrossRef][Web of Science][Medline]
Kishino H and Hasegawa M. (1989) Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data and the branching order in Hominoidea. J Mol Evol 31:151160.[CrossRef]
Kuhner MK. (2006) LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters. Bioinformatics 22:6768770.
Kuhner MK, Yamato J, Felsenstein J. (1995) Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. Genetics 140:14211430.[Abstract]
Maley CC, Galipeau PC, Li X, Sanchez CA, Paulson TG, Reid BJ. (2004) Selectively advantageous mutations and hitchhikers in neoplasms. Cancer Res 64:34143427.
Nielsen R and Wakeley J. (2001) Distinguishing migration from isolation: a Markov chain Monte Carlo approach. Genetics 158:885896.
Rambaut A and Drummond AJ. (2003) Tracer v1.3 [Internet]. (University of Oxford [cited 2006 April 26], Oxford (UK)) Available from: http://evolve.zoo.ox.ac.uk/.
Rodrigo AG and Felsenstein J. (1999) Coalescent approaches to HIV population genetics. In Crandall K (Ed.). Molecular evolution of HIV(Johns Hopkins University Press, Baltimore (MD)) pp. 233272.
Rodrigo AG, Shpaer EG, Delwarts EL, Iversen AKN, Gallo MV, Brojatsch J, Hirsch MS, Walker BD, Mullins JI. (1999) Coalescent estimates of HIV-1 generation time in vivo. Proc Natl Acad Sci USA 96:21872191.
Salemi M, Lamers SL, Yu S, de Oliveira T, Fitch WM, McGrath MS. (2005) Phylodynamic analysis of human immunodeficiency virus type 1 in distinct brain compartments provides a model for the neuropathogenesis of AIDS. J Virol 79:1134311352.
Watterson GA. (1975) On the number of segregating sites in genetical systems without recombination. Theor Popul Biol 7:256276.[CrossRef][Web of Science][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



