MBE Advance Access originally published online on May 7, 2007
Molecular Biology and Evolution 2007 24(8):1639-1655; doi:10.1093/molbev/msm081
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Fair-Balance Paradox, Star-tree Paradox, and Bayesian Phylogenetics
Department of Biology, Galton Laboratory, University College London, London, United Kingdom
E-mail: z.yang{at}ucl.ac.uk.
| Abstract |
|---|
|
|
|---|
The star-tree paradox refers to the conjecture that the posterior probabilities for the three unrooted trees for four species (or the three rooted trees for three species if the molecular clock is assumed) do not approach
when the data are generated using the star tree and when the amount of data approaches infinity. It reflects the more general phenomenon of high and presumably spurious posterior probabilities for trees or clades produced by the Bayesian method of phylogenetic reconstruction, and it is perceived to be a manifestation of the deeper problem of the extreme sensitivity of Bayesian model selection to the prior on parameters. Analysis of the star-tree paradox has been hampered by the intractability of the integrals involved. In this article, I use Laplacian expansion to approximate the posterior probabilities for the three rooted trees for three species using binary characters evolving at a constant rate. The approximation enables calculation of posterior tree probabilities for arbitrarily large data sets. Both theoretical analysis of the analogous fair-coin and fair-balance problems and computer simulation for the tree problem confirmed the existence of the star-tree paradox. When the data size n
, the posterior tree probabilities do not converge to
each, but they vary among data sets according to a statistical distribution. This distribution is characterized. Two strategies for resolving the star-tree paradox are explored: (1) a nonzero prior probability for the degenerate star tree and (2) an increasingly informative prior forcing the internal branch length toward zero. Both appear to be effective in resolving the paradox, but the latter is simpler to implement. The posterior tree probabilities are found to be very sensitive to the prior.
Key Words: Lindley's paradox fair-balance paradox star-tree paradox prior clade probabilities
| Introduction |
|---|
|
|
|---|
Thanks to the implementation of efficient Markov chain Monte Carlo (MCMC) algorithms in the computer program MrBayes (Huelsenbeck and Ronquist 2001
In a simulation study, Suzuki, Glazko, and Nei (2002)
generated data sets under the star tree for four species and analyzed them using MrBayes, which considers binary trees only. They found that the posterior probability for the inferred binary tree was often too high. The study used a wrong and simplistic model in the analysis, so that the problem was due in part to model violation. However, extreme posterior probabilities were observed in similar simulations without model violation (Cummings et al. 2003
; Lewis, Holder, and Holsinger 2005
; Yang and Rannala 2005
). The failure of the posterior probabilities for the three binary trees to converge to
in large data sets simulated under the star tree is somewhat counterintuitive and is called the star-tree paradox (Lewis, Holder, and Holsinger 2005
). The concern is not so much that the posterior tree probabilities differ from
as that they are sometimes either very small or very large when in fact no information is available to resolve the tree one way or another.
The posterior probability for a tree is the probability that the tree is true given the data, the prior, and the likelihood (substitution) model. There are thus three possible reasons for high tree probabilities: (1) errors, including numerical problems in the MCMC algorithm, which cause the posterior probabilities to be calculated incorrectly; (2) misspecification of the substitution model; and (3) misspecification and sensitivity of the prior. The first two reasons may be responsible for high posterior probabilities in some studies. In particular, use of simplistic and unrealistic models is known to inflate posterior probabilities for trees (e.g., Buckley 2002
; Lemmon and Moriarty 2004
; Huelsenbeck and Rannala 2004
). However, high posterior probabilities have also been observed when the first two reasons clearly do not apply (Yang and Rannala 2005
). This article deals with the third reason and studies the effect of prior specification on Bayesian phylogenetic inference.
The nature of the problem may be better understood by considering the analogous fair-coin problem (Lewis, Holder, and Holsinger 2005
; Yang and Rannala 2005
). Suppose a coin is fair with the probability of heads to be
0 =
. We flip the coin n times and observe y heads. We then calculate the posterior probabilities (P– and P+) for two models that the coin is either negatively or positively biased: H–:
<
and H+:
>
. (It is inconsequential whether the true value
=
is included in none, one, or both of the two models since a point value has zero probability in a continuous distribution.) We assign equal prior probabilities for H– and H+ and uniform priors for
in each model. When n is large, we may expect P– and P+ to approach
, but they do not. Instead P– varies considerably among data sets (all generated under
0 =
) even when n
. This is referred to as the fair-coin paradox (Lewis, Holder, and Holsinger 2005
). Indeed, the limiting distribution of P– when n
is the uniform U(0, 1) (Yang and Rannala 2005
, equation 5). Figure 1 shows the histograms of P– when n = 103 and 106. Intuitively, even though the proportion of heads y/n becomes closer and closer to
when n increases, the number of heads y fluctuates around n/2 more and more wildly among data sets. Note that the variance of y/n is 1/(4n), and the variance of y is n/4. The posterior probability P– depends on the number as well as the proportion of heads.
|
One has to consider how a sensible Bayesian analysis should behave in this problem. In a significance test, the P value has a uniform distribution U(0, 1) if the null hypothesis is true and the test is exact. The true null hypothesis is falsely rejected 5% of the time if the test is conducted at the 5% significance level. This is the case even with infinitely large data sets, if a fixed significance level is used. However, Bayesian statistics is a more "optimistic" and "aggressive" methodology (Efron 1998
0 =
, one may sensibly expect P– and P+ to converge to
when n
. Of course, P– should converge to 1 if
0 <
(or to 0 if
0 >
). For the tree problem, the same argument suggests that if the true tree is the star tree, one would like the posterior probabilities for the three binary trees to converge to
each when the number of sites n
. Here I take this position, as did Lewis, Holder, and Holsinger (2005)
, because problems of phylogeny reconstruction are intractable analytically. Numerical calculation of integrals becomes unreliable in large data sets while MCMC algorithms are too slow and too imprecise.
In this article I develop approximate methods to calculate the posterior probabilities (P1, P2, P3) for the three rooted trees for three species, using data of binary characters evolving at a constant rate. This is the simplest tree-reconstruction problem (Yang 2000
), chosen here to make the analysis possible. The approximation allows Bayesian calculation in arbitrarily large data sets, without the need for MCMC algorithms. I conduct large-scale simulations, which confirm the existence of the star-tree paradox; when the data size n increases, the posterior tree probabilities do not converge to
each, but continue to vary among data sets according to a statistical distribution. This distribution is characterized. I then explore the sensitivity of Bayesian analysis to the prior and evaluate two strategies suggested to resolve the star-tree paradox. The first assigns a nonzero prior probability for the degenerate star tree (Lewis, Holder, and Holsinger 2005
), and the second uses a prior to force the internal branch lengths to approach zero when n
(Yang and Rannala 2005
). The behavior of posterior tree probabilities in large data sets is predicted by drawing an analogy with the fair-coin problem, and the predictions are confirmed numerically by computer simulation.
A synopsis is provided in the next section, which summarizes the major results of this study. The biologist reader may read this section, as well as the Discussion, and skip the Mathematical Analysis section.
| Biological Synopsis |
|---|
|
|
|---|
The Fair-coin and Fair-balance Problems
The fair-coin problem, as described above, has the same behavior as the fair-balance problem discussed by Yang and Rannala (2005)
beta(
,
), with mean
and variance 1/(8
+ 4). This is the U(0, 1) prior when
= 1 but can be highly concentrated around
if
is large. As long as
is fixed, the posterior probability P– for the model of negative bias approaches the uniform distribution U(0, 1) when the number of coin tosses n
.
Two strategies (priors) are considered to resolve the fair-coin paradox. In the first,
in the beta prior increases with n so that the prior variance of
approaches 0, forcing
to be more and more highly concentrated around
. We require that P– approach
if the coin is fair, and 1 if the coin has a negative bias (or 0 if the coin has a positive bias). These requirements mean that the prior variance for
should approach 0 faster than 1/n and more slowly than 1/n2. In the second, a nonzero prior probability is assigned to the degenerate model of no bias H0:
=
. Then the posterior probability for H0 approaches 1 when n
, and the method behaves as desired.
The Star-tree Problem
Defining the Problem
The three binary rooted trees for three species are shown in figure 2. The data are three sequences of binary characters, which are assumed to be evolving at a constant rate (that is, under the molecular clock) (Yang 2000
). The data can be summarized as counts n0, n1, n2, n3 of site patterns xxx, xxy, yxx, and xyx, where x and y are any two distinct characters, while the total number of sites is
. Each binary tree has two branch length parameters t0 and t1, measured by the expected number of changes per site. Intuitively, we can see the three variable patterns xxy, yxx, and xyx "support" the three binary trees
1,
2, and
3, respectively. Indeed a likelihood analysis will choose tree
1 as the maximum-likelihood tree if n1 is greater than both n2 and n3. Let p0, p1, p2, p3 be the expected site pattern probabilities, with
= 1. Then tree
1 can be represented by p0 > p1 > p2 = p3, with two free parameters, whereas the star tree is p0 > p1 = p2 = p3 (Yang 2000
). In a Bayesian analysis, we assign equal probabilities
to the three binary trees, and exponential priors with means µ0 and µ1 on the two branch lengths t0 and t1 in each binary tree (fig. 2).
|
Star-tree Paradox
Posterior probabilities for the three binary trees (P1, P2, P3) were calculated from data sets simulated under the star tree, with n = 3 x 103, 3 x 106, or 3 x 109 sites in the sequence. It is found that (P1, P2, P3) does not converge to
with the increase of n, confirming the star-tree paradox. Instead (P1, P2, P3) vary among data sets, according to a distribution f(P1, P2, P3), which is independent of the branch length t in the star tree and of the prior means µ0 and µ1 (see fig. 7 below). There are four modes in the distribution, such that in most data sets, either the three probabilities are all close to
, or one of them is close to 1 and the other two are close to 0. Suppose we consider very high and very low posterior probabilities for binary trees as "errors" since the true tree is the star tree. In 4.2% (or 0.8%) of data sets, at least one of the three posterior probabilities is > 0.95 (or > 0.99%), and in 17.3% (or 2.6%) of data sets, at least one of the three posterior probabilities is < 0.05 (or < 0.01). Those "error" rates appear too high, given that the data sets are arbitrarily large and are supposed to represent infinite data sets.
|
Two Strategies to Resolve the Star-tree Paradox
Further analysis of the tree problem is through an analogy with the fair-coin problem. Note that the fair-coin and fair-balance problems are analytically tractable, but the tree problem is not. My analysis of the tree problem is thus numerical verification by computer simulation, in which only a finite number of replicate data sets can be generated and each data set can only be of finite size. To see the analogy, it is more convenient to consider the site pattern probabilities as parameters in each binary tree instead of branch lengths t0 and t1. In the fair-coin problem, the data have a binomial distribution or multinomial distribution with two cells (corresponding to heads and tails). The two models of negative and positive bias assume that one cell probability is greater than the other, yet the truth (the fair-coin model) is that they are equal. In the star-tree problem, the data have a multinomial distribution with four cells (corresponding to the four site patterns). We compare three binary-tree models, which assume that one of three cell probabilities (for the three variable site patterns) is greater than the other two and that these other two are equal. The truth (the star tree) is that all three cell probabilities are equal. In other words, the three binary trees are represented by
1: p1 > p2 = p3,
2: p2 > p3 = p1 and
3: p3 > p1 = p2, while the true star tree is
0: p1 = p2 = p3. (The probability p0 for the constant pattern may be considered an unimportant nuisance parameter, shared by all four trees.) Both the proportions of heads and tails in the fair-coin problem and the proportions of the site patterns in the tree problem converge to their expected probabilities, with variances proportional to 1/n.
We apply the same two strategies as discussed above for the fair-coin problem to resolve the star-tree paradox. The first uses a prior on parameters in the model to force the binary tree to converge to the star tree, or to force the three cell probabilities p1, p2, p3 to approach equality (p1 = p2 = p3), when n
. From the analysis of the fair-coin problem, the prior should force E(p1 – p2)2 to approach 0 faster than 1/n but more slowly than 1/n2. This means, as seen by translating the prior on cell probabilities into a prior on branch lengths t0 and t1, that the mean µ0 in the exponential prior for the internal branch length t0 should approach 0 faster than
but more slowly than 1/n. This prediction is only partially confirmed. Simulations confirm that to resolve the star-tree paradox—that if, for (P1, P2, P3) to converge to
if the star tree is the true tree — µ0 should approach 0 faster than
. Numerical problems (see later) have prevented confirmation that µ0 should approach 0 more slowly than 1/n for P1 to converge to 1 if tree
1 is the true tree.
The second strategy assigns a nonzero prior probability
0 for the degenerate star tree (p1 = p2 = p3). Simulations confirm that when n
, the posterior probability for the star tree approaches 1, and this prior indeed resolves the star-tree paradox. This result is expected from previous theoretical work. Indeed Dawid (1999)
has studied the asymptotics of Bayesian model selection when the data size n
. If all models considered in the Bayesian analysis are wrong, the probability for the model closest to the truth, as measured by the Kullback-Leibler divergence, approaches 1. If one model is correct and all others are wrong, the probability for the true model approaches 1. If several models are true, the probability for the true model with the fewest parameters approaches 1. The case where several models of the same dimension are true is not well specified. Dawid's proof assumes that the parameters are unbounded while here the star tree is at the boundary of the parameter space of the binary trees. However, the qualitative conclusions appear applicable to the tree problem. Here the data are generated under the star tree, so that all four trees are correct, but the star tree has one fewer parameter, and its posterior probability approaches 1.
| Discussion |
|---|
|
|
|---|
Does the Star-tree Paradox Exist?
Kolaczkowski and Thornton (2006)
First, KT06 simulated data sets with up to n = 107 sites using a star tree of four species, with all four branch lengths equal. The data were analyzed using MrBayes to calculate posterior probabilities (P1, P2, P3) for the three binary unrooted trees without assuming the molecular clock. All five branch lengths in each binary tree are assigned the uniform prior U(0, 10). The variance in the posterior probability for a binary tree, say P1, was initially small, but increased with the increase of n to a stable value of about 0.06 when n
103 (KT06, fig. 1b). The standard deviation (SD) of
0.24
is about the same as that obtained in this article for rooted trees of three species (0.2498; see figure 8a below). It is likely that these two values are indeed identical and that the three-species problem of figure 2, studied here, and the four-species problem with equal branch lengths in the star tree, studied by KT06, produce the same limiting distribution f(P1, P2, P3). It is also likely that the distribution in the four-species case is similarly independent of the branch length used in the star tree and the upper bound in the uniform prior for branch lengths in the binary trees. It would be interesting to know whether this invariance holds also when the four branches in the star tree have different lengths. At any rate, the failure of P1 to converge to
confirms the star-tree paradox. KT06 appeared to have mistaken a stable variance for zero variance when they claimed that their results disproved the star-tree paradox, and they were incorrect to conclude that "With infinite data, posterior probabilities give equal support for all resolved trees, and the rate of false inferences falls to zero." KT06 emphatically criticized the speculation of Lewis, Holder, and Holsinger (2005)
that "Bayesian analyses become increasingly unpredictable" with the increase of data size when the true tree is the star tree. Technically, this speculation is confirmed rather than refuted by the result of KT06 (and by the results of this study), as the variance of P1 continues to increase with n, even though the amount of increase approaches zero (KT06, fig. 1b). Clearly, the variance cannot increase without limit, the absolute maximum being 2/9 (with the SD to be
= 0.4714), achieved if the posterior probabilities (P1, P2, P3) take only three sets of values, each with probability 1/3: (1, 0, 0), (0, 1, 0), and (0, 0, 1).
|
Second, KT06 examined the so-called type-I error rate in finite data sets of 5,000 sites, and find that when the true tree is the star tree, the posterior probability for a binary tree is > 95% (or > 99%) in less than 5% (or 1%) of data sets. The same pattern holds also for rooted trees in this study, although the posterior probability for a binary tree is < 5% (or < 1%) in more than 5% (or 1%) of data sets, as mentioned above. It is debatable whether such "error" rates are acceptable if they persist in arbitrarily large data sets. While it is appropriate to study so-called Frequentist properties of a Bayesian method, KT06 confused Bayesian posterior probabilities with Frequentist P values when they claimed that "posterior probabilities never produce strong support for incorrectly resolved phylogenies more often than they should." Bayesian statistics in general does not provide a guaranty of its performance under Frequentist criteria. KY06 also claimed that the "type-I" error rate decreased when n increased from 103 to 107 (KY06, fig. 2b). This result is inconsistent with the present study and appears to contradict their finding of an increasing and asymptotically stable variance in P1. The result may be due to numerical problems in the MCMC algorithms in the analysis of KT06.
Third, KT06 used MrBayes to analyze a data set consisting of the expected probabilities of the site patterns calculated under the star tree. This "infinite" data set gave
as the posterior probability for each binary tree. However, analysis of this average site is not meaningful, as it ignores the variation among data sets and the fact that the number of sites as well as the proportions of site patterns influences Bayesian analysis. In the fair-coin problem, the data set consisting of
heads and
tails would produce P– = P+ =
, but this average coin toss tells us nothing about the behavior of the Bayesian method when n
(see fig. 1).
The position of KT06 toward the star-tree paradox is marred by errors in the analysis. The paradox concerns the performance of the Bayesian method in large or infinite data sets, so that finite data sets are not the real issue. Nevertheless the "error" rates in finite data sets are higher than KY06 suggested, because the method produced very small posterior probabilities too often (see above). KT06 expected the "error" rate to reduce to zero when the data size n
, with the posterior tree probabilities approaching
. This is the behavior of a sensible Bayesian analysis assumed by Lewis, Holder, and Holsinger (2005)
and Yang and Rannala (2005)
, although KT06 failed to realize that the Bayesian method does not behave in this way.
Priors and Bayesian Phylogenetics
It is a curious fact that to resolve the fair-coin paradox, the prior probability
0 on the degenerate model of fair coin (H0:
=
) can be constant and independent of the data size, while the prior on parameter
(the probability of heads) has to be increasingly concentrated around
=
, depending on the data size n. The difference appears to be due to the fact that any point mass has probability zero in a non-degenerate continuous distribution. Nevertheless, both may be viewed as priors on parameter
in models of negative and positive biases (H– and H+) without considering H0 in the analysis. The degenerate-model prior is equivalent to assigning a mixture distribution on
, with a component at
in proportion
0 and another component from a continuous distribution in proportion 1 –
0. Similarly the star-tree prior
0 is equivalent to a mixture-distribution for internal branch lengths in binary trees (with the star tree excluded from the Bayesian analysis), with a component at zero in proportion
0 and a component from the continuous exponential distribution in proportion 1 –
0. Implementation of the data size-dependent prior is simpler as it requires only a change to the prior mean for internal branch lengths (Yang and Rannala 2005
). The star-tree prior is more complex because bifurcating and multifurcating trees have different numbers of branch length parameters so that algorithms such as reversible jump MCMC (Green 1995
) are needed to deal with models of different dimensions (Lewis, Holder, and Holsinger 2005
).
Both the star-tree prior and the data size-dependent prior may be criticized. Whether truly simultaneous speciation events ever occur in nature is debatable, and if they do not, assigning a prior probability to a model known to be false runs into a conceptual difficulty. Similarly, the use of data size (although not the data themselves) for prior specification may appear non-Bayesian. The prior is supposed to reflect information concerning the parameter before the data are analyzed and should ideally be independent of the data. Nevertheless, this ideal is often hard to achieve in "objective" Bayesian statistics when little information is available about the parameter. Both Jeffreys's prior (Jeffreys 1961
) and the reference prior (Bernardo 1979
) depend on the likelihood function or the experimental design. One may ask why one's prior ignorance concerning a parameter should depend on how one conducts the experiment to find out about the parameter. An extreme case is Bernardo's (1980)
use of the data (not just data size) to specify the prior, although the idea did not appear to be warmly received in the ensuing discussions. Data size-dependent priors were discussed by Bartlett (1957)
, Davison (2003
, pp. 586–587), and Cox (2006
, pp. 42–43, 106–107), as a possible way of resolving Lindley's paradox (see below). One may argue that if data sets of such large sizes are needed to resolve the tree, the internal branch must be very short, so that it may be sensible to assume increasingly shorter internal branches in the prior in larger data sets. Yang and Rannala (2005)
also discussed the use of empirical estimates of internal branch lengths from real data sets to specify the prior, and pointed out that almost all of the possible phylogenetic trees are wrong, and that most internal branch lengths in wrong trees are estimated to be zero.
The biologist reader should be aware that there have been longstanding fundamental disagreements among statisticians concerning principles of statistical inference. In particular, model selection is a difficult area for both Bayesian and Frequentist statistics, and it is also an area where the two approaches can draw very different conclusions from the same data. A brief overview of this controversy is provided in Yang (2006
,
5.1.3). As phylogeny reconstruction unfortunately falls into this class of difficult statistical inference problems (e.g., Yang et al. 1995
), biologists may have to think about what constitutes a sensible behavior in a Bayesian phylogenetic analysis. Six decades ago, Egon S. Pearson (1947)
wrote that "Hitherto the user has been accustomed to accept the function of probability theory laid down by the mathematicians; but it would be good if he could take a larger share in formulating himself what are the practical requirements that the theory should satisfy in application." This advice may be useful even today.
Almost all controversies surrounding Bayesian inference concern the prior, which is also the focus of this study. For the tree problem, one may take the position that the prior implemented in current computer programs is appropriate and then accept whatever properties Bayesian inference under the prior possesses. The expectation for the posterior tree probabilities to approach
when n
is then seen as false intuition and no paradox remains. This position may be natural to some Bayesian statisticians. Another position is to judge the method by its statistical properties. In the "objective" Bayesian method, it is a common practice to specify the prior such that the resulting Bayesian inference is deemed reasonable (e.g., Jeffreys 1961
). I have taken that position in this study, motivated by the observation that the posterior tree probabilities are often too extreme. Two a priori criteria are set up: (1) if the true tree is the star tree, the probabilities for the three binary trees should approach
when n
, and (2) if a binary tree is the true tree, its posterior probability should approach 1. Two strategies of prior specification are then found to meet those criteria.
Based on the perception that the posterior probabilities for trees or clades are often too high, some authors (e.g., Suzuki, Glazko, and Nei 2002
; Simmons, Pickett, and Miya 2004
) argued that the Bayesian posterior probabilities for trees or clades are not trustable, and alternative methods such as the bootstrap should be used to assess the reliability of estimated trees. Similarly Douady et al. (2003)
suggested the bootstrapped Bayesian analysis, in which the Bayesian method is used to analyze bootstrap pseudo-data sets. The method then involves prohibitive computation and is also a strange mix of Bayesian and Frequentist methodologies. Instead, we consider the default priors implemented in current computer programs to be inappropriate and attempt to specify better priors to produce more reasonable posterior tree probabilities. Lewis, Holder, and Holsinger (2005)
also emphasized the fact that a number of realistic evolutionary models have been implemented in the MrBayes program, making the method an attractive option for analyzing ever-increasing genetic data sets.
| Mathematical Analysis |
|---|
|
|
|---|
Bayesian Analysis of the Fair-balance Problem
The Fair-balance Problem
In the fair-coin problem, one may also assign an informative prior on the probability of heads:
beta(
,
). The beta distribution with
> 1 has a mode at
, so that the coin is more likely to be nearly even than seriously biased. For large
, beta(
,
) can be approximated by a normal distribution with mean
and variance 1/(8
+ 4). The likelihood, given by the binomial probability of the number of heads y
bi(n,
), is approximated by the normal density y/n
N(
, 1/(4n)). The posterior
|y
beta(y +
, n – y +
) can be approximated by the normal distribution with mean
is fixed.) Thus we redefine
–
as the parameter, use the normal distribution to approximate the prior, the likelihood, and the posterior; and restate the problem as the following fair-balance problem. Suppose the data consist of n independent observations y1, y2, ..., yn, with yi
N(
,
2), where
is unknown and
2 is known. The yis may be measurement errors on a balance. Let
be the sample mean. The two models are then H–:
< 0 and H+:
> 0. In the prior we assign equal probabilities (
) for each model, and
N(0, 
2), truncated to the appropriate range in each model.
The posterior of
is then given by
|
, from which one can get the posterior probability for model H– as
![]() | (1) |
(·) is the cumulative distribution function (c.d.f.) of the standard normal distribution (Yang and Rannala 2005
Suppose the true parameter is
0. As
varies among data sets as
N(
0,
2/n), P– has the density
![]() | (2) |
–1(·) is the inverse c.d.f. of the standard normal distribution (Yang and Rannala 2005
, and
If the balance is fair and the true parameter
0 = 0, equation (2) becomes
![]() | (3) |
. If
is a constant, we have n
and f(P–)
1 when n
, so that P– converges to the uniform distribution U(0, 1) (see fig. 1). This is called the fair-balance paradox (Yang and Rannala 2005
when n
, but it fails to do so.
The Data Size-Dependent Prior
One of the ideas suggested in the discussions of Lindley's paradox (Lindley 1957
, see below) is to let the prior be increasingly informative with the increase of the data size. Consider
= c/n
as a prior for
, as a possible way for resolving the fair-balance paradox. From equation (3), it is clear that if 0 <
< 1, P– still converges to U(0, 1), even though this prior forces
to be closer and closer to 0 with the increase of n, converging to a point mass at
= 0 in the limit. If
= 1 so that
= c/n, f(P–) peaks at P– =
, but the distribution does not degenerate to a point mass at
(equation 3). Figure 3 shows a few densities when c = n
= 0.1, 1, and 2. Note that in this case the prior
N(0, c
2/n) and the likelihood
N(
,
2/n) have the same "precision" about
. When
> 1, f(P–)
0 for all values of P– except P– =
; that is, P– converges to a point mass at
. Thus to avoid the fair-balance paradox, we should have
> 1 in
= c/n
; the variance in the prior of
should approach 0 faster than 1/n.
|
The case of
0
0 (equation 2) is summarized in table 1. The statement of Yang and Rannala (2005
0 < 0 (or to 0 if
0 > 0) irrespective of
in
= c/n
is inaccurate. Indeed the behavior of f(P–) depends on
. To ensure that P–
1 if
0 < 0 (and P–
0 if
0 > 0), we require
< 2. Any value of
in the interval (1, 2) will produce sensible Bayesian inference by the criteria used here, and a smaller
corresponds to a more powerful analysis, as it produces higher posterior probabilities for the true model if the coin is biased. Figure 4a shows that the posterior probability P– calculated from a data set may be very sensitive to the prior or the value of
. Furthermore, while f(P–) converges to a point mass at 1,
, and 0 if the true
0 < 0, = 0, and > 0, respectively, the rate of convergence depends on
. Curves a & b in figure 5 show the density when
0 = 0 and 0.1 at n = 1,000 when
=
is used.
|
|
|
The Degenerate-Model Prior
Another strategy is to assign a nonzero probability to the degenerate model H0:
= 0 (Lewis, Holder, and Holsinger 2005
0, (1 –
0)/2, and (1 –
0)/2. The unknown parameter in H– and H+ is assigned the prior
N(0, 
2), truncated to the appropriate range, where
is a constant.
The likelihood is given by
|
N(
,
2/n). The marginal likelihoods are
![]() | (4) |
![]() | (5) |
0, and n
.
When the sample mean
varies among data sets according to N(
0,
2/n), with density
|
| (6) |
|
| (7) |
. Any P0 in the interval (0, 1/(1 + a)) corresponds to two
:
![]() | (8) |
![]() | (9) |
![]() | (10) |
0) depends on P0,
0, n
, and
If
0 = 0, equation (10) reduces to
![]() | (11) |
0 and n
.
Equations (10) and (11) can be used to confirm that as long as
0 > 0, f(P0) converges to a point mass at 1 when n
if
0 = 0 (so that H0 is true), and that if
0
0 (so that H0 is false), f(P0) will converge to 0, in which case one of P– and P+ (the one corresponding to the true model) will converge to 1. In other words, the probability for the correct model always converges to 1 when n
. This is a special case of Dawid's (1999)
general proof of the consistency of Bayesian model selection.
Here I consider the prior probability
0 as a way of resolving the fair-balance paradox and treat P0 as equal support for H– and H+. Thus (P0, P–, P+) calculated from any data set are converted to
= (P0/2 + P–, P0/2 + P+). Then if
0 = 0, we have P0
1, so that
and
. Similarly, if
0
0, we have P0
0, so that one of
and
will approach 1. It is clear that use of the prior probability
0 resolves the fair-balance paradox.
Nevertheless, the Bayesian analysis may be very sensitive to the value of
0, and this sensitivity appears to be the nature of the problem. For example, for a data set of size n = 1,000 with
= –0.05, we have
to be 0.943, 0.683, 0.560 and 0.532, if
0 = 0, 1/10,
, and
, respectively (fig. 4b). Furthermore, while
converges to a point mass at 1,
, and 0 if the true
0 < 0, = 0, and > 0, respectively, the convergence may be at very different rates depending on
0. Curves c & d in figure 5 show the density for
0 = 0 and 0.1, with n = 1,000 when the prior
0 =
is used. This prior produces high posterior probabilities for the true model much more often and may be considered more powerful than the data size-dependent prior with
=
, that is,
(curves a & b in fig. 5).
Lindley's Paradox
If we do not distinguish between models H– and H+ and define P1 = 1 – P0 = P– + P+, the problem becomes one of comparing a sharp null hypothesis H0:
= 0 with a composite alternative hypothesis H1:
0. This is the case for Lindley's (1957
; see also Jeffreys 1939
) paradox. If
is fixed but n
, then P0
1 (eq. 7). Lindley's paradox refers to the observation that in a data set,
may differ sufficiently from 0 for H0 to be rejected by a significance test, while Bayesian analysis of the same data strongly supports H0 with posterior probability P0
1. Thus significance test and Bayesian analysis draw opposite conclusions from the same data. Indeed, if large data sets are generated under the null model, such contradictions will occur in
5% of data sets if the significance test is conducted at the 5% level. As discussed above, if H0 is true and n is large, P0
1 in nearly every data set, but the significance test will still reject the true null hypothesis 5% of the time. This result appears to suggest flaws in the methodology of significance test, as claimed by some Bayesian statisticians (e.g., Good 1982
, p. 342; Press 2003
, pp. 220–225; Berger 1985
, pp. 144–157), rather than in Bayesian analysis, as suggested by, e.g., Bernardo (1980)
and Shafer (1982)
. Furthermore, Davison (2003
, pp. 586–587) and Cox (2006
, pp. 42–43, 106–107) (see also, Bartlett 1957
) suggested the use of
= c/n, so that
N(0, c
2/n), to resolve Lindley's paradox. By the criteria used here, this prior is not acceptable as it causes f(P0) to fail to converge to the point mass at 1 when
0 = 0 (see equation 11)!
Nevertheless, whatever the true model or the observed data
, P0 can be made arbitrarily close to 1 by the use of a diffuse prior or a large
, as P0
1 when
in equation (7). Bayesian analysis in this case is extremely sensitive to the prior.
Bayesian Tree Estimation in the Three-Species Case
The Tree Problem
There are three (rooted) binary trees for three species (fig. 2):
1 = ((12)3),
2 = ((23)1), and
3 = ((31)2). We consider binary characters, which evolve at a constant rate according to a stationary Markov process. The data are counts n0, n1, n2, n3 of site patterns xxx, xxy, yxx, and xyx. Let xi = ni/n, i = 0, 1, 2, 3, be the proportions of the site patterns. The data may be represented as n = {n1, n2, n3} or x = {x1, x2, x3}, with n to be the total number of sites.
Under tree
1, with branch lengths t0 and t1 (fig. 2), the probabilities of observing the four site patterns are
![]() | (12) |
t0, t1
, we have p0
p1
p2 = p3
0 and p0 + p1 + p2 + p3 = 1. The likelihoods under the three trees are
![]() | (13) |
for each binary tree and exponential priors with means µ0 and µ1 for t0 and t1: f(t0) = exp{–t0/µ0}/µ0 and f(t1) = exp{–t1/µ1}/µ1. The exponential priors appear more sensible than uniform priors since most branch lengths in real trees are small while very large branch lengths are rare. The marginal likelihood under tree
i is
|
| (14) |
|
| (15) |
Thus analysis of each data set requires evaluation of three two-dimensional integrals. (In contrast, the case of four species and no molecular clock requires evaluation of three five-dimensional integrals.) Yang and Rannala (2005)
used Mathematica (Wolfram 2003
) to calculate the integrals of equation (14) numerically. This is found to be unreliable in large data sets, with n
5,000, say. A difficulty is that the integrand is nearly a spike at its mode.
Two ideas appear promising. The first is to use the site pattern probabilities as parameters in the binary tree instead of t0 and t1 and construct conjugate priors on them. The second is to use large-sample approximations. The latter is explored in this study.
Approximate Calculation of Posterior Probabilities for Trees
We use Laplacian expansion (Copson 1965
, pp. 36–47; Bender and Orszag 1999
, pp. 261–276) to approximate the integral M1 for tree
1 (eq. 14). The integrals M2 and M3 for trees
2 and
3 are calculated by a permutation of the counts n1, n2, n3. In a typical Bayesian estimation problem under a well-specified model, the likelihood function and the posterior density can be quite accurately approximated using a normal density in large data sets (Lindley 1980
; Tierney and Kadane 1986
). However, phylogenetic trees are different models (e.g., Yang et al. 1995
). For any given data set, the maximum likelihood estimate (MLE) of t0 is zero in at least one tree, in which case the normal approximation breaks down. Instead the tedious algorithms presented below were derived by trial and error, with intensive testing in comparison with Mathematica.
Rewrite equation (14) as
|
| (16) |
|
| (17) |
h/
ti, hij =
2h/
ti
tj, etc., to be the derivatives evaluated at the MLEs
and
(see Appendix). Let H = {hij} be the Hessian matrix. If H is positive-definite, we let
![]() |
|
|
=
01/(
0
1). We use the first few terms in the Taylor expansions of f and h as approximations
![]() | (18) |
The integral of equation (16) is the volume of the solid between the f·enh surface above the t0–t1 plane in the quarterplane t0 > 0, t1 > 0. We consider three cases, depending on whether
> 0 and whether
= 0 (Yang 2000
, Tables 2 and 3). We assume that x0 >
.
|
Case I: x1 > (1 – x0)/3. We have
> 0, and
> 0, with
=
= 0 (Yang 2000
, where the likelihood surface is nearly that of a bivariate normal density function.
![]() | (19) |
is often close to 0, or
is small (say, <3), in which case equation (19) is not very reliable. The bivariate normal integral can then be calculated using the algorithm of Drezner and Wesolowsky (1990)
Case II: x1 = (1 – x0)/3. We have
= 0,
> 0, with
=
= 0. The integral is then half that in case I as the volume above the half plane t0 < 0 is missing.
|
| (20) |
Case III: x1 < (1 – x0)/3. We have
= 0 and
> 0, with
< 0 and
= 0. This situation is complex, and is broken into two cases, depending on whether the Hessian matrix H is positive definite.
In case IIIa, H is positive definite. We then use all second-order terms in the Taylor expansion of h.
![]() | (21) |
0,
, and we have
![]() | (22) |
0 >> 1, we may apply Watson's Lemma to approximate the integral in equation (22). Write this as
, where q(y) =
, with a =
0. From the MacLaurin expansion of q(y), we have
![]() | (23) |
![]() | (24) |
(·) is the probability density function (p.d.f.) of a standard normal variate. Thus
![]() | (25) |
(·).
However, if c = –nh0
0 is small (< 1), as may be the case if h0 is nearly zero, equation (25) is unreliable. Then I use the Gauss-Legendre quadrature to calculate the one-dimensional integral of equation (22) numerically, which was found to produce reliable results.
In case IIIb, x1 < (1 – x0)/3, so that
= 0 and
> 0, with h0 =
< 0 and h1 =
= 0, but H is not positive-definite. This case occurs mainly when the data are very unlikely on the tree and h0 is very negative. We then use the linear term for t0 and quadratic term for t1 in the Taylor expansion of h, as follows.
![]() | (26) |
Change variables from t1 to z =
, where
1 =
.
![]() | (27) |
, and thus the integral from
to
is nearly the same as from –
to
, while 1/(1 + a) = 1 – a + a2 – a3 + ... when |a| =
< 1. The last equality uses the result that if z is a random variable from the standard normal distribution, E(zk) = 0 for odd k or (k – 1) (k – 3) · 3 · 1 for even k (e.g., Johnson et al. 1994Suppose in the data set, n1 > n2 > n3. Then M1 > M2 > M3. Calculation of M1 makes use of equation (19) for case I, and calculation of M3 makes use of equations (22) or (27) for case IIIa. Calculation of M2 uses each of these two cases about half of the time. Cases II (equation 20) and IIIb (equation 27) are rarely encountered.
The above discussion assumes that the prior on branch lengths are fixed, with µ0 and µ1 to be fixed constants. When µ0 depends on the data size n, some modifications to the above algorithm are necessary.
The exact calculation using Mathematica is reliable for small data sets, and unstable for large ones (say, with n > 5,000). The approximate calculation is the opposite. It is reliable for large data sets only, say with n
1,000. Figure 6 shows posterior tree probabilities calculated using the two methods, while table 2 shows the effect of sample size n on the approximation. On a 3.2GHz Pentium IV, analyzing 105 data sets took a few seconds using the approximate method and
15 days using Mathematica. Both methods are much faster than MCMC for this small problem. The approximation allows us to calculate posterior tree probabilities for arbitrarily large data sets.
|
Simulation of Data
Consider simulation of data sets under tree
1 with given branch lengths t0 and t1. Simulation under the star tree
0 can be done using the same algorithm by fixing t0 = 0. The counts of sites follow a multinomial distribution with four cells: MN4(n; p0, p1, p2, p2), with cell probabilities given in equation (12). For large n, the data have approximately a trivariate normal distribution: n = (n1, n2, n3)
N3(n
0, nS0), where
![]() | (28) |
We have |nS0| = n3p0p1p2p2, and
![]() | (29) |
![]() | (30) |
0)T = (x1 – p1, x2 – p2, x3 – p2)T, and T is the transpose.
The Cholesky decomposition of the variance matrix is given as nS0 = LLT, with
![]() | (31) |
![]() | (32) |
0 + Lz will be the desired counts of site patterns.
Two Strategies to Resolve the Star-tree Paradox
We now consider the two priors for resolving the star-tree paradox, following our discussions of the fair-coin and fair-balance paradoxes above. The first is to let the prior mean for the internal branch length approach zero when the data size increases, and the second is to assign a nonzero probability
0 for the degenerate star tree.
Data Size-Dependent Prior
This forces the mean µ0 in the prior for internal branch length t0 to approach 0, or, equivalently, to force the probabilities of the three variable site patterns p1, p2, and p3 to approach equality (p1 = p2 = p3), when n
(Yang and Rannala 2005
). In the fair-coin problem, 1 –
and
are the two cell probabilities in the multinomial (binomial) distribution, the models of negative and positive bias are specified as H–: 1 –
>
and H–: 1 –
<
while the fair-coin model is H0: 1 –
=
. The distance between H– (say) and H0 may be measured by |1 –
–
| = |1 – 2
|. It was determined that the prior should force E(1 – 2
)2 or the variance of
to approach 0 faster than 1/n but more slowly than 1/n2. In the tree problem, the binary tree, say
1, is represented by p1 > p2 = p3 while the star tree
0 is p1 = p2 = p3, where p1, p2, p3 are three cell probabilities in a multinomial distribution. The distance between
1 and
0 can be measured by |p1 – p2|, and by analogy with the fair-coin problem, we require the prior on branch lengths t0 and t1 should force E(p1 – p2)2 to approach 0 faster than 1/n but more slowly than 1/n2.
Let µ0 = c/n
with
> 0. The prior for branch lengths t0 and t1 is given by the independent exponential distributions
|
| (33) |
= p1 – p2 as the new parameters in the binary tree; the two sets of parameters are related by equation (12). The prior distribution of p0 and
is obtained from equation (33) through a variable transform as
![]() | (34) |
2) =
. Thus µ0 should approach 0 faster than
but more slowly than 1/n; in other words we require
<
< 1 in µ0 = c/n
.
Degenerate-Model Prior
0
We assign a prior probability
0 > 0 for the star tree
0, while the three binary trees are assigned prior probabilities
1 =
2 =
3 = (1 –
0)/3 (Lewis, Holder, and Holsinger 2005
). The branch length t in the star tree is assigned the prior f(t) = exp{–t/µ1}/µ1. The marginal likelihood M0 under
0 is a one-dimensional integral over t, similar to equation (14). This is reliably calculated by approximating the likelihood with a normal density, similarly to the calculation with equation (19). The marginal likelihoods for the three binary trees M1, M2, and M3 are calculated as before. Then
iMi, i = 0, 1, 2, 3, are rescaled to sum to one to give the posterior probabilities for all four trees. As the star tree is a special case of the three binary trees with one fewer parameter, all four trees are correct when the data are generated from the star tree. Thus we expect the posterior probability for the star tree
0 to converge to 1 as the star-tree model has a lower dimension (Dawid 1999
). Here we consider
0 as a way of resolving the star-tree paradox and divide P0 among the three binary trees to calculate their posterior probabilities
|
| (35) |
Thus P1, P2, P3 will converge to the point mass at
when n
if the data are generated under the star tree, and to (1, 0, 0) if the data are generated under the binary tree
1.
Simulation Results
The Star-tree Paradox
We use computer simulation to study the variation in posterior tree probabilities (P1, P2, P3) when data sets are generated under the star tree. The branch length is fixed at t = 0.2. Each of the 105 replicate data sets is analyzed using the Bayesian method to calculate P1, P2, P3, using equal prior probabilities (
) for the three binary trees and exponential priors for branch lengths with means µ0 = 0.1 and µ1 = 0.2 (equation 15). The distribution f(P1, P2, P3) across data sets is estimated by a kernel-density smoothing algorithm (Silverman 1986
). Three sequence lengths are used: 3 x 103, 3 x 106, and 3 x 109. For n = 3 x 103, both exact calculation using Mathematica and the approximate method by Laplacian expansion are used, while for the two large data sizes, only the approximate method is used.
Figure 7 shows the joint density f(P1, P2, P3) for n = 3 x 103 and 3 x 109. Figure 8 shows three univariate densities derived from the same data, for P1, for Pmin = min(P1, P2, P3) and for Pmax = max(P1, P2, P3). For n = 3 x 103, the exact and approximate methods produced results that are indistinguishable, suggesting that the approximation is reliable. The results for n = 3 x 103, 3 x 106 (not shown), and 3 x 109 are very similar, indicating that for the parameter values used, n = 3 x 103 is close to infinity, although it is noticeable that the posterior probabilities tend to become more extreme (near 0 or 1) in larger data sets (fig. 8a). The SD for P1 is 0.2440 for n = 3 x 103 and 0.2498 for n = 3 x 106 and 3 x 109. In general, the means and SDs for P1, Pmin, and Pmax are identical to the fourth decimal place between n = 3 x 106 and 3 x 109.
For n = 3 x 109, data sets are also simulated using different values of the branch length t in the star tree (such as 0.1, 0.3, 0.4, 0.5, and 1.0), and they are analyzed using different prior means µ0 and µ1 (such as µ0 = 0.2, 0.5, 10 and µ1 = 0.1, 0.3, 0.7). The number of replicates is also raised to 107. As far as can be judged, the distribution f(P1, P2, P3) is independent of t, µ0 and µ1. The invariance of f(P1, P2, P3) to parameters t, µ0 and µ1 may be generally true as it parallels the fair-balance analysis in which the limiting distribution f(P–) is uniform, independent of parameter
in the prior
N(0, 
2). It also indicates that the distribution is unlikely to change when n increases beyond 3 x 109. In all cases examined, every Pi has mean 1/3 and SD 0.2498, and pairwise correlation coefficient –0.5000. The correlation should be exactly
, according to the following symmetry argument (Peter Green, pers. comm.). From 1 = P1 + P2 + P3, we have
![]() | (36) |
. There are four modes in the distribution, at the center and the three corners of the ternary graph (fig. 7).
We now use the distributions of P1, Pmin and Pmax for n = 3 x 109 to examine how often the Bayesian method produces extreme posterior probabilities, assuming that this sequence length represents the limiting case of infinite data (fig. 8). Pmin has mean 0.1298 and SD 0.0769 while Pmax has mean 0.6319 and SD 0.1698. In 4.23% of data sets, Pmax > 0.95 (that is, at least one of the three posterior probabilities is > 0.95), and in 0.79% of data sets, Pmax > 0.99. In 17.3% of data sets, Pmin < 0.05 (that is, at least one of the three posterior probabilities is < 0.05), and in 2.6% of data sets, Pmin < 0.01. If we consider any particular binary tree, such as
1, we find that the proportion of data sets in which P1 < 0.05 (or 0.01) is 8.1% (or 1.31%), and the proportion of data sets in which P1 > 0.95 (or 0.99) is 1.41% (or 0.26%). Because the true tree is the star tree, we would not want any binary tree to have either a very high or a very low posterior probability. The method appears to produce extreme posterior probabilities, especially very small ones, quite often.
Data Size-dependent Prior
This prior forces the mean µ0 of internal branch lengths to approach 0 when n
. We let µ0 = 0.1/n
and use different values for
. When the data are simulated under the star tree, the means of the posterior probabilities for the three binary trees are always
. Figure 9a shows the SD of P1 for tree
1 when
= 0, 0.5, 0.51, 0.707, and 0.8. Our theoretical analysis suggests that
has to be greater than
for P1 to converge to the point mass
. If
= 0, the SD of P1 converges to 0.2498 when n
; this is the case of the star-tree paradox discussed above. If
= 0.5, the SD stabilizes to 0.064 instead of 0. Thus (P1, P2, P3) have a distribution, which depends on parameters such as branch length t in the star tree, and µ1 and c in the prior (in µ0 = c/n
). This is analogous to the case of
0 = 0 and
= 1 in table 1 for the fair-balance problem (fig. 3). When
= 0.51, slightly larger than
, the SD decreases monotonically from 0.0608 at n = 103 to 0.0522 at n = 3 x 109. The limit when n
should be 0, according to the theoretical analysis. If
= 0.707 or 0.8, the SD clearly converges to 0 when n
.
|
Results obtained when the data are simulated under the binary tree
1 with t0 = 0.01 and t1 = 0.2 are shown in figure 9b. The theoretical analysis predicts that one has to have
< 1 for P1 to converge to the point mass at 1 when n
. If
= 0, 0.5, or 0.707 (all less than 1), the mean of P1 indeed converges to 1 while the SD converges to 0, so that the probability for the true model converges to 1 (fig. 9b). Numerical problems are encountered with larger values of
, so that the cases in which
is close to or larger than 1 are not examined. Nevertheless, as long as the star-tree paradox is resolved (with
>
), small values for
are preferred to larger ones, as small values lead to higher posterior probabilities for the true tree when the true tree is binary. Three convenient values for
are 0.667, 0.707, and 0.75. These are the harmonic, geometric, and arithmetic means of
and 1, and may represent conservative, moderate, and liberal priors, respectively.
Degenerate-Model Star-tree Prior
0
Here a nonzero probability
0 is assigned for the degenerate star tree
0, while the three binary trees have prior probabilities
1 =
2 =
3 = (1 –
0)/3. The posterior probabilities for the three binary trees are calculated using equation (35). We are interested in the behavior of the joint density f(P1, P2, P3) when the data size n
and when the data are generated under either the star tree or a binary tree.
A few different values are used for
0: 1/10, 1/4, and
. In every case, the joint density f(P1, P2, P3) converges to
when n
. For example, with t = 0.2 in the star tree and
0 = 0.25, µ0 = 0.1, and µ1 = 0.2 in the prior, the SD of P1 is calculated to be 0.125, 0.025, and 0.004 for n = 3 x 103, 3 x 106, and 3 x 109, respectively. The mean of the distribution is clearly
, and the convergence of the SD to 0 means that the distribution is becoming degenerate to the point mass at the mean. When
0 = 0.1, the SD of P1 is 0.177, 0.044, and 0.007 for the three values of n, and the rate of convergence is slower than when
0 = 0.25.
Furthermore, analysis of data sets simulated under a binary tree with t0 > 0 confirms that when n increases, the posterior probability for the true binary tree approaches 1. In sum, use of the prior
0 resolves the star-tree paradox, as long as 0 <
0 < 1. This result is expected from Dawid's (1999)
general proof of consistency of Bayesian model selection.
| Addendum |
|---|
|
|
|---|
Steel and Matsen (2007)
, the posterior probability for any binary tree, say, P1, does not converge to
and will maintain a strictly positive probability of being large (say, > 0.99). The result is consistent with this study, contra Kolaczkowski and Thornton (2006)
remains unknown. | Appendix. Derivatives for Laplacian Expansion |
|---|
|
|
|---|
Consider tree
1. The data can be represented as x0 = n0/n, x1 = n1/n, and the likelihood L = nh, where
|
| (37) |
and
and note that
![]() | (38) |
![]() | (39) |
![]() | (40) |
![]() | (41) |
| Acknowledgements |
|---|
|
|
|---|
I am grateful to Professor Peter Green of University of Bristol for pointing out that the correlation between any two posterior probabilities in the star-tree distribution is exactly
. I thank Professor Philip Dawid (UCL) for very useful discussions, and Jim Mallet and Max Telford for comments on the first part of the manuscript. This study is supported by a grant from the Natural Environment Research Council (UK). | Footnotes |
|---|
Arndt von Haeseler, Associate Editor
| References |
|---|
|
|
|---|
Bartlett MS. A comment on D.V. Lindley's paradox. Biometrika (1957) 44:533–534.
Bender CM, Orszag SA. Advanced Mathematical Methods for Scientists and Engineers: Asymptotic Methods and Perturbation Theory (1999) New York: Springer-Verlag.
Berger JO. Statistical Decision Theory and Bayesian Analysis (1985) New York: Springer-Verlag.
Bernardo JM. Reference posterior distributions for Bayesian inference. J R Stat Soc B (1979) 41:113–147.
Bernardo JM. A Bayesian analysis of classical hypothesis testing. In: Bayesian Statistics—Bernardo JM, DeGroot MH, Lindley DV, Smith AFM, eds. (1980) Valencia, Spain: Valencian University Press. 605–647.
Berry V, Gascuel O. On the interpretation of bootstrap trees: appropriate threshold of clade selection and induced gain. Mol Biol Evol (1996) 13:999–1011.
Bourlat SJ, Juliusdottir T, Lowe CJ, Freeman R, et al. Deuterostome phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida. Nature (2006) 444:85–88.[CrossRef][Medline]
Buckley TR. Model misspecification and probabilistic tests of topology: evidence from empirical data sets. Syst Biol (2002) 51:509–523.[CrossRef][Web of Science][Medline]
Copson ET. Asymptotic Expansions (1965) Cambridge, UK: Cambridge University Press.
Cox DR. Principles of Statistical Inference (2006) Cambridge, UK: Cambridge University Press.
Cummings MP, Handley SA, Myers DS, Reed DL, et al. Comparing bootstrap and posterior probability values in the four-taxon case. Syst Biol (2003) 52:477–487.
Davison AC. Statistical Models (2003) Cambridge, UK: Cambridge University Press.
Dawid AP. The trouble with Bayes factors. Research Report 202. In: Department of Statistical Science (1999) University College London.
Douady CJ, Delsuc F, Boucher Y, Doolittle WF, et al. Comparison of Bayesian and maximum likelihood bootstrap measures of phylogenetic reliability. Mol Biol Evol (2003) 20:248–254.
Drezner Z, Wesolowsky GO. On the computation of the bivariate normal integral. J Statist Comput Simul (1990) 35:101–107.[CrossRef]
Efron B. R.A. Fisher in the 21st Century. Stat Sci (1998) 13:95–122.[CrossRef][Web of Science]
Erixon P, Svennblad B, Britton T, Oxelman B. Reliability of Bayesian posterior probabilities and bootstrap frequencies in phylogenetics. Syst Biol (2003) 52:665–673.
Good IJ. Lindley's paradox. J Am Stat Assoc (1982) 77:342.[CrossRef][Web of Science]
Green PJ. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika (1995) 82:711–732.
Hill ID. The normal integral. Appl Stat (1973) 22:424–427.[CrossRef]
Huelsenbeck JP, Ronquist F. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics (2001) 17:754–755.
Huelsenbeck JP, Rannala B. Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. Syst Biol (2004) 53:904–913.
Jeffreys H. Theory of Probability (1939) Oxford, UK: Clarendon Press.
Jeffreys H. Theory of Probability (1961) Oxford, UK: Oxford University Press.
Johnson NL, Kotz S, Balakrishnan N. Continuous Univariate Distributions (1994) Volume 1. New York: Wiley.
Kolaczkowski B, Thornton JW. Is there a star tree paradox? Mol Biol Evol (2006) 23:1819–1823.
Lemmon AR, Moriarty EC. The importance of proper model assumption in Bayesian phylogenetics. Syst Biol (2004) 53:265–277.
Lewis PO, Holder MT, Holsinger KE. Polytomies and Bayesian phylogenetic inference. Syst Biol (2005) 54:241–253.
Li S, Pearl D, Doss H. Phylogenetic tree reconstruction using Markov chain Monte Carlo. J Am Statist Assoc (2000) 95:493–508.[CrossRef][Web of Science]
Lindley DV. A statistical paradox. Biometrika (1957) 44:187–192.
Lindley DV. Approximate Bayesian methods. In: Bayesian statistics—Bernardo JM, DeGroot MH, Lindley DV, Smith AFM, eds. (1980) Valencia, Spain: Valencian University Press. 223–237.
Mau B, Newton MA. Phylogenetic inference for binary data on dendrograms using Markov chain Monte Carlo. J Computat Graph Stat (1997) 6:122–131.[CrossRef]
Pearson ES. The choice of statistical tests illustrated on the interpretation of data classed in the 2 x 2 table. Biometrika (1947) 34:139–167.
Press SJ. Subjective and Objective Bayesian Statitics (2003) New Jersey: John Wiley & Sons.
Rannala B, Yang Z. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. J Mol Evol (1996) 43:304–311.[Web of Science][Medline]
Shafer G. Lindley's paradox. J Am Statist Assoc (1982) 77:325–334.[CrossRef][Web of Science]
Silverman BW. Density Estimation for Statistics and Data Analysis (1986) London: Chapman and Hall.
Simmons MP, Pickett KM, Miya M. How meaningful are Bayesian support values? Mol Biol Evol (2004) 21:188–199.
Steel M, Matsen FA. The Bayesian "star paradox" persists for long finite sequences. Mol Biol Evol (2007) 24:1075–1079.
Suzuki Y, Glazko GV, Nei M. Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics. Proc Natl Acad Sci USA (2002) 99:16138–16143.
Tierney L, Kadane JB. Accurate approximations for posterior moments and marginal densities. J Am Stat Assoc (1986) 81:82–86.[CrossRef][Web of Science]
Wolfram S. Mathematica 5 (2003) Cambridge, UK: Cambridge University Press.
Yang Z. Complexity of the simplest phylogenetic estimation problem. Proc R Soc B: Biol Sci (2000) 267:109–116.[Medline]
Yang Z. Computational Molecular Evolution (2006) Oxford, England: Oxford University Press.
Yang Z, Rannala B. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo Method. Mol Biol Evol (1997) 14:717–724.[Abstract]
Yang Z, Rannala B. Branch-length prior influences Bayesian posterior probability of phylogeny. Syst Biol (2005) 54:455–470.
Yang Z, Goldman N, Friday AE. Maximum likelihood trees from DNA sequences: a peculiar statistical estimation problem. Syst Biol (1995) 44:384–399.[Abstract]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
A. R. Lemmon, J. M. Brown, K. Stanger-Hall, and E. M. Lemmon The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference Syst Biol, May 22, 2009; (2009) syp017v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. D. McKenna, A. S. Sequeira, A. E. Marvaldi, and B. D. Farrell Temporal lags and overlap in the diversification of weevils and flowering plants PNAS, April 28, 2009; 106(17): 7083 - 7088. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Yang Empirical evaluation of a prior for Bayesian phylogenetic inference Phil Trans R Soc B, December 27, 2008; 363(1512): 4031 - 4039. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. B. Prasad, M. W. Allard, NISC Comparative Sequencing Program, and E. D. Green Confirming the Phylogeny of Mammals by Use of Large Comparative Sequence Data Sets Mol. Biol. Evol., September 1, 2008; 25(9): 1795 - 1808. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Susko On the Distributions of Bootstrap Support and Posterior Distributions for a Star Tree Syst Biol, August 1, 2008; 57(4): 602 - 612. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Dohrmann, D. Janussen, J. Reitner, A. G. Collins, and G. Worheide Phylogeny and Evolution of Glass Sponges (Porifera, Hexactinellida) Syst Biol, June 1, 2008; 57(3): 388 - 405. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

) or n = 106() times. The number of heads y in n tosses is used to calculate P–, assuming a uniform prior 










































