MBE Advance Access originally published online on February 13, 2007
Molecular Biology and Evolution 2007 24(4):1075-1079; doi:10.1093/molbev/msm028
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
The Bayesian "Star Paradox" Persists for Long Finite Sequences
Biomathematics Research Centre, University of Canterbury, Christchurch, New Zealand
E-mail: m.steel{at}math.canterbury.ac.nz.
| Abstract |
|---|
|
|
|---|
The "star paradox" in phylogenetics is the tendency for a particular resolved tree to be sometimes strongly supported even when the data is generated by an unresolved ("star") tree. There have been contrary claims as to whether this phenomenon persists when very long sequences are considered. This note settles one aspect of this debate by proving mathematically that the chance that a resolved tree could be strongly supported stays above some strictly positive number, even as the length of the sequences becomes very large.
Key Words: phylogenetic trees Bayesian statistics star trees
| Introduction |
|---|
|
|
|---|
Two recent papers (Yang and Rannala 2005
"Analysis of ideal data sets does not indicate what will happen when very large data sets with some stochastic error are analyzed, but it does show that when infinite data are generated on a star tree, posterior probabilities are predictable, equally supporting each possible resolved tree."
Yang and Rannala (2005)
had attempted to simulate the large sample posterior distribution but ran into numerical problems and commented that it was "unclear" what the limiting distribution on posterior probabilities was as n became large.
In particular, all of the aforementioned papers have left open an interesting statistical question, which this short note formally answersnamely, does the Bayesian posterior probability of the 3 resolutions of a star tree on 3 taxa converge to 1/3 as the sequence length tends to infinity? That is, does the distribution on posterior probabilities for "very long sequences" converge on the distribution for infinite length sequences? We show that for most reasonable priors it does not. Thus the star paradox does not disappear as the sequences get longer.
As noted by Yang and Rannala (2005)
and Lewis et al. (2005)
, one can demonstrate such phenomena more easily for related simpler processes such as coin tossing (particularly if one imposes a particular prior). Here, we avoid this simplification as such results do not rigorously establish corresponding phenomena in the phylogenetic setting, which in contrast to coin tossing involves considering a parameter space of dimension greater than 1 (moreover, as we will see there is a complication that arises in the phylogenetic problem that is entirely absent from the coin-tossing problem). We also frame our main result so that it applies to a fairly general class of priors. Whether or not the star paradox is a practical concern for biologists is likely to depend heavily on the data, the priors, and the methods used to establish Bayesian posterior probabilities. The purpose of this paper is merely to indicate that from a mathematical perspective there is no reason to think that the star paradox will automatically vanish given long enough sequences. Some further comments and earlier references on the phenomenon have been described in the recent review paper by Alfaro and Holder (2006
, p. 3536).
| Analysis of the Star Tree Paradox for 3 Taxa |
|---|
|
|
|---|
On tree T1 (in fig. 1), let pi = pi(t0, t1), i = 0, 1, 2, 3, denote the probabilities of the 4 site patterns (xxx, xxy, yxx, xyx, respectively) under the simple 2-state symmetric Markov process (the argument extends to more general models, but it suffices to demonstrate the phenomena for this simple model). From equation (2) of Yang and Rannala (2005)
|
|
|
|
|
|
|
| (1) |
pi(t0, t1) with strict inequality unless t0 = 0 (or in the limit as t1 tends to infinity).
|
To allow maximal generality, we make only minimal assumptions about the prior distribution on trees and branch lengths. Namely, we assume that the 3 resolved trees on 3 leaves (trees T1, T2, T3 in fig. 1) have equal prior probability
and that the prior distribution on branch lengths t0, t1 is the same for each tree with a smooth joint probability density function that is bounded and everywhere nonzero. This condition applies, for example, to the exponential prior discussed by Yang and Rannala (2005)
Let n = (n0, n1, n2, n3) be the counts of the different types of site patterns (corresponding to the same patterns as for the Pi's). Thus n = 
ni is the total number of sites (i.e., the length of the sequences). Given a prior distribution on (t0, t1) for the branch lengths of Ti (for i = 1, 2, 3) let
be the posterior probability of tree Ti given the site pattern counts n. Now suppose the n sites are generated on a star tree T0 with positive branch lengths. We are interested in whether the posterior probability
could be close to 1 or whether the chance of generating data with this property goes to zero as the sequence length gets very large. We show that in fact the latter possibility is ruled out by our main result, namely, the following:
Theorem 2.1
Consider sequences of length n generated by a star tree T0 on 3 taxa with strictly positive edge length tand let n be the resulting data (in terms of site pattern counts). Consider any prior on the 3 resolved trees (T1, T2, T3) and their branch lengths that is tame (as defined above). For any
> 0, and each resolved tree Ti (i = 1, 2, 3), the probability that n has the property that
does not converge to 0 as n tends to infinity.
Proof of Theorem 2.1
Because of the considerable details involved in the proof, we present a brief, intuitive outline of the argument.
Firstly, we will show that the probability that a star tree generates a "moderate" excess of site patterns favoring any 1 of the 3 resolved trees stays bounded above zero as n (the sequence length) goes to infinity. Here, by moderate we mean that the excess is in the order of the square root of n (the standard deviation for site pattern counts). Conditioning on this event (and some related moderate events, collectively called Fc below), we would like to show that a given tree T has unusually high posterior probability for this data; this can be simplified to showing that the ratio of the posterior probability of the given resolved tree to either of the other 2 resolved trees is large. These ratios can in turn be conveniently expressed as a ratio of expectations of 2 closely related random variables (X, Yeq. 4). A crucial observation (based on symmetry considerations) is that X and Y are random variables determined just by T and the prior on its branch lengths (t0, t1)that is, we have reduced a problem involving 3 resolved trees and their branch length priors to an analysis of 2 quantities involving a single resolved tree and its branch length prior. To analyze the ratio of expectations (in the hope of showing it is large), it helps to focus on the distribution of the random variables (P0, P1) induced by the prior on (t0, t1) (later it is helpful to consider a further derived pair of random variables (P0, Z)). We do not need to calculate these distributions explicitly; indeed, working over the general class of priors is curiously helpful here, as it avoids the temptation to get bogged down in the detailed analysis of a particular distribution.
Considering the ratio of conditional expectations according to a distribution on (P0, P1) leads to a second helpful observation: if we condition on P0 taking a particular value, say p0, then p0 cancels out of the ratio (eq. 13), reducing what began as a 2-dimensional problem to a family of 1-dimensional computations. At this point a complication arises that requires some care to resolve (and which does not arise in the much simpler [fair] coin-tossing problem). Namely, the ratio of conditional probabilities is not always large (e.g., if P0 is conditioned to equal a value that approaches
which corresponds to long branches [site saturation], then all 3 resolved trees have approximately the same posterior probability for any data). Nevertheless, we are assuming that the data is generated by a star tree with finite nonzero branch lengths and so the probability q0 of the unvaried pattern (xxx) is a fixed number strictly between
and 1. If P0 were to differ from q0 too much then (for long sequences that satisfy the moderate events mentioned above) the posterior probability of any resolved tree conditional on such an extreme P0 value would be small, compared with a conditioned value of P0 close to q0this is, informally, what Claim (i) in the proof below says. Moreover, if P0 is conditioned to equal a value close to q0, then the ratio of conditional expectations can be shown to be large. This is essentially Claim (ii) in the proof below. These 2 claims can be combined by Lemma 2.2 to handle the complication described, and thereby, establish what we required, namely, that the ratio of (unconditional) expectations is large.
We now proceed to the formal detailsthe proofs of Claims (i) and (ii) and Lemma 2.2 are deferred to the Appendix.
Consider the star tree T0 with given branch lengths t
(as in fig. 1). Let (q0, q1, q2, q3) denote the probability of the 2 types of site patterns generated by T0 with these branch lengths. Note that q1 = q2 = q3 and so q0 = 1 3q1). Suppose we generate n sites on this tree, and let n0, n1, n2, n3 be the counts of the different types of site patterns (corresponding to the pi's).
and for i = 1, 2, 3, let
|
|
|
|
1
[2c, 4c] because
1 +
2 +
3 = 0. By standard stochastic arguments (based on the asymptotic approximation of the multinomial distribution by the multinormal distribution) event Fc has probability at least some value
' =
'(c) > 0 for all n sufficiently large (relative to c).
Given the data n = (n0, n1, n2, n3) write
for the probability of n assuming the data was generated on tree Ti with branch lengths t0, t1. The assumption of equality of priors across T1, T2, and T3 implies that
|
| (2) |
|
| (3) |
Now, as (t0, t1) are random variables with some prior density, when we view p0, p1, p2, p3 as random variables by virtue of their dependence on (t0, t1), we will write them as P0, P1, P2, P3 (note that Yang and Rannala [2005]
use Pi differently). With this notation, the posterior probability of T1 conditional on n can be written as
|
|
|
|
|
| (4) |
|
|
Lemma 2.2
Let X, Y be nonnegative continuous random variables, dependent on a third random variablethat takes values in an interval I = [a, b]. Suppose that for some interval I0 strictly inside I, and I1 = I I0 the following inequality holds:
and that for some constant B > 0, and all
(5) ![]()
I0,
Then,
(6) To apply this lemma, select a value s > 0 so that
Then let
Claim:
Let c > 1. Then conditional on the data n = (n0, n1, n2, n3) satisfying Fc, and for n sufficiently large, the following two inequalities hold:The proofs of these claims are given in the Appendix.
- (i)
- (ii) For all
To apply these claims, select
(this is finite by the assumption that the prior on (t0, t1) is smooth and everywhere nonzero). Then,
Combining Lemma 2.2 with Claims (i) and (ii) for this value of c we deduce that conditional on n satisfying Fc and for n sufficiently large,
As stated before, the probability that n satisfies Fc is at least
(7) ' =
'(c) > 0 for n sufficiently large, and so, by equation (4) and equation (7), we have
Similarly,
Now, because
it follows that, for n sufficiently large, and conditional an event of probability at least
' > 0, that
as claimed. This completes the proof.
| Concluding Remarks |
|---|
|
|
|---|
One feature of the argument we have provided is that it does not require stipulating in advance a particular prior on the branch lengthsthat is, the result is somewhat generic as it imposes relatively few conditions. Moreover, it seems possible to weaken these even further. For example, the requirement that the prior on (t0, t1) be everywhere nonzero could be weakened to simply being nonzero in a neighborhood of (0, t
) (thereby allowing, e.g., a uniform distribution on bounded range).
An interesting open question in the spirit of this paper is to explicitly calculate the limit of the posterior density f(P1, P2, P3) described in (Yang and Rannala 2005
). It may also be of interest to study posterior support for resolved trees when one weakens the molecular clock assumption on the star tree that generates the data. For example, one could imagine combinations of (non-clocklike) branch lengths that may lead to more frequent and/or stronger support for particular resolved trees.
| Appendix: Proof of Lemma 2.2 and Claims (i) and (ii) |
|---|
|
|
|---|
Proof of Lemma 2.2:
For W = X, Y we haveIn particular, for W = X we have:
(8) Note that equation (6) implies that
so
Taking W = Y in equation (8) and applying equation (5) gives us
(9) which combined with equation (9) gives the result.
Proof of Claim (i),
We will first bound
above. Let µ (n) = (q
q
q
q
)n. Now, conditional on n satisfying Fc we have
where p = (p0, p1, p2, p3) and q = (q0, q1, q2, q3), and dKL denotes KullbackLeibler distance. Now,
(10) (the first inequality is a standard one in probability theory). In particular, if p0
I1, then |q0 p0|>s>0. Moreover,
The right hand side can then be bounded above by rearranging equation (10) and using the lower bound on the KullbackLeibler distance, giving
In the reverse direction, we have
(11) where
and
for (p0, p1, p2) determined by some
(12) Now, the first term of the product in equation (12) converges to a constant as n grows (because p0 q0, p1 q1, and p2 q1 are each of order n1), whereas the condition Fc ensures that the second term decays no faster than
for a constant C1. Thus,
for a positive constant C2. The term B(n) is asymptotically proportional to n2. Summarizing, for a constant C3 > 0 (dependent just on t
)
which combined with equation (11) establishes Claim (i) for n sufficiently large.
In order to prove Claim (ii), we need some preliminary results.
Lemma 3.1
Let< 1. Then for each x > 0 there exists a value K = K(x) <
that depends continuously on x so that the following holds. For any continuous random variable Z on [0, 1] with a smooth density function f that satisfies f(1)
0 and |f'(z)| < B for all z
(
, 1], we have
for all
Proof.
LetThen,
Now,
where
denotes asymptotic equivalence (i.e., f (k)
g(k) if limk
f (k) / g (k) = 1). Using integration by parts,
Now, provided k > (1
)2, we have tk >
and so the absolute value of the second term on the right is at most
Consequently,
is bounded above by B times a term of order k2. A similar argument, again using integration by parts, shows that
is bounded above by B times a term of order k2, and the lemma now follows by some routine analysis.
Lemma 3.2
Let y = (1 + 2x)(1 x)2. Then, for x[0, 1) and m
3 we have
Proof.
and m2(1 y) = m2(3x2 2x3)
3m2x2, and for m
3 this upper bound is less or equal to the lower bound in the previous expression.
Proof of Claim (ii), for all p0
I0,
:
Write
as shorthand for
Note that, for any r, s > 0,
Consequently, if we let
then, by definition of the
i's,
(13)
Now, conditional on n satisfying Fc (and because P1
P2) the following 2 inequalities hold (recalling that
1 +
2 +
3 = 0),
![]() |
![]() | (14) |
Let U = (P1 P2) / (1 P0), which takes values between 0 and 1 because P1
P2. Because P1 + 2P2 = 1 P0, we can write
Thus,
and
Substituting these into equation (14), letting Z = (1 + 2U)(1 U)2 and noting that
gives
![]() |
3,
|
| (15) |
(P0, Z) is a smooth invertible mapping between (0,
)2 and its image within (
, 1) x (0, 1). Notice that Z becomes concentrated about 1 whenever P0 approaches
. However, for p0 in the interval I0 the conditional density f(Z|P0 = p0) is smooth, bounded away from 0, and its first derivative is also uniformly bounded above over this interval. Consequently, we may apply Lemma 3.1 (noting that the condition that n satisfies Fc ensures that
I0, |
|
| Acknowledgements |
|---|
|
|
|---|
M.S. thanks Ziheng Yang for suggesting the problem of computing the limiting distribution of posterior probabilities for 3-taxon trees. We also thank Paul O. Lewis and the anonymous referee for their helpful comments. This work is funded by the Allan Wilson Centre for Molecular Ecology and Evolution.
| Footnotes |
|---|
1 Present address: Biomathematics Research Centre, Department of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
Herve Philippe, Associate Editor
| References |
|---|
|
|
|---|
Alfaro ME and Holder MT. (2006) The posterior and prior in Bayesian phylogenetics. Annu Rev Evol Syst 37:1942.
Kolaczkowski B and Thornton JW. (2006) Is there a star tree paradox? Mol Biol Evol 23:18191823.
Lewis PO, Holder MT, Holsinger KE. (2005) Polytomies and Bayesian phylogenetic inference. Syst Biol 54:2241253.[CrossRef][ISI][Medline]
Yang Z and Rannala B. (2005) Branch-length prior influences Bayesian posterior probability of phylogeny. Syst Biol 54:3455470.[CrossRef][ISI][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
L. Shavit, D. Penny, M. D. Hendy, and B. R. Holland The Problem of Rooting Rapid Radiations Mol. Biol. Evol., November 1, 2007; 24(11): 2400 - 2411. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Yang Fair-Balance Paradox, Star-tree Paradox, and Bayesian Phylogenetics Mol. Biol. Evol., August 1, 2007; 24(8): 1639 - 1655. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

> 0, and each resolved tree Ti (i = 1, 2, 3), the probability that n has the property that
that takes values in an interval I = [a, b]. Suppose that for some interval I0 strictly inside I, and I1 = I I0 the following inequality holds:

q
q
q
)n. Now, conditional on n satisfying Fc we have
< 1. Then for each x > 0 there exists a value K = K(x) <
0 and |f'(z)| < B for all z
denotes asymptotic equivalence (i.e., f (k) 

3m2x2, and for m 


