MBE Advance Access originally published online on September 17, 2008
Molecular Biology and Evolution 2008 25(12):2615-2626; doi:10.1093/molbev/msn209
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Conditional Gene Genealogies under Strong Purifying Selection
Department of Organismic and Evolutionary Biology, Harvard University
E-mail: wakeley{at}fas.harvard.edu.
| Abstract |
|---|
|
|
|---|
The ancestral selection graph, conditioned on the allelic types in the sample, is used to obtain a limiting gene genealogical process under strong selection. In an equilibrium, two-allele system with strong selection, neutral gene genealogies are predicted for random samples and for samples containing at most one unfavorable allele. Samples containing more than one unfavorable allele have gene genealogies that differ greatly from neutral predictions. However, they are related to neutral gene genealogies via the well-known Ewens sampling formula. Simulations show rapid convergence to limiting analytical predictions as the strength of selection increases. These results extend the idea of a soft selective sweep to deleterious alleles and have implications for the interpretation of polymorphism among disease-causing alleles in humans.
Key Words: natural selection coalescent ancestral graph Ewens sampling formula
| Introduction |
|---|
|
|
|---|
Hermisson and Pennings recently put forward the idea of a "soft selective sweep" and also studied the properties of these interesting events (Hermisson and Pennings 2005
Strong selection against one allele and in favor of another is modeled here using a coalescent approach (Kingman 1982a
, 1982b
; Hudson 1983
; Tajima 1983
). In general, it has proven difficult to incorporate selection into the coalescent and still maintain the analytical tractability and ease of application of the model. This is because different selected alleles have different rates of coalescence—they are not "exchangeable" (Cannings 1974
; Kingman 1982c
; Aldous 1985
)—and because the frequencies of alleles change over time by selection, mutation, and drift. One approach to the coalescent with selection is to condition rates of coalescence on allele frequencies and model changes in allele frequencies explicitly (Kaplan et al. 1988
; Barton et al. 2004
). A second approach, called the ancestral selection graph, was proposed by Krone and Neuhauser (1997)
and provides the framework for the work presented here.
The ancestral selection graph arises via an augmentation of the typical forward-time models of population genetics, the same models that yield the standard Wright–Fisher diffusion (Ewens 2004
). The augmented model is built in two layers. For the first layer, an exchangeable forward-time population process (obtained by imagining that the population is composed entirely of the fittest genotype) is run for an infinitely long time to produce a large graph. This graph contains all the ancestor–descendant relationships, birth and death events, etc., that occurred in the population. For the second layer, a small fraction of birth events are marked as being available only to the fittest genotype. Selection is enforced in a second run through the graph, in which allelic states are assigned and their fates are followed forward in time. Less-fit alleles are barred from using the marked birth events. The properties of the original model are preserved, but the inclusion of an exchangeable process in the two-layer model greatly facilitates the study of gene genealogies.
Although the population model behind the ancestral selection graph has been generalized considerably (Neuhauser and Krone 1997
; Neuhauser 1999
; Fearnhead 2006
), for simplicity, the present work is based on the original formulation of Krone and Neuhauser (1997)
. Two alleles, A1 and A2, experience mutations with probability u per generation, and allele A2 has fitness 1 + s relative to A1. The standard diffusion assumptions are made: time is measured in proportion to N generations and N tends to infinity, whereas u and s tend to zero, such that the dynamics depend on scaled mutation and selection parameters
and
. There is no recombination within the locus. Krone and Neuhauser (1997)
assumed a haploid Moran model of reproduction (Moran 1958
, 1962
), in which
= Nu and
= Ns, but the ancestral selection graph should hold for any model that has the standard diffusion as its limit. It may be assumed, without loss of generality, that A2 is the fitter of the two alleles (
> 0).
One small modification is made, which is to allow for asymmetric mutation in the following way. When a mutation occurs, it has probability
1 of producing an A1 allele and probability
2 = 1 –
1 of producing an A2 allele. Any asymmetric, two-allele model can be represented in this way, and such "parent-independent" mutation generalizes readily to multiple alleles; for example, see Stephens and Donnelly (2003)
. Note that this means some mutation events are "empty" (Baake and Bialowons 2008
), in the sense that they do not change the allelic type. This further augmentation of the model allows for a simplification (Fearnhead 2002
) that will be important in what follows.
When run for a long time, which is from an essentially infinite time in the past to the present, this two-allele population process will reach a statistical equilibrium. Here, the present time is defined as time t = 0 and the past as times t > 0. Analysis of this model, or its two-layer counterpart, shows that the frequency of A1 in the population at equilibrium, at time t = 0, has distribution
|
| (1) |
(Wright 1931
A sample of size n, taken from the population at the present time zero, contains n1 copies of allele A1 and n2 = n – n1 copies of allele A2 with probability
|
| (2) |
Crucially for the ancestral selection graph, equation (2) also holds at any time t in the past when there are n lineages ancestral to a present-day sample, subject to certain conditions which can be found in Donnelly and Kurtz (1999)
. Intuitively, this follows from the fact that the population has been evolving for an infinite length of time even before t and because present-day samples are taken at random with respect to genetic variation or any events that have occurred in the population.
The ancestral selection graph is obtained by following a random sample of genetic lineages from the present back into the past under the two-layer population model. Initially, the allelic types of the samples are not specified, and an ancestral graph is obtained by tracing back through the exchangeable population process. This proceeds from time zero back to the ultimate ancestor of the sample, which is reached the first time the entire sample is descended from a single lineage (Krone and Neuhauser 1997
). Each ancestral lineage experiences mutations at rate
/2, each pair of lineages coalesces with rate 1, and each lineage "branches" with rate
/2.
Branching events correspond to the marked birth events described above. When a branching event occurs, the lineage that experiences it splits into two lineages. Thus, the number of ancestral lineages can increase as they are followed back in time. Branching events capture the effect of selection in favor of A2. They must be included in the graph in order to have an ancestral process in which lineages are initially exchangeable. They are resolved in the second run through the graph, with allelic states specified, such that the gene genealogy of the sample is a bifurcating tree and A2 enjoys a higher fitness than A1. In order to resolve branching events and retrieve the gene genealogy of the sample, one of the two lineages emanating from each branching event is labeled the "incoming branch" and the other is labeled the "continuing branch" (Krone and Neuhauser 1997
). Only one of these will be included in the gene genealogy of the sample. There are four possible values for the allelic states (I, C) of the incoming and continuing branches—(A1, A1), (A1, A2), (A2, A1), and (A2, A2)—but these are not specified in the initial construction of the graph.
When the ultimate ancestor is reached, its type is drawn from the distribution h(x) and the lineages are traced forward in time to the present-day sample, changing type as needed when mutation events are encountered. Branching events are resolved as follows. If I = A2, then the incoming branch replaces the continuing branch, and the allelic state of the descendent lineage is A2. If I = A1, then the incoming branch does not replace the continuing branch. Instead, the descendent lineage inherits the state and ancestry of the continuing branch. Nonancestral lineages are discarded, and the result is a sample drawn from the joint distribution of allelic states and gene genealogies (Krone and Neuhauser 1997
).
The utility of the ancestral selection graph is not that it generates samples with allelic states in proportion to their probabilities, this is known and given by equation (2), but rather that it provides a tool for investigating the properties of gene genealogies under selection. However, the presence of branching events makes analysis and simulation difficult. Fortunately, the ancestral selection graph can also be used to model the ancestry of a sample conditional on allelic types (Slade 2000a
, 2000b
), and in this case, the problem of multiplying ancestral lineages is not so severe. Following the work of Slade (2000b)
, Fearnhead (2002)
, Stephens and Donnelly (2003)
, and Baake and Bialowons (2008)
, it is possible to describe a conditional ancestral selection graph in which branching events are minimized and in which superfluous lineages can be discarded upon mutation.
In this work, the conditional ancestral selection graph is used to investigate the analytical properties of gene genealogies of samples of known allelic type in the case when selection is very strong. This case has not yet been considered in the literature, likely due to the explosion of lineages which occurs in the unconditional ancestral selection graph when the selection parameter
is very large. In addition, simulations show a relatively small effect of selection on gene genealogies of random samples (Golding 1997
; Neuhauser and Krone 1997
; Przeworski et al. 1999
) and of samples of known allelic type when selection is moderate (Slade 2000a
, 2000b
), so it is of interest to investigate a case where selection should have a dramatic effect on the gene genealogy.
In the limit
, the ancestry of a random sample or a sample containing at most one deleterious allele (A1) is shown to be neutral. In contrast, genealogies of samples containing more than one strongly deleterious allele are very different than neutral genealogies. Their structure can be described and is identical to that of a soft selective sweep. That is, the distribution of the number of independently derived, deleterious mutant alleles in the sample is given by the Ewens sampling formula (Ewens 1972
). Interestingly, this is identical to the result for a different limiting model, of strong mutation–selection balance (Hartl and Campbell 1982
; Sawyer 1983
), that Reich and Lander (2001)
used in their interpretation of allelic diversity of human disease. Simulations show a rapid approach to limiting analytical predictions, occurring between about
= 1 and
= 100.
| Methods and Results |
|---|
|
|
|---|
The notion of "real" and "virtual" lineages (Krone and Neuhauser 1997
The conditional ancestral selection graph is derived from the fundamental recursive equation (Krone and Neuhauser 1997
; Slade 2000a
; Fearnhead 2002
) for the probability that an ordered set of lineages is composed of n1 real and v1 virtual lineages of allelic type A1 and n2 real and v2 virtual lineages of allelic type A2. This state is denoted (n1, n2, v1, v2), so that the sample itself would be represented as (n1, n2, 0, 0). The basic approach is to condition on the first step back in the ancestry of n = n1 + n2 + v1 + v2 lineages in the exchangeable process (the first layer of the ancestral selection graph) and to consider which patterns of ancestral states would produce the configuration (n1, n2, v1, v2) given each possible event. Equation (A1) in the Appendix gives the basic recursion. It is a straightforward application of ideas that are discussed in detail elsewhere (Krone and Neuhauser 1997
; Slade 2000b
; Fearnhead 2002
; Stephens and Donnelly 2003
; Baake and Bialowons 2008
), but the Appendix also includes a discussion of each term.
The probabilities in equation (A1) can be computed by a simple extension of equation (2),
|
| (3) |
+
+ n + v – 1)/2, where n = n1 + n2 and v = v1 + v2. A reduced conditional ancestral process is possible because two kinds of events allow a virtual lineage to be discarded and may be filtered out of the process (Slade 2000b
Slade (2000b)
found that if a branching event occurs in which the incoming branch has type I = A2 and the ancestral lineages not involved in the event all have the correct allelic types to produce the configuration (n1, n2, v1, v2), then it is unnecessary to create a new virtual lineage with state C. In particular, both C = A1 and C = A2 would yield the correct allelic types of the descendent lineages. Algebraically, these two possibilities can be collected and the simplification p(n1, n2, v1 + 1, v2) + p(n1, n2, v1, v2 + 1) = p(n1, n2, v1, v2) may be applied in lines six and eight of equation (A1).
Fearnhead (2002)
showed, similarly, that when a mutation event occurs on a virtual lineage, that lineage may be discarded. This follows from the assumption of parent-independent mutation, in which the parental allele may be of either type. Algebraically, p(n1, n2, v1, v2) + p(n1, n2, v1 – 1, v2 + 1) = p(n1, n2, v1 – 1, v2) and p(n1, n2, v1 + 1, v2 – 1) + p(n1, n2, v1, v2) = p(n1, n2, v1, v2 – 1), which may be applied in lines three and four, respectively, of equation (A1). Note that the corresponding algebraic simplifications will not be used in the first two lines of equation (A1) because the specific objects of study here are the allelic states of, and relationships among, the real lineages ancestral to the sample.
An ancestral process conditional on the allelic types is obtained by implementing these simplifications in equation (A1), collecting all the terms involving p(n1, n2, v1, v2) on the left-hand side, then dividing both sides by the result. The ancestral process thus obtained is akin to the one described by Stephens and Donnelly (2003)
but minimizes the number of virtual branches that need to be added to the graph. The total rate of events is given by
![]() | (4) |
and
and again n = n1 + n2 and v = v1 + v2. The result of these manipulations to equation (A1) is
![]() | (5) |
A2 mutation event on a real lineage, an A2
A1 mutation event on a real lineage, a branching event on any lineage in which the ancestor is always a virtual A1 lineage, and loss of a virtual A1 lineage by mutation or coalescence. The conditional ancestral process is a Markov jump chain which remains in state
for an exponentially distributed length of time, with mean
, and then jumps to a new state according to these six probabilities. This is a straightforward generalization of the algorithms in Fearnhead (2002)
The only case in which equation (5) leads easily to analytical results is when
= 0. In this neutral case, no virtual lineages of either type are produced, and the sample may be represented simply by (n1, n2). For small samples, the Markov jump chain gives simple systems of equations that can be solved for quantities of interest. For example, the expected times to common ancestry for the three possible samples of size two can be shown to be
![]() | (6) |
Gene Genealogies under Strong Selection
The transition probabilities in the conditional ancestral process described above, and the ones given by equation (A1) in the Appendix, depend on ratios of sampling probabilities. As noted by Stephens and Donnelly (2003)
, some time may be saved in performing simulations because the constant B in equation (1) cancels in these ratios and, thus, does not need to be calculated. Further, the presence of these ratios changes the probabilities of events substantially when selection is strong because unfit (A1) alleles are unlikely to be sampled. For example, the rate of branching is reduced in equation (5) because the ratio p(n1, n2, v1 + 1)/p(n1, n2, v1) becomes very small when
is large. Fairly simple expressions for these ratios are available when
is large, and this makes it possible to describe a limiting
conditional ancestral process.
The analysis follows from a uniform asymptotic (large
) expansion of Kummer's confluent hypergeometric function, which here is denoted 1F1[a;b; –
]. Specifically, if a > 0, b > 0, and
> 0, which will all be true here, then from equations 3.1.2 and 4.1.2 in Slater (1960)
,
![]() | (7) |
![]() | (8) |
For two sample configurations (
) and (n1, n2, v1), let
,
, c = 
1 + n1 + v1, and d =
+ n1 + n2 + v1. Equation (8) gives
![]() | (9) |
|
| (10) |
.
The rate
in equation (4) is the total rate of events in the conditional ancestral process, which occurs on a time scale proportional to N generations. Examination of
shows that the limiting conditional ancestral process for the present-day sample (n1, n2, 0) must be analyzed separately for n1 > 0 and for n1 = 0. In the first case, when there is at least one deleterious allele in the sample, the total rate of events depends linearly on
. Then, when
is large, the waiting time to an event will be of order
–1. In contrast, when n1 = 0, so that the sample contains only fit alleles, the total rate of events is
In this case, the waiting time to an event does not depend on
. Instead, it is of order 1 when
is large. This leads to a "separation of time scales" that is key to the analysis of equation (5) for large
.
Separation of Times Scales: Fast Processes
In the first case, when n1 > 0 (and v1 = 0), equation (10) allows for the following simplification of equation (5). Taking the limit of equation (5) as
, or ignoring terms of order
–1 and smaller, and simplifying give
|
| (11) |
–1 or smaller because they either lead to the production of an additional unfit allele or simply because
is of order
. Note that in this (n1 > 0) limiting process, all the lineages are real, no virtual lineages are produced, and no events occur among the A2 lineages, if there are any in the sample.
The two probabilities in equation (11) are identical in form to those of a fundamental stochastic process in population genetics, namely the process of tracing the ancestry of a sample under the infinite-alleles mutation model that gives the Ewens sampling formula (Ewens 1972
). Note that whichever event occurs above, the number of A1 lineages decreases by one. Then equation (11) may be reapplied with n1
n1 – 1, and so on, continuing until no A1 lineages remain. Counting the number, K, of (A1
A2) mutations leads to
|
| (12) |
is an unsigned Stirling number of the first kind. Equation (12) is identical to the probability function for the number of alleles in the Ewens sampling formula (Ewens 1972
1, gives the probability function for the numbers of A1 descendants in the sample of each of these k ancestral A2 alleles. As mentioned above, this result is identical to the result for soft selective sweeps; for example, see Pennings and Hermisson (2006a)
Again, the amount of time this takes will be of order
–1, which is negligible on the coalescent time scale of the ancestral selection graph, where one unit of time is proportional to N generations (N2/2 steps in the discrete Moran model). Thus, a sample containing n1 copies of A1, where n1 > 0 and n2 copies of A2 will quickly be converted into an ancestral sample of K + n2 real A2 lineages, where K is a random variable with 1 < K < n1 and the probability function P(K = k|n1) above.
Separation of Times Scales: Slow Processes
Of course, the above is only part of the ancestry of the sample. It is still necessary to trace the ancestry of the resulting k + n2 real A2 lineages back to their most recent common ancestor. For an ancestral sample of this sort, or for a present-day sample containing only fit alleles, a different limiting process arises. Without loss of generality, k may be omitted for simplicity, and the sample may be represented as (n1 = 0, n2 > 0, v1 = 0). In this case, using equations (10) and (5) and taking the limit, or ignoring terms of order
–1 and smaller, gives
|
|
The creation of this virtual A1 lineage induces a third case, similar to the case n1 > 0 above, but in which a different type of "fast" event is possible: the annihilation of the virtual lineage by mutation. When the ancestral configuration is (n1 = 0, n2, v1 = 1), the total rate of events is again of order
. An analysis like those above shows that the last term in equation (5) is 1 + O(
–1), and all other terms are O(
–1). In the limit
, the virtual lineage is annihilated with probability equal to one, and this happens in a negligible amount of time. The sample thus reverts immediately to state (n1 = 0, n2, v1 = 0) and the above "slow" process resumes.
This shows that a present-day sample or ancestral configuration comprised only of n2 copies of the fit allele, A2, undergoes a filtered ancestral process in which branching events occur, but the resulting virtual A1 lineages are instantly removed. Eventually, with probability equal to one in the limit
, a coalescent event will occur between two of the n2 A2 alleles. The waiting time to this event is exponentially distributed with rate equal to the total rate of events times the probability that the event is a coalescent event or
|
|
–1.
Comparing Analytical Predictions to Simulations
In the limiting ancestral process, analytical predictions can be made for any quantity of interest simply by conditioning on K, the number of A2 ancestors of the n1 copies of allele A1 in the sample. For example, the total length of the gene genealogy of the sample (n1, n2) is given by
|
| (13) |
K
n1, the expected length of the gene genealogy above is less than or equal to the neutral expectation for a sample of size n = n1 + n2. Due to equation (12), it will be greatest when
is large, so that K is equal to n1 with high probability. When
is small, E[Ttotal] will be close to the neutral expected value for a sample of size n = 1 + n2.
Further, let Ti be the time during which there are i lineages ancestral to the sample. Under the standard neutral coalescent, Ti has expectation 2/(i(i – 1)), and i ranges from 2 to n. Under the limiting conditional ancestral selection graph, the expected value is
|
| (14) |
i
n2, the expected value of Ti for any sample is given exactly by the neutral expected value. For samples in which n1 > 0, the expected value of
is also given by the neutral expected value because there must be at least one A2 ancestor of the n1 A1 alleles. On the other hand, the expected value of Ti may be much smaller than the neutral expected value for values of i such that n2 + 1 < i
n1 + n2 because the fast coalescence/mutation process described above may cause some coalescent intervals to be skipped with high probability. That is, given a value of K = k, Ti will be equal to zero for n2 + k < i
n1 + n2.
The process described by equation (5) is straightforward to simulate. The simulations presented below were done in Mathematica (Wolfram 1999)
, version 5.2. The Mathematica notebook used to generate the results is available from the author upon request. A single simulation run begins with a sample (n1, n2, 0) and ends in a random configuration, either (1, 0, v1) or (0, 1, v1), in which there is only one real lineage, which is the most recent common ancestor of the sample. Exponential waiting times with means
are generated conditional upon each configuration of ancestral lineages encountered, and transitions are implemented stochastically according to the probabilities in equation (5). This algorithm is very similar to that of Stephens and Donnelly (2003)
, the only difference being that the simplifications due to Slade (2000b)
and Fearnhead (2002)
have been implemented here but were not used in Stephens and Donnelly (2003)
.
The two main aims of the simulations are to assess the convergence of various quantities to the limiting
predictions, such as equations (13) and (14) above, and to illustrate how conditional gene genealogies depend on
. The program was also tested against available analytical results. In particular, with
= 0, the average pairwise coalescence times become closer and closer to the predictions of equation (6) as the number of replicates increases (results not shown). Further, when
is very large, the simulation conforms to limiting (
) analytical predictions. Simulation results change monotonically between these two extremes, but no analytical predictions are available for arbitrary
.
Results are presented for samples of size 10, for three different sampling configurations: a sample containing only A1 alleles (10, 0), a sample split evenly between A1 and A2 alleles (5, 5), and a sample containing only A2 alleles (0, 10). In all cases,
= 1 and
1 =
2 = 0.5, and 200,000 replicates were done to produce each result. The exception to this is figure 4C, which required a larger number of replicates. In this case, 1 million replicates were done for each combination of parameters. The average values of four quantities were computed for 17 values of
, from 10–3 to 105, evenly spaced on a log scale. The speed of these simulations is greatly improved by tabling values of equation (7) for each value of
(and
,
1,
2). The 17 million replicates that produced figure 4C took 22 h on a Macintosh 1.5 GHz PowerPC G4.
|
Simulation Results
Figure 1 shows the average total length of the gene genealogy of the sample, that is, the sum of the lengths of all the real branches in the ancestry, back to the most recent common ancestor of the sample. On the left, as
decreases, the values converge on neutral expectations. In this case, because of the dependence on mutation and consistent with equation (6), the value for the sample (5, 5) is larger than the values for the samples (10, 0) and (0, 10). On the right, the values fit the limiting
predictions well, which in this case are given by 1.82, 4.82, and 5.66 for samples (10, 0), (5, 5), and (0, 10), respectively. The most rapid change in values occurs between
= 1 and
= 100. When
< 0.1, the behavior is very close to that of the neutral model, and when
> 1000, the behavior is very close to that of the limiting
model. Interestingly, these "cutoffs" of
appear insensitive to the value of
(results not shown).
|
Figure 2 shows the average fraction of the gene genealogy made up of lineages of type A1. For each simulation replicate, the total length of real A1 lineages was computed and divided by the total length of all real (A1 + A2) lineages for that replicate. These values were averaged over all simulation replicates. On the left, when
is small, the value for the sample (5, 5) is close to one-half because mutation is symmetric and the sample configuration is also symmetric. The unbalanced samples (10, 0) and (0, 10) have values close to 1 and 0, respectively, when
is small. Note that, when
becomes large, the effect of the sample configuration disappears and the values for all three samples converge on one-half or on
1 if mutation is asymmetric (results not shown).
|
On the right in Figure 2, when
is large, the average fraction of A1 lineages decreases to zero for any sample that contains at least one A2 allele, in this case samples (5, 5) and (0, 10). This illustrates that any A1 alleles in the sample will disappear rapidly as a result of the fast process described above as they coalesce or get converted to A2 alleles by mutation. In the special case that only A1 alleles are sampled, there is a chance, equal to P(K = 1|n1) in the limit, that all n1 copies will coalesce before the first A1
A2 mutation event. If this occurs, then the entire gene genealogy will be composed of A1 lineages, and the fraction will be equal to one, even though the total length of the tree may be very small. If K > 1, the fraction of A1 lineages will be negligible for the reason just discussed. Thus, the average fraction for the sample (10, 0) converges on P(K = 1|n1 = 10)
0.28 as
increases in Figure 2.
Figure 3 shows the average ratio of virtual branches to real branches in the ancestry of the sample back to the most recent common ancestor. As in Figure 2, the ratio is taken for each replicate and then averaged across replicates. For any sample, the largest numbers of virtual branches are generated when
is slightly less than 10. The total time of virtual branches is very small when
is either small or large. The intuition behind this is clear when
is small: selection is weak and few branching events occur. On the other hand, when
is large, virtual A1 branches will be created but then will be annihilated quickly by mutation. There is also a strong effect of sample type on the total length of virtual branches, with samples containing more copies of A2 having smaller numbers of virtual branches. Figure 3 demonstrates that the conditional ancestral selection graph, with transitions given by equation (5), does not suffer from the explosion of virtual lineages that plagues the unconditional ancestral selection graph. For these samples and parameter values, virtual branches never outnumber real branches, at least on average.
|
Figure 4 shows how the average time during which there are i lineages ancestral to the sample (2
i
n) compares to the neutral expectation for a random sample, E[Ti] = 2/(i(i – 1)). Thus, the plots are similar to skyline plots (Strimmer and Pybus 2001)
, and again, there are 17 of these ranging from 10–3 to 105. The different curves for small
are difficult to distinguish, as are those for large
. Thus, figure 4 displays the same sharp transition between the behaviors for small and large
seen in figures 1–3
values.
The bundle of lines at the top in figure 4A displays the behavior under neutrality but conditional on the sample being of one allelic type (A1) and with
= 1. Thus, the bundle of lines sits below one. As might be expected, a sample of all A2 displays the same behavior when
is small. This is shown in figure 4C, but in this case, the bundle of lines is at the bottom of the plot rather than at the top. Thus, increasing
has the opposite effect on relative coalescence times for the sample (0, 10) as it does for the sample (10, 0). In the case of (0, 10), shown in figure 4C, increasing
leads to more and more neutral looking gene genealogies, for the reasons discussed above. In the case of (10, 0), shown in figure 4A, increasing
leads to very short times for first 10 – K coalescent intervals looking back (i = 10, ..., K + 1), where K is the random variable with distribution given by equation (12). The bundle of lines at the bottom of figure 4A sits right on top of the limiting predictions obtained from equation (14).
Figure 4B shows the behavior for an evenly split sample. In this case, the bundle of lines for small
tends to lie above one, especially for the more ancient coalescent intervals (small i), because the sample requires at least one mutation before the most recent common ancestor can be reached. As
increases, the average coalescence times converge on the limiting predictions of equation (14). In this case, the n1 = 5 copies of A1 in the sample coalesce (and mutate) rapidly into K lineages of type A2, then the subsequent ancestry of the remaining n2 + K lineages is neutral. As K
1, there is at least one A2 ancestor of the five A1 alleles, so the times Ti are given by the neutral model for i = 2, 3, 4, 5, and 6.
| Discussion |
|---|
|
|
|---|
The ancestral selection graph is a mathematical tool for studying gene genealogies of alleles under selection. Due to the complicated nature of ancestral processes with selection, few analytical results are available. Krone and Neuhauser (1997)
= 0 or
= 0, and Neuhauser and Krone (1997)
for a given
. Here, using the conditional ancestral selection graph, it was shown that neutral gene genealogies also dominate when
. However, gene genealogies of samples which happen to contain some number of deleterious alleles are very different than neutral genealogies. Their ancestries consist of a two-phase process, in which the deleterious (A1) alleles in the sample quickly coalesce and mutate into a random number of advantageous (A2) alleles, and then the ancestry of those alleles and the rest of the sample is given by the neutral coalescent process. Interestingly, the Ewens sampling formula describes the result of the fast process.
As mentioned above, the results presented here are fundamentally similar to those for soft selective sweeps by recurrent mutation. Pennings and Hermisson (2006a)
describe how the Ewens sampling formula gives the probability function for the number of independent ancestral alleles of a sample of size n taken at the end of the sweep. Pennings and Hermisson (2006a)
assumed that each copy of the advantageous allele arises uniquely via one-way mutation from the background allele and that the allele-frequency trajectory of the advantageous allele follows the standard prediction for strong positive selection. However, they argue that the result should not depend on the shape of the allele-frequency trajectory as long as the sweep occurs quickly.
It is not surprising that the same result should be found, as it was here, for a strongly deleterious allele that happens to reach a large enough frequency to be observed in a sample. Such an allele would have arrived at high frequency by a sweep-like process (see further discussion below). Given the starting and ending frequency of an allele, some properties of the trajectory do not depend on whether the allele is advantageous or deleterious, but only on the absolute value of selection parameter, see Sections 4.6 and 5.4 in Ewens (2004)
. The comparison of the present results to those of Pennings and Hermisson (2006a)
is of value because it shows that deleterious alleles which are observed in a sample are no more or less likely to be derived from independent mutations than advantageous alleles which have undergone a sweep.
The results presented here also have implications for the interpretation of allelic diversity at human disease loci and make a contribution to ongoing modeling efforts in that area (Di Rienzo 2006)
. Hartl and Campbell (1982)
studied a model of classical mutation–selection balance, in which the frequency of alleles that cause a simple Mendelian disorder is held constant over time by strong mutation and selection and found an identical role for the Ewens sampling formula as that discovered here. Their model exists in the limit as N
with
and
but
/
constant (Sawyer 1983)
, meaning that mutation is a strong force that keeps the deleterious allele at an appreciable frequency in the population despite strong deleterious selection.
Pritchard (2001)
considered similar ideas in the context of complex diseases, where selection on particular alleles that cause susceptibility is expected to be weaker. He discussed the importance of the mutation–selection–drift equilibrium (eq.1) in interpreting the diversity of alleles at each locus that contributes to a complex disease. Again, equation (1) holds in the limit as N
with
and
constant. Pritchard (2001)
used simulations to study patterns of allelic diversity and estimated that the scaled rate of mutation to deleterious alleles—denoted 
1 here and βS in Pritchard (2001)
—is between about 0.1 and 5 for a typical locus.
Using a completely different approach and set of assumptions, Slatkin and Rannala (1997)
also showed that the diversity of alleles at a disease locus should follow the Ewens sampling formula. In particular, Slatkin and Rannala (1997)
used a birth–death process, forward in time, in which every copy of the allele reproduced independently. Their mathematical analysis did not require the alleles to be deleterious, but they focused on this case because they were interested in applications to disease. The method and results of Slatkin and Rannala (1997)
should be valid over a fairly broad range of parameter values, provided that the overall frequency of the alleles is small.
Slatkin and Rannala (1997)
also pointed out that the standard homozygosity test (Watterson 1978
; Slatkin 1994
, 1996
) could be used to detect deviations from the model, including different rates of mutation to different alleles, differential selection, differential penetrance, and changes in population size (in particular, growth). They rejected the null model for the BRCA1 locus, which is implicated in early-onset breast cancer, and the factor VIII locus, which is associated with hemophilia A. They concluded that population growth could explain the deviations at both loci. Beginning instead with the model of Hartl and Campbell (1982)
, Reich and Lander (2001)
extended this conclusion about growth to a general observation among many loci: that diseases in higher frequencies tend to have a simpler pattern of diversity, with one most common allele. For further discussion, see Pritchard and Cox (2002)
and Di Rienzo (2006)
.
Taken together, the results of Hartl and Campbell (1982)
, Slatkin and Rannala (1997)
, Pennings and Hermisson (2006a), and those presented here support the broad applicability of the Ewens sampling formula as a model of diversity among selectively equivalent alleles at a locus without recombination. To illustrate some of the above ideas and to get a sense of the domain of application (to deleterious/disease alleles) of the present model compared with the model Hartl and Campbell (1982)
, consider figure 5. The model of Slatkin and Rannala (1997)
will not be considered in this context because its assumptions are implicit, about the age and frequencies of alleles, rather than explicit, about the parameters. Figure 5A shows three distributions of x, the frequency of the deleterious allele A1, all of which correspond to a disease whose average frequency in the population is 1/2000, but among which there are very different levels of variation around this average. Note that h(x), given in equation (1), may be interpreted as the relative amount of time the population spends with the frequency of A1 equal to x. A different but related way to think of h(x) is that it represents the relative chance that the current frequency of A1 in the population is equal to x.
|
The shape of h(x) depends on

1, 
2, and
. It is L shaped when both 
1 and 
2 are less than or equal to one and has a nonzero mode if the mutation rate to the less-frequent allele (here 
1) is greater than one. The solid curve in Figure 5A corresponds to one set of parameters used in the simulations presented above. Recall that when
= 1000, as for the solid curve, the simulation results were very close the limiting predictions for strong selection, here meaning
for a given
. It is clear that, for this solid curve, the population is not particularly likely to have an x close to its expected value of 1/2000 and that much of the time the frequency of A1 is close to zero. On the other hand, as
and
increase, but
/
remains constant, the distribution becomes more and more concentrated on the expected value. The finely dashed curve in figure 5A represents a case in which the model of Hartl and Campbell (1982)
Figure 5B displays the same solid curve that appears in figure 5A, but over a wider range of x. Again, in this case, there is only a 1/2000 chance of sampling a deleterious allele. Also included in figure 5B is the distribution of x, conditional on observing five deleterious alleles in a random sample of size 100. The general formula for this posterior distribution of allele frequencies is
|
| (15) |
| Appendix |
|---|
|
|
|---|
In the following equation, the probability of the ordered sample (n1, n2, v1, v2) is computed by conditioning on the first step in the exchangeable ancestral process which underlies the ancestral selection graph (see text). The total rate of events is (n + v) (
+
+ n + v – 1)/2. Fourteen different kinds of events are distinguished, based on whether they are mutation, coalescent, or branching events and on which lineages they affect. This is a special case of the general, multiallele processes described in Fearnhead (2002)
![]() | (A1) |
Each term on the right-hand side of equation (A1) has the form P{Event}P{Data|Event}, where Data means the ordered sample (n1, n2, v1, v2), and the sum is taken only over Events that have a nonzero probability of producing the data. In particular, coalescent events between lineages whose allelic types in the data are different are omitted from equation (A1), as are mutation events in which the descendent lineage is not of the allelic type required by the data. There are four kinds of events among the 14 terms on the right-hand side of equation (A1): mutation events in which the descendant lineage is of the correct allelic type (terms 1–4), branching events in which the descendent lineage must be of allelic type A1 (terms 5 and 7), branching events in which the descendent lineage must be of allelic type A2 (terms 6 and 8), and coalescent events in which both descendent lineages must be of the same allelic type (terms 9–14).
The probabilities of each event are given by the fractions in each term and are computed in the usual way as the rate of each particular event divided by the total rate of events in the unconditional, exchangeable process. At the time of the first event, the lineages in the graph are a random sample from the equilibrium population (Donnelly and Kurtz 1999)
. The probabilities of the data given each event are computed by considering what type of ancestral sample would yield the data. For example, the lineages not involved in the event must simply have the same allelic state that they do in the data.
For mutation events, the ancestral sample would yield the data if it were identical to the data (in which case it is an empty mutation event) or if it contained one fewer allele of the type required for the mutant lineage by the data and one more of the other allelic type (in which case the mutation converts the lineage to the correct allelic type). In the case of branching events, all four possibilities for the allelic states of the incoming and continuing branches must be considered. For branching events in which the descendent lineage must be of allelic type A1, only an ancestral sample in which the incoming (virtual) branch has allelic type A1 (and the remaining lineages have the types required by the data) would yield the data. For branching events in which the descendent lineage must be of allelic type A2, three cases of (I, C) could produce the data: (A1, A2), (A2, A1), and (A2, A2). In the first two cases, the ancestral sample possesses one additional (virtual) lineage of type A1 relative to the data; hence, these are grouped together in equation (A1), whereas in the third case the ancestral sample possesses one additional (virtual) lineage of type A2. Finally, coalescent events in which both descendent lineages are of the same allelic type will yield the data if the coalesced ancestral lineage is of the correct allelic type, so that ancestral sample contains one fewer of that allelic type than the descendent sample does.
| Acknowledgements |
|---|
|
|
|---|
I thank Joachim Hermisson, Tee Muirhead, Pleuni Pennings, and Monty Slatkin for helpful discussions of this work.
| Footnotes |
|---|
Marcy Uyenoyama, Associate Editor
| References |
|---|
|
|
|---|
Abramowitz M, Stegun IA. Handbook of mathematical functions (1964) New York: Dover Publications.
Aldous DJ. Exchangeability and related topics. In: École dÉté de Probabilités de Saint-Flour XII—1983—Dold A, Eckmann B, eds. (1985) Berlin (Germany): Springer-Verlag. 1–198. (Lecture notes in mathematics; vol. 1117).
Baake E, Bialowons R. Miekisz J, ed. Forthcoming. 2008. Ancestral processes with selection: branching and Moran models. In: Miekisz J, editor. Banach center publications. Warsaw (Poland): Institute of Mathematics, Polish Academy of Sciences.
Barton NH, Etheridge AM, Sturm AK. Coalescence in a random background. Ann Appl Probab (2004) 14:754–785.[CrossRef]
Cannings C. The latent roots of certain Markov chains arising in genetics: a new approach. I. Haploid models. Adv Appl Probab (1974) 6:260–290.[CrossRef]
Di Rienzo A. Population genetics models on common diseases. Curr Opin Genet Dev (2006) 16:630–636.[CrossRef][Web of Science][Medline]
Donnelly P, Kurtz TG. Genealogical models for Fleming-Viot models with selection and recombination. Ann Appl Probab (1999) 9:1091–1148.[CrossRef]
Ewens WJ. The sampling theory of selectively neutral alleles. Theoret Popul Biol (1972) 3:87–112.
Ewens WJ. Mathematical population genetics. Volume I: theoretical foundations (2004) Berlin (Germany): Springer-Verlag.
Fearnhead P. The common ancestor at a nonneutral locus. J Appl Probab (2002) 39:38–54.[CrossRef]
Fearnhead P. Perfect simulation from nonneutral population genetic models: variable population size and population subdivision. Genetics (2006) 174:1397–1406.
Golding GB. The effect of purifying selection on genealogies. In: Progress in population genetics and human evolution—Donnelly P, Tavaré S, eds. (1997) New York: Springer-Verlag. 271–285. (IMA volumes in mathematics and its applications; vol. 87).
Griffiths RC, Tavaré S. Simulating probability distributions in the coalescent. Theoret Popul Biol (1994a) 46:131–159.
Griffiths RC, Tavaré S. Ancestral inference in population genetics. Stat Sci (1994b) 9:307–319.[CrossRef][Web of Science]
Hartl DL, Campbell RB. Allelic multiplicity in simple Mendelian disorders. Am J Hum Genet (1982) 34:866–873.[Web of Science][Medline]
Hermisson J, Pennings PS. Soft sweeps: molecular population genetics of adaptation from standing genetic variation. Genetics (2005) 169:2335–2352.
Hudson RR. Testing the constant-rate neutral allele model with protein sequence data. Evolution (1983) 37:203–217.[CrossRef][Web of Science]
Kaplan NL, Darden T, Hudson RR. Coalescent process in models with selection. Genetics (1988) 120:819–829.
Kimura M. Stochastic processes and the distribution of gene frequencies under natural selection. Cold Spring Harb Symp Quant Biol (1955) 20:33–53.
Kingman JFC. The coalescent. Stochastic Process Appl (1982a) 13:235–248.[CrossRef]
Kingman JFC. On the genealogy of large populations. J Appl Probab (1982b) 19A:27–43.[CrossRef]
Kingman JFC. Exchangeability and the evolution of large populations. In: Exchangeability in probability and statistics—Koch G, Spizzichino F, eds. (1982c) Amsterdam: North-Holland. 97–112.
Krone SM, Neuhauser C. Ancestral processes with selection. Theoret Popul Biol (1997) 51:210–237.
Moran PAP. Random processes in genetics. Proc Camb Philos Soc (1958) 54:60–71.[CrossRef]
Moran PAP. Statistical processes of evolutionary theory (1962) Oxford: Clarendon Press.
Neuhauser C. The ancestral graph and gene genealogy under frequency-dependent selection. Theoret Popul Biol (1999) 56:203–214.
Neuhauser C, Krone SM. The genealogy of samples in models with selection. Genetics (1997) 145:519–534.[Abstract]
Pennings PS, Hermisson J. Soft sweeps II: molecular population genetics of adaptation from recurrent mutation or migration. Mol Biol Evol (2006a) 23:1076–1084.
Pennings PS, Hermisson J. Soft sweeps III: the signature of positive selection from recurrent mutation. PLoS Genet (2006b) 2:e186.[CrossRef][Medline]
Pritchard JK. Are rare variants responsible for susceptibility to complex diseases? Am J Hum Genet (2001) 69:124–137.[CrossRef][Web of Science][Medline]
Pritchard JK, Cox NJ. The allelic architecture of human disease genes: common-disease–common variant ... or not? Hum Mol Genet (2002) 11:2417–2423.
Przeworski M, Charlesworth B, Wall JD. Genealogies and weak purifying selection. Mol Biol Evol (1999) 16:264–1252.
Reich DE, Lander ES. On the allelic spectrum of human disease. Trends Genet (2001) 17:502–510.[CrossRef][Web of Science][Medline]
Sawyer SA. A stability property of the Ewens sampling formula. J Appl Probab (1983) 20:449–459.
Slade PF. Simulation of selected genealogies. Theoret Popul Biol (2000a) 57:35–49.
Slade PF. Most recent common ancestor distributions in genealogies under selection. Theoret Popul Biol (2000b) 58:291–305.
Slater LJ. Confluent hypergeometric functions (1960) Cambridge: Cambridge University Press.
Slatkin M. An exact test for neutrality based on the Ewens sampling distribution. Genet Res (1994) 64:71–74.[Web of Science][Medline]
Slatkin M. A correction to the exact test based on the Ewens sampling distribution. Genet Res (1996) 68:259–260.[Web of Science][Medline]
Slatkin M, Rannala B. The sampling distribution of disease-associated alleles. Genetics (1997) 147:1855–1861.[Abstract]
Stephens M, Donnelly P. Ancestral inference in population genetics models with selection. Aust N Z J Stat (2003) 45:395–430.[CrossRef]
Strimmer K, Pybus OG. Exploring the demographic history of DNA sequences using the generalized skyline plot. Mol Biol Evol (2001) 18:2298–2305.
Tajima F. Evolutionary relationship of DNA sequences in finite populations. Genetics (1983) 105:437–460.
Terwilliger JD, Weiss KM. Linkage disequilibrium mapping of complex disease: fantasy or reality? Curr Opin Biotech (1998) 9:578–594.[CrossRef][Web of Science][Medline]
Watterson GA. On the number of segregating sites in genetical models without recombination. Theoret Popul Biol (1975) 7:256–276.
Watterson GA. The homozygosity test of neutrality. Genetics (1978) 88:405–417.
Wolfram S. The mathematica book (1999) 4th edition. Cambridge (UK): Wolfram Media/Cambridge University Press.
Wright S. Evolution in Mendelian populations. Genetics (1931) 16:97–159.
Wright S. Population structure in evolution. Proc Am Philos Soc (1949) 93:471–478.[Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











