MBE Advance Access originally published online on November 30, 2006
Molecular Biology and Evolution 2007 24(2):574-586; doi:10.1093/molbev/msl189
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Increasing Sequence Correlation Limits the Efficiency of Recombination in a Multisite Evolution Model
Department of Microbiology and Molecular Biology, Tufts University
E-mail: irouzine{at}tufts.edu.
| Abstract |
|---|
|
|
|---|
The accumulation of preexisting beneficial alleles in a haploid population, under selection and infrequent recombination, and in the absence of new mutation events is studied numerically by means of detailed Monte Carlo simulations. On the one hand, we confirm our previous work, in that the accumulation rate follows modified single-site kinetics, with a timescale set by an effective selection coefficient seff as shown in a previous work, and we confirm the qualitative features of the dependence of seff on the population size and the recombination rate reported therein. In particular, we confirm the existence of a threshold population size below which evolution stops before the emergence of best-fit individuals. On the other hand, our simulations reveal that the population dynamics is essentially shaped by the steady accumulation of pairwise sequence correlation, causing sequence congruence in excess of what one would expect from a uniformly random distribution of alleles. By sequence congruence, we understand here the opposite of genetic distance, that is, the fraction of monomorphic sites of specified allele type in a pair of genomes (individual sequences). The effective selection coefficient changes more rapidly with the recombination rate and has a higher threshold in this parameter than found in the previous work, which neglected correlation effects. We examine this phenomenon by monitoring the time dependence of sequence correlation based on a set of sequence congruence measures and verify that it is not associated with the development of linkage disequilibrium. We also discuss applications to HIV evolution in infected individuals and potential implications for drug therapy.
Key Words: recombination selection model numerical simulation correlation
| Introduction |
|---|
|
|
|---|
Genomic recombination is a pervasive biological mechanism. It occurs in living taxa of all levels of complexity, from viruses and prokaryotes to highest order mammals and gymnosperms. Yet its benefits for the survival and propagation of a species are not always clear. Whether recombination is advantageous or not for a certain population seems to depend on a complex set of conditions involving, with equal weight, the nature of intragenome interactions between various beneficial alleles, the structure and size of the population itself, and the type of environment shaping the evolution of the population.
It is frequently argued, for instance, that the exchange of genetic material via recombination (sexual reproduction) expands the range of individual variability that provides the substrate for natural selection (Burt 2000
). However, according to few-site models for very large populations in a stationary environment, such an increase in variability arises only under negative linkage disequilibrium (Otto and Lenormand 2002
) or negative epistasis (Barton 1995a
; Otto and Lenormand 2002
). Still because the effects of linkage and epistasis can change with the character of the environment, it was found that recombination may yet increase variability under rapidly changing environmental conditions, regardless of epistasis (Barton 1995a
; Burger 1999
).
Perhaps the most quoted advantage of recombination in finite populations is that the exchange of alleles between members of a population may promote the accumulation and fixation of beneficial mutations, counteracting the negative effect of chromosomal linkage on the response to selection (Fisher 1930
; Muller 1932
). Beneficial mutations survive selection and random drift only if they arise in a sufficiently large number of high-fitness individuals. If the population is not large enough and/or the selection pressure is not high enough, the population will experience an excessive buildup of deleterious mutations known as Muller's ratchet (Felsenstein 1974
), as well as a slow down in the fixation of beneficial mutations that have to compete for representation in the population (clonal interference effect). Recombination has the potential to counteract these effects by bringing together, in the same genome, distinct beneficial mutations that otherwise would likely remain segregated. This old idea (Fisher 1930
; Muller 1932
) has received quantitative support, both analytically (Crow 1965
; Hill and Robertson 1966
; Felsenstein 1974
; Barton 1995b
) and numerically (Hey 1998
), typically in models with two or few linked sites. Then again, recombination can just as well dissociate beneficial mutations in highly fit individuals, producing less-fit progeny with diminished chances of survival. In the context of an early deterministic model with multiplicative selection and no linkage disequilibrium, such events were shown to erase entirely the positive effects of recombination on the fixation of beneficial mutations (Maynard-Smith 1968
).
The reasons for such divergent conclusions can be traced eventually to various disparities in the assumptions of the different models employed. The enormous complexity of the corresponding in vivo processes calls necessarily for a number of drastic simplifications, starting with the type and size of genome (haploid vs. diploid, 2-allele vs. multiallele, 2 or 3 site vs. multisite or even infinite length), the size of the population (finite vs. infinite, constant vs. variable) and its initial configuration, low or high rate of mutation (when present), type and strength of selection, mechanism and rate of recombination, stochastic or deterministic kinetics, etc. Given the wide range of options and the enduring complexity of the problem, even under strong simplification, it is not surprising that different models lead to different conclusions. The overall impression is, however, that models that take into account the effects of finite population size tend to predict an advantage of recombination in the progressive evolution of organisms (Crow 1965
; Hill and Robertson 1966
; Felsenstein 1974
; Barton 1995b
). One always hopes that approaching the problem with a generous enough set of assumptions will eventually allow a unified understanding of now seemingly antagonistic results.
With this position in mind, we (Rouzine et al. 2003
; Rouzine and Coffin 2005
) have recently formulated a novel analytical approach to a 2 allele, multisite model of evolution in haploid populations with a simple fitness landscape. Variants of this type of model have proved useful in quantitative studies of the evolution of RNA viruses and bacteria (Tsimring et al. 1996
; Aranson et al. 1997
; Kessler et al. 1997
; Ridgway et al. 1998
). The key observation of our approach is that, in a great variety of situations, the distribution of genomes (individual sequences) according to fitness (or the number of less-fit alleles) can be approximated as a quasi-deterministic "solitary wave" with stochastic edges, propagating in fitness space at a slowly changing speed. The method allows the derivation of closed-form expressions for basic kinetic quantities, valid over a broad range of model parameters. Hence, it can provide estimates of the evolution rate for a large variety of theoretical and experimental situations. The framework was first applied to a study of Muller's ratchet and FisherMullerHillRobertson effect under weak selection and mutation (forward, backward, and compensatory), in the absence of recombination (Rouzine et al. 2003
). For a wide range of population sizes, it was found that the evolution rate has a logarithmic dependence on the selection coefficient and the population size N. In particular, the model recovered successfully the result of the single-site model in the infinite population limit, the rate of accumulation of beneficial alleles limited by FisherMullerHillRobertson effect at intermediate N, a steady state at a critical value of N, and Muller's ratchet in small populations.
The same approach was employed recently in an analysis of evolution under recombination and weak selection, in the absence of mutations (Rouzine and Coffin 2005
), using the approximation that positions of deleterious alleles do not correlate within and between genomes of given fitness. The prediction of a quasi-deterministic solitary wave was found to hold for this model as well. The magnitude of the corresponding accumulation rate followed single-site type kinetics, with the effects of recombination, selection, and random drift combined into an effective selection coefficient seff. The effective coefficient seff was smaller than the single-site selection coefficient s, but increased with the recombination rate, confirming that recombination counteracts only partially the HillRobertson effect introduced by linkage. Notably, seff, and therefore the entire dynamics, was shown to display a well-defined transition with increasing population size N or recombination rate r. Below a certain critical size Nc
(N/r)1/2, the population evolved eventually into a clone and the accumulation of beneficial alleles stopped, corresponding to a null value for seff. For population sizes larger than Nc, beneficial alleles accumulated at a steady rate, until such alleles became fixed at all sites. In this case, the effective selection coefficient acquired a logarithmic dependence on N and approached the single-site selection coefficient s at large enough N.
The current paper presents detailed numerical tests of the model and conclusions of Rouzine and Coffin (2005)
by means of Monte Carlo simulation. We perform a careful numerical analysis of the accumulation kinetics, with specific emphasis on the emergence of intergenome correlation for homologous sites throughout the population. Intergenome correlation is assessed by means of pairwise sequence congruence, defined as the fraction of monomorphic sites of specified allele type in a pair of genomes. Thus, sequence congruence, as a measure of genetic similarity, is the "opposite" of the genetic distance. We examine the interplay between the buildup of sequence congruence and the accumulation of better-fit alleles by monitoring the time dependence of a complementary set of sequence congruence functions and of linkage disequilibrium. Particular attention is also given to the suitability of the single-site form of the accumulation rate expression and to the dependence of the effective selection coefficient on the population size and the recombination rate. As before, we apply our study to the evolution of HIV populations and the probable role of recombination in the emergence of drug resistance.
| Model and Methods |
|---|
|
|
|---|
Model
We model the viral population as a collection of N haploid genomes, represented by sets of L sites (bases or base pairs) each and subject to discrete time WrightFisher evolution with recombination, random drift, and selection for better-fit genotypes. For simplicity, each site is assumed to display only one of 2 alleles: better fit or less fit. We further disregard any epistasis, as well as all mutation events. Deleterious mutation events are neglected under the assumption that the accumulation rate of better-fit sites, in the parameter range of interest, is much larger than the mutation rate. Beneficial mutation events are also neglected because all beneficial alleles are assumed to preexist. This simplification allows us to consider only significant sites that present at least one copy of a better-fit allele in the initial population. The appearance of new significant sites at later times can be ignored, and the effective genome length L is fixed accordingly for the duration of the evolution. The population size N is also assumed to be constant in time.
The initial population is subjected to a discretized evolution with a time step of one generation. Each generation, a (small) number of randomly sampled pairs of genomes undergoes recombination, and the entire population is subsequently replaced by sampled progeny. In the sampling process, individual genomes are substituted by their direct descendants in random numbers according to a Poisson distribution, under the constraint of a constant population size. The average number of progeny for any given genome is proportional to its fitness, which is determined on the assumption that each expressed nonresistant allele diminishes the chances of survival by a factor es, slightly less than unit (s << 1). In other words, log fitness is proportional to the number k of less-fit sites, by the selection coefficient s (log fitness = sk, no epistasis).
The model recombination process assumes that recombinant genome pairs are selected randomly and sequentially with a probability 1/N0, such that each genome has an equal chance of undergoing recombination, at a rate r = N/N0. Each generation, the randomly sampled genome pairs undergo a preset number of sequence crossovers to produce pairs of descendent strands. Both descendents of a recombinant pair are retained in the evolving population (even though in a real population only one descendant is produced, the difference hardly matters, as only unusually highly fit recombinants influence evolution). We note that because of this latter assumption, our parameter r should not be thought of as a probability, because it may well happen that r > 1. In such a case, there is a nonzero probability that some descendents of genomes already involved in a recombinant event become themselves involved in recombination within the same generation. In the case of HIV, the rate r is determined by the frequency of coinfection of a target cell by 2 (or more) viral particles. More precisely, r represents the average number of coinfection events experienced by a target cell during one generation. Assuming that each such coinfection event results in the recombination of the 2 viral strands, the average number of recombinant encounters per generation in the entire population amounts to
rN/2.
Model Parameters for HIV Populations In Vivo
The model and the model parameters are tailored to fit our particular interest in HIV evolution. The details depend on the timescale considered. For long observation periods, 102 to 103 generations, the number of polymorphic sites is on the order L
100. For short-term evolution, the effective genome length L (number of drug-sensitive polymorphic sites) may be much smaller. The effective size of an HIV population, N, is set by the number of infected cells on the productive pathway. According to existing estimates based on models that include selection (Chun et al. 1997
; Rouzine and Coffin 1999
; Frost et al. 2000
), in an average patient, this number is N
106 or larger. However, the time required for our Monte Carlo simulations increases at least quadratically with the population size, and simulations become rather inefficient at N > 105. We compromise on somewhat smaller population sizes, 105 or fewer. A similar predicament constrains our choice of allele composition. We consider cases when only a small fraction f0 of the alleles presented by significant sites at the onset of evolution are better fit. This initial condition corresponds to 2 experiments: 1) reversion of deleterious alleles in untreated patients soon after transmission due to the transmission bottleneck and 2) amplification of drug-resistant alleles after start of drug therapy. But a very low initial fraction of better-fit alleles requires higher N to avoid their loss, Nf0s >> 1, and leads again to inefficiently long simulation times. We prefer to balance against lengthy simulations by considering initial compositions with f0
102 randomly distributed drug-resistant alleles. In the absence of initial allele loss, simulation results are weakly sensitive to f0.
Variation of the selection coefficient s among different sites is ignored, replaced by a representative value determined by the timescale of the experiment under consideration (t
1/s). Indeed, given a timescale, beneficial alleles at sites with much larger s will be fixed rapidly and similar alleles at sites with much smaller s will not have enough time to accumulate. Short-term evolution, spanning 10100 generations, suggests s
0.1, whereas longer timescales, stretching over hundreds and thousands of generations, imply s
102 (Rouzine and Coffin 1999
). Much smaller values of s may be dismissed because the corresponding timescales exceed the average duration of HIV infections. Larger values s
1, corresponding to short durations of the order of several generations, are also irrelevant because such sites can be regarded as conserved on longer timescales. Our simulations use typical values of s = 0.10.01.
In each round of infection, 2 RNA templates carried by a virus particle are reverse transcribed into a new provirus. Recombination occurs when the reverse transcriptase switches between the 2 available templates and incorporates fragments of genetic material from both parental strands into the newly transcribed DNA genomes. It is assumed that a relatively large number,
10, of such switches (crossovers) occurs before the reverse transcription process is completed (Dang et al. 2004
; Levy et al. 2004
), with the result that the progeny provirus carries a well-mixed blend of sequences from the parental RNA strands. Two different virus genomes can recombine only if they coinfect the same cell. If the density of newly infected cells is low in the tissue, as in persistent HIV infection, the rate r of recombination for any one genome becomes proportional to the number N of infected cells, r = N/N0. The factor 1/N0 represents the probability that any randomly selected pair of genomes succeeds in coinfection; alternatively, N0 is the extrapolated population size for which the recombination rate becomes 1. It depends on the total number of cells permissive for virus replication and their distribution in tissue. Whereas an estimate, r
1, has been obtained in some untreated patients based on the sampling of double HIV DNA positive cells (Jung et al. 2002
), conclusive experimental measurements of the rate r in productively infected (RNA positive) cells for given N have yet to be performed. Unfortunately, this leaves a good deal of uncertainty concerning the relevant values for both N and r in real patients. For instance, the above estimate of r may be too high if most coinfecting virions originate in neighboring cells and therefore are genetically uniform. Similarly, the estimate for N, which is based on a 2-site model, may have to be adjusted both to account for a higher frequency of observed crossovers and for the multisite nature of evolution. In the following, we use values on the order N0
103 to 105, which may be lower than the typical value in an untreated patient but offer a reasonable compromise in terms of simulation time. We note that N0 can be varied in vitro by changing the culture conditions, specifically, the number, density, and type of permissive cells. The model is equally applicable for other partially sexual populations, such as yeast, in which the recombination rate r is regulated independently of N by external factors.
Previous Theoretical Framework, Corrections, and Estimates
Because in the initial population the position of beneficial alleles within genomes is random and any crossover events during recombination at later times also occur in random positions, it could be expected that the better-fit alleles remain randomly distributed at any time during the evolution (Rouzine and Coffin 2005
). In such an approximation, the viral population kinetics can be monitored by the time-dependent frequency of genomes with k less-fit alleles, f(k, t). It has been found previously (Rouzine et al. 2003
; Rouzine and Coffin 2005
) that, under the assumptions discussed above, the main part of the distribution f(k, t) follows a quasi-deterministic kinetics. By contrast, the edges of the distribution (beyond which all fitness classes are empty), especially the leading edge toward small k, advance at a rate determined by the stochastic formation of best-fit sequences through recombination. Away from the edges f(k, t) is expected to follow kinetics of the form
|
| (1) |
k
=
k kf (k, t) is the average number of less-fit alleles, and the kinetic equation (1) acquires the simpler form
|
| (2) |
k
:
|
| (3) |
(k
k
)2
being the variance of the number k of deleterious alleles (Haigh 1978
dkkR (k, t) = 0, and R does not contribute directly to changes in
k
. Its contribution is concealed in the explicit form of the variance Var[k].
At infinite N or small N0, the expression of Var[k] can be further inferred without direct knowledge of the recombination term R, recalling that in this case the initial random (binomial) distribution of mutations (
k
(t = 0) = L 1) remains random (binomial) throughout the evolution. Indeed, in either case, the average number of recombination events per generation, rN/2 = N2/2N0, becomes infinitely large, and such strong recombination maintains a perfectly random (binomial) redistribution of alleles throughout the population. As in the single-site case, the corresponding variance is given by
|
| (4) |
k
/L. At finite N and large N0, the binomial distribution of alleles (eq. 4), and therefore the deterministic expression for the evolution rate, equation (3), does not apply directly for 3 reasons as follows.
- (i) Because better-fit alleles at different sites are selected concurrently and because new sequences are produced at a limited rate, the number k of less-fit sites correlates between different genomes. Recombination produces best-fit individual genomes that spread gradually over the population, then the process is repeated. As a result, at a given
k
, the distribution of k among genomes is more narrow than given by equation (4), and the evolution rate, correspondingly, is smaller than that given by equation (3). In order to determine the correct variance Var[k], one has to specify the expression for the recombination term R(k, t) in equation (2). In the approximation that the positions of individual sites are completely random within a genome and, unlike their total number, do not correlate between different genomes, R(k, t) can be expressed in terms of f(k, t), as a quadratic functional (Rouzine and Coffin 2005
). Matching the rate of generation of new best-fit genomes (i.e., those with smallest k) to the evolution rate of the deterministic part of fitness distribution, we (Rouzine and Coffin 2005
) obtained
where seff is given by
(5a)
or, taking into account that r = N/N0,
(5b)
Thus, the binomial dependence of Var[k] on
k
, equation (4), is mostly preserved, but Var[k] is decreased by a factor seff/s that becomes one at N =
or
if r and N are independent, and at
if r and N are proportional. Accordingly, the deterministic evolution equation (3) formally holds, but with a smaller effective selection coefficient seff that absorbs the effects of finite population size, linkage (HillRobertson effect), and recombination.
- (ii) The approximation (Rouzine and Coffin 2005
) that the positions of beneficial alleles, for a given number per sequence, do not correlate between genomes may be inaccurate. Even if the initial distribution of alleles is completely random, correlations between genomes may develop later because pairs of recombining genomes share sequence segments inherited from common ancestors. Suppose, for example, that 2 genomes are produced by recombination between 2 pairs of parental genomes. If one parent sequence is common for both recombinants, approximately a quarter of their sequence will be identical. The frequency of common ancestral sequences is limited by the combined effect of selection and random sampling. At this time, we are not able to provide good theoretical estimates for the magnitude of this effect. However, the direct numerical simulations presented below will show that the effective selection coefficient approximation provides very good estimates for the time dependence of
k
and the variance Var[k], but equation (5b) for seff has to be modified substantially, by a factor that is a nontrivial functional of the model parameters N, L, r, s.
- (iii) equation (5a) for the variance Var[k] has to be adjusted, in addition, for the fixation and loss of better-fit alleles because monomorphic sites do not contribute to fluctuations of k. If klost(t) and kfix(t) denote, respectively, the number of lost and fixed better-fit sites at time t, these modifications bring equation (5a) for Var[k] to the form

(6)
- (ii) The approximation (Rouzine and Coffin 2005
Numerical Implementation
The Monte Carlo simulations were performed with our newly developed microPopulation software package (Gheorghiu-Svirschevski S, Rouzine I, unpublished data). The corresponding algorithm of the WrightFisher kinetics is rather simple. The model population is formally represented by an array of N strings with L binary variables each. The initial array is constructed site by site according to a binomial distribution with an average of f0 · L beneficial alleles per genome. Each Monte Carlo generation (iteration) carries the population array through one session of recombination events and one session of selection. During the recombination step, parent genome pairs, selected sequentially and randomly with a probability 1/N0, undergo a preset number of M = 10 crossovers each, at randomly selected positions along the genome length. No recombination hotspots are used in our simulations. (A random number of crossover events would be a better assumption. However, because the results are not sensitive to the value chosen for M, we can safely assume that they are not sensitive to fluctuations of M either). The resulting population is then submitted to a selection algorithm based on the well-known "broken stick" algorithm. The unit interval [0, 1] is divided into N segments, each segment corresponding to one genome in the population. The length of each segment is set proportional to the fitness of the matching genome. Then the N genomes surviving into the next generation are selected by N random drawings of a random variable uniformly distributed on [0, 1], according to the particular segment on which each drawing places the random variable.
Calculated Quantities
Our numerical calculations have targeted a 3-fold task. First, we performed a detailed numerical check on the actual validity of the inferred kinetics described by equation (6). The distribution f(k, t) of genomes with given number k of less-fit alleles was monitored throughout the run, along with the corresponding average
k
, its time derivative d
k
/dt, and the variance Var[k]. For the same purpose, we also monitored the time dependence of the number of lost and fixed better-fit alleles. Second, we paid particular attention to similarity in pairs of sequences. To give quantitative expression to this phenomenon, we defined 3 complementary measures of the degree of sequence congruence (the opposite of the genetic distance) in the following way. For any given pair of genomes i and j, we denote nlf and nbf the number of monomorphic less-fit and better-fit sites, respectively, and denote npoly the number of polymorphic sites. Observe that nlf + nbf + npoly = L. For each type of allele, we define a pair congruence measure Clf(bf)(i, j) as the ratio of monomorphic sites for that type of allele to the total number of sites expressing the type in at least one of the genomes. That is,
|
| (7) |
|
| (8) |
|
| (9) |
Clf, Cbf, Ctot
1. Value 1Ctot is the average pairwise genetic distance.
In our third numerical task, we checked the time dependence of linkage disequilibrium throughout the model dynamics. Linkage disequilibrium was monitored via the 2-site Lewontin measure D'(
, ß) (Lewontin 1964
) averaged at every given time over all pairs of sites separated by a prescribed mutual distance d, D' =
D' (
,
+ d)
. Here
and ß =
+ d, 1
, ß
L, label 2 arbitrary sites of the viral genome separated by d intermediate positions. The Lewontin disequilibrium is calculated as follows. If f11(
, ß) denotes the probability that both site
and site ß contain less-fit alleles, f1(
) denotes the probability of a less-fit allele at site
alone, and D (
, ß) = f11 (
, ß) f1(
)f1(ß), then
![]() | (10) |
)(1 f1(ß)), (1f1(
))f1(ß)} is the maximum value of D(
, ß) over all possible distributions of alleles in genomes at fixed f1(
) and f1(ß), and Dmin = max{f1(
) f1(ß), (1 f1(
))(1 f1(ß))} is the minimum of its absolute value. With the normalization shown in equation (10), the total disequilibrium is bounded, as given by |D'(d)|
1. | Results |
|---|
|
|
|---|
Kinetics of Better-Fit Allele Accumulation
The typical observed evolution of the distribution f(k, t) for a sufficiently large population size N and random initial distribution of alleles is shown in figure 1a (average over 10 runs with an identical initial population). Parameter values are defined in the figure. Figure 1b shows the corresponding time dependence (averaged over 10 runs) of
k
, s1|d
k
/dt| and Var[k], as well as the time-dependent numbers of lost and fixed better-fit alleles, klost and kfix. The 3 allele-based measures of sequence congruence obtained numerically using equations (79) are shown in figure 2a, plotted against the current
k
. Note that
k
(t) is a strictly decreasing function of time, and thus the arrow of time in figure 2a runs from larger
k
to lower
k
. Estimates of the variance Var[k] from equation (6) are compared with numerical values from simulations in figure 2b. The estimates were calculated using seff as a fitting parameter and the numerical values of klost and kfix. Also shown in figure 2b are estimates of the variance Var[k] in terms of the total sequence congruence Ctot(t) (see eq. 11) below).
|
|
It can be easily observed from figure 1 that the presence of recombination, under selection and random drift fosters a steady accumulation of better-fit alleles from one generation to the next. One can distinguish 3 significant dynamic phases:
- Phase 1. The quasi-singular distribution corresponding to the largely less-fit initial population (
k
t=0
0.99L) disperses into a bell-shaped profile that slowly shifts toward lower values of k (see fig. 1a and b). This distribution corresponds to a rapid initial accumulation of better-fit alleles and thus to a substantial diversification of the population. By the end of this transient period, the distribution f(k) develops a relatively large number of individual k-groups, most comprising a sizeable number of genomes.
- Phase 2. The bell-shaped distribution continues to shift toward lower numbers of k, this time at a considerably accelerated but slowly varying rate (fig. 1a and b). Notably, the distribution profile varies only slightly during this phase (fig. 1a, 40
t
100); that is, as pointed out in our earlier work (Rouzine et al. 2003
; Rouzine and Coffin 2005
), the propagation of f(k, t) can be approximated as a solitary wave.
- Phase 3. The remaining less-fit alleles continue to be eliminated until (almost) completely purged. In other words, the better-fit alleles undergo gradual fixation, whereas the bell-shaped profile of f(k) shrinks until it eventually reduces to a single stationary group at kfinal
0.
- Phase 2. The bell-shaped distribution continues to shift toward lower numbers of k, this time at a considerably accelerated but slowly varying rate (fig. 1a and b). Notably, the distribution profile varies only slightly during this phase (fig. 1a, 40
At lower population sizes, this general pattern is altered by the loss of better-fit alleles during Phase 2 (fig. 1c). In this case, the ongoing accumulation of better-fit alleles is limited by the diminished number of polymorphic sites. Beyond the solitary wave regime, the loss of better-fit alleles declines rapidly until it eventually halts, whereas their fixation also begins at random available sites. As in the case of larger N, the latter process continues until complete elimination of nonfixed less-fit alleles, with the result that the entire population is reduced finally to a clone, with kfinal > 0 randomly positioned less-fit sites (fig. 1a). The number kfinal of less-fit sites surviving in any particular final clone samples a bell-shaped distribution centered on an average position
kfinal
(results not shown). Figure 1d shows that, for fixed recombination and selection rates,
kfinal
depends on the population size N. It decreases toward k = 0 with increasing N, whereas the width of its corresponding distribution shrinks, such that in sufficiently large populations the final clone displays invariably the best-fit genome that can evolve (kfinal = 0).
As was checked in all simulations that we have run, the kinetics of allele accumulation proves to be in excellent agreement with the general prescription of Fisher's law, equation (3), that the speed of evolution is proportional to the width of the distribution f(k, t) (see 2 examples in fig. 1b and c). Figure 2b shows that we also find very good agreement with the modified one-site kinetics of equation (6), based on the effective selection coefficient approach. The dependence of seff on N and other parameters is further discussed in the next section.
Sequence Congruence
The emergence of congruent sequence segments throughout the population was followed by means of the 3 measures defined in equations (79). A typical output is shown in figure 2a, plotted against the current average number of less-fit sites
k
.
The overall sequence congruence Ctot and the congruence of less-fit sites Clf both display a nonmonotonous dependence, with a rather shallow minimum between maximal unit values at the beginning and at the end of the evolution (fig. 2a). Unit congruence corresponds to (almost) uniform final and initial populations. During most of the evolution, the congruence of less-fit sites Clf decreases due to the constant accumulation of better-fit alleles. The final increase in Clf is related to the shrinking of the distribution f(k, t) (fig. 1a) and the emergence of the final clone. The increase of total congruence during the second half of evolution witnesses the increasing congruence of better-fit sites.
The congruence of better-fit sites Cbf increases monotonously with
k
throughout the evolution, from an initial null value for
k
99 (f0 = 0.01, L = 100) at t = 0 to the unit value marking the establishment of the final clonal population. Interestingly, in the parameter range we tested, for example, N0 = 104, the time dependence of Cbf(t) is approximately linear throughout the evolution. Although we do not have a theoretical understanding of this feature, it is worth pointing out that the approximately linear dependence of Cbf can be exploited for early anticipation of
kfinal
and therefore of the ratio of better-fit/less-fit sites in the final clone.
Connection between the Genetic Distance and the Variance of the Fitness Distribution
The total congruence Ctot correlates closely with the variance of the distribution f(k, t). Figure 2b shows that a proportionality of the form
|
| (11) |
a constant coefficient close to unit,
= 0.850.97. We note that L(1 Ctot) represents the average Hamming distance within the population. Relation (11) implies, then, that the rate of evolution, equation (3), is proportional to the average Hamming distance, as
|
| (12) |
= 1 can be understood under the assumption that the distribution of less-fit alleles on polymorphic sites within any pair of genomes is random, as it is the case, for example, when nonfixed alleles are randomly distributed in the population. To this end consider the following expression for the variance Var[k]:
|
| (13) |
ki npoly/2)2, where
ki = ki nlf denotes the number of polymorphic mutant sites in genome i. But because mutations in pairs of polymorphic sites are distributed with equal chance (1/2) between the 2 sites, npoly/2 is just the average number of polymorphic mutations that can be contributed by genome i for fixed npoly, and one has
ki does not correlate between genomes, for example, less-fit alleles for a fixed number of polymorphic sites are distributed randomly between the 2 genomes, from the binomial distribution we have, on average,
ki of polymorphic mutations, associated with correlations between fitness values of different genomes [Rouzine and Coffin 2005
<1 in equation (11). We note that in the limit of large N, when
approaches the ratio p = seff/s, as defined in Rouzine and Coffin (2005)
p, whereas at lower N, equation (11) still holds, but
p (see fig. 2b).
Linkage Disequilibrium
The initial population has a random allele makeup, with null pairwise linkage disequilibrium as defined by equation (10). Representative tests of pairwise linkage disequilibrium at later times are shown in figure 3 for 2 typical sets of model parameters and various linkage distances between sites of a pair, d. The quantity D'(d), calculated according to equation (10), exhibits only small, statistically insignificant fluctuations from a null value, irrespective of the distance d. In particular, this result shows that the positions of alleles of a specific type (beneficial or deleterious) do not correlate along genomes, although alleles (either deleterious or beneficial) do correlate between homologous sites on different genomes (across the population), as described in the previous paragraph. We note in this regard that the experimental observation of linkage disequilibrium can be attributed to random mutation events that generate new beneficial alleles (Rouzine and Coffin 1999
), as well as to epistasis. In the present model, epistasis is absent, and all the alleles are assumed to preexist.
|
Dependence of the Evolution Rate and the Sequence Congruence on N and Other Parameters
As documented in figure 2b, the analytic approach based on the effective selection coefficient seff, equation (6), provides a good fit to the simulated results for the kinetics of Var[k] and
k
. As can be seen in figure 4, for a fixed characteristic population size N0 and raw selection coefficient s, the average over all runs of the ratio seff/s shows a logarithmic increase with the population size N and reaches an upper plateau at the deterministic value of 1 for large enough N. In this limit, recombination is so effective that sites that are statistically independent at the onset evolve independently henceforth. At lower population sizes, seff/s also shows a tendency toward a plateau. Closer examination reveals that the onset of this lower plateau correlates with the gradual disappearance of simulation runs that end in the population with maximum fitness,
k
= 0 (i.e., for a small enough population size, the final clones start to include with certainty some less-fit alleles). To emphasize this feature, we calculated, in addition, a fractional ratio s'eff/s, defined as the contribution to the average seff/s arising exclusively from the fraction of runs arriving at the best-fit population (fig. 4). Whereas at sufficiently large N the fractional ratio s'eff/s is identical to seff/s, at the lower population sizes, where runs that end in the best-fit population gradually disappear, the fraction displays an abrupt drop and reaches a null value at some critical population size Nc. This behavior may be regarded as a phase transition from all-best-fit final clones to partially better-fit final clones.
|
We observe also that the increase of seff/s with population size N is closely matched by a similar decrease in sequence congruence, as given by seff/s
1 C
(fig. 4). Furthermore, the ratio s
/2 of the evolution rate to the average Hamming distance L(1 Ctot) in equation (12) depends only weakly on the population size (result not shown).
Finally, we note that our results do not greatly depend on the number of crossovers, M, per genome per recombination event. Although the majority of our simulations used a value M = 10, we verified that the dependence of seff on N does not change much in the limit M
. For instance, for a rather high recombination rate, N0 = 103 and long genomes, L = 1000, an increase in the number of crossovers produced a shift in the critical population size from Nc
300, for M = 10, to Nc
200, for M
.
| Discussion |
|---|
|
|
|---|
Evolution Rate versus Accumulation of Sequence Correlation
A previous approach (Rouzine and Coffin 2005
N
, equation (5b), below which evolution stopped shortly after its onset (s'eff/s = 0).
The present Monte Carlo results confirm the existence of a threshold in N for s'eff/s (fig. 4). Also, for fixed N0 and s, the fractional ratio s'eff/s shows an increase with population size. There are, however, a number of differences between the previous approximation and the present simulations. First, the transition between s'eff = 0and s'eff = s occurs in a more narrow interval of ln N. Second, the value of the critical population size Nc is higher than
N
. Third, the current simulations indicate that the accumulation of pairwise congruence, in excess of what one would expect from a uniformly random distribution of alleles, plays an essential role in the evolution of a viral population (fig. 2a). At the same time, pairwise linkage disequilibrium between sites within the same genome does not emerge (fig. 3).
A qualitative understanding of correlation buildup can be gained on noticing that 2 genomes must correlate if they share segments originating in an identical ancestral clone. To quantify this effect, let us introduce the "identity by descent" C of a pair of sequences, as the fraction of sites that have a common ancestor, that is, are identical by descent. Two homologous sites in a pair of sequences descend from either 2 different genomes in the initial population or the same genome. In the second case, a site belongs to one of C·L congruent sites. Notice that, in the absence of mutation (as is the case in our model), sites identical by descent necessarily carry identical alleles, but the opposite is not true.
Each of the observable congruences defined in equations (79) can be related in a simple fashion to the average identity by descent with respect to the initial population, C, under the approximation that the positions of alleles in noncongruent segments of genomes do not correlate. Indeed, if we denote f =
k
/L, then one has, respectively,
|
| (14a) |
|
| (14b) |
|
| (14c) |
0, for the uncorrelated initial population at f
1, to some finite value C(t
)
1, for the final clone at f = 0. With this conjecture, it can be shown from equation (12) that the sequence congruences will have most of the qualitative features (monotony, minima, asymptotic trend at f = 0 and f = 1) displayed in figure 2a.
One can use equation (14) to compare values of the identity by descent C obtained in 3 independent ways. In figure 2a, we show the 3 values of C calculated in this way and numerical values for Clf, Cbf, and Ctot for a parameter value set near the transition point in N, where seff/s
0.5. We observe that 1) the 3 curves for C are close, which confirms the assumption that sites not identical by descent do not correlate and 2) the average value of C in the interval of f is not small, demonstrating the connection between sequence correlation and the kinetic properties of population. We note that, at small f, Clf can be used as a crude approximation for C, equation (14a).
The effective selection coefficient can be expressed in terms of the identity by descent C. It is sufficient to note that sites identical by descent (and therefore having identical alleles) do not contribute, on average, to the variance Var[k] of the number of deleterious mutations because
Hence, for a population size N sufficiently large to avoid any loss of better-fit alleles and neglecting correlation in k between genomes (Rouzine and Coffin 2005
), one can write
|
| (15) |
= 1. This interpretation is essentially based on the assumption that partial correlation between 2 genomes at individual sites does not occur: sites either have a common ancestor or do not correlate at all. In this regime, equation (15) and equation (5a) allow one to identify
|
| (14) |
Relevance for HIV Populations
Our choice of numerical values for the model parameters (L = 10100, N < 105, N0 = 103 to 105, s = 102 to 101, initial frequency of better-fit alleles
102) was tailored for a quantitative understanding of long-term HIV evolution, particularly for the reversion of deleterious alleles in untreated low-viremia patients. Our results are also relevant for improvement of the existing antiviral therapy targeted at delay of emergence of drug-resistant variants. In this case, the initial state corresponds to the small virus population surviving drug depletion and harboring a very low number of better-fit, drug-resistant alleles (frequency
102). The subsequent evolution at constant population size models the initial period of accumulation of drug-resistant alleles at a low viral load, before the onset of uncontrolled rebound.
Current drug cocktails that achieve > 99.99% virus depletion target a small number of sites in HIV genome. The level of resistance (fitness increase) sufficient for virus replication under drug is not evident from the level of depletion and has to be estimated from a population dynamics model. A recent work of one of us (Rouzine et al. 2006
) estimates that the decrease of the wild-type virus fitness due to drug by a factor of 1020 may be sufficient to achieve such a deep level of depletion. Because L
23 mutations enable virus to replicate effectively under drug, the selection coefficient under therapy is large for these critical mutations, s = ln(1020)/L = 0.71.5. Thus, the present simulations, using large L and small s, do not apply to evolution of the critical resistance mutations under the current therapy regiments. As shown in Rouzine and Coffin (2005)
using a separate argument, in this case, resistant mutations become fixed in the surviving population within a few generations, unless the effective population size is extremely low, Nc < 3045.
On the other hand, both the analytic results (Rouzine and Coffin 2005
) and the present simulations suggest that such drastic depletion levels are not necessary for successful control of resistance development. According to our model, if the number of drug target sites is increased to, for example, L = 10 (which somewhat exceeds the number of currently known drug targets), we predict that the rebound of resistant strains will require much larger population sizes. The selection coefficient per critical resistant site, in this case, is smaller, s = ln(1020)/L = 0.20.3, and the present theory applies approximately. Our previous estimate of the critical population size is Nc
N
(Rouzine and Coffin 2005
). The present numerical results, figure 4, show that the actual critical population size, below which completely resistant strains do not materialize with certainty, is considerably larger, as given by the empirical relationship Nc
(1530) N
for N0 = 103 to 105, L = 10100, and s = 0.1. Using the only available estimate of recombination rate r = N/N0
1 (Jung et al. 2002
) and the estimate of population size in an average patient N
106 (Rouzine and Coffin 1999
), we obtain N0
106 and Nc
104. The value of r may be overestimated because most coinfecting virions may originate from nearby cells and be genetically uniform. The estimate of N also has to be updated to account for a higher number of crossovers than thought previously and for the multisite nature of evolution (the quoted estimate was based on a 2-site model). However, even assuming that the value of N0 is 2 orders of magnitude smaller, that is, N0 = 104, we still obtain Nc
1000 (fig. 4), much larger than the critical depletion level for currently used drug cocktails, Nc
30.
In summary, our current results strongly reinforce the conclusion set forth in the previous study (Rouzine and Coffin 2005
) that successful long-term antiretroviral therapy could be achieved with significantly reduced dosage, provided the number of target sites is sufficiently large.
| Acknowledgements |
|---|
|
|
|---|
This work was supported by grants K25 AI 01811 to I.R. and R01 CA 089441 to J.M.C. S.G. was supported by training grant T32 AI007389. J.M.C. was an American Cancer Society Research Professor with support from the George Kirby Foundation.
| Footnotes |
|---|
Pekka Pamilo, Associate Editor
| References |
|---|
|
|
|---|
Aranson I, Tsimring L, Vinokur V. (1997) Evolution on a rugged landscape: pinning and aging. Phys Rev Lett 79:3298.[CrossRef]
Barton NH. (1995a) A general model for the evolution of recombination. Genet Res 65:123144.[Web of Science][Medline]
Barton NH. (1995b) Linkage and the limits to natural selection. Genetics 140:821841.[Abstract]
Burger R. (1999) Evolution of genetic variability and the advantage of sex and recombination in changing environments. Genetics 153:10551069.
Burt A. (2000) Perspective: sex, recombination, and the efficacy of selectionwas Weismann right? Evol Int J Org Evol 54:337351.
Chun TW, Carruth L, Finzi D, et al. [15 co-authors]. (1997) Quantification of latent tissue reservoirs and total body viral load in HIV-1 infection. Nature 387:183188.[CrossRef][Medline]
Crow JF and Kimura M. (1965) Evolution in sexual and asexual populations. Am Nat 9:439450.[CrossRef]
Dang Q, Chen J, Unutmaz D, Coffin JM, Pathak VK, Powell D, KewalRamani VN, Maldarelli F, Hu WS. (2004) Nonrandom HIV-1 infection and double infection via direct and cell-mediated pathways. Proc Natl Acad Sci USA 101:632637.
Felsenstein J. (1974) The evolutionary advantage of recombination. Genetics 78:737756.
Fisher RA. (1930) The genetical theory of natural selection(Oxford University Press, Oxford).
Frost SD, Nijhuis M, Schuurman R, Boucher CA, Brown AJ. (2000) Evolution of lamivudine resistance in human immunodeficiency virus type 1-infected individuals: the relative roles of drift and selection. J Virol 74:62626268.
Haigh J. (1978) The accumulation of deleterious genes in a populationMuller's ratchet. Theor Popul Biol 14:251267.[CrossRef][Web of Science][Medline]
Hey J. (1998) Selfish genes, pleiotropy and the origin of recombination. Genetics 149:20892097.
Hill WG and Robertson A. (1966) The effect of linkage on the limits to artificial selection. Genet Res 8:269294.[Web of Science][Medline]
Jung A, Maier R, Vartanian JP, Bocharov G, Jung V, Fischer U, Meese E, Wain-Hobson S, Meyerhans A. (2002) Multiply infected spleen cells in HIV patients. Nature 418:144.[CrossRef][Medline]
Kessler DA, Ridgway D, Levine H, Tsimring L. (1997) Evolution on a smooth landscape. J Stat Phys 87:519.[CrossRef]
Levy DN, Aldrovandi GM, Kutsch O, Shaw GM. (2004) Dynamics of HIV-1 recombination in its natural target cells. Proc Natl Acad Sci USA 101:42044209.
Lewontin RC. (1964) The interaction of selection and linkage I. General considerations; heterotic models. Genetics 49:4967.
Maynard-Smith J. (1968) Evolution in sexual and asexual populations. Am Nat 102:469473.[CrossRef][Web of Science]
Muller HJ. (1932) Some genetic aspects of sex. Am Nat 66:118138.[CrossRef][Web of Science]
Otto SP and Lenormand T. (2002) Resolving the paradox of sex and recombination. Nat Rev Genet 3:252261.[CrossRef][Web of Science][Medline]
Ridgway D, Levine H, Kessler DA. (1998) Evolution on a smooth landscape: the role of bias. J Stat Phys 90:191.[CrossRef]
Rouzine I, Wakeley J, Coffin J. (2003) The solitary wave of asexual evolution. Proc Natl Acad Sci USA 100:587592.
Rouzine IM and Coffin JM. (1999) Linkage disequilibrium test implies a large effective population number for HIV in vivo. Proc Natl Acad Sci USA 96:1075810763.
Rouzine IM and Coffin JM. (2005) Evolution of human immunodeficiency virus under selection and weak recombination. Genetics 170:718.
Rouzine IM, Sergeev RA, Glushtsov AI. (2006) Two types of cytotoxic lymphocyte regulation explain kinetics of immune response to human immunodeficiency virus. Proc Natl Acad Sci USA 103:666671.
Tsimring LS, Levine H, Kessler DA. (1996) RNA virus evolution via a fitness-space model. Phys Rev Lett 76:4440.[CrossRef][Web of Science][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




