MBE Advance Access originally published online on March 5, 2003
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mol. Biol. Evol. 20(3):315-337. 2003
DOI: 10.1093/molbev/msg039
© 2003 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038
Detecting Recombination in 4-Taxa DNA Sequence Alignments with Bayesian Hidden Markov Models and Markov Chain Monte Carlo

* Biomathematics and Statistics Scotland (BioSS), JCMB, King's Buildings, Edinburgh, United Kingdom
Department of Applied Statistics, University of Reading, Reading, United Kingdom
| Abstract |
|---|
|
|
|---|
This article presents a statistical method for detecting recombination in DNA sequence alignments, which is based on combining two probabilistic graphical models: (1) a taxon graph (phylogenetic tree) representing the relationship between the taxa, and (2) a site graph (hidden Markov model) representing interactions between different sites in the DNA sequence alignments. We adopt a Bayesian approach and sample the parameters of the model from the posterior distribution with Markov chain Monte Carlo, using a Metropolis-Hastings and Gibbs-within-Gibbs scheme. The proposed method is tested on various synthetic and real-world DNA sequence alignments, and we compare its performance with the established detection methods RECPARS, PLATO, and TOPAL, as well as with two alternative parameter estimation schemes.
Key Words: phylogeny DNA sequence alignment recombination hidden Markov models Markov chain Monte Carlo
| Introduction |
|---|
|
|
|---|
The underlying assumption of most phylogenetic tree reconstruction methods is that there is one set of hierarchical relationships among the taxa. While this is a reasonable approach when applied to most DNA sequence alignments, it can be violated in certain bacteria and viruses due to sporadic recombination. The resulting transfer or exchange of DNA subsequences can lead to a change of the branching order (topology) in the affected region, which results in conflicting phylogenetic information from different regions of the alignment. If undetected, the presence of these so-called mosaic sequences can lead to systematic errors in phylogenetic tree estimation. Their detection, therefore, is a crucial prerequisite for consistently inferring the evolutionary history of a set of DNA sequences.
In the last few years, a plethora of methods for detecting recombination have been developedfollowing up on the seminal paper by Maynard Smith (1992)and it is beyond the scope of this article to present a comprehensive overview. Many detection methods for identifying the nature and the breakpoints of the resulting mosaic structure are based on moving a window along the sequence alignment and computing a phylogenetic divergence score for each window position. Two well-established methods following this approach are PLATO and TOPAL.
PLATO (Grassly and Holmes, 1997) estimates a phylogenetic tree from the whole DNA sequence alignment, and then systematically looks for subsets with a low likelihood under this model by computing the statistic
|
|
W
N/2. Parametric bootstrapping is applied to generate the null distribution of the maximized Q value under the null hypothesis of no recombination. If the reference model, from which the log likelihoods are computed, were the true tree (meaning the tree one would get if no recombination event had happened), significantly large Q values would be a reliable indication for recombinant regions. However, the true tree is not known, and is approximated by a tree estimated from the whole sequence alignment. This alignment includes the recombinant regions, which perturb the parameter estimation for the reference tree (see fig. 1, bottom). Consequently, the method becomes increasingly unreliable as the recombinant regions grow in length.
|
TOPAL (McGuire, Wright, and Prentice 1997; McGuire and Wright 2000), illustrated in figure 2, replaces the global by a local reference tree. A window of typically 200500 bases is slid along the DNA sequence alignment. The reference tree is estimated from the left half of the window, and used to computed a goodness-of-fit score for both parts of the window. The difference between these goodness-of-fit scores, the so-called DSS statistic, is likely to be small within a homogeneous part of the alignment, but large as the window is moved into a recombinant region. Parametric bootstrapping is applied to compute a distribution of DSS peaks under the null hypothesis of no recombination, and significantly large DSS peaks are indicators of putative recombinant breakpoints. While this method overcomes the principled shortcoming of PLATO, the spatial resolution for the identification of the breakpoints is typically of the order of the window size and, consequently, rather poor.
|
This article discusses a different approach, which follows up on earlier work by Hein (1993). The idea is to introduce a hidden state that represents the tree topology at a given site. A state transition from one topology into another corresponds to a recombination event. To introduce correlations between adjacent sites, a site graph is introduced, representing which nucleotides interact in determining the tree topology. Thus, the standard model of a phylogenetic tree is generalized by the combination of two graphical models: (1) a taxon graph (phylogenetic tree) representing the relationships between the taxa, and (2) a site graph representing interactions between different sites in the DNA sequence alignments. To keep the mathematical model tractable and the computational costs limited, the latter are reduced to nearest-neighbor interactions. Breakpoints of mosaic segments are predicted by state transitions in the site graph. While this method can only deal with a small number of sequences simultaneously, it has, in principle, the potential to predict the locations and breakpoints of recombinant regions more accurately than what can be achieved with most existing techniques.
The article is organized as follows: the next section, Method: Background and Earlier Approaches, introduces the mathematical method and discusses the shortcomings of existing parameter estimation techniques. Then, under Method: A Bayesian Approach, we discuss how earlier approaches can be improved with a Bayesian approach using Markov chain Monte Carlo. In the section titled Data we describe various synthetic and real-world DNA sequence alignments on which the proposed scheme was tested. We next present the simulation study itself and discuss the results. The article ends with a conclusion and recommendations for future work.
| Method: Background and Earlier Approaches |
|---|
|
|
|---|
Consider an alignment
of m DNA sequences, N nucleotides long. Let each column in the alignment be represented by yt, where the subscript t represents the site, 1
t
N. Hence yt is an m-dimensional column vector containing the nucleotides at the tth site of the alignment, and
= (y1,..., yN). Attached to each site is a hidden state variable St, which represents the tree topology at site t. For m taxa, there are K = (2m - 5)!! distinct unrooted topologies (where !! denotes double factorial), hence St
{1,..., K}. If a recombination event has occurred, then there will be a change in topology in this region, corresponding to a transition into another hidden state at the breakpoint of this region. Our objective is to predict the "optimal" sequence of hidden states
|
|
and some optimality criterion to be discussed below. Obviously, this optimization problem is, in general, intractable. First, the number of possible topologies at a given site, K, increases super-exponentially with the number of sequences m. Second, there are KN different state sequences, which prevents an exhaustive search even for small values of K. Consequently, the introduction of approximations and restrictions is inevitable.
To deal with the second source of computational complexity, interactions between sites are limited to nearest-neighbor interactions. This allows the application of a dynamic programing scheme which reduces the computational complexity to
(K2N), that is, to an expression linear in N. To deal with the first source of complexity, the scheme has to be restricted to alignments with small numbers of sequences. In the current work, we restrict our approach to alignments with only m = 4 taxa. In the Discussion we describe how this restriction can be relaxed.
RECPARS
Hein (1993) defined optimality in a parsimony sense. His algorithm, RECPARS, searches for the most parsimonious state sequence S, that is, the one that minimizes a given parsimony cost function E(S). Interactions between sites are restricted to nearest-neighbor interactions, as discussed above, and the search is carried out with dynamic programing. Although RECPARS is faster than the methods to be discussed below, it suffers from the shortcomings inherent in parsimony, as discussed by Felsenstein (1988). Moreover, E(S) depends only on the topology-defining sites; thus the algorithm discards a substantial proportion of sites in the alignment. The most serious disadvantage is that the cost function E(S) depends on certain parametersthe mutation cost Cmut, and the recombination cost Crecwhich can not be optimized within the framework of this method. Consequently, these parameters have to be chosen by the user in advance, and the predictions depend on this rather arbitrary prior selection.
Detecting Recombination with Hidden Markov Models
Adopting a statistical approach to phylogenetics, illustrated in figure 3, the probabilistic equivalent to RECPARS is a hidden Markov model (HMM), whose application to the detection of recombination was first suggested by McGuire, Wright, and Prentice (2000). Figure 4, left, shows the corresponding probabilistic graphical model. White nodes represent hidden states, St, which have direct interactions only with the states at adjacent sites, St-1 and St+1. Black nodes represent columns in the DNA sequence alignment, yt. The joint probability of the DNA sequence alignment,
, and the sequences of hidden states, S, factorizes:
|
|
|
|
The optimal state sequence
is the one most supported by the data, that is, the mode of P(S|
):
|
|
While, in general, this problem would be intractable because of the exponential increase in the number of state sequences (see above), the reduction to nearest-neighbor interactions between hidden states and the resulting factorization (3) allows the application of a dynamic programing technique, the so-called Viterbi algorithm (Rabiner 1989), to find the mode
with computational complexity
(N). The factorization (3) contains three terms: P(yt|St), P(St|St-1), and P(S1). The transition probabilities P(St|St-1) correspond to recombination events (if St
St-1). Let
denote the probability that the tree topology remains unchanged as we move from a given site in the alignment, t, to an adjacent site, t + 1 or t - 1. We then obtain for the state transition probabilities:
|
|
(St, St-1) denotes the Kronecker delta function, which is 1 when St = St-1, and 0 otherwise. It is easily checked that this satisfies the normalization constraint
St P(St|St-1) = 1. The emission probabilities P(yt|St) can easily be computed with the pruning algorithm (Felsenstein 1981) if the branch lengths corresponding to the topology St, w
, and the parameters of the nucleotide substitution model, 
, are known. So, more precisely, we have P(yt|St) = P(yt|St, w
, 
). To simplify the notation, define the accumulated vectors w = (w1, ... , wK) and
= (
1, ... ,
K) and define: P(yt|St, w
, 
) = P(yt|St, w,
). This means that St indicates which subvectors of w and
apply. We can depict the dependence of the probability distribution on the parameters w and
in a probabilistic graphical model, shown in figure 4, left, with the emission and transition probabilities illustrated in figures 3 and 5, respectively. The prediction task is to find the most likely hidden state sequence conditional on the observations (that is, the DNA sequence alignment) and the parameters w,
, and
:
|
|
|
The parameters w,
, and
need to be estimated.
Heuristic Parameter Estimation: HMM-Heuristic
McGuire, Wright, and Prentice (2000) estimated the branch lengths w for each tree topology separately with maximum likelihood. This approach is suboptimal. For a proper estimation of the branch lengths of a recombinant tree, one would have to restrict the parameter estimation to the recombinant region. The location of this region, however, is not known in advance. Estimating the branch lengths from the whole DNA sequence alignment leads to seriously distorted values, as demonstrated by Husmeier and Wright (2001), because the estimation includes regions of the alignment for which the tree topology is incorrect. A heuristic way to address this problem, suggested by McGuire, Wright, and Prentice (2000), is to estimate the branch lengths from a subregion of the alignment. The length of this region should be matched to the length of the recombinant region, which, however, is not known in advance. Also, this approach does not offer a way to estimate the recombination parameter
.
Parameter Estimation with Maximum Likelihood: HMM-ML
A solution to this problem, proposed by Husmeier and Wright (2001), is a proper maximum likelihood estimation of the parameters so as to maximize
|
|
, and the recombination parameter
. This requires a summation over all state sequences S = (S1,..., SN), that is, over KN terms, and seems to be intractable for all but very short sequence lengths N. However, Husmeier and Wright (2001) showed that by applying the expectation maximization (EM) algorithm (Dempster, Laird, and Rubin 1977), the sparseness of the connectivity in the HMM could be exploited to reduce the computational complexity to the order of K separate tree optimizations. While the application of this scheme outperformed the heuristic approach of McGuire, Wright, and Prentice (2000), it suffers from the shortcoming that the predicted state sequence does not only depend on the data, argmax P(S|
), but also on the parameters, argmax P(S|
, w,
,
).The fact that these parameters are estimated from the data itself with maximum likelihood renders the approach susceptible to over-fitting. This calls for an independent hypothesis test with parametric bootstrapping, which, however, incurs prohibitively high computational costs, as demonstrated by Larget and Simon (1999). To rephrase this problem, note that hidden Markov models and phylogenetic trees have many similarities with neural networks; in fact, all three models are instances of the more general class of graphical models (Heckermann 1999). Studies on neural networks and graphical models have shown that, for sparse data, maximum likelihood is susceptible to over-fitting, and that the generalization performance is significantly improved with the Bayesian approach. A detailed investigation of this approach can be found in Neal (1996). In a nutshell, maximum likelihood gives only a point estimate of the parameters, which ignores the more detailed information contained in the curvature and (possibly) multimodality of the likelihood landscape. By sampling rather than optimizing parameters, the Bayesian approach captures more information about this landscape, and consequently gives improved and more reliable predictions.
| Method: A Bayesian Approach |
|---|
|
|
|---|
A Bayesian approach to phylogenetics without recombination was proposed and tested by Yang and Rannala (1997), Mau, Newton, and Larget (1999), and Larget and Simon (1999). Generalizing this scheme to the presence of recombination requires replacing the single topology-indicating variable by the state sequence S, as discussed in the previous section. The prediction of this state sequence should be based on the posterior probability P(S|
), which requires integrating out the remaining parameters:
|
|
In principle this avoids the over-fitting scenario mentioned above and removes the need for a separate hypothesis test. The difficulty, however, is that the integral in (eq. 8) is analytically intractable, which calls for the application of a numerical approximation, using Markov chain Monte Carlo (MCMC). The practical viability of the Bayesian framework thus hinges on the performance of this scheme. In the subsections below, we will discuss the following issues: (1) the choice of prior probabilities; (2) the chosen Markov chain Monte Carlo method, which has the form of a Metropolis-Hastings and Gibbs-within-Gibbs sampling scheme; (3) methods for accelerating the convergence of the Markov chain; (4) the prediction resulting from this scheme; and (5) a software implementation. We will then test this approach on various DNA sequence alignments.
Prior Probabilities
Inherent in the Bayesian framework is the choice of prior probabilities for all model parameters, as illustrated in figure 4, right. We make the usual assumption of parameter independence, P(
, w,
) = P(
)P(w)P(
), and choose rather vague priors to reflect the absence of true prior knowledge. The prior probabilities will either be conjugate, where possible, or uniform, but proper (that is, restricted to a finite interval).
The recombination parameter
is a binomial random variable, for which the conjugate prior is a beta distribution,
|
|
and ß, as shown in figure 6.
|
The branch lengths w are defined in the usual way: that is, they represent the average number of nucleotide substitutions per site. A priori, they are assumed to be uniformly distributed in the interval [0, 1]. Fixing an upper bound on the branch lengths is necessary to avoid the use of an improper prior, for which the MCMC scheme might not converge. Because for real DNA sequence alignments branch lengths are unlikely to approach 1, this restriction should not cause any difficulties.
The prior on
depends on the model of nucleotide substitution. In the present study, the Felsenstein 84 model (Felsenstein and Churchill, 1996) is used, which has four free parameters: the nucleotide frequencies,
A,
C,
G, and
T, and the transition bias
. (Note that because of the normalization constraint
A +
C +
G +
T = 1, there are 4 rather than 5 free parameters.) In our approach, each tree is allowed to have a different value of
, whereas the nucleotide frequencies are assumed to be the same for all trees. This is for algorithmic efficiency: Allowing each tree to have a different set of frequencies means that some frequencies might be inferred from a small amount of data, leading to vague posterior distributions that slow down the convergence of the Markov chain. The total vector of nucleotide substitution parameters is thus of the form
|
|
A natural prior for the nucleotide frequencies
i is a Dirichlet distribution, which, as a multivariate generalization of the beta distribution (eq. 9), satisfies the normalization constraint. We here choose a Dirichlet (1,1,1,1) distribution, which is a uniform distribution subject to the normalization constraint and thus maximally non-informative. For the transition biases
k (k = 1,..., K), we choose a uniform prior over the interval [0, 2]. Again, an upper bound is needed to prevent the prior from becoming improper. Allowing
k to be as large as 2 will account for extreme cases of transition bias, which should not impose any serious restrictions in practice. Finally, we assume P(S1) =
S1
{1,..., K}, that is, a uniform prior on the tree topologies.
The joint distribution of the DNA sequence alignment, the state sequences, and the model parameters is given by
|
|
) is the probability of the tth column of nucleotides in the alignment, which is computed with the pruning algorithm (Felsenstein 1981), P(St|St-1,
) is the probability of transitions between states, given by (eq. 5), and P(S1), P(w), P(
), and P(
) are the prior probabilities, discussed above.
Markov Chain Monte Carlo (MCMC) Sampling
Ultimately, we are interested in the marginal posterior probability of the state sequences, P(S|
), which requires a marginalization over the model parameters according to (eq. 8). The numerical approximation is to sample from the joint posterior distribution
|
|
|
|
Define
. From (eq. 5) and (eq. 9) it is seen that writing the joint probability (eq. 11) as a function of
gives:
|
|
On normalization this gives
|
|
is the beta distribution (eq. 9), from which sampling is straightforward (Rubinstein 1981).
For sampling the state sequences S, we adopt the approach suggested by Robert, Celeux, and Diebolt (1993) and sample each state St separately, conditional on the othersthat is, with a Gibbs-within-Gibbs scheme:
|
|
|
|
) and P(St+1|St,
) are given by equation (5). Note that the expression after the
symbol is easily normalized to give a proper probability, from which sampling is straightforward (because St
{1,..., K} is discrete).
For sampling the remaining parameters, w and
, we apply the Metropolis-Hastings algorithm (see Hastings [1970] and Chib and Greenberg [1995]). Let z(i) denote the parameter configuration in the ith sampling step. A new parameter configuration
is sampled from a proposal distribution Q(
|z(i)), and then accepted with probability
|
|
. Otherwise, z(i+1) = z(i). The distribution P is given by equation (11).
Improving the Convergence of the Markov Chain
In theory the algorithm converges to the posterior distribution (eq. 12) irrespective of the choice of the proposal distribution (assuming ergodicity). In practice, a "good" choice of Q(.|.) is crucial to achieve convergence within a reasonable amount of time, and will be discussed next.
For the components wl of the vector of branch lengths w and for the transition biases
k, a new value is selected from a uniform interval centred around the existing value. This is a symmetric proposal distribution, so the terms Q(.|.) cancel out in equation (18). For the nucleotide frequencies
A,
C,
G,
T, new values are sampled from a Dirichlet distribution. This ensures that the normalization constraint
A+
C +
G +
T = 1 is satisfied. The parameters of the Dirichlet distribution are chosen proportional to the current values of the nucleotide frequencies, thereby proposing new values close to the current ones, which in turn makes it more likely that the proposed values will be accepted. This proposal distribution is not symmetric, so the Q(.|.) terms must be calculated in equation (18).
If too few proposed values are accepted, the corresponding proposal distributions Q(.|.) may be tuned to make acceptance more likely and thereby to accelerate convergence. For the branch lengths wl and the transition biases
k, this is done by decreasing the width of the uniform interval from which the new value is sampled. For the nucleotide frequencies
A,
C,
G,
T, the constant of proportionality in the Dirichlet distribution is increased so that the proposed frequencies are more likely to be closer to the existing values.
The algorithm is started by first initializing the chain. The sequence of topologies S is chosen randomly or from some initial estimation, e.g., using RECPARS. The branch lengths are set to some plausible value, e.g., the average branch length of the global maximum likelihood tree obtained with DNAML of the PHYLIP package (available from http://evolution.genetics.washington.edu/phylip.html). Initial values for the transition biases
k and the nucleotide frequencies
A,
C,
G,
T can be estimated from the data, as described later under Simulations. The parameter groups are then updated in order according to (eq. 13) and the details described above. An initial equilibration or burn-in period must be run to allow the Markov chain to reach stationarity. In this part of the simulation, the parameters of the proposal distributions Q(.|.) are tuned as described above. This burn-in is followed by the sampling phase of the simulation, in which the state sequences S (and, if of interest, the model parameters) are saved for further analysis. Note that during the sampling phase, the parameters of the proposal distributions must not be tuned, as this might lead to biased samples that do not represent the correct posterior probability (eq. 12).
In principle, the initialization of the hidden states is unimportant because the Markov chain will forget its initial configuration and converge toward the equilibrium distribution irrespective of its starting point. In practice, however, extreme starting values can slow down the mixing of the chain and result in a very long burn-in, in which case the MCMC sampler may fail to converge toward the main support of the posterior distribution in the available simulation time. To address this problem, we combined simulations from different initializations and explored a method akin to simulated annealing (Kirkpatrick, Gelatt, and Vecchi 1983). Note that the bottleneck of the presented Markov chain Monte Carlo scheme is the sampling of the topology sequences S. Because recombination events are quite rare, the number of topology changes along the DNA sequence alignment is usually small and, consequently, the posterior distribution of
concentrated on values close to 1. This discourages state transitions and may slow down the mixing of the Markov chain. To increase the transition rate during equilibration, we therefore modified the transition probabilities as follows: Let T denote the total length of the equilibration phase, and let i
{1,...,T} denote the ith sample of the Markov chain during equilibration. Then, during equilibration, equation (15) is replaced by
|
|
will be sampled with a higher probability than with the standard (unannealed) scheme (assuming the prior has been chosen sufficiently vague). This facilitates transitions between state sequences and can be expected to improve the mixing of the Markov chain.
Prediction
Recall that the proposed Bayesian method samples topology sequences from the joint posterior probability P(S|
) = P(S1,..., SN|
), where St (t = 1,..., N) represents the topology at site t. To display the results graphically, we marginalize, for each site in turn, over all the remaining sites so as to return the marginal posterior probabilities P(St|
). These are then plotted, for each topology P(St = 1|
), P(St = 2|
), and P(St = 3|D), along the sequence alignment. Assigning each site t to the mode of the posterior probability P(St|
) gives a list of putative recombinant regions, identical to the output of RECPARS. This is a useful reduction of information that allows the comparison of classification scores, as shown later (see figure 12). Note, however, that the posterior probabilities P(St|
) contain further, additional information, as they also indicate the uncertainty of the prediction.
|
Implementation
The method discussed above has been implemented in the C++ program package BARCE, which is freely available from http://www.bioss.sari.ac.uk/
dirk/My_software.html. | Data |
|---|
|
|
|---|
We have tested the viability of the proposed method on various DNA sequence alignments, including a simulated recombination and the sequences of maize, hepatitis B virus, and Neisseria.
Simulated Recombination
DNA sequences, 1000 bases long, were evolved along a 4-species tree, using the Kimura model of nucleotide substitution (Kimura 1980) with a transition-transversion ratio of 2. Two recombination events were simulated, as shown in figure 7. Topology 1 is the "true" topology, which applies to those parts of the alignment that are not affected by recombination. The four sequences are evolved along the interior branch and the first quarter of the exterior branches of a phylogenetic tree (top left). At this point, the subsequence between sites 201 and 400 in Strain 3 is replaced by the corresponding subsequence in Strain 1 (top right). The sequences then continue to evolve along the exterior branches until the branch length is 0.75 times the final exterior branch length (middle, left). This is followed by a second recombination event, where the subsequence between sites 601 and 800 in Strain 2 replaces the corresponding subsequence in Strain 3 (middle right). The sequences then continue to evolve along the exterior branches for the remaining length (bottom left). The resulting mosaic structure of the alignment is shown in figure 7, bottom right. In the main part of the alignment, Strain 3 is most closely related to Strain 4. However, in the region between sites 201 and 400, it is most closely related to Strain 1, and in the region between 601 and 800, it is most closely related to Strain 2 (bottom right). Thus, the first, more ancient, recombination event corresponds to a transition from Topology 1 into Topology 2. The second, more recent, recombination event corresponds to a transition from Topology 1 into Topology 3. This model simulates a realistic scenario where an ancestor of Strain 3 incorporates genetic material from ancestors of other extant strains, which in each case is followed by subsequent evolution. The simulations were repeated for a different mosaic structure, where the first recombinant region was extended by 100 nucleotides (region 201500), and the second region was shortened by 100 nucleotides (region 701800). The mosaic structures are shown in the top-left subfigure of figures 811![]()
![]()
. For each mosaic structure, we repeated the simulation with three different tree heights, where the tree height is defined as half the sum of all the branch lengths between the two strains that are farthest apart. Note that as the tree height becomes smaller, the numbers of polymorphic and topology-defining sites decrease. This reduces the information content in the alignment and makes the detection of recombinant regions more difficult.
|
|
|
|
|
The software used for this simulation can be downloaded from http://www.bioss.sari.ac.uk/
dirk/software/Sirens/INFO.html.
Maize
Gene conversion is a process equivalent to recombination, which occurs in multigene families, where a DNA subsequence of one gene can be replaced by the DNA subsequence from another. Indication of gene conversion between a pair of maize actin genes has been reported by Moniz de Sa and Drouin (1996), who showed that the Maz56 and Maz63 genes had a gene conversion covering the first 875 nucleotides of their coding regions. We applied our algorithm to a multiple alignment of the following four maize sequences (1008 nucleotides long): Maz56 (GenBank/EMBL accession number U60514), Maz63 (U60513), Maz89 (U60508), and Maz95 (U60507). The sequences were aligned with ClustalW (Thompson, Higgins, and Gibson 1994), using the default parameter settings. We define the states of the HMM as follows: Topology 1: [(Maz56,Maz63),(Maz89,Maz95)]; topology 2: [(Maz56,Maz89),(Maz63,Maz95)]; topology 3: [(Maz56,Maz95),(Maz63,Maz89)]. (See figure 15, top, for an illustration.)
|
Hepatitis B
A DNA virus with a short genome of only 3200 bases causes hepatitis B infection. Evidence for recombination was first found by Bollyky et al. (1996), and in this study we investigated a subset of four strains with the following GenBank identifiers (accession numbers in square brackets): (1) HPBADW1 [D00329], (2) HPBADW2 [D00330], (3) HPBADWZCG [M57663], (4) HPBADRC [D00630]. The sequences were aligned with ClustalW, using the default parameters. Columns with gaps were discarded, giving a total alignment length of 3049 bases. Bollyky and associates (1996) found a recombinant region of 189 bases in HPBADWZCG between t = 1865 and t = 2054 (when not removing gaps: t = 20142203), corresponding to a transition from topology St = 1 (HPBADW1 grouped with HPBADW2) into topology St = 2 (HPBADW1 grouped with HPBADWZCG).
Neisseria
One of the first indications for sporadic recombination was found in the bacterial genus Neisseria (Maynard Smith 1992). We chose a subset of the 787-nucleotide Neisseria argF DNA multiple alignment studied by Zhou and Spratt (1992), where we selected the four strains (1) N. gonorrhoeae [X64860], (2) N. meningitidis [X64866], (3) N. cinera [X64869], and (4) N. mucosa [X64873] (GenBank/EMBL accession numbers are in brackets). Zhou and Spratt (1992) found two anomalous, or more diverged regions in the DNA alignment, which occur at positions t = 1202 and t = 507538 (Note that Zhou and Spratt (1992) used a different labeling scheme, with the first nucleotide at t = 296, and the last one at t = 1082.) In the rest of the alignment, N. meningitidis clusters with N. gonorrhoeae (defined as topology St = 1 in our HMM), whereas between t = 1 and t = 202, they found that N. meningitidis is grouped with N. cinera (defined as state St = 3). Zhou and Spratt (1992) suggested that the region t = 507538 might be the result of rate variation. The situation is illustrated in figure 16.
|
Note that by restricting the alignments to m = 4 sequences we keep the dimension of the state space limited to K = 3 tree topologies: St
{1,2,3}. | Simulations |
|---|
|
|
|---|
We applied RECPARS with different ratios of the mutation and recombination costs, Crec/Cmut, using the program written by Kim Fisker (ftp://ftp.daimi.aau.dk/pub/empl/kfisker/programs/RecPars/).
TOPAL was applied with two different window lengths: W = 100 and W = 200. We used Version 2 of the program (McGuire and Wright 2000) with the default options except for the nucleotide substitution model, where we replaced the (default) Jukes-Cantor model with the Kimura model. The transition-transversion ratio was estimated with Puzzle (Strimmer and von Haeseler 1996).
For PLATO, we used Version 2.11 of the program developed by Grassly and Holmes (1997), available from http://evolve.zoo.ox.ac.uk/software/Plato/Plato2.html, and we varied the window length between five bases and half the sequence length. The reference tree was obtained with maximum likelihood from the whole DNA sequence alignment, using DNAML of the PHYLIP package (available from http://evolution.genetics.washington.edu/phylip.html) or Puzzle. On the synthetic data, we used a uniform substitution rate, and set the transition-transversion ratio to the known true value. On the real data, we used two models of rate heterogeneity: (1) a uniform rate and (2) gamma distributed rates with five rate categories. The respective PLATO commands are these: (1) plato -mHKY -tTAU and (2) plato -g5 -aALPHA -mHKY -tTAU, where TAU and ALPHA are the transition-transversion ratio and the alpha parameter of the discrete gamma distribution, respectively, both estimated with Puzzle.
The application of HMM-heuristic was similar to the study by McGuire, Wright, and Prentice (2000). We chose the Felsenstein 84 model of nucleotide substitution, estimating the transition-transversion ratio with maximum likelihood (using Puzzle), and estimating the nucleotide frequencies
A,
C,
G,
T from the data according to
X = NX/
X'NX', where NX is the number of occurrences of nucleotide X
{A, C, G, T}. For each topology in turn, we optimized the branch lengths of the corresponding phylogenetic tree with maximum likelihood on the whole alignment, using the program DNAML of the PHYLIP package. As opposed to McGuire Wright, and Prentice (2000), we did not restrict the optimization to subsets of the alignments, since the subset size is a parameter that cannot be properly optimized within the framework of this approach.
For training the HMM-ML, we followed Husmeier and Wright (2001) and optimized the recombination parameter
and all the branch lengths simultaneously in a maximum likelihood sense with the EM algorithm, using the MATLAB programs written by the authors (available from http://www.bioss.sari.ac.uk/
dirk/My_software.html).
Finally, the proposed Bayesian MCMC scheme, HMM-Bayes, was applied as follows. We used the Felsenstein 84 model of nucleotide substitution, with a prior on the parameters as described earlier under Method: A Bayesian Approach. For the prior on
[0,1], we chose the three distributions shown at the bottom of figure 6, which incorporate our knowledge that P(
) must be skewed toward the right of the interval [0,1]. This is so because
= 0.5 means that, on average, every second site is subject to recombination, and we know that recombination events are much rarer, that is, that
>> 0.5. We found that for the chosen priors, the simulations gave very similar results. Recall that the posterior probability is the product of the prior and the likelihood. Whereas the first term is constant, the second term scales like N, the number of sites in the DNA sequence alignment. Consequently, for a sufficiently long alignment and a reasonable prior, the weight of the likelihood is considerably higher than that of the prior, and variations of the latter have therefore only a marginal influence on the prediction. This was borne out in the simulations, as shown in figure 21 and discussed in the Appendix.
|
The initial nucleotide frequencies and the initial transition-transversion ratio were estimated from the data, as described above. Equilibration was carried out over 105 - 106 MCMC steps. This was followed by a sampling phase of the same length, during which the parameters
, w,
, and topology sequences S were sampled in intervals of 1000 MCMC steps. From the recorded topology sequences S, we computed the marginal posterior probabilities P(St = 1|
), P(St = 2|
), and P(St = 3|
) for all sites in the DNA sequence alignment, 1
t
N. | Results |
|---|
|
|
|---|
Comparison with RECPARS
We applied both HMM-Bayes and RECPARS to the synthetic DNA sequence alignments described earlier under Data. The objective of this simulation study was to test the performance of both methods on different (a priori known) mosaic structures and for varying levels of difficulty of the detection problem (which is related to the tree height, also discussed under Data). The results are shown in figures 811
When the tree height is decreased to 0.1, neither RECPARS nor HMM-Bayes predicts the mosaic structure of the alignment correctly. RECPARS finds only one recombinant region, which for the first alignment is even badly misplaced (fig. 8, bottom right). HMM-Bayes detects both recombinant regions and even locates them rather accurately, but it misclassifies the topology change for one of these regions (fig. 9, bottom right; fig. 11, bottom right). This is most likely a consequence of the fact that for small tree heights, the number of mutations and, consequently, the number of polymorphic sites is small. Thus, there is less information in the data, and any inference is inevitably less accurate.
For a more quantitative comparison between RECPARS and HMM-Bayes, recall that the detection of recombination is basically a classification problem: Each site in the sequence alignment is assigned to one of the three possible tree topologies. For RECPARS, this is done directly. For HMM-Bayes, it is done by assigning each site to the mode of the posterior probability. We use two criteria to rate the performance of the methods: The sensitivity, which is the percentage of correctly classified recombinant sites, and the specificity, which measures the percentage of correctly classified non-recombinant sites. Comparing the performance of RECPARS and HMM-Bayes across all simulations, shown in figure 12, we found that HMM-Bayes gives a consistent and significant improvement on RECPARS in the accuracy of locating and classifying the recombinant regions, as indicated by a systematically increased sensitivity score.
Figure 13 compares the predictions of RECPARS and HMM-Bayes on the real-world DNA sequence alignments. First, it can be seen that the predictions with RECPARS depend sensitively on the recombination-mutation cost ratio Crec/Cmut. The best predictions show a qualitative agreement with the predictions from the literature, but note that selecting the best value of Crec/Cmut in this way, with the benefit of hindsight, is not possible in real applications where the location of the recombinant regions is not known beforehand. Also note that setting Crec/Cmut = 1.5, as suggested by Wiuf Christensen, and Hein (2001), is not guaranteed to give reliable results, as can be seen for the hepatitis B sequence alignment (fig. 13, middle left), where this parameter setting leads to an over-tessellated mosaic structure with false positive predictions of spurious recombinant regions.
|
Second, even when selecting the best Crec/Cmut ratio, the agreement between the breakpoints predicted with RECPARS and those from the literature is good only for the hepatitis B sequence alignment (fig. 13, middle left). For the maize alignment (fig. 13, top left), RECPARS predicts a large uncertainty in the location of the breakpoint. This is a consequence of the lack of topology-defining sites in this part of the alignment (recall that RECPARS uses only these sites): between t = 706 and t = 968, the alignment contains only a single topology-defining site. For the Neisseria sequence alignment (fig. 13, bottom left), the recombinant region predicted with RECPARS is far too short.
The subfigures on the right of figure 13 show the predictions with HMM-Bayes. Note that these predictions do not depend on any heuristic tuning of parameters, since all the parameters are properly inferred from the data by sampling them from the posterior distribution (12) with MCMC.
The mosaic structures and the breakpoints predicted with HMM-Bayes for the maize and the hepatitisB sequence alignments are in agreement with the results from the literature (fig. 13, top and middle). Also, the location of the predicted recombinant region in Neisseria accords with the prediction by Zhou and Spratt (1992). Their second anomalous region, shown by the gap in figure 13, bottom left, is modeled by a distributed representation with HMM-Bayes; this reflects the uncertainty about the nature of this region. The only difference between the literature and the prediction with HMM-Bayes is the presence of a further peak of P(St = 3|
) at the end of the alignment (see fig. 13, bottom right). Because this region is very short (less than 20 bases long) and therefore difficult to detect with other methods, we assume that we have found a true recombinant region that has not been discovered before.
Comparison with the Heuristic HMM
Figure 14 shows the results obtained on one of the synthetic DNA sequence alignments (mosaic structure A, tree height 0.2). The subfigure on the left shows the prediction of P(St|
) with HMM-heuristic. For this method, the recombination parameter
has to be specified in advance, and we have set it to the mean of the prior distribution:
= 0.8. It can be seen that the overall pattern of the posterior probabilities is correct, showing an increase for topology St = 2 in the region 200 < t < 400, and an increase for topology St = 3 in the region 600 < t < 800. However, the signals are very noisy, and an automatic classification based on the mode of the posterior probability would incur a high proportion of erroneously predicted topology changes. The Bayesian scheme, HMM-Bayes, shown on the right of figure 14, overcomes this shortcoming. The predicted state transitions coincide with the true breakpoints. The posterior probabilities for the topologies, P(St|
), are close to zero or one, which significantly reduces the noise. By assigning each site in the alignment to the mode of P(St|
), the tree topologies and the mosaic structure of the alignment are predicted correctly. The mean and the standard deviation of the posterior distribution P(
|
) are 

posterior = 0.992 and
posterior = 0.004. With four breakpoints in an alignment of length 1000 bases, the true value for the recombination parameter is
= 0.996, which deviates from the prediction by only 0.4%.
|
Figure 15 shows the prediction of P(St|
) for the maize sequence alignment. The subfigures in the middle row show predictions obtained with HMM-heuristic, using different recombination parameters,
= 0.8 (left) and
= 0.95 (right). The overall pattern of the graphs captures the gene conversion event in that the final section shows a clear increase of the posterior probability for topology St = 3. However, the signals are very noisy and unsuitable for an automatic detection of gene conversion without human intervention. The subfigure on the bottom left of figure 15 shows the prediction with HMM-heuristic when setting
to the Bayesian posterior mean,
= 0.997, obtained with HMM-Bayes. This leads to a considerable reduction of the noise and a qualitatively correct prediction of the gene conversion event. However, the breakpoint deviates considerably from that predicted by Moinz de Sa and Drouin (1996). A clear improvement is obtained with HMM-Bayes (figure 15, bottom right), which predicts a sharp transition from topology St = 1 into topology St = 3 at the location t predicted by Moinz de Sa and Drouin (1996).
Comparison with Maximum Likelihood
On the maize and hepatitis B sequence alignments, the predictions with HMM-ML and HMM-Bayes were practically indistinguishable (graphs not included in this article). The difference between the two approaches is in the confidence that we have in the prediction. The prediction with HMM-ML, P(St|
, w,
,
), is dependent on the model parameters w,
,
, which were fitted with maximum likelihood and might therefore be subject to over-fitting. This calls for an independent statistical significance test, using, e.g., parametric bootstrapping. This approach is extremely computationally expensive, as discussed by Larget and Simon (1999). The prediction with HMM-Bayes, on the other hand, is only dependent on the data, P(St|
), because the model parameters have been integrated out. This means that the prediction is consistent within the Bayesian framework and does not require an independent significance test (assuming sufficient convergence of the MCMC simulation).
Figure 16 shows the prediction of P(St|
) on the Neisseria sequence alignment, where the subfigure in the bottom left was obtained with HMM-ML, and the subfigure in the bottom right with HMM-Bayes. Both methods agree in predicting a sharp transition from topology St = 3 to St = 1 at breakpoint t = 202, which is in agreement with the findings by Zhou and Spratt (1992). Both methods also agree in predicting a short recombinant region of the same topology change at the end of the alignment. However, while the prediction with HMM-ML could have been the result of over-fitting, HMM-Bayes, by integrating out the model parameters, is not susceptible to this fallacy. This corroborates the prediction with HMM-ML, in the same way as a frequentist hypothesis test, and thus suggests that we have discovered a new recombinant region undetected by Zhou and Spratt (1992).
Differences between the predictions of HMM-ML and HMM-Bayes are found in the middle of the alignment, where two further breakpoints occur at sites t = 506 and t = 537. This is in agreement with Zhou and Spratt (1992). However, while Zhou and Spratt (1992) suggested that the region between t = 506 and t = 537 might be the result of rate heterogeneity, HMM-ML predicts a recombination event with a clear transition from topology St = 1 into St = 2. This seems to be the result of over-fitting: Because the distribution of the nucleotide column vectors yt in the indicated region is significantly different from the rest of the alignment, modeling this region with a different hidden state can increase the likelihood, although the hidden state itself (topology St = 2) might be ill-matched to the data. This deficiency is redeemed with HMM-Bayes, whose prediction is shown in figure 16, bottom right. The critical region between sites t = 506 and t = 537 is again identified, indicated by a strong drop in the posterior probability for the dominant topology, P(St = 1|
). However, the uncertainty in the nature of this region is indicated by a distributed representation, where both alternative hidden states, St = 2 and St = 3, are assigned a significant probability mass. With the prediction of this uncertainty, HMM-Bayes also indicates a certain model misspecification inherent in the current schemethe absence of hidden states for representing different evolutionary ratesand thus avoids the over-fitting incurred when applying HMM-ML.
To test the conjecture that the Bayesian approach is more robust to over-fitting than maximum likelihood, we carried out two further simulation studies. In the first study, we simulated the effect of rate heterogeneity (fig. 17, top). We simulated the evolution process along the branches of a four-species phylogenetic tree with the Kimura model, as described earlier under Data, but reduced the rate of nucleotide substitution by a factor of 1/5 in the center region, between sites t = 301 and t = 600. The results are shown in the bottom of figure 17. The maximum likelihood approach clearly over-fits and erroneously predicts a recombinant region. The Bayesian MCMC approach is only slightly affected in its prediction of the posterior probabilities: the graph of P(St = 1|
) shows a small dent. This does not lead to classifying any site in the alignment as a topology different from St = 1, though, hence no recombination is predicted, and over-fitting is avoided.
|
The objective of the second simulation study was to test how reliable the prediction of the prediction uncertainty is. This is important for medical applications: When predicting that a certain HIV strain, for instance, is a mosaic of well-established strains, a pharmaceutical company would like to know how reliable this prediction is before launching an expensive drug or vaccine development project.
To this end, we first simulated a recombination process according to the method described in the Data section, and then simulated observational noise, corresponding to wrong base calls or typing errors in DNA sequencing, by randomly replacing 20% of the columns in the alignment by those of a second alignment that was unaffected by recombination. The process is illustrated in the top of figure 18. The consequence is that the uncertainty in the prediction of the recombinant region should increase, and the recombination event should be predicted with a probability less than 1. Figure 18 (bottom left) shows the prediction with maximum likelihood. The location of the breakpoint (in the middle of the alignment at site 200) is fairly accurate, and the nature of the mosaic structure has been identified correctly in that a recombination event corresponding to a change from topology 1 (left) into topology 3 (right) is predicted. However, this prediction is overconfident in that the topology change is predicted with probability 1, which ignores the effect of typing errors. When applying the Bayesian approach, we found that the MCMC trajectories converged to two different semiconverged states, one corresponding to a clear recombination event and the other to the absence of any recombination. While this bimodality indicates insufficient mixing of the Markov chain, it also points to the intrinsic uncertainty in the prediction problem, caused by the introduction of typing errors (we did not observe any bimodality in the absence of typing errors). When combining several MCMC trajectories started from different initializations, we obtained the prediction shown in figure 18, bottom right, which, as opposed to the maximum likelihood approach, indicates the intrinsic uncertainty by predicting a recombination event with probability less than one.
|
Comparison with the Window Methods PLATO and TOPAL
Finally, we have compared the performance of HMM-Bayes with the window methods PLATO and TOPAL. The results are shown in figure 19. For PLATO, the agreement with the "true" (meaning true for the synthetic data and predicted in the literature for the real data) locations is poor. This is most likely a consequence of the principled shortcoming of PLATO: Because the recombinant regions are rather long, they have a substantial impact on the estimation of the reference tree, causing the Q-statistic of equation (1) to lose power and rendering the detection method unreliable.
|
The last two rows of figure 19 show the DSS statistic of TOPAL for the two window sizes W = 100 (third row) and W = 200 (bottom row). It can be seen that the results depend critically on this parameter, which has to be chosen sufficiently large. For W = 100, the agreement between the predicted and the true locations of the breakpoints is poor. Doubling the window length, W = 200, gives a qualitatively correct prediction of the breakpoints, except for a spurious peak at the beginning of the hepatitis B alignment. Notice that a short window of W = 100 is not recommended. However, increasing the window size to W = 200 degrades the spatial resolution and leads to a larger uncertainty in locating the breakpoints.
On comparing these methods with HMM-Bayes (figs. 13 and 14), we find that the latter gives more accurate predictions than PLATO and more precise locations of the breakpoints than TOPAL, with the further advantage that all parameters are estimated from the data, and no arbitrary window parameter has to be chosen in advance.
| Discussion |
|---|
|
|
|---|
In this article, we have proposed a Bayesian MCMC method (HMM-Bayes) for detecting recombination with HMMs. This follows up on earlier work by Hein (1993); McGuire, Wright, and Prentice (2000); and Husmeier and Wright (2001), where the parameters were not estimated (RECPARS), were estimated heuristically (HMM-heuristic), or were estimated with maximum likelihood (HMM-ML). We have compared the methods on various synthetic and real-world DNA sequence alignments and found that HMM-Bayes leads to a considerable improvement on RECPARS and HMM-heuristic in predicting the nature and breakpoints of recombinant regions. All model parameters are properly inferred from the data, removing the need for arbitrarily setting these parameters by hand in advance. In comparison with the older maximum likelihood scheme (HMM-ML), the Bayesian approach has been found to be more robust against over-fitting, and to give a more reliable estimation of the uncertainty of the prediction.
Note that our approach focuses on topology changes rather than recombination in general. If a recombination event only changes the branch lengths of a tree, without affecting its topology, it will not be detected. However, the main motivation for our method is a prescreening of an alignment for topology changes as a crucial prerequisite for a consistent phylogenetic analysis. Most standard phylogenetic methods are based on the implicit assumption that the given alignment results from a single phylogenetic tree. In the presence of recombination they would therefore infer some "average" tree. If the recombination event leads to different trees with the same topology but different branch lengths, then the wrong assumption of having only one tree is not too dramatic: The topology will still be correct (as it has not been changed by the recombination event), and the branch lengths will show some average value. This is still reasonable, because branch lengths are continuous numbers, and the mean of continuous numbers is well defined. If the recombination event, however, leads to a change of the topology, inferring an average tree is no longer reasonable: Tree topologies are cardinal entities, for which an average value is not defined. In fact, it is well known that in this case the resulting average tree will be in a distorted "limbo" state between the dominant and recombinant trees, which renders the whole inference scheme unreliable. Thus, by focusing on recombination-induced topology changes we focus on those events that cause the main problems with standard phylogenetic analysis methods.
We should further point out that our approach is not about estimating recombination rates, but rather about inferring the mosaic structure of a particular sequence alignment. The proposed method has a parameter that might be confused with a recombination rate: the parameter
, which should actually be referred to as a recombination probability (the probability that on moving from the nth site to the (n + 1)th site in the sequence alignment no topology change occurs). This parameter corresponds to the recombination cost in RECPARS; it is not a recombination rate in population genetics terms. By estimating this parameter from the datathat is, the sequence alignmentthe prediction performance of the algorithm improves considerably, thereby overcoming the main shortcomings of RECPARS and HMM-heuristic, where this parameter has to be chosen arbitrarily in advance. We do not claim, however, that
has any meaning in itself: it is a parameter whose proper estimation from the sequence alignment improves the detection of phylogenetic topology changes, and this is what we are interested in.
A limitation of the proposed method is that the states of the HMM represent only different tree topologies but do not allow for different rates of evolution. A way to redeem this deficiency is to employ a factorial hidden Markov model (FHMM), as discussed by Ghahramani and Jordan (1997), and to introduce two different types of hidden states: one representing different topologies, the other representing different evolutionary rates. This effectively combines the method of the present paper with the approach of Felsenstein and Churchill (1996). While parameter estimation with maximum likelihood would lead to a considerable increase of the computational complexity (Ghahramani and Jordan, 1997), it seems that this increase will be less dramatic for the MCMC method, because the Gibbs sampling scheme of (eq. 17) can be applied to both types of hidden states separately. A detailed investigation of this approach is the subject of future research.
As mentioned earlier under Method: Background and Earlier Approches, the method presented here is restricted to DNA sequence alignments with small numbers of taxa, and our current software implementation can only deal with alignments of four sequences. This is because each possible tree topology constitutes a separate state of the HMM. In practical applications, our method has therefore to be combined with a fast low-resolution preprocessing method, like RECPARS or TOPAL: In a first, preliminary analysis, apply RECPARS or TOPAL to identify putative recombinant sequences and their approximate mosaic structures. In a second, subsequent step, apply HMM-Bayes to the tentative sets of four sequences that result from the previous step. This will allow a more accurate analysis of the mosaic structure than can be obtained from a window method like TOPAL with its inherently low resolution, and it will resolve contradictions that are likely to arise from different (arbitrary) settings of the parsimony cost parameters of RECPARS. In general, after identifying a small set of putative recombinant sequences with any fast low-resolution method, the exact nature of the recombination processes and the location of the breakpoints can be further investigated with the high-resolution method proposed in this article.
Note that, in principle, our method can deal with more than four sequences. All that is required is that the set of different candidate topologies, which constitute the states of the HMM, is sufficiently small (fewer than 10, say). Such a sparse set of candidate topologies can be obtained from the preliminary analysis with one of the lower-resolution methods. The proposed Bayesian HMM method can then be applied in a subsequent step, with the hidden states set to the topologies obtained from the preliminary analysis. This will, obviously, not be able to detect any topology changes not detected before, but it would still be likely to give an improvement on the low-resolution method of the previous step in that recombination breakpoints and the nature of the mosaic structure would be predicted more accurately.
A more ambitious goal would be to improve the methodology itself by introducing a more informative form of the transition probabilities between the topologies. In the current version, all changes into other topologies are equally likely, as expressed by (eq. 5). For four sequences with only three possible tree topologies, this seems to be a valid assumption. However, on increasing the number of sequences, this uniform prior on the topologies leads to a superexponential explosion of the parameter space. Choosing a more informative prior that, given a topology, is only nonzero for closely related topologiesas suggested by Hein (1993) and implemented in RECPARSmight offer a way to apply the proposed method to alignments of more than four sequences. This approach, however, has not yet been explored and will certainly offer an interesting topic for future research.
| Appendix |
|---|
|
|
|---|
To monitor the convergence of the MCMC simulations, we inspected trace plots, like those in figure 20, to check whether the MCMC trajectories had reached stationarity. We then repeated the simulations from different initializations, as shown in figure 21, and tested for consistency of the results.
|
Figure 21 shows the dependence of the predictions on both the prior and the initialization. All predicted posterior probabilities P(St |
) are similar in that they show a transition from topology 1 into topology 3 around position t = 875. The graphs differ slightly in the exact form of this transition. This implies that when deriving from the posterior probabilities a crisp classifier that flags a recombinant region when P(St = 3|
) > 0.5, the predicted breakpoints may vary slightly between the simulations. In fact, we found that the predictions for the breakpoints varied between t = 771 and t = 781, whereas Moniz de Sa and Drouin (1996) predicted a breakpoint at t = 775. This is not surprising. All simulations predict uncertainty in the immediate neighborhood of t = 775. Consequently, when estimating the breakpoint from the posterior probabilities, this estimate itself is subject to this uncertainty. Note, however, that the uncertainty inherent in our estimation is of the order of ±5 nucleotides, which is considerably more precise than that of the alternative detection methods discussed in the article.
| Acknowledgements |
|---|
|
|
|---|
This work was funded by the Biotechnology and Biological Sciences Research Council (BBSRC) Bioinformatics Initiative and the Scottish Executive Environmental and Rural Affairs Department (SEERAD). The research leading to this paper was carried out in part at the Isaac Newton Institute, Cambridge, United Kingdom. We would like to thank Frank Wright and Karen Ayres for helpful discussions. We are also grateful to the associated editor and to four anonymous referees for their detailed comments on earlier versions of this manuscript.
| Footnotes |
|---|
E-mail: dirk{at}bioss.ac.uk.
| Literature Cited |
|---|
|
|
|---|
Bollyky, P. L., A. Rambaut, P. H. Harvey, and E. C. Holmes. 1996. Recombination between sequences of hepatitis B virus from different genotypes,. J. Mol. Evol. 42:97-102.[CrossRef][Web of Science][Medline]
Casella, G., and E. I. George. 1992. Explaining the Gibbs sampler. Am. Statist. 46:167-174.[CrossRef]
Chib, S., and E. Greenberg. 1995. Understanding the Metropolis-Hastings Algorithm. Am. Statist. 49:327-335.[CrossRef]
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B39:1-38.
Felsenstein, J. 1981. Evolution trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[CrossRef][Web of Science][Medline]
Felsenstein, J.. 1988. Phylogenies from molecular sequences: inference and reliability,. Annu. Rev. Genet. 22:521-565.[CrossRef][Web of Science][Medline]
Felsenstein, J., and G. A. Churchill. 1996. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93-104.[Abstract]
Ghahramani, Z., and M. I. Jordan. 1997. Factorial hidden Markov models,. Machine Learn. 29:245-273.[CrossRef]
Grassly, N. C., and E. C. Holmes. 1997. A likelihood method for the detection of selection and recombination using nucleotide sequences. Mol. Biol. Evol. 14:239-247.[Abstract]
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97-109.
Heckermann, D. 1999. A tutorial on learning with Bayesian networks. Pp. 301354 in M. I. Jordan, ed. Learning in Graphical Models, Adaptive computation and machine learning. MIT Press, Cambridge, Massas.
Hein, J. 1993. A heuristic method to reconstruct the history of sequences subject to recombination. J. Mol. Evol. 36:396-405.
Husmeier, D., and F. Wright. 2001. Detection of recombination in DNA multiple alignments with hidden Markov models. J. Comput. Biol. 8:401-427.[CrossRef][Web of Science][Medline]
Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120.[CrossRef][Web of Science][Medline]
Kirkpatrick, S., C. D. Gelatt, and M. P. Vecchi. 1983. Optimization by simulated annealing. Science 220:671-680.
Larget, B., and D. L. Simon. 1999. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 16:750-759.[Web of Science]
Mau, B., M. A. Newton, and B. Larget. 1999. Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics 55:1-12.[CrossRef][Web of Science][Medline]
Maynard Smith, J. 1992. Analyzing the mosaic structure of genes. J. Mol. Evol. 34:126-129.[Web of Science][Medline]
McGuire, G., and F. Wright. 2000. TOPAL 2.0: improved detection of mosaic sequences within multiple alignments. Bioinformatics 16:130-134.
McGuire, G., F. Wright, and M. J. Prentice. 1997. A graphical method for detecting recombination in phylogenetic data sets. Mol. Biol. Evol. 14:1125-1131.[Abstract]
McGuire, G., F. Wright, and M. J. Prentice.. 2000. A Bayesian method for detecting recombination in DNA multiple alignments. J. Comput. Biol. 7:159-170.[CrossRef][Web of Science][Medline]
Moniz de Sa, M., and G. Drouin. 1996. Phylogeny and substitution rates of angiosperm actin genes. Mol. Biol. Evol. 13:1198-1212.[Abstract]
Neal, R. M. 1996. Bayesian learning for neural networks, vol. 118, Lecture notes in statistics. Springer-Verlag, New York.
Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77:257-286.[CrossRef]
Robert, C. P., G. Celeux, and J. Diebolt. 1993. Bayesian estimation of hidden Markov chains: A stochastic implementation. Statist. Prob. Lett. 16:77-83.[CrossRef]
Rubinstein, R. Y. 1981. Simulation and the Monte Carlo method. John Wiley & Sons, New York.
Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964-969.[Web of Science]
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.
Wiuf, C., T. Christensen, and J. J. Hein. 2001. A simulation study of the reliability of recombination detection methods. Mol. Biol. Evol. 18:1929-1939.
Yang, Z., and B. Rannala. 1997. Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method,. Mol. Biol. Evol. 14:717-724.[Abstract]
Zhou, J., and B. G. Spratt. 1992. Sequence diversity within the argF, fbp and recA genes of natural isolates of Neisseria meningitidis: interspecies recombination within the argF gene,. Mol. Microbiol. 6:2135-2146.[CrossRef][Web of Science][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
E. W. Bloomquist and M. A. Suchard Unifying Vertical and Nonvertical Evolution: A Stochastic ARG-based Framework Syst Biol, November 9, 2009; (2009) syp076v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Webb, J. M. Hancock, and C. C. Holmes Phylogenetic inference under recombination using Bayesian stochastic topology selection Bioinformatics, January 15, 2009; 25(2): 197 - 203. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. W. Bloomquist, K. S. Dorman, and M. A. Suchard StepBrothers: inferring partially shared ancestries among recombinant viral sequences Biostat., January 1, 2009; 10(1): 106 - 120. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. McCauley, S. de Groot, T. Mailund, and J. Hein Annotation of selection strengths in viral genomes Bioinformatics, November 15, 2007; 23(22): 2978 - 2986. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. F. Boni, D. Posada, and M. W. Feldman An Exact Nonparametric Method for Inferring Mosaic Structure in Sequence Triplets Genetics, June 1, 2007; 176(2): 1035 - 1047. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Didelot, M. Achtman, J. Parkhill, N. R. Thomson, and D. Falush A bimodal pattern of relatedness between the Salmonella Paratyphi A and Typhi genomes: Convergence or divergence by homologous recombination? Genome Res., January 1, 2007; 17(1): 61 - 68. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. D. Wilson, J. Cheung, D. W. Martindale, S. W. Scherer, and B. F. Koop Comparative analysis of the paired immunoglobulin-like receptor (PILR) locus in six mammalian genomes: duplication, conversion, and the birth of new genes Physiol Genomics, November 21, 2006; 27(3): 201 - 218. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Agerso, A. G. Pedersen, and F. M. Aarestrup Identification of Tn5397-like and Tn916-like transposons and diversity of the tetracycline resistance gene tet(M) in enterococci from humans, pigs and poultry J. Antimicrob. Chemother., May 1, 2006; 57(5): 832 - 839. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Maydt and T. Lengauer Recco: recombination analysis using cost optimization Bioinformatics, May 1, 2006; 22(9): 1064 - 1071. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. N. Lukashev Evidence for recombination in Crimean-Congo hemorrhagic fever virus J. Gen. Virol., August 1, 2005; 86(8): 2333 - 2338. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. N. Minin, K. S. Dorman, F. Fang, and M. A. Suchard Dual multiple change-point model leads to more accurate recombination detection Bioinformatics, July 1, 2005; 21(13): 3034 - 3042. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Husmeier, F. Wright, and I. Milne Detecting interspecific recombination with a pruned probabilistic divergence measure Bioinformatics, May 1, 2005; 21(9): 1797 - 1806. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. P. Rooney Mechanisms Underlying the Evolution and Maintenance of Functionally Heterogeneous 18S rRNA Genes in Apicomplexans Mol. Biol. Evol., September 1, 2004; 21(9): 1704 - 1711. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






. The parameters 





























