Skip Navigation


MBE Advance Access originally published online on March 21, 2008
Molecular Biology and Evolution 2008 25(6):1199-1208; doi:10.1093/molbev/msn066
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
25/6/1199    most recent
msn066v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Munshaw, S.
Right arrow Articles by Kepler, T. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Munshaw, S.
Right arrow Articles by Kepler, T. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

An Information-Theoretic Method for the Treatment of Plural Ancestry in Phylogenetics

Supriya Munshaw*,{dagger} and Thomas B. Kepler*,{ddagger}

* Department of Biostatistics and Bioinformatics, Center for Computational Immunology, Duke University Medical Center, Durham, NC
{dagger} Computational Biology and Bioinformatics PhD Program, Institute for Genome Sciences and Policy, Duke University
{ddagger} Department of Immunology, Department of Statistical Science, Duke University

E-mail: kepler{at}duke.edu


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Note Added during Proof...
 Acknowledgements
 References
 
In the presence of recombination and gene conversion, a given genomic segment may inherit information from 2 distinct immediate ancestors. The importance of this type of molecular inheritance has become increasingly clear over the years, and the potential for erroneous inference when it is not accounted for in the statistical model is well documented. Yet, the inclusion of plural ancestry (PA) in phylogenetic analysis is still not routine. This omission is due to the greater difficulty of phylogenetic inference on general acyclic graphs compared that on with trees and the accompanying computational burden. We have developed a technique for phylogenetic inference in the presence of PA based on the principle of minimum description length, which assigns a cost—the description length—to each network topology given the observed sequence data. The description length combines the cost of poor data fit and model complexity in terms of information. This device allows us to search through network topologies to minimize the total description length. By comparing the best models obtained with and without PA, one can determine whether or not recombination has played an active role in the evolution of the genes under investigation, identify those genes that appear to be mosaic, and infer the phylogenetic network that best represents the history of the alignment. We show that the method performs well on simulated data and demonstrate its application on HIV env gene sequence data from 8 human subjects. The software implementation of the method is available upon request.

Key Words: recombination • MDL • gene conversion • phylogenetic networks


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Note Added during Proof...
 Acknowledgements
 References
 
In the presence of recombination or gene conversion, a given genomic segment may inherit information from 2 distinct immediate ancestors. Such "plural ancestry (PA)" is ubiquitous in evolution but is often neglected in the phylogenetic inference of natural histories. In the absence of PA, the topology of any resulting phylogeny is tree like and consistent across the length of the gene. The representation of natural histories with PA requires the use of phylogenetic networks (Hudson and Kaplan 1985Go) or general acyclic graphs (Maynard Smith 1989Go). Statistical modeling on general acyclic graphs is substantially more difficult than on trees; inference of phylogenetic relationships in these cases is correspondingly more problematic. In spite of these difficulties, such modeling is critical. Schierup and Hein (2000)Go have shown that the neglect of PA can lead to markedly mistaken conclusions.

PA regulates the diversification of multigene families and plays a key role in the adaptive evolution of viruses within quasispecies and of alleles within populations; it is particularly important in the coevolution of higher organisms and their microbial symbionts. Gene conversion, for example, has recently been shown to have played a key role in the evolution of the interferon alpha gene family in mammals (Woelk et al. 2007Go). RNA viruses such as the human immunodeficiency virus (HIV) utilize virion-to-virion genetic exchange to evade host defenses and overcome susceptibility to multiple antiviral drugs (Robertson et al. 1995Go). An especially well-studied case of mosaicism, in which different segments of a single gene have distinct ancestry, is found in the adaptive immune system of jawed vertebrates, in which antigen receptor genes undergo somatic recombination of gene segments in lymphocytes creating a large repertoire of specificities against a wide range of antigens (Tonegawa 1983Go). Antigen receptor genes in jawless vertebrates, unrelated to genes of similar function in jawed vertebrates, have recently been shown to be generated by gene conversion (Pancer et al. 2004Go), as have host defense genes in an invertebrate, Biomphalaria glabrata (Zhang et al. 2004Go).

Several distinct classes of methods for the treatment of mosaic sequences have been developed. Each of these infers different types of information relevant to genetic mosaicism. One includes methods that identify recombination without phylogenetic inference. For instance, Hudson and Kaplan (1985)Go developed a method that provides a lower bound on the number of recombination events that occurred in the history of a collection of sequences that does not require inference of a phylogeny. Sawyer (1989)Go developed methods for the detection of genetic mosaicism by looking at the distribution of segments of synonymous polymorphisms, again, without the use or inference of phylogenies.

Methods that generate inferred phylogenies include that of Grassly and Holmes (1997)Go that compares the likelihood of an evolutionary model constrained to use a single phylogenetic tree for all sites in the gene with those of models in which the tree topology is allowed to vary along the length of the gene. Although this method does not produce a single phylogenetic network, the multiple inferred trees can in principle be superimposed to produce such a network.

Hein (1990)Go poses the problem within the context of dynamic programming and describes an algorithm to identify the phylogenies, varying across sites, that minimize the sum of mutation and recombination costs, allowing these costs to be arbitrarily fixed. Though this algorithm is computationally infeasible, Hein (1993)Go also published a much simpler, fast algorithm based on a reasonable heuristic regarding the topologies that are likely to be important. These methods represent generalizations of the method of parsimony and are thus subject to the same criticisms (Felsenstein 1988Go).

Strimmer and Moulton (2000)Go addressed these perceived shortcomings by applying maximum likelihood on directed acyclic graphs as an extension to Felsenstein's method for trees. Felsenstein's method is based on the conditional probability P(y|x,t) of observing nucleotide y given parent x and intervening time t. Strimmer and Moulton adjust this definition to allow for PA by introducing priors p1, p2 for the probabilities that there are 2 "parents," 1 or 2, respectively. The graph is then constructed using the conditionals P(y|x1,x2,t) {equiv} p1P(y|x1,t) + p2P(y|x2,t).

In a more recent method, Jin et al. (2006)Go applied maximum likelihood methods to reticulate networks, which are obtained from trees by the addition of additional edges between edges in the original tree. Their extension is based on the decomposition of the phylogenetic network into overlapping subtrees and uses a maximum likelihood criterion with a branch-and-bound heuristic to reconstruct the phylogenetic history of putative recombinants.

The use of maximum likelihood methods in this context represents a significant step forward but faces the challenge posed by the need to choose from among several models of differing sizes. As pointed out by the authors, adding a reticulation edge always increases the likelihood. Hypothesis testing can be applied, but the hierarchy of models thus formed is not necessarily transitive. For 3 models A, B, and C, it is possible that A is rejected in favor of B, B is rejected in favor of C, yet A is not rejected in favor of C. Indeed, the authors do not actually test for the presence of recombination but use a visual heuristic to choose the size of the final model.

We have developed an information-theoretic method for phylogenetic inference in the presence of PA. The method is an outgrowth of an approach we developed for the analysis of mosaicism in host defense genes (Zhang et al. 2004Go) and is based on model selection by minimization of the "description length," as we now explain.

The minimum description length (MDL) principle casts the model selection problem as that of finding the most efficient encoding of the data (Grünwald 2007Go). This approach arose from the theoretical work of Kolmogorov (1965)Go, Solomonoff (1964)Go, and Chaitin (1966)Go on data complexity, a concept developed to provide a definition of randomness applicable to individual datastreams rather than exclusively to ensembles. The idea is to define the complexity of a datastream D as the length, in bits, of the shortest computer program that produces D as output. A random datastream is one that cannot be compressed at all—the shortest program essentially is PRINT D. The underlying intuition is that regularities allow data compression; randomness is the complete absence of regularity. Rissanen (1987)Go extended these abstract ideas to the practical problem of statistical data analysis by placing restrictions on the kinds of computers and programs that can be considered.

The analogy Rissanen uses is that of communication: The datastream is a message to be sent after appropriate encoding. An efficient encoding takes advantage of the data's regularities as Morse code encodes the frequently used letter "e" as "dot" and the rarer letter "q" as "dash dash dot dash." This strategy requires the transmission of the message in 2 parts. The first part contains the code key and the second contains the message itself, encoded using the key. In more familiar terms, the code key is analogous to the statistical model and the associated parameter estimates, and the "message itself" is analogous to the data residuals. The complexity of the model and the lack of fit are thus both evaluated using information as the common currency.

Viewed from this perspective, a tree is primarily a very efficient means of encoding data, such as DNA sequences, that contain patterns of shared features. Data compression is achieved by encoding common characteristics at the root and adding more and more specific characteristics (shared by fewer individuals) as accumulating differences as the leaves are approached. It is worth pointing out in this regard a passage from Darwin (1859)Go in which he describes classification by the "Natural System" as

an artificial means for enunciating, as briefly as possible, general propositions, that is, by one sentence to give the characters common, for instance, to all mammals, by another those common to all carnivora, by another those common to the dog-genus, and then by adding a single sentence, a full description is given of each kind of dog. The ingenuity and utility of this system are indisputable.

It is precisely the "data compression" properties of trees that he is referring to and it is their extraordinary utility in this context that drove him to posit the "propinquity of descent" as the cause.

There is an intimate relationship between code length and probability captured by the Kraft inequalities (see, e.g., Grünwald 2007Go), which provides a 1-1 map between coding schemes and likelihood functions. In particular, the asymptotic expansion in N, the size of the data set, of the MDL has minus the log of the maximum likelihood as the leading coefficient (order N).

The method we have developed is realized in a stochastic minimization of the description length over the space of phylogenetic networks; each elementary transaction involves random modification of the network topology. We have implemented the method in software, validated its performance on simulated data sets, and demonstrated its application on a collection of genomic sequences of the HIV-1 envelope protein isolated longitudinally from each of 8 subjects infected with HIV (Shankarappa et al. 1999Go).


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Note Added during Proof...
 Acknowledgements
 References
 
Inference Procedure
The number of phylogenetic tree networks grows superexponentially with the size of the data (Felsenstein 1978Go), and the number of phylogenetic networks grows more rapidly still, making exhaustive consideration of all topologies infeasible.

We have therefore designed our procedure as a stochastic optimization on the space of phylogenetic networks. The procedure uses simulated annealing (Kirkpatrick et al. 1983Go) to perform the minimization of the MDL of the data plus the network as the objective function.

The process is carried out in 2 stages. During stage 1, the tree-like topology is preserved by restricting the elementary operations that are allowed. This stage is intended to find a suitable starting point for the second stage, in which an enlarged set of elementary operations is applied, including some that do not preserve the tree topology. After each operation, the description length is evaluated and used to determine whether the result of the operation will be preserved or discarded. The procedure is explained below in greater detail.

The Scoring Function—MDL
We have cast the determination of the phylogenetic history of a set of sequences explicitly as a problem of model selection in which we seek the model that minimizes the total information required to encode the data or in other words, the description length of the data given a model.

A phylogenetic tree can be viewed as a hierarchical data structure that enables an efficient encoding of a set of similar DNA sequences through elimination of redundancies. Consider the encoding of 2 similar genes, each of length L. One could use a naive model and encode the 2 genes independently, at an information cost of 4L bits (1 nt requires 2 bits to encode). Or one could encode the consensus sequence (costing 2L bits) and then encode the changes required to recover each of the 2 genes from the consensus. Roughly speaking, each mutation requires log23 bits to specify the class of mutation and log2L bits to specify the nucleotide position at which it occurs. As long as the number of mutations, m, required for this encoding is small enough to satisfy m(log23 + log2L) < 2L, the tree model is more efficient than the naive model.

For larger gene sets, the process is similar, though the models are more complex. Each gene must specify its parent and those changes to the parent that produce the gene. The set of pointers from genes to their parents is equivalent to the topology of the tree, and the encoding of the changes from parent to "child" is accomplished through the use of a model for mutations.

We extend this basic coding scheme by allowing any node in the network to have 2 ancestors and thus be treated as a mosaic sequence. In this case, we need to specify both ancestors, the break point where the mosaic switches from one to the other, and the mutations between this parental hybrid and the gene of interest. The point is to account accurately for the cost of allowing PA. For each sequence alignment we consider, we will use and compare 2 different models. One, designated M1, prohibits PA, and the other, M2, allows it.

Although the method can be used with any mutation model, we use a relatively simple model in what follows here. This model allows for different mutation rates for each of the 4 classes of pairwise relationships between nucleotides: identity, transition, transversion 1 (G,A {leftrightarrow} C,T), and transversion 2 (G,A {leftrightarrow} T,C). We further assume that these rates are uniform across positions in the gene and over all branches. Finally, we assume equal branch lengths throughout the tree.

Under these conditions, the mutations are distributed according to a multinomial model. The probability density function is defined as

Formula (1)
where x is the unknown parameter vector of length 4 representing the relationships between nucleotides mentioned above (the sum of the components of this vector equals the total number of nucleotides in the network) and p contains the proportions of each relationship. Taking a uniform prior on the parameter vector, we integrate it out and obtain the total information required to encode the mutations given this model

Formula (2)
where N is the total number of nodes in a given network (including the root) and L is the common length of the sequences. The arguments of the logarithms are the binomial and multinomial coefficients, respectively.

Equation (2) has a useful direct interpretation in terms of coding: The first term is the information required to specify the counts of the components of vector x given the total number, (N 1)L, of nucleotides in a network (the root is excluded because there are no mutations in the root). The second term accounts for the number of ways these mutations can be assigned to the NL nucleotides of the data set given the counts in vector x. We ignore a term 2L that represents the amount of information required to encode the root. Because all networks in question will have a root with the same length, this term will not make any difference in the final description length.

In addition to the information required to encode the mutations, we need to encode the ancestry as well. Under a singular ancestry model, each node must specify a parent. If NI is the number of internal nodes and NO is the number of observed or leaf nodes, the ancestor information is given by equation 3

Formula (3)
where the first term represents the amount of information required to specify the number of internal nodes and hence all possible parents. The second term represents the amount of information required to specify a parent for an observed or leaf node and because an internal node cannot be its own parent, the third term represents the information to encode an ancestor for an internal node.

Under the PA model, we must first specify the number of recombinants in a given network and the number of ways of distributing these recombinants. Each recombinant will then have a secondary parent from one of the NI internal nodes (a recombinant cannot have the same parent twice, so we have NI – 1 choices) and the location of the crossover point, where the gene switches from similarity to parent 1 to similarity to parent 2. Because this switch cannot happen at the last nucleotide, specifying the recombination point takes log2 (L – 1) bits. If there are NR nodes with dual ancestry, the corresponding information cost is

Formula (4)
Equation 4 might be a slight overestimation (of order 1) in that internal nodes are allowed to be recombinants but they cannot be their own parents. Here we do not distinguish between observed node recombinants and internal node recombinants.

The total information cost or the description length of the encoding under model M2 is simply the sum of the 3 terms, IT = Iµ + IA + IR. The third term in the equation IR is omitted when encoding under M1. The optimal model is the 1 that minimizes the total information. The trade-off of recombination is achieved by reducing the number of mutations and thus decreasing Iµ by more than IR.

Simulated Annealing
As mentioned above, we start with a random initial topology. Figure 1A demonstrates an example of a topology with 9 species. The elementary operations we use allow any finite topology to be converted to any other finite topology in a finite number of steps. We then randomly assign our data to this topology and calculate the resulting description length. Note that each node in our network represents a DNA sequence. The leaf nodes represent the input sequences, whereas the sequences at the interior nodes, that is, nodes with children, are determined by the demand for minimizing the description length and hence contain the appropriate consensuses. We then allow the repeated application of the following elementary operations and their inverses: change parent, split a node, combine 2 nodes, or add a second parent. There are 2 main steps in the program, each of which is described in detail below.


Figure 1
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— (A) An example of a starting tree with 9 species. (B) Branch swapping is applied, and species 6 has moved. (C) An example of node absorption where the parent of species 1 and 2 is combining with the root. (D) An example of node splitting or fission where the parent of species 6, 7, 8, and 9 has been split. In all, 8 and 9 have been assigned to the new node. (E) An example of a phylogenetic network where species 6 is represented as a recombinant between the parent of species 3, 4, 5 and the parent of species 7, 8, and 9.

 
Stage 1
  • The operation to be applied is chosen from among the 3 possibilities with uniform probability: branch swapping, node absorption, and node fission.
  • Branch swapping. We choose 2 nodes at random: an interior node (which will become the new parent) and a child node. We break the link between the chosen child and its current parent and link the child to the new parent. An example of branch swapping is shown in figure 1B, where the branch between species 6 and its parent in 1A has been swapped giving species 6 a new parent.
  • Node absorption. We randomly choose an interior node and merge it with its parent. In other words, the chosen node is deleted from the tree, and all its children are made direct progeny of its parent. An example of node absorption is shown in figure 1C, where the parent of species 1 and 2 in 1A has been merged with the root.
  • Node fission. We randomly choose an interior node and create a new node that is assigned as a child of the chosen node. The children of the chosen node are then distributed between the new child and itself minimizing the local contribution to the description length using a simple 2-means algorithm (MacQueen 1967Go). An example of node fission is shown in figure 1D, where the parent of species 6, 7, 8, and 9 is split such that 8 and 9 have a separate parent.

The description length of the resulting tree is calculated at the end of each iteration within stage 1. The move is accepted if the description length of the resulting network is less than the existing one. If it is not, we use simulated annealing to choose whether to accept or reject the move. This is necessary in order to avoid local minima. Step 1 runs till the completion of simulated annealing.

Stage 2
Stage 2 differs from stage 1 by replacing branch swapping with "adoption." Adoption proceeds by choosing a node at random from all nonroot nodes, designating that the child and choosing new parent node from among all interior nodes.

  • If the chosen child node has 1 parent, add a branch to the new parent while maintaining the branch to the old parent. Determine the recombination point between the 2 parents that minimizes the local description length by minimizing the number of mutations between the parent hybrid and the sequence at the child node. An example of a move of this kind is shown in figure 1E, where 6 now has 2 parents.
  • If the chosen child already has 2 parents, we consider 5 operations and choose the 1 that produces the smallest description length.

  1. Represent the child as a recombinant between the current primary parent and the new parent.
  2. Represent the child as a recombinant between the current secondary parent and the new parent.
  3. Break the link with both existing parents and represent the child with having only the new parent as a parent.
  4. Break the link with the secondary parent and represent the child by the current primary parent only. This move is a check and is done to allow reversibility.
  5. Break the link with the primary parent and represent the child by the current secondary parent only.

As in step 1, the description length of the tree resulting from the move is calculated and stage 2 is also iterated till the completion of simulated annealing. At the end of both steps, we have a network that minimizes the description length of the data.

Simulated Data Sets
We generated simulated data sets using a Markov Process model to grow a phylogenetic network with point mutations, recombinations, and duplications and obtain a set of sequences related as if through this network. At any given time in the process, each node is considered independently and can acquire a single mutation, split into 2 sibling nodes, become recombination receptive, or do nothing. The rates of these processes are given by kµ, ks, and kr, respectively. Mutations are random and incorporated at random locations along the gene. When a node becomes recombination receptive, a partner node is chosen from among all extant internal nodes, and a recombination crossover point is chosen at random along the common lengths of the parents. A new node with the resulting mosaic sequence is added as a shared child to the 2 parent node.

HIV-I env Genes
Shankarappa et al. (1997) studied the divergence and diversity of the C2-V5 region of the HIV-1 env gene in 9 homosexual men with moderate or slow disease progression. They obtained gene sequences at an average of 12 time points per person, covering 6–12 years of infection. These data sets were used because they provided the opportunity to study the patterns of HIV-1 evolution within single individuals and 8 of these data sets were available on National Center for Biotechnology Information (GenBank accession numbers AF137629 [GenBank] –AF137765 [GenBank] [patient 1], AF137767 [GenBank] –AF137897 [GenBank] [patient 2], AF137898 [GenBank] –AF138003 [GenBank] [patient 3], AF138004 [GenBank] –AF138163 [GenBank] [patient 5], AF138166 [GenBank] –AF138263 [GenBank] [patient 6], AF138305 [GenBank] –AF138411 [GenBank] [patient 7], AF138531 [GenBank] –AF138643 [GenBank] [patient 9], and AF138652 [GenBank] –AF138703 [GenBank] [patient 11]). We aligned the sequences using the ClustalW application in BioEdit. Because our program is not yet meant to handle extensive insertions and deletions within alignments, we discarded the ends of alignments where such indels were present. Sequences from 1 patient had indels throughout the length of the gene and were therefore excluded from the study. The sequence lengths given in the results represent only the part of the alignment used for the analysis.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Note Added during Proof...
 Acknowledgements
 References
 
Validation on Simulated Data
For each of the combinations of ks/kr and kµ/kr displayed in table 1, we generated 12 replicate data sets and analyzed them using the methods described above (table 2).


View this table:
[in this window]
[in a new window]

 
Table 1 Parameter Values for Rates of Mutation (kµ), Duplication (ks), and Recombination (kr) for Simulations

 

View this table:
[in this window]
[in a new window]

 
Table 2 Analysis of the Simulated Data Sets with Parameter Values, Description Lengths (IT) of the Networks with (IT2) and without (IT1) PA

 
Gene-wise type I and type II errors, also shown in table 2, were calculated based on the number of false positives (FPs) and false negatives (FNs) in each data set. A FP was recorded in cases where IT1(the description length for the best network without PA) was larger than IT2 (allowing PA) and there was at least 1 gene identified as mosaic. Alignment-wise type II error was calculated when a data set with mosaic genes was identified as having no recombination. Each row of the table is a summary of the replicates, each with the stated rates of recombination and duplication.

In order to determine whether the procedure not only detects the presence of recombination but also accurately infers their evolutionary histories, we compared the simulated phylogenetic networks for each data set with networks generated by our program. Figure 2 provides 1 such visual comparison of the simulated network and the network constructed using our procedure, showing that the 2 are very similar. Nine out of 12 mosaic sequences were identified as such with the appropriate ancestors and crossover points. The mosaic sequence shown in the red dashed line could not be identified because the crossover point was at position 297 of 300 and there was too little information transferred from the secondary parent for reliable identification. The mosaic sequence shown in the blue dashed line was not identified because 1 of its ancestors lost all its nonrecombinant descendants from the simulation. Hence, its mosaic descendants could not be identified as such.


Figure 2
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— (A) Network generated by simulation. (B) Network generated by our method.

 
Note that the inferred network is not expected to be identical to the true network because the method generates multifurcating networks when there is insufficient information to resolve the pattern of bifurcations within any multifurcating node, a condition that is the norm for data sets of this size and complexity. For example, given 3 sequences AGA, AAC, and TAC, any combination of these sequences would make an equally feasible bifurcating tree. In this case, our program simply chooses a mutifurcating tree because there is no one best resolution.

To estimate the alignment-wise type I error rate (for a whole alignment), we generated 15 simulated data sets with no recombination for each of the following mutation to duplication rate ratios: 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1,024. There was an average of 46.9 genes per data set. Alignment-wise FPs as defined above were counted in 11 of the 165 data sets, 9 of which had identified a single gene as mosaic, whereas the other 2 had 2 and 3 mosaic genes, respectively. These mosaic genes were generated in data sets with mutation to duplication rate ratio of 4–64. The higher mutation rates led to extremely diverse data sets resulting in poor resolution. The overall alignment-wise type I error rate is thus 6.67%. The average gene-wise error rate within each of the family of data sets where alignment-wise FPs were found is less than 0.4%. Note that where IT2 is smaller than IT1 and there are no putatively mosaic genes, we can say with certainty that the annealing in stage 1 was incomplete. This shortcoming can be reduced by improving the cooling schedule.

HIV—env Gene
We ran each data set 5 times with different random initial conditions and accepted as mosaic genes those that were identified as mosaic at least 3 out of 5 times. Table 3 shows the total number of sequences and mosaic genes found in each of the data sets as well as the average description lengths for each of the data sets. Comparison of the description lengths at stages 1 and 2 (with and without PA) suggests the presence of recombination in all cases but patient 11. Patient 11 had no putative recombinants at all. Each of the other patients had multiple mosaic genes and substantially smaller description length in stage 2. The mutations per split for the data sets ranged from 3.9 to 14.3. This range was covered in our simulations; the results of the simulation can be taken as relevant to the analysis of the HIV data sets. Figure 3 shows the phylogenetic network of the sequences extracted from patient 2. This particular set of sequences was shown to have approximately 5.6 mutations per split and 9 putative mosaic genes. According to our simulations, the type II error for 5.6 mutations per split could be between 0.18 and 0.36 (between 4 and 8 mutations per split). Hence, we can be confident that we have identified approximately at least 64% of the recombinants. The type I error rate in this case is between 0.0 and 0.005, which is fairly low. Figure 4 shows an example of a mosaic sequence and its inferred parents. This sequence (AF137871 [GenBank] ) was isolated 126 months after seroconversion from patient 2. Table 4 shows that putative recombinants in patient 2 were amplified starting 30 months after seroconversion.


View this table:
[in this window]
[in a new window]

 
Table 3 Phylogenetic Analysis of the C2-V5 Region of the Env Gene in HIV-1 in 8 Patients with Description Lengths (IT) of the Networks with (IT2) and without (IT1) PA

 

Figure 3
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Network generated by applying our method to the C2-V5 region of the env gene of HIV-1 from patient 2 (Shankarappa et al. 1999). Nine putative recombinants were found.

 

Figure 4
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Mosaic sequence AF137871 [GenBank] from patient 2 shown with its parents 1010 and 1346. The vertical line at position 202 shows the putative point of recombination such that the first half of the gene is believed to have descended from 1010 and the second half from 1346.

 

View this table:
[in this window]
[in a new window]

 
Table 4 Details of Putative Recombinants Identified in Patient 2

 

    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Note Added during Proof...
 Acknowledgements
 References
 
We have introduced an approach to phylogenetic inference that allows PA and validated its performance on simulated data sets. We made the following observations: 1) The gene-wise type I error rate, which is defined as the number of genes falsely identified as mosaic divided by the total number of nonmosaic genes, is fairly low at all mutation rates. This number remains less than or equal to 10% for all parameter values used in testing. 2) The gene-wise type II error rate, defined as the number of genes falsely identified as nonmosaic divided by the total number of mosaic genes, is reasonable for a range of intermediate relative mutation rates (4–64 mutations per split) but becomes intolerably large for more extreme values at very low and very high rates. Failure of this type is inevitable: mutations both provide and destroy the information necessary to identify recombination. Recombination between identical or nearly genes cannot be detected by analysis of the resulting sequence, yet very high rates corrupt the signal. 3) Similarly, the alignment-wise type II error, the proportion of data sets judged to have no recombination (IT1 > IT2) when they in fact do, is high when the mutations per branch are extreme. 4) Finally, the overall alignment-wise type I error rate, the proportion of data sets that are falsely identified to have recombination is 0.007.

In addition to testing our method on simulations, we demonstrated its application using the C2-V5 region of the env gene of HIV-1 from 8 individuals (Shankarappa et al. 1999Go). It is widely known that diversity in HIV-1 is generated through point mutations as well as recombinations (Robertson et al. 1995Go). Because recombination is active in the HIV genome, the neglect of PA in the phylogenetic analysis of HIV-1 sequences may lead to incorrect results (Schierup and Hein 2000Go). Our analysis detected multiple mosaic genes in all but 2 of the patients.

Further evidence that these genes really are mosaic comes from an analysis of the times at which they were sampled. Shankarappa et al. (1999)Go sequenced virus samples at each of several times from each individual—typically from 3 months postseroconversion to 7 or more years (the latest sampling time was more than 11 years).

We expect to find genes arising from viral recombination to appear later in the infection because detectable recombination requires superinfection of target cells with genetically distinguishable viruses, the chances of which clearly increase over time. We therefore analyzed the sampling times for the sets of genes identified as mosaic by determining, for each patient, the earliest time giving rise to a sample containing a mosaic gene. For the ith patient, denote that sampling time ti. Take the null hypothesis to be that the genes identified as mosaic are chosen at random from among all genes sampled. Under this null, the probability pi that the earliest sampling time in the ith patient sample is ti or later is

Formula (5)
where Ni is the total number of genes sampled, ki is the size of the subsample of putatively mosaic genes, and ni(ti) is the number of genes in the complete sample with sampling time ti or later. The probabilities computed for the data sets are shown in table 3.

Mosaic genes are first detected in patients at an average of 37 months after seroconversion. On analyzing the sampling times, we found that the mosaic sequences for patients 2, 3, 7, and 9 appeared significantly late. We also found that patient 11, the slowest progressor in the study, showed no presence of recombination.

Our method depends entirely on the use of a single well-behaved cost function on phylogenetic networks given the observed sequence data so that the power of numerical minimization can be utilized. Methods based on pairwise hypothesis testing frequentist methods such as the likelihood ratio test do not provide a cost function of this kind. Our choice of cost function, the MDL, is intended to balance the cost of lack of model fit to the data against the cost of model complexity in a principled and natural way, by supplying a common currency—information—for both. Consistency is essential in this context because the complexity of the underlying network changes with the number of putatively mosaic genes. Other methods, including those of Hein (1990)Go, allow the cost of recombinations relative to mutations to be fixed arbitrarily but do not determine the appropriate value for this ratio. MDL naturally sets this ratio adaptively: The total cost of mutations is not simply proportional to the number of mutations; the information cost of a single mutation decreases as the total number of mutations increases. Similarly, the cost of a recombination event decreases as the total number of such events increases, and the cost of either one depends on the total number of nucleotides in the data set.

We ran several of our simulated data sets using Hein's Web-based recombination tool Recpars (http://www.daimi.au.dk/~compbio/recpars/recpars.html). Under its default values, we ran 10 of our simulated data sets and observed a gene-wise type I error of 2% and type II error of 43%. The same data sets with our program had a 0.2% gene-wise type I error and 28% type II error. Out of 75 negative controls (data sets with no recombinants), we found that Recpars erroneously identified 32 data sets (42%) as having recombination events, compared with the 9% observed using our method. Moreover, the Web version of Recpars failed to give us any results for data sets with mutation to duplication ratios of 32 and higher.

In addition, MDL methods, like likelihood-based methods, provide flexibility, allowing the use of more complex evolutionary models. We have here used a simple model for evolution, because doing so gives us access to a simple closed form for the description length for any network and any data and thus greatly facilitates numerical minimization, but this simplification does not come without costs of its own. For example, Crandall et al. (1999)Go examined sequence data from 8 patients and found several positions at which parallel mutations related to drug resistance became prevalent. Had they been isolated from a single individual, such homoplasies would erroneously be taken as contributing evidence to the hypothesis of PA. A model that accounts for mutation rate heterogeneity among positions and/or selection would be required to distinguish these cases. Similarly, a model that allows for varying branch lengths will also be useful. Even though our simplified model assumes constant time between lineages, it should be noted that our simulated data are generated with exponentially distributed branch lengths. MDL can, in principle, be applied using any model for mutation and any network topology, but models allowing differing branch lengths and heterogeneous mutation rates across alignment positions will require substantially greater computation; good approximation schemes will certainly prove helpful in this regard.

Furthermore, the stochastic search methods we use are not particularly sophisticated. Substantially improved performance could likely be achieved with greater attention paid to the numerical optimization process. The computation time depends almost entirely on the stochastic search method. Using our current search parameters, we calculated the computation time for 8 data sets ranging from 20 to 100 sequences of length 300 with an average of 8 mutations/split (the mean mutations per branch for the HIV data set). The largest data set of 100 sequences ran for 115 min (real user time on a 64-bit machine, 2.19 GHz processor, 4 GB RAM), whereas the smallest data set ran for 8 min. The average time was 48 min. A better optimization scheme could lower the computation time significantly.

In spite of the oversimplifications and relatively large computational effort required, we have shown that the method as described works well, with tolerable error rates on simulated data and biologically plausible results on clinical data from HIV-infected patients.

The executables and C++ source code for the software implementation are available upon request.


    Note Added during Proof Stage
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Note Added during Proof...
 Acknowledgements
 References
 
We have recently discovered that Ren et. al (1995)Go developed a method for phylogenetic inference based on the minimum description length principle over ten years ago. Their approach is very similar to the one we have described here, though ours extends the core ideas to address the inference of more general phylogenetic networks.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Note Added during Proof...
 Acknowledgements
 References
 
We are grateful for helpful discussions with Marcy Uyenoyama and for the financial support of the Bill and Melinda Gates Foundation through grant number 38643 to Dr Barton Haynes. We would also like to thank the 2 anonymous reviewers for their insightful comments.


    Footnotes
 
Naruya Saitou, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Note Added during Proof...
 Acknowledgements
 References
 

    Chaitin G. On the length of programs for computing finite binary sequences. J ACM (1966) 13:547–569.[CrossRef]

    Crandall KA, Kelsey CR, Imamichi H, Lane HC, Salzman NP. Parallel evolution of drug resistance in HIV: failure of nonsynonymous/synonymous substitution rate ratio to detect selection. Mol Biol Evol (1999) 16(3):372–382.[Abstract]

    Darwin CR. On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. In: Issue 1 (1859) 1st ed. London: John Murray. 413.

    Felsenstein J. The number of evolutionary trees. Syst Zool (1978) 27(1):27–33.[Abstract/Free Full Text]

    Felsenstein J. Phylogenies from molecular sequence: inference and reliability. Annu Rev Genet (1988) 22:521–565.[CrossRef][Web of Science][Medline]

    Grassly NC, Holmes EC. A likelihood method for the detection of selection and recombination using nucleotide sequences. Mol Biol Evol (1997) 14:239–247.[Abstract]

    Grünwald PD. The minimum description length principle (2007) Cambridge (MA): MIT Press.

    Hein J. Reconstructing evolution of sequences subject to recombination using parsimony. Math Biosci (1990) 98(2):185–200.[CrossRef][Web of Science][Medline]

    Hein J. A heuristic method to reconstruct the history of sequences subject to recombination. J Mol Evol (1993) 36(4):396–405.[Web of Science]

    Hudson RR, Kaplan NL. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics (1985) 111(1):147–164.[Abstract/Free Full Text]

    Jin G, Nakhleh L, Snir S, Tuller T. Maximum likelihood of phylogenetic networks. Bioinformatics (2006) 22(21):2604–2611.[Abstract/Free Full Text]

    Kirkpatrick S, Gelatt CD, Vecchi MP. Optimization by simulated annealing. Science (1983) 220:671–680.[Abstract/Free Full Text]

    Kolmogorov A. Three approaches to the quantitative definition of information. Probl Inf Transm (1965) 1:1–7.

    MacQueen JB. Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability—LeCam LM, Newman J, eds. (1967) Berkeley (CA): University of California Press. 281–297.

    Maynard Smith J. Trees, bundles or nets? Trends Ecol Evol (1989) 4:302–304.[CrossRef]

    Pancer Z, Amemiya CT, Ehrhardt GR, Ceitlin J, Gartland GL, Cooper MD. Somatic diversification of variable lymphocyte receptors in the agnathan sea lamprey. Nature (2004) 430:174–180.[CrossRef][Medline]

    Ren F, Tanaka H, Gojobori T. Construction of molecular evolutionary phylogenetic trees from DNA sequences based on minimum complexity principle. Comput. Methods Programs Biomed (1995) 46:121–130.[CrossRef]

    Rissanen J. Stochastic complexity. J R Stat Soc Ser B (1987) 49:223–239.

    Robertson DL, Sharp PM, McCutchan FE, Hahn BH. Recombination in HIV-1. Nature (1995) 374:124–126.[Medline]

    Sawyer S. Statistical tests for detecting gene conversion. Mol Biol Evol (1989) 6:528–538.

    Schierup MH, Hein J. Consequences of recombination on traditional phylogenetic analysis. Genetics (2000) 156:879–891.[Abstract/Free Full Text]

    Shankarappa R, Margolick JB, Gange SJ, et al, (11 co-authors). Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type I infection. J Virol (1999) 73(12):10489–10502.[Abstract/Free Full Text]

    Solomonoff R. A formal theory of inductive inference. Inf Control (1964) 7:1–22.[CrossRef]

    Strimmer K, Moulton V. Likelihood analysis of phylogenetic networks using directed graphical models. Mol Biol Evol (2000) 17(6):875–881.[Abstract/Free Full Text]

    Tonegawa S. Somatic generation of antibody diversity. Nature (1983) 302:575–581.[CrossRef][Medline]

    Woelk CH, Frost SDW, Richman DD, Higley PE, Kosakovsky Pond SL. Evolution of the interferon alpha gene family in eutherian mammals. Gene (2007) 397:38–50.[CrossRef][Web of Science][Medline]

    Zhang SM, Adema CM, Kepler TB, Loker ES. Diversification of Ig superfamily genes in an invertebrate. Science (2004) 305:251–254.[Abstract/Free Full Text]

Accepted for publication March 16, 2008.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
25/6/1199    most recent
msn066v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Munshaw, S.
Right arrow Articles by Kepler, T. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Munshaw, S.
Right arrow Articles by Kepler, T. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?