Skip Navigation


MBE Advance Access originally published online on March 20, 2007
Molecular Biology and Evolution 2007 24(6):1312-1319; doi:10.1093/molbev/msm052
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow An erratum has been published
Right arrow All Versions of this Article:
24/6/1312    most recent
msm052v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Linz, S.
Right arrow Articles by von Haeseler, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Linz, S.
Right arrow Articles by von Haeseler, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

A Likelihood Framework to Measure Horizontal Gene Transfer

Simone Linz*, Achim Radtke* and Arndt von Haeseler{dagger},{ddagger},§,||

* Department of Bioinformatics, Heinrich-Heine-University, Düsseldorf, Germany
{dagger} Center for Integrative Bioinformatics Vienna, Max F. Preutz Laboratories, Vienna, Austria
{ddagger} University of Vienna, Vienna, Austria
§ Medical University of Vienna, Vienna, Austria
|| University of Veterinary Medicine Vienna, Vienna, Austria

E-mail: linz{at}cs.uni-duesseldorf.de.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
We suggest a likelihood-based approach to estimate an overall rate of horizontal gene transfer (HGT) in a simplified setting. To this end, we assume that the number of occurring HGT events within a given time interval follows a Poisson process. To obtain estimates for the rate of HGT, we simulate the distribution of tree topologies for different numbers of HGT events on a clocklike species tree. Using these simulated distributions, we estimate an HGT rate for a collection of gene trees representing a set of taxa. As an illustrative example, we use the "Clusters of Orthologous Groups of proteins" (COGs). We also perform a correction of the estimated rate taking into account the inaccuracies due to gene tree reconstructions. The results suggest a corrected HGT rate of about 0.36 per gene and unit time, in other words 11 HGT events have occurred on average among the 44 taxa of the COG species tree. A software package to estimate an HGT rate is available online (http://www.cibiv.at/software/hgt/).

Key Words: horizontal gene transfer • gene tree • species tree • nontree-like evolution


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
It is well known that gene trees reconstructed for different genetic loci for the same set of taxa do not necessarily agree. Their branching pattern may be different from each other and different from the species tree (Pamilo and Nei 1988Go). These discrepancies are not always due to the uncertainty of the phylogenetic inference method, but rather due to biological processes like hybridization, gene duplication and deletion, or horizontal gene transfer (HGT) (Syvanen 1994Go). In the following, we will focus on the latter of these processes.

The effect of one HGT event is visualized in figure 1 which shows a species tree (fig. 1A) of the 5 taxa A, B, C, D, and E. This tree indicates a close relation between A and the cluster of B and C. In many cases, the species tree also explains the phylogeny of single genes, but sometimes a gene has a different evolutionary history than the species tree (Pamilo and Nei 1988Go). For such a gene, the gene tree is displayed in figure 1B. One possible explanation for this kind of difference is HGT, in this case from A to D. During such a process, a piece of DNA (e.g., a gene) is transferred from one organism to another, which is not its offspring. The genetic material is stably incorporated in the acceptor genome, in contrast to the vertical inheritance of genes by descent from one's parents (Bushman 2002Go). In the depicted case, the arrow shows gene transfer from species A to species D. As a consequence, the gene tree for this gene shows a close relationship between A (donor) and D (acceptor).


Figure 1
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Comparison of the species tree (A) and one corresponding gene tree (B) after a single HGT event. The arrow indicates a gene transfer from species A (donor) to species D (acceptor). To check whether the gene tree is a subtree of the species tree, we compute the tree topology {tau}(S|X) (C) derived from the species tree.

 
HGT is known as an important mechanism to shape the genomes of bacteria (Ochman et al. 2000Go; Boucher et al. 2003Go), but recently there is also an accumulation of data indicating that this process also occurs in the evolution of eukaryotes (de la Cruz and Davies 2000Go; Andersson 2005Go) and archaea (Nelson et al. 1999Go; Diruggiero et al. 2000Go).

Several approaches have been published that discover single HGT events (Lerat et al. 2003Go, 2005Go), whereas another kind of approach estimates the amount of genes that are acquired through HGT for a given genome. The latter type of analysis is reviewed in Ochman et al. (2000)Go for 19 completely sequenced genomes. In these species, the amount of adopted genes varies between virtually none in organisms with small genome size, for example, Rickettsia prowazekii, Borrelia burgdorferi, and Mycoplasma genitalium, to nearly 17% in Synechocystis PCC6803 (Ochman et al. 2000Go). Another way of detecting horizontally transferred genes uses bacterial genome sequences to examine the nucleotide composition (GC content) and usage of different codons (Lawrence and Ochman 1997Go).

In contrast to these approaches, we estimated an overall rate of HGT for a given set of species based on simulating a likelihood curve for the reconstructed species tree. We constructed a clocklike species tree reflecting the actual evolutionary pathways of the investigated organisms (Pamilo and Nei 1988Go) and simulated different numbers of HGT events, implemented as series of subtree prune and regraft processes on the species tree (Hillis et al. 1996Go). Simulations with different numbers of HGT lead to a distribution of tree topologies that are comparable with the gene tree distribution to estimate an HGT rate supported by a likelihood framework.

As the number of tree topologies increases exponentially with the number of leaves, the probability to get a specific topology is very low. To overcome this, we worked with quartet subtrees instead of the complete gene tree topologies (see the Materials and Methods section).

To apply the method to real data, we used the "Cluster of Orthologous Groups of proteins" (COGs) (Tatusov et al. 2001Go).


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Notation
In the following, we introduce some mathematical notation needed for the analysis. For more detailed definitions, we refer the reader to Semple and Steel (2003)Go.

A phylogenetic tree Formula is described by a set of vertices Formula (also called nodes) and a set of edges Formula The degree of a vertex v, denoted by deg(v), is the number of edges incident with v. Any node with deg(v) = 1 is called a leaf and all other vertices are called internal. With Formula (T) we denote the set of all leaves. An unrooted phylogenetic tree T is called binary if each node v isin Formula is either a leaf or has deg(v) = 3. If the binary phylogenetic tree is rooted, this definition must be extended to the most recent common ancestor of all leaves, which is called the root r. This vertex r is the only one in a rooted binary phylogenetic tree with deg(r) = 2.

Let Formula be a set of n taxa and let X be an arbitrary subset of S. To describe all binary tree topologies with leaves, which are labeled with taxa of X, we use Formula (X), whereas {tau}(X) denotes an element of Formula (X). The tree T(S) isin Formula (S) is the species tree of Formula with a length function l, which defines the branch length of each edge e isin Formula T(S). The size L(T(S)) of a tree is the sum over all branch lengths. We assume a species tree T(S) to be binary, rooted, leaf labeled, and clocklike (each species (leaf of the tree) has the same distance to the root), and we interpret the distance between any node and the root as time that has passed since the first split at the root. A gene tree is a tree topology of a leaf-labeled tree, which evolves within a species tree and comprises at most all taxa of S.

The restriction of S to X, denoted by {tau}(S|X), is a tree topology derived from T(S) where all leaves in S\X are ignored and all vertices with deg(v) = 2 are suppressed, except the root (fig. 1C).

Modeling HGT
To model the process of HGT, we have to make some pivotal assumptions. 1) A binary, labeled, rooted, and clocklike species tree T(S) is known, as well as all splitting times along this tree. 2) Differences between a gene tree and T(S) are only caused by HGT events. 3) The transfer rate {lambda} is homogeneous per gene and unit time. 4) Genes are transferred independently. 5) One copy of the transferred gene still remains in the donor genome. 6) The transferred gene replaces any existing ortholog counterpart in the acceptor genome.

As described before, the effect of HGT can result in a branching pattern of a gene tree, which differs from the species tree (fig. 1). From a computational point of view, we model each HGT event as a subtree prune and regraft process (Hillis et al. 1996Go). This means an HGT event is modeled in the following way: As we assume a homogeneous HGT rate {lambda}, the transfer events are uniformly distributed along all branches of the tree. For each HGT event, we randomly choose a starting point in the clocklike species tree, determine the corresponding time in this tree, search for all branches that exist at that point of time, and randomly select one as acceptor branch. Consequently, single transfer events between species are only possible if they coexist in time in order to prevent gene transfer from present-day species to fossils. This biologically motivated restriction is not considered in most current research on HGT models (e.g., Suchard 2005Go). Furthermore, it is easily seen that not every single HGT event changes the branching pattern of the species tree, for example, if the process takes place between branches that share the most recent common ancestor.

For a given species tree with total length L(T(S)) and fixed {lambda}, the tree topology {tau}(S) occurs with a certain probability P({tau}(S)|T(S),{lambda},L(T(S))). In the following, we write P({tau}(S)|{lambda}) instead because {lambda} is the parameter of interest. As stated in the introductory part, the number of HGT events is Poisson distributed with parameter {Lambda} = {lambda}·L(T(S)) for a fixed species tree. Thus, the probability for {tau}(S) given {lambda} is

Formula (1)
The Poisson distribution describes the probability for h HGT events to happen on the species tree T(S) with L(T(S)) and {lambda}, whereas the 2nd factor is the probability to observe {tau}(S) as tree topology after h HGT events. Although the Poisson distribution is easy to calculate, the probability distribution of the gene trees for a fixed number of HGT events appears hard to calculate, except for trivial cases like Formula Moreover, for a fixed arbitrary subset Formula we can compute the probability for each subtree {tau}(X) as follows:

Formula (2)
The Kronecker delta {delta}({tau}(X),{tau}(S|X)) is 1 if the topology of the induced subtree {tau} (S|X) with respect to Formula is identical to {tau}(X), otherwise it is 0.

Equations (1) and (2) allow the estimation of {lambda} in a likelihood framework. Therefore, we assume that {lambda} acts on each gene independently. If m gene trees Formula are reconstructed, the likelihood of {lambda} is

Formula (3)
We maximize equation (3) with respect to {lambda}, which is interpreted as the most likely transfer rate.

This approach turns out to be computationally infeasible because a reliable estimation of P({tau}(S)|{lambda}) is only possible for a small number of taxa. Hence, we resort to an approximation of the likelihood. We consider a collection of subsets Formula together with the probability distribution induced by equation (2) and the simplified situation that the occurrences of gene trees Formula are mutually independent for different randomly chosen subsets Formula In this case, the joint probability of Formula is given by

Formula (4)
Although equation 4 is an approximation to equation 3, the simulations show that it is good enough for the practical application, and we can also apply the described estimation scheme to estimate Formula and Formula, respectively.

Estimating the Probability Distribution of Gene Trees
From the previous paragraph, it is obvious that it would be very difficult to find an analytical expression for any of the equations. However, equation (1) suggests an efficient simulation. For any fixed number h of HGT events, we can approximate the distribution P({tau}(S)|HGT = h) reasonably well. Therefore, we simulate N = 100,000 times h HGT events on the species tree with 0 ≤ h ≤ 60 and calculate how often each gene tree occurs in the simulated trees. We end up with a probability distribution P*(·) in which each column represents one gene tree and each row a fixed number of HGT events. The final likelihood estimation is based on P*(·).

Although P({tau}(S)|{lambda}) can be estimated for small taxa sets, it gets intractable for biologically interesting numbers because too many tree topologies exist and it is almost impossible to simulate enough trees for a reliable estimation within a reasonable time span. In such situations, the probability for different subsets Formula proves more successful. Thus, we reduce the calculated probability distribution P*(·) to a subset of randomly chosen quartet topologies of the given set of gene trees.

The COG Data
The whole data set, which is available via the National Center for Biotechnology Information Web site (http://www.ncbi.nlm.nih.gov/), comprises 3,167 protein families of 44 species (2 eukaryotes, 9 archaea, and 33 bacteria). As we are concentrating on single-copy genes up to now, we only extracted those families that fulfill this criterion. To obtain enough phylogenetic information (Nei 1996Go) to reconstruct the gene trees, we only used COG families with a minimum alignment length of 100 amino acids for each of the corresponding proteins. We also required at least 4 species per COG family. After applying these 3 criteria, 780 protein families still remained (see supplementary tables S1 and S2, Supplementary Material online). For each of these families a gene tree was reconstructed with Tree-Puzzle (Schmidt et al. 2002Go), using the Dayhoff substitution model (Dayhoff et al. 1978Go).

Species Tree Reconstruction
To construct the species tree of the mentioned 780 protein families, we built all 3 binary trees for all possible quartets (A, B, C, D) and computed the corresponding log-likelihood values {ell} as sum of the log likelihoods of the COG families (gi) each represented by a gene tree.

Formula (5)
All the 3 log-likelihood values Formula are set to 0 if at least one of the species A, B, C, or D does not occur in the corresponding COG family gi.

Afterwards, we used Tree-Puzzle (Schmidt et al. 2002Go) to construct the species tree topology of the log-likelihood values of all Formula quartet topologies (|gi| is the number of taxa represented in the COG family gi).

To assign branch lengths to this topology, we performed a clock test (Felsenstein 1988Go) for all 780 protein families. The result contained 443 clocklike and 337 nonclocklike COG families. Only 3 of all families occurred in all 44 species (COG0013: Alanyl-tRNA synthetase, COG0092: Ribosomal protein S3, and COG0541: Signal recognition particle GTPase Ffh), but none of them evolved clocklike. Therefore, we had to use an appropriate set of gene trees that covers all 44 species. For each clocklike evolving COG family with taxa set X, we reconstructed the corresponding subtree {tau}(SCOG|X) with a total branch length measured in numbers of substitutions per site. Furthermore, we identified a set G of subtrees fulfilling the following conditions: 1) G covers the species tree completely and 2) each branching point is determined by at least one subtree. Such a coverage was found for the 3 clocklike evolving families: COG0419 (ATPase involved in DNA repair), COG0173 (Aspartyl-tRNA synthetase) and COG1242 (uncharacterized FeS oxidoreductases). As some of the splitting times are given by 2 or 3 of the named families and each of them evolved with a different rate, we computed the ratio of these rates to estimate the splitting times relative to one protein family, in this case COG0419.

Eventually, the reconstructed species tree T(SCOG) was used to simulate distributions of tree topologies for different numbers of HGT events.

Comparing Trees
To compare the most frequent gene tree with the species tree, we extracted all quartet topologies from the 780 gene trees and added up the information in a descending sorted list representing each topology by the number of its occurrence. Afterwards, we built a quartet set that finally consisted of 35.7% of the initially extracted quartet topologies. Starting with the most frequent topology, we put each one successively in the set if the corresponding quartet tree does not contradict a quartet tree already in the current set. With the help of the final quartet set, we reconstructed a tree using Tree-Puzzle. To compare the obtained tree topology with the COG species tree, we built a consensus tree using the program Consense of the PHYLIP package (Felsenstein 1989Go).


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Quality Check
To obtain reliable simulation and estimation results, we repeated the procedure for different parameter settings. Hence, we used a program that simulates an HGT rate {lambda} on a clocklike species tree. The corresponding number of HGT events was drawn from the Poisson distribution. This kind of simulation generates a new data set, which is comparable to the 780 gene trees of the COG data. Because we know the true HGT rate {lambda}, we can check the reliability of the estimation procedure.

First of all, we estimated the probability P*({tau}(X)|HGT = h) to get the tree topology {tau}(X) if exactly h HGT events happened on the species tree T(S) with 44 taxa for the COG data. Thus, we simulated N times h events on T(S) and assumed that P*({tau}(X)|HGT = h) is the relative occurrence of the topology {tau}(X).

To analyze the influence of the size of the quartet set, we generated 1,000 gene trees for several HGT rates {lambda}. We extracted all quartet topologies and used a randomly chosen subset of these topologies to estimate the HGT rate {lambda}. Repeating this for the quartet set sizes 100, 1,000, and 10,000, we got the results visualized in figure 2 where the true HGT rate {lambda} to generate gene trees is plotted against the estimated rate. It turned out that a set of 10,000 topologies was large enough to get reliable estimation results.


Figure 2
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Quality of the rate estimation in dependence on the number of quartet topologies: (A) 100, (B) 1,000, and (C) 10,000. N = 100,000 and hmax = 60 are fixed.

 
For a 2nd test, we used 10,000 quartet topologies and varied the value of hmax (the maximal number of simulated transfers on the species tree) while N (number of simulations for a fixed hmax) was 100,000. Figure 3 displays the estimation based on Formula One can see that for each hmax there exists a maximum rate, which can be estimated reliably, whereas rates above get underestimated.


Figure 3
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Quality of the rate estimation as a function of the maximum number of simulated HGT events with N = 100,000 and 10,000 quartet topologies. Each displayed value is based on one estimation.

 
In another analysis (data not shown), we also checked that N = 100,000 is a feasible value for the number of simulations.

The Most Frequent Gene Tree
To determine whether the most frequent gene tree is similar to the reconstructed species tree, we compared both trees. We computed a quartet set of all quartet topologies of the 780 COG gene trees which only consisted of those quartet topologies that did not contradict each other. A comparison of this quartet set with the species tree T(SCOG) comprising 44 taxa led to the consensus tree depicted in figure 4. Both trees support all bifurcations except for 2 nodes indicated by multifurcations in the consensus tree. We can conclude that T(SCOG) and the tree reconstructed of the quartet set—consisting of not contradicting quartet topologies of the gene trees—are nearly equal. As the latter quartet set represents the most frequent quartet topologies, we can also deduce that the most common gene tree is very similar to T(SCOG).


Figure 4
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Consensus tree of the COG species tree and the tree reconstructed of the quartet set which represents the most frequent quartet subtrees of the 780 gene trees. Only 2 multifurcations exist which indicate incongruity between both input trees. This tree was reconstructed with the strict consensus mode of the Consense program (Felsenstein 1989Go).

 
Estimating the HGT Rate {lambda} for the COG Data
The quality tests described above have shown that an HGT rate {lambda} of 0.7 can be estimated reliably if we extract 10,000 quartet topologies of the 780 COG gene trees and set the parameters N = 100, 000 and hmax = 60.

We applied this procedure to the COG data, repeated the estimation for 50 sets of quartet topologies, and obtained results for Formula between 0.43 and 0.48 and for Formula between 12.86 and 14.35 presented in figure 5A. Because {Lambda} is the parameter of the Poisson distribution, which describes the occurrence of HGT events in time, Formula is the expected value for the number of HGT events that happened on T(S), here T(SCOG). The estimated HGT rate Formula is relative to the number of substitutions in COG0419 (ATPase involved in DNA repair), which were used to assign branch lengths to the species tree.


Figure 5
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 5.— Distribution of the estimated HGT rates Formula for the COG data for randomly chosen quartet sets (A) of the 780 gene trees and (B) over all 44 species tree taxa with N = 100,000 and hmax = 60.

 
To test the reliability of the results, we checked if the estimated HGT rates differ from those estimated of quartet sets randomly chosen from all quartet topologies of the 44 species tree organisms. Figure 5A shows the estimation results of quartet topologies, which could be found in the 780 gene trees and figure 5B represents estimations over all Formula quartet topologies. The graph indicates an estimated HGT rate Formula, which is about 10 times higher, namely between 4.66 and 4.7. As these rates must be higher because the set consists of many quartet topologies (7%), which are not part of the gene trees, so that many more HGT events are necessary to get the distribution, the estimated rates for the COG data seems quite acceptable.

Rate Correction Taking into Account the Inaccuracies of Gene Tree Reconstructions
We performed a further analysis taking into account the inaccuracies and uncertainties of gene tree reconstructions. For each protein family gi, representing a taxa set Formula, we restricted T(SCOG) to Formula denoted by Formula and assigned branch lengths to all of these tree topologies using Tree-Puzzle (Schmidt et al. 2002Go). Afterwards, we simulated protein sequences of the same size than the corresponding COG sequences with Seq-Gen (Rambaut and Grassly 1997Go) along the calculated trees, using the Dayhoff substitution model (Dayhoff et al. 1978Go). We repeated this step 5 times, then calculated the corresponding gene trees, and repeated the estimation procedure. As the newly acquired gene trees are based on trees which are subtrees of the species tree T(SCOG), we expected to estimate an HGT rate Formula of about zero.

After the estimation of 10 randomly chosen quartet sets for each of the 5 simulated data sets, we got the distribution which is shown in the stacked histogram in figure 6. Each of the 5 colors represents 1 data set. The estimation results are nearly constant, at about 0.1 (0.1 ± 0.01). This result could be interpreted as a kind of background noise due to inaccuracies in the applied gene tree reconstruction procedure (see Materials and Methods) because in the setting there should be no difference in the branching patterns of gene and species tree and therefore, it leads to the conclusion that the estimated average HGT rate Formula of about 0.46 per gene and unit time is about 22% too high. This would decrease the total amount of HGT events, which is necessary to transform the species tree topology into one gene tree from 14 to 11 events per gene.


Figure 6
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 6.— Distribution of the estimated HGT rates Formula for 5 simulated data sets. Each data set is based on the 780 protein families and their corresponding subtrees in T(SCOG). For each data set 10 randomly chosen quartet sets have been estimated.

 

    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
In the previous paragraph, we described some results based on a new approach to estimate an overall rate of HGT with the help of a likelihood framework. We are able to estimate a rate under the assumptions that all differences between a gene tree and the corresponding species tree are caused by HGT and that the HGT rate is homogeneous over the whole tree. Note that we did not make any statement about the probability if a gene is transferred at all, but how many events have occurred within one COG family on average. Thus, we are assuming that every gene is transferred with the same probability.

The extent to which HGT has shaped the individual genomes is controversially discussed (Kurland et al. 2003Go). Although some research groups support the opinion that HGT plays an important role in evolution and appears very frequently (Doolittle 1999Go; Eisen 2000Go; Garcia-Vallve et al. 2000Go), others think that the impact of HGT is overestimated due to problems in the various inferring procedures (Brown 2003Go). A recent publication by Ge et al. (2005)Go also analyzed the COG data and detected HGT in 33 out of 297 protein families. To do so, they used a novel test statistic based on tree topology comparisons. Unfortunately, they did not say anything about how many HGT events happen in each of the 33 detected COGs, which would be interesting in order to compare their results with ours.

There are several other approaches trying to estimate an HGT rate. For example, Huelsenbeck et al. (2000)Go developed a Bayesian framework for the analysis of cospeciation, which could also be used to estimate rates of genetic transfer. Suchard (2005)Go published 2 stochastic models with the same purpose. The first model, developed by Suchard (2005)Go, is based on subtree prune and regraft operations and is applicable if the number of taxa under consideration is small, whereas the other approach is a random walk over complete graphs and offers a solution for an increasing number of taxa. In both publications, the fact that the corresponding framework can deal with gene and species tree topologies, which are not known without error is highlighted. But on the other hand, both HGT models require that all gene and species trees are based on the same set of present-day species. In contrast, the new approach, which we have introduced here, can incorporate gene families that are incomplete by using quartet subtrees and we can estimate reliable rates even if the gene tree taxa sets are only subsets of the whole species tree taxa set. As an example, the COG data set only comprises 2 gene families which represent genes for all 44 taxa (see Supplementary Material online). Furthermore, this new framework also takes into account the inaccuracies in the gene tree reconstruction method.

As genomes are not only shaped by HGT, but also by processes like hybridization, gene loss, duplication, genesis, and fusion/fission (Snel et al. 2002Go), it becomes clear that the estimated rate of about 11 events per gene and unit time is a kind of upper bound because we assume that all conflicts in the gene tree topologies are caused by HGT. However, it remains as yet unclear, how the rate estimate changes if multicopy genes were included in the analysis. Nevertheless, the estimate seems to be quite high. This can be explained by the fact that a lot of HGT events will not change the tree topology, for example, events between 2 nodes that share parents. This seems to be important because 71% of the total branch length of the COG species tree can be involved in HGT events, which do not change the branching pattern. As it is most likely that the majority of HGT events in nature take place between closely related taxa it becomes clear that the number of these events would be underestimated by just counting visible incongruences between 2 given trees. Moreover, if one gene is transferred back and forth between 2 lineages, these events will not be detected either. The importance to take unobservable HGT events into account is supported by the fact that the topologies of 264 (34%) of the 780 COG gene trees are equal to the species tree, restricted to the corresponding gene tree taxa. This means that the gene tree topology can be explained without any single HGT event. As the number of taxa of these 264 trees differs widely, and even gene trees with up to 36 taxa are equal to the corresponding species tree restriction, we can assume that HGT events happened during the evolution of the corresponding gene although we cannot see any of them. This is also supported by the fact that it is still not proven if a core of nontransferable genes exists (Nesbø et al. 2001Go). Summing up, the importance of simulating HGT events on a given species tree, instead of just counting visible differences between a species and gene tree, becomes obvious and distinguishes our approach from some previous work on estimating an HGT rate. To get an impression of the probability that an HGT event does not change the tree topology, we counted the simulated trees which are equal to the COG species tree. The result indicates that this probability is 9% (0.9%) for the simulated trees after one or two transfers. As our approach includes simulations on the species tree, which gave us a distribution of trees after different numbers of HGT events, we automatically include unobservable HGT events, and therefore, the estimated rate is higher than in other approaches. But this high rate also indicates that HGT influences the tree topologies strongly, as described by Doolittle (1999)Go.

Many other approaches (e.g., see Hao and Golding 2006Go; Dagan and Martin 2007Go) exist which also estimate an HGT rate. All those methods are quite different from one another, and it is difficult to compare their results with ours. The 2 mentioned publications are based on gene-present and -absent patterns, whereas the method that we have introduced here, uses the information of reconstructed gene trees to calculate an HGT rate. Dagan and Martin (2007)Go have presented a method in which they inferred a conservative lower bound estimate of about 1.1 HGT events per gene family and gene family lifespan considering the genome size of present-day species. As already explained above, the estimates represented here are a kind of upper bound and therefore, they are much higher. Because both methods (Hao and Golding 2006Go; Dagan and Martin 2007Go) are tested on different data sets, it would be interesting to see how much the results really differ when both are applied to the same data set.

Certainly, this newly developed method to estimate a rate of HGT is based on a number of key assumptions (as described in the Materials and Methods) and we are aware of the fact that probably some of them are not reasonable from a more biological point of view. Nevertheless, we started with such a simplified model to get a first impression about the overall rate of HGT for a set of present-day species, and we intend to consider more biologically relevant aspects of HGT in the future. For example, like Suchard (2005)Go, we would like to include heterogeneous HGT rates in the analysis because such rates are important to take into account that genes belonging to different functional categories have different transferabilities (Nakamura et al. 2004Go). Another interesting and important extension for the simulation would be to include uncertainties of the species tree branch lengths. So far, we assume that these lengths are exactly known.

Furthermore, the species tree represents the evolutionary history of all 780 examined COG families and is therefore slightly different from a tree obtained from 16/18S rDNA sequences which are often used to reconstruct a species tree for a set of given taxa (Woese 2000Go). Probably, we would even estimate a higher HGT rate if we used such a species tree because that one would not consider the information of the COG data, which would lead to more differences between species and gene trees. Therefore, the estimation of an HGT rate with the help of a different species tree would be a further interesting task.

Currently, the newly developed method only deals with trees representing a single gene copy per species. Because phylogenies often present several distantly related copies for a given organism, the HGT estimates based on orthologs only could be too low. We intend to include multicopy genes into the analysis to overcome this problem in the future.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Supplementary tables S1 and S2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/). The first table S1 is a summary of all 780 COG gene trees that were used in the described analysis. For each gene tree, the corresponding COG, the maximum likelihood value of the reconstructed gene tree, the number of taxa, whether or not the gene evolved clocklike, and the species names are given. As we used abbreviations for the exact species names, table S2 provides a translation table assigning each species name such an abbreviation. A software package to estimate an HGT rate is available online (http://www.cibiv.at/software/hgt/). It consists of several C/C++ programs, Perl scripts, and a short user manual.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
We wish to thank the Bioinformatics Institute at the University of Düsseldorf and the Center for Integrative Bioinformatics in Vienna. For coding some extensions to the Tree-Puzzle program, we would like to thank Heiko Schmidt. We thank Roland Fleißner, Ingo Paulsen, Ricardo de Matos Simões, and 2 anonymous referees for useful comments on an earlier version of the manuscript and Mareike Fischer and Ramona Schmid for helpful discussions and critical proofreading. This work was supported by Deutsche Forschungsgemeinschaft grant SFB-TR1. We also thank the Vienna Science and Technology Fund for financial support.


    Footnotes
 
Martin Embley, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 

    Andersson JO. Lateral gene transfer in eukaryotes. Cell Mol Life Sci (2005) 62:1182–1197.[CrossRef][Web of Science][Medline]

    Boucher Y, Douady CJ, Papke RT, Walsh DA, Boudreau ME, Nesbø CL, Case RJ, Doolittle WF. Lateral gene transfer and the origins of prokaryotic groups. Annu Rev Genet (2003) 37:283–328.[CrossRef][Web of Science][Medline]

    Brown JR. Ancient horizontal gene transfer. Nat Rev Genet (2003) 4:121–132.[CrossRef][Web of Science][Medline]

    Bushman F. Lateral DNA transfer: mechanisms and consequences. (2002) Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press.

    Dagan T, Martin W. Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. Proc Natl Acad Sci USA (2007) 104:870–875.[Abstract/Free Full Text]

    Dayhoff MO, Schwartz RM, Orcutt BC. A model of evolutionary change in proteins. In: Atlas of protein sequence and structure—Dayhoff MO, ed. (1978) Vol. 5. Washington (DC): National Biomedical Research Foundation. 345–352.

    de la Cruz F, Davies J. Horizontal gene transfer and the origin of species: lessons from bacteria. Trends Microbiol (2000) 8:128–133.[CrossRef][Web of Science][Medline]

    Diruggiero J, Dunn D, Maeder DL, Holley-Shanks R, Chatard J, Horlacher R, Robb FT, Boos W, Weiss RB. Evidence of recent lateral gene transfer among hyperthermophilic archaea. Mol Microbiol (2000) 38:684–693.[CrossRef][Web of Science][Medline]

    Doolittle WF. Phylogenetic classification and the universal tree. Science (1999) 284:2124–2129.[CrossRef][Web of Science][Medline]

    Eisen JA. Horizontal gene transfer among microbial genomes: new insights from complete genome analysis. Curr Opin Genet Dev (2000) 10:606–611.[CrossRef][Web of Science][Medline]

    Felsenstein J. Phylogenies from molecular sequences: inference and reliability. Annu Rev Genet (1988) 22:521–565.[CrossRef][Web of Science][Medline]

    Felsenstein J. PHYLIP—phylogeny inference package. Version 3.2. Cladistics (1989) 5:164–166.

    Garcia-Vallve S, Romeu A, Palau J. Horizontal gene transfer in bacterial and archaeal complete genomes. Genome Res (2000) 10:1719–1725.[Abstract/Free Full Text]

    Ge F, Wang LS, Kim J. The cobweb of life revealed by genome-scale estimates of horizontal gene transfer. PLoS Biol (2005) 3:e316.[CrossRef][Medline]

    Hao W, Golding GB. The fate of laterally transferred genes: life in the fast lane to adaptation or death. Genome Res (2006) 16:636–643.[Abstract/Free Full Text]

    Hillis DM, Moritz C, Mable BK. Molecular systematics (1996) Sunderland (MA): Sinauer.

    Huelsenbeck JP, Rannala B, Larget B. A Bayesian framework for the analysis of cospeciation. Evolution (2000) 54:352–364.[CrossRef][Web of Science][Medline]

    Kurland CG, Canback B, Berg OG. Horizontal gene transfer: a critical view. Proc Natl Acad Sci USA (2003) 100:9658–9662.[Abstract/Free Full Text]

    Lawrence JG, Ochman H. Amelioration of bacterial genomes: rates of change and exchange. J Mol Evol (1997) 44:383–397.[CrossRef][Web of Science][Medline]

    Lerat E, Daubin V, Moran NA. From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-Proteobacteria. PLoS Biol (2003) 1:e19.[Medline]

    Lerat E, Daubin V, Ochman H, Moran NA. Evolutionary origins of genomic repertoires in bacteria. PLoS Biol (2005) 3:e130.[CrossRef][Medline]

    Nakamura Y, Itoh T, Matsuda H, Gojobori T. Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nat Genet (2004) 36:760–766.[CrossRef][Web of Science][Medline]

    Nei M. Phylogenetic analysis in molecular evolutionary genetics. Annu Rev Genet (1996) 30:371–403.[CrossRef][Web of Science][Medline]

    Nelson KE, Clayton RA, Gill SR, et al, (29 co-authors). Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature (1999) 399:323–329.[CrossRef][Medline]

    Nesbø CL, Boucher Y, Doolittle WF. Defining the core of nontransferable prokaryotic genes: the euryarchaeal core. J Mol Evol (2001) 53:340–350.[CrossRef][Web of Science][Medline]

    Ochman H, Lawrence JG, Groisman EA. Lateral gene transfer and the nature of bacterial innovation. Nature (2000) 405:299–304.[CrossRef][Medline]

    Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol (1988) 5:568–583.[Abstract]

    Rambaut A, Grassly NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci (1997) 13:235–238.[Abstract/Free Full Text]

    Schmidt HA, Strimmer K, Vingron M, von Haeseler A. TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics (2002) 18:502–504.[Abstract/Free Full Text]

    Semple C, Steel M. Phylogenetics (2003) Oxford (UK): Oxford University Press.

    Snel B, Bork P, Huynen MA. Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Res (2002) 12:17–25.[Abstract/Free Full Text]

    Suchard MA. Stochastic models for horizontal gene transfer: taking a random walk through tree space. Genetics (2005) 170:419–431.[Abstract/Free Full Text]

    Syvanen M. Horizontal gene transfer: evidence and possible consequences. Annu Rev Genet (1994) 28:237–261.[Web of Science][Medline]

    Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res (2001) 29:22–28.[Abstract/Free Full Text]

    Woese CR. Interpreting the universal phylogenetic tree. Proc Natl Acad Sci USA (2000) 97:8392–8396.[Abstract/Free Full Text]

Accepted for publication March 15, 2007.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Genome ResHome page
X. Didelot, A. Darling, and D. Falush
Inferring genomic flux in bacteria
Genome Res., February 1, 2009; 19(2): 306 - 317.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow An erratum has been published
Right arrow All Versions of this Article:
24/6/1312    most recent
msm052v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Linz, S.
Right arrow Articles by von Haeseler, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Linz, S.
Right arrow Articles by von Haeseler, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?