MBE Advance Access originally published online on May 19, 2008
Molecular Biology and Evolution 2008 25(8):1677-1682; doi:10.1093/molbev/msn117
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Reconstructing Evolutionary Graphs: 3D Parsimony
Department of Molecular, Cellular, and Developmental Biology, University of California, Los Angeles; Molecular Biology Institute, University of California, Los Angeles; Department of Human Genetics, University of California, Los Angeles; and UCLA Astrobiology Institute, University of California, Los Angeles
E-mail: lake{at}mbi.ucla.edu.
| Abstract |
|---|
|
|
|---|
The increasing recognition that symbioses have greatly altered evolution through genome fusions is creating a need for algorithms that can reliably detect and reconstruct fusions. Here, we generalize the bootstrappers gambit algorithm (a quartet method) in order to permit it to analyze both bifurcations and fusions under a single mathematical model, and thereby detect past genomic branchings and endosymbioses. This new method, 3-dimensional parsimony, can be applied to aligned sequences, such as gene, indel, or other genomic presence/absence sequences. It also provides a statistical measure of support for each possible graph. The usefulness of this method is demonstrated by applying it to the ring of life.
Key Words: 3D parsimony ring of life eukaryotes fusion digraphs evolutionary graphs
| Introduction |
|---|
|
|
|---|
Given the enormous interest in discovering the topology of the net-, the web-, the ring-, the tree-, or the non–tree-of-life (Hilario and Gogarten 1993
Here, we provide a generalized method, 3-dimensional (3D) parsimony, to reconstruct genomic graphs. 3D parsimony analyzes graphs, including trees, under a single mathematical model using the presence and absence patterns observed for genes, indels, or other character data. It also provides an improved theoretical framework for understanding graph reconstruction. We demonstrate this method by applying it to the Ring of Life.
| Theory |
|---|
|
|
|---|
Tree Reconstruction
3D parsimony is a generalization of parsimony-based logic for detecting graphs and directed graphs from aligned genomic sequences. It combines aspects of traditional parsimony (Fitch 1971
To appreciate how parsimony can be applied to graphs, it is useful to review how the bootstrappers gambit algorithm parsimoniously analyzes trees. Consider the 5-taxon, unrooted tree shown at the top of figure 1, the ((a, b), c, (d, e)) tree in Newick notation. Gambit analyzes 5-taxon trees by systematically removing 1 taxon at a time in order to generate 5, four-taxon sequence alignments of gene presences, +s, and absences, –s. It then analyzes the 4-taxon alignments by counting which parsimoniously informative character state patterns, that is, the patterns containing 2 +s and 2 –s, occur most frequently. These 5 most parsimonious, 4-taxon subtrees are then graphically combined into a single 5-taxon tree.
|
A gambit analysis is illustrated in figure 1, assuming that the genomes evolve according to the tree shown at the top of the figure. When taxon e is eliminated from the analysis, shown by a dashed e leaf in the tree at the top of the first column, the resulting subtree has topology ((a, b), (c, d)).
Most parsimoniously this subtree produces gene counts for the supported patterns, (+, +, , , X) and ( , ,+, +, X) for taxa (a, b, c, d, e) respectively, shown at the top of the second column of figure 1. The 4 nonparsimonious patterns are not shown. (Note also that –s are shown as blanks for visual clarity.) The Xs in the patterns indicate that the genomic sequence state for taxon e may be either a plus or a minus because the character states of the missing taxa are unknown. The other 4 panels in the second column of figure 1 show the allowed sets for the remaining choices of missing taxa. Note that paired patterns for each of the 5 subtrees are symmetric, that is, (+, +, –, –) and (–, –, +, +), when the taxon containing X is not shown. This is true only for trees and not generally the case for graphs where the corresponding patterns may be (+, +, –, –) and (+, –, +, –), for example. Once the best supported character state sets are identified by counting the number of genes that support each pattern, one can infer the most parsimonious 4-taxon trees and subsequently combine them into 5-taxon trees. This logic can be iteratively applied to more taxa as needed.
Gambit-like analyses of conditioned data sets proceed in related ways. As previously described (Lake and Rivera 2004
), conditioning is used when analyzing complete genomes in order to define rigorously the genes that support each character state pattern. (Without a conditioning genome, the gene missing state, –, may be defined in multiple ways.) Conditioning is accomplished by referencing the analyses to a predefined set of genes, such as all those genes present, +, in a reference taxon. The reference taxon is then used to define the gene "universe" for the analysis but is not directly used in the phylogenetic reconstruction. By utilizing each taxon in turn as the reference genome, one can avoid deciding on a particular reference genome and simultaneously maximize the informational content of the analysis (Spencer et al. 2007
).
Thus, for a conditioned 5-taxon analysis, one analyzes all 5 four-taxon trees, as in the second column of figure 1, except that one only counts patterns in which X corresponds to character state +. Under this model, the 2 patterns that support the ((a, b), (c, d)) tree, shown at the top of the second column, are 5 (+, +, , ,+) and 10 (, , +, +, +), corresponding to taxa (a, b, c, d, e), respectively. Similarly, the remaining 4-taxon trees are analyzed by conditioning on each of the 4 remaining taxa. Thus, conditioning can be naturally integrated into the gambit algorithm.
At this point, it may be useful to step back and consider how a bootstrappers gambit analysis differs from a direct parsimony analysis of 5-taxon trees. The parsimoniously informative character sets for conditioned 4-taxon analyses consist of all sets that contain 3 +s and 2 –s (1 + for conditioning and 2 +s and 2 –s for analyzing the 4-taxon trees). This restricts the analysis to the 10 possible combinations of 3 +s and 2 –s listed in the third column of figure 1, numbered 1–10. Depending on the conditioning taxon, patterns 1, 2, 5, 8, 9, and 10 are most parsimonious with the corresponding 4-taxon trees derived from tree ((a, b), c, (d, e)). In contrast, a parsimonious analysis of the entire 5-taxon tree, that is, a nonbootstrappers analysis, permits a smaller set of patterns. Specifically, patterns 1, 10, 11, and 20 are parsimonious. Because patterns 11 and 20 contain 3 –s and 2 +s, they are not informative if conditioned. Thus, 3D parsimony utilizes 4 terms, 2, 5, 8, and 9, that do not support the 5-taxon tree in direct, traditional parsimony analyses. When the method is used to reconstruct graphs, these additional terms provide the new information that is needed to move out of 2 dimensional tree space and into 3D graph space, as explained in the next section.
Reconstructing Graphs
The procedures described in this section are appropriate for labeled, connected graphs generated under a general evolutionary model. Under this model, each internal node connects 3 edges/leaves and may be produced by several processes of evolutionary diversification: fusion, bifurcation, or directed transfer. Fusion nodes are surrounded by 1 directed edge exiting the node (the fusion organism) and 2 directed edges entering it (the fusion donors), unrooted bifurcation nodes are surrounded by 3 edges of unknown directionality, and unrooted directed transfer donors and recipients are surrounded by 1 edge of known directionality. An example of a directed transfer would be the significant transfer of genes from the chloroplast to the nucleus but not in the reverse direction. (In graphical terminology [Buckley and Harary 1990
], the indegrees and outdegrees, respectively, of a fusion node are 2 and 1, those of a rooted bifurcation node are 1 and 2, those of an unrooted bifurcation node are 3 and 3, those of an unrooted directed donor are 2 and 1, and those of an unrooted directed recipient are 1 and 2.) In evolutionarily realizable graphs, the fusion, directed, and bifurcation nodes must be arranged so that at all points are reachable from at least 1 point on the graph (the root).
A 3D parsimony analysis of a 5-taxon, unicycle (Semple and Steel 2006
) graph is illustrated in figure 2. In this graph, shown at the top of the figure, genomes from 2 donor taxa, labeled a and d, fuse to create a new taxon, c. Genome fusions have novel properties that make them initially more difficult to understand than bifurcations. To appreciate this, let us focus on the ring that is conditioned on taxon e, shown at the top of the left column in figure 2, and consider the number of character state changes necessary to obtain the pattern (+, –, +, –) for taxa (a, b, c, d). The gene present in taxon a can be transferred to taxon c without a character state change, as indicated by the arrowhead directed toward c from a. Because the gene is absent in taxon b and present in a, it must either be lost or gained between a and b as indicated by the open rectangular box marked 6. Because the gene absent in b is also absent in d, no changes along the path between b and d are required (ignore rectangular box 10 because it applies to a different pattern). Also no gene state changes are required between d and c. Even though c and d have different states, d is a "–" and c is a "+," no character state changes are required because c has already inherited the gene from a, so that it is not necessary to inherit a second copy from d. Thus, only a single gene character state change is needed to explain pattern 6.
|
The fusion property differentiates the character state patterns generated by trees from those of other graphs. Trees, conditioned on a given taxon, create 2 four-taxon patterns that are symmetrically related. The top tree in figure 1, conditioned on taxon e, generates patterns 5 (+, +, –, –) and 10 (–, –, +, +) for taxa (a, b, c, d). These patterns are symmetrically related with respect to an interchange of +s and –s. In contrast, graphs are not restricted to this symmetry. For example, the graph at the top of figure 2 produces patterns (+, –, +, –) and (–, –, +, +) for taxa (a, b, c, d). These 2 patterns are not symmetrically related because the fusion provides 2 pathways for genes to enter fusion organism, c. A gene may enter c through a, producing pattern (+, –, +, –), or a gene may enter c through d, producing pattern (–, –, +, +) for taxa (a, b, c, d). Thus, fusion events can dramatically increase the complexity of allowed phylogenetic patterns.
The patterns allowed by the tree in figure 1, and by the ring in figure 2, are compared side-by-side in figure 3. The disallowed patterns, corresponding to low gene counts, are shaded, whereas the allowed patterns, corresponding to high gene counts, are not shaded. The allowed patterns for the ((a, b), c, (d, e)) tree analyzed in figure 1 (patterns 1, 2, 5, 8, 9, and 10) are shown in the second column, and the allowed patterns for the unicycle graph analyzed in figure 2 (patterns 1, 3, 4, 6, and 10) are shown in the fourth column. By comparing the experimentally observed pattern with those predicted by various graphs, one can determine the graph that best fits the data. Determining which 5-taxon graph is supported by the data then reduces to asking how many counts support each of the 10 possible, 5-taxon character state patterns, just as computing 4-taxon trees corresponds to measuring parsimony counts for the 3 possible E, F, and G trees.
|
Intuitively, a ring is a more complex object than a tree leading to the suspicion that one would always favor a ring over a tree. But this is not the case. Because all 5-taxon graphs (including trees) are analyzed as 4-taxon subtrees and because these subtrees are evaluated using the same parsimony criteria (namely all 2 plus and 2 minus patterns can be explained by a maximum of 2 character state changes or a minimum of 1 change), this ensures that identical selection criteria are applied to both trees and graphs.
The 3D Parsimony Analyses of the Ring of Life Data
Conditioned reconstructions have previously been used to study the origin of the eukaryotes and provided evidence for a Ring of Life in which the eukaryotes are the fusion organism and the 2 fusion partners are prokaryotes (Rivera and Lake 2004
). That study concluded that 1 fusion partner branches from deep within an ancient photosynthetic clade and is "either a proteobacterium or a member of a larger photosynthetic clade that included the Cyanobacteria and the Proteobacteria," and "the other is related to the archaeal prokaryotes." Here, gene presence/absence data from the Ring of Life are reanalyzed using 3D parsimony, to demonstrate the use of this new tool.
Parsimoniously informative, conditioned character state sets, containing 3 gene presences, +s, and 2 gene absences, –s, were extracted from the aligned gene absence/presence sequences of the gamma proteobacterium Xylella fastidiosa (X), the bacillus Staphylococcus aureus MW2 (S), the eukaryotes (Y) Schizosaccharomyces pombe and Saccharomyces cerevisiae, the archaeal eocyte Sulfolobus tokodaii (E), and the halobacterium Halobacterium sp. NRC-1 (H). The extracted counts for the informative character state sets are shown in table 1.
|
The first column of table 1 shows the 10 parsimoniously informative, character state patterns with gene presences indicated by +s and gene absences indicated by blanks. The gene counts observed when either S. pombe, Yp, or S. cerevisiae, Yc, is included in the analyses are shown in the fourth and fifth columns of table 1, respectively. Note that similar count distributions are found when either S. pombe or S. cerevisiae are used as the eukaryote in the analyses, as expected for 2 closely related eukaryotes. The second column shows the number of parsimony changes needed for each pattern if the ((X, S), Y, (E, H)) tree is correct. The third column shows the number of parsimony changes needed if the graph shown in figure 4 is correct. Thus, we calculate the total number of gene changes required by multiplying the number of parsimony changes predicted by the tree or the graph times the number of genes supporting each pattern. Assuming the most parsimonious tree, ((X, S), Y, (E, H)), is correct requires 1,264 or 1,301 character state changes, when S. pombe or S. cerevisiae is used as the eukaryote, respectively. Assuming the graph in figure 4 is correct, requires 1,049 or 1,096 character state changes when S. pombe or S. cerevisiae, is used as the eukaryote, respectively. Thus, the ring of life requires about 210 fewer changes than the best tree.
|
The parsimony changes calculated from the best tree and the best graph are marked with stars when a character set pattern is most parsimonious (1 change). Six character set patterns are most parsimonious for the tree, and 5 are most parsimonious for the graph. Only 2 of the starred tree patterns match the top 6 gene counts, indicating poor agreement between the best tree and the observed counts. In contrast, the parsimony changes calculated from the 5-taxon unicycle graph in figure 4 (column 3) are in full agreement with the observed counts. Specifically, the 5 starred character set patterns computed to be most parsimonious (1 change) correspond to the top 5 highest scoring patterns for S. pombe and S. cerevisiae (shown in bold), and the 5 nonparsimonious patterns (2 changes) correspond to the bottom 5 scoring patterns. The 5 allowed patterns for the ring of life explain 601 (72.8%) of the Pfam counts or 624 (72.6%) of the Pfam counts when either S. pombe or S. cerevisiae are used as the eukaryote in the analyses.
To assess whether the gene counts provide statistically significant support for the graph in figure 4, we calculate the number of bootstrap analyses that support it and the number of bootstrap analyses that support any other graphs. The conditional probability supporting the graph in figure 3 is then the number of those bootstrap replicates that support the graph in figure 3 divided by the number of replicates that support it or any other graphs (Lake 1995
). (For discussions of the interpretations of bootstraps, see Felsenstein and Kishino (1993)
; Hillis and Bull (1993)
; Efron et al. (1996)
.) Using the fractions of the total counts supporting each of the 10 patterns to estimate the probability under a decavariate multinomial model, we calculate 10,000 bootstrap replicates of the character state pattern distributions. For the S. pombe data, 96.02% of the replicates contain the same 5 top patterns as those in table 1 and therefore support the graph in figure 4. For the S. cerevisiae data, 99.80% of the bootstraps support the graph in figure 4. We conclude that both 5-taxon graphs are statistically supported by these data, consistent with the ring reconstructed in Rivera and Lake (2004)
.
| Discussion |
|---|
|
|
|---|
Graphs are assuming increased importance in studies of genomic evolution. Today genome fusions through symbioses are a central feature in evolution and interest in fusions continues to increase (Margulis 1970
Fusions have properties that allow them to reveal details of past evolutionary events. Genome fusions carry partial rooting information, whereas bifurcations do not. Fusions can thus exclude the root from parts of the graph of life, whereas bifurcations cannot. Fusions, in contrast to tree bifurcations, bring 2 different fusion partners together at the same "time" and in the same "environment" (place). This can provide important clues about the nature of the relationship between the partners and about the environment they lived in at the time of the fusion and can help place time markers on the graph of life. Yet, few phylogenetic algorithms can directly detect fusions.
There is considerable interest and discussion about the biological bases for genome fusions and the time spans over which they occur (Martin and Embley 2004
; Bapteste and Walsh 2005
; Lester et al. 2005
; McInerney and Wilkinson 2005
). Mathematically, fusions correspond to the directed transfer of statistically significant numbers of genes from 2 donor taxa to generate a new taxon. This is a process constrained by time, space, and phylogenetic relationships. When a fusion is detected, its biological interpretation depends in part upon how accurately one can experimentally answer the following questions. What are the best phylogenetic estimates of the identity of the fusion donor taxa? Is the resolution of the phylogenetic reconstruction sufficiently high that the donors can be localized to a few taxa, or can they only be identified as a large taxon? How accurately can we identify the fusion taxon? Are multiple events involved? Gaining such knowledge depends upon obtaining the necessary phylogenetic information. In this sense, the Ring of Life is only starting to inform us about eukaryotic origins. We would like to know if the ring is the result of observing the flow of genes from mitochondrion and chloroplast to the nucleus. Or is the ring representing an earlier phylogenetic relationship? 3D parsimony now provides us with another tool to help us better answer the fascinating questions related to eukaryotic evolution.
3D parsimony is only a first step toward analyzing evolutionary graphs, and many additional modifications are needed to improve and extend it. For example, it has long been known, theoretically (Felsenstein 1978
) and practically (Lake 1991
), that parsimony is subject to long-branch attraction. Thus, the results obtained here using 3D parsimony provide a starting point for improved graphical reconstructions and leave much room for improvement and modification. There is also considerable progress being made in developing circular splits and other algorithms useful for graphical reconstructions (Baroni et al. 2006
; Grunewald et al. 2006
). In theory, the results presented here allow one to reconstruct 5-taxon graphs from character state patterns by enumerating all possible 5-taxon evolutionary graphs and calculating their corresponding character state patterns. But no algorithm is available yet that allows us to determine 5-taxon graphs directly from the character state patterns. Also unsolved are the precise details of how 3D parsimony analyses can be extended from 5-taxon graphs to N-taxon graphs. The necessary procedures, probably like those used to extend 4-taxon trees to N-taxon trees in gambit and other algorithms, remain to be developed.
3D parsimony shows great promise for understanding the evolution of life and further developing this method to extend its usefulness represents an exciting scientific challenge.
| Acknowledgements |
|---|
|
|
|---|
This paper is dedicated to Walter Fitch who taught me parsimony and to Mike Steel who showed me how to write the Bayesian-Jeffreys Bootstrap theorem. I thank Jackie Servin, Ryan Skophammer, and Craig Herbold for constant advice and help. Supported by grants from National Science Foundation and the UCLA NASA Astrobiology Institute to J.A.L.
| Footnotes |
|---|
1 Present address: 232 Boyer Hall, 611 South Young Drive, University of California, Los Angeles
Martin Embley, Associate Editor
| References |
|---|
|
|
|---|
Bapteste E, Walsh DA. Does the Ring of Life ring true? Trends Microbiol (2005) 13:256–261.[CrossRef][Web of Science][Medline]
Baroni M, Semple C, Steel M. Hybrids in real time. Syst Biol (2006) 55:46–56.[CrossRef][Web of Science][Medline]
Buckley F, Harary F. Distance in graphs (1990) Addison-Wesley. Redwood City (CA).
Dagan T, Martin W. The tree of one percent. Genome Biol (2006) 7:118.[CrossRef][Medline]
Doolittle WF. Phylogenetic classification and the universal tree. Science (1999) 284:2124–2128.
Efron B, Halloran E, Holmes E. Bootstrap confidence levels for phylogenetic trees. Proc Natl Acad Sci USA (1996) 93:13429–13434.
Embley TM. Anaerobic eukaryotes and their archaebacterial endosymbionts. Environ Microbiol (2002) 4:15–16.[CrossRef][Medline]
Felsenstein J. Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool (1978) 27:401–410.
Felsenstein J, Kishino H. Is there something wrong with the bootstrap on phylogenies—a reply. Syst Biol (1993) 42:193–200.
Fitch WM. Toward defining the course of evolution: minimum change for a specific tree topology. Syst Zool (1971) 20:406–416.[Abstract]
Grunewald S, Forslund K, Dress A, Moulton V. QNet: an agglomerative method for the construction of phylogenetic networks from weighted quartets. Mol Biol Evol (2006) 24:532–538.[CrossRef][Web of Science][Medline]
Hilario E, Gogarten JP. Horizontal transfer of ATPase genes—the tree of life becomes a net of life. Biosystems (1993) 31:111–119.[CrossRef][Web of Science][Medline]
Hillis DM, Bull JJ. An empirical-test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst Biol (1993) 42:182–192.
Jain R, Rivera MC, Moore JE, Lake JA. Horizontal gene transfer accelerates genome innovation and evolution. Mol Biol Evol (2003) 20:1598–1602.
Konstantinidis KT, Tiedje JM. Towards a genome-based taxonomy for prokaryotes. J Bacteriol (2005) 187:6258–6264.
Lake JA. Tracing origins with molecular sequences: metazoan and eukaryotic beginnings. Trends Biochem Sci (1991) 16:46–50.[CrossRef][Web of Science][Medline]
Lake JA. Calculating the probability of multitaxon evolutionary trees—bootstrappers Gambit. Proc Natl Acad Sci USA (1995) 92:9662–9666.
Lake JA, Herbold CW, Rivera MC, Servin JA, Skophammer RG. Rooting the tree of life using non-ubiquitous genes. Mol Biol Evol (2007) 24:130–136.
Lake JA, Rivera MC. Was the nucleus the 1st endosymbiont? Proc Natl Acad Sci USA (1994) 91:2880–2881.
Lake JA, Rivera MC. Deriving the genomic tree of life in the presence of horizontal gene transfer: conditioned reconstruction. Mol Biol Evol (2004) 21:681–690.
Lester L, Meade A, Pagel M. The slow road to the eukaryotic genome. Bioessays (2005) 28:57–64.[Web of Science]
Margulis L. Origin of the eukaryotic cells (1970) New Haven (CT): Yale University Press.
Martin W, Embley TM. Early evolution comes full circle. Nature (2004) 431:134–136.[CrossRef][Medline]
Martin W, Muller M. The hydrogen hypothesis for the first eukaryote. Nature (1998) 392:37–41.[CrossRef][Medline]
McInerney JO, Pisani D. Paradigm for life. Science (2007) 318:1390–1391.
McInerney JO, Wilkinson M. New methods ring changes for the tree of life. Trends Ecol Evol (2005) 20:105–107.[CrossRef][Medline]
Moran NA. Accelerated evolution and Muller's ratchet in endosymbiotic bacteria. Proc Natl Acad Sci USA (1996) 93:2873–2878.
Rivera MC, Lake JA. The ring of life: evidence for a genome fusion origin of eukaryotes. Nature (2004) 431:152–155.[CrossRef][Medline]
Semple C, Steel MA. Unicyclic networks: compatibility and enumeration. IEEE/ACM Trans Comput Biol Bioinform (2006) 3:84–91.[CrossRef]
Sorek R, Zhu Y, Creevery CJ, Francino MP, Bork P, Rubin EM. Genome-wide experimental determination of barriers to horizontal gene transfer. Science (2007) 318:1449–1452.
Spencer M, Bryant D, Susko E. Conditioned genome reconstruction: how to avoid choosing the conditioning genome. Syst Biol (2007) 56:25–43.[CrossRef][Web of Science][Medline]
Strimmer K, VonHaeseler A. Quartet puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol Biol Evol (1996) 13:964–969.[Web of Science]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
J. A. Lake, R. G. Skophammer, C. W. Herbold, and J. A. Servin Genome beginnings: rooting the tree of life Phil Trans R Soc B, August 12, 2009; 364(1527): 2177 - 2185. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




