MBE Advance Access originally published online on February 22, 2008
Molecular Biology and Evolution 2008 25(4):643-654; doi:10.1093/molbev/msn034
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Integrating Markov Clustering and Molecular Phylogenetics to Reconstruct the Cyanobacterial Species Tree from Conserved Protein Families


* Institute of Low Temperature Science, Hokkaido University, Sapporo, Japan
Departments of Biology and Chemistry, Washington University, St Louis, MO
School of Natural Sciences, University of California, Merced
E-mail: jason.raymond{at}ucmerced.edu.
| Abstract |
|---|
|
|
|---|
Attempts to classify living organisms by their physical characteristics are as old as biology itself. The advent of protein and DNA sequencing—most notably the use of 16S ribosomal RNA—defined a new level of classification that now forms our basic understanding of the history of life on earth. High-throughput sequencing currently provides DNA sequences at an unprecedented rate, not only providing a wealth of information but also posing considerable analytical challenges. Here we present comparative genomics–based methods useful for automating evolutionary analysis between any number of species. As a practical example, we applied our method to the well-studied cyanobacterial lineage. The 24 cyanobacterial genomes compared here occupy a wide variety of environmental niches and play major roles in global carbon and nitrogen cycles. By integrating phylogenetic data inferred for upward of 1,000 protein-coding genes common to all or most cyanobacteria, we have reconstructed an evolutionary history of the phylum, establishing a framework for resolving key issues regarding the evolution of their metabolic and phenotypic diversity. Greater resolution on individual branches can be attained by telescoping inward to the larger set of conserved proteins between fewer taxa. The construction of all individual protein phylogenies allows for quantitative tree scoring, providing insight into the evolutionary history of each protein family as well as probing the limits of phylogenetic resolution. The tools incorporated here are fast, computationally tractable, and easily extendable to other phyla and provide a scaleable framework for contrasting and integrating the information present in thousands of protein-coding genes within related genomes.
Key Words: genomics cyanobacteria evolution Markov clustering phylogenomics
| Introduction |
|---|
|
|
|---|
Although the 16S ribosomal RNA (rRNA) paradigm continues to provide a strong framework for understanding evolution, it represents only one small piece of an organism's history. The exponentially increasing number of genome sequencing projects is pushing our understanding of diversity well beyond the limitations of the single-gene proxy. Integrating the enormous wealth of genetic information—hundreds to tens of thousands of genes per genome—stands as one of the central challenges to biology in the 21st century. Ultimately, an evolutionary tree will be available for every (nonnovel) gene from every sequenced genome, providing a temporal and cross-species blueprint of how Darwinian evolution has brought these genes together into an organism able to thrive in its particular niche.
The goal of phylogenomics has recently been the subject of a number of novel and provocative approaches (Eisen 1998
; Lerat et al. 2003
; Rivera and Lake 2004
; Delsuc et al. 2005
; Snel et al. 2005
). Although insightful, their results are often quite controversial; for example, some strongly support the canonical tree of life as deduced by 16S rRNA analysis, whereas others suggest striking rearrangements to this orthodoxy (Wolf et al. 2002
; Charlebois et al. 2003
; Doolittle 2005
; Ciccarelli et al. 2006
). Perhaps the best developed and most rigorously tested of these methods, molecular phylogeny, have been difficult to implement due primarily to computational challenges of constructing gene trees with very large data sets. Furthermore, single-gene phylogenies are complex by default, often reflecting nonvertical evolution due to horizontal gene transfer, gene duplication (paralogy), and loss (Gogarten and Townsend 2005
). Deep phylogenies are especially prone to poor resolution due to sequence divergence. In particular, although 1 set of homologous genes or proteins may be quite useful in resolving species or genus-level relationships, it might be quite poor at resolving phylum-level relationships due to poor conservation or short sequence length.
In this work, we take a new approach integrating clustering and sequence analysis toward resolving an integrated phylogeny spanning multiple taxonomic levels within a single phylum. Using all genomes available from a single phylum, our approach combines the rigorous (maximum likelihood) analysis of large numbers of orthologs, as well as of concatenated sets of up to several hundred proteins representing a large fraction of some genomes, and of consensus phylogenies based on single-protein trees. The ultimate goal is to determine, given the known role of horizontal gene transfer particularly in prokaryote evolution as well as the difficulty in resolving deep phylogenies, whether a plurality phylogenetic signal exists that is both consistent with, and potentially explanatory toward, systematic and taxonomic information about a group of organisms.
This phylum first approach is well suited to the >103 ongoing genome projects, for several reasons. First, most phyla appear to be robustly defined based both on molecular methods, especially 16S, and on traditional systematics. Organisms within a phylum typically share unique phenotypic traits that are variable enough to be both interesting and informative of the evolutionary process. Second, by focusing first on resolving the distribution and phylogeny of single proteins, it is possible to select for subsequent analysis those that are potentially most useful in resolving relationships at different taxonomic levels. For example, many proteins are not common to all organisms within a clade and would be excluded from analyses of completely conserved, or "core," proteins, whereas they might be useful for determining relationships between subsets of organisms. Additionally, depending on factors such as length and degree of conservation, some proteins give well-resolved trees for only some taxonomic levels. Ribosomal proteins often share 100% amino acid identity—and are thereby phylogenetically uninformative—between members of the same genus or species.
Working with a single phylum (as opposed to, say, all 3 domains of life) also prevents data sets from becoming computationally intractable, especially when employing maximum likelihood–based approaches. This methodology can also be naturally extended into different taxonomic levels. Whereas some subset of proteins may be useful for resolving relationships within phyla, when needed, additional proteins can be incorporated for reconstructing family-, class-, or genus-level relationships by selecting only those proteins conserved at these taxonomic levels. Understanding which proteins are adequate at resolving different taxonomic levels enables selection of proteins that are useful in determining relationships between phyla—an ultimate goal (and persistent shortcoming) in reconstructions of the tree of life.
As an introductory example, we focus on the phylum cyanobacteria, which is notable for sequencing projects covering a wide swath of their enormous diversity as well as for their evolutionary importance and time constraints on their early evolution. The most ancient diagnostic markers for any organism come in the way of chemical biomarkers argued to have been left by cyanobacterial ancestors some 2.7 billion years ago, and the global-scale effects resulting from the oxygen produced during cyanobacterial photosynthesis are seen in rocks
2.43 billion years old and younger (Summons et al. 1999
; Farquhar et al. 2000
; Knoll 2003
; Kopp et al. 2005
). Ongoing and completed sequencing projects include cyanobacteria from marine and freshwater environments, thermophiles, nitrogen fixers, and symbionts. In addition to illustrating the robust evolutionary resolution acquired using our method, we also seek to build a growing phylogenetic framework upon which the evolution of this phenotypically diverse group of organisms is based.
The long history of cyanobacterial systematics has been confounded by morphology-based botanical classifications as well as difficulties in resolving closely related species using 16S rRNA (Rippka et al. 1979
; Fox et al. 1992
; Castenholz 2001
; Casamatta et al. 2005
). Individual genes and proteins conserved across all organisms or specifically in all cyanobacteria have been used to build phylogenies (Woese 1987
; Giovannoni et al. 1988
; Honda et al. 1999
; Hess et al. 2001
; Seo and Yokota 2003
; Henson et al. 2004
). Some subsets of cyanobacteria have also been compared extensively, particularly within the (genomically) well-sampled Prochlorophyte clade (Hess 2004
; Dufresne et al. 2005
). However, only a few studies thus far have assembled cyanobacterial phylogenies based on a larger set of proteins conserved across all cyanobacteria. Martin et al. (2002)
examined several thousand genes from 3 then-available cyanobacteria to determine the evolutionary history of nuclear genes from Arabidopsis thaliana, establishing the widescale impact that imported cyanobacterial genes have had on the evolution of photosynthetic eukaryotes, as well as plausible gene complements of chloroplast/cyanobacterial ancestors. A Blast-based comparison of the genomes of 8 cyanobacterial genomes by (Martin et al. 2003
) revealed 181 signature genes that do not have homologs in other organisms, roughly 3/4 of which had no ascribable function yet are clearly important in some aspect of cyanobacterial lifestyle. Sanchez-Baracaldo et al. (2005)
more recently developed a method based on multigene concatenation combined with morphological character analysis to construct and map traits onto a cyanobacterial species tree. Additionally, a cyanobacterial phylogeny based on 31 proteins conserved across the entire tree of life was constructed as part of a large-scale tree construction (Ciccarelli et al. 2006
), but this study used only 8 cyanobacterial taxa and the ribosomal proteins used for tree construction did not resolve terminal branches. A cluster of orthologous groups (COG)–based analysis was used to determine the distribution of proteins in 15 complete cyanobacterial genomes, with a particular focus on understanding the origin of photosynthesis (Mulkidjanian et al. 2006
). However, the analysis did not undertake phylogenetic analysis, either of individual protein families or in an attempt to resolve the evolution of the phylum as a whole. Zhaxybayeva et al. (2006)
have conducted the most extensive sampling of the phylum to date, reconstructing histories of 1,128 protein-coding genes from 11 cyanobacterial genomes in order to reconstruct a plurality tree based on quartet analysis (Zhaxybayeva et al. 2006
).
In addition to constructing maximum likelihood trees for a large number of orthologs from completed cyanobacterial genomes, we assembled concatenated alignments as a further test of phylogenetic robustness. Importantly, variations in the concatenated alignment used resulted in 2 distinct but very highly supported phylogenies, suggesting that even large, statistically well-supported concatenations can converge on very different trees. To further test phylogenetic robustness, we used a tree consensus method to build a single tree that best captures all single-protein phylogenies. Recent work (Gadagkar et al. 2005
) has compared the effectiveness of concatenated versus consensus methods for phylogenetic inference in the face of incongruent signals (e.g., due to horizontal gene transfer, poor resolution, invalid model assumptions, or use of the same model for all data sets). They found that concatenated phylogenies outperform consensus phylogenies, though importantly both methods can converge on incorrect trees when systematic biases are present in individual trees—for example, when the evolutionary model used is a poor match to the data. However, our consensus tree agrees exactly with one of the trees inferred from concatenated alignments, compares the results of multiple evolutionary models, and also is compatible with modern cyanobacterial classification schemes that integrate both systematic and molecular information.
To further test, and potentially increase, resolution of individual nodes on our concatenated/consensus genome tree, we used a telescoping method whereby protein families that are conserved among a smaller number of very closely related taxa can be taken into account. This proved useful particularly in resolving relationships between the very closely related marine Synechococcus and Prochlorococcus clades, which were clarified with exceptional support by analyzing conserved protein families between just these 2 groups. In cases, such as these, the inverse relationship between the number of conserved protein families and the number of taxa tends to yield a uniform total number of phylogenetically informative characters.
Finally, using these methods to model cyanobacterial speciation provides a framework for understanding and explaining the distribution of cyanobacterial protein families. Generating a robust "background" tree is crucial for framing key evolutionary events, such as the origin and evolution of capabilities such as pigment biosynthesis, carbon and nitrogen fixation, and provides insight into fundamental evolutionary mechanisms such as niche adaptation, genome reduction, and horizontal gene transfer. This approach can be similarly extended to other phyla to provide a high-resolution framework, based on the totality of evolutionary information from many protein families, which can be linked together to assemble the tree of life.
| Methods |
|---|
|
|
|---|
All data are publicly available in the way of completed or nearly complete genome sequences (table 1). The pipeline of methods used is diagrammed in figure 2. BlastP comparisons (10–4 cutoff, BLOSUM62, standard settings for word size, gap opening/extension, and filtering) were made between all protein sequences from the genomes of 24 cyanobacteria and 2 non-cyanobacterial outgroups (see table 1), representing all complete plus diverse set of nearly complete cyanobacteria, and outgroups from well-sampled bacterial phyla (proteobacteria and Gram positive bacteria). To generate first-pass protein families, Markov clustering (Enright et al. 2002
|
|
Using the multiple alignments and corresponding Neighbor-Joining trees generated by ClustalW as a guide, protein families were then manually checked for poor alignments and/or long-branch lengths, with poorly aligned sequences and/or poorly assembled protein families either corrected or removed. Most frequently, these differences involved inclusion of a paralog in a protein family, which can be easily detected based on the number of homologs per organism or, often, the presence of long branches in the phylogeny. As depicted in table 2, these curated protein families were then parsed using various filters, for example, selecting protein families present in all or most cyanobacteria, any imaginable subset of organisms, or by selecting protein families that all share a common function or annotation. The full-protein family spreadsheet is available as supplementary table S2 (Supplementary Material online).
|
In addition to the distance-based trees generated during multiple alignment, phylogenies based on single-protein families were generated for every aligned protein family using 2 different maximum likelihood methods. The first approach used PHYLIP's ProML package with the following parameters: JTT probability model, one category of sites with constant rate, and with randomized input order (Felsenstein 1989
Concatenated multiple alignments were generated by end-to-end attachment of individual protein families, using gaps as placeholders for species missing a particular ortholog. As an additional test of robustness, variable/uninformative positions were filtered out of these concatenated alignments using progressively more stringent Shannon information entropy cutoffs (SIE 1.0–3.0) and filtering out positions with >50% gaps. The resulting concatenated alignments from all 26 genomes ranged from 28,281 (SIE 1.0) to 230,415 (full/unfiltered concatenation) aligned amino acid positions and contained up to 300,000 aligned positions in the case of the Prochlorococcus/Synechococcus-conserved protein families (fig. 4). PHYLIP ProML and Neighbor-Joining phylogenies were then constructed for each of these filtered concatenated alignments to determine the effect of removing gaps and progressively more variable sites from alignments (see e.g., discussion of difference in support for the Prochorococcus/Synechococcus clade in the main text).
|
Our final goal was to test the effect of correcting for site heterogeneity in concatenated alignments by incorporating a gamma parameter, rather than strictly filtering out variable regions of alignments. The size and associated memory requirements of inferring gamma corrected phylogenies for these concatenated alignments required they be analyzed using MrBayes (Huelsenbeck and Ronquist 2001
Tree comparisons used PHYLIP's consense, using the extended majority rule method and both the symmetric (Robinson–Foulds) and branch score distance metrics. Comparisons also included 50 trees comprised of the same cyanobacterial taxa arranged in randomized topologies. As illustrated in figure 5, core- and pan-genome numbers are determined for a specific rooted phylogeny by 1) counting the number of protein families conserved within all descendents of a particular node in the tree (core) and 2) counting the total number of protein families present in the descendents of a particular node in the tree (pan).
|
| Results and Discussion |
|---|
|
|
|---|
The 24 genomes analyzed here represent all cyanobacteria with either complete or very nearly complete sequencing projects and encompass nearly 94,000 protein-coding genes. Homology-based Markov clustering resulted in 7,378 families of proteins present in more than 1 cyanobacterium (an additional 12,955 protein families were found only in a single cyanobacterium). Many of these families include multiple, often closely related paralogs. For example, the D1 and D2 proteins of the photosystem II reaction center complex are members of the same family, and ABC transporter and serine/threonine kinase paralogs are quite extensive even in the smallest cyanobacterial genomes. To avoid problems associated with inclusion of paralogs in phylogenies, initial analysis focused on families with few or no paralogs present in most or all cyanobacteria, which includes housekeeping proteins common to most organisms as well as cyanobacterial-specific proteins that have been important during their evolution and early diversification.
Following the initial clustering, 613 protein families fit the criterion of being absent in not more than 2 cyanobacteria and having not more than 2 paralogs in total for all organisms. Alignments and Neighbor-Joining phylogenies for all families were manually checked, and poorly aligned proteins (as well as those with disproportionately long-branch lengths; for details, see Methods) were removed from alignments or else the family was removed from the analysis. A total of 583 protein families remained after this manual curation. Here we focus on a substantial number of relatively easily obtained families of orthologs, selected by a fast clustering approach that minimizes the number of paralogs while maximizing the total number of genomes represented in a given protein family (see supplementary table S1, Supplementary Material online).
Phylogenies for each of the 583 families were constructed using 2 different implementations of the maximum likelihood method (PHYLIP and quartet-based iqpnni; see Methods). A total of 438 of these families—those comprised strictly of orthologs—were then used to generate a consensus phylogeny that portrays the bifurcations that occur most frequently across all trees (fig. 2). For example, both the marine Synechococcus/Prochlorococcus (11 organisms) and the Synechococcus sp. A and B' clusters are conserved in every tree generated, and the Nostocales clade is observed in 421 of 438 trees. Importantly, only minority support is observed for several nodes on the tree, especially among the cyanobacteria often argued as among the earliest branching (Gloeobacter)—which may indeed reflect asymmetric rates of evolution—as well as for some members of the Prochlorococcus lineages, which recent studies suggest may result from horizontal gene transfer (Beiko et al. 2005
). The ability to detect this phylogenetic incoherence is a crucial step in being able to segregate both protein families and organisms that are responsible. An attractive, iterative approach would take these into account by fine-tuning parameters of ascribed evolutionary models or progressively removing "difficult" protein families from tree-building methods that rely on combined data sets.
This consensus phylogeny gives a straightforward method for finding putative horizontal gene transfer events and indicates that gene transfer "across" the tree, that is, between Prochlorococcus/marine Synechococcus and cyanophytes, is very rare among this particular subset of proteins. Note that as these proteins are common to almost all cyanobacteria, a very specific type of horizontal gene transfer—orthologous gene replacement—must occur, whereby a newly transferred gene displaces a functional wild-type gene. Importantly, though recent evidence indeed supports an important role for horizontal gene transfer among cyanobacteria (Zhaxybayeva et al. 2006
), simulations suggest that these phylogenetic signals are not self-reinforcing and, even when corrections are not made for variations in evolutionary rate or composition, convergence to the true tree is frequently observed (Gadagkar et al. 2005
). Indeed, Zhaxybayeva et al. (2006)
obtained a plurality tree based on quartet reconstruction with which the consensus and concatenated trees presented here are consistent.
In addition to individual and consensus phylogenies, all alignments without paralogs were concatenated into a single large alignment containing 230,415 positions encompassing 26 organisms. Smaller alignments were generated from this full alignment using a Shannon information entropy–based filter (Reche and Reinherz 2003
) to remove phylogenetically uninformative (too variable or too conserved) sites from the alignment. Shannon entropy can be calculated for each position in an alignment and provides a more robust method for parsing informative positions from alignment than simply culling positions that fall below a given percentage identity or similarity. For example, a position in a protein sequence alignment might have 1 amino acid in half of the sequences and a different amino acid in the other half. If a percentage-based cutoff were used, this position would contain the same informative value as one where half the positions were 1 amino acid and the other half were all different amino acids. However, the Shannon entropy score of these 2 examples is quite different and, furthermore, is conceptually similar to maximum likelihood calculations. Phylogenies for all concatenated alignments were generated as discussed in the methods and showed overall agreement with one another, with one notable exception—differing levels of filtering (Shannon entropy cutoff values ranging from 1 to 4, where 0 is an invariant site and 4.322 is a site where all 20 amino acids are equally represented) resulted in 2 distinct trees differing by monophyly of the Prochlorococcus/Synechococcus clades. One of the trees—shown in figure 3—was converged upon from multiple MrBayes runs using the full/unfiltered data set. This tree is characterized by separate/monophyletic Prochlorales (the order containing Prochlorococcus species) and marine Synechococcus clades, with Synechococcus sp. strain WH 5701 basal to both groups, a topology supported in previous single-gene trees (Rocap et al. 2002
; Scanlan 2003
). Notably, this tree was in almost exact agreement with the consensus phylogeny generated from 438 trees (with the exception of the poorly supported Acaryochloris marina/Thermosynechococcus elongatus clade, resolved as 2 distinct lineages in the concatenated tree).
|
Although the observed convergence to a single tree from 2 different approaches lends support to this as the true tree, the fact that a different tree was inferred from some filtered concatenated alignments underscores the importance of using multiple methods of analysis to infer phylogenies. Shannon entropy presents a metric for pruning highly variable (less phylogenetically informative) positions from long alignments, making phylogenetic analysis more tractable. However, care must be taken that evolutionary models are compared each time a data set is filtered as it is feasible that the best model can change as positions are pruned from an alignment. Even character-rich data sets can be prone to error, in particular when they contain multiple phylogenetic signals or include highly divergent or deeply branching organisms (Mossel and Steel 2006
As is evident in figures 3 and 4, order Prochlorales shows anomalously long-branch lengths, evident both in individual as well as concatenated phylogenies, that may account for the alternative topology seen in some filtered concatenated phylogenies (this alternate topology is illustrated by the dashed line in fig. 4). However, one of the trees is converged to in both concatenated and consensus phylogenies, lending support to this as the true tree.
As a further test, we demonstrate one of the advantages of our approach by incorporating additional information from protein families excluded from the initial analysis because they were not present in most or all cyanobacteria. Specifically, 1,108 protein families are found in all Prochlorococcus and marine Synechococcus species (including WH 5701). A total of 848 of these families have no paralogs within either of these clades, and so individual and consensus/concatenated phylogenies can be generated for this Prochlorococcus/Synechococcus-specific subset of families. As shown in figure 4, phylogeny based on 848 concatenated protein families (287,466 aligned positions in 11 Prochlorococcus/Synechococcus genomes) supports a branching order in agreement with both the consensus and fully concatenated data sets. Moreover, the resulting phylogeny also retains the relatively long-branch lengths characteristic of several members of the prochlorales clade, suggesting that an accelerated substitution rate across many proteins has accompanied genome reduction. Prochlorococcus genome analyses have observed this long-branch effect, which is likely due to loss of several DNA repair capabilities during genome reduction (Dufresne et al. 2005
).
The single phylogeny converged upon by multiple methods used herein also provides a framework for understanding the distribution of protein families at each ancestral node on the tree (Martin et al. 2002
; Eisen and Fraser 2003
; Lerat et al. 2003
). As shown in figure 3, the common ancestor of all cyanobacteria is inferred to have had a conserved core of 361 protein families as these are present in the full set of 26 genomes analyzed. A total of 675 proteins (within which the 361 are nested) are common to all 24 cyanobacterial genomes analyzed, though as mentioned, many of these families contain paralogs and so were excluded from this analysis. These families represent a widely conserved core of housekeeping proteins common not only across known cyanobacterial diversity but also present to some extent in non-cyanobacterial genomes. Furthermore, the total diversity of modern cyanobacterial protein families—the union of all protein families in all progeny of an ancestor—is inferred to be just over 20,000 proteins for the cyanobacterial common ancestor and 25,292 when including the non-cyanobacterial outgroups. This is referred to as the cyanobacterial pan-genome (which must be emphasized never actually existed but simply captures the extent of protein family variability across the phylum), illustrated along with the core-genome concept in figure 5. These pan- and core-genome numbers provide upper and lower bounds on protein family distributions at each node in a given phylogeny and are not parsimony-based estimates of the true genetic content of ancient organisms.
The core-genome at the base of the cyanobacterial phylum encompasses most of the major proteins of the photosynthetic apparatus, suggesting that oxygenic photosynthesis evolved prior to or early in the cyanobacterial radiation. This is in stark contrast with the ability to fix nitrogen, which is found paraphyletically throughout the cyanobacterial tree (illustrated in fig. 6a—N2-fixing lineages denoted by "+"). The nodes where nitrogen fixation is inferred—that is, whose descendent lineages all fix nitrogen—occur at multiple points across the tree (gray squares on fig. 6a) so that gene loss, horizontal gene transfer, or some combination of these processes must be invoked to explain the distribution of nitrogen fixation. The strength of having both combined and individual phylogenies comes from the capability to contrast the background tree of cyanobacterial speciation (figs. 2 and 3) with the evolutionary tree for nitrogenase. For example, based on the species tree, one plausible scenario is that nitrogenase was acquired on independent occasions within cyanobacterial lineages (e.g., through horizontal gene transfers would be required at the gray +'s in fig. 6a), followed by largely vertical evolution to result in the observed distribution in the phylum. Alternatively (and arguably less parsimoniously), one could posit that the ancestor of all cyanobacteria had the capability to fix nitrogen but that the nitrogenase evolutionary history has since been dominated by gene loss. This scenario begins with nitrogen fixation in the hypothetical pan-genome and is followed by multiple independent losses, shown as x's on figure 6a.
|
By examining phylogenies of individual protein families, for example, that of the NifD (nitrogen fixation catalytic subunit) protein family shown in figure 6b, we can explore whether one of these scenarios is indeed more parsimonious than the other or if some combination of the 2 is more likely. The NifD tree (fig. 6b) shows some congruence with the cyanobacterial species tree (fig. 6a) but provides an important example of the complex history of protein families, often overlooked or not accurately captured in species trees. As well as supporting numerous gene losses, the NifD tree shows evidence for several gene duplications and plausible horizontal gene transfer, as suggested by the position of Trichodesmium erythraeum, comprising the earliest cyanobacterial branch among NifD proteins (though note poor bootstrap support makes it difficult to resolve this from the Synechococcus sp. A/B' divergence). At face value, this indeed suggests a combination of vertical evolution and gene loss accounts for the distribution of nitrogen fixation in cyanobacteria, with evidence for horizontal gene transfer as well as duplication in several lineages.
As with this truth-is-in-between example, the cyanobacterial ancestor would have had a genome content somewhere between the core- and pan-ancestral extremes, with functions and capabilities that, as demonstrated above, can be understood through examining of individual phylogenies. In a broader sense, the range established by ancestral core- and pan-genomes gives insight into the relative importance of genome reduction versus the evolution or acquisition of new genes and helps constrain the appearance of phenotypes specific to individual organisms or clades. This approach is extended to several other pathways of key importance to cyanobacterial evolution, such as carbon fixation and pigment biosynthesis, in Swingley et al. (2007)
. As shown in figure 7, the increasing size of the core-genome between any 2 organisms shows strong inverse correlation with their phylogenetic distance, whereas the pan-genome size shows only weak correlation. This results mainly because of the presence of novel/orphan genes that distinguish even closely related genomes, such as the 2 Synechococcus elongatus strains with 2,219 shared protein families.
|
As mentioned above, the major elements of the cyanobacterial species tree find strong support in other analyses, coming both from systematics and molecular analyses. This includes: monophyly of heterocystous diazotrophs with the nonheterocystous diazotroph Trichodesmium erythreum as an outgroup (Sanchez-Baracaldo et al. 2005
The phylogenies presented here integrate a large amount of genomic data from all completed, as well as a few nearly complete, cyanobacterial genomes. The fact that concatenated and consensus phylogenies from as many as 583 proteins converge on nearly identical topologies that agree with earlier systematic and molecular approaches suggests that this tree represents an accurate, though averaged, history of cyanobacterial speciation. Moreover, phylogenies from individual protein families are retained and can be selected and contrasted based on overall resolution, taxonomic distribution, degree of orthology versus paralogy, or various functional or pathway-associated criteria (e.g., table 2). Though attempting to resolve organismal evolution as a single phylogenetic tree invariably ignores the rich histories of single genes, here we have emphasized how organismal history can be understood at one level by integrating the information present in diverse genes and on additional levels by contrasting that integrated tree with individual phylogenies.
This telescoping approach to phylogenetic reconstruction—incorporating data from protein sequences at multiple taxonomic levels of conservation—can be used to refine evolutionary trees at different levels of phylogenetic resolution. Furthermore, inference of robust phylogenies stands as a primary technique by which horizontal gene transfer can be detected (and then be subtracted from consensus data sets). As genome data continue to fill out the branches of the tree of life, this approach will become increasingly useful as it provides a way to incorporate, compare, and contrast entire genomes' worth of sequence data, without ignoring information from individual genes or proteins.
| Accession Numbers |
|---|
|
|
|---|
Accession numbers for genomes used in this study are given in table 1.
| Supplementary Material |
|---|
|
|
|---|
Supplementary figure S1 and tables S1 and S2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
|
| Acknowledgements |
|---|
|
|
|---|
The authors wish to thank Jeff Touchman and the DNA sequencing team at the Translational Genomics Institute for making available sequence data for Acaryochloris marina. The authors also acknowledge very helpful discussions and suggestions from Carrine Blank and Elbert Branscomb. The A. marina genome project is funded by grant 0412824 from the National Science Foundation Microbial Genome Sequencing Program (http://genomes.tgen.org/). R.B. acknowledges additional support from grant NNG04GK59G from the Exobiology Program at the National Aeronautics and Space Administration. J.R. acknowledges support through a Lawrence Postdoctoral Fellowship at Lawrence Livermore National Laboratory.
| Footnotes |
|---|
Takashi Gojobori, Associate Editor
| References |
|---|
|
|
|---|
Beiko RG, Harlow TJ, Ragan MA. Highways of gene sharing in prokaryotes. Proc Natl Acad Sci USA (2005) 102:14332–14337.
Casamatta DA, Johansen JR, Vis ML, Broadwater ST. Molecular and morphological characterization of ten polar and near-polar strains within the Oscillatoriales (cyanobacteria). J Phycol (2005) 41:421–438.[CrossRef][Web of Science]
Castenholz RW. Phylum BX. Cyanobacteria. Oxygenic photosynthetic bacteria. In: Bergey's manual of systematic bacteriology. Volume 1: the Archaea and deeply branching and phototrophic Bacteria—Boone DR, Castenholz RW, eds. (2001) New York: Springer-Verlag. 413–439.
Charlebois RL, Beiko RG, Ragan MA. Microbial phylogenomics: branching out. Nature (2003) 421:217.[CrossRef][Medline]
Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P. Toward automatic reconstruction of a highly resolved tree of life. Science (2006) 311:1283–1287.
Delsuc F, Brinkmann H, Philippe H. Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet (2005) 6:361–375.[Web of Science][Medline]
Doolittle RF. Evolutionary aspects of whole-genome biology. Curr Opin Struct Biol (2005) 15:248–253.[CrossRef][Web of Science][Medline]
Dufresne A, Garczarek L, Partensky F. Accelerated evolution associated with genome reduction in a free-living prokaryote. Genome Biol (2005) 6:R14.[CrossRef][Medline]
Eisen JA. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res (1998) 8:163–167.
Eisen JA, Fraser CM. Phylogenomics: intersection of evolution and genomics. Science (2003) 300:1706–1707.
Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res (2002) 30:1575–1584.
Farquhar J, Bao H, Thiemens M. Atmospheric influence of earth's earliest sulfur cycle. Science (2000) 289:756–759.
Felsenstein J. PHYLIP—Phylogeny inference package (Version 3.2). Cladistics (1989) 5:164–166.
Ferris MJ, Ruff-Roberts AL, Kopczynski ED, Bateson MM, Ward DM. Enrichment culture and microscopy conceal diverse thermophilic Synechococcus populations in a single hot spring microbial mat habitat. Appl Environ Microbiol (1996) 62:1045–1050.[Abstract]
Fox GE, Wisotzkey JD, Jurtshuk P Jr. How close is close: 16S rRNA sequence identity may not be sufficient to guarantee species identity. Int J Syst Bacteriol (1992) 42:166–170.
Gadagkar SR, Rosenberg MS, Kumar S. Inferring species phylogenies from multiple genes: concatenated sequence tree versus consensus gene tree. J Exp Zoolog B Mol Dev Evol (2005) 304:64–74.[Medline]
Giovannoni SJ, Turner S, Olsen GJ, Barns S, Lane DJ, Pace NR. Evolutionary relationships among cyanobacteria and green chloroplasts. J Bacteriol (1988) 170:3584–3592.
Gogarten JP, Townsend JP. Horizontal gene transfer, genome innovation and evolution. Nat Rev Microbiol (2005) 3:679–687.[CrossRef][Web of Science][Medline]
Harlow TJ, Gogarten JP, Ragan MA. A hybrid clustering approach to recognition of protein families in 114 microbial genomes. BMC Bioinformatics (2004) 5:45.[CrossRef][Medline]
Henson BJ, Hesselbrock SM, Watson LE, Barnum SR. Molecular phylogeny of the heterocystous cyanobacteria (subsections IV and V) based on nifD. Int J Syst Evol Microbiol (2004) 54:493–497.
Hess WR. Genome analysis of marine photosynthetic microbes and their global role. Curr Opin Biotechnol (2004) 15:191–198.[CrossRef][Web of Science][Medline]
Hess WR, Rocap G, Ting CS, Larimer F, Stilwagen S, Lamerdin J, Chisholm SW. The photosynthetic apparatus of Prochlorococcus: insights through comparative genomics. Photosynth Res (2001) 70:53–71.[CrossRef][Web of Science][Medline]
Honda D, Yokota A, Sugiyama J. Detection of seven major evolutionary lineages in cyanobacteria based on the 16S rRNA gene sequence analysis with new sequences of five marine Synechococcus strains. J Mol Evol (1999) 48:723–739.[CrossRef][Web of Science][Medline]
Huelsenbeck JP, Ronquist F. MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics (2001) 17:754–755.
Knoll AH. The geological consequences of evolution. Geobiology (2003) 3–14.
Kopp RE, Kirschvink JL, Hilburn IA, Nash CZ. The paleoproterozoic snowball earth: a climate disaster triggered by the evolution of oxygenic photosynthesis. Proc Natl Acad Sci USA (2005) 102:11131–11136.
Kumar S, Tamura K, Nei M. MEGA: molecular evolutionary genetics analysis software for microcomputers. Comput Appl Biosci (1994) 10:189–191.
Lerat E, Daubin V, Moran NA. From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-Proteobacteria. PLoS Biol (2003) 1:E19.[Medline]
Martin KA, Siefert JL, Yerrapragada S, Lu Y, McNeill TZ, Moreno PA, Weinstock GM, Widger WR, Fox GE. Cyanobacterial signature genes. Photosynth Res (2003) 75:211–221.[CrossRef][Web of Science][Medline]
Martin W, Rujan T, Richly E, Hansen A, Cornelsen S, Lins T, Leister D, Stoebe B, Hasegawa M, Penny D. Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus. Proc Natl Acad Sci USA (2002) 99:12246–12251.
Minh BQ, Vinh le S, von Haeseler A, Schmidt HA. pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies. Bioinformatics (2005) 21:3794–3796.
Mossel E, Steel M. How much can evolved characters tell us about the tree that generated them? In: Mathematics of evolution and phylogeny—Gascuel O, ed. (2006) Oxford: Oxford University Press. 384–412.
Mulkidjanian AY, Koonin EV, Makarova KS, et al, (12 co-authors). The cyanobacterial genome core and the origin of photosynthesis. Proc Natl Acad Sci USA (2006) 103:13126–13131.
Nelissen B, Van de Peer Y, Wilmotte A, De Wachter R. An early origin of plastids within the cyanobacterial divergence is suggested by evolutionary trees based on complete 16S rRNA sequences. Mol Biol Evol (1995) 12:1166–1173.[Abstract]
Reche PA, Reinherz EL. Sequence variability analysis of human class I and class II MHC molecules: functional and structural correlates of amino acid polymorphisms. J Mol Biol (2003) 331:623–641.[CrossRef][Web of Science][Medline]
Rippka R, Deruelles J, Waterbury JB, Herdman M, Stanier RY. Generic assignments, strain histories and properties of pure cultures of cyanobacteria. J Gen Microbiol (1979) 111:1–61.
Rivera MC, Lake JA. The ring of life provides evidence for a genome fusion origin of eukaryotes. Nature (2004) 431:152–155.[CrossRef][Medline]
Rocap G, Distel DL, Waterbury JB, Chisholm SW. Resolution of Prochlorococcus and Synechococcus ecotypes by using 16S-23S ribosomal DNA internal transcribed spacer sequences. Appl Environ Microbiol (2002) 68:1180–1191.
Sanchez-Baracaldo P, Hayes PK, Blank CE. Morphological and habitat evolution in the cyanobacteria using a compartmentalization approach. Geobiology (2005) 3:145–165.[CrossRef]
Scanlan DJ. Physiological diversity and niche adaptation in marine Synechococcus. Adv Microb Physiol (2003) 47:1–64.[Web of Science][Medline]
Seo PS, Yokota A. The phylogenetic relationships of cyanobacteria inferred from 16S rRNA, gyrB, rpoC1 and rpoD1 gene sequences. J Gen Appl Microbiol (2003) 49:191–203.[CrossRef][Medline]
Snel B, Huynen MA, Dutilh BE. Genome trees and the nature of genome evolution. Annu Rev Microbiol (2005) 59:191–209.[CrossRef][Web of Science][Medline]
Summons RE, Jahnke LL, Hope JM, Logan GA. 2-Methylhopanoids as biomarkers for cyanobacterial oxygenic photosynthesis. Nature (1999) 400:554–557.[CrossRef][Medline]
Swingley WD, Blankenship RE, Raymond J. Insights into cyanobacterial evolution from comparative genomics. In: Genomics and molecular biology of cyanobacteria—Herrero A, Flores E, eds. (2007) Norwich (UK): Horizon Scientific Press. 22–43.
Woese CR. Bacterial evolution. Microbiol Rev (1987) 51:221–271.
Wolf YI, Rogozin IB, Grishin NV, Koonin EV. Genome trees and the tree of life. Trends Genet (2002) 18:472–479.[CrossRef][Web of Science][Medline]
Zhaxybayeva O, Gogarten JP, Charlebois RL, Doolittle WF, Papke RT. Phylogenetic analyses of cyanobacterial genomes: quantification of horizontal gene transfer events. Genome Res (2006) 16:1099–1108.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
R. S. Gupta Protein signatures (molecular synapomorphies) that are distinctive characteristics of the major cyanobacterial clades Int J Syst Evol Microbiol, October 1, 2009; 59(10): 2510 - 2526. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Zhaxybayeva, W. F. Doolittle, R. T. Papke, and J. P. Gogarten Intertwined Evolutionary Histories of Marine Synechococcus and Prochlorococcus marinus Gen Biol Evol, September 23, 2009; 2009(0): 325 - 339. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Luque, M. L. Riera-Alberola, A. Andujar, and J. A. G. Ochoa de Alda Intraphylum Diversity and Complex Evolution of Cyanobacterial Aminoacyl-tRNA Synthetases Mol. Biol. Evol., November 1, 2008; 25(11): 2369 - 2389. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. A. Welsh, M. Liberton, J. Stockel, T. Loh, T. Elvitigala, C. Wang, A. Wollam, R. S. Fulton, S. W. Clifton, J. M. Jacobs, et al. The genome of Cyanothece 51142, a unicellular diazotrophic cyanobacterium important in the marine nitrogen cycle PNAS, September 30, 2008; 105(39): 15094 - 15099. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||










