Skip Navigation


MBE Advance Access originally published online on September 26, 2008
Molecular Biology and Evolution 2008 25(12):2689-2698; doi:10.1093/molbev/msn213
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
25/12/2689    most recent
msn213v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Kuo, C.-H.
Right arrow Articles by Kissinger, J. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kuo, C.-H.
Right arrow Articles by Kissinger, J. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 The Authors
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.


Research Articles

The Apicomplexan Whole-Genome Phylogeny: An Analysis of Incongruence among Gene Trees

Chih-Horng Kuo*,1, John P. Wares* and Jessica C. Kissinger*,{dagger},{ddagger}

* Department of Genetics, University of Georgia
{dagger} Center for Tropical and Emerging Global Diseases, University of Georgia
{ddagger} Institute of Bioinformatics, University of Georgia

E-mail: chkuo{at}email.arizona.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
The protistan phylum Apicomplexa contains many important pathogens and is the subject of intense genome sequencing efforts. Based upon the genome sequences from seven apicomplexan species and a ciliate outgroup, we identified 268 single-copy genes suitable for phylogenetic inference. Both concatenation and consensus approaches inferred the same species tree topology. This topology is consistent with most prior conceptions of apicomplexan evolution based upon ultrastructural and developmental characters, that is, the piroplasm genera Theileria and Babesia form the sister group to the Plasmodium species, the coccidian genera Eimeria and Toxoplasma are monophyletic and are the sister group to the Plasmodium species and piroplasm genera, and Cryptosporidium forms the sister group to the above mentioned with the ciliate Tetrahymena as the outgroup. The level of incongruence among gene trees appears to be high at first glance; only 19% of the genes support the species tree, and a total of 48 different gene-tree topologies are observed. Detailed investigations suggest that the low signal-to-noise ratio in many genes may be the main source of incongruence. The probability of being consistent with the species tree increases as a function of the minimum bootstrap support observed at tree nodes for a given gene tree. Moreover, gene sequences that generate high bootstrap support are robust to the changes in alignment parameters or phylogenetic method used. However, caution should be taken in that some genes can infer a "wrong" tree with strong support because of paralogy, model violations, or other causes. The importance of examining multiple, unlinked genes that possess a strong phylogenetic signal cannot be overstated.

Key Words: Apicomplexa • genome scale • phylogeny • bootstrap • long-branch attraction • taxon sampling


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
The protistan phylum Apicomplexa contains many important pathogens (Levine 1988Go). The most infamous members of this phylum are the causative agents of malaria from the genus Plasmodium, which causes more than one million human deaths per year globally (WHO and UNICEF 2005Go). Other important lineages include Babesia, which causes babesiosis in ruminants and humans (Brayton et al. 2007Go); Cryptosporidium, which causes cryptosporidiosis in humans and animals (Abrahamsen et al. 2004Go); Theileria, which causes tropical theileriosis and East Coast fever in cattle (Gardner et al. 2005Go; Pain et al. 2005Go); and Toxoplasma, which causes toxoplasmosis in immunocompromised patients and congenitally infected fetuses (Montoya and Liesenfeld 2004Go). These pathogens have been subjected to intense genome sequencing efforts in the hope of facilitating biomedical research (Tarleton and Kissinger 2001Go; Carlton 2003Go). The recent availability of fully annotated genome sequences from multiple species within this phylum provides a new and exciting opportunity for us to better understand the phylogeny of these important pathogens.

The use of genome sequences for phylogenetic inference has only recently become possible. The large number of characters derived from genomic data allows robust inference of organismal phylogeny (Delsuc et al. 2005Go; Philippe, Delsuc, et al. 2005Go; Rokas 2006Go), even when the level of incomplete lineage sorting is high (Pollard et al. 2006Go). Initially, it was thought that use of genomic data would bring an end to the incongruence commonly observed in multigene molecular phylogenetic inference (Gee 2003Go; Rokas et al. 2003Go). However, further investigations suggest that the results from genome-scale phylogenetic inference should be interpreted with caution (Soltis et al. 2004Go; Jeffroy et al. 2006Go; Nishihara et al. 2007Go). Although genomic data can effectively suppress stochastic noise in shorter molecular sequences, the large amount of data can actually strengthen systematic biases when present (Phillips et al. 2004Go; Rodriguez-Ezpeleta et al. 2007Go).

Previous studies that examined factors such as poor taxon sampling (Soltis et al. 2004Go; Philippe, Lartillot, and Brinkmann 2005Go), inappropriate choices of phylogenetic method (Phillips et al. 2004Go; Jeffroy et al. 2006Go), nucleotide or amino acid composition bias and deviation from compositional equilibrium (Phillips et al. 2004Go; Collins et al. 2005Go), and variation of evolutionary rates among or within sites (Dopazo H and Dopazo J 2005Go; Nishihara et al. 2007Go; Rodriguez-Ezpeleta et al. 2007Go), all found that systematic biases can lead to incorrect trees with strong support. Several approaches that can detect and remove systematic biases in genome-scale phylogenetic inference have been proposed, including modification of taxon sampling (Rodriguez-Ezpeleta et al. 2007Go), examination of model violations (Rodriguez-Ezpeleta et al. 2007Go), recoding of molecular sequences (Phillips et al. 2004Go; Rodriguez-Ezpeleta et al. 2007Go), removal of the fast-evolving sites (Nishihara et al. 2007Go; Rodriguez-Ezpeleta et al. 2007Go), and utilizing rare genomic changes (Delsuc et al. 2005Go). Among the approaches that have been developed to address the systematic biases in genome-scale analyses, examination of incongruence among individual genes is directly relevant to the design and interpretation of multigene analyses that are fundamental in molecular phylogenetics (Huelsenbeck et al. 1996Go; Taylor and Piel 2004Go; Jeffroy et al. 2006Go). Unfortunately, investigations of incongruence among gene trees at the genome-scale have been limited to a few selected groups such as gamma-Proteobacteria (Lerat et al. 2003Go), yeast (Taylor and Piel 2004Go; Gatesy and Baker 2005Go; Jeffroy et al. 2006Go), and Drosophila (Pollard et al. 2006Go) due to the limitation of data availability.

In this study, we present the first genome-scale phylogenetic analysis in the phylum Apicomplexa. Because of the ancient origin of this phylum, estimated at approximately 700–900 Myr (Douzery et al. 2004Go), we perform our genome-scale phylogenetic inference at the protein level. The robust inference of the organismal phylogeny based on genomic data provides a solid foundation for comparative studies that improve our knowledge of apicomplexan evolution. In addition to facilitating the planning of future phylogenetic studies that involve other closely related pathogens, our systematic investigation of incongruence among gene trees can improve our understanding of multigene phylogenetic inference in general.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
Data Sources and Ortholog Identification
Our data set contains seven apicomplexan species that have fully annotated genome sequence available, including Babesia bovis (Brayton et al. 2007Go) from GenBank (GenBank accession numbers AAXT01000001–AAXT01000013), Cryptosporidium parvum (Abrahamsen et al. 2004Go) from CryptoDB.org (Heiges et al. 2006Go), Eimeria tenella from GeneDB.org (Hertz-Fowler et al. 2004Go), Plasmodium falciparum (Gardner et al. 2002Go) and Plasmodium vivax from PlasmoDB.org (Bahl et al. 2003Go), Theileria annulata (Pain et al. 2005Go) from GeneDB.org (Hertz-Fowler et al. 2004Go), and Toxoplasma gondii from Toxo-DB.org (Gajria et al. 2008Go). A free-living ciliate, Tetrahymena thermophila (Eisen et al. 2006Go), is included as the outgroup. For each species, we obtained all annotated proteins in the genome for ortholog identification. The data sources and protein-encoding gene counts are summarized in table 1.


View this table:
[in this window]
[in a new window]

 
Table 1 List of Species Name Abbreviations and Data Sources

 
Orthologous genes were identified using OrthoMCL (Li et al. 2003Go) (version 1.3) with BLASTP (Altschul et al. 1990Go) and E value cutoff set to 1 x 10–30. The ortholog identification process in OrthoMCL is largely based on the popular criterion of reciprocal best hits but also involves an additional step of Markov Clustering (van Dongen 2000Go) to improve sensitivity and specificity. A benchmarking study has found that this algorithm performed well among available methods for ortholog identification (Hulsen et al. 2006Go). We selected the orthologous genes that are shared by all eight species to infer the gene tree. Orthologous gene clusters that contain more than one gene from any given species were removed to avoid the complications introduced by paralogous genes in phylogenetic inference.

Phylogenetic Inference
The program ClustalW (Thompson et al. 1994Go) (version 1.83) was used for multiple sequence alignment. The "tossgaps" option was enabled to ignore gaps when constructing the guide tree, and all other parameters were set to the default values unless specifically stated otherwise. The alignments produced by ClustalW were filtered by GBLOCKS (Castresana 2000Go) (version 0.91b) to using default settings remove regions that contain gaps or are highly divergent. The resulting amino acid alignment for each gene (provided in supplementary data file 1, Supplementary Material online) was used in the main phylogenetic analysis as described below; a codon-based nucleotide alignment for each gene was generated by PAL2NAL (Suyama et al. 2006Go) and is provided in supplementary data file 2 (Supplementary Material online).

Three phylogenetic methods, including maximum likelihood (ML), maximum parsimony (MP), and Neighbor-Joining (NJ), were used to infer the gene tree for each individual gene. ML inferences were performed using PHYML (Guindon and Gascuel 2003Go). The proportion of invariant sites and the gamma-distribution parameter with eight substitution categories were estimated from the data set. The substitution model was set to JTT (Jones et al. 1992Go), and we enabled the optimization options for tree topology, branch lengths, and rate parameters. MP trees were constructed using PROTPARS in the PHYLIP package (Felsenstein 1989Go) (version 3.65) with 100 randomizations of input order. When more than one equally parsimonious tree was found for a given gene, the strict consensus tree of all equally parsimonious trees was used as the MP tree of this gene. NJ trees were constructed using NEIGHBOR in the PHYLIP package with species input order randomization enabled. The distance matrices were calculated by Tree-Puzzle (Schmidt et al. 2002Go) (version 5.2). The parameters used in Tree-Puzzle were set to the JTT substitution model, the mixed model of rate heterogeneity with one invariant and eight gamma rate categories, and the exact and slow parameter estimation. The level of bootstrap support for each gene was inferred by 100 resamplings of the alignment using SEQBOOT in the PHYLIP package followed by ML inference.

To investigate the sensitivity of a gene to the multiple sequence alignment parameter, we varied the gap opening penalty by 2-fold in both directions (i.e., increased the default cost from 10 to 20 or decreased it to 5) and inferred the gene tree under each setting. Individual genes are classified into three categories including robust, intermediate, and sensitive based on the ML gene-tree topologies from the three gap opening penalties examined. A gene is classified as robust if all three settings generated the same topology, intermediate if two out of the three settings generated the same topology, or sensitive if each setting generated a different topology.

To investigate the effect of the substitution model used on the resulting gene-tree topology, we performed ML inference for each gene using two additional substitution models, including LG (Le and Gascuel 2008Go) and WAG (Whelan and Goldman 2001Go). The resulting gene trees are compared with the topology obtained using the JTT model (Jones et al. 1992Go).

Inference of the Species Tree
The species tree was inferred using two different approaches. The first approach was based on the consensus of individual gene trees. The consensus tree was inferred by the CONSENSE program in the PHYLIP package using extended majority rule. Gene trees inferred by different phylogenetic methods (i.e., ML, MP, and NJ) were analyzed separately. The second approach was based on the concatenated alignment of all individual genes following the phylogenetic inference procedures as described above.

Characterization of Gene Trees
The topology distance between each gene tree and the species tree was calculated based on the symmetric difference (Robinson and Foulds 1981Go) as implemented in TREEDIST in the PHYLIP package. For genes that inferred a topology that is different from the species tree, we performed the approximately unbiased (AU) test (Shimodaira 2002Go) and the Shimodaira–Hasegawa (SH) test (Shimodaira and Hasegawa 1999Go) using the CONSEL package (Shimodaira and Hasegawa 2001Go) to test if the species tree topology is significantly rejected by a gene.

Taxon Removal Tests
To evaluate the potential influence of long-branch attraction (LBA), we removed either of the two taxa that have a long terminal branch (i.e., the outgroup T. thermophila and the ingroup C. parvum) and repeated the phylogenetic inference for each gene. Our procedure is conceptually similar to the taxon jackknife method (Siddall 1995Go) but contains one important distinction. The traditional taxon jackknife method removes a taxon after multiple sequence alignment and prior to tree reconstruction. However, the taxon being removed still affects the alignment and thus can influence the resulting tree. We chose to perform the taxon removal prior to multiple sequence alignment to eliminate any effect on the phylogenetic inference from the taxon being removed.


    Results and Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
Ortholog Identification
From the seven apicomplexans and the one ciliate examined, we identified 268 single-copy genes that are shared by all eight species. These genes represent less than 10% of the annotated genes from the smallest genome (table 1), indicating that these organisms are highly divergent in their gene content. The long evolutionary distance between ciliates and apicomplexans only partially explains this observation. When the outgroup is not considered, the seven apicomplexans share 508 orthologous genes (of which 433 are single copy in all species). One of our previous studies that examined a different set of apicomplexan species produced similar results and suggested that 28–45% of the genes in an apicomplexan genome are genus-specific (Kuo and Kissinger 2008Go). This high level of divergence in gene content is consistent with the ancient origin of the phylum. The divergence time between apicomplexans and ciliates was estimated to be in the range of 700–900 Myr based on 129 genes from 36 eukaryotes (Douzery et al. 2004Go).

For the purpose of phylogenetic analysis, we focus on the 268 single-copy genes shared by all eight species. Many of these genes are responsible for basic cellular processes (e.g., DNA replication, transcription, translation, etc.), as noted in our previous study (Kuo and Kissinger 2008Go). The sequence identity and annotation information of these genes are provided in supplementary table S1 (Supplementary Material online).

The Apicomplexan Species Tree
The species tree was inferred using two different approaches. The first approach calculated the consensus tree among the 268 individual gene trees, and the second approach utilized a concatenated alignment of 71,830 amino acid sites. Both approaches resulted in the same species tree topology (fig. 1) by all three phylogenetic methods used. Groupings of three species pairs, including P. falciparum and P. vivax, B. bovis and T. annulata, and E. tenella and T. gondii, are supported by 87% or more of the genes based on ML consensus. In contrast, the two short internal branches are supported by less than 50% of the genes. Nevertheless, all internal branches received 100% ML bootstrap support based on the analysis of the concatenated alignment.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— The inferred apicomplexan species tree. The ML tree is generated from the concatenated alignment of 268 single-copy genes (71,830 aligned amino acid sites). One free-living ciliate, Tetrahymena thermophila, is included as the outgroup to root the tree. Bootstrap support based on 100 replicates is 100% for all internal branches. Labels above branches indicate the level of consensus support (%) based on ML, MP, and NJ.

 
This tree topology is consistent with most of our prior understanding of apicomplexan evolution based on morphology and development (Perkins et al. 2000Go), rDNA analyses (Escalante and Ayala 1995Go; Morrison and Ellis 1997Go), and multigene phylogenies (Douzery et al. 2004Go; Philippe et al. 2004Go; Kuo and Kissinger 2008Go). The piroplasmids (represented by B. bovis and T. annulata) form a sister group to the haemosporidians (represented by the Plasmodium lineage) with the cyst-forming coccidia (represented by E. tenella and T. gondii) as the next closely related group. Although the Cryptosporidium lineage was classified as a coccidian in early taxonomy work (Levine 1984Go), our result provides further support to the growing consensus that this lineage is basal to other apicomplexans and separate from other coccidia (Carreno et al. 1999Go; Zhu et al. 2000Go; Leander et al. 2003Go).

The Distribution of Gene Trees
Examination of individual genes revealed a seemingly high degree of incongruence among gene trees. Of the 268 gene trees examined, we observed a total of 48 topologies based on ML analysis (fig. 2). The most frequently observed topology (fig. 3A) is consistent with the putative species tree and is supported by 19% of the genes. Each of the next three frequent topologies (fig. 3BD) is supported by approximately 7–10% of the genes and is different in the placement of C. parvum. Two additional topologies (fig. 3E and F) are supported by 6% of the genes and exhibit alternative placements of the Plasmodium lineage. The observation that only a relatively small number of topologies are found may be attributed to our limited taxon sampling of eight species. For example, in an analysis of 106 genes from 14 yeast species, Jeffroy et al. (2006)Go found that each of the genes analyzed supports a distinct topology.


Figure 2
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Frequency distribution of gene-tree topologies. Based on the 268 single-copy genes examined, we observed a total of 48 gene-tree topologies. The six most frequently observed gene-tree topologies, each supported by more than 5% of the genes, are provided in figure 3.

 

Figure 3
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— The six most frequently observed gene-tree topologies. Each topology is supported by more than 5% of the 268 genes examined. The exact count and frequency of genes that support (or significantly reject) each topology are provided under the tree. ML: frequency of genes that infer the specific topology using ML inference; AU: frequency of genes that significantly reject the topology using AU test; SH: frequency of genes that significantly reject the topology using SH test.

 
Despite the seemingly high level of incongruence among gene trees, only 16 genes significantly reject the putative species tree topology in the AU test (Shimodaira 2002Go). When using the more conservative SH test (Shimodaira and Hasegawa 1999Go), only two genes significantly reject the putative species tree. The first gene is annotated as a hypothetical protein in P. falciparum (gene ID: PF14_0326) and exhibits a high level of length variation among the species examined (i.e., varied from 2,452 amino acids in E. tenella to 8,094 amino acids in P. falciparum). The conserved regions that can be reliably aligned only account for 3% of the alignment. The second gene is annotated as a putative RNA-binding protein in P. falciparum (gene ID: PF08_0086) and also exhibits a high level of length variation (i.e., varied from 271 amino acids in B. bovis to 1,076 amino acids in P. vivax). The protein alignment obtained after GBLOCKS filtering only contains 29 sites. Based on the pattern of sequence length variation, we suspect that the gene annotations may be problematic in some of the species. For this reason, further analysis of these two genes was not pursued.

The finding of a high level of topological incongruence among gene trees that lack statistical significance has been reported in previous genome-scale phylogenetic studies. Lerat et al. (2003)Go examined 205 single-copy genes shared by 13 gamma-Proteobacteria species and found only two significantly rejected the putative species tree in the SH test. In both cases, the discordance between the gene tree and the putative species tree can be explained by a single lateral gene transfer (LGT) event. Similarly, examinations of the 106 single-copy genes shared by a group of Saccharomyces spp. showed that the majority of bipartition conflicts among genes have low bootstrap support (Taylor and Piel 2004Go; Jeffroy et al. 2006Go).

One possible hypothesis to explain the rare occurrences of a gene significantly rejecting the species tree is that single-copy genes are unlikely to be involved in LGT events (Daubin et al. 2002Go, 2003Go). Under this hypothesis, these genes have been confined in the organismal phylogeny throughout their evolutionary history, so the gene-tree topology is unlikely to be radically different from the species tree. By focusing on a small subset of genes that are highly conserved across all apicomplexan lineages examined, our methodology for orthologous gene selection may have effectively excluded genes that experienced LGT since the ciliate–apicomplexan divergence. Although LGT does not appear to influence our phylogenetic inference as presented here, caution should be taken in future studies because several previous studies suggest that LGT is an important evolutionary force in apicomplexans (Huang, Mullapudi, Lancto, et al. 2004Go; Huang, Mullapudi, Sicheritz-Ponten, and Kissinger 2004Go; Striepen et al. 2004Go; Nagamune and Sibley 2006Go) and other protists (Gogarten 2003Go; Richards et al. 2003Go; Andersson 2005Go).

Evaluation of Phylogenetic Signal by Bootstrap Support
To test if the observed topological incongruence among gene trees can be explained by a low resolving power for certain clades in some genes, we used the minimum bootstrap value observed in a gene tree to identify genes that possess strong phylogenetic signals. The results indicate that the percentage of genes that support the putative species tree increases as a function of the bootstrap cutoff used (table 2). In the most extreme example, when only the genes with a minimum bootstrap value of 90% at any node are examined, all five genes that meet this cutoff support the putative species tree topology. Even when the selection stringency is relaxed to a 70% bootstrap support, a cutoff that is commonly used in phylogenetic inference (Hillis and Bull 1993Go), 47% of these genes are consistent with the putative species tree and the two short internal branches received at least 60% of the consensus support. Curiously, we did not find any significant correlation between bootstrap support and alignment length, average pairwise protein distance, or other attributes of genes (supplementary table S1, Supplementary Material online).


View this table:
[in this window]
[in a new window]

 
Table 2 Effects of Removing Genes Based on the Minimum Bootstrap Support

 
In addition to being consistent with the putative species tree, genes with strong bootstrap support are often insensitive to changes in alignment parameter (table 3), substitution model (table 4), or the phylogenetic method used (table 5). In these tests, we are interested in investigating if a gene could infer the same gene-tree topology across a range of settings used in the phylogenetic inference process; the agreement between the gene-tree topology and the putative species tree is not considered. At 70% minimum bootstrap cutoff, we found that 90% of these genes are robust to a 4-fold change in the gap opening penalty (table 3), 93% of the genes are insensitive to the choice of substitution model (table 4), and 57% of the genes behave consistently across different phylogenetic methods (table 5). Although the use of methodological concordance as a criterion for selecting genes for phylogenetic inference was criticized (Grant and Kluge 2003Go), our results suggest that a gene is more likely to behave consistently across different phylogenetic methods when it contains a strong phylogenetic signal.


View this table:
[in this window]
[in a new window]

 
Table 3 Robustness to Alignment Settings as a Function of the Minimum Bootstrap Support

 

View this table:
[in this window]
[in a new window]

 
Table 4 Robustness to Substitution Model as a Function of the Minimum Bootstrap Support

 

View this table:
[in this window]
[in a new window]

 
Table 5 Methodological Concordance as a Function of the Minimum Bootstrap Support

 
Removal of the Long Branches
In addition to the low signal-to-noise ratio in some genes, another possible source of incongruence among gene trees is the LBA problem that resulted from our nonideal taxon sampling. Several observations support this hypothesis. First, when a gene behaved inconsistently across different phylogenetic methods, ML and NJ often result in an identical gene-tree topology that is different from MP (table 5). In addition, the outgroup T. thermophila and the ingroup C. parvum both have a long evolutionary distance to the other taxa (fig. 1). The lack of additional species that can be used to break up the long branch leading to the Cryptosporidium lineage may be responsible for its unstable phylogenetic placement, as evidenced by the fact that three of the most frequently observed gene-tree topologies involve alternative placement of C. parvum (fig. 3BD). Although the genome sequence of C. hominis is available, adding this species is not particularly helpful. The genomes of these two Cryptosporidium spp. exhibit only 3–5% divergence at the nucleotide level (Xu et al. 2004Go). For the 268 conserved proteins that we used for phylogenetic inference, the sequences from these two species are essentially identical (data not shown).

The issue of nonideal taxon sampling reflects a limitation that is often faced by genome-scale phylogentic inferences (Soltis et al. 2004Go). To circumvent this limitation, we utilized two other commonly suggested approaches to address the LBA problem (Bergsten 2005Go). First, all sites that contain gaps or are highly divergent were removed from the alignment prior to phylogenetic inference by GBLOCKS (see Materials and Methods). Second, we removed either the outgroup T. thermophila or the ingroup C. parvum prior to sequence alignment and repeated the phylogenetic inference.

When the outgroup is removed from the data set, we observed a large increase in the consensus support for the PlasmodiumBabesiaTheileria clade (table 6). Two alternative bipartitions, as shown in panels E and F of figure 3, received substantially weaker consensus supports regardless of the minimum bootstrap cutoff used. Removal of the ingroup C. parvum resulted in a reduction of the number of observed gene-tree topologies (table 6), but the consensus support for the PlasmodiumBabesiaTheileria clade is relatively low compared with the removal of T. thermophila.


View this table:
[in this window]
[in a new window]

 
Table 6 Effects of Taxon Removal

 

    Conclusion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
The recent availability of genome sequences allowed us to infer an organismal phylogeny that includes several important apicomplexan pathogens with high confidence. This robust species tree provides a solid foundation for future comparative studies that can improve our understanding of apicomplexan evolution and parasite biology. Although the level of incongruence among gene trees appears to be high at first glance, further investigation indicates that most of the observed conflict does not have strong statistical support. Interestingly, the minimum bootstrap support observed in a gene tree appears to be a useful predictor of phylogenetic performance. Genes that produce strong bootstrap support for all internal branches are more likely to be consistent with the species tree and robust to changes in the alignment parameter or the phylogenetic method used. Nevertheless, examination of multiple unlinked genes with strong phylogenetic signals is important for accurate phylogenetic inference because any single gene can have a different evolutionary history from the organismal phylogeny. Our systematic investigation provides a list of phylogenetically informative genes in the phylum Apicomplexa. These genes are good candidates for future sequencing efforts that aim at improving taxon sampling in this group of important pathogens.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
Supplementary data files l and 2 and table S1 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
C.-H.K. was supported by a National Institutes of Health (NIH) Training Grant (GM07103), the Kirby and Jan Alton Graduate Fellowship, and a Dissertation Completion Assistantship at the University of Georgia. Funding for this work was provided by NIH R01 AI068908 to J.C.K. P. Brunk, F. Chen, J. Felsenstein, M. Heiges, A. Oliveira, E. Robinson, and H. Wang provided valuable assistance on the use of computer hardware and software. We thank the J. Craig Venter Institute for providing prepublication access to the genome sequence data of P. vivax and T. gondii. The associate editor, Dr Hervé Philippe, and three anonymous reviewers provided constructive comments that greatly improved this manuscript.


    Footnotes
 
1 Present address: Department of Ecology and Evolutionary Biology, University of Arizona Back

Hervé Philippe, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 

    Abrahamsen MS, Templeton TJ, Enomoto S, et al, (20 co-authors). Complete genome sequence of the apicomplexan, Cryptosporidium parvum. Science (2004) 304:441–445.[Abstract/Free Full Text]

    Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol (1990) 215:403–410.[CrossRef][Web of Science][Medline]

    Andersson JO. Lateral gene transfer in eukaryotes. Cell Mol Life Sci (2005) 62:1182–1197.[CrossRef][Web of Science][Medline]

    Bahl A, Brunk B, Crabtree J, et al, (18 co-authors). PlasmoDB: the Plasmodium genome resource. A database integrating experimental and computational data. Nucleic Acids Res (2003) 31:212–215.[Abstract/Free Full Text]

    Bergsten J. A review of long-branch attraction. Cladistics (2005) 21:163–193.[CrossRef][Web of Science]

    Brayton KA, Lau AOT, Herndon DR, et al, (28 co-authors). Genome sequence of Babesia bovis and comparative analysis of apicomplexan hemoprotozoa. PLoS Pathog (2007) 3:e148.

    Carlton J. Genome sequencing and comparative genomics of tropical disease pathogens. Cell Microbiol (2003) 5:861–873.[CrossRef][Web of Science][Medline]

    Carreno RA, Matrin DS, Barta JR. Cryptosporidium is more closely related to the gregarines than to coccidia as shown by phylogenetic analysis of apicomplexan parasites inferred using small-subunit ribosomal RNA gene sequences. Parasitol Res (1999) 85:899–904.[CrossRef][Web of Science][Medline]

    Castresana J. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol (2000) 17:540–552.[Abstract/Free Full Text]

    Collins TM, Fedrigo O, Naylor GJP. Choosing the best genes for the job: the case for stationary genes in genome-scale phylogenetics. Syst Biol (2005) 54:493–500.[CrossRef][Web of Science][Medline]

    Daubin V, Gouy M, Perriere G. A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res (2002) 12:1080–1090.[Abstract/Free Full Text]

    Daubin V, Moran NA, Ochman H. Phylogenetics and the cohesion of bacterial genomes. Science (2003) 301:829–832.[Abstract/Free Full Text]

    Delsuc F, Brinkmann H, Philippe H. Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet (2005) 6:361–375.[Web of Science][Medline]

    Dopazo H, Dopazo J. Genome-scale evidence of the nematode-arthropod clade. Genome Biol (2005) 6:R41.[CrossRef][Medline]

    Douzery EJP, Snell EA, Bapteste E, Delsuc F, Philippe H. The timing of eukaryotic evolution: does a relaxed molecular clock reconcile proteins and fossils? Proc Natl Acad Sci USA (2004) 101:15386–15391.[Abstract/Free Full Text]

    Eisen JA, Coyne RS, Wu M, et al, (53 co-authors). Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote. PLoS Biol (2006) 4:1620–1642.[Web of Science]

    Escalante A, Ayala F. Evolutionary origin of Plasmodium and other Apicomplexa based on rRNA genes. Proc Natl Acad Sci USA (1995) 92:5793–5797.[Abstract/Free Full Text]

    Felsenstein J. PHYLIP—phylogeny inference package (version 3.2). Cladistics (1989) 5:164–166.

    Gajria B, Bahl A, Brestelli J, et al, (15 co-authors). ToxoDB: an integrated Toxoplasma gondii database resource. Nucleic Acids Res (2008) gkm981. 36:D553–D556.

    Gardner MJ, Bishop R, Shah T, et al, (44 co-authors). Genome sequence of Theileria parva, a bovine pathogen that transforms lymphocytes. Science (2005) 309:134–137.[Abstract/Free Full Text]

    Gardner MJ, Hall N, Fung E, et al, (45 co-authors). Genome sequence of the human malaria parasite Plasmodium falciparum. Nature (2002) 419:498–511.[CrossRef][Web of Science][Medline]

    Gatesy J, Baker RH. Hidden likelihood support in genomic data: can forty-five wrongs make a right? Syst Biol (2005) 54:483–492.[CrossRef][Web of Science][Medline]

    Gee H. Evolution: ending incongruence. Nature (2003) 425. 782–782.

    Gogarten JP. Gene transfer: gene swapping craze reaches eukaryotes. Curr Biol (2003) 13:R53–R54.[CrossRef][Web of Science][Medline]

    Grant T, Kluge AG. Data exploration in phylogenetic inference: scientific, heuristic, or neither. Cladistics (2003) 19:379–418.[CrossRef][Web of Science]

    Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol (2003) 52:696–704.[CrossRef][Web of Science][Medline]

    Heiges M, Wang HM, Robinson E, et al, (13 co-authors). CryptoDB: a Cryptosporidium bioinformatics resource update. Nucleic Acids Res (2006) 34:D419–D422.[Abstract/Free Full Text]

    Hertz-Fowler C, Peacock CS, Wood V, et al, (14 co-authors). GeneDB: a resource for prokaryotic and eukaryotic organisms. Nucleic Acids Res (2004) 32:D339–D343.[Abstract/Free Full Text]

    Hillis DM, Bull JJ. An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst Biol (1993) 42:182–192.

    Huang J, Mullapudi N, Lancto CA, Scott M, Abrahamsen MS, Kissinger JC. Cryptosporidium parvum: phylogenomic evidence supports past endosymbiosis, intracellular and horizontal gene transfer. Genome Biol (2004) 5:R88.[CrossRef][Medline]

    Huang JL, Mullapudi N, Sicheritz-Ponten T, Kissinger JC. A first glimpse into the pattern and scale of gene transfer in the Apicomplexa. Int J Parasitol (2004) 34:265–274.[CrossRef][Web of Science][Medline]

    Huelsenbeck JP, Bull JJ, Cunningham CW. Combining data in phylogenetic analysis. Trends Ecol Evol (1996) 11:152–158.[CrossRef]

    Hulsen T, Huynen MA, de Vlieg J, Groenen PMA. Benchmarking ortholog identification methods using functional genomics data. Genome Biol (2006) 7:R31.[CrossRef][Medline]

    Jeffroy O, Brinkmann H, Delsuc F, Philippe H. Phylogenomics: the beginning of incongruence? Trends Genet (2006) 22:225–231.[CrossRef][Web of Science][Medline]

    Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci (1992) 8:275–282.[Abstract/Free Full Text]

    Kuo C-H, Kissinger JC. Consistent and contrasting properties of lineage-specific genes in the apicomplexan parasites Plasmodium and Theileria. BMC Evol Biol (2008) 8:108.[CrossRef][Medline]

    Le SQ, Gascuel O. An improved general amino acid replacement matrix. Mol Biol Evol (2008) 25:1307–1320.[Abstract/Free Full Text]

    Leander BS, Harper JT, Keeling PJ. Molecular phylogeny and surface morphology of marine aseptate gregarines (apicomplexa): selenidium spp. and Lecudina spp. J Parasitol (2003) 89:1191–1205.[CrossRef][Medline]

    Lerat E, Daubin V, Moran NA. From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-proteobacteria. PLoS Biol (2003) 1:101–109.[CrossRef][Web of Science]

    Levine ND. Taxonomy and review of the coccidian genus Cryptosporidium (Protozoa, Apicomplexa). J Protozool (1984) 31:94–98.[Medline]

    Levine ND. Progress in taxonomy of the Apicomplexan protozoa. J Eukaryot Microbiol (1988) 35:518–520.[CrossRef]

    Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res (2003) 13:2178–2189.[Abstract/Free Full Text]

    Montoya JG, Liesenfeld O. Toxoplasmosis. Lancet (2004) 363:1965–1976.[CrossRef][Web of Science][Medline]

    Morrison DA, Ellis JT. Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of Apicomplexa. Mol Biol Evol (1997) 14:428–441.[Abstract]

    Nagamune K, Sibley LD. Comparative genomic and phylogenetic analyses of calcium ATPases and calcium-regulated proteins in the Apicomplexa. Mol Biol Evol (2006) 23:1613–1627.[Abstract/Free Full Text]

    Nishihara H, Okada N, Hasegawa M. Rooting the eutherian tree: the power and pitfalls of phylogenomics. Genome Biol (2007) 8:R199.[CrossRef][Medline]

    Pain A, Renauld H, Berriman M, et al, (50 co-authors). Genome of the host-cell transforming parasite Theileria annulata compared with T. parva. Science (2005) 309:131–133.[Abstract/Free Full Text]

    Perkins FO, Barta JR, Clopton RE, Peirce MA, Upton SJ. Apicomplexa. In: An illustrated guide to the protozoa—Lee J, Leedale G, Bradbury P, eds. (2000) Lawrence (KS): Society of Protozoologists. 190–369.

    Philippe H, Delsuc F, Brinkmann H, Lartillot N. Phylogenomics. Annu Rev Ecol Evol Syst (2005) 36:541–562.[CrossRef]

    Philippe H, Lartillot N, Brinkmann H. Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. Mol Biol Evol (2005) 22:1246–1253.[Abstract/Free Full Text]

    Philippe H, Snell EA, Bapteste E, Lopez P, Holland PWH, Casane D. Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol Biol Evol (2004) 21:1740–1752.[Abstract/Free Full Text]

    Phillips MJ, Delsuc FD, Penny D. Genome-scale phylogeny and the detection of systematic biases. Mol Biol Evol (2004) 21:1455–1458.[Abstract/Free Full Text]

    Pollard DA, Iyer VN, Moses AM, Eisen MB. Widespread discordance of gene trees with species tree in Drosophila: evidence for incomplete lineage sorting. PLoS Genet (2006) 2:1634–1647.[Web of Science]

    Richards TA, Hirt RP, Williams BAP, Embley TM. Horizontal gene transfer and the evolution of parasitic protozoa. Protist (2003) 154:17–32.[Medline]

    Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci (1981) 53:131–147.[CrossRef][Web of Science]

    Rodriguez-Ezpeleta N, Brinkmann H, Roure eacute atrice B, Lartillot N, Lang BF, Philippe H. Detecting and overcoming systematic errors in genome-scale phylogenies. Syst Biol (2007) 56:389–399.[CrossRef][Web of Science][Medline]

    Rokas A. Genomics and the tree of life. Science (2006) 313:1897–1899.[Abstract/Free Full Text]

    Rokas A, Williams BL, King N, Carroll SB. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature (2003) 425:798–804.[CrossRef][Web of Science][Medline]

    Schmidt HA, Strimmer K, Vingron M, von Haeseler A. TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics (2002) 18:502–504.[Abstract/Free Full Text]

    Shimodaira H. An approximately unbiased test of phylogenetic tree selection. Syst Biol (2002) 51:492–508.[CrossRef][Web of Science][Medline]

    Shimodaira H, Hasegawa M. Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol (1999) 16:1114–1116.[Web of Science]

    Shimodaira H, Hasegawa M. CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics (2001) 17:1246–1247.[Abstract/Free Full Text]

    Siddall ME. Another monophyly index: revisiting the jackknife. Cladistics (1995) 11:33–56.[CrossRef][Web of Science]

    Soltis DE, Albert VA, Savolainen V, et al, (11 co-authors). Genome-scale data, angiosperm relationships, and ‘ending incongruence’: a cautionary tale in phylogenetics. Trends Plant Sci (2004) 9:477–483.[CrossRef][Web of Science][Medline]

    Striepen B, Pruijssers AJP, Huang JL, Li C, Gubbels MJ, Umejiego NN, Hedstrom L, Kissinger JC. Gene transfer in the evolution of parasite nucleotide biosynthesis. Proc Natl Acad Sci USA (2004) 101:3154–3159.[Abstract/Free Full Text]

    Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res (2006) 34:W609–W612.[Abstract/Free Full Text]

    Tarleton RL, Kissinger J. Parasite genomics: current status and future prospects. Curr Opin Immunol (2001) 13:395–402.[CrossRef][Web of Science][Medline]

    Taylor DJ, Piel WH. An assessment of accuracy, error, and conflict with support values from genome-scale phylogenetic data. Mol Biol Evol (2004) 21:1534–1537.[Abstract/Free Full Text]

    Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res (1994) 22:4673–4680.[Abstract/Free Full Text]

    van Dongen S. Graph clustering by flow simulation (2000) University of Utrecht.

    Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol (2001) 18:691–699.[Abstract/Free Full Text]

    WHO and UNICEF. World malaria report 2005 (2005) Geneva (Switzerland): World Health Organization.

    Xu P, Widmer G, Wang Y, et al, (18 co-authors). The genome of Cryptosporidium hominis. Nature (2004) 431:1107–1112.

    Zhu G, Keithly JS, Philippe H. What is the phylogenetic position of Cryptosporidium? Int J Syst Evol Microbiol (2000) 50:1673–1681.[Abstract]

Accepted for publication September 18, 2008.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
25/12/2689    most recent
msn213v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Kuo, C.-H.
Right arrow Articles by Kissinger, J. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kuo, C.-H.
Right arrow Articles by Kissinger, J. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?