MBE Advance Access originally published online on August 4, 2006
Molecular Biology and Evolution 2006 23(11):2090-2100; doi:10.1093/molbev/msl080
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© 2006 The Authors
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Research Articles |
Exploring the Relationship between Sequence Similarity and Accurate Phylogenetic Trees

* Department of Biochemistry and Molecular Genetics, University of Virginia
Josephine Bay Paul Center for Comparative Molecular Biology and Evolution, Marine Biological Laboratory, Woods Hole, Massachusetts
E-mail: wrp{at}virginia.edu.
| Abstract |
|---|
|
|
|---|
We have characterized the relationship between accurate phylogenetic reconstruction and sequence similarity, testing whether high levels of sequence similarity can consistently produce accurate evolutionary trees. We generated protein families with known phylogenies using a modified version of the PAML/EVOLVER program that produces insertions and deletions as well as substitutions. Protein families were evolved over a range of 100400 point accepted mutations; at these distances 63% of the families shared significant sequence similarity. Protein families were evolved using balanced and unbalanced trees, with ancient or recent radiations. In families sharing statistically significant similarity, about 60% of multiple sequence alignments were 95% identical to true alignments. To compare recovered topologies with true topologies, we used a score that reflects the fraction of clades that were correctly clustered. As expected, the accuracy of the phylogenies was greatest in the least divergent families. About 88% of phylogenies clustered over 80% of clades in families that shared significant sequence similarity, using Bayesian, parsimony, distance, and maximum likelihood methods. However, for protein families with short ancient branches (ancient radiation), only 30% of the most divergent (but statistically significant) families produced accurate phylogenies, and only about 70% of the second most highly conserved families, with median expectation values better than 1060, produced accurate trees. These values represent upper bounds on expected tree accuracy for sequences with a simple divergence history; proteins from 700 Giardia families, with a similar range of sequence similarities but considerably more gaps, produced much less accurate trees. For our simulated insertions and deletions, correct multiple sequence alignments did not perform much better than those produced by T-COFFEE, and including sequences with expressed sequence taglike sequencing errors did not significantly decrease phylogenetic accuracy. In general, although less-divergent sequence families produce more accurate trees, the likelihood of estimating an accurate tree is most dependent on whether radiation in the family was ancient or recent. Accuracy can be improved by combining genes from the same organism when creating species trees or by selecting protein families with the best bootstrap values in comprehensive studies.
Key Words: simulation phylogenetic analysis accuracy sequence similarity
| Introduction |
|---|
|
|
|---|
Despite the expectation that abundant genome and expressed sequence tag (EST) sequence data would produce better evolutionary histories, particularly for organisms from deep branches in the "tree of life," phylogenies based on genome-scale data are often inconsistent (de la Cruz and Davies 2000
Compared with LGT, computational artifacts such as LBA provide a less provocative explanation for inconsistent phylogenies. LBA can result in the grouping of the fastest evolving taxa, irrespective of their true phylogenies (Felsenstein 1978
), and is expected for sequences from deeply branching or rapidly evolving organisms. LBA can be reduced by increasing or modifying the taxonomic sample (Hillis 1996
; Graybeal 1998
; Hillis 1998
; Zwickl and Hillis 2002
), by focusing on slowly evolving positions (Felsenstein 1978
; Olsen 1987
; Brinkmann and Philippe 1999
), and by improving the model of sequence evolution (Lockhart et al. 1996
; Yang 1996
; Hirt et al. 1999
; Van de Peer et al. 2000
). In general, LBA is thought to play a less important role than LGT when large numbers of proteins, some of which share strong sequence similarity, are considered.
It has been difficult to estimate the relative importance of LGT (biological) and LBA (methodological) as explanations for inconsistent trees, in part because we do not know how often phylogenetic methods fail for sequences that evolved vertically, and that can be identified by widely used sequence comparison methods like Blast (Altschul et al. 1990
) and FASTA (Pearson and Lipman 1988
). In this paper, we explore the relationship between sequence similarity and phylogenetic accuracy from various tree topologies and evolutionary models using protein sequences that share statistically significant similarity (i.e., that could be identified in a similarity search). We created simulated gene families over a broad range of evolutionary distances, using simple models of rate variation and gap insertion, and measured the accuracy of phylogenetic estimation methods. We were surprised to find that despite our simple evolutionary models, 1040% of the reconstructed trees were less than 80% accurate, even for very highly conserved simulated protein families.
| Methods |
|---|
|
|
|---|
Generating Protein Families
Protein families were created using a modified version of EVOLVER (Yang 1997
Assessing Evolutionary Distances
To characterize the diversity of each protein family, all its members were compared using PRSS (Pearson and Lipman 1988
), which estimates the statistical significance of a sequence similarity by shuffling one of the sequences. The expectation value (E(·) value) calculated by PRSS is equivalent to the E(·) value calculated by Blast; it reports the number of times a similarity score is expected to occur by chance in a database search against a sequence database of specified size. Here, we report E(10,000)the number of times an alignment score would be found by chance in a search of a 10,000 entry database. The protein families in this study have either 8 (unbalanced trees) or 16 (balanced trees) members. For each of the 8 (or 16) members of a family, 7 (or 15) pairwise expectation values were calculated. The median expectation value for the family is the lowest (most significant or most closely related) of the 8 (or 16) median expectation values of the individual family members.
Multiple Sequence Alignment Accuracy Analysis
Protein families were aligned using T-COFFEE (Notredame et al. 2000
). True multiple sequence alignments were created using the known positions of changes at homologous sites produced by EVOLVER. Multiple alignment accuracy was evaluated with the VerAlign comparison software, which is available at www.ibivu.cs.vu.nl/programs/veralignwww/ (Simossis et al. 2005
). VerAlign calculates a sum-of-pairs (SOP) score that reports the number of correctly aligned residues in each of the pairwise residue alignments implied by the multiple sequence alignment.
Phylogenetic Trees of Synthetic Families
Phylogenies were constructed using 4 methods: 1) distance (PHYLIPFitch, Fitch and Margoliash 1967
; Felsenstein 1993
), 2) maximum parsimony (PHYLIPProtpars, Eck and Dayhoff 1966
; Felsenstein 1993
), 3) maximum likelihood (PHYML, Guindon and Gascuel 2003
), and 4) Bayesian inference (MrBayes, Ronquist and Huelsenbeck 2003
). For distance trees, each multiple sequence alignment of a protein family was used to make a distance matrix using PROTDIST (Felsenstein 1993
), with the JonesTaylorThornton method of amino acid replacement (Jones et al. 1992
). Maximum likelihood (PHYML) reconstruction used JonesTaylorThornton protein rate matrices and included the default gamma parameter for an "average" distribution of evolutionary rate variation over the protein sites (Guindon and Gascuel 2003
). Bayesian analysis used PAM250 rate matrices (Dayhoff et al. 1978
) and was done with and without including gamma-distributed rate variation across the sites. Bootstrap analysis (Felsenstein 1985
) was performed using maximum likelihood.
Tree Evaluation
Tree accuracy was evaluated using 2 programs: TREEDIST (Felsenstein 1993
) and TREESCORE (this work). TREEDIST calculates 2 measures of tree accuracy: symmetric difference (Robinson and Foulds 1981
) and branch score difference (Kuhner and Felsenstein 1994
). Symmetric difference measures the differences in the topology of 2 trees, whereas the branch score difference measures differences in topology and branch length as a root sum of squares. We normalized both distances by dividing them by the corresponding maximum distances of trees with random topologies.
Similar to CompareTree (Hall 2005
), we also developed a measure of tree accuracy that directly counts the correct number of cladesTREESCORE. Each ancestor node and its descendents in the model tree represent a clade. The TREESCORE is the ratio of the number of correct clades contained in the estimated tree divided by the maximum possible number of clades. Clades containing only 2 leaf sequences were weighted half, to focus the measure on deeper clades that are more difficult to reconstruct. Thus, for a balanced unrooted tree (fig. 1A) with 16 taxa, the maximum number of correct clades is 9. For the unbalanced unrooted tree in figure 1C, the maximum number of correct clades is 4. In the analysis of the Giardia families, the "correct" clades are deep, so leaf clades were not downweighted.
|
Giardia Phylogenies
For purposes of high-throughput phylogenetic tree generation for Giardia, each open reading frame (Morrison et al. 2004
1010. Multiple alignments were preformed with T-COFFEE. Phylogenies were estimated using Mr. Bayes (Ronquist and Huelsenbeck 2003
Synthetic EST Sequences
To examine the effect of EST errors in phylogeny estimation, we added additional substitution and insertion/deletion errors to the DNA sequences produced by EVOLVER and truncated the sequences to produce simulated ESTs. EST error rates were estimated by comparing Giardia lamblia EST sequences with the Giardia lamblia Genome Database (Morrison et al. 2004
) using FASTA (Pearson and Lipman 1988
) with match/mismatch +1/3 and gap penalties 12/4. Alignments were analyzed for the number of mismatches, insertions, and deletions in the EST sequences compared with the genome sequence. EST sequences had about 2.4% substitutions per read with 0.075% insertions and 0.053% deletions. Assembled contigs had about one-third as many errors (0.72%, 0.032%, and 0.018%). DNA sequences from each protein family were mutated according to an EST-like or EST-contiglike model. Because truncations of EST sequences occur at the 5' end, sequence truncation was incorporated into the model, based on the EST truncations seen in Giardia EST sequences. Only 16% of ESTs aligned with the N terminus of the protein; 36% aligned over 75% of the protein's length; 56% aligned over half the protein; and 84% of the ESTs covered at least 25% of the protein-coding region. We also considered a "limited truncation model," in which the EST started and ended randomly within the protein-coding region (based on the distribution of Giardia EST alignments), but ESTs were excluded if less than 50% of the protein was aligned. Protein translations were determined by comparing the sequence with other members in the family using FASTY (Pearson et al. 1997
).
Sequence Concatenation
We generated 2 data sets of concatenated sequences. Proteins were concatenated from the same taxa evolved under the same tree topologies and model trees. In addition, we concatenated sequences that were evolved to the same evolutionary distance. These concatenated sequences were then analyzed as individual sequences.
| Results |
|---|
|
|
|---|
Because many phylogenomic studies rely on large-scale alignments of protein and DNA sequences that were identified by sequence similarity searches, we explored the relationship between sequence similarity and phylogenetic accuracy. We generated 2,400 protein sequence families from 48 model trees produced from 1) 4 tree topologies (fig. 1), 2) 3 radiation models, and 3) 4 evolutionary distances. The tree topologies consisted of two 16-taxon balanced (fig. 1A and B) and two 8-taxon unbalanced (fig. 1C and D) topologies; to preserve the distribution of internode distances, the unbalanced trees had fewer taxa. Branch lengths were specified using 3 radiation models: recent, uniform, and ancient. Recent radiation protein families have more evolutionary change before speciation events (fig. 1A), resulting in longer internal branches and shorter "leaf" branches. In recent radiation families, daughter branches were half the evolutionary distance of the parent branch, whereas in ancient radiation families (fig. 1C), daughter branches were twice as long as the parent. Ancient radiation trees are expected to be the most difficult to reconstruct accurately because of their short internal branches. The evolutionary distance of the internal branches provided to EVOLVER was constant for uniform radiation protein families, but EVOLVER introduces variation into the branch lengths of the trees it generates (fig. 1B). Even though there is a strong correlation between branch lengths provided to the EVOLVER program and evolutionary distances of the resulting trees, the rate variation used by EVOLVER can produce phylogenies with distributions of evolutionary distance that partially obscures the evolutionary model (fig. 1D).
To explore the relationship between measured sequence similarity and phylogenetic accuracy, target evolutionary distances for the most distant sequences in the family ranged from 100 to 400 PAMs (Dayhoff et al. 1972
, 1978
). Protein families were grouped by median expectation value from the most divergent to the least divergent. About 37% of sequences did not share statistically significant sequence similarity and therefore would not have been found with a Blast search (Altschul et al. 1990
). The remaining families, which shared significant median sequence similarity, were split into quintiles, from least similar (Q1) to most similar (Q5). As expected, groups with the best statistical significance (fig. 2A) had the lowest sequence divergence (fig. 2B). All 6 groups have similar distributions of lengths (fig. 2C), with the exception of the least divergent quintile (Q5), which had longer sequences, on average. Because we grouped families by median statistical significance, which depends both on sequence similarity and sequence length, quintiles overlap in their ranges of PAM distance. In addition to the sequence variation produced by the EVOLVER program codon model, we introduced insertions and deletions into the protein sequences, using parameters estimated by Benner et al. (1993)
, based on structural alignments (fig. 2D).
|
To confirm that our simulated protein families span a range of evolutionary distances similar to those likely to be encountered in a challenging eukaryotic data set, we examined 705 Giardia proteins with statistically significant homologs in 6 of 9 groups of organisms. The distributions of statistical significance, evolutionary distance, length, and ungapped alignment lengths for the Giardia protein families are also shown in figure 2. The conserved Giardia protein families show a distribution of sequence similarity and evolutionary distance that is very similar to the overall distribution of similarity and distance of our synthetic families that share statistically significant similarity (fig. 2, comparing + total with Giardia). However, the Giardia proteins have substantially shorter regions of ungapped sequence in the multiple alignments than the synthetic proteins do (fig. 2D). This presumably reflects domain insertions in Giardia and deletions, which are not part of the Benner insertion/deletion model.
The median similarity characterizes the "average" properties of each protein family; however, the ability of a family to reconstruct a phylogenetic tree accurately may depend more on the most distant members of the family. Table 1 summarizes the radiation modelsancient, uniform, or recentin the different quintile groups and the statistical significance of the most distant members in each quintile. Thus, 72% of the proteins in the most distant quintile (Q1) were produced with either the ancient or the recent radiation model; the remaining families came from the uniform model. Surprisingly, 14% of the least divergent families produced with the recent radiation model have at least 1 pair of members that lacks statistically significant similarity (E(·) > 103). In practice, these family members could be identified through transitive homology or with a profile sequence similarity program like PSI-Blast (Altschul et al. 1997
). More of the ancient radiation protein families fall in the not-significant category; as a result, in all the significant quintiles, there are higher percentages of families from uniform or recent radiation models.
|
Multiple Sequence Alignment Accuracy
Sequence-based phylogenetic reconstruction strategies require multiple sequence alignments; inaccurate multiple alignments are often proposed to explain inconsistent phylogenies. To assess the multiple sequence alignment accuracy at different evolutionary distances, we compared our estimated T-COFFEE alignments with the true multiple sequence alignments recorded as the families were produced by EVOLVER. Sequence families that did not share statistically significant sequence similarity, and thus would not have been detected in a Blast search, were poorly aligned with T-COFFEE (fig. 3). More than half of these estimated alignments had SOP scores that were 60% of the scores produced by true alignments. However, over 90% of families with statistically significant sequence similarity had SOP scores greater than 80% of the true alignment (fig. 3). As expected, the estimated alignments look more like the true multiple sequence alignments as the amount of divergence decreases.
|
Phylogenetic Accuracy
To determine the relationship between evolutionary distance and tree accuracy, we estimated phylogenies for our simulated families using 4 different methods: 1) distance (fig. 4A), 2) parsimony (fig. 4B), 3) maximum likelihood (fig. 4C), and 4) Bayesian (fig. 4D). Our TREESCORE threshold for accuracy was 0.80; phylogenies at that score have only 1 (for 8-taxon families) or 2 (for 16-taxon families) incorrectly grouped clades within the tree. More than 95% of the recent radiation familiesfamilies with longer ancient brancheswere more than 80% accurate (fig. 4) over the entire range of sequence similarities, including families whose members did not share statistically significant similarity. For ancient radiation familiesfamilies with short ancient branches80% accuracy was achieved in only 1550% of the most divergent families that shared statistically significant similarity, depending on the phylogenetic reconstruction method used. Sixty to 90% of families from the uniform radiation model produced 80% accurate trees from the most divergent, but statistically significant, quintile (fig. 4C). All 4 tree-building methods produced similar trends, with each method performing better with some radiation models at some distances.
|
We measured phylogenetic accuracy by 3 ways: symmetric tree distance (Robinson and Foulds 1981
|
To evaluate the ability of our simulated families to reproduce the phylogenetic accuracy of actual protein families, we estimated phylogenies with Giardia open reading frames and their non-Giardia homologs using TREESCORE (figs. 5D and 6). Despite having a range of sequence similarities that is quite similar to the range of diversity seen in our statistically significant synthetic protein families, Giardia phylogenies are significantly less accurate in each of the similarity quartiles (fig. 5D). Giardia protein families have TREESCORE accuracies similar to those seen in the least statistically significant quintile (Q1) of our synthetic families (fig. 6). For example, 23% of phylogenetic trees built from Giardia families were at least 80% correct, compared with the 22% of ancient radiation Q1 families and 46% of Q2 families, using maximum likelihood. Thus, the Giardia protein family trees are substantially less accurate than one would expect, based on their statistical significance or expectation value. The lower consistency of the Giardia phylogenies presumably reflects the larger number of gaps in these sequences and the shorter ungapped regions (fig. 2D).
|
In addition to median sequence similarity, we characterized other properties of the Giardia protein homologs to see if we could identify a subset of proteins that were more likely to produce accurate evolutionary trees. We characterized the diversity of each family by estimating the statistical significance of sequence similarities of every pairwise alignment in the family and found that all the Giardia protein family members shared significant similarity, but families with the stronger sequence similarity produced considerably more accurate trees than the more divergent families (fig. 5B). We did not find a strong correlation between the length of ungapped regions and tree accuracy (data not shown).
Including Truncated Sequences Decreases Phylogenetic Accuracy
Because EST sequences can be readily and cheaply obtained, they are often included in phylogenetic analysis. But EST sequences also contain errors, so we examined the effect of including EST sequences in our phylogenetic analyses. For each of our 2,400 protein families, 1 mRNA sequence was altered using an EST-contiglike error model or an EST-like model (Methods, fig. 7). Phylogenies were estimated using maximum likelihood and evaluated with TREESCORE. We sometimes saw a small decrease in the accuracy of phylogenies produced from families with EST-contiglike sequences (fig. 7, dashed line). However, if we limited the truncation to less than 50% of the sequence length, we observed no decrease in phylogenetic accuracy for contig sequences (data not shown). Sequences evolved under a limited truncation EST-like model showed a slight decrease in accuracy for families in Q4 (fig. 7, dotted line) but resembled accurate full-length sequences otherwise. Thus, truncation can reduce accuracy; fortunately, truncated sequences are readily identified in sequence alignments and thus can be excluded.
|
Strategies for Improving Phylogenetic Accuracy
For many ancient radiation sequence families that share statistically significant similarity, the probability of reconstructing an accurate phylogeny was low. Therefore, we examined 3 methods to increase phylogenetic accuracy: using sequence concatenation (fig. 8A and B), correct alignments (fig. 8C), and bootstrapping (fig. 8D).
|
Increasing sequence size can improve phylogenetic reconstruction (Miyamoto 1985
We were concerned that tree accuracy had improved with concatenation because the more distant families were combined with less distant families. Hence, we also concatenated sequences that were evolved under the same topology to the same final evolutionary distances (fig. 8B). Because there were fewer families that met these criteria, only 2 sequences were concatenated. Again, the number of accurate phylogenies increased significantly. For sequences in quintile Q2, the number of accurate phylogenies increased 50% (fig. 8B). Protein families in the 3 least distant quintiles showed about a 10% increase in accuracy when 2 sequences were concatenated.
Phylogenetic reconstruction begins with multiple sequence alignments, so we asked whether correct multiple sequence alignments produced substantially better phylogenies than calculated alignments. We constructed true alignments based on the positions of insertions and deletions of each sequence and their ancestor sequences. Phylogenies were estimated and scored as in figure 4C. We would expect that for families with poor estimated alignments, NS and Q1, phylogenetic accuracy would increase given the true alignment. Yet, although in some instances we were able to reconstruct a more accurate phylogeny using the true alignment, the differences were minimal even for the most distantly related protein families (Q1, fig. 8C).
In addition, because protein families from our ancient radiation model often produced incorrect trees, we sought a measure of tree reliability that would help us identify the trees that are more likely to be correct. Bootstrap analysis and Bayesian posterior probabilities are used in phylogenetic analysis to determine the robustness of a phylogenetic prediction, so we used these scores to stratify the protein families examined in our analysis. Bootstrap scores were calculated by averaging the bootstrap values from 100 sample data sets of the internal nodes. Posterior probability scores were calculated by averaging the posterior probabilities of the internal nodes. Trees with average bootstrap scores greater than 0.80 across all taxa have more phylogenies that are accurate (fig. 8D, triangle), particularly for families in Q2 and Q3 (70% increase). Similarly, trees with the lowest variance among their bootstrap values at the internal nodes were accurate more often (data not shown). Overall, when families that do not have a bootstrap score greater than 0.80 were removed, accuracy increased from 56% to 62%. Similarly, trees with the highest average posterior probabilities were more likely to be accurate (data not shown).
| Discussion |
|---|
|
|
|---|
We have explored the limits of phylogenetic reconstruction by assessing the probability of estimating a correct phylogenetic tree for diverse protein families evolved using different radiation models. As expected, as sequence similarity increases, multiple sequence alignment accuracy increases. Phylogenies of protein families evolved under a uniform radiation model are estimated accurately at diverse DNA distances using various phylogenetic algorithms (Hall 2005
Including ESTs in phylogenetic tree reconstructions reduced accuracy only slightly when the EST sequence covered more than half the length of the protein family. Likewise, when we estimated trees using true multiple sequence alignments, phylogenetic accuracy improved only slightly. Our results are consistent with work by Hall (2005)
, which shows that topological accuracy does not increase when correct alignments are used. However, branch-length accuracy is increased with accurate alignments (Hall 2005
). Although accurate alignments are important for creating accurate trees, other factors, such as sequence similarity and short ancient branch lengths, have much more impact on phylogenetic accuracy. In challenging cases such as families with long recent branches, sequence concatenation and choosing trees with high robustness scores (bootstrap values and posterior probability) can improve the probability of accurately estimating the trees.
Because simulation studies are performed under well-specified models, our results almost certainly overestimate our ability to reconstruct accurate phylogenies from biological data. Thus, our estimates of reconstruction accuracy must be seen as best case values; more complex models of rate variation across sequences, and a more comprehensive sample of tree topologies and radiation histories, might allow a more reliable estimate phylogeny accuracy for a given level of sequence similarity. For similar levels of sequence similarity, the Giardia families produce much less accurate trees than our simulated families (fig. 5D). The large differences in ungapped alignment coverage between our synthetic families and Giardia families suggest that a more realistic model for protein families would contain a domain insertion/deletion model, perhaps with domains that evolve at different rates from the surrounding sequence. Because Giardia protein families do show a range of tree accuracies similar to our Q1 synthetic families, it may be possible to develop a more accurate synthetic evolution model that reproduces the multiple alignment ungapped coverage and has a range of TREESCORE accuracies similar to Giardia. This model might allow a more accurate prediction of phylogeny reconstruction from sequence similarity and ungapped alignment coverage.
Of course, some of the inaccurate Giardia phylogenies may reflect biological phenomena (i.e., LGT or recombination) and possible human errors, such as including paralogous genes or proteins that do not share similar domain composition. But the observation that synthetic protein families with a simple phylogenetic history can produce a similar range of reconstruction accuracies (fig. 6) suggest that a computational, rather than biological, explanation is sufficient to explain the inconsistencies.
Species trees based on molecular data are rarely estimated with individual proteins. Now that many prokaryotic and model eukaryotic genomes have been completed and several groups are producing high-throughput EST sequences, phylogenetic analysis should be performed not on individual protein families but by concatenating several sequences. To improve accuracy further, phylogenies should be estimated with organisms that have close relatives in the protein family and all sequences should share statistically significant sequence similarity (E(·)
103) with all other members of the family. And, in studies of phylogenies with many protein families, families with higher bootstrap values across the tree are more likely to produce accurate phylogenies.
| Acknowledgements |
|---|
|
|
|---|
This work was supported by National Institutes of Health grant AI1058054 to M. Sogin. We thank S. Huse for help with the analysis of the Giardia phylogenetic tree data and M. Sogin for his comments on the manuscript. C. reinhardtii, P. sojae, and P. ramorum sequence data were produced by the US Department of Energy Joint Genome Institute, http://www.jgi.doe.gov/ and are provided for use in this publication/correspondence only.
Funding to pay the Open Access publication charges for this article was provided by the NIH grant AI1058054.
| Footnotes |
|---|
William Martin, Associate Editor
| References |
|---|
|
|
|---|
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J Mol Biol 215:40310.[CrossRef][ISI][Medline]
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389402.
Benner SA, Cohen MA, Gonnet GH. (1993) Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J Mol Biol 229:106582.[CrossRef][ISI][Medline]
Brinkmann H and Philippe H. (1999) Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Mol Biol Evol 16:81725.[Abstract]
Campbell AM. (2000) Lateral gene transfer in prokaryotes. Theor Popul Biol 57:717.[CrossRef][ISI][Medline]
Dayhoff MO, Eck RV, Park CM. (1972) A model of evolutionary change in proteins. Atlas of protein sequence and structure. (National Biomedical Research Foundation, Silver Spring, MD) Volume 5:8999.
Dayhoff MO, Schwartz RM, Orcutt BC. (1978) A model of evolutionary change in proteins. Atlas of protein sequence and structure. (National Biomedical Research Foundation, Silver Spring, MD) Volume 5:34552.
de la Cruz F and Davies J. (2000) Horizontal gene transfer and the origin of species: lessons from bacteria. Trends Microbiol 8:12833.[CrossRef][ISI][Medline]
Doolittle WF. (1998) You are what you eat: a gene transfer ratchet could account for bacterial genes in eukaryotic nuclear genomes. Trends Genet 14:30711.[CrossRef][ISI][Medline]
Doolittle WF. (1999) Phylogenetic classification and the universal tree. Science 284:21249.
Eck RV and Dayhoff MO. (1966) Atlas of protein sequence and structure(National Biomedical Research Foundation, Silver Springs, MD).
Eisen JA. (2000) Horizontal gene transfer among microbial genomes: new insights from complete genome analysis. Curr Opin Genet Dev 10:60611.[CrossRef][ISI][Medline]
Felsenstein J. (1978) Cases in which parsimony or compatibility methods will be positively misleading. Syst Biol 27:40110.
Felsenstein J. (1985) Confidence-limits on phylogeniesan approach using the bootstrap. Evolution 34:78391.
Felsenstein J. (1993) Phylogeny Inference Package (PHYLIP). Version 3.5. (University of Washington, Seattle, WA).
Fiala KI and Sokal RR. (1985) Factors determining the accuracy of cladogram estimationevaluation using computer-simulation. Evolution 39:60922.[CrossRef][ISI]
Fitch WM and Margoliash E. (1967) Construction of phylogenetic trees. Science 155:27984.
Graybeal A. (1998) Is it better to add taxa or characters to a difficult phylogenetic problem? . Syst Biol 47:917.[CrossRef][ISI][Medline]
Guindon S and Gascuel O. (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52:696704.[CrossRef][ISI][Medline]
Hall BG. (2005) Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol Biol Evol 22:792802.
Hillis DM. (1996) Inferring complex phylogenies. Nature 383:1301.[CrossRef][Medline]
Hillis DM. (1998) Taxonomic sampling, phylogenetic accuracy, and investigator bias. Syst Biol 47:38.[CrossRef][ISI][Medline]
Hirt RP, Logsdon JM Jr, Healy B, Dorey MW, Doolittle WF, Embley TM. (1999) Microsporidia are related to fungi: evidence from the largest subunit of RNA polymerase II and other proteins. Proc Natl Acad Sci USA 96:5805.
Jain R, Rivera MC, Moore JE, Lake JA. (2003) Horizontal gene transfer accelerates genome innovation and evolution. Mol Biol Evol 20:1598602.
Jones DT, Taylor WR, Thornton JM. (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:27582.
Kluge AJ. (1989) A concern for evidence and a phylogenetic hypothesis of relationships among Epicrates (Boidae, Serpentes). Syst Zool 38:725.
Kuhner MK and Felsenstein J. (1994) A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol 11:45968.[Abstract]
Lockhart PJ, Larkum AW, Steel M, Waddell PJ, Penny D. (1996) Evolution of chlorophyll and bacteriochlorophyll: the problem of invariant sites in sequence analysis. Proc Natl Acad Sci USA 93:19304.
Lopez-Garcia P and Moreira D. (1999) Metabolic symbiosis at the origin of eukaryotes. Trends Biochem Sci 24:8893.[CrossRef][ISI][Medline]
Mirkin BG, Fenner TI, Galperin MY, Koonin EV. (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol Biol 3:2.[CrossRef][Medline]
Miyamoto MM. (1985) Consensus cladograms and general classifications. Cladistics 1:1869.
Morrison HG, McArthur AG, Adam RD, Aley SB, Gillin FD, Olsen GJ, Sogin ML. (2004) The Giardia lamblia Genome Database. , Available from: www.mbl.edu/Giardia.
Nixon KC and Carpenter JM. (1996) On simultaneous analysis. Cladistics 12:22141.[CrossRef]
Notredame C, Higgins DG, Heringa J. (2000) T-COFFEE: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:20517.[CrossRef][ISI][Medline]
Ochman H, Lawrence JG, Groisman EA. (2000) Lateral gene transfer and the nature of bacterial innovation. Nature 405:299304.[CrossRef][Medline]
Olsen GJ. (1987) Earliest phylogenetic branchings: comparing rRNA-based evolutionary trees inferred with various techniques. Cold Spring Harbor Symp Quant Biol 52:82537.[ISI][Medline]
Pearson WR and Lipman DJ. (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:24448.
Pearson WR, Wood T, Zhang Z, Miller W. (1997) Comparison of DNA sequences with protein sequences. Genomics 46:2436.[CrossRef][ISI][Medline]
Robinson DF and Foulds LR. (1981) Comparison of phylogenetic trees. Math Biosci 53:13147.[CrossRef]
Ronquist F and Huelsenbeck JP. (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:15724.
Salisbury BA. (1999) Misinformative characters and phylogeny shape. Syst Biol 48:15369.[CrossRef][ISI][Medline]
Simossis VA, Kleinjung J, Heringa J. (2005) Homology-extended sequence alignment. Nucleic Acids Res 33:81624.
Van de Peer Y, Ben Ali A, Meyer A. (2000) Microsporidia: accumulating molecular evidence that a group of amitochondriate and suspectedly primitive eukaryotes are just curious fungi. Gene 246:18.[CrossRef][ISI][Medline]
Yang Z. (1996) . Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol Evol 11:36772.[CrossRef]
Yang Z. (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13:5556.
Yang Z, Nielsen R, Hasegawa M. (1998) Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol 15:160011.[Abstract]
Zwickl DJ and Hillis DM. (2002) Increased taxon sampling greatly reduces phylogenetic error. Syst Biol 51:58898.[CrossRef][ISI][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||


). Quintiles of sequences that have statistically significant median similarity are indicated by least similar (Q1,
), second least similar (Q2,
), third least similar (Q3,
), fourth least similar (Q4,
), and most similar (Q5, 

), uniform (
), and ancient (
). Open symbols show families with nonsignificant median expectation values.


0.80. Filled symbols represent families with a significant median expectation value and therefore identified by sequence similarity searches.