Molecular Biology and Evolution 17:1220-1231 (2000)
© 2000 Society for Molecular Biology and Evolution
Regular Article |
Bias in Phylogenetic Reconstruction of Vertebrate Rhodopsin Sequences
Department of Organismic and Evolutionary Biology, Harvard University
Abstract
Two spurious nodes were found in phylogenetic analyses of vertebrate rhodopsin sequences in comparison with well-established vertebrate relationships. These spurious reconstructions were well supported in bootstrap analyses and occurred independently of the method of phylogenetic analysis used (parsimony, distance, or likelihood). Use of this data set of vertebrate rhodopsin sequences allowed us to exploit established vertebrate relationships, as well as the considerable amount known about the molecular evolution of this gene, in order to identify important factors contributing to the spurious reconstructions. Simulation studies using parametric bootstrapping indicate that it is unlikely that the spurious nodes in the parsimony analyses are due to long branches or other topological effects. Rather, they appear to be due to base compositional bias at third positions, codon bias, and convergent evolution at nucleotide positions encoding the hydrophobic residues isoleucine, leucine, and valine. LogDet distance methods, as well as maximum-likelihood methods which allow for nonstationary changes in base composition, reduce but do not entirely eliminate support for the spurious resolutions. Inclusion of five additional rhodopsin sequences in the phylogenetic analyses largely corrected one of the spurious reconstructions while leaving the other unaffected. The additional sequences not only were more proximal to the corrected node, but were also found to have intermediate levels of base composition and codon bias as compared with neighboring sequences on the tree. This study shows that the spurious reconstructions can be corrected either by excluding third positions, as well as those encoding the amino acids Ile, Val, and Leu (which may not be ideal, as these sites can contain useful phylogenetic signal for other parts of the tree), or by the addition of sequences that reduce problems associated with convergent evolution.
Introduction
Phylogenetic analysis is a complex problem in inference. It is therefore not surprising that all existing phylogenetic methods are known to fail under some conditions and for a variety of reasons. In recent years, several issues have emerged as particularly thorny. Use of an oversimplified model of molecular evolution or strong violation of the assumptions of a model can result in convergence to an incorrect topology with greater certainty as sequence length increases (i.e., inconsistency). This type of problem is particularly relevant to parsimony analyses, especially in cases in which some branches are much longer than others, a problem which has been dubbed "long-branch attraction" (Felsenstein 1978
). Phylogenetic methods based on explicit models of evolution, such as distance and maximum likelihood, tend to be less vulnerable to this type of problem, but even these are known to display inconsistency under conditions where their assumptions are strongly violated (Hillis, Huelsenbeck, and Cunningham 1994
; Gaut, and Lewis 1995
; Yang 1996
; Huelsenbeck 1997
; Sullivan and Swofford 1997
; Huelsenbeck 1998
). In addition, although taxon sampling has long been an important issue in phylogenetic analyses, it remains difficult to establish reasonable guidelines for sampling and to assess the effects that it may have on the accuracy of tree topologies (Hillis 1996, 1998
; Poe 1998
; Rannala et al. 1998
).
Determining the particular conditions under which phylogenetic methods fail is critical to both understanding their limitations and developing new, improved models and algorithms better suited to the analysis of molecular data. For example, third positions have been thought to be problematic in many data sets due to the effects of base compositional bias (Saccone, Pesole, and Preparata 1989
; Sidow and Wilson 1990
; Sogin, Hinkle, and Lelpe 1993
). This has led to the development of models that incorporate nonstationary changes in base composition for both distance and likelihood phylogenetic methods (Lockhart et al. 1994
; Steel 1994
; Galtier and Gouy 1995, 1998
). However, in practice, it is often difficult to identify concrete examples of failure of phylogenetic methods in real data sets and to pinpoint the reasons for that failure. Another common feature of molecular data sets that may cause phylogenetic methods to fail is variation in codon bias across the tree, but examples of this in real data sets have yet to be isolated, and the challenges they pose for phylogenetic reconstruction have only just begun to be addressed (Goldman and Yang 1994
; Muse 1996
; Yang 1997
).
Rhodopsin is an ideal genetic system for exploring issues in phylogenetic reconstruction, because it has been cloned from a variety of species, and much is known about its function and molecular evolution (Chang et al. 1995, 1996
; Baylor 1996
; Baylor and Burns 1998
; Bowmaker 1998
; Sakmar 1998
; Townson et al. 1998
). Rhodopsin is a single-copy nuclear gene encoding a seven-transmembrane G-proteincoupled receptor which forms the first step in the visual transduction cascade in the photoreceptors of the eye (Nathans 1992
). In vertebrates, it is expressed at high levels in a single cell type, rod photoreceptor cells (Khorana 1992
; Chang et al. 1996
; Baylor and Burns 1998
; Sakmar 1998
). Rhodopsin has been found to exist in more than one copy only in rare instances, for example, in polyploid animals such as the carp, Cyprinus carpio (Larhammar and Risinger 1994
). Most important for this study, phylogenetic relationships among vertebrates, for which rhodopsin sequences are available, have been well-characterized using fossil, morphological, and molecular data (Carroll 1997
).
This study takes advantage of well-established vertebrate relationships to examine in detail molecular evolutionary forces which result in spurious reconstructions in a data set of vertebrate rhodopsin sequences. Once these factors have been identified, methods are explored to reduce their effects and eliminate the spurious reconstructions.
Materials and Methods
Rhodopsin sequences were obtained from the GenBank database via NCBI's website (http://www.ncbi.nlm.nih.gov/genbank/). GenBank accession numbers for all the sequences used are given in table 1 . Rhodopsin cDNA sequences were aligned using CLUSTAL W and modified by hand to allow only gaps between codons. This file was then translated to yield an equivalently aligned amino acid rhodopsin data set. Parsimony, distance, and maximum-likelihood phylogenetic analyses were performed using a beta-test version of PAUP*, version 4 (Swofford 1999
). Trees were rooted using the lamprey sequence as an outgroup. In addition, many of the analyses also included four paralogous rodlike cone opsin genes (GenBank accession numbers: gekko blue, M92035; chick green, M92038; goldfish green1, L11865; goldfish green2, L11866) as outgroup sequences in order to confirm the position of the root (Chang et al. 1995
). The results of these analyses confirmed the position of the lamprey as the most basally diverging vertebrate rhodopsin.
|
In order to determine the best model for distance and likelihood analyses, likelihood scores were determined for five different models: JC (Jukes and Cantor 1969
-distribution; Yang 1993
In addition to equally weighted parsimony analyses, 2:1 transversion (Tv) : transition (Ts) weighting was also used. Although other weighting schemes were explored, they produced less reliable trees (data not shown). In addition, this weighting scheme reflects the likelihood estimate of the Tv/Ts ratio (1.5). Distance bootstrap analyses were performed using the HKY85+
, HKY85, and K2P models and the neighbor-joining algorithm.
In order to assess phylogenetic signal in the data set, 10,000 random trees were generated in PAUP* to calculate g1 statistics (Hillis and Huelsenbeck 1992
). In addition, two measures of codon bias, scaled
2 and effective number of codons (ENC) (Shields et al. 1988
; Wright 1990
), were calculated to assess codon usage in each taxon. Measures of nucleotide and codon bias were calculated using the program MEA (generously provided by its author, E. Moriyama).
To test for long-branch attraction (Huelsenbeck 1998
), 100 data sets were simulated by parametric bootstrapping using the program SIMINATOR (Huelsenbeck, Hillis, and Jones 1996
) with parameters estimated from the original rhodopsin sequence data set. The simulated data sets were subsequently analyzed using equally weighted maximum parsimony in PAUP*, version 4, with 100 replications of (nonparametric) bootstrapping, 10 random-addition replicates each.
Results
Phylogenetic Analyses
Phylogenetic analyses were performed on a data set of 20 vertebrate rhodopsin nucleotide sequences (table 1
). Although this data set showed high levels of genetic variation (table 2
) and generally performed well in reconstructing traditional relationships among vertebrates, phylogenetic analyses consistently show substantial bootstrap support for two groupings which contradict established vertebrate relationships: reptiles and amphibians form a clade (fig. 1B
), instead of the more traditional reptiles and mammals (fig. 1A
), and alligator and anolis form a clade (fig. 2B
), instead of alligator and chicken (fig. 2A
). In parsimony analysis with equal weights (table 3
), bootstrap support was 86% for the reconstruction of amphibians as the sister group to reptiles (this node is hereinafter referred to as amph+rept) and 72% for the grouping of alligator + anolis as the sister lineage to the chicken (this node is hereinafter referred to as gator+anol). Support for these resolutions is robust to changes in the relative weightings of transversions and transitions: 91% for rept+amph and 65% for gator+anol with 2-to-1 Tv/Ts (table 3
). Less than 5% bootstrap support was seen for more accepted resolutions of these nodes. This is in contrast to the robust support for established relationships elsewhere in the tree (fig. 3
). Note that this data set, like many other molecular data sets, does not recover the Glires clade (rodents + rabbits), but instead places the rodents basal to a clade containing artiodactyls and other mammals. On the other hand, most morphological data recover the Glires (de Jong 1998
). The Glires controversy is beyond the scope of this paper and does not influence its major observations.
|
|
|
|
|
Another unusual aspect of this data set is that despite the substantial divergences among sequences (table 2 ), useful phylogenetic signal has been retained in third positions. Not only do third positions contribute the largest numbers of informative sites (out of 568 total informative sites, 315 were in third positions, 151 were in first positions, and 102 were in second positions), but they also contain enough signal that when analyzed alone (fig. 4 ), they recover a tree that is almost as well supported as the tree with all three codon positions included. In addition, analyses of the degree of skewness of a distribution of lengths of 10,000 randomly generated trees imply that some phylogenetic signal does reside in third positions (all positions: g1 = -0.69; third positions only: g1 = -0.69; first + second positions only: g1 = -0.80).
|
Although there is useful phylogenetic signal retained at third positions with respect to many nodes in the tree, this signal also appears to be underlying some of the support for the problematic reconstructions. Bootstrap support for these incongruent resolutions almost completely disappeared when third positions were excluded from parsimony analysis (<5% for rept+amph, 10% for gator+anol), an effect that is robust to changes in transversion-transition weighting (table 3 ). Furthermore, excluding third positions had the effect of increasing support for the more established chicken + alligator grouping (hereinafter chick+gator) to 72%, in contrast to the <5% bootstrap support shown when all positions were included in the analysis. Support for the reptile + mammal grouping (hereinafter rept+mamm) also increased, but not as much (16%), when third positions were excluded.
When analyzed alone, third positions showed substantial support for the spurious resolutions (68% for rept+amph, 41% for gator+anol) and no support for the well-corroborated relationships, an effect which was robust to changes in Tv/Ts weighting (table 3 and fig. 4 ). Analyses of the amino acid sequences, which should be free of the base compositional and codon bias effects particularly problematic for third-base positions and transitions, did not show any support for the incongruent relationships (table 3 ). However, the bootstrap phylogeny based on amino acids was rather poorly resolved in general (fig. 5 ).
|
Distance analyses which did not incorporate nonstationary changes in base composition did not fare much better than parsimony for this data set, and also tended to recover the problematic nodes with substantial bootstrap support (71% for rept+amph and 47% for gator+anol, HKY85+
model; table 3
). These bootstrap values remained quite stable, even when the correction for rate heterogeneity was not included in the analysis or when models with fewer parameters were used (table 3
).
Given the variation in base composition in this data set, especially at third positions (see table 1
), analyses using LogDet/paralinear distance methods (Lake 1994
; Lockhart et al. 1994
; Steel 1994
) were performed. These methods allow for nonstationary changes in base composition among sequences in a phylogeny and would be expected to perform better for data sets where this is a problem. Phylogenetic bootstrap analyses using LogDet distances did show reduced support for the problematic reconstructions (56% for rept+amph and 38% for gator+anol; table 3
and fig. 6 ). Moreover, for one of the problematic nodes, there was also slightly increased support for the correct reconstruction (38% for chick+gator; table 3
).
|
Maximum-likelihood methods were also explored for this data set (fig. 7 ). Likelihood ratio tests were used to compare nested models of evolution in order to identify models that best fit our data set. These models were tested for a phylogeny of well-established vertebrate relationships (fig. 8A ). Among the models tested, among-sites rate heterogeneity was the single most important parameter resulting in significantly better likelihood scores (
2 ranged from 2235.2 to 2412.8, P < 0.001 for all comparisons; table 4
). Among the models incorporating rate heterogeneity, GTR+
had significantly higher likelihood scores in pairwise comparisons with all other models (
2 = 123.8619.6, P < 0.001 for all comparisons) except for the HKY85+
model (
2 = 6, P = 0.2). The HKY85+
model, when compared with nested models with fewer parameters, had significantly better likelihood scores (
2 = 117.8306.8, P < 0.001 for all comparisons). Since the GTR+
model was not found to be significantly better than the HKY85+
model, the HKY85+
model was determined to be the best fit of those tested for this data set and was subsequently used in a full likelihood bootstrap analysis of the rhodopsin data set. However, maximum-likelihood phylogenetic methods under the HKY85+
model did not perform any better than distance or parsimony methods, showing substantial support for spurious resolutions at both nodes (79% for rept+amph and 44% for gator+anol; fig. 7
and table 3
).
|
|
|
Since base compositional effects appeared to be important in this data set, likelihood methods which allow for nonstationary GC content were also explored (Galtier and Gouy 1998
Finally, it has been suggested that hydrophobic amino acids may be less useful for phylogenetic reconstruction than other amino acids (Naylor and Brown 1997
). To explore the effects of hydrophobic amino acids in the rhodopsin data set, nucleotide positions encoding the hydrophobic amino acids Ile, Leu, and Val were excluded in a parsimony analysis (189 nucleotide positions excluded, representing 63 amino acids). This analysis showed greatly reduced bootstrap support for the spurious resolutions (<5% for rept+amph and 33% for gator+anol, 2:1 Tv/Ts; table 3 ), indicating that positions encoding for these amino acids may underlie the spurious signal. If the spurious signal was due mainly to functional constraints on these hydrophobic amino acids, then excluding third positions should not affect the analysis. This was not the case, as the effect remained even when only third positions of the hydrophobic amino acids Ile, Leu, and Val were excluded (table 3
).
Statistical Tests Comparing Trees
Several statistical tests were performed using the rhodopsin nucleotide data set in order to determine if phylogenies with and without the two spurious reconstructions were significantly different. The Templeton (1983)
test and the "winning sites" test (Prager and Wilson 1988
) compare trees under the parsimony criteria, whereas the Kishino-Hasegawa test (Kishino and Hasegawa 1989
) was formulated to compare trees under either likelihood or parsimony. Tests under the parsimony criteria are shown in table 5
, and tests under the likelihood criteria are shown table 6
. Each of the two spurious reconstructions (rept+amph, gator+anol) was tested separately in pairwise tests of trees with and without each spurious reconstruction. These tests confirmed the results of the phylogenetic bootstrap analysis, pinpointing third positions and nucleotides encoding Ile, Leu, and Val as the sites supporting the spurious reconstructions. Although neither spurious reconstruction (rept+amph, gator+anol) was significantly better with all nucleotide sites included, when only third positions and sites encoding Ile, Leu, or Val were considered, trees with the spurious reconstructions became significantly better than those without. This was true under both parsimony (table 5
) and likelihood (table 6
). Conversely, when only first and second positions, excluding those sites encoding Ile, Leu, or Val, were considered, the tree without spurious reconstructions was found to be better than either one of the trees with the spurious reconstructions. This result was significant under parsimony, but not under likelihood (tables 5 and 6
).
|
|
Simulation Studies Using Parametric Bootstrapping
Long-branch attraction has been identified as a potential reason for problematic groupings in several studies (Huelsenbeck, Hillis, and Jones 1996
model (maximum-likelihood-estimated parameters: K = 3.12,
= 0.33, frequency of A = 0.19, frequency of C = 0.35, frequency of G = 0.24, frequency of T = 0.23; branch lengths are given in fig. 8A
). Although a few of the resolutions of taxa present in this tree do remain somewhat controversial (e.g., the placement of the rabbits as basal to artiodactyls instead of with rodents), these are unlikely to affect the simulations with respect to the nodes in question. Results from parsimony bootstrap analysis of the 100 simulated data sets are graphed in figure 8B and C, representing the expected null distribution of parsimony bootstrap values for each spurious reconstruction (rept+amph, gator+anol). Note that support for these spurious clades was being examined under conditions where the data were simulated from topologies reflecting the more established relationships (rept+mamm, gator+chick). The median level of bootstrap support for the incorrect rept+amph clade was 10.5% and that for the gator+anol clade was 19% in the simulated data sets. In the real rhodopsin data set, bootstrap support for both spurious resolutions was significantly higher than expected from the null distribution of simulated data sets generated by parametric bootstrapping (86% for rept+amph and 72% for gator+anol; P < 0.05 in both cases). This indicates that the level of support seen for the problematic reconstructions is higher than would be expected given the conditions of the simulations, and therefore unlikely to be due to long-branch attraction.
Base Composition and Codon Bias Measures
Since the results of the phylogenetic analyses and statistical tests comparing phylogenies implied that third positions, as well as transitions, underlie the bootstrap support of the spurious reconstructions, base composition and codon bias measures were examined for evidence of convergent evolution. First- and second-position nucleotide compositions were fairly homogeneous across all sequences. However, at third positions, reptile and amphibian rhodopsins tended to have lower %GC than other sequences (table 1
). This pattern of convergent evolution may confound phylogenetic analyses and result in the spurious grouping, as shown by mapping the GC content on the phylogeny (fig. 9
). Furthermore, amphibian and reptile rhodopsins are less biased in their codon usage, as shown by scaled
2 and ENC codon bias measures, than are the rhodopsins of other vertebrate groups (table 1
). Not only are there convergences in the overall degree of codon bias, but there are also convergences in the usage frequencies of specific codons that reflect the spurious groupings. This convergent pattern was evident when the codon usage frequencies were mapped on a tree. For example, convergences in the frequency of GGC, one of four codons coding for glycine, are shown mapped on the tree in figure 9
.
|
The results of the phylogenetic analyses and statistical tests comparing alternative phylogenies also implicated positions encoding hydrophobic residues Ile, Leu, and Val as contributing to the high bootstrap support of the spurious reconstructions. In order to further explore this effect, base composition was examined at these sites for evidence of convergent evolution. Third positions in general had already been shown to be convergent in this data set (see above, fig. 9 ); therefore, for these amino acids, attention was focused on first and second positions. Second positions did not vary, as the amino acids Ile, Leu, and Val are all encoded by the same nucleotide, T. However, at first positions, at these sites, it was found that reptile and amphibian rhodopsins tended to have more A's (32.01%) than all other sequences (28.86%). This effect was not as marked in first positions that encoded amino acids other than Ile, Leu, and Val (27.28% for amph+rept, 26.43% for all others). This pattern of convergent evolution resulting in increased numbers of A's in first positions also results in an increased proportion of Ile's in reptile and amphibian rhodopsins relative to the total numbers of Ile, Leu, and Val present (28.1% for amph+rept, 26.5% for all others).
Effect of Increased Sampling
If the spurious reconstructions seen in this data set were due to convergent evolution, perhaps better sampling across the tree could ameliorate this effect. Rhodopsin sequences from five basally diverging taxa that were recent additions to GenBank were added to the data set: sea lamprey, Conger eel, Anguilla eel, skate, and Myripristis berndti, a holocentrid marine fish (table 1
). It is important to note that not only do these sequences represent basal species poorly sampled in the original data set, but several of them also display values of base composition at third positions and/or codon bias quite different from their closest neighbors on the tree, and are thus more likely to "break up" convergent effects.
Myripristis berndti and Anguilla rhodopsins have only 67.81% and 73.65% GC content at third positions, as compared with other fish rhodopsins, which average 80.16% (table 1
). Similarly, skate rhodopsin has much lower %GC at third positions (70.70%) than the nearest basal lineage, lamprey rhodopsin (87.57%). The two measures of codon bias, scaled
2 and ENC, also showed the M. berndti, skate, and Conger rhodopsins to be atypically low in codon bias compared with neighboring fish and lamprey sequences (table 1
).
For this expanded data set, equally weighted parsimony analysis of all positions showed reduced bootstrap support for the spurious rept+amph clade (48%) as compared with the original data set (86% without the additional sequences) and increased support for the correct rept+mamm clade, which rose from <5% in the original data set (table 3 ) to 25% in the expanded data set (table 7 ). Unlike analyses of the original data set, in which there was virtually no difference between equal weights versus 2-to-1 Tv/Ts weighting schemes, analysis of the expanded data set was highly sensitive to differences in weighting, particularly in the resolution of the reptile-mammal-amphibian node. When Tv/Ts weighting was used, bootstrap support for the correct rept+mamm clade jumped from 25% (equal weights) to 70% (2:1 Tv/Ts; fig. 10 ). In contrast, bootstrap support for the spurious gator+anol clade remains substantial in the analysis of the expanded data set (73%), and the high degree of sensitivity to differences in Tv/Ts weighting was not seen here (table 7 ).
|
|
In both cases, bootstrap support for the spurious resolutions disappeared entirely when third positions were excluded from parsimony analysis, regardless of Tv/Ts weighting (table 7 ). These results are similar to those of the analysis of the original data set (table 3 ). However, in contrast to the original data set, when third positions were excluded in the expanded data set, bootstrap support for the more established resolutions was increased (44% for rept+mamm and 78% for chick+gator, equal weights). When analyzed alone, third positions showed substantial support for the spurious resolutions and no support for the well-corroborated resolutions of these nodes, regardless of Tv/Ts weighting (table 7 ).
The patterns of bootstrap support in distance analyses of the expanded data set (table 7
) remained very similar to those of the original data set (table 3
), with very little difference in support between the models used, showing neither decreased support for spurious resolutions nor increased support for correct resolutions. Maximum-likelihood reconstructions under HKY85+
in the expanded data set also showed results similar to those found for the original data set and did not show reduced support for the spurious nodes nor heightened support for the correct nodes in the expanded data set (table 7
).
Statistical comparisons of trees with and without the spurious reconstructions (rept+amph, gator+anol) were consistent with the phylogenetic bootstrap analyses. A tree with the gator+anol clade was still better than one without this spurious reconstruction when only third positions and sites encoding Ile, Leu, and Val were considered. This result was significant under the parsimony criterion (table 5 ) and not quite significant under the likelihood criterion (P = 0.07; table 6 ). However, the sites which clearly supported the spurious gator+anol reconstruction in both the original and extended data sets and also supported the spurious rept+amph reconstruction in the original data set were no longer capable of distinguishing between a tree with the spurious rept+amph reconstruction and one without in the extended data set (tables 5 and 6 ). This result is again consistent with the phylogenetic bootstrap analyses, which suggest that the additional sequences aid in breaking up convergences among the sequences, but only for the spurious rept+amph reconstruction, which is more proximal to the additional sequences, leaving the spurious gator+anol reconstruction largely unaffected.
Discussion
Our results indicate that the two problematic reconstructions in the original rhodopsin data set were probably not the result of topological effects such as long-branch attraction. This is demonstrated by the persistence of these spurious nodes when maximum-likelihood methods were used and by the fact that the bootstrap support for these spurious nodes was well outside of the distribution of support obtained for each node from simulated data sets generated by parametric bootstrapping. Rather, these spurious reconstructions were most likely due to convergences in base compositional bias at third positions, in codon bias, and in positions encoding for the hydrophobic amino acids Ile, Val, and Leu, which tend to group unrelated sequences. This represents a strong violation of phylogenetic model assumptions of stationary base composition and codon frequencies across the tree, which would cause methods not directly addressing these problems to fail under these conditions.
Base compositional bias at third positions has often been found to be problematic for phylogenetic reconstruction, and several methods have been developed in an attempt to address this problem (Lockhart et al. 1994
; Galtier and Gouy 1995, 1998
). Although these methods did reduce support for the spurious reconstructions in the rhodopsin data set, they were not completely effective in eliminating the problematic nodes, and it seems clear that base compositional bias is not the only reason for the spurious nodes. In fact, simulation studies on a data set of bat sequences have shown that levels of base compositional bias must be extremely high (>90% AT) in order to show any evidence of spurious reconstructions (Van Den Bussche et al. 1998
). Although fairly high, levels of base compositional bias are not so extreme in the rhodopsin data set.
In addition to base compositional bias, convergent effects in codon bias and in positions encoding hydrophobic amino acids also appear to be supporting the spurious reconstructions in the rhodopsin data set. Other phylogenetic studies that have also found problematic reconstructions have attributed these to various problems such as not incorporating rate heterogeneity across sites into the phylogenetic model (Takezaki and Gojobori 1999
), which is clearly not the case here. However, there is growing evidence that convergent or parallel evolution at the level of nucleotides (or amino acids) is a common feature of many molecular data sets and may pose a significant challenge in attempting to reconstruct unbiased phylogenies (Naylor and Brown 1997, 1998
; Cao et al. 1998
; Foster and Hickey 1999
; Lee 1999
). In particular, nucleotide sites encoding the hydrophobic amino acids Ile, Leu, and Val have been shown in other studies to display lower retention indices than other sites (Naylor and Brown 1997
), and the analyses of the rhodopsin data set presented here provide more evidence of the importance of this effect. The reasons for it still remain unclear but may be related to relaxed constraints on hydrophobic amino acids contained within transmembrane domains.
There are several ways to address these problems of bias in base composition, codon frequencies, and sites encoding hydrophobic amino acids. All of these positions could be excluded from a parsimony phylogenetic analysis. This method can be effective in principle, but in fact may not be ideal, as these positions often contain useful phylogenetic signal in addition to the spurious signal, and excluding them can result in loss of resolution in the phylogenetic reconstructions (e.g., see Campbell, Brower, and Pierce 2000). Another way of addressing this problem would be to develop more complex models of evolution which incorporate these assumptions about base composition, codon bias, and amino acid composition. However, this may require the addition of many more parameters to the model, which may become problematic.
In addition to advances in phylogenetic methodology, this problem may be effectively addressed, albeit indirectly, via better sampling of species. Note that here "better sampling" means the addition of sequences not only proximal to problematic nodes, but also intermediate in base composition and codon bias. In other words, it is not only important when considering sampling issues to "break up" long branches that can lead to the failure of methods such as parsimony, but even more important to "break up" convergences in base composition and codon bias that can cause all types of phylogenetic methods, not just parsimony, to fail. In fact, it should be noted that of all the phylogenetic methods used here, only weighted parsimony methods are able to recover the correct topology once appropriately sampled sequences are included in the analysis, and thus these methods outperform both distance and maximum-likelihood methods in this regard. This may reflect greater sensitivity of maximum-likelihood and distance methods to incorrect assumptions in the underlying models (with respect to nonstationary nucleotide and codon bias and hydrophobic sites) in comparison with parsimony methods, which sometimes may prove more robust to violations of these assumptions despite the fact that maximum-likelihood methods are known to be consistent over a larger set of conditions than are parsimony methods (Hillis, Huelsenbeck, and Cunningham 1994
; Huelsenbeck 1997
; Sullivan and Swofford 1997
).
Acknowledgements
We thank Z. Yang, R. Honeycutt, and two anonymous reviewers for many helpful comments on the manuscript, and N. Pierce and M. Donoghue for discussion and advice. B.S.W.C. is an NSF/Alfred P. Sloan Fellow in Molecular Evolution.
Footnotes
Rodney Honeycutt, Reviewing Editor
1 Present address: Department of Molecular Biology and Biochemistry, Rockefeller University. ![]()
2 Present address: Department of Biology, University of Maryland, College Park. ![]()
3 Keywords: molecular evolution
hydrophobic amino acids
base compositional bias
codon bias
parametric bootstrapping ![]()
4 Address for correspondence and reprints: Belinda S. W. Chang, Rockefeller University, 1230 York Ave., Box 284, New York, New York 10021. E-mail: changb{at}rockvax.rockefeller.edu ![]()
literature cited
Baylor, D. 1996. How photons start vision. Proc. Natl. Acad. Sci. USA 93:560565.
Baylor, D. A., and M. E. Burns. 1998. Control of rhodopsin activity in vision. Eye 12:521525.
Bowmaker, J. 1998. Evolution of colour vision in vertebrates. Eye 12:541547.
Campbell, D. L., A. V. Z. Brower, and N. E. Pierce. 2000. Molecular evolution of the Wingless gene and its implications for the phylogenetic placement of the butterfly family Riodinidae (Lepidoptera: Papilionoidea). Mol. Biol. Evol. 17:684696.
Cao, Y., A. Janke, P. J. Waddell, M. Westerman, O. Takenaka, S. Murata, N. Okada, S. Paabo, and M. Hasegawa. 1998. Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders. J. Mol. Evol. 47:307322.[Web of Science][Medline]
Carroll, R. L. 1997. Patterns and processes of vertebrate evolution. Cambridge University Press, Cambridge, England.
Chang, B. S. W., D. Ayers, W. C. Smith, and N. E. Pierce. 1996. Cloning of the gene encoding honeybee long-wavelength rhodopsin: a new class of insect visual pigments. Gene 173:215219.
Chang, B. S. W., K. S. Crandall, J. P. Carulli, and D. L. Hartl. 1995. Opsin phylogeny and evolution: a model for blue shifts in wavelength regulation. Mol. Phylogenet. Evol. 4:3143.[Medline]
de Jong, W. W. 1998. Molecules remodel the mammalian tree. Trends Ecol. Evol. 13:270275.
Felsenstein, J. 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27:401410.[Web of Science]
. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368376.[Web of Science][Medline]
. 1991. PHYLIP: phylogeny inference package. Version 3.4. University of Washington, Seattle.
Foster, P. G., and D. A. Hickey. 1999. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J. Mol. Evol. 48:284290.[Web of Science][Medline]
Galtier, N., and M. Gouy. 1995. Inferring phylogenies from sequences of unequal base compositions. Proc. Natl. Acad. Sci. USA 92:1131711321.
. 1998. Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871879.[Abstract]
Gaut, B. S., and P. O. Lewis. 1995. Success of maximum likelihood phylogeny inference in the four-taxon case. Mol. Biol. Evol. 12:152162.[Abstract]
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725736.[Abstract]
Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:672677.
Hillis, D. M. 1996. Inferring complex phylogenies. Nature 383:130131.
. 1998. Taxonomic sampling, phylogenetic accuracy, and investigator bias. Syst. Biol. 47:38.
Hillis, D. M., and J. P. Huelsenbeck. 1992. Signal, noise, and reliability in molecular phylogenetic analyses. J. Hered. 83:189195.
Hillis, D. M., J. P. Huelsenbeck, and C. W. Cunningham. 1994. Application and accuracy of molecular phylogenies. Science 164:671677.
Huelsenbeck, J. P. 1997. Is the Felsenstein zone a fly trap? Syst. Biol. 46:6974.
. 1998. Systematic bias in phylogenetic analysis: is the Strepsiptera problem solved? Syst. Biol. 47:519537.
Huelsenbeck, J. P., D. M. Hillis, and R. Jones. 1996. Parametric bootstrapping in molecular phylogenetics: applications and performance. Pp. 1945 in J. D. Ferraris and S. R. Palumbi, eds. Molecular zoology. Wiley and Sons, New York.
Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21132 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York.
Khorana, H. G. 1992. Rhodopsin, photoreceptor of the rod cell. J. Biol. Chem. 267:14.
Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111120.[Web of Science][Medline]
Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J. Mol. Evol. 29:170179.[Web of Science][Medline]
Lake, J. A. 1994. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc. Natl. Acad. Sci. USA 91:14551459.
Larhammar, D., and C. Risinger. 1994. Molecular genetic aspects of tetraploidy in the common carp, Cyprinus carpio. Mol. Phylogenet. Evol. 1:5968.
Lee, M. S. Y. 1999. Molecular phylogenies become functional. Trends Ecol. Evol. 14:177178.[Medline]
Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11:605612.[Web of Science][Medline]
Muse, S. V. 1996. Estimating synonymous and nonsynonymous substitution rates. Mol. Biol. Evol. 13:105114.[Abstract]
Nathans, J. 1992. Rhodopsin: structure, function, and genetics. Biochemistry 31:49234931.
Naylor, G. J. P., and W. M. Brown. 1997. Structural biology and phylogenetic estimation. Nature 388:527528.
. 1998. Amphioxus mitochondrial DNA, chordate phylogeny, and the limits of inference based on comparisons of sequences. Syst. Biol. 47:6176.
Poe, S. 1998. The effect of taxonomic sampling on accuracy of phylogeny estimation: test case of a known phylogeny. Mol. Biol. Evol. 15:10861090.[Web of Science]
Prager, E. M., and A. C. Wilson. 1988. Ancient origin of lactalbumin from lysozyme: analysis of DNA and amino acid sequences. J. Mol. Evol. 27:326335.[Web of Science][Medline]
Rannala, B., J. P. Huelsenbeck, Z. Yang, and R. Nielsen. 1998. Taxon sampling and the accuracy of large phylogenies. Syst. Biol. 47:702710.[Web of Science][Medline]
Saccone, C., G. Pesole, and G. Preparata. 1989. DNA microenvironments and the molecular clock. J. Mol. Evol. 29:407411.[Web of Science][Medline]
Sakmar, T. P. 1998. Rhodopsin: a prototypical G protein-coupled receptor. Prog. Nucleic Acid Res. Mol. Biol. 59:134.[Web of Science][Medline]
Shields, D. C., P. M. Sharp, D. G. Higgins, and F. Wright. 1988. "Silent" sites in Drosophila genes are not neutral: evidence of selection among synonymous codons. Mol. Biol. Evol. 5:704716.[Abstract]
Sidow, A., and A. C. Wilson. 1990. Compositional statistics: an improvement of evolutionary parsimony and its deep branches in the tree of life. J. Mol. Evol. 31:5168.[Web of Science][Medline]
Sogin, M. L., G. Hinkle, and D. D. Lelpe. 1993. Universal tree of life. Nature 362:795.
Steel, M. 1994. Recovering a tree from the Markov leaf colourations it generates under a Markov model. Appl. Math. Lett. 7:1923.
Sullivan, J., and D. L. Swofford. 1997. Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. J. Mamm. Evol. 4:7786.
Swofford, D. L. 1999. PAUP*, phylogenetic analysis using parsimony (*and other methods). Version 4.0. Sinauer, Sunderland, Mass.
Takezaki, N., and T. Gojobori. 1999. Correct and incorrect vertebrate phylogenies obtained by the entire mitochondrial DNA sequences. Mol. Biol. Evol. 16:590601.[Abstract]
Templeton, A. R. 1983. Phylogenetic inference from restriction endonuclease cleavage site maps with particular reference to the humans and apes. Evolution 37:221244.
Townson, S. M., B. S. W. Chang, E. Salcedo, L. Chadwell, N. E. Pierce, and S. G. Britt. 1998. Isolation and physiological characterization of the genes encoding the blue and ultraviolet sensitive opsins of the honeybee, Apis mellifera. J. Neurosci. 18:24122422.
Van Den Bussche, R. A., R. J. Baker, J. P. Huelsenbeck, and D. M. Hillis. 1998. Base compositional bias and phylogenetic analyses: a test of the "flying DNA" hypothesis. Mol. Phylogenet. Evol. 10:408416.[Web of Science][Medline]
Wright, F. 1990. The effective number of codons' used in a gene. Gene 87:2329.
Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol. Biol. Evol. 10:13961401.[Abstract]
. 1994. Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39:105111.[Web of Science][Medline]
. 1996. Phylogenetic analysis using parsimony and likelihood methods. J. Mol. Evol. 42:294307.[Web of Science][Medline]
. 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555556.
Yang, Z., N. Goldman, and A. Friday. 1994. Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol. Biol. Evol. 11:316324.[Abstract]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
N. C. Sheffield, H. Song, S. L. Cameron, and M. F. Whiting Nonstationary Evolution and Compositional Heterogeneity in Beetle Mitochondrial Phylogenomics Syst Biol, August 1, 2009; 58(4): 381 - 394. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Gowri-Shankar and M. Rattray A Reversible Jump Method for Bayesian Phylogenetic Inference with a Nonhomogeneous Substitution Model Mol. Biol. Evol., June 1, 2007; 24(6): 1286 - 1299. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Jayaswal, J. Robinson, and L. Jermiin Estimation of Phylogeny and Invariant Sites under the General Markov Model of Nucleotide Sequence Evolution Syst Biol, April 1, 2007; 56(2): 155 - 162. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. F. Gruber, R. S. Voss, and S. A. Jansa Base-Compositional Heterogeneity in the RAG1 Locus among Didelphid Marsupials: Implications for Phylogenetic Inference and the Evolution of GC Content Syst Biol, February 1, 2007; 56(1): 83 - 96. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. S. Tavares, A. J. Baker, S. L. Pereira, and C. Y. Miyaki Phylogenetic Relationships and Historical Biogeography of Neotropical Parrots (Psittaciformes: Psittacidae: Arini) Inferred from Mitochondrial and Nuclear DNA Sequences Syst Biol, June 1, 2006; 55(3): 454 - 470. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. T. Herbeck, P. H. Degnan, and J. J. Wernegreen Nonhomogeneous Model of Sequence Evolution Indicates Independent Origins of Primary Endosymbionts Within the Enterobacteriales ({gamma}-Proteobacteria) Mol. Biol. Evol., March 1, 2005; 22(3): 520 - 532. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Y.W. Ho and L. S. Jermiin Tracing the Decay of the Historical Signal in Biological Sequence Data Syst Biol, August 1, 2004; 53(4): 623 - 637. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. S. Jermiin, S. Y.W. Ho, F. Ababneh, J. Robinson, and A. W.D. Larkum The Biasing Effect of Compositional Heterogeneity on Phylogenetic Estimates May be Underestimated Syst Biol, August 1, 2004; 53(4): 638 - 643. [Full Text] [PDF] |
||||
![]() |
M. S. Rosenberg and S. Kumar Heterogeneity of Nucleotide Frequencies Among Evolutionary Lineages and Phylogenetic Inference Mol. Biol. Evol., April 1, 2003; 20(4): 610 - 621. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. L. Braun and E. Grotewold Fungal Zuotin Proteins Evolved from MIDA1-like Factors by Lineage-Specific Loss of MYB Domains Mol. Biol. Evol., July 1, 2001; 18(7): 1401 - 1412. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











