MBE Advance Access originally published online on October 19, 2005
Molecular Biology and Evolution 2006 23(2):268-278; doi:10.1093/molbev/msj041
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Article |
Dense Taxonomic EST Sampling and Its Applications for Molecular Systematics of the Coleoptera (Beetles)
,1,2
,2




* Department of Entomology, The Natural History Museum, London, United Kingdom;
Department of Biological Sciences, Imperial College London, Silwood Park Campus, Ascot, United Kingdom; and
Department of Zoology, The Natural History Museum, London, United Kingdom
E-mail: j.hughes{at}bio.gla.ac.uk.
| Abstract |
|---|
|
|
|---|
Expressed sequence tag (EST) sequences can provide a wealth of data for phylogenetic and genomic studies, but the utility of these resources is restricted by poor taxonomic sampling. Here, we use small EST libraries (<1,000 clones) to generate phylogenetic markers across a broad sample of insects, focusing on the species-rich Coleoptera (beetles). We sequenced over 23,000 ESTs from 34 taxa, which produced 8,728 unique sequences after clustering nonredundant sequences. Between taxa, the sequences could be grouped into 731 gene clusters, with the largest corresponding to mitochondrial DNA transcripts and gene families chymotrypsin, actin, troponin, and tubulin. While levels of paralogy were high in most gene clusters, several midsized clusters including many ribosomal protein (RP) genes appeared to be free of expressed paralogs. To evaluate the utility of EST data for molecular systematics, we curated available transcripts for 66 RP genes from representatives of the major groups of Coleoptera. Using supertree and supermatrix approaches for phylogenetic analysis, the results were consistent with the emerging phylogenetic conclusions about basal relationships in Coleoptera. Numerous small EST libraries from a taxonomically densely sampled lineage can provide a core set of genes that together act as a scaffold in phylogenetic reconstruction, comparative genomics, and studies of gene evolution.
Key Words: phylogenomics gene ontology shallow genomics expressed sequence tag
| Introduction |
|---|
|
|
|---|
Current molecular systematics depends on polymerase chain reaction (PCR) amplification of a few "universal" genes to provide phylogenetic data. However, as the need for sequencing further genes is increasingly evident (Murphy et al. 2001
Most publicly available ESTs have been generated for gene discovery or to complement genome sequencing efforts. Some ESTs have been compiled into sets of nonredundant clusters in public databases such as tigr (http://www.tigr.org/), PartiGeneDB (http://www.partigenedb.org/), and UniGene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene). However, species for EST analyses have rarely been selected based on taxonomic criteria, which limits their use for phylogenetic analyses and comparative genomics (but see the recent study of Parkinson et al. 2004b
). A concerted effort to enlarge EST databases to encompass disparate taxa should alleviate these problems (Bapteste et al. 2002
; Theodorides et al. 2002
), and recent compilations of large multigene data sets combined from genome sequences and EST data have demonstrated the power of molecular sequences for resolving deep relationships in eukaryotes (Philippe, Lartillot, and Brinkmann 2005
; Rodriguez-Ezpeleta et al. 2005
). Here we explore the possibility of generating small EST databases for taxa specifically selected to obtain comprehensive coverage of target groups. We apply this approach to the Coleoptera (beetles), a group that includes nearly one-third of all known species of animals (Erwin 1982
; Hammond 1992
; Beutel and Haas 2000
; Caterino et al. 2002
) but where existing EST data are limited.
A critical problem for comparative studies is that ESTs from different taxa may not contain overlapping sets of genes. For example, given a conserved core of 6,089 orthologous genes in the genomes of Drosophila melanogaster and Anopheles gambiae (Zdobnov et al. 2002
), the probability that 250 ESTs from each species retrieve a matching ortholog is only 1.68 x 103 (250/6,089 x 250/6,089) if all genes are equally represented. The challenge of matching orthologous genes between taxa is amplified by the low expression of many transcripts; sequencing of tens of thousands of ESTs in D. melanogaster (Rubin et al. 2000
) or Bombyx mori (Mita et al. 2003
) fell short of reaching a full complement of predicted genes. However, even relatively small EST data sets consistently recover a subset of genes with conserved roles in core biological processes such as DNA replication, transcription, and cell metabolism (Hsiao et al. 2001
). These genes should be suitable for phylogenetic analysis across a broad sample of taxa.
The use of nuclear genes as a source of phylogenetic data requires an appreciation of the complex nature of genome evolution, involving gene loss, duplications, expansion of gene families, and functional diversification. Assignment of gene orthology is difficult even between fairly closely related groups such as the dipteran A. gambiae and D. melanogaster, where genes diversified independently in each lineage (Zdobnov et al. 2002
). Increased taxon sampling can improve the confidence of orthology assignments by identifying the origin of gene copies, facilitating inferences on gene duplications, and clarifying the relationship between gene content and the diversity of lineages (Parkinson et al. 2004b
).
Here, we test the utility of dense taxonomic EST sampling, generating relatively small numbers of ESTs (<1,000 clones) for each major group in the focal Coleoptera and several related groups of insects. Existing studies of basal relationships in the Coleoptera to date were based on the mitochondrial cox1 (Howland and Hewitt 1995
) and the nuclear small subunit rRNA genes (Caterino et al. 2002
), but the use of a single locus in these cases was insufficient to resolve the main phylogenetic questions. Novel sources of phylogenetic information are highly desirable and should preferentially rely on multiple single-copy nuclear genes. Using EST-based approaches that do not rely on degenerate PCR would be a great advantage in this diverse group of insects. We therefore used the Coleoptera to test critical questions about the feasibility of dense EST sampling for molecular systematics. Specifically, we investigated the minimum size of EST libraries necessary to produce sufficient overlap in gene representation between libraries and assessed what kind of genes show the widest representation across small EST libraries. Further, the degree of paralogy in EST data remains insufficiently known but is a critical issue if genes from different species libraries are used for phylogenetic reconstruction. The utility of the approach is shown here by producing phylogenetic trees for the basal groups of Coleoptera from 66 genes coding for ribosomal proteins (RP).
| Materials and Methods |
|---|
|
|
|---|
Insect Specimens, RNA Extraction, and cDNA Library Construction
Twenty-five species of insects, of which 14 were Coleoptera, and two outgroups were used for library construction (table 1). RNA was obtained from entire adult specimens, except for the use of larval wing discs in the butterfly Papilio dardanus (A. Cieslak and A. P. Vogler, unpublished data) and testes in the tiger beetles Cicindela litorea and Cicindela littoralis (J. Galian and A. P. Vogler, unpublished data). Seven published coleopteran EST libraries (Theodorides et al. 2002
|
For most libraries, ESTs were sequenced in both directions to provide longer and more accurate sequences which is critical for phylogenetic analysis. Sequencher 4.1 (Gene Codes Corp., Ann Harbor, Mich.) was used for sequence editing, including the automated removal of vector sequences and poor-quality data. Sequences were further edited manually to recall ambiguities and resolve conflicting base calls in forward and reverse reads where multiple clones were available. Edited sequences were clustered into contigs in Sequencher at high stringency to obtain "tentative unique genes" (TUGs) for each library and exported for further analysis. We also used a fully automated method for sequence editing with the Trace2dbest perl script (Parkinson and Blaxter 2004
Sequence Clustering and Phylogenetic Analysis
EST sequences were subjected to Blast comparisons against GenBank using BlastN (nucleotidenucleotide searches) and TBlastX (conceptual protein translations) (Altschul et al. 1990
). Where significant matches were found (E value >105) and putative gene identity was established by these sequence comparisons, TUGs were assigned gene ontology (GO) classifications by comparing deduced amino acids with the Uniprot database and parsing of the Uniprot GO table (http://www.ebi.ac.uk/uniprot/index.html). Gene classifications were accepted if our data had 30% similarity over >100 amino acids with curated data and a significant E value (>105) in a TBlastX search. When parsed for GO classification, we accepted identity from lower TBlastX matches if top matches did not contain GO classifications. TBlastX searches were used to calculate the proportion of TUGs which matched sets of proteins from D. melanogaster, Homo sapiens, and Caenorhabditis elegans with E values <105.
For clustering, similarity between TUGs within and between libraries was determined using TBlastX searches. For each TUG, its TBlastX hits were examined, and if the similarity was above a specified threshold, then a cluster was made. These first-pass clusters contained many TUGs in more than one cluster, so these clusters were themselves iteratively merged and redundant sequences removed, until there were no sequences contained in more than one cluster. The Python scripts used for clustering are available from PGF on request. TUGs clustered in searches were translated in Sequencher and aligned with ClustalX (Thompson et al. 1997
).
For phylogenetic analysis from these clusters, we focused specifically on the RP genes. After minor sequence editing and verification of transcript fidelity, the most complete amino acid sequences were used for conceptual translations using ClustalX and submitted to European Molecular Biology Laboratory nr databases (Supplementary Material A, see Supplementary Material online). Three further Coleoptera, Tribolium castaneum (J. Savard and D. Tautz, personal communication), Callosobruchus maculatus (J. H. F. Pedra, A. Brandt, R. Westerman, H.-M. Li, J. Romero-Severson, L. L. Murdock, and B. R. Pittendrigh, personal communication), and Ips pini (Eigenheer et al. 2003
) with public ESTs in GenBank were also searched for RPs and used in the phylogenetic analysis. After excluding the smallest EST libraries (Silpha atrata and Tribolium confusum), we concatenated data from 66 RPs found in four or more species of Coleoptera, which correspond to minimal phylogenetic clusters (sensu Driskell et al. 2004
). Regions of uncertain amino acid alignment homology were removed using Gblocks 0.91b (Castresana 2000
). Phylogenetic analysis was conducted with parsimony, with a heuristic search strategy (random taxon addition, 100 replicates; Tree Bisection-Reconnection branch swapping). We used PAUP* to calculate nonparametric bootstrap scores (1,000 replicates) and Bremer support, facilitated by TreeRot 2.0 (Sorenson 1999
). Phyml v2.4.4 (Guindon and Gascuel 2003
) was used for maximum likelihood (ML) analyses with 100 bootstraps, using both the WAG substitution model, suitable for soluble proteins such as RPs, and the Dayhoff model selected with ModelGenerator (http://bioinf.nuim.ie/software/modelgenerator). With both models, we accounted for the among-site rate variation using a gamma distribution and a proportion of invariant sites (pInvar). Bayesian analyses were also conducted using the latter model on the concatenated multigene data set with MrBayes v3.1.1 (Huelsenbeck and Ronquist 2001
). Nodal support was assessed as posterior probability from two independent runs each with four chains of 1,000,000 generations in the Markov chain Monte Carlo procedure (the first 500,000 generations were discarded as "burn-in"). In an alternative supertree approach, the same amino acid alignments from each RP gene were first used individually for parsimony analysis using branch and bound searches. For each RP, the strict consensus tree was saved to the file, and resolved nodes were recoded as binary state using matrix representation with parsimony coding (Baum 1992
; Ragan 1992
) with Clann 2.0.1 (Creevey and McInerney 2005
).
| Results |
|---|
|
|
|---|
Characteristics of the Libraries
Among the EST libraries for 32 insect species, plus two arthropod outgroups (a spider and millipede), we sampled 20 species of Coleoptera, with representatives from each of the four suborders, and a selection of all major groups (Series) in the large suborder Polyphaga. Together, the libraries contained 23,026 EST sequences with high-quality base calls, ranging from 29 to 1,341 ESTs per taxon (table 1). In total, 8,728 TUGs were obtained after semimanual editing (Materials and Methods). Automated editing of the same data produced 8,910 unique sequences, with
7% fewer sequences in redundant groups and 12% more singletons (table 1). Overall sequence similarity and statistical analysis (below) produced similar results as the manually edited sequences, and hence, the automated EST clustering appears sufficiently reliable for the initial compilation of large data sets, in particular as sequence quality increases with greater number of ESTs in a TUG. According to the GO categorization of the 34 EST libraries (table 2), the nuclear genes most frequently detected were "housekeeping" genes, including RPs and enzymes. Transcripts from mitochondrial genes were also prevalent, with an average of six mitochondrial transcripts per taxon. Although mitochondrial sequences present in EST libraries are an artifact of the reverse transcriptasePCR procedure, they provide valuable phylogenetic markers. In contrast, relatively few developmental proteins, transcription factors, and elongation factors (EFs) were detected among ESTs. A large number of ESTs showed significant similarity to genes of unknown function in the Uniprot database (5%37% depending on the library), and in each taxon, a large proportion of the sequences (35%80%) did not have any significant public database matches within the search parameters.
|
When our ESTs where compared against the genes of D. melanogaster, 50% of sequences had significant matches with E values <105 (ranging from 21% to 67% depending on probe species; table 2). Overall, our insect ESTs had significantly more matches with D. melanogaster sequences than with H. sapiens or C. elegans (df = 33, t = 6.5 and 5.6, respectively, P < 0.001). The insect ESTs showed fewer matches with C. elegans (df = 33, t = 8.7, P < 0.001) than with H. sapiens, despite the presumed closer relationships of nematodes with insects based on rRNA (Aguinaldo et al. 1997
jhughes/SimiTri/), as expected with decreased phylogenetic proximity.
Clustering Between Libraries
The presence of putative orthologs across libraries is critical for EST data to be useful in molecular systematics. Using the BlastN algorithm, we found that between 10% and 53% of unique sequences in a given library had matches (E value < 105) with the data set containing all the other libraries (table 2). After conceptual translation, pairwise sequence matches (TBlastX E < 105) ranged from 1% to 29% of unique sequences shared between any two libraries, with an average of 12% (Supplementary Material B, see Supplementary Material online). The number of intralibrary matches was slightly lower, with 0% to 23% of sequences showing significant matches within the same library in a protein-level search (table 2), but indicating a high proportion of paralogy in each library. Manual editing of primary sequences increased the between-library matches and the size of clusters at stringent cutoff values when compared to the automated approach (1080: t = 2.2, df = 34, P < 0.05; 10100: t = 2.4, df = 34, P < 0.05; 10150: t = 2.1, df = 34, P < 0.05; Supplementary Material C, see Supplementary Material online).
When sequences with significant similarity were clustered across all libraries, up to 731 clusters included TUGs from two or more taxa, although no TUG had representatives in more than 28 of the 34 libraries. A total of 154 TUGs showed significant Blast matches within a single taxon only (Supplementary Material C, see Supplementary Material online). Most of the largest clusters, with TUGs in more than eight species at an E value < 1010 (table 3; Supplementary Material D, see Supplementary Material online), included genes for which exceptional levels of mRNA expression have been established (Hsiao et al. 2001
). The largest clusters included RPs and mitochondrial genes. Sequences from known protein families were also present in the clusters, such as tubulins, myosins, and troponin I. Three clusters contained EF genes (EF-1 alpha homologs, EF-1 beta, and EF-2). Interestingly, there were four clusters of genes that did not have any Blast matches, and six clusters that only showed matches to D. melanogaster and A. gambiae genes of unknown function (Supplementary Material D, see Supplementary Material online). Along with mitochondrial genes, several nuclear genes detected in multiple EST libraries have been used widely in insect molecular systematic studies (Caterino, Cho, and Sperling 2000
). These included 28S rRNA (represented in EST libraries of 14 species), EF-1 alpha (13 species), H3 histone (7 species), and Cu, Znsuperoxide dismutase (7 species).
|
The number of clusters and their size were strongly dependent on the significance level of the Blast search partly due to the separation of paralogs at higher stringency. This was evident in tubulins (breaking up into alpha and beta superfamilies at higher stringency), myosins (separating to light chain I and regulatory light chain II), and troponin I (separating to troponin I a1 and troponin I b1). Table 4 presents those clusters with the number of unique sequences equal to the number of taxa (libraries), i.e., where each taxon contributes only one ortholog. Such potentially paralogy-free clusters included a maximum of 14 taxa. Many of these were identified as coding RP genes and were used to test the phylogenetic utility of the EST database.
|
The Higher Coleopteran (Beetle) Phylogeny from RPs
Out of a complete set of 76 nonacidic RPs found in insects (Landais et al. 2003
|
Overall, these data suggest that the number of detected RP genes increases linearly with greater numbers of ESTs (fig. 1, R2 = 0.7006; y = 0.0278x) and further predict that libraries of
2,000 ESTs obtained from whole adult specimens can yield complete sets of RPs. The linear increase is consistent with the fact that different RP genes were recovered in different organisms (fig. 2), even if a similar total number of RPs was detected. This might indicate that most RPs genes have a similar chance to be cloned in our relatively small libraries, but the total ESTs sequenced needs to be higher than a few hundred ESTs to obtain the complete set of 76 RPs.
|
Phylogenetic analysis was conducted to establish basal relationships in the Coleoptera with 66 RP genes using both a "supertree" (derived from topology of individual gene trees) and a "supermatrix" (derived by simultaneous phylogenetic analysis of all sequence information). Ten additional RP genes were detected in less than 4 out of the 20 Coleoptera species and could not be used for phylogenetic analysis. After removing these sequences and the alignment-sensitive regions from all other genes (Materials and Methods), the final data matrix included a total of 10,403 amino acid residues, with individual taxa represented by 447 (Platystomos) to 9,151 (Tribolium) residues with an average 2,976 ± 1,892 residues and an overall degree of matrix completion of 28.6%. Individual genes were represented in between 4 and 10 taxa. When all 76 RPs are considered, the mean number of taxa per gene was 5.68. All methods of tree construction (maximum parsimony, ML, Bayesian, and supertree) produced similar tree topologies (fig. 3). At the deepest nodes, when rooted with the suborder Archostemata, the remaining coleopteran suborders resolved as (Adephaga (Myxophaga, Polyphaga)), although the supertree analyses placed Myxophaga (Sphaerius sp.) within the Polyphaga. In all analyses, the Elateriformia (one of the five Series of families of Polyphaga) was a paraphyletic assemblage of basal Polyphaga, with the Eucinetidae (Eucinetus sp.) sister to the remaining Series, Staphyliniformia, Scarabaeiformia, and Cucujiformia. The close relationship of Scarabaeus laticollis (Scarabaeiformia), Georissus sp., and Hister sp. (Staphyliniformia) supported the Haplogastra uniting both Series (Crowson 1955
|
| Discussion |
|---|
|
|
|---|
EST databases are rapidly growing, with approximately 27.6 million entries in GenBank as of June 2005 (http://www.ncbi.nlm.nih.gov/dbEST/). Yet, until recently, the taxonomic coverage of the Class Insecta has been limited to 8 of the 25 or so insect orders. Within the largest order, Coleoptera, three libraries have become available recently, but taxonomically, these represent only a very limited group within one of the Series of Polyphaga. (Two further libraries were added to dbEST since our analysis was conducted.) EST representation in the insects has been severely biased toward Diptera, comprising 15 of 47 holometabolan insects as of June 2005 and
628,300 out of
919,200 EST sequences (excluding our data). Although the EST data sets presented here are small in comparison with other arthropod EST projects, we have almost doubled the taxonomic coverage of arthropod orders, including the first EST libraries for Strepsiptera, Rhaphidiodea, Trichoptera, Mecoptera, and Thysanura, and added over 11,000 ESTs from the Coleoptera, arguably the most diverse insect order, sampled from the broadest possible taxonomic diversity. Our main aim was to test whether generating a small number of ESTs from a broad sample of taxa would be a suitable approach to phylogeny reconstruction. The findings confirm that even small libraries (<1,000 clones) show high levels of matching TUGs. Even with an average library size of 257 unique sequences, we recovered a conserved core of genes represented consistently across libraries. Many of these genes had not previously been used for phylogeny reconstruction, increasing the spectrum of molecular markers available to insect systematics. The most widely detected clusters contained mitochondrial DNA transcripts, enzymes, and RPs. However, tree construction was impeded by the great proportion of missing data entries, in particular due to several of the smaller libraries in our data set. Based on the completeness of RP representation in the libraries (fig. 1), we extrapolate that approximately 2,000 ESTs are needed to recover these highly expressed genes consistently when extracting total RNA from a whole adult specimen. Using embryonic tissues, for example, with a high rate of biosynthesis may increase the proportion of RPs in the libraries and lower the number of ESTs needed to generate the complete set of RPs in each taxon.
Such a large number of sequences may appear to be a costly way to establish phylogenetic relationships between taxa. However, the success of sequencing multiple single-copy loci to resolve the deeper nodes within the Tree of Life (e.g., in mammals: Murphy et al. 2001
; Teeling et al. 2005
) cannot easily be extended to most groups via traditional PCR methods using degenerate primers. Our efforts to amplify even a few nonstandard single-copy genes consistently within or across different superfamilies of the Coleoptera have largely failed (unpublished data), and the best results to date were obtained when the primers have been based on the EST sequences obtained here (Pons et al. 2004
). As automation advances and the cost of sequencing decreases, dense EST sampling is likely to become a more cost-effective approach for acquiring single-copy nuclear markers for the deep-level molecular systematics of many groups.
A perhaps unexpected finding was the high degree of paralogy in most clusters evident from the large number of within-library similarity hits. Paralogs can prohibit the determination of species relationships and mislead phylogenetic inferences if they are not detected. However, tentative orthologous clusters (i.e., with only a single member per taxon) were readily detected and included up to 14 of the 34 taxa (some of which were present in very small libraries). In future, some of these genes may prove not to be paralogy free, but it is reassuring that they include a number of housekeeping genes, such as RPs, which are already known to be largely paralogy free across Metazoa (Landais et al. 2003
; Philippe et al. 2004
). Other large clusters that were paralogy free under high clustering stringency only (table 3) will require further analyses to separate different paralogy groups.
For molecular systematics, EST sequencing exposes us to hundreds of loci for which we have no existing information about the pattern of molecular variation and phylogenetic information content. At this early stage of comparative EST sequencing, it already seems obvious that only a minority of the available genes will emerge as useful for reconstructing phylogenetic relationships at the deeper hierarchical levels, whereas most gene sequences will be shown to suffer from shallow paralogy possibly linked to functional diversity. As EST sequences tend to be short, well-supported phylogenetic trees will only emerge when several genes of overlapping resolution are combined, together enhancing the phylogenetic signal (Olmstead and Sweere 1994
; Gatesy et al. 1999
). However, simultaneous analysis is only justified once orthology has been established.
Clearly, the RP genes provide such a resource and were used here to provide valuable insights into the phylogeny of Coleoptera (fig. 3). The relationships among the four suborders of Coleoptera have long been controversial (Hennig 1981
; Lawrence and Newton 1995
; Beutel and Haas 2000
), with each of the three possible arrangements supported by reputable studies (Kukalova-Peck and Lawrence 1993
; Beutel and Haas 2000
; Caterino et al. 2002
). The supermatrix analysis based on 66 RPs suggests the placement of Myxophaga as sister to Polyphaga which is consistent with the traditional view, going back to Crowson (1955
, 1960
), and several later studies based on various morphological character systems. These results conflict with those from 18S rRNA, which place Polyphaga with Adephaga as the sister, not Myxophaga (Caterino et al. 2002
), but phylogenetic conclusions from this gene are affected by length variation and the rate heterogeneity, and hence, independent evidence from RPs is very valuable. Within the Polyphaga, the EST data supported the general ideas about basal relationships of the Series (the five traditional family groups of Polyphaga), including the paraphyly of Staphyliniformia with respect to Scarabaeiformia (Korte et al. 2004
; Caterino et al. 2005
), the paraphyly of Elateriformia and their basal position within Polyphaga (Caterino et al. 2002
), and the monophyly of the large Cucujiformia and the large phytophagous Chrysomeloidea and Curculionoidea ("Phytophaga").
In conclusion, we used dense EST sampling for molecular systematics, to avoid difficult PCR-based methods and extend the range of gene markers for multigene phylogenetics. Comparable studies in nematodes (Parkinson et al. 2004b
) and Apicomplexa (Li et al. 2003
) focused on gene discovery and comparative genomics, and it will be interesting to use these much larger EST data for phylogenetic analysis in the way proposed here. It is evident from our analysis that phylogenetic inferences will suffer from the unexpectedly high level of paralogy affecting most of the highly expressed loci, unless paralogy groups whose origin precedes the separation of the focal taxa can be separated a priori (Philippe, Lartillot, and Brinkmann 2005
; Rodriguez-Ezpeleta et al. 2005
).
Many questions remain for the use of the broad EST approach, for example, which molecular techniques are most suitable for enriching the desired loci prior to sequencing or the utility of tissue-specific libraries to reduce the recovery of paralogous sequences. For example, libraries of P. dardanus were obtained from wing discs and included a much higher proportion of RPs than most of the other libraries which were obtained from total adult tissue (table 2). Furthermore, for comparative studies, bidirectional sequencing of ESTs and careful curation of redundant sequences is important to mitigate problems otherwise introduced by sequencing and partial gene sequences. However, the best strategy might be to sequence the majority of ESTs in a single direction and only sequence the reverse direction when the full length of specific genes is missing.
RP genes apparently were little affected by recent paralogy and provide a formidable resource for deep-level phylogenetics. With 66 genes included here in an analysis of Coleoptera, this represents a great advance over the existing trees from single genes (Howland and Hewitt 1995
; Caterino et al. 2002
). However, as the matrix includes some 71.4% of missing data, support levels inevitably will be low (Wiens 2003
; Hughes and Vogler 2004
; Philippe et al. 2004
) even if the effect may be less pronounced with a greater number of genes (Driskell et al. 2004
). Yet, presenting just under 9,000 unique nuclear sequences, the current study provides a foundation for multilocus phylogenetics of Coleoptera and other insect groups. Dense taxonomic EST sampling will offer us new opportunities for phylogenetic analysis while also providing a less myopic glimpse of the functional and evolutionary diversity in the most species-rich lineage on Earth.
| Supplementary Material |
|---|
|
|
|---|
Supplementary Materials AD are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|
We are grateful to Sue Lomas and Francis Wright at the sequencing facilities at Silwood Park and Derek Huntley, James Abbott, and Gail Bartlett from the Bioinformatics Support Service at Imperial College. We thank Hans Pohl, Ignacio Ribera, Michael Balke, and Peter Hammond for contributing insect specimens. We greatly thank Herve Philippe and anonymous reviewers for useful comments, and Miquel Arnedo, Alexandra Cieslak, Jose Galián, Jesus Gómez-Zurita, Fatos Kopliku, and Nathalie Tristem for contributing additional library construction and sequencing. This project was funded by Biotechnology and Biological Sciences Research Council grant 49/G14548 to Michael Caterino, A.P.V. and P.G.F. and a Ph.D. studentship to S.J.L. Additional funding were from the Department of Trade and Industry, United Kingdom and an Alexander S. Onassis foundation scholarship to A.P.
| Footnotes |
|---|
1 Present address: Division of Environmental and Evolutionary Biology, Institute of Biomedical and Life Sciences, Graham Kerr Building, University of Glasgow, Glasgow, United Kingdom
2 These authors contributed equally to the work. ![]()
Herve Philippe, Associate Editor
| References |
|---|
|
|
|---|
Aguinaldo, A. M., J. M. Turbeville, L. S. Linford, M. C. Rivera, J. R. Garey, R. A. Raff, and J. A. Lake. 1997. Evidence for a clade of nematodes, arthropods and other moulting animals. Nature 387:489493.[CrossRef][Medline]
Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403410.[CrossRef][Web of Science][Medline]
Bapteste, E., H. Brinkmann, J. A. Lee et al. (11 co-authors). 2002. The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba. Proc. Natl. Acad. Sci. USA 99:14141419.
Baum, B. R. 1992. Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees. Taxon 41:310.
Beutel, R. G., and F. Haas. 2000. Phylogenetic relationships of the suborders of Coleoptera (Insecta). Cladistics 16:103141.[CrossRef]
Blair, J. E., K. Ikeo, T. Gojobori, and S. B. Hedges. 2002. The evolutionary position of nematodes. BMC Evol. Biol. 2:7.[CrossRef][Medline]
Castresana, J. 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17:540552.
Caterino, M. S., S. Cho, and F. A. Sperling. 2000. The current state of insect molecular systematics: a thriving Tower of Babel. Annu. Rev. Entomol. 45:154.[CrossRef][Web of Science][Medline]
Caterino, M. S., T. Hunt, and A. P. Vogler. 2005. On the constitution and phylogeny of Staphyliniformia (Insecta: Coleoptera). Mol. Phylogenet. Evol. 34:655672.[CrossRef][Web of Science][Medline]
Caterino, M. S., V. L. Shull, P. M. Hammond, and A. P. Vogler. 2002. Basal relationships of Coleoptera inferred from 18S rDNA sequences. Zool. Scr. 31:4149.[CrossRef]
Creevey, C. J., and J. O. McInerney. 2005. Clann: investigating phylogenetic information through supertree analyses. Bioinformatics 21:390392.
Crowson, R. A. 1955. The natural classification of the families of the Coleoptera. Nathaniel Lloyd, London.
. 1960. The phylogeny of Coleoptera. Annu. Rev. Entomol. 5:111134.[CrossRef][Web of Science]
Dopazo, H., and J. Dopazo. 2005. Genome-scale evidence of the nematode-arthropod clade. Genome Biol. 6:R41.[CrossRef][Medline]
Driskell, A. C., C. Ane, J. G. Burleigh, M. M. McMahon, B. C. O'Meara, and M. J. Sanderson. 2004. Prospects for building the tree of life from large sequence databases. Science 306:11721174.
Eigenheer, A. L., C. I. Keeling, S. Young, and C. Tittiger. 2003. Comparison of gene representation in midguts from two phytophagous insects, Bombyx mori and Ips pini, using expressed sequence tags. Gene 316:127136.[Medline]
Erwin, T. L. 1982. Tropical forests: their richness in Coleoptera and other arthropod species. Coleopt. Bull. 36:7475.
Ewing, B., and P. Green. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8:186194.
Ewing, B., L. Hillier, M. C. Wendl, and P. Green. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175185.
Gatesy, J., M. Milinkovitch, V. Waddell, and M. Stanhope. 1999. Stability of cladistic relationships between Cetacea and higher-level artiodactyl taxa. Syst. Biol. 48:620.[CrossRef][Web of Science][Medline]
Guindon, S., and O. Gascuel. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696704.[CrossRef][Web of Science][Medline]
Hammond, P. M. 1992. Species inventory. Pp. 1739 in B. Groombridge, ed. Global biodiversity, status of the Earth's living resources. Chapman and Hall, London.
Hedges, S. B., J. E. Blair, M. L. Venturi, and J. L. Shoe. 2004. A molecular timescale of eukaryote evolution and the rise of complex multicellular life. BMC Evol. Biol. 4:2.[CrossRef][Medline]
Hennig, W. 1981. Insect phylogeny. Academic Press, New York.
Howland, D. E., and G. M. Hewitt. 1995. Phylogeny of the Coleoptera based on mitochondrial cytochrome oxidase I sequence data. Insect Mol. Biol. 4:203215.[Web of Science][Medline]
Hsiao, L. L., F. Dangond, T. Yoshida et al. (22 co-authors). 2001. A compendium of gene expression in normal human tissues. Physiol. Genomics 7:97104.
Huelsenbeck, J. P., and F. Ronquist. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754755.
Hughes, J., and A. P. Vogler. 2004. The phylogeny of acorn weevils (genus Curculio) from mitochondrial and nuclear DNA sequences: the problem of incomplete data. Mol. Phylogenet. Evol. 32:601615.[CrossRef][Web of Science][Medline]
Korte, A., I. Ribera, R. G. Beutal, and D. Bernhard. 2004. Interrelationships of Staphyliniform groups inferred from 18S and 28S rDNA sequences, with special emphasis on Hydrophiloidea (Coleoptera, Staphyliniformia). J. Zool. Syst. Evol. Res. 42:281288.[CrossRef]
Kukalova-Peck, J., and J. F. Lawrence. 1993. Evolution of the hind wing in Coleoptera. Can. Entomol. 125:181258.
Landais, I., M. Ogliastro, K. Mita, J. Nohata, M. Lopez-Ferber, M. Duonor-Cerutti, T. Shimada, P. Fournier, and G. Devauchelle. 2003. Annotation pattern of ESTs from Spodoptera frugiperda Sf9 cells and analysis of the ribosomal protein genes reveal insect-specific features and unexpectedly low codon usage bias. Bioinformatics 19:23432350.
Lawrence, J. F., and A. F. Newton Jr. 1995. Families and subfamilies of Coleoptera (with selected genera, notes, references and data on family-group names). Pp. 779913 in J. Palaluk and S. A. Slipinski, eds. Biology, phylogeny and classification of Coleoptera. Papers celebrating the 80th birthday of Roy A. Crowson. Muzeum I Instytut Zoologii PAN, Warsaw, Poland.
Li, L., B. P. Brunk, J. C. Kissinger et al. (20 co-authors). 2003. Gene discovery in the apicomplexa as revealed by EST sequencing and assembly of a comparative gene database. Genome Res. 13:443454.
Mita, K., M. Morimyo, K. Okano et al. (12 co-authors). 2003. The construction of an EST database for Bombyx mori and its application. Proc. Natl. Acad. Sci. USA 100:1412114126.
Murphy, W. J., E. Eizirik, S. J. O'Brien et al. (11 co-authors). 2001. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294:23482351.
Olmstead, R. G., and J. A. Sweere. 1994. Combining data in phylogenetic systematicsan empirical approach using 3 molecular data sets in the Solanaceae. Syst. Biol. 43:467481.[CrossRef][Web of Science]
Parkinson, J., A. Anthony, J. Wasmuth, R. Schmid, A. Hedley, and M. Blaxter. 2004a. PartiGeneconstructing partial genomes. Bioinformatics 20:13981404.
Parkinson, J., and M. Blaxter. 2003. SimiTrivisualizing similarity relationships for groups of sequences. Bioinformatics 19:390395.
. 2004. Expressed sequence tags: analysis and annotation. Methods Mol. Biol. 270:93126.[Medline]
Parkinson, J., D. B. Guiliano, and M. Blaxter. 2002. Making sense of EST sequences by CLOBBing them. BMC Bioinformatics 3:31.[CrossRef][Medline]
Parkinson, J., M. Mitreva, C. Whitton et al. (12 co-authors). 2004b. A transcriptomic analysis of the phylum Nematoda. Nat. Genet. 36:12591267.[CrossRef][Web of Science][Medline]
Philip, G. K., C. J. Creevey, and J. O. McInerney. 2005. The Opisthokonta and the Ecdysozoa may not be clades: stronger support for the grouping of plant and animal than for animal and fungi and stronger support for the Coelomata than Ecdysozoa. Mol. Biol. Evol. 22:11751184.
Philippe, H., N. Lartillot, and H. Brinkmann. 2005. Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. Mol. Biol. Evol. 22:12461253.
Philippe, H., E. A. Snell, E. Bapteste, P. Lopez, P. W. Holland, and D. Casane. 2004. Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol. Biol. Evol. 21:17401752.
Pons, J., T. Barraclough, K. Theodorides, A. Cardoso, and A. Vogler. 2004. Using exon and intron sequences of the gene Mp20 to resolve basal relationships in Cicindela (Coleoptera: Cicindelidae). Syst. Biol. 53:554570.[CrossRef][Web of Science][Medline]
Ragan, M. A. 1992. Matrix representation in reconstructing phylogenetic relationships among the eukaryotes. Biosystems 28:4755.[CrossRef][Web of Science][Medline]
Rodriguez-Ezpeleta, N., H. Brinkmann, S. C. Burey, B. Roure, G. Burger, W. Loffelhardt, H. J. Bohnert, H. Philippe, and B. F. Lang. 2005. Monophyly of primary photosynthetic eukaryotes: green plants, red algae, and glaucophytes. Curr. Biol. 15:13251330.[CrossRef][Web of Science][Medline]
Rubin, G. M., L. Hong, P. Brokstein, M. Evans-Holm, E. Frise, M. Stapleton, and D. A. Harvey. 2000. A Drosophila complementary DNA resource. Science 287:22222224.
Rudd, S. 2003. Expressed sequence tags: alternative or complement to whole genome sequences? Trends Plant Sci. 8:321329.[CrossRef][Web of Science][Medline]
Sorenson, M. D. 1999. TREEROT. Version 2c. Boston University, Boston.
Teeling, E. C., M. S. Springer, O. Madsen, P. Bates, J. O'Brien S, and W. J. Murphy. 2005. A molecular phylogeny for bats illuminates biogeography and the fossil record. Science 307:580584.
Theodorides, K., A. De Riva, J. Gomez-Zurita, P. G. Foster, and A. P. Vogler. 2002. Comparison of EST libraries from seven beetle species: towards a framework for phylogenomics of the Coleoptera. Insect Mol. Biol. 11:467475.[CrossRef][Web of Science][Medline]
Thompson, J. D., T. J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins. 1997. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25:48764882.
Wheeler, W. C., M. Whiting, Q. D. Wheeler, and J. M. Carpenter. 2001. The phylogeny of the extant hexapod orders. Cladistics 17:113169.[CrossRef][Web of Science]
Wiens, J. J. 2003. Missing data, incomplete taxa, and phylogenetic accuracy. Syst. Biol. 52:528538.[CrossRef][Web of Science][Medline]
Zdobnov, E. M., C. von Mering, I. Letunic et al. (36 co-authors). 2002. Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster. Science 298:149159.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
W. J. Baker, V. Savolainen, C. B. Asmussen-Lange, M. W. Chase, J. Dransfield, F. Forest, M. M. Harley, N. W. Uhl, and M. Wilkinson Complete Generic-Level Phylogenetic Analyses of Palms (Arecaceae) with Comparisons of Supertree and Supermatrix Approaches Syst Biol, May 30, 2009; (2009) syp021v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Sanderson Phylogenetic Signal in the Eukaryotic Tree of Life Science, July 4, 2008; 321(5885): 121 - 123. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Hausdorf, M. Helmkampf, A. Meyer, A. Witek, H. Herlyn, I. Bruchhaus, T. Hankeln, T. H. Struck, and B. Lieb Spiralian Phylogenomics Supports the Resurrection of Bryozoa Comprising Ectoprocta and Entoprocta Mol. Biol. Evol., December 1, 2007; 24(12): 2723 - 2729. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Savard, D. Tautz, S. Richards, G. M. Weinstock, R. A. Gibbs, J. H. Werren, H. Tettelin, and M. J. Lercher Phylogenomic analysis reveals bees and wasps (Hymenoptera) at the base of the radiation of Holometabolous insects Genome Res., November 1, 2006; 16(11): 1334 - 1338. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






