MBE Advance Access originally published online on August 17, 2006
Molecular Biology and Evolution 2006 23(11):2191-2202; doi:10.1093/molbev/msl090
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Fast Protein Evolution and Germ Line Expression of a Drosophila Parental Gene and Its Young Retroposed Paralog
Department of Biology, University of Texas at Arlington
E-mail: betran{at}uta.edu.
| Abstract |
|---|
|
|
|---|
This is the first detailed study of the evolution, phylogenetic distribution, and transcription of one young retroposed gene, CG13732, and its parental gene CG15645, whose functions are unknown. CG13732 is a recognizable retroposed copy of CG15645 retaining the signals of this process. We name the parental gene Cervantes and the retrogene Quijote. To determine when this duplication occurred and the phylogenetic distribution of Quijote, we employed polymerase chain reaction, Southern blotting, and the available information on sequenced Drosophila genomes. Interestingly, these analyses revealed that Quijote is present only in 4 species of Drosophila (Drosophila melanogaster, Drosophila simulans, Drosophila sechellia, and Drosophila mauritiana) and that retroposed copies of Cervantes have also originated in the lineages leading to Drosophila yakuba and Drosophila erecta independently in the 3 instances. We name the new retrogene in the D. yakuba lineage Rocinante and the new retrogene in the D. erecta lineage Sancho. In this work, we present data on Quijote and its parental gene Cervantes. Polymorphism analysis of the derived gene and divergence data for both parental and derived genes were used to determine that both genes likely produce functional proteins and that they are changing at a fast rate (KA/KS
0.38). The negative value of H of Fay and Wu in the non-African sample reveals an excess of derived variants at high frequency. This could be explained either by positive selection in the region or by demographic effects. The comparative expression pattern shows that both genes express in the same adult tissues (male and female germ line) in D. melanogaster. Quijote is also expressed in male and female in D. simulans, D. sechellia, and D. mauritiana. We argue that the fast rate of evolution of these genes could be related to their putative germ line function and are further studying the independent recruitment of Cervantes-derived retrogenes in multiple lineages.
Key Words: new gene retroposition germ line Drosophila Cervantes Quijote
Puesto nombre, y tan a su gusto, a su caballo, quiso ponérsele a sí mismo, y en este pensamiento duró otros ocho días, y al cabo se vino a llamar Don Quijote; de donde -como queda dicho- tomaron ocasión los autores...
El ingenioso hidalgo Don Quijote de la ManchaMiguel de Cervantes Saavedra (1547-1616)
Having got a name for his horse so much to his taste, he was anxious to get one for himself, and he was eight days more pondering over this point, till at last he made up his mind to call himself "Don Quixote," whence, as has been already said, the authors... (Translated by John Ormsby 1997).
| Introduction |
|---|
|
|
|---|
New genes can originate in the genome by different mechanisms (Betrán and Long 2002
Processed copies of protein-coding genes were first described in mammals because of their abundance. Retroposed gene copies often are believed to be pseudogenes (Vanin 1985
; Dunham et al. 1999
; Mighell et al. 2000
) because they lack regulatory regions and, as a consequence, they will often degenerate (Mighell et al. 2000
). However, many of them are known to be functional (Brosius 1991
; Betrán and Long 2002
; Marques et al. 2005
).
Several retroposed genes have been described in Drosophila (Betrán et al. 2002
and references below). Some of them have been studied in detail. One of them is the chimeric gene jingwei (jgw; Long and Langley 1993
). This gene is located on chromosome 3 in the sister species Drosophila teissieri and Drosophila yakuba and is not present in the closest relatives. The age of the gene has been estimated to be less than 2.5 Myr (Long et al. 1999
). Jgw has a 3' Adh-like end and a 5' end (3 exons) recruited from the gene yellow emperor (ymp; Long and Langley 1993
; Wang et al. 2000
). Another functional processed protein-coding gene has been described in Drosophila melanogaster (Yuan et al. 1996
): Pros28.1A. Pros28.1A is an intronless copy of Pros28.1. The new copy is 74% identical at the amino acid level to the parental copy but is located on a different chromosome. Pros28.1 maps to polytene band 14B4 on chromosome X, whereas Pros28.1A is located within polytene band 92F on chromosome 3. Pros28.1 transcription is detected during all Drosophila life stages from embryo to adult in both males and females. Unlike Pros28.1, Pros28.1A is expressed only in males and very specifically in the germ line during spermatogenesis. Another young retroposed gene, nuclear transport factor-2related of Drosophila (Dntf-2r; Betrán and Long 2003
) and its parental gene Dntf-2 also have been studied. Sequence analyses revealed that Dntf-2r is a new functional gene. Dntf-2r is present only in 4 species of Drosophila: D. melanogaster, D. simulans, D. sechellia, and D. mauritiana and is estimated to be 312 Myr old. Dntf-2r evolved faster after its creation than the parental copy and under the action of positive selection at the amino acid level. Comparative expression analysis shows that Dntf-2 is expressed widely (Bhattacharya and Steward 2002
; Betrán and Long 2003
) but Dntf-2r can be considered male specific (Betrán and Long 2003
), although we have recently seen slight expression in females (E. Betrán and M. Motiwale, unpublished results). Another chimeric gene derived from a Adh retrogene, Adh-Twain, has just been described, and it is evolving under positive selection (Jones et al. 2005
). In all of these cases, the parental and derived genes are located on different chromosomes (Powell 1997
), a pattern that supports retroposition as the mobilizing mechanism (Betrán et al. 2002
).
Other examples of retroposed genes in Drosophila that are located on the same chromosome include the glycolytic enzyme phosphoglycerate mutase 87 (Pglym87). Pglym87 (Currie and Sullivan 1994
) has no introns, unlike the parental gene Pglym78 that has two. Both these genes map to chromosome arm 3R (see FlyBase; unlike that described by Currie and Sullivan 1994
). There are also reports of retroposed genes in tandem organized gene families in D. melanogaster. For instance, Charles et al. (1997)
found that in the cluster of third instar larval cuticular proteins, 4 of the genes are intronless and appear to have arisen by retroposition.
Here, we study the evolution, phylogenetic distribution, and transcription of one young gene, CG13732, generated by retroposition, and its parental gene, CG15645, in Drosophila. This pair of genes was recently described after inspecting the genome for young retroposed genes (Betrán et al. 2002
; fig. 1). The 2 genes map to different chromosomes, 3 and X, respectively, in D. melanogaster. Our phylogenetic analyses reveal that CG13732 is only present in 4 species of Drosophila: D. melanogaster, D. simulans, D. sechellia, and D. mauritiana. We name the parental gene Cervantes (cerv) and the retrogene Quijote (qjt) after the book "El ingenioso hidalgo Don Quijote de la Mancha" by Miguel de Cervantes Saavedra. In this book, the main character names himself Quijote. Interestingly, our phylogenetic analyses additionally revealed that retroposed copies of CG15645 have also originated in the lineages leading to D. yakuba and D. erecta independently from the event that originated CG13732. We name the retrogene in the D. yakuba lineage Rocinante (rocin) and the retrogene in the D. erecta lineage Sancho (san). Cervantes's main character names his horse Rocinante and Sancho is Quijote's squire. Although Quijote and the parental gene are predicted genes with unknown function (Adams et al. 2000
), our sequence analyses suggest that functional proteins are being produced and that these proteins are changing at a very fast rate (KA/KS
0.38). In addition, our transcription analyses reveal that both of these genes are expressed in the same adult tissues (male and female germ line) and that Cervantes is alternatively spliced. Their fast rate of evolution could be related to their putative germ linespecific function.
|
| Materials and Methods |
|---|
|
|
|---|
Southern Blot Analysis
The phylogenetic distribution of Quijote was studied by Southern blot. Approximately, 2 µg of genomic DNA from every species in the D. melanogaster subgroup except D. santomea was digested with EcoRI and blotted to nylon membrane (Sambrook et al. 1989
Amplification, Cloning, and Sequencing of Cervantes and Quijote
Sequences from Cervantes and Quijote were amplified by PCR from single male genomic DNA obtained using Puregene kit (Gentra Systems, Minneapolis, MN). Primers 5'-AAGCGTCTGCATAGAATCTG-3' and 5'-AGCGATCCGGATAATGACAAG-3' were used to amplify the whole Cervantes gene in D. melanogaster. Primers 5'-TTACGCAATTCAATGGCAACCT-3' and 5'-GAGAAGCAGCAGCGGGAGAT-3' were used to amplify the whole Quijote gene in D. melanogaster, D. simulans, D. sechellia, and D. mauritiana. These primers were specifically designed not to amplify any of the other paralogs present (see below) in the D. melanogaster genome. They occur in the region flanking the coding regions of the respective genes and have unique Blast hit in the D. melanogaster genome.
Alleles of D. melanogaster Quijote were amplified from fly strains from around the world: OK17 and HG84 from Africa, cal4, cof3, yep3, yep18, yep25, y10, y2. and y7 from Australia, 253.27 and 253.38 from Taiwan, north7_13 and north34 from Israel, Besançon from France, and EC171 and EC175 from Ecuador. We also used Zimbabwe lines Z(s)2, Z(s)6, Z(s)8, Z(s)11, Z(s)28, Z(s)29, Z(s)30, Z(s)40, Z(s)53, and Z(s)56. Alleles of Quijote were sequenced from 1 stock of D. simulans (Florida; provided by J. Coyne), 3 stocks of D. sechellia (number 4, 24, and 34 from Cousin Island 1985, provided by J. Coyne), and 3 stocks of D. mauritiana (163.1; Lemeunier and Ashburner 1976
), and G102 and W145G74 from S.C. Tsaur. Cervantes was sequenced for D. melanogaster 253.35 from Taiwan. PCR products were sequenced directly after PCR purification (Qiagen, Valencia, CA). Sequences were determined by sequencing both strands. Haplotypes were inferred directly for Cervantes because males were sequenced and the gene is located on the X chromosome, but, for Quijote that is located on the third chromosome, PCRs of heterozygous individual in more than one nucleotide position were cloned in a TOPO cloning vector (Invitrogen, CA, USA) and 1 clone was sequenced to infer the 2 haplotypes. Only a randomly chosen allele for those individuals was considered for the population genetics analyses. DNA sequencing was done on an ABI automated DNA sequencer (Applied Biosystems, Carlsbad, CA) with fluorescent DyeDeoxy terminator reagents.
Genome Confirmation of the Absence of Quijote in D. yakuba and D. erecta
The DNA sequence of Quijote, Quijote + 2 kb of flanking region and the sequence of 2 genes flanking Quijote on each side in D. melanogaster, were used as query in a Blast search against D. yakuba and D. erecta genomes in the FlyBase Blast Service (http://flybase.bio.indiana.edu/blast/). The Blast results revealed the absence of Quijote at the homologous site but the presence of the flanking genes (see Results). The presence of the empty site in both species was confirmed by PCR with primers flanking the empty site (D. yakuba_Quijote_empty3L_F: 5'-AAACACACTGCTTGTCTAGTG-3', D. yakuba_Quijote_empty3L_R: 5'-TTCAATGGGTTTTCTGGTCAG-3', D. erecta_Quijote_empty3L_F: 5'-CACCTGCACACGTGATTACT-3', and D. erecta_Quijote_empty3L_R: 5'-CCAACTGCTGATCCAGTATC-3') and sequencing. DNA sequencing was done using standard protocols (see above).
Strikingly, the Blast analyses revealed the presence of other new retrogenes independently derived from Cervantes in both D. yakuba and D. erecta. These results were confirmed by PCR with primers flanking the new retrogenes (Dyakuba_Quijote_like1_3R_F: 5'-GATGATAGAGCTGATAGAGCTC-3', Dyakuba_ Quijote_like1_3R_R: 5'-TGCTTGAAGCTTTCTAGAACTCT-3', Derecta_ Quijote_like2_3R_F: 5'-CATCGAGCTATCGATGTCAAC-3', and Derecta_ Quijote_like2_3R_R: 5'-GTGAAACAGGTCGTAGTACTG-3') and sequencing. More details are provided below.
Sequences Obtained from Sequenced Genomes
Cervantes was identified from the sequenced genomes for D. simulans (April 2005 Release 1 SLAGAN), D. yakuba (April 2004 Release 1 SLAGAN), and D. erecta (October 2004 SLAGAN). These sequences were used in the divergence analyses. We are confident we are using orthologous genes because we identified Blast hits for the flanking genes. D. simulans Cervantes is flanked by at least 3 of the 4 orthologous genes (2 at each side) that flank Cervantes in D. melanogaster: CG32584, CG7872, and CG9215. D. yakuba and D. erecta Cervantes are flanked by 2 of the 4 orthologous genes that flank Cervantes in D. melanogaster: CG7872 and CG9215.
Expression Analysis
Tissues were homogenized directly using a glass homogenizer, and total RNA was prepared according to manufacturer's recommendations (Qiagen). Total RNA was prepared from D. melanogaster adults from the Besançon strain: 15 virgin females, 15 gonadectomized males, 100 testes + accessory glands, 100 testes, 10 gonadectomized females, and 30 ovaries. Total RNA also was prepared from D. simulans from a Florida strain provided by J. Coyne. The samples were from 35 gonadectomized males, 60 testes, 60 accessory glands, 55 gonadectomized females, and 80 ovaries. Gonadectomized males and females, testes + accessory glands, testes, and ovaries were obtained by dissecting mature (>24 hours) males and females in saline solution. These tissues were soaked at 4 °C overnight in RNA-later solution (Ambion, Austin, TX) and preserved at 20 °C until they were processed. Total RNA also was prepared from D. mauritiana (strain 72) and D. sechellia (Cousin Island; provided by J. Coyne) from 15 males and 15 virgin females.
Transcript information for the D. melanogaster Quijote and Cervantes genes in adults was obtained by sequencing the products of 5' and 3' rapid amplification of cDNA ends (RLM-RACE; Ambion) and reverse transcriptasePCR (RTPCR) experiments.
The 5'-RLM-RACE protocol from Ambion ensures amplification of the 5' end of a capped mRNA. Random decamers were used to synthesize the 5' end of the Quijote and Cervantes from RNA of testes. Outer and inner 5' Race primers from Ambion and the specific primers 5'-GTTGTATCTTTTGCTGGACA-3' and 5'-CTTGCAACTTCTGCTGTAGG-3' and nested primers 5'-CGCCTGACCAGCTGCTGA-3' and 5'-ACCCGCCTGGGGAGCTGCC-3' were used to PCR amplify the 5' end of the testes cDNA of Quijote and Cervantes, respectively.
Oligo-(dT) with an adapter (3'-RLM-RACE from Ambion) was used to prime the synthesis of the 3' end of the cDNA. Single stranded cDNA was synthesized from total RNA from testes. Inner 3' Race primers from Ambion and the specific primer 5'-AGAAGTTGCTCGAGCAGAGC-3' and nested primer 5'-ATCCCTGGTCCAAGGCCC-3' of Quijote and 5'-AATGGACTTCAATTACCATGTG-3' of Cervantes were use to PCR amplify the 3' end of both genes.
RTPCR experiments were performed after synthesizing single strand cDNA from mRNA using Superscript (Invitrogen). Oligo-(dT) was used to prime the synthesis of the cDNA. RTPCR was conducted in all tissues from total RNA for Quijote and Cervantes in D. melanogaster. Quijote is an intronless gene. Analysis of expression of intronless genes is complicated by the fact that genomic contamination can produce a band of the same size as that expected from the cDNA. Therefore, we DNAse treated all RNA samples prior to cDNA synthesis and included RT negative controls of RNA without RT in the PCR to ensure that the DNAse treatment worked. Specific primers 5'-AGAAGTTGCTCGAGCAGAGC-3' and 5'-CTCCGAGGCAGTTACATCCA-3' for Quijote and 5'-GATCCCTGGTCCAAAGCCT-3' and 5'-CTCCGAGGAAGTTTCTTTCT-3' for Cervantes were used to amplify these genes from cDNA. Primers 5'-AATGGACTTCAATTACCATGTG-3' and 5'-CTTGCAACTTCTGCTGTAGG-3' were used with ovary cDNA to obtain products of Cervantes from which to check the splicing site of the intron of this gene. Products of the RTPCR were sequenced. DNA sequencing was done using standard protocols (see above).
RTPCR for Quijote and Cervantes was conducted from total RNA from various different tissues for D. simulans (see above) and from whole males and females of D. mauritiana and D. sechellia. We used primers 5'-ATTTCGTGTCCGATTTCAGC-3' and 5'-TCCGAGGCACTTCCATCAAG-3' for Quijote and 5'-TGCTGGACCAGAACGTGGAG-3' and 5'-TCCTCGATCGCGTCGGACATG-3' for Cervantes.
Sequence Analysis
Sequences were aligned using ClustalW (Thompson et al. 1994
) and manually adjusted. Synonymous and nonsynonymous substitutions per site (KS and KA) were computed following Goldman and Yang method (Goldman and Yang 1994
; Yang 1998
) and using PAML 3.1 software (Yang 1997
). A single rate model for all sites was first specified (
= 0; Yang 1997
), and a tree was provided considering all the gene sequences. The tree was deduced considering gene age (see Results) and species phylogenetic information (Ting et al. 2000
; Tamura et al. 2004
; Akashi et al. 2006
) (((((Quijote D. mauritiana, Quijote D. simulans), Quijote D. sechellia), Quijote D. melanogaster), (Cervantes D. melanogaster, Cervantes D. simulans)), (Cervantes D. yakuba, Cervantes D. erecta)). Cervantes of D. yakuba and D. erecta were used as outgroups. KA/KS ratio differences in different lineages were tested using maximum likelihood. Twice the log likelihood difference between any 2 nested models differing in the number of parameters was compared with a
2 distribution with as many degrees of freedom as the difference in number of parameters of the compared models (Yang 1998
). Maximum likelihood estimates of parameters for each branch (branch length, and
= KA/KS) together with the estimate of
, the transition/transversion ratio, can be used to calculate KA and KS per branch and construct a tree based on nonsynonymous and synonymous substitutions.

, the average number of nucleotide differences per site between 2 random sequences (Tajima 1989
), and
W, Watterson's estimate of
from the number of segregating sites (Watterson 1975
) were calculated. Both values estimate the neutral parameter
= 4Neµ for autosomal loci, where Ne is the effective population size and µ is the neutral mutation rate, under equilibrium conditions. Differences between 
and
W (Tajima's D) reveal nonequilibrium conditions in the history of the gene. Tajima's D (Tajima 1989
) was calculated and tested by 10,000 simulations using DNAsp 3.53 (Rozas J and Rozas R 1999
). Fu and Li tests that compare estimates of
from internal and external branches also were employed (Fu and Li 1993
) and tested by 10,000 simulations using DNAsp 3.53 (Rozas J and Rozas R 1999
).
H, the estimator of
= 4Neµ weighted by the homozygosity of the derived variants (Fay and Wu 2000
), also estimates the neutral parameter
= 4Neµ for autosomal loci under equilibrium conditions. The H statistic, difference between 
and
H estimates (Fay and Wu 2000
), measures the excess of derived variants at high frequency. The H value can be tested using neutral coalescence simulations to address the action of positive selection (Fay and Wu 2000
; Otto 2000
). H of Fay and Wu was computed and tested by 10,000 simulations using the program available at http://www.genetics.wustl.edu/jflab/htest.html. Recombination rates for all those simulations were considered zero in order to be conservative.
Under the model of neutrality, intraspecific variation is positively correlated with interspecific divergence (Kimura 1983
). Deviations from this expectation can result from a number of causes including positive Darwinian selection (McDonald and Kreitman 1991
; Nielsen 2001
). We compared the pattern of intraspecific variation with that of interspecific divergence at synonymous and replacement sites (McDonald and Kreitman 1991
) using DNAsp 3.53 (Rozas J and Rozas R 1999
).
| Results |
|---|
|
|
|---|
Phylogenetic Distribution of Quijote
We inferred that Quijote (CG13732) is only present in D. melanogaster, D. simulans, D. mauritiana, and D. sechellia, after using Southern analysis (fig. 2), genome data, and PCR. Southern analysis reveals multiple bands in all species and extra bands in D. melanogaster, D. simulans, D. mauritiana, and D. sechellia but not in D. yakuba, D. teissieri, D. erecta, and D. orena. The probe used is approximately 1 kb long and includes 200 bp of the 3' flanking region of Cervantes. It should hybridize with Cervantes, Quijote, and any additional closely related member of this gene family. The presence of multiple bands in D. melanogaster, D. simulans, D. mauritiana, and D. sechellia is interpreted as hybridization to several members of the same gene family: Quijote (the derived gene), Cervantes (the parental gene), CG15644, and CG18620. The 2 last genes are closely related genes to Cervantes (Thornton and Long 2005
20% in the coding region (D. melanogaster genome Release 2 [R2]). CG15644 and CG18620 are very similar to each other. Their sequence divergence is 1.2% (D. melanogaster genome R2. Cervantes, CG15644, and CG18620 map to region X13D-E. CG15644 and CG18620 were collapsed in 1 gene [CG32584] in recent releases of the genome, but they are clearly different genes in D. melanogaster [Thornton and Long 2005
|
|
In contrast, we observed 2 bands in D. yakuba, D. teissieri, D. erecta, and D. orena and infer 2 fragments to which the probe hybridizes. After examining the sequenced genomes of D. yakuba and D. erecta by Blast and conducting PCR analysis, we conclude that these 2 fragments contain 2 genes of this gene family: Cervantes and another different gene that we found to be a retrogene newly originated from Cervantes but unrelated to Quijote in both species (see details below).
Several different approaches confirmed the absence of Quijote in D. yakuba and D. erecta. Our Blast results using Quijote + 2 kb of flanking region of D. melanogaster against D. yakuba and D. erecta assembled sequences (http://flybase.bio.indiana.edu/blast/) revealed an empty site in the 3L. Results were confirmed by PCR using genomic DNA with primers flanking the empty site (fig. 2B) and sequencing. We also confirmed this by examining the genome sequences for the presence of the syntenic region (2 genes on each side of Quijote) but the absence of Quijote.
Interestingly, our Blast results using only Quijote of D. melanogaster against assembled D. yakuba and D. erecta sequences (http://flybase.bio.indiana.edu/blast/) revealed the presence of another Cervantes retrogene in the 3R arm of both species. Results were confirmed by PCR using genomic DNA with primers flanking the new insertions (fig. 2B) and sequencing. These analyses revealed the presence of other related retrogenes very likely in different sites in both species given that the flanking sequences appear to be different. The sequences flanking the new D. yakuba retrogene show a high degree of similarity by Blast to region 89B of D. melanogaster, and the sequences flanking the new D. erecta retrogene yield a strong Blast hit in the region 100B. In these copies, the intron present in Cervantes is absent, suggesting that like Quijote they originated through retrotransposition. We name the new retrogene in the D. yakuba lineage Rocinante and the new retrogene in the D. erecta lineage Sancho (see Introduction for etymology). These new retrogenes are currently being investigated for transcriptional activity and evolutionary history. In agreement with our inferences, these analyses are revealing that Rocinante and Sancho are more similar to Cervantes of D. yakuba and Cervantes of D. erecta, respectively, than to Quijote (data not shown).
Cervantes and Quijote Transcripts and Patterns of Expression
D. melanogaster RTPCR results are shown in figure 3. Both genes, parental (Cervantes) and derived (Quijote), are highly expressed in testes and ovaries but seem to be absent or minimally expressed in other tissues (the rest of the male and female bodies) in this species. Quijote and Cervantes are expressed in both males and females of D. simulans, D. mauritiana, and D. sechellia (data not shown). Several RTPCR products were confirmed by sequencing to confirm specificity of the primers. In D. simulans, adult tissue expression was consistent with the observations for D. melanogaster. Both genes, parental (Cervantes) and derived (Quijote), are highly expressed in testes and ovaries. Cervantes is expressed in other tissues but products of Quijote are absent or expressed at a very low level (data not shown).
|
Transcripts of Quijote and Cervantes from testes from D. melanogaster are shown in figure 1. The testis Quijote mRNA includes 220 bp 5' of the predicted start codon and 78 bp 3' of the stop codon. This is the first mRNA reported for this predicted gene. Sequencing of Cervantes 5'-RLM-RACE testis product revealed the existence of 2 whole cDNAs: one that uses splicing site 1 of figure 1 and another that uses alternative splicing site 2 of figure 1. It is evident from figure 1 that Quijote was derived from splice variant 1. The alternative splicing site observed in our sequencing (site 2) is 16 bp downstream of the first splicing site, changing the open reading frame. An in-frame stop codon occurs 65 bp after the second splicing site if the same start codon is used; an alternative potential start codon occurs 30 codons downstream. In both transcripts, the 5' end is 171 bp 5' of the predicted start codon (fig. 1) and the 3' end is 120 bp 3' to the stop codon. We do not yet know the relative degree of expression of these 2 transcripts. We used ovarian cDNA and different primers (see Material and Methods) to obtain products of Cervantes from which to check the splicing site of the intron of this gene in female tissues. Only the product of the splicing site 1 of Cervantes (fig. 1) was found in ovary.
Recently, a complete cDNA has been described for Cervantes in D. melanogaster: RE46906.5prime. Splicing site 1 is spliced in this expressed sequence tag product. R3 and subsequent releases of D. melanogaster genome predict 2 different mRNAs for Cervantes gene (RA and RB), neither of which begins or ends exactly where the mRNA that we describe ends. CG15645-RA 5' end is 8 nt shorter than the one we obtained. CG15645-RA uses the splicing site 1 of figure 1. CG15645-RB does not contain introns and is shorter at the 5' end.
The length of the Cervantes cDNA in D. melanogaster and the presence of an in-frame deletion of codons 2124 in this species (fig. 1 and table 1) are consistent with the length of the open reading frame predicted for Quijote in R2 and subsequent releases (fig. 1). Recently, in R3 and subsequent releases of the D. melanogaster genome, an even longer open reading frame for both genes Cervantes and Quijote is predicted. Both open reading frames are outlined in figure 1.
Sequence Analysis
We sequenced the parental gene (Cervantes; CG15645) from a lab strain of D. melanogaster. We obtained the sequence of Cervantes in D. simulans, D. yakuba, and D. erecta from the genome sequences (see Materials and Methods). Table 1 summarizes these data. The Cervantes intron present in these species is not shown in table 1. We sequenced the newly acquired gene, Quijote (CG13732), for 3 individuals of D. mauritiana, 3 of D. sechellia, 1 of D. simulans, and 27 of D. melanogaster (tables 1 and 2). The consensus sequence of Quijote for D. melanogaster is shown in table 1. The analysis of Cervantes from D. simulans, D. yakuba, and D. erecta confirms that the open reading frame should be at least as long as that predicted for Quijote in R2 (fig. 1 and table 2) because it reveals that the putative start codon for this gene in R1 (ATG; Met position 103; Adams et al. 2000
) is not conserved in these species. On the other hand, the ATG corresponding to the predicted start codon for Quijote (Met position 1) is conserved in all the species in which the gene is present. As mentioned above, R3 and the subsequent releases of the D. melanogaster genome predict an even longer open reading frame for both genes Cervantes and Quijote. Our polymorphism and divergence data and analyses did not include these 69 additional nucleotides.
|
The putative proteins for these different genes show many length changes (insertions and deletions of whole codons). Two deletions of complete codons (12 and 9 bp long) differentiate the D. melanogaster Cervantes from Quijote (fig. 1 and table 1). Based on phylogenetic evidence, one of these two deletions occurred in the Cervantes lineage after Quijote duplication. It is shared by Cervantes in D. melanogaster and D. simulans, but it is not present in Cervantes in D. erecta and D. yakuba (table 1). The other deletion occurred in the D. melanogaster Cervantes lineage alone. In addition, Cervantes of D. erecta and D. yakuba have an indel (9 bp) compared with the sequences of other species of Drosophila examined, and Quijote of D. simulans exhibits an insertion of 3 bp (table 1). In addition, the region surrounding the stop codon of Cervantes and Quijote has changed particularly rapidly (table 1).
Apart from rapid changes in the length of the proteins in this gene family, we found other evidence of rapid protein evolution (see below). Figure 4A shows the gene phylogeny used to compute KA and KS for every branch using maximum likelihood. Log likelihood values and maximum likelihood estimates of
(KA/KS) along every branch of the tree under different models were estimated between sequences from table 1 (table 3). Consensus sequences of Quijote for D. melanogaster (worldwide samples), D. mauritiana, and D. sechellia were used for the comparisons.
|
|
After the formation of a new gene (e.g., Quijote), the function of the protein can change. The new gene can evolve faster or slower than the parental one because of different levels of constraint. Positive selection also may allow the gene to gain a new function. This may happen soon after the gene is formed (Jones and Begun 2005
A "free ratio" model was first applied to the data. This model computes a different
(KA/KS) for every branch (Yang 1998
); lnL = 2,302.44. This model (29 parameters) does not differ significantly from the one ratio model with 16 parameters (lnL = 2,311.60; X2 = 18.3197; P(13) = 0.1457; see table 3), revealing that there are no major differences in KA/KS between the different branches of the tree. An additional model (model C; table 3) was compared with model B (table 3) to test if, on average, the parental gene is evolving differently from the derived gene. These 2 models do not differ (X2 = 2.5971; P(1) = 0.1071) leading to the conclusion that parental and derived genes are evolving at a similar rate. We also tested if there was an acceleration right after the new gene formed by comparing model B and D (X2 = 5.5065; P(2) = 0.0637) and concluded that there has been no acceleration after the duplication. Interestingly, the observation that model B differs from model E (X2 = 48.7818; P(1) < 106) reveals the action of purifying selection in those genes; the KA/KS ratio is very significantly smaller than 1 (
0.38). All significant comparisons remain significant after correcting for multiple tests (Bonferroni correction P < 0.0125; Sokal and Rohlf 1995
).
The KA/KS ratio of
0.38 is high relative to the average value in Drosophila for nonfast-evolving genes estimated in the same way as done here (0.2182; Swanson et al. 2001
). This value increases to 0.4727 for accessory glandspecific ESTs (Swanson et al. 2001
). The fact that the KA/KS ratio is high for this gene family could be due to both relaxation of selection or positive selection (Wyckoff et al. 2000
; Yang 2001
). Under the relaxation hypothesis, we expect KA/KS to be high if the protein does not have a strong amino acid constraint. Under the positive selection hypothesis, KA/KS can be high if favored replacement mutations are often fixed in the population. This is likely to happen after duplication if the new gene is acquiring a new function. The action of positive selection is almost certain if the KA/KS ratio is significantly greater than 1, but this criterion is very strict (Wyckoff et al. 2000
; Yang 2001
). The KA/KS ratio in a lineage is an average over all sites, and even under the action of positive selection, it can be smaller than 1 because some sites might be under positive selection, whereas others are under purifying selection.
Polymorphism information for Quijote in D. melanogaster and polymorphism and divergence data of D. melanogaster, D sechellia, and D. mauritiana were analyzed to further explore these 2 possibilities. Seventeen and 10 alleles were analyzed for the coding region of Quijote (630 bp) in worldwide and Zimbabwe samples of D. melanogaster, respectively. We observed 5 segregating sites in the D. melanogaster worldwide population (3 synonymous and 2 nonsynonymous) and 11 segregating sites and 12 mutations (6 synonymous and 6 nonsynonymous;
= 0.00617; table 2) in Zimbabwe D. melanogaster. Only 4 segregating sites remain in the D. melanogaster worldwide population, if we remove 2 African alleles, HG84 and OK17 (2 synonymous and 2 nonsynonymous;
= 0.00195). This means that
is 4 times higher in Africa for Quijote. Higher variation in African samples of D. melanogaster has been previously observed for the X chromosome (Begun and Aquadro 1993
; Andolfatto 2001
). However, loci on the autosomes were observed to be as variable outside as within Africa (Andolfatto 2001
). Although the pattern of variation for the X chromosome could be due to an "out of Africa" bottleneck, the inconsistency with the autosomal data disfavored this hypothesis (Andolfatto 2001
). Our observation for Quijote (located on chromosome 3) would be consistent with either a bottleneck or an adaptive hypothesis (selection in new environments; Andolfatto 2001
; Harr et al. 2002
).
We tested for bias in the frequency spectrum of polymorphisms. Tajima's D test (Tajima 1989
), Fu and Li (1993)
tests, and the H test of Fay and Wu (2000)
were applied to the non-African and Zimbabwe samples for Quijote separately. None of the tests were significant except the H test of Fay and Wu in the non-African sample that was marginally significant (see below). Tajima's D was negative in both non-African (0.2276; P = 0.4321) and Zimbabwe (0.4705; P = 0.3450) samples.
H statistic measures the excess of derived variants at high frequency. This occurs, for example, after an episode of selection. An outgroup sequence is used to infer the derived polymorphic state (Fay and Wu 2000
). Given the relationships among sequences (fig. 4), Quijote consensus of the D. simulans complex was used as outgroup in the H test (table 1). There are 4 and 11 segregating sites in the D. melanogaster Quijote polymorphism data from non-African and Zimbabwe samples, respectively. To be conservative, the recombination rate in the region (4Nc) was considered to be zero. The divergence at synonymous sites between D. simulans and D. melanogaster may be from 0.05 to 0.10 and back mutation for the analyses (Fay and Wu 2000
) can take values from 0.017 to 0.033. We carried out the test with these 2 extreme back mutation values; the probabilities for the H test are 0.0537 and 0.0542 for H = 1.7143 and 0.1906 and 0.1931 for H = 1.3333, for 0.017 and 0.033 back mutation and non-African and Zimbabwe, respectively. These results point to deviations from neutral expectations in the non-African population. The marginally significant negative value of H in the non-African alleles could be explained by an adaptive hypothesis (selection in new environments; Fay and Wu 2000
; Harr et al. 2002
), but see Discussion.
The McDonaldKreitman test (McDonald and Kreitman 1991
) was performed for Quijote polymorphism and divergence data of D. melanogaster, D. sechellia, and D. mauritiana. Non-African and Zimbabwe polymorphism data of D. melanogaster were analyzed separately. None of the comparisons was significant (P > 0.05), indicating no deviations from the neutral model. McDonaldKreitman tests were also performed removing singletons (i.e., nucleotide variants that appear only once in the data for a given species) to correct for weak purifying selection and increase the power of detecting positive selection. None of the comparisons was significant (P > 0.05).
| Discussion |
|---|
|
|
|---|
Here we present data on Cervantes (CG15645) and Quijote (CG13732). These genes were predicted (Adams et al. 2000
We also explored the annotation of these genes. Given the observed conservation of the start codon across species, we infer that the Cervantes open reading frame is as long as the one predicted for Quijote at least for one of its transcripts.
Expression and sequence analyses reveal that these are 2 functional genes: their proteins show constraint and are transcribed in both the male and female germ line. Surprisingly, in D. melanogaster and D. simulans, Quijote is expressed in the same adult tissues where Cervantes is expressed despite being a retroposed copy that does not show similarity at the 5' region with the parental gene (fig. 1). Quijote also is expressed in males and females of D. mauritiana and D. sechellia. However, we do not know yet if it is expressed mainly in germ line in these species. The roles of these genes remain unknown despite our efforts in exploring their function by protein homology searches.
The putative proteins for both genes are evolving quickly and many deletions and/or insertions of codons and changes in the stop codon position have occurred in this family of proteins. D. melanogaster Cervantes is 207 aa long, D. simulans Quijote is 211 aa long, and Quijote from D. mauritiana, D. sechellia, and D. melanogaster is 210 aa. The KA/KS ratio is
0.38 for both genes, Cervantes and Quijote, indicating that they have fast-evolving proteins in comparison to the average value of KA/KS ratio in Drosophila for nonfast-evolving genes; 0.2182 (Swanson et al. 2001
). Importantly, KA/KS ratio is significantly smaller than 1, revealing the action of purifying selection in those genes and providing additional support for the production of functional proteins in both cases. It is believed that after the formation of a new gene (e.g., Quijote), its rate of evolution should be faster compared with the parental gene. The new gene is thought to experience a lower level of constraint (Ohno 1970
) or, alternatively, evolves under positive selection to gain a new function (Long and Langley 1993
). However, the comparison of the rates of protein evolution for Cervantes and Quijote leads to the conclusion that parental and derived genes are evolving at a similar rate. The similarity in the rates of evolution of parental and derived proteins in recent duplicates is unexpected. Examples of young genes in Drosophila (jgw, Sdic, and Dntf-2r; Long and Langley 1993
; Nurminsky et al. 1998
; Betrán and Long 2003
) have revealed a different pattern; rates of protein evolution are faster for the derived copy. However, this is not the case for Quijote. This may be explained by the observation that both genes are expressed in the same tissues in adult flies and, therefore, experience similar selective environments.
Interestingly, the H test revealed a deviation from neutral expectations in the non-African sample for Quijote. The marginally significant negative value of H in the non-African sample reveals an excess of variation at high frequency in Quijote that could be explained by recent positive selection in the D. melanogaster lineage in that genomic region (Fay and Wu 2000
). However, it has recently been shown that population bottlenecks, such as the one that may have occurred in D. melanogaster when it colonizes new areas from an African origin, or sampling from differentiated subpopulations could also lead to this result (Przeworski 2002
; Haddrill et al. 2005
). If selection in a novel environment has occurred at the protein level, there should be fixed amino acid differences between African and non-African samples. However, we found no fixed amino acid differences between African and non-African D. melanogaster data for Quijote, although the coding region could be slightly longer at its 5' end than the region we analyzed. Alternatively, selection could have occurred in a noncoding region of this gene or a nearby gene. It could also be that we are just seeing a demographic effect. In that case, the fast protein evolution could be due to the protein being under reduced functional constraint.
We observe that Quijote is evolving at the same fast rate in different lineages of Drosophila and at a rate similar to that of Cervantes. If the fast evolution of the gene is due to positive selection, it should also be evident in other lineages and in Cervantes. Polymorphism data are available for Cervantes from Thornton and Long (2005)
. Ten partial alleles of D. melanogaster were sequenced from Zimbabwe lines. We used these data to perform Tajima's D test and Fu and Li tests. Neither Tajima's D test nor Fu and Li tests were significant (P > 0.05). We also used D. simulans Cervantes sequence to perform a McDonaldKreitman test. This comparison was not significant (P > 0.05). So, we do not have evidence for recent selection on Cervantes in D. melanogaster.
We would like to further explore whether the fast evolution of Quijote or Cervantes is due to positive selection, using additional polymorphism data for Quijote and Cervantes in D. mauritiana. In D. melanogaster, the longer Quijote region in African and non-African local populations remains to be studied for possible adaptive fixed differences between populations and try to infer the factors underlying the evolution of the gene. At this point, the fast evolution of both Quijote and Cervantes is only suggestive of selection acting on both genes due to their germ line expression and possible reproduction-related functions (Gavrilets 2000
; Wyckoff et al. 2000
; Swanson et al. 2001
; Swanson and Vacquier 2002
).
This work provides an exciting basis for further study of the roles and evolution of this newly recognized gene family. From our searches of the recently sequenced genomes of D. yakuba and D. erecta, we infer that retrogenes analogous to Quijote originated from Cervantes, likely independently in both species. We named these new genes Rocinante and Sancho. This reveals Cervantes as a very prolific source for retrogenes. If, as earlier proposed (Betrán et al. 2002
), selection is driving the export of functions from the X chromosome to autosomes, those retrogenes from Cervantes could be recurrent fixation events explained by selection. We are now studying their evolution and pattern of transcription to explore these hypotheses.
| Supplementary Materials |
|---|
|
|
|---|
Accession numbers for the sequences used are AY150701AY150706, AY150708, AY150709, AY150711, and DQ888176DQ888200, of which the last set corresponds to newly described sequences.
| Acknowledgements |
|---|
|
|
|---|
We thank J. Coyne, P. Gibert, F. Lemeunier, SC. Tsaur, and ML. Wu for providing Drosophila strains used in this work. We thank Agencourt, Inc. (D. erecta), and Washington University Genome Center (D. simulans and D. yakuba) for prepublication access to their genome data. Two reviewers, Paul Chippindale, Tina Harr, Manyuan Long, Ellen Pritham, Janice Spofford, and Kevin Thornton critically read and/or improved the manuscript in some way. This work was supported by start-up funds and research enhancement program from the University of Texas at Arlington and GM 071813-01 grant from National Institutes of Health to E.B.
| Footnotes |
|---|
Diethard Tautz, Associate Editor
| References |
|---|
|
|
|---|
Adams MD, Celniker SE, Holt RA, et al. (295 co-authors). 2000. The genome sequence of Drosophila melanogaster. Science 287:218595.
Akashi H, Ko WY, Piao S, John A, Goel P, Lin CF, Vitins AP. (2006) Molecular evolution in the Drosophila melanogaster species subgroup: frequent parameter fluctuations on the timescale of molecular divergence. Genetics 172:171126.
Andolfatto P. (2001) Contrasting patterns of X-linked and autosomal nucleotide variation in Drosophila melanogaster and Drosophila simulans. Mol Biol Evol 18:27990.
Begun DJ and Aquadro CF. (1993) African and North American populations of Drosophila melanogaster are very different at the DNA level. Nature 365:54850.[CrossRef][Medline]
Betrán E and Long M. (2002) Expansion of genome coding regions by acquisition of new genes. Genetica 115:6580.[CrossRef][ISI][Medline]
Betrán E and Long M. (2003) Dntf-2r: a young Drosophila retroposed gene with specific male expression under positive Darwinian selection. Genetics 164:97788.
Betrán E, Thornton K, Long M. (2002) Retroposed new genes out of the X in Drosophila. Genome Res 12:18549.
Bhattacharya A and Steward R. (2002) The Drosophila homolog of NTF-2, the nuclear transport factor-2, is essential for immune response. EMBO Rep 3:37883.[CrossRef][ISI][Medline]
Brosius J. (1991) Retroposonsseeds of evolution. Science 251:753.
Charles JP, Chihara C, Nejad S, Riddiford LM. (1997) A cluster of cuticle protein genes of Drosophila melanogaster at 65A: sequence, structure and evolution. Genetics 147:121324.[Abstract]
Currie PD and Sullivan DT. (1994) Structure, expression and duplication of genes which encode phosphoglyceromutase of Drosophila melanogaster. Genetics 138:35263.[Medline]
Dunham I, Shimizu N, Roe BA, et al. (213 co-authors). 1999. The DNA sequence of human chromosome 22. Nature 402:48995 [published erratum appears in Nature 404(6780):904].
Esnault C, Maestre J, Heidmann T. (2000) Human LINE retrotransposons generate processed pseudogenes. Nat Genet 24:3637.[CrossRef][ISI][Medline]
Fay JC and Wu CI. (2000) Hitchhiking under positive Darwinian selection. Genetics 155:140513.
Fu Y-X and Li W-H. (1993) Statistical tests of neutrality of mutations. Genetics 133:693709.[Abstract]
Gavrilets S. (2000) Rapid evolution of reproductive barriers driven by sexual conflict. Nature 403:8869.[CrossRef][Medline]
Goldman N and Yang Z. (1994) A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 11:72536.[Abstract]
Haddrill PR, Thornton KR, Charlesworth B, Andolfatto P. (2005) Multilocus patterns of nucleotide variability and the demographic and selection history of Drosophila melanogaster populations. Genome Res 15:7909.
Harr B, Kauer M, Schlotterer C. (2002) Hitchhiking mapping: a population-based fine-mapping strategy for adaptive mutations in Drosophila melanogaster. Proc Natl Acad Sci USA 99:1294954.
Jones CD and Begun DJ. (2005) Parallel evolution of chimeric fusion genes. Proc Natl Acad Sci USA 102:113738.
Jones CD, Custer AW, Begun DJ. (2005) Origin and evolution of a chimeric fusion gene in Drosophila subobscura, D. madeirensis and D. guanche. Genetics 170:20719.
Kimura M. (1983) The neutral theory of molecular evolution. (Cambridge University Press, Cambridge).
Lemeunier F and Ashburner MA. (1976) Relationships within the melanogaster species subgroup of the genus Drosophila (Sophophora). II. Phylogenetic relationships between six species based upon polytene chromosome banding sequences. Proc R Soc Lond B Biol Sci 193:27594.[Medline]
Long M and Langley CH. (1993) Natural selection and the origin of jingwei, a chimeric processed functional gene in Drosophila. Science 260:915.
Long M, Wang W, Zhang J. (1999) Origin of new genes and source for N-terminal domain of the chimerical gene, jingwei, in Drosophila. Gene 238:13541.[CrossRef][ISI][Medline]
Marques AC, Dupanloup I, Vinckenbosch N, Reymond A, Kaessmann H. (2005) Emergence of young human genes after a burst of retroposition in primates. PLoS Biol 3:e357.[CrossRef][Medline]
McDonald JH and Kreitman M. (1991) Adaptive protein evolution at the Adh locus in Drosophila. Nature 351:6524.[CrossRef][Medline]
Mighell AJ, Smith NR, Robinson PA, Markham AF. (2000) Vertebrate pseudogenes. FEBS Lett 468:10914.[CrossRef][ISI][Medline]
Nielsen R. (2001) Statistical tests of selective neutrality in the age of genomics. Heredity 86:6417.[CrossRef][ISI][Medline]
Nurminsky DI, Nurminskaya MV, De Aguiar D, Hartl DL. (1998) Selective sweep of a newly evolved sperm-specific gene in Drosophila. Nature 396:5725.[CrossRef][Medline]
Ohno S. (1970) Evolution by gene duplication(Springer, Berlin, Germany).
Otto SP. (2000) Detecting the form of selection from DNA sequence data. Trends Genet 16:5269.[CrossRef][ISI][Medline]
Powell JR. (1997) Progress and prospects in evolutionary biology: The Drosophila model. (Oxford University Press, New York).
Przeworski M. (2002) The signature of positive selection at randomly chosen loci. Genetics 160:1179



