Molecular Biology and Evolution 19:521-252 (2002)
© 2002 Society for Molecular Biology and Evolution
Do Introns Favor or Avoid Regions of Amino Acid Conservation?
*The Biological Laboratories, Harvard University;
Department of Bioinformatics, Medical Research Institute, Tokyo Medical and Dental University;
Laboratory of Computational Biology, Ludwig Institute for Cancer Research, São Paulo branch
| Abstract |
|---|
|
|
|---|
Are intron positions correlated with regions of high amino acid conservation? For a set of ancient conserved proteins, with intronless prokaryotic but intron-containing eukaryotic homologs, multiple sequence alignments identified residues invariant throughout evolution. Intron positions between codons show no preferences. However, introns lying after the first base of a codon prefer conserved regions, markedly in glycines. Because glycines are in excess in conserved regions, this behavior could reflect phase-one introns entering glycine residues randomly in the ancestral sequences. Examination of intron positions within codons of evolutionarily invariable amino acids showed that roughly 50% of these introns are bordered by guanines at both 5'- and 3'-ends, 25% have a G only before the intron, and 5% have a G only after the intron, whereas about 20% are bordered by nonguanine bases.
| Introduction |
|---|
|
|
|---|
The genes of complex eukaryotes are interrupted by introns. Explanations for the origins of these introns are that either they were added (Logsdon Jr. 1998
Recent arguments (de Souza et al. 1998
; Roy et al. 1999
) support this mixed model. In a set of ACPs of known three-dimensional structure, phase-zero intron positions, which lie between the codons, were correlated with the boundaries of certain sizes of modules, compact regions of peptide chain, whereas phase-one and -two introns, which lie after the first and second base of a codon, were not. Such a correlation with three-dimensional structure would not be expected if introns had been added to previously existing genes but would follow if the original genes had been assembled through exon shuffling. This suggests that some of the phase-zero introns were ancient but that most introns of phases one and two and many of phase zero were added to the ACP genes.
One explanation for intron structure correlation in an introns-added-late model is that the process by which introns are added to genes, presumably as transposable elements inserting into a DNA or RNA sequence, might be mutagenic and hence change the amino acid sequence of the protein product. Thus, one might expect to see a preference for introns in regions where the amino acid sequence is not critical and a dearth of introns in regions of high amino acid conservation. Such a propensity might, in turn, lead to a correlation between introns and aspects of the three-dimensional structure of the protein, if it were to be the case that the boundaries of modules corresponded to regions of low amino acid conservation and the cores of modules, to regions of high amino acid conservation.
To test this conjecture we have examined the correlation between intron positions and conserved residues. We essentially use the set of ACPs that de Souza et al. (1998)
explored. For each member of that set, we took as a reference sequence the sequence corresponding to the three-dimensional structure. We identified a large number of homologs, both prokaryotic and eukaryotic, for each sequence and, by making multiple alignments, identified amino acids that were identical across all the homologs. This is a region of highest conservation, which consists of amino acids that are identical in both the prokaryotic and eukaryotic homologs. We then assigned intron positions in the eukaryotic homologs to the reference sequences by pairwise alignment of the relevant sequences, using a computer program that searches the database and identifies intron positions. We find that for phase-zero intron positions, there is no preference to be flanked by codons of evolutionarily invariable amino acids. However, phase-one intron positions show a significant preference to be found within codons of evolutionarily invariable amino acids.
| Materials and Methods |
|---|
|
|
|---|
Ancient Genes Used in the Study
We used a previously published sample of 44 ancient genes (de Souza et al. 1998
|
Obtaining Homologous Sequences
To obtain homologous sequences, an amino acid sequence database was created from GenBank release 110. Sequences homologous to the reference sequences were identified by FASTA, version 3.1t13 (Pearson and Lipman 1988
Multiple Alignment and Identification of Conserved Sites at Three Levels
Multiple alignments of the homologous sequences were performed using Clustal W Version 1.74 (Thompson, Higgins, and Gibson 1994
). We marked, on the reference sequences, the sites where all the sequences have the same amino acid residue. Table 1
also gives the number of homologs that were aligned by Clustal W. Sequences that yielded very large gaps or other very bad alignments were discarded by visual inspection. All studied multiple alignments are available from our website: http://mcb.harvard.edu/gilbert/invariant_sites.
Mapping of Intron Positions
An intron-exon database (EID; Saxonov et al. 2000
) derived from GenBank release 105 was the source of introns. This relatively old version of GenBank was chosen specifically to be very close to the previously published set of intron positions (de Souza et al. 1998
; Roy et al. 1999
). To map introns from homologous genes onto a reference sequence, we used our INTRONMAP program (Fedorov et al. 2001
). This program precisely maps intron positions, taking into account intron phases.
Intron Phases
Intron phases are defined as the position of an intron within a codon. Phases zero, one, and two are, according to the normal definitions, introns lying before a codon or after the first or second base, respectively, but we have also used the term phase three to describe an intron that lies after a codon.
| Results and Discussion |
|---|
|
|
|---|
Table 2 shows the number of introns found in the invariant residues in phases zero, one, and two and also lists introns in phase three, after a residue, to display any boundary effects on intron positions before or after an evolutionarily invariable site. Because the invariable amino acid has always been the same, if an intron was there from the beginning of evolution, it would always have been either before, after, or within this particular amino acid. Only 15% of all the residues are invariantly conserved. Table 3 compares the observed and expected totals. There is a large excess of introns in phase one with an extremely high
2, but there is not much deviation from randomness for introns that lie next to codons or in phase two.
|
|
Figure 1 shows that the phase-one introns lie primarily in glycine codons. However, glycine itself is in striking excess among the invariant residues. Figure 2 compares the relative excesses and depletions of each of the conserved amino acids. Highly conserved residues, most essential for the protein conformation and functioning, show a preference for glycine, histidine, proline, and tryptophan, amino acids that are likely to play an important role in maintaining structure or function. Glycine is in the greatest excess, reflecting its position as the smallest amino acid playing an essential role in residue packing. For these ACPs, phase-one and phase-two intron positions are likely to be the result of the addition of introns (de Souza et al. 1998
|
|
Are all introns added into GG sequences? In these conserved regions, the invariant residues have always been the same since the progenote, so one knows something about the relevant codons. To account for the excesses and depletions shown in figure 2 , we normalized the frequency of introns in invariant residues to the frequency of that residue in the reference sequences. Table 4 lists those normalized percentages. Thus, one can estimate the probabilities of the introns entering into a sequence whose codon frequencies correspond to the amino acid frequencies in the reference sequences. Consider phase one. Only the glycine codon has a GG sequence around the phase-one position: 49% of all the phase-one introns could enter glycine codons. How many phase-one introns would have entered such that they had a guanine before them or a guanine after them? Table 4 shows 78% with guanine before and 55% with guanine after. Lastly, 16% of such introns would have entered dinucleotide sequences with no guanines.
|
The frequencies for the phase-zero and the phase-two introns provide some confirmation of these estimates. The base immediately following a phase-zero intron is determined; so the table of frequencies implies that about 47% of phase-zero introns have a guanine after them. Phase-two introns follow a guanine about 47% to 72% of the time (because the arginine or serine codons are undetermined). Overall, these guanine compositions after the phase-zero or before the phase-two introns are consistent with the numbers deduced from the phase-one introns, about 75% with G before and about 50% with G after.
We conclude that introns do not avoid conserved regions, but, on the contrary, there is an excess of phase-one introns in the conserved regions, and there is no particular bias for phase-zero or phase-two introns. The excess of phase-one introns in the invariant residues is related to the excess of glycines in the conserved regions. If the introns had integrated randomly into glycines in the overall sequence, there would be an excess of introns almost equal to this excess of introns in the invariant residues. There is no support for the argument that introns favoring unconserved regions are the basis for the correlation with protein structure.
The data demonstrate that added introns show a strong bias for guanine residues, but not a total requirement. Although about 75% enter after a guanine residue and 50% lie between two guanines, still a small fraction, about 20%, does not enter at guanines. Because, in general, the sequences in the exons around the introns do not show such an extreme bias for guanines (Long et al. 1998
), the exon sequences must be able to mutate away without affecting the splicing.
| Acknowledgements |
|---|
|
|
|---|
T.E. was supported by the Japan Society for the Promotion of Science. S.J.d.S. was supported by Fundacao de Amparo a Pesquisa do Estado de Sao Paulo (Sao Paulo, Brazil) and the PEW-Latin American Fellows Program.
| Footnotes |
|---|
Naruya Saitou, Reviewing Editor
Keywords: intron position
exon
splicing
evolution
conserved protein
amino acid
insertion ![]()
Address for correspondence and reprints: Walter Gilbert, The Biological Laboratories, Harvard University, 16 Divinity Avenue, Cambridge, Massachusetts 02138. gilbert{at}nucleus.harvard.edu
. ![]()
| References |
|---|
|
|
|---|
de Souza S. J., M. Long, R. J. Klein, S. Roy, S. Lin, W. Gilbert, 1998 Toward a resolution of the introns early/late debate: only phase zero introns are correlated with the structure of ancient proteins Proc. Natl. Acad. Sci. USA 95:5094-5095
Doolittle R., 1978 Genes in pieces: were they ever together? Nature 272:581-582.
Fedorov A., X. Cao, S. Saxonov, S. J. de Souza, S. W. Roy, W. Gilbert, 2001 Intron distribution difference for 276 ancient and 131 modern genes suggests the existence of ancient introns Proc. Natl. Acad. Sci. USA 98:1317713182
Gilbert W., 1978 Why genes in pieces? Nature 271:501.[Medline]
Logsdon J. M. Jr., 1998 The recent origins of spliceosomal introns revisited Curr. Opin. Genet. Dev 8:637-648[Web of Science][Medline]
Long M., S. J. de Souza, C. Rosenberg, W. Gilbert, 1998 Relationship between "proto-splice sites" and intron phases: evidence from dicodon analysis Proc. Natl. Acad. Sci. USA 95:219-223
Pearson W. R., D. J. Lipman, 1988 Improved tools for biological sequence comparison Proc. Natl. Acad. Sci. USA 85:2444-2448
Roy S. W., M. Nosaka, S. J. de Souza, W. Gilbert, 1999 Centripetal modules and ancient introns Gene 238:85-91[Web of Science][Medline]
Saxonov S., I. Daizadeh, A. Fedorov, W. Gilbert, 2000 EID: the exon-intron databasean exhaustive database of protein coding intron-containing genes Nucleic Acids Res 28:185-190
Thompson J., D. Higgins, T. Gibson, 1994 CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Nucleic Acids Res 22:4673-4680
Venkatesh B., Y. Ning, S. Brenner, 1999 Late changes in spliceosomal introns define clades in vertebrate evolution Proc. Natl. Acad. Sci. USA 96:10267-10271
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
A. Fedorov, A. F. Merican, and W. Gilbert Large-scale comparison of intron positions among animal, plant, and fungal genes PNAS, December 10, 2002; 99(25): 16128 - 16133. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


