Skip Navigation


MBE Advance Access originally published online on July 23, 2007
Molecular Biology and Evolution 2007 24(10):2158-2168; doi:10.1093/molbev/msm151
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
24/10/2158    most recent
msm151v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by De Kee, D. W.
Right arrow Articles by Stoltzfus, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by De Kee, D. W.
Right arrow Articles by Stoltzfus, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press 2007.

Research Articles

A Sequence-Based Model Accounts Largely for the Relationship of Intron Positions to Protein Structural Features

Danny W. De Kee, Vivek Gopalan and Arlin Stoltzfus*

Center for Advanced Research in Biotechnology, Rockville, MD
* Biochemical Science Division, National Institute of Standards and Technology, Gaithersburg, MD

E-mail: stoltzfu{at}umbi.umd.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Claims of intron-structure correlations have played a major role in debates surrounding split gene origins. In the formative (as opposed to disruptive or "insertional") model of split gene origins, introns represent the scars of chimaeric gene assembly. When analyzed retrospectively, formative introns should tend to fall between modular units, if such units exist, or at least to exhibit a preference for sites favorable to chimaera formation. However, there is another possible source of preferences: under a disruptive model of split gene origins, fortuitous intron-structure correlations may arise because the gain of introns is biased with respect to flanking nucleotide sequences. To investigate the extent to which a sequence-biased intron gain model may account for the present-day distribution of introns, data on over 10,000 introns in eukaryotic protein-coding genes were integrated with structural data from a set of 1,851 nonredundant protein chains. The positions of introns with respect to secondary structures, solvent accessibility, and so-called "modules" were evaluated relative to the expectations of a null model, a disruptive model based on amino acid frequencies at splice junctions, and a formative model defined relative to these. The null model can be excluded for most structural features and is highly improbable when intron sites are grouped by reading frame phase. Phase-dependent correlations with secondary structure and side-chain surface accessibility are particularly strong. However, these phase-dependent correlations are explained largely by the sequence-based disruptive model.

Key Words: intron evolution • secondary structure • sequence preferences • splice site


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
The question of what factors determine the positions of introns has been given much attention over the last 25 years (Blake 1978Go; Go 1981Go; Tittiger et al. 1993Go; Stoltzfus et al. 1994Go; Logsdon et al. 1995Go; de Souza et al. 1996Go; Rzhetsky et al. 1997Go; Liu et al. 2005Go). Though a variety of factors might be invoked to account for the positions of introns, explanations often fall into 2 categories (Jellie et al. 1996Go; Qiu et al. 2004Go). In the "formative" category of explanation, the positions of introns reflect events of gene formation, as in the "exon theory of genes" (Gilbert 1987Go). In the "disruptive" or "insertional" interpretation, the positions of introns reflect evolutionary addition of introns to a gene region that was not split previously (Cavalier-Smith 1991Go; Palmer and Logsdon 1991Go).

A key implication of the formative view, as first recognized by Blake (1978)Go, is that exons in protein-coding genes will tend to correspond to structural or functional units of proteins. A correspondence of exons to globular domains was anticipated originally (Blake 1978Go), but it soon became clear (Campbell and Porter 1983Go; Go 1983Go) that exons in animal genes are generally too short for such a correspondence. Later studies focused on spatially compact subdomain regions (Go 1981Go) referred to as "modules," surface-accessible regions (Craik et al. 1983Go) and secondary structure motifs (Duester et al. 1986Go; Gilbert et al. 1986Go). These various claims led to a consensus of opinion that intron-structure correlations were the strongest evidence for a formative view of introns (Doolittle 1987Go), until a statistical analysis revealed that no single claim could be justified when applied consistently to multiple cases (Stoltzfus et al. 1994Go).

Subsequent work on this issue has not resolved clearly the question of how introns correlate with structural features or what are the causes of this relationship. On the one hand, analysis of cases of exon shuffling (Patthy 1991Go) provides a clear proof of a formative role for introns and reveals a highly nonrandom tendency for introns to be located in interdomain regions (Liu and Altman 2003Go). On the other hand, with respect to estimating the extent to which the formative model accounts for the totality of intron positions in extant genes, the situation is much more confusing. Initially, advocates of the formative view (Gilbert 1987Go) tended to assume that each and every intron had a special structural meaning, its position—plus or minus a few nucleotides due to putative "sliding"—reflecting some formative event of primordial assembly or subsequent exon shuffling. In the late 1990s, a weaker formative theory emerged in a series of papers that continued to claim structural evidence for primordial formation of genes from exons, but no longer cited the totality of evidence from intron positions, instead focusing only on a subset of phase 0 introns that were claimed to fall in the boundaries between compact modules (de Souza et al. 1996Go, 1998Go; Roy et al. 1999Go; Fedorov et al. 2001Go). Others continue to argue the formative view from a proposed correlation with secondary structure (Contreras-Moreira et al. 2003Go; Barik 2004Go).

Meanwhile, it has become clear that the positions of introns could exhibit nonrandomness for a completely unrelated reason: the evolutionary process of intron gain exhibits nucleotide preferences, favoring the pattern MAG{wedge}GT, where "{wedge}" is the site of gain (Qiu et al. 2004Go; Sverdlov et al. 2004Go). This preference may be responsible for much of the nonuniformity in intron phases because the "target site" is nonuniformly distributed among phases due to biased codon usage (Ruvinsky et al. 2005Go; Nguyen et al. 2006Go). This same preference is also relevant to understanding intron–protein correlations. Long before Qiu et al. showed that the MAG{wedge}GT pattern flanking introns is largely due to intron gain, Fichant (1992)Go had shown that the MAG{wedge}GT pattern flanking introns largely explains biased amino acid composition at intron junctions, for example, a tendency for phase 1 introns to fall in a glycine codon (i.e., G{wedge}GN). Obviously, because protein features correlate with amino acid sequence features, biased amino acid composition at a set of sites may result in biased structural properties for that same set of sites. Thus, nucleotide sequence preferences for intron gain are expected to generate intron-structure correlations, and it is possible that such preferences are largely or wholly responsible for any observed intron-structure correlations, as suggested by Stoltzfus et al. (1994)Go.

In this study, we combine data on intron locations and protein structural features in order to assess the evidence for, and the causes of, any relation that might exist between intron positions and structural features of proteins. In principle, these data could be used to assess 3 models for such a relationship: a null model in which intron sites are a random sample of all sites, a disruptive model in which intron sites are biased due only to sequence preferences of intron gain, and a formative model in which intron sites are biased according to chimaerogenic potential. In practice, the formative model is poorly specified due to the lack of reliable measures of chimaerogenic potential, whereas the sequence-biased gain model can yield precise predictions due to the clear relationship between nucleotide and protein sequences (i.e., the genetic code) and the availability of data on the structural propensities of amino acids.

This analysis yields 2 conclusions. First, the null model can be rejected: there are systematic correlations between positions of introns and structural features, particularly when the introns are separated according to phase. Second, the model of sequence-biased intron gain largely explains the deviations from randomness. For instance, phase 1 intron positions tend to avoid helices and to favor coils, but these preferences are explained quantitatively by a glycine preference consistent with the MAG{wedge}GT nucleotide preference for intron gain. These conclusions apply across vertebrate, invertebrate, plant, and fungal sets of genes, and they do not depend on excluding animal-specific (AS) gene families, which show a similar pattern. In general, the formative model does not have an important role in accounting for the overall pattern observed in present-day genes, though it might become important to account for some subset of introns defined by phylogenetic or other restrictions, an issue that is not resolved by the results reported here.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Approach
The main aim of this work is to determine the extent to which a sequence-biased gain model accounts for any observed correspondence between intron positions and protein structural features. To do so, we define a null model, H0, and a disruptive model, HD; in some cases, it is possible to define a formative model, HF, relative to HD or H0. In the approach used here, these models generate quantitative knowledge-based predictions by incorporating observed frequencies and structural propensities of amino acids, as described below.

Null Model
In the null model, H0, intron locations are chosen without respect to sequence or structure so that all possible intron sites are equally probable. For instance, under the null model, if 30% of residues are assigned to {alpha}-helices, then 30% of intron sites are expected to fall in {alpha}-helices, or more generally:

Formula (1)
where PFormula is the probability of an intron position that, relative to the encoded protein structure, maps to a site assigned to secondary structural element Formula (helix, strand, and coil), and kFormula is the fraction of that same type of secondary structural element in a large nonredundant set of proteins. To distinguish the null hypothesis from HD, we may specify further that the null model is not sensitive to phase

Formula (2)
where P{phi},Formula is the probability of a phase Formula intron position that maps to a secondary structural element of type Formula .

Disruptive Model
In the disruptive model, HD, any structural correlations with intron sites are due solely to biases in amino acid composition that reflect the nucleotide preferences for intron gain (Qiu et al. 2004Go; Sverdlov et al. 2004Go). Qiu et al. conclude specifically that these preferences are taxon specific (at least at the level of kingdoms) and that, even within a taxon, the intron gain preferences cannot represent a single target sequence such as AG{wedge}G but must reflect a mixture or statistical profile. Therefore, it is not possible to compute precise quantitative expectations of the disruptive model without an equally precise statistical profile of kingdom-specific preferences for intron gain. Unfortunately, no such precise profile is available.

To circumvent this difficulty, we assume that amino acid frequencies at splice junctions can be used to reflect the effects of nucleotide preferences independent of structural effects. Thus, we evaluate HD in terms of an amino acid frequency model, in which the correlations of intron positions with structural features are predictable entirely based on the observed frequencies of amino acids near splice junctions. For instance, in the case of secondary structure:

Formula (3)
where p{phi},Formula is defined as before, Formula is the frequency with which an amino acid i is associated with a phase Formula intron position and qi,Formula is the propensity of any amino acid i in a protein, to map to a secondary structural element, Formula .

Note 2 complications of this model. First, given that prior work (Qiu et al. 2004Go) revealed taxonomic differences in nucleotide preferences for intron gain, amino acid frequencies near splice junctions must be computed taxon specifically (however, we do not define the structural propensities of amino acids taxon specifically). Second, given that significant nucleotide preferences for intron gain extend from nucleotide sites –3 to +2 relative to the intron site (Qiu et al. 2004Go), multiple amino acid sites may be affected even when considering a single intron phase, and each site would exhibit a different biased composition. However, because the preferences are weak except from –2 to +1, we assume that such effects can be ignored for phase 1 and 2 sites. For phase 0 introns, we consider 2 kinds of amino acid sites, the upstream or "phase 3" site and the downstream or "phase 0" site.

Refined (di–amino acid) Disruptive Model
The amino acid frequency model assumes that amino acids are independent. However, phase 0 introns lie between 2 codons that are both simultaneously affected by the MAG{wedge}GT preference for intron gain. Thus, under the disruptive model, the CAG-encoded glutamines upstream of a phase 0 intron are not a random sample of glutamines, but are a biased sample in that they tend to be followed by valine residues encoded by GTN. The di–amino acid disruptive model is similar to the disruptive model except that 2 adjacent sites are treated together and are weighted by the frequencies of pairs of amino acids.

Formula (4)
where Pij,q is the probability of a pair of amino acids (one upstream and one downstream of the intron) in a secondary structural element, fij is the frequency of a pair of amino acids, and qij is the propensity of a pair of adjacent amino acids to be in a particular structural secondary assignment, where there are 9 such assignments ({alpha}{alpha}, {alpha}-, {alpha}ß, ß{alpha}, ß-, ßß, -{alpha}, ––, -ß).

Formative Model
The formative model, HF, is a correlation in excess of the null model, or in excess of HD, and in a direction consistent with increased chimaerogenic potential. In the case of secondary structure and surface accessibility, HF is poorly defined. Though it is often assumed that secondary structures are modules to some extent, the available evidence that might support this assumption is not strong (e.g., DuBose and Hartl 1989Go); thus, one does not know for certain how chimaerogenic potential is distributed relative to the secondary structure map. If modularity is dominated by secondary structure, then intersecondary structure sites will tend to be intermodule sites, but it also may be that the main chimaerogenic units are tertiary modules with boundaries that tend to interrupt secondary structures. With respect to surface accessibility, there is some evidence that it is easier to add to a protein by inserting fragments in surface-accessible loops (Benner et al. 1997Go); therefore, HF implies increased surface accessibility.

In the case of the "modules" invoked by Gilbert and colleagues, it might seem obvious that the formative model predicts an excess of between-module introns relative to the expectations of H0 and, if applicable, HD. Indeed, other things being equal, this seems a reasonable expectation, and we follow it here. However, one should not assume this uncritically because it has never been shown that boundaries between these alleged modules are more likely sites for chimaera formation, though such a demonstration might be possible using available data (Voigt et al. 2002Go).

Structure Analysis
The representative sets of proteins are derived from the Protein Data Bank (PDB) database (Berman et al. 2000Go). The data set was filtered to remove redundancy at 30% sequence identity level by using sequence clustering program Blastclust (Altschul et al. 1990Go).

Secondary structural definitions are based on the DSSP program (Kabsch and Sander 1983Go). The resulting frequencies of helix, sheet, and coil are similar to those found by other researchers in large sets of data (Martin et al. 2005Go). Residues are classified as being associated with a helix, strand, or coil, using the entire set of 4,659 prokaryotic and eukaryotic genes.

The surface accessibility was determined using NACCESS (Hubbard and Thornton 1993Go) based on the method of Lee and Richards (1971)Go. This program computes surface accessibility in either absolute 2) or relative (% relative to standard condition Ala–Xxx–Ala) terms, summing over all atoms, side-chain only or backbone only. In the absence of a clear prior rationale for preferring one of the resulting 6 measures, we chose the most residue-sensitive measure (the most likely to be influenced under hypothesis HD), which is absolute side-chain accessibility, and the least residue-sensitive measure (the most likely to be influenced under hypothesis HF), which is relative backbone accessibility.

The procedure for defining "modules" used by de Souza and colleagues was implemented in Perl, using a diameter value of 27.6 Å, chosen based on their previous work reporting correlations between introns and "modules" of 27.6 Å (de Souza et al. 1996Go). The Perl implementation was validated by a direct side-by-side comparison of the output with that of the original INTERMODULE program (de Souza et al. 1996Go).

Eukaryotic Gene Database
The eukaryotic gene information is obtained by a protocol modified from that used to construct the Intron Database (IDB) (Schisler and Palmer 2000Go). This new version of IDB (Hladish T, Schisler NJ, and Stoltzfus A, unpublished data) is a GenBank-based (version 142) eukaryotic protein-encoding gene database and is cross-referenced to SwissProt in order to remove redundancy. Entries with incomplete sequences or with noncanonical splice sites are removed for the analysis. The National Center for Biotechnology Information taxonomy database was used for classification of the IDB entries into 4 taxonomic groups, vertebrates, invertebrates, plants, and fungi. The current IDB contains 165,451 full-length entries.

Mapping of Intron Positions on Protein structures
The representative protein chains are aligned against the intron-containing subset of the IDB by utilizing BlastP. A cutoff of 30% identity is used for identifying homologs. The top hit in each taxonomic group is used for this study, for example, the plant data set consists of the top plants hits to 648 PDB chains and includes 4,050 introns. Representative chains (1,354) had at least one intron mapped to a protein structure. The results of these alignments are summarized in table 1.


View this table:
[in this window]
[in a new window]

 
Table 1 The Number of PDB Chains and Introns for the Four Taxonomic Groups

 
Removal of AS Gene Families
Patthy (1999)Go has shown clear evidence that exon shuffling events in mosaic proteins are specific to a metazoan origin. The intron positions in these proteins are present in protein domain boundaries and may have biased structural properties. Hence, the nonredundant protein structures were classified into 2 categories: animal-specific (AS) genes and without-animal–specific (WAS) genes. These classifications are based on the Blast homologues; the AS data set contains genes whose homologs are animal genes only and the WAS data set contains all other families.


    Results
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Taxonomically Defined Groups of Intron-Containing Homologs
Sequences for an initial set of 6,444 nonredundant structures of protein chains were pruned by removing sequences less than 90 residues and then aligned against the intron-containing subset of the IDB to identify homologs with at least 30% identity, yielding 4,659 protein chains (1,240,720 residues). For each of 4 taxonomic groups (vertebrates, invertebrates, plants, and fungi), the top hit of at least 30% was retained, resulting in the 4 sets of data described in table 1.

Each of these taxonomically defined sets was analyzed using the same methods, with results that exhibited an overall similarity in spite of some differences. To shorten the presentation of results, here we present complete results only for the vertebrates. Complete results for the other taxonomic groups are presented as Supplementary Material (see Supplementary Material online) and are cited in the text where there are important similarities or significant differences.

Null Model
The vertebrate set of observed intron positions mapped onto protein structures has a secondary structure distribution of 46.0% coil, 38.4% helix, and 15.6% sheet (table 2; see Supplementary Material table S2 online for other data sets). As indicated in table 2, the expected distribution under the null model, which is just the distribution of secondary structure assignments for all amino acid sites (see Methods), is very similar: 45.0%, 35.6%, and 19.4% for coil, helix, and sheet, respectively (similar to the values observed in other large data sets: see Martin et al. 2005Go). Although the observed distribution deviates only slightly from the null model, the observed frequency of intron sites that map to ß-strands, 15.6 ± 1.4%, is significantly lower than the expected value of 19.4%, that is, introns tend to avoid ß-strands.


View this table:
[in this window]
[in a new window]

 
Table 2 Observed and Expected Secondary Structure and Module Distributions for All Intron Positions According to the Null Model (95% CI)

 
Table 2 also displays the frequency with which intron sites map to linker regions between "modules" as defined by the algorithm of de Souza et al. (1996)Go. The observed value of 33.1% corresponds closely with the value expected under the null hypothesis (table 2).

None of the deviations are significant for the other taxonomic groups (table S2, Supplementary Material online).

With respect to surface accessibility, various measures are used commonly and none is definitive. Here we use absolute side-chain accessibility (in Å2) as a residue-sensitive measure suitable for detecting the distinctive implications of the disruptive model, and relative backbone accessibility (relative to the accessibility in an extended Ala–Xxx–Ala tripeptide) as a measure that is more indicative of protein secondary and tertiary structure, and thus more suitable for detecting the distinctive implications of the formative hypothesis. The observed intron positions mapped on the protein structures have a mean relative backbone accessibility of 27.0 ± 1.5% and a side-chain absolute accessibility of 32.6 ± 1.6 Å2, as compared with expected values of 25.9% (P value = 0.06, based on Z scores) and 35.0 Å2 (P value = 0.001). The backbone and side-chain measures are predicted very well for invertebrates (P values are 0.14 and 0.48, respectively), whereas there are significant differences among the plants and fungi (data not shown).

A more detailed implication of the null hypothesis H0, as distinct from the disruptive model, is that the structural properties associated with intron sites should be insensitive to phase. For this purpose, although there are only 3 possible intron phases, we may consider 4 types of sites: "phase 1" sites where the amino acid is encoded by a codon interrupted by a phase 1 intron, "phase 2" sites corresponding to a codon interrupted by a phase 2 intron, and "phase 3" and "phase 0" sites representing the upstream and downstream (respectively) codons flanking an intron between 2 codons. It is important to distinguish phase 3 and phase 0 sites because, although they are equivalent under the null model and the formative model, they are not equivalent under the disruptive model (as explained in Methods).

Figure 1 shows the results of dividing the data into these 4 phases of sites. Overall, these results reveal large and significant deviations from the null expectation of no effect. For instance, for phase 1 and 2 sites in the vertebrate data set (fig. 1A), the observed frequencies with which sites map to helices are 7.4% lower and 5.1% higher, respectively, than the null expectation of 35.7%. In general, among diverse taxa (fig. S1A, Supplementary Material online), phase 0, 2, and 3 sites are underpredicted in helical regions and overpredicted in coils, whereas this trend is reversed for phase 1 sites, which are observed to be much more common in coils than expected. Thus, the null model is excluded clearly for the case of secondary structure.


Figure 1
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Intron-associated sites in proteins have phase-specific structural features. Observed values (error bars, 95% CI) for structural correspondences are shown in comparison to expected values (black bars) for 4 categories of intron-associated sites from the vertebrate data set (for other data sets, see supplementary fig. S1, Supplementary Material online): phase 0 (open bars), phase 1 (gray bars), phase 2 (striped bars), and phase 3 (stippled bars). The structural features are (A) frequency with which a protein site is assigned to 1 of 3 secondary structure categories, (B) frequency with which the site falls in a linker region between modules, (C) the main-chain relative surface accessibility at the site, and (D) the side-chain absolute surface accessibility for the residue at that site.

 
The locations relative to "modules" shown for vertebrate data in figure 1B do not exhibit significant deviations; however, there are small but significant deviations for the plant data at phase 3 sites and for invertebrate data at phase 2 sites (supplementary fig. S1B, see Supplementary Material online), with the observed intermodule frequencies being slightly higher than expected. For surface accessibility, shown for the vertebrate data in figure 1C and D, the 2 different measures reveal somewhat different effects. For backbone relative accessibility (the measure that should be more sensitive to HF), the main effect shown for vertebrates in figure 1C is an excess accessibility of phase 1 sites; the same effect is seen for invertebrates and plants (fig. S1C, Supplementary Material online; the effect in fungi is in the same direction but is not statistically significant). For side-chain absolute accessibility (the measure that should be more sensitive to HD) shown in figure 1D, accessibility is overpredicted for phase 0 and 1 sites and underpredicted for phase 2 and 3 sites; the same pattern of significant over- and underprediction is seen in invertebrates and plants (fig. S1D, Supplementary Material online; for fungi, phase 0 and 1 sites are underpredicted, but phase 2 and 3 sites do not differ significantly from expected values).

Thus, the intron-structure relationship is radically phase-dependent, as predicted under the disruptive model but not under the null or formative model. This phase dependence is strongest for secondary structure and solvent accessibility and is relatively minor for modules.

Disruptive Model
In the disruptive model (see Methods), structural preferences arise solely from amino acid preferences that are interpreted to reflect nucleotide preferences of MAG{wedge}GT at sites of intron gain (Qiu et al. 2004Go; Sverdlov et al. 2004Go). Prior to this work, it has been established clearly that the distribution of amino acids at intron sites is nonuniform and phase dependent in a manner consistent with a nucleotide signal (Fichant 1992Go; Whamond and Thornton 2006Go). Table 3 summarizes the amino acid preferences at intron sites for the vertebrate data, revealing a pattern qualitatively similar to those reported earlier by others. From the Supplementary Material (fig. S2), it may be noted that the pattern of preferences for intron sites in invertebrates, plants, and fungi is qualitatively similar to that seen for vertebrate data, but the pattern is much less pronounced for fungi, for example, the entropy of amino acid composition at intron-associated sites is 3.9 for fungi, 3.3 for plants, 3.4 for vertebrates, and 3.7 for invertebrates. This difference is not due to a difference in the background level of amino acid composition bias, which has an entropy of 4.2 regardless of taxonomic source.


View this table:
[in this window]
[in a new window]

 
Table 3 Amino Acid Frequency Distribution at Intron Positions

 
Figure 2 shows the observed distribution of the structural attributes for vertebrate data relative to the expectations of the disruptive model, that is, assuming that the observed distribution of structural attributes is entirely a side effect, not a cause, of amino acid composition biases that themselves reflect sequence-biased intron gain. From figure 2A–D, showing secondary structure data 4 classes of sites, it is clear that the disruptive model largely accounts for the phase-dependent nonuniformity seen earlier (fig. 1A). For instance, while the distribution of phase 1 sites in secondary structure deviates radically from the null expectation (fig. 1A), this same distribution does not deviate significantly from the expectation of the disruptive model for vertebrates (fig. 2B; nor for the invertebrate, plant, and fungal data sets: fig. S3B, Supplementary Material online).


Figure 2
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— The amino acid model accounts largely for phase-associated nonuniformity in structural properties of intron-associated sites. Observed values (open bars; error bars, 95% CI) for structural correspondences are shown in comparison to expected values (black bars) from the disruptive model based on amino acid frequencies. Results shown here are for the vertebrate data set (for other data sets, see supplementary fig. S2, Supplementary Material online). Panels (A) through (D) show results for secondary structure for phase categories 0, 1, 2, and 3, respectively; panel (E) shows results for all 4 phase categories relative to intermodule linkers; and panels (F) and (G) show main-chain relative surface accessibility and side-chain absolute accessibility, respectively, for all 4 phase categories.

 
Although the disruptive model is an improvement for the vertebrate data set and for the other taxonomic groups (see Supplementary Material online), it does not fully account for secondary structure, as shown by the small but significant deviations in figure 2A, C and D. Helices are underpredicted at phase 0 and phase 3 sites, whereas coils are overpredicted at phase 3 sites.

Likewise, the disruptive model does not fully account for the distribution of intron sites relative to "modules". Although there are not significant deviations for the vertebrate data shown in figure 2E, the small but significant deviations noted earlier for the null model at phase 3 sites in plants and at phase 2 sites in invertebrate still remain and are not explained by the disruptive model (fig. S3E, Supplementary Material online).

Finally, the disruptive model accounts largely for the dramatic deviations from null expectations in regard to surface accessibility. For absolute side-chain accessibility, the disruptive model accounts for the nonuniformity among sites observed in the vertebrate data shown in figure 2G and that observed in the other taxonomically defined data sets (fig. S3G, Supplementary Material online), with the exception that accessibility is overpredicted for phase 0 and phase 3 sites in plants. For relative backbone accessibility, the disruptive model accounts for the vertebrate (fig. 2F) and invertebrate results (supplementary fig. S3F, see Supplementary Material online), whereas for plants and fungi, accessibility is overpredicted consistently although the disruptive model correctly predicts the order of site classes (phase 1 > phase 2 > phase 3 > phase 0 sites for plants and phase 1 > phase 3 > phase 2 > phase 0 sites for fungi: fig. S3F, Supplementary Material online).

How much better is the disruptive model than the null model? One way to assess the difference is simply to count significant deviations. For the vertebrate data, out of 24 comparisons with the null expectation there are 16 deviations (11 for secondary structure, 0 for "modules," 1 for backbone relative accessibility, and 4 deviations for side-chain absolute accessibility). By contrast, for the vertebrate data, the disruptive model leaves only 5 deviations (2 for phase 0 secondary structure, 1 each for phases 2 and 3 secondary structure, and 1 for the backbone relative accessibility). For all data (not just vertebrates), there are 48 deviations out of 96 for the null model and 30 deviations for the disruptive model.

Revised (di–amino acid) Disruptive Model
An obvious means to improve the analysis above would be to take into account the nonindependence of phase 3 and phase 0 sites expected under the disruptive model. Nonindependence is expected because the nucleotide preferences for intron gain affect both sites at once. Thus, the valine residues encoded (by GTN codons) just downstream of introns are not a random sample of all valines in proteins because they tend to be preceded by the upstream residue glutamine (CAG) or lysine (AAG).

To account for this anticipated effect of nonindependence, the structural properties of intron positions were predicted based on the structural propensities of pairs of amino acids, weighted by the paired frequencies observed at intron sites (see Methods), with results shown in figure 3. As can be seen, this di–amino acid model improves the fit with observed values for phase 0 sites but not for phase 3 sites, for example, the observed value of 45.6 ± 2.6 for helix at phase 3 sites is still significantly higher than the expected value of 40.1 (P value < 0.001). For all taxonomic groups, the tendency at phase 3 sites is for helices to be underpredicted and coils to be overpredicted, and these deviations are significant for the plant and vertebrate data sets. In general, the di–amino acid model represents a significant but only modest improvement over the original model.


Figure 3
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— The di–amino acid model accounts for structural correspondences of intron sites slightly better than the amino acid model. Graphic conventions are as in figure 2, except that, because the di–amino acid model only differs from the amino acid model for phase 0 and phase 3 intron-associated sites, results are shown only for these sites. Panels (A) and (B) show results for secondary structure, panel (C) shows results for intermodule linkers, and panels (D) and (E) show main-chain relative surface accessibility and side-chain absolute accessibility, respectively. Results are for the vertebrate data set (for other data sets, see supplementary fig. S3, Supplementary Material online).

 
Formative Model
The formative model suggests that, other things being equal, intron sites should avoid {alpha}-helices and ß-strands and instead should fall in coil regions (Lonberg and Gilbert 1985Go). In fact, when intron sites are not divided by phase, there is no such significant tendency in vertebrates, as shown in table 2 (nor in the other taxonomic groups: table S2, Supplementary Material online). When sites are divided into phases 0, 1, 2, and 3, only phase 1 sites correlate with coils (in vertebrates, as well as for invertebrate and plant data sets), whereas other sites show a deficit; this deficit is significant for phase 0 and 3 sites (see fig. 1A and fig. S1A, Supplementary Material online). However, as noted above, the pattern at phase 1 sites is explained by the disruptive model (fig. 2B and fig. S2B, Supplementary Material online). Thus, there is nothing for the formative model to explain.

For the case of "modules," again, there is not a significant overall tendency for vertebrate introns to fall in intermodule regions (table 2; likewise for other taxonomic groups: table S2, Supplementary Material online). When sites are divided by phase, there is no significant correlation for the vertebrate data, but for the invertebrate data there is a phase–2-specific excess of "intermodule" introns, and for the plant data an excess of intermodule introns at phase 0 and 3 sites (fig. S2E, Supplementary Material online). As noted above, these idiosyncratic deviations are not explained by the disruptive model. Thus, because both these are excesses (instead of deficits), they suggest the formative model.

The formative model suggests that, other things being equal, intron sites should be more surface accessible. However, the opposite effect is seen when introns are not divided by phase. Side-chain absolute accessibility at intron sites is significantly lower than expected for vertebrate data (also for plant and fungal data; supplementary table S2, see Supplementary Material online). When sites are divided by phase, correlations occur in both directions: for side-chain absolute accessibility- phase 0 and 1 sites typically are significantly lower than expected in accessibility, whereas phase 2 and 3 sites are higher; for main-chain relative accessibility, phase 1 sites are significantly more accessible than expected, whereas the other sites are slightly (typically insignificantly) less accessible (table S2.5, Supplementary Material online). As noted earlier, the disruptive model accounts for these results; thus, there is nothing for the formative model to explain.

AS Genes Data Set
For the WAS genes data set described above, the secondary structural distribution at intron positions can be predicted with considerable accuracy using a disruptive model based only on observed amino acid frequencies. This same approach can be extended to an AS data set. As with the WAS data set, the null model is strongly excluded when phase is taken into account (data not shown; fig. 4 from the Supplementary Material online illustrates 12 deviations out of 24 comparisons). Figure 4S summarizes the results of applying the disruptive model to the AS data set, using the di–amino acid version of the model for phases 3 and 0. As with the WAS data set, the disruptive model accounts for most of the pattern of nonrandomness seen among phases in the AS data set, but is not perfect. In particular, the model underpredicts coil at phase 1 sites, which at 64.4% have a frequency 6.4% higher in the AS data set than in the WAS data set.


Figure 4
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Correspondences for the AS data set are slightly different but are explained largely by the di–amino acid model. Observed values (open bars; error bars, 95% CI) for structural correspondences are shown in comparison to expected values (black bars) from the di–amino acid disruptive model, and for vertebrate data from AS gene families (for the invertebrate data from AS genes, see supplementary fig. S4, Supplementary Material online). As in figure 2, panels (A) through (D) show results for secondary structure, panel (E) for intermodule regions, and panels (F) and (G) for surface accessibility.

 

    Discussion
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Two main results follow from a quantitative analysis of data on 11,334 intron positions in relation to a null model of no preferences, a disruptive model based on amino acid preferences, and a formative model. First, it is possible to exclude the null model for a variety of types of possible correlation, for example, the observed percentage distribution of ß-strand is 15.6 ± 1.4 (95% confidence interval), whereas the expected percentage distribution is 19.4. Second, the disruptive model accounts largely for the most distinctive patterns of nonrandomness in the distribution of intron sites in relation to structure, which are in regard to secondary structure and solvent accessibility, whereas the formative model has little to explain.

To demonstrate this point clearly and to show that it applies across taxonomic groups (not merely to the vertebrate data presented in Results), a summary for all taxonomic groups is shown in figure 5. The strongest correlations with structural features are in regard to secondary structure (A) and side-chain absolute accessibility (D), while any tendency to fall between "modules" (B) is weak or insignificant. The greatest deviations are explained largely by the disruptive model. That is, although some significant deviations remain, the large deviations from expected values under the null model fade to being insignificant or only marginally significant under the disruptive model. The di–amino acid model offers a slight improvement. The overall trend of improvement in models is similar for the AS data sets, but the improvement is not as dramatic (fig. S5, Supplementary Material online).


Figure 5
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 5.— The disruptive model accounts largely for phase-specific correspondences observed in diverse taxonomically defined data sets. Each histogram shows, for each taxonomically defined data set, the combined deviation from expected values for 3 models (null, disruptive, di–amino acid) for the case of secondary structure (A), intermodule location (B), main-chain relative solvent accessibility (C), and side-chain absolute solvent accessibility (D). Each cluster of 4 bars represents results for vertebrates (black), invertebrates (gray), plants (light gray), and fungi (white), with the height of each bar being a measure of deviation combining 4 position-specific deviations (results for AS genes data sets are given in supplementary fig. S5, see Supplementary Material online). For (A) and (B), these deviations are {chi}2 values (note that the degrees of freedom are 8 in A and 4 in B); for (C) and (D), the deviations are sums of absolute Z scores. To aid in comparisons, each panel has a dashed line representing P = 0.01. For the strongest correlations, shown in panels A and D, the correlation is explained largely by the disruptive model; in panel C, the correlation is weak and taxon specific, and the disruptive model offers only a modest improvement; in panel B, the correlation is very weak and taxon specific, and the disruptive model offers no overall improvement.

 
This conclusion is stronger than that allowed from the results of a recent analysis by Whamond and Thornton (2006)Go, who also addressed the relationship of intron positions to secondary structure within a disruptive model. In that study, the mean Euclidean distance between observed and expected frequency distributions (for locations of introns with respect to helix, sheet, and coil) decreased from 0.12 (average for 3 phase-specific distances) for the null model to 0.10 for the model of nucleotide-biased intron sites, whereas in the present study, the comparable values are 0.11 (null), 0.064 (disruptive), and 0.047 (di–amino acid disruptive) for vertebrate data (values for plant and invertebrate data are comparable; for the fungal data, the improvement is trivial, as is apparent from fig. 5; table S4, Supplementary Material online). Thus, the disruptive model used by Whamond and Thornton improves only slightly on the null model, in contrast to the major improvement found here (fig. 5). Possible reasons for this difference are that Whamond and Thornton began with nucleotide biases rather than amino acid biases (i.e., a more theoretical, less empirical prediction model, thus a more demanding prediction model), they combined data from diverse taxonomic groups (whereas we treat the groups separately to allow for heterogeneity), and they used unusual secondary structure assignments that yield background frequencies outside the range found here or in other studies (e.g., Martin et al. 2005Go).

The conclusions of this analysis may appear to contradict some statements of Gilbert and colleagues arguing for a formative "introns-early" view (in which primordial protein-coding genes emerge from fusing separate exon minigenes) based on the claim that phase 0 introns shared between different taxonomic kingdoms show an unusual tendency to fall between modules (de Souza et al. 1996Go, 1998Go; Roy et al. 1999Go; Fedorov et al. 2001Go, 2003Go).

However, the appearance of a contradiction mainly reflects divergent aims. This study aims to assess how well one can account for the distribution of introns in present-day genes using a quantitative predictive model that has some theoretical content (a theory of intron evolution) as well as some ad hoc empirical content (amino acid frequencies, residue-averaged structure propensities); we focus on the big picture and do not attempt to track down all possible sources of minor deviations. By comparison, Gilbert and colleagues are focused on detecting a specific subtle signal that they interpret as a formative signal of introns-early; they do not attempt to assess the extent to which this signal accounts for nonrandomness observed in present-day gene structures. Such an assessment would be impossible given that the formative theory is not sufficiently well specified to make quantitative predictions. That is, there is no theoretical basis under the formative model for specifying precisely where introns are to be found relative to protein structural features or to phases. Instead, any predictions related to these quantities are relative (i.e., a value higher or lower than some null expectation) and conditional: under the formative theory, introns, or a subset of introns that one might discover (e.g., phase 0 introns or ancient introns), will tend to be found in positions that have a higher chimaerogenic potential than expected, which might correspond (or might not) to sites between secondary structures (e.g., Lonberg and Gilbert 1985Go) or sites between the modules of de Souza et al. (1998)Go.

As a result of this difference in aims, the scope and methods of this analysis are different in several important ways. First, when we report that the disruptive model accounts largely for intron-structure correlations and, particularly for the strongest correlations, we are including aspects of structure—namely secondary structure and surface accessibility—that 1) are recognized universally to be important aspects of protein structure but 2) are not addressed by de Souza et al. and Fedorov et al. Because various past claims for a formative role of introns (e.g., Lonberg and Gilbert 1985Go) relied on arguments from secondary structure subsequently found to be weak (Stoltzfus et al. 1994Go), the decision by de Souza et al. and Fedorov et al. not to address secondary structure arises, not because secondary structure is irrelevant to the formative model but because an analysis of secondary structure is ex posteriori unlikely to satisfy their aim of finding a signal for introns-early. Second, for similar reasons, we do not focus our interpretation of results on phase 0 introns but consider a sample representing the totality of introns in present-day genes. Third, because our main focus is on accounting for present-day genes, we do not consider whether patterns might be different for a small set of introns that might be old.

With these differences in mind, finally, one may ask whether results reported here in regard to modules are consistent with prior claims, a comparison made difficult by the fact that prior claims are not consistent with each other. Although previous arguments had emphasized 28 Å modules, de Souza et al. (de Souza et al. 1996Go, 1998Go) argued that the relevant "module" correlation involved 3 different sizes—21, 28, and 33 Å—and that the signal was limited to phase 0 introns in ancient conserved regions (ACRs) and for unknown reasons was not found in vertebrate genes. In subsequent more extensive analyses (Fedorov et al. 2001Go, 2003Go), the results for phase 0 introns in ACRs revealed a single peak of significance involving 25 Å "modules," while the tendency in regard to 21, 28, and 33 Å "modules" was insignificant or marginal, and comparable in scale to other patterns that did not figure in the formative interpretation offered by the authors, for example, a tendency for phase 2 introns to fall between 30 Å "modules" (fig. 2 of Fedorov et al.).

Because the present study treats taxonomic groups separately and combines data from ACRs and non-ACRs, there is no direct comparison with these earlier studies. However, for the sake of making some comparison, one may hope that results reported here for 27.6 Å modules (the size chosen for this study, following the initial claims for the modules algorithm by [de Souza et al. 1996Go]) would fall somewhere between the values for ACRs and non-ACRs. For ACRs, the relative excess (the deviation divided by the expected value) of phase 0 introns between 27.6 Å "modules" is roughly 4%, with {chi}2 {approx} 3 (fig. 2A of Fedorov et al., top row), whereas for non-ACRs, the excess is about 2% with {chi}2 < 1 (fig. 2B of Fedorov et al., top row). That is, the correspondence is small and insignificant. By comparison, the phase 0–ACR relation with 25 Å "modules" that is the basis of the authors’ formative interpretation represents a relative excess of roughly 5–8%, corresponding to {chi}2 values typically from 4 to 7 (fig. 2A of Fedorov et al. 2001Go, top row, or fig. 1 of Fedorov et al. 2003Go), that is, significant for the {alpha} = 0.05 critical level but not for {alpha} = 0.005. Thus, from the work of Fedorov et al., we expect that any tendency for phase 0 introns to fall between 27.6 Å "modules" will be small, with a relative deviation of less than a few percent. However, given the low apparent repeatability of this type of analysis, reflected in the contradiction between de Souza et al. and Fedorov et al., one should allow for some volatility.

The results presented here are consistent with this very modest expectation. When all phases are combined, we find no correspondence with "modules" for the vertebrate data (consistent with de Souza et al. 1998Go) and a slight but insignificant excess of "intermodule introns" for the other taxonomic groups where the relative excess is about 3%. When sites are divided by phase, there are 2 significant excesses of "intermodules" sites relative to the null model (or relative to the disruptive model, which is nearly the same): an excess of phase 2 sites in invertebrate data and of phase 3 sites in plant data. The latter correspondence implicates phase 0 sites and represents a relative excess of about 8%, consistent with prior claims. The correspondence involving phase 2 invertebrate sites is greater in degree (a 21% excess) but less statistically significant. When data are combined across taxonomic groups, the overall excess of phase 0 sites between 27.6 Å "modules" is about 3.6% and is not statistically significant, also consistent with prior claims. If the pattern uncovered by Fedorov et al. 2003Go holds, then the statistical excess is due to intron sites shared between taxa and presumed, on this basis, to be old. However, this issue is not addressed here.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Supplementary tables and figures are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
This work was supported by National Institutes of Health grant R01-LM007218 to A.S. The identification of specific commercial software products is for the purpose of specifying a protocol and does not imply a recommendation or endorsement by the National Institute of Standards and Technology.


    Footnotes
 
William Martin, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 

    Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol (1990) 215:403–410.[CrossRef][Web of Science][Medline]

    Barik S. When proteome meets genome: the alpha helix and the beta strand of proteins are eschewed by mRNA splice junctions and may define the minimal indivisible modules of protein architecture. J Biosci (2004) 29:261–273.[Web of Science][Medline]

    Benner SA, Cannarozzi G, Gerloff D, Turcotte M, Chelvanayagam G. Bonafide predictions of protein secondary structure using transparent analyses of multiple sequence alignments. Chem Rev (1997) 97:2725–2844.[CrossRef][Web of Science][Medline]

    Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res (2000) 28:235–242.[Abstract/Free Full Text]

    Blake CCF. Do genes-in-pieces imply proteins-in-pieces? Nature (1978) 273:267.[CrossRef]

    Campbell RD, Porter RR. Molecular cloning and characterization of the gene coding for human complement protein factor B. Proc Natl Acad Sci USA (1983) 80:4464–4468.[Abstract/Free Full Text]

    Cavalier-Smith T. Intron phylogeny: a new hypothesis. Trends Genet (1991) 7:145–148.[Web of Science][Medline]

    Contreras-Moreira B, Jonsson PF, Bates PA. Structural context of exons in protein domains: implications for protein modelling and design. J Mol Biol (2003) 333:1045–1059.[CrossRef][Web of Science][Medline]

    Craik CS, Rutter WJ, Fletterick R. Splice junctions: association with variation in protein structure. Science (1983) 220:1125–1129.[Abstract/Free Full Text]

    de Souza SJ, Long M, Klein RJ, Roy S, Lin S, Gilbert W. Toward a resolution of the introns early/late debate: only phase zero introns are correlated with the structure of ancient proteins. Proc Natl Acad Sci USA (1998) 95:5094–5099.[Abstract/Free Full Text]

    de Souza SJ, Long M, Schoenbach L, Roy SW, Gilbert W. Intron positions correlate with module boundaries in ancient proteins. Proc Natl Acad Sci USA (1996) 93:14632–14636.[Abstract/Free Full Text]

    Doolittle WF. What introns have to tell us: hierarchy in genome evolution. Cold Spring Harb Symp Quant Biol (1987) 52:907–913.[Abstract/Free Full Text]

    DuBose RF, Hartl DL. An experimental approach to testing modular evolution: directed replacement of alpha-helices in a bacterial protein. Proc Natl Acad Sci USA (1989) 86:9966–9970.[Abstract/Free Full Text]

    Duester G, Jornvall H, Hatfield GW. Intron-dependent evolution of the nucleotide-binding domains within alcohol dehydrogenase and related enzymes. Nucleic Acids Res (1986) 14:1931–1941.[Abstract/Free Full Text]

    Fedorov A, Cao X, Saxonov S, de Souza SJ, Roy SW, Gilbert W. Intron distribution difference for 276 ancient and 131 modern genes suggests the existence of ancient introns. Proc Natl Acad Sci USA (2001) 98:13177–13182.[Abstract/Free Full Text]

    Fedorov A, Roy S, Cao X, Gilbert W. Phylogenetically older introns strongly correlate with module boundaries in ancient proteins. Genome Res (2003) 13:1155–1157.[Abstract/Free Full Text]

    Fichant GA. Constraints acting on the exon positions of the splice site sequences and local amino acid composition of the protein. Hum Mol Genet (1992) 1:259–267.[Abstract/Free Full Text]

    Gilbert W. The exon theory of genes. Cold Spring Harb Symp Quant Biol (1987) 52:901–905.[Abstract/Free Full Text]

    Gilbert W, Marchionni M, McKnight G. On the antiquity of introns. Cell (1986) 46:151–153.[CrossRef][Web of Science][Medline]

    Go M. Correlation of DNA exonic regions with protein structural units in haemoglobin. Nature (1981) 291:90–92.[CrossRef][Medline]

    Go M. Modular structural units, exons, and function in chicken lysozyme. Proc Natl Acad Sci USA (1983) 80:1964–1968.[Abstract/Free Full Text]

    Hubbard SJ, Thornton JM. NACCESS (1993) Department of Biochemistry and Molecular Biology, University College London.

    Jellie AM, Tate WP, Trotman CN. Evolutionary history of introns in a multidomain globin gene. J Mol Evol (1996) 42:641–647.[CrossRef][Web of Science][Medline]

    Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers (1983) 22:2577–2637.[CrossRef][Web of Science][Medline]

    Lee B, Richards FM. The interpretation of protein structures: estimation of static accessibility. J Mol Biol (1971) 55:379–400.[CrossRef][Web of Science][Medline]

    Liu M, Walch H, Wu S, Grigoriev A. Significant expansion of exon-bordering protein domains during animal proteome evolution. Nucleic Acids Res (2005) 33:95–105.[Abstract/Free Full Text]

    Liu S, Altman RB. Large scale study of protein domain distribution in the context of alternative splicing. Nucleic Acids Res (2003) 31:4828–4835.[Abstract/Free Full Text]

    Logsdon JM Jr, Tyshenko MG, Dixon C, Jafari JD, Walker VK, Palmer JD. Seven newly discovered intron positions in the triose-phosphate isomerase gene: evidence for the introns-late theory. Proc Natl Acad Sci USA (1995) 92:8507–8511.[Abstract/Free Full Text]

    Lonberg N, Gilbert W. Intron/exon structure of the chicken pyruvate kinase gene. Cell (1985) 40:81–90.[CrossRef][Web of Science][Medline]

    Martin J, Letellier G, Marin A, Taly JF, de Brevern AG, Gibrat JF. Protein secondary structure assignment revisited: a detailed analysis of different assignment methods. BMC Struct Biol (2005) 5:17.[CrossRef][Medline]

    Nguyen HD, Yoshihama M, Kenmochi N. Phase distribution of spliceosomal introns: implications for intron origin. BMC Evol Biol (2006) 6:69.[CrossRef][Medline]

    Palmer JD, Logsdon JM Jr. The recent origins of introns. Curr Opin Genet Dev (1991) 1:470–477.[CrossRef][Medline]

    Patthy L. Modular exchange principles in proteins. Curr Opin Struc Biol (1991) 1:351–361.[CrossRef]

    Patthy L. Genome evolution and the evolution of exon-shuffling—a review. Gene (1999) 238:103–114.[CrossRef][Web of Science][Medline]

    Qiu WG, Schisler N, Stoltzfus A. The evolutionary gain of spliceosomal introns: sequence and phase preferences. Mol Biol Evol (2004) 21:1252–1263.[Abstract/Free Full Text]

    Roy SW, Nosaka M, de Souza SJ, Gilbert W. Centripetal modules and ancient introns. Gene (1999) 238:85–91.[CrossRef][Web of Science][Medline]

    Ruvinsky A, Eskesen ST, Eskesen FN, Hurst LD. Can codon usage bias explain intron phase distributions and exon symmetry? J Mol Evol (2005) 60:99–104.[CrossRef][Web of Science][Medline]

    Rzhetsky A, Ayala FJ, Hsu LC, Chang C, Yoshida A. Exon/intron structure of aldehyde dehydrogenase genes supports the "introns-late" theory. Proc Natl Acad Sci USA (1997) 94:6820–6825.[Abstract/Free Full Text]

    Schisler NJ, Palmer JD. The IDB and IEDB: intron sequence and evolution databases. Nucleic Acids Res (2000) 28:181–184.[Abstract/Free Full Text]

    Stoltzfus A, Spencer DF, Zuker M, Logsdon JM Jr, Doolittle WF. Testing the exon theory of genes: the evidence from protein structure. Science (1994) 265:202–207.[Abstract/Free Full Text]

    Sverdlov AV, Rogozin IB, Babenko VN, Koonin EV. Reconstruction of ancestral protosplice sites. Curr Biol (2004) 14:1505–1508.[CrossRef][Web of Science][Medline]

    Tittiger C, Whyard S, Walker VK. A novel intron site in the triosephosphate isomerase gene from the mosquito Culex tarsalis. Nature (1993) 361:470–472.[CrossRef][Medline]

    Voigt CA, Martinez C, Wang ZG, Mayo SL, Arnold FH. Protein building blocks preserved by recombination. Nat Struct Biol (2002) 9:553–558.[Web of Science][Medline]

    Whamond GS, Thornton JM. An analysis of intron positions in relation to nucleotides, amino acids, and protein secondary structure. J Mol Biol (2006) 359:238–247.[CrossRef][Web of Science][Medline]

Accepted for publication June 25, 2007.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
24/10/2158    most recent
msm151v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by De Kee, D. W.
Right arrow Articles by Stoltzfus, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by De Kee, D. W.
Right arrow Articles by Stoltzfus, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?