Skip Navigation


MBE Advance Access originally published online on January 30, 2007
Molecular Biology and Evolution 2007 24(4):1012-1024; doi:10.1093/molbev/msm020
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/4/1012    most recent
msm020v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sassi, S. O.
Right arrow Articles by Benner, S. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sassi, S. O.
Right arrow Articles by Benner, S. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

The Evolution of Seminal Ribonuclease: Pseudogene Reactivation or Multiple Gene Inactivation Events?

Slim O. Sassi*, Edward L. Braun{dagger} and Steven A. Benner*

* Foundation for Applied Molecular Evolution, Gainesville, Florida
{dagger} Department of Zoology, University of Florida, Gainesville

E-mail: ssassi{at}ffame.org.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
Two approaches, one novel, are applied to analyze the divergent evolution of ruminant seminal ribonucleases (RNases), paralogs of the well-known pancreatic RNases of mammals. Here, the goal was to identify periods of divergence of seminal RNase under functional constraints, periods of divergence as a pseudogene, and periods of divergence driven by positive selection pressures. The classical approach involves the analysis of nonsynonymous to synonymous replacements ratios ({omega}) for the branches of the seminal RNase evolutionary tree. The novel approach coupled these analyses with the mapping of substitutions on the folded structure of the protein. These analyses suggest that seminal RNase diverged during much of its history after divergence from pancreatic RNase as a functioning protein, followed by homoplastic inactivations to create pseudogenes in multiple ruminant lineages. Further, they are consistent with adaptive evolution only in the most recent episode leading to the gene in modern oxen. These conclusions contrast sharply with the view, cited widely in the literature, that seminal RNase decayed after its formation by gene duplication into an inactive pseudogene, whose lesions were repaired in a reactivation event. Further, the 2 approaches, {omega} estimation and mapping of replacements on the protein structure, were compared by examining their utility for establishing the functional status of the seminal RNase genes in 2 deer species. Hog and roe deer share common lesions, which strongly suggests that the gene was inactive in their last common ancestor. In this specific example, the crystallographic approach made the correct implication more strongly than the {omega} approach. Studies of this type should contribute to an integrated framework of tools to assign functional and nonfunctional episodes to recently created gene duplicates and to understand more broadly how gene duplication leads to the emergence of proteins with novel functions.

Key Words: seminal ribonuclease • pseudogene • ruminant • gene duplication • novel function


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
According to a standard model (Ohno 1970Go), the origin of proteins with novel functions begins with a duplication of a gene for a protein performing an ancestral function. Under this model, one duplicate continues to perform the ancestral function, relaxing selective constraints on the second paralog. This allows the second paralog to "explore" regions of "sequence space" without being constrained by natural selection (Zhang 2003Go; Hurles 2004Go). Kimura and Ota (1974)Go proposed that this model represents one of the fundamental "principles governing molecular evolution."

The origin of novel functions is, in some cases, undoubtedly more complicated than suggested by the standard model. For example, gene duplication may be preceded by a period of "gene sharing," where the unduplicated ancestor of the paralogs performs both an ancestral role and a novel role, the latter having arisen whereas the ancestral gene was subject to selective constraints (Hughes 1994Go). This gene-sharing model postulates that the 2 functions are then partitioned among the paralogs after duplication. Other novel functions may arise from "overprinting," a term used to describe the expression of open reading frames that had not previously encoded a protein (Ohno 1984Go; Keese and Gibbs 1992Go; Braun et al. 2000Go). In fact, examples of exons that resulted from overprinting appear to be more numerous in genes with alternative splicing than exons that arose by duplication (Kondrashov and Koonin 2003Go). Additional variant models include partial duplications, with or without the formation of a chimera with another gene (Katju and Lynch 2006Go), and the adaptive change model (Yang and Bielawski 2000Go; Liberles and Wayne 2002Go; Bielawski and Yang 2003Go; Zhang 2003Go), which invokes positive Darwinian selection on one paralog after duplication. Lastly, the pseudogene reactivation model deserves attention (Balakirev and Ayala 2003Go); we will explore this model in detail using the example of seminal RNase.

In the standard model, the majority of duplication events are believed to end with the irreversible inactivation of one duplicate (Walsh 1995Go; Lynch et al. 2001Go). This belief is consistent with the notion that genes evolving free of constraint have a higher probability of acquiring a mutation that renders the encoded protein pathologically defective (e.g., a nonsense mutation or frameshift) than acquiring a change that results in a novel function. An empirical approach to estimating this probability requires that we examine duplication events in natural history. In the postgenomic age, this has become easier to do, pace the fact that information preserved in the modern genomes is adequate to infer the functional status of duplicates for only recent duplication events. Even so, an integrated framework of tools is needed to help us decide, for reconstructed historical events, whether an ancestral paralogous protein was functional or not.

The tempo of sequence change immediately following duplication has frequently been proposed as a metric to make this decision. Genes that are free of constraint diverge at the rapid rate characteristic of neutral drift and are expected to have a normalized ratio of nonsynonymous to synonymous mutations (dN/dS = {omega}) of unity. Thus, a rapid rate of nonsynonymous sequence divergence ({omega} = 1) in a duplicate is taken to indicate the absence of constraints (although Balakirev and Ayala (2003)Go emphasize that assuming {omega} = 1 is not necessarily warranted for all pseudogenes).

Unfortunately, a pathway giving new function via an episode of positive Darwinian selection will also be characterized by rapid sequence change and a high value of {omega} that may not significantly differ from unity (although {omega} values significantly greater than unity are generally accepted as evidence for positive adaptation). Thus, the observation of rapid sequence evolution in one of the duplicates may also be consistent with this "adaptive model" for the origin of novel functions (Zhang et al. 1998Go). This leads to the unfortunate possibility that a high {omega} that does not significantly differ from unity could imply 2 very different conclusions, neutral evolution after the loss of purifying constraint or positive Darwinian selection. Despite this issue, a number of productive efforts have used an elevated {omega} as a criterion to search for proteins subject to positive Darwinian selection both on a large scale (Endo et al. 1996Go; Roth et al. 2005Go) and with specific proteins (Zhang et al. 2002Go).

In principle, one might distinguish between the adaptive and standard models by determining the period over which rapid sequence evolution took place. If the number of amino acid replacements necessary to shift a protein from one function to another is small (Perutz 1983Go; Asenjo et al. 1994Go; Newcomb et al. 1997Go; Zhang et al. 2002Go), one can imagine a short period of drift into a fruitful area of sequence space ({omega} {approx} 1), an episode of rapid adaptation ({omega} > 1), followed by the return to evolution under new functional constraints ({omega} < 1). Long periods of drift followed by the acquisition of a new function are not expected because genes diverging without constraint for long periods of time are expected to become irretrievably damaged (perhaps with a half life of 5 Myr) (Marshall et al. 1994Go; Lynch and Conery 2000Go).

Conversion to a pseudogene is not necessarily synonymous with "gene death." In some cases, pseudogenes can play a regulatory role (Balakirev and Ayala 2003Go) and are also known to contribute to specific functions, such as generating antibody diversity (Ota and Nei 1995Go). In other cases, partial reactivation of a pseudogene by exon shuffling produces new functions (Doxiadis et al. 2006Go). Lastly, pseudogenes could act as donors in interlocus gene conversion that could result in a large number of simultaneous changes to a functional gene (which may or may not be advantageous). An even more extreme example for "life" after conversion to a pseudogene, however, is provided by pseudogene reactivation, which might provide another model for the origin of novel functions. In the pseudogene reactivation model, unconstrained exploration of sequence space continues after mutations render the gene unable to encode a functional protein. Then, lesions incompatible with expression of the pseudogene are repaired. In principle, pseudogene reactivation might allow a gene on one adaptive "peak" to shift to another, even when a required intermediate is toxic. The potential contributions of the reactivation model, whether it is partial or complete reactivation, to the origin of novel functions has led Balakirev and Ayala (2003)Go to relabel pseudogenes as potogenes, for potential genes (using nomenclature Brosius and Gould [1992]Go originally suggested).

Pseudogene reactivation makes available a longer time to search sequence space, but is believed to generate proteins with new functions only infrequently. The repair of lesions might include the reinsertion of a deleted segment, the removal (in frame) of an inserted segment, or other events that are likely to be improbable. Partial gene conversion with a functional gene as a donor might improve the probability of pseudogene reactivation, but enough of the pseudogene sequence must be preserved to maintain the benefits of expanding the sequence space explored after duplication.

Bovine seminal RNase has been proposed to be an example of a protein encoded by a gene that arose from a reactivated pseudogene (Trabesinger-Ruef et al. 1996Go). Seminal RNase diverged from the pancreatic RNase family approximately 40 MYA. In the modern ox, seminal RNase is expressed in seminal plasma at a high level (ca. 2% of the soluble protein). The primary function of the RNase in seminal plasma is unclear, but it displays immunosuppressive and other cell-based activities that are highly distinct from the closely related pancreatic ribonucleases (RNases) (Vescia et al. 1980Go; Laccetti et al. 1992Go; Kim et al. 1995Go; Soucek et al. 1996Go; Sinatra et al. 2000Go; Lee and Raines 2005Go).

Orthologs of the gene encoding–bovine seminal RNase in closely related ruminants (e.g., deer, kudu, okapi, and giraffe) have lesions (deletions, insertions, and changes in key residues) expected to be incompatible with production of an active protein (Trabesinger-Ruef et al. 1996Go; Breukelman et al. 1998Go; Kleineidam et al. 1999Go) (fig. 2). Therefore, any role that seminal RNase might play in oxen is not played in other modern ruminants. Further, as seminal RNase pseudogenes are present in multiple lineages branching from the lineage leading to oxen, the gene encoding–seminal RNase was either inactivated multiple times in lineages leading to other modern ruminants (the "multiple inactivation" narrative; see fig. 1B) or inactivated only once and was reactivated very recently in an immediate ancestor of oxen (the "pseudogene reactivation" narrative; see fig. 1A).


Figure 2
View larger version (70K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Multiple sequence alignment and 3-dimensional crystal (pdb:1R3M [Berisio et al. 2003Go]) for seminal RNase in its dimeric form. (A) DNA sequence alignment. Taxa in blue have an insertion (in blue), deletion (in blue), or a premature stop (in red) relative to the active bovine seminal RNase gene. The indels create frame shifts in the affected genes. Taxa in red have amino acid replacements at the active site likely to render the protein unable to act as a catalyst for the hydrolysis of RNA. (B) Protein sequence alignment using the same colors as the DNA sequence alignment. # Represents premature stops. (CF) Different projections of the crystal structure. Blue and green distinguish the subunits of the seminal RNase dimer. Red indicates sites that have amino acids in the last common ancestor of all seminal RNases different from the active bovine seminal RNase. Yellow indicates active site residues. (C) Active site is toward the viewer. (D) Active site is away from the viewer. (E) Cross section showing that none of the sites suffering replacement are inside the folded core of the protein, other than a single site (32) at the dimer interface. (F) Active site at the top of the image.

 

Figure 1
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Trees showing 2 alternative narratives for the history of ruminant seminal RNase. Episodes during which each narrative postulates that seminal RNase diverged free of functional constraints are indicated by red lines. Periods when each narrative postulates that seminal RNase was under functional constraint are indicated by black lines. Red arrows indicate events where the coding region is proposed by the narrative to have suffered a lesion giving a pseudogene. The black arrow indicates the proposed pseudogene reactivation. * Represents an active and expressed protein in an extant species; ? Indicates a suspected pseudogene (no lesion is present, but examination of available tissues has not indicated the presence of the protein). The appearance of deletions (D) and insertion (I) are shown on the corresponding branches.

 
Although the distribution of functional genes and pseudogenes requires invoking 1 of these 2 narratives if the phylogeny shown is correct, a different phylogenetic tree would remove the need for either of these narratives. The topology shown is congruent with other estimates of ruminant phylogeny (Mahon 2004Go; Hernandez Fernandez and Vrba 2005Go), but it does remain possible that the seminal RNase phylogeny differs from the ruminant species tree due to incomplete lineage sorting.

Although 3 different narratives can explain the phylogenetic distribution of seminal RNase pseudogenes, initial studies concluded that the pseudogene reactivation narrative was plausible (Trabesinger-Ruef et al. 1996Go). Consequently, seminal RNase has been viewed as an important example of pseudogene reactivation producing novel function (Harrison and Gerstein 2002Go; Zhang 2003Go; Harrison et al. 2005Go; Katju and Lynch 2006Go). Here, we examine the value of classical tools, including the estimation of {omega} by maximum likelihood (ML) methods, to discuss the alternative narratives outlined above. We then introduce additional tools that utilize the 3-dimensional folded structure of the protein as a way to distinguish between these narratives. We found that the evidence strongly supports the multiple inactivation narrative and conclude that bovine seminal RNase should no longer be viewed as an unambiguous example of pseudogene reactivation.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
Alignment and Phylogenetic Methods
Sequences were obtained from GenBank or sequenced from tissue derived from the Center for Reproduction of Endangered Species (Trabesinger-Ruef et al. 1996Go). The alignment was produced with ClustalX (Thompson et al. 1994Go). The DNA sequence alignment was guided by the protein sequence alignment. To capture a comprehensive representation of the phylogeny and its corresponding ambiguity, nucleotide evolution models were selected using Modeltest (Posada and Crandall 1998Go) using both the likelihood ratio test and Akaike information criterion (AIC). The trees were then calculated using the selected model in PAUP* (Swofford 2001Go) under the ML optimality criterion. These models were also applied in a Bayesian Markov chain Monte Carlo (MCMC) framework to reconstruct the phylogeny using MrBayes (Ronquist and Huelsenbeck 2003Go).

Ancestral Reconstruction
The PAML program package was used to reconstruct the ancestral sequences for the seminal RNase genes following an empirical Bayes method (Yang et al. 1995Go). Three different evolutionary model frameworks were implemented in the reconstruction, a codon model using 2 different procedures to estimate the codon frequencies and an amino acid model. The first codon model (known as 1 x 4), estimates the frequencies of different codons frequencies by examining the average nucleotide frequencies in the input sequence data as a whole. The second method, 3 x 4, estimates the frequencies of different codons by examining separately the nucleotide frequencies in the first, second, and third positions in the input data; the frequency of a specific codon is the product of 3 estimated nucleotide frequencies. The third model (the amino acid model) uses an empirical rate matrix (Jones et al. 1992Go). In addition to using different models to infer ancestral sequences, different tree topologies were considered to reflect uncertainties in the underlying topology. Although the trees shown in figure 1 reflect the estimate based upon Bayesian MCMC analysis using nucleotide data, we also used trees estimated by other methods (ML analysis of nucleotide data).

Structural Mapping and Solvent Accessibility
The solvent accessibility of all residues of seminal RNase was determined using the definition of secondary structure of proteins (DSSP) program (Kabsch and Sander 1983Go) applied to crystallographic structures of the RNase monomer (pdb:1N3Z [Sica et al. 2003Go]) and dimer (pdb:1BSR [Capasso et al. 1983Go], pdb:1R5C [Merlino et al. 2004Go], and pdb:1R3M [Berisio et al. 2003Go]). Residues with 10% or greater solvent accessibility were considered solvent exposed. The solvent accessibility of the ancestrally replaced amino acids was also determined in the same way. The statistical significance of the observed surface distribution of the ancestrally replaced amino acids was determined using a {chi}2 test.

Calculations of {omega} = dN/dS
Codeml in the PAML program package was used to calculate all {omega} (dN/dS) values (Yang 1997Go) under the ML codon model. When different values for {omega} were calculated for different branches or different branch groupings, the branch model as implemented in PAML was used (Yang 1998Go; Yang and Nielsen 1998Go).

Simulations
Simulated data sets under different branch models (Yang 1998Go; Yang and Nielsen 1998Go) were produced using the evolver NS branch sites version of Evolver in the PAML (version 3.15) package (Yang 1997Go). At least 1,000 data sets were simulated for each set of conditions. The seminal RNase gene and species tree was used in the simulations. Branch lengths, {kappa} (transitions–transversions ratio), and codon frequencies values from the codeml ML analysis of the seminal RNase data set were used in the simulations. Different {omega} values were also applied to generate the simulated data sets and were varied depending on the simulation conditions as discussed in the Results and Discussion section. The simulated data sets were then analyzed using codeml in the PAML program package to estimate {omega} and other parameters. The programs Excel and Prism were used to examine the distributions of {omega} and conduct the statistical analyses.

Incomplete Lineage Sorting
The species/gene tree was compared with a tree that would follow an incomplete lineage-sorting narrative (see Supplementary Material online). The Shimodaira–Hasegawa test as implemented in PAUP* (Swofford 2001Go) was used to evaluate the topologies.

Gene Conversion
Four different trees representing 4 possible narratives of gene conversion were compared (Supplementary Material online) with each other and to the species/gene tree in a parsimony framework as implemented in PAUP*. Tree lengths for each nonconstant character that vary among the 5 trees were compared and characters that supported one of the trees associated with possible gene conversion events were examined.


    Results and Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
Seminal RNase Phylogeny is Congruent with Ruminant Phylogeny
The estimate of the RNase gene tree obtained by a Bayesian MCMC analysis here (fig. 1) includes a seminal RNase clade and a pancreatic RNase clade as expected. The seminal RNase clade has a topology that is almost completely congruent with the likely ruminant species tree (Mahon 2004Go; Hernandez Fernandez and Vrba 2005Go). This suggests that the history of the seminal RNase gene matches the evolutionary history of ruminants inferred using multiple lines of evidence (morphology as well as mitochondrial and nuclear sequence data) with at most modest topological differences that can be explained by population genetic processes (Pamilo and Nei 1988Go; Maddison 1997Go) along with uncertainty in the gene tree and/or species tree.

Despite the general congruence, we wanted to rigorously test the possibility that incomplete lineage sorting might be able to explain the distribution of pseudogenes and functional genes in the extant ruminant species. This narrative postulates that the ancestrally inactivated allele of seminal RNase did not become fixed in the population. Instead, the nonfunctional pseudogene allele was maintained along with a functional allele. In this narrative, the distribution of pseudogenes and functional genes, as observed in the gene tree, is explained by recent losses of polymorphism fixing either the pseudogene or the functional gene in specific lineages. Although instances of deep coalescence that cause modest differences between the gene tree and the species tree are possible, this narrative would require the maintenance of an ancestral polymorphism for an unusually long period of time (through multiple speciation events). The gene tree topology consistent with the deep incomplete lineage-sorting narrative can be excluded (P value = 0.03) using the Shimodaira–Hasegawa topology test (Shimodaira and Hasegawa 1999Go). On these grounds, we excluded this possibility of deep incomplete lineage sorting and focused on the 2 remaining narratives: pseudogene reactivation and multiple recent inactivations.

Seminal RNase Lesions Support Independent Gene Inactivation Events
The fact that the distribution of functional seminal RNase genes is explained most parsimoniously by the pseudogene reactivation narrative (assuming equal costs for conversion between pseudogenes and functional genes) when combined with the absence of evidence for function of this gene outside of the bovine lineage (suggesting that the seminal RNase gene had no function for ~35 Myr) the best corroborated hypothesis is pseudogene reactivation (Trabesinger-Ruef et al. 1996Go).

However, the independent inactivation narrative is more consistent with the sequences of the seminal RNase pseudogenes because the specific inactivating lesions differ in each of the ruminant lineages. It is more parsimonious to conclude that none of the lesions in various ruminants were present in the internal nodes of the seminal RNase tree (table 1). These data do not exclude the pseudogene reactivation model because it remains possible that an initially inactivating lesion was lost or occurred outside of the sequenced regions (e.g., the promoter or untranslated regulatory regions). In such a historical narrative, the mutations with the potential to inactivate the gene do not represent events that initially inactivated seminal RNase; instead, they simply reflect the spectrum of mutations expected for pseudogenes after expression was lost. In this version of the narrative, reactivation of an unexpressed seminal RNase gene occurred just before the ox and buffalo diverged through mutation in a regulatory region (fig. 1A; also see Trabesinger-Ruef et al. 1996Go), which was possible because the coding region avoided a lesion (by chance) in the time since divergence, despite being a pseudogene. In fact, as emphasized previously, the pseudogene reactivation narrative is the most parsimonious narrative if one considers only the functional or pseudogene status of the seminal RNase genes (fig. 1A).


View this table:
[in this window]
[in a new window]

 
Table 1 Expression of Seminal RNase in the Studied Species and the Lesion Status of the Corresponding Seminal RNase Gene

 
Nonsynonymous Evolutionary Rates Varied during Seminal RNase History
Estimates of {omega} were initially obtained for each branch of the tree using a ML method, using a parameter-rich model that allowed each of the 28 branches to be associated with an independent {omega} value. Many of the estimated {omega} values were much lower than unity, suggesting that seminal RNase has been subject to purifying selection during most episodes represented by branches in the seminal RNase tree. Some {omega} values were extremely high ({omega} >> 1), however, reflecting either positive Darwinian selection or short branch lengths having few synonymous substitutions (a "division by zero" problem; see Supplementary Material online).

Accordingly, adjacent branches in the tree were grouped to generate a set of {omega} estimates that were fewer in number than the number of branches in the tree. This grouping decreased the number of free parameters, increased the number of sites useful to estimate individual {omega}, and consequently decreased the variance of the {omega} estimates. In the first clustering, all branches with low {omega} (<1) from the initial analysis (which estimated a separate {omega} for each branch) were collected into a single group assumed to be described by a single ratio. The branches with {omega} higher than unity were allowed to have individual {omega} values unless the branches were adjacent, in which case the adjacent branches were constrained to have a single {omega} parameter. This resulted in 4 groups of branches, 1 containing the majority of branches and having {omega} < 1 (the "background {omega} value") and 3 groups that are candidates for {omega} ≥ 1 (the "high {omega}" groups). The high {omega} groups were combinatorially merged into the group with background (low) {omega} value, ultimately generating a set of 7 models (1 with 3 high {omega} groups, 3 with 2 high {omega} groups, and 3 with 1 high {omega}). This process was designed to cover all possible combinations for calculating {omega} values from the most complex (each branch with a distinct {omega} value), to intermediate models (e.g., 3 different high {omega} groups and the background {omega}), to the simplest model with more than one {omega} value (2 group models with 1 high {omega} and the background {omega}). These models were also compared with an even simpler model that assumes a single {omega} value for the entire tree.

These models were then evaluated using the AIC (Burnham et al. 2002Go; Posada and Buckley 2004Go), which provides an estimate of the Kullback–Leibler distance between an approximating model under consideration and the unknown "true" model. The AIC provides a way to assess whether the fit of models (based upon likelihood scores) sufficiently improves when parameters are added to justify the increased model complexity.

The best model proved to have 2 {omega} values, a high {omega} value ({omega} {approx} 1.6) for 2 recent branches leading to the bovine lineage and a lower one ({omega} {approx} 0.3) for the remainder of the tree (Supplementary Material online). All other groupings increased the complexity of the model and its fit to the data, but that increase was not sufficient to compensate for the cost of having an increased number of parameters. This suggested that the seminal RNase gene was subject to purifying selection during most of its evolutionary history, with the exception of a brief and recent period of positive selection in the immediate ancestry of oxen. Given that {omega} {approx} 0.3 throughout the majority of the tree, these results are more consistent with the multiple gene inactivation narrative than other narratives.

Estimates of {omega} Suggest That Seminal RNase Genes Were Subject to Selective Constraint
The pseudogene reactivation model predicts that estimates of {omega} for most branches within the seminal RNase tree will be near unity because the sequences would have undergone neutral evolution after selective constraints were lost. In contrast, independent gene inactivation predicts that the evolution after gene inactivation will be free of constraint ({omega} {approx} 1 for external branches leading to the modern pseudogenes in various ruminants), whereas internal branches will show evidence for selective constraint ({omega} < 1).

If we assume the independent gene inactivation narrative, the data are inadequate to say where along the external branches the events that created the pseudogenes occurred. Thus, the functional constraints along those branches are unknown. If the event occurred early, then the sequence underwent neutral drift during most of the time represented by that branch, and the branch should have {omega} {approx} 1. If the event occurred late, then the sequence was diverging under functional constraint along most of the branch in question, and {omega} is expected to be less than unity.

The model with separate {omega} values for each branch (considered to be overparameterized by the AIC) is consistent with the independent inactivation model as well. In this model, the estimates of {omega} for the majority of internal branches were substantially less than unity. Likewise, the estimates of {omega} for terminal branches leading to pseudogenes (hog deer and roe deer, okapi, saiga, kudu, Cape buffalo, and forest buffalo) and suspected pseudogenes (duiker and water buffalo) were less than unity.

We then asked how robust these inferences were. To examine rigorously the ability of ML estimates of {omega} to detect neutral evolution over the tree as a whole, we simulated seminal RNase sequence evolution assuming either constraint ({omega} = 0.3) or neutral evolution ({omega} = 1). Although the {omega} estimates obtained in analyses using the simulated data showed substantial variance (fig. 3), probably reflecting the small number of sites available to calculate {omega} given the short length of the seminal RNase sequences, the distributions did not overlap. This provides strong support for the contention that the seminal RNase sequences were subject to constraint over much of their history, an observation consistent with the independent inactivation narrative. It is important to note the {omega} variance increased in the neutral evolution simulations (fig. 3).


Figure 3
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Results of simulations assuming constrained and neutral evolution. Black curve shows the distribution of {omega} estimates if the true {omega} value for the internal branches is 0.3. Red curve shows the distribution {omega} estimates if the true {omega} value for internal branches is unity.

 
Information from RNase Structure Supports Purifying Selection
We then considered a different approach to addressing these questions, one that did not solely rely on linear sequence data. If a protein is divergently evolving without functional constraint, the sites holding amino acid replacements should be randomly distributed with respect to its surface, interior, active site, and other functionally relevant features of its folded structure. In contrast, if selective pressures constrain amino acid replacement, then the distribution of amino acid replacements in the 3-dimensional structure should not be random.

As the structure of seminal RNase is known, this approach could be directly applied to the distribution of amino acid replacements that accumulated along the internal branches in the seminal RNase tree. The sequences of the ancestral proteins were inferred for all ancestral nodes in the RNase tree (Supplementary Material online) using the empirical Bayes method (Yang et al. 1995Go) implemented in the program PAML (Yang 1997Go). Ambiguity was incorporated into the reconstruction by examining multiple evolutionary models (2 codon models and an amino acid model) and by varying the tree topology (using other plausible topologies). All of the amino acids that changed along the internal branches were then mapped on the dimeric bovine seminal RNase structure and visualized in 3 dimensions (fig. 2).

This procedure revealed a distribution of ancestral amino acid replacements that was apparently nonrandom. By eye, amino acid replacements appeared to occur almost entirely on the surface of the biomolecule; sites within the folded core of the protein were largely free of replacements. Further, the RNA-binding site and -active site were unchanged (fig. 2). This pattern of replacements indicated that purifying selection did not permit the accumulation of amino acid replacements that would be most likely disrupt the folding (i.e., those in the core of the protein, recognizing that replacements in the core need not disrupt folding, especially if they are associated with compensatory changes) or enzymatic activity of the protein. In contrast, the replacements that have been permitted appear to be those that are least likely to have an impact on folding or activity (i.e., surface residues, although some surface residues may be exceptions to this general rule). Indeed, one surface residue that changed (cysteine-31) is important for seminal RNase quaternary structure (Mazzarella et al. 1993)Go. However, the change at this site is unlikely to indicate the loss of function because it reflects a change to a cysteine residue at this position (table 2). The inference that the ancestral proteins were active was confirmed in a separate study by "resurrecting" those proteins (Sassi S, unpublished data). These same observations regarding the distribution of amino acid changes were true, leading us to the same conclusions, regardless of whether crystal structures based upon the monomer or the dimer were used (pdb:1N3Z [Sica et al. 2003Go], pdb:1BSR [Capasso et al. 1983Go], pdb:1R5C [Merlino et al. 2004Go], and pdb:1R3M [Berisio et al. 2003Go]).


View this table:
[in this window]
[in a new window]

 
Table 2 Summary of Amino Acid Replacements along Internal Branches of the Seminal RNase Phylogeny

 
To place these observations in a quantitative framework, the solvent accessibility of individual amino acid side chains was estimated from the seminal RNase structures using DSSP (Kabsch and Sander 1983Go). The residues were placed into 2 bins depending upon their solvent accessibility. Residues with solvent accessibility <10% were considered to be in the core. These core residues represented 36.7% of the 124 amino acids in seminal RNase; the remaining 63.3% of the residues were considered to be surface residues because they are solvent accessible sites. Slightly more than 94% of the sites that accumulated amino acid replacements were solvent accessible, which is significantly more than expected if the residues that accumulated replacements were randomly distributed ({chi}2 = 7.5130; degree of freedom (df) = 1; P = 0.008).

This structural mapping analysis supports the conclusion that seminal RNase was subject to purifying selection because amino acid replacements appear to have followed the expected pattern for an active enzyme. Therefore, these results support the model with multiple gene inactivation events that took place in the recent evolution of the clade, as represented by the external branches of the tree. This independent line of reasoning is entirely consistent with that obtained from ML estimates of {omega}.

Use of {omega} and Crystallographic Analysis to Understand Deer Seminal RNases
The lineage leading to the seminal RNase pseudogenes in the deer presents a unique opportunity for further understanding of both the {omega} estimation and crystallographic tools to determine whether functional constraints were present or absent during an episode of evolutionary history. The seminal RNases pseudogenes from the 2 cervid taxa (hog and roe deer) share common lesions that almost certainly make both unable to encode an active protein; 5 amino acid replacements in the active site (fig. 2). This suggests that those lesions were present in their last common ancestor and that the ancestral protein was also inactive. These 2 deer species are estimated to have diverged ~19 MYA (Hernandez Fernandez and Vrba 2005Go), a significant period of time for a lineage to have been free of functional constraints.

Indeed, the crystallographic analysis developed here supports that conclusion. Of the 13 amino acid replacements estimated for these branches within the cervid evolutionary history, 5 are buried and 8 are not, nearly exactly the same ratio of buried and surface residues found in the protein as a whole (38.5% buried and 61.5% exposed; {chi}2 = 0.017; df = 1; P = 0.896). The estimate of {omega} offers this inference less persuasively, with {omega} = 0.4 and 0.7 estimated initially for the hog and roe deer branches, respectively. A pairwise {omega} estimate using a standard dN and dS estimator (Nei and Gojobori 1986Go) for the hog and roe deer alone is 0.73 because dN is 0.1025 and dS is 0.140. The best model based upon the AIC did not include a separate {omega} estimate for the cervid lineage, however, and a model with a separate {omega} for the deer did not suggest unity as a value ({omega} = 0.5 for the deer branches and {omega} = 0.3 for the remaining internal branches). Standard theory would not interpret these values as evidence for an absence of functional constraints.

If the last common ancestor of hog and roe deer indeed contained a seminal RNase pseudogene, then the gene inactivation must have predated the last common ancestor of the deer included in this study. To see how the timing of gene inactivation influenced estimates of {omega}, seminal RNase evolution was simulated using the gene tree topology and the ML empirical parameters obtained from the RNase sequence alignment (the values estimated by PAML for branch length, transition/transversions ratio ({kappa}), {omega} = 0.3 for internal branches, and the codon frequencies). The terminal branches leading to the hog deer and roe deer were assumed to have {omega} = 1, as was a portion of the internal branch leading to their common ancestor. Because the timing of gene inactivation along that internal branch is not constrained by the data, simulations placed the loss of constraint at various points along the internal branch, with a corresponding increase of {omega} from 0.3 to unity. All simulated data sets were analyzed using codeml from the PAML package (Yang 1997Go), one {omega} value calculated for the 3 branches of the deer clade (fig. 4).


Figure 4
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Subtree including seminal RNase sequences from hog deer and roe deer and the point where these sequences diverge from the okapi–giraffe clade (point P1). Distributions show the frequency of {omega} estimates depending upon the point on branch C where the true {omega} value changes from 0.3 to unity. One thousand data sets have been simulated under each of the shown conditions. The P0 curve shows a distribution of {omega} values for simulations assuming the true {omega} value is 0.3 for all branches. Median {omega} values for P0, P1, P2, P3, and P4 were 0.298, 1.06, 0.76, 0.62, and 0.56, respectively.

 
As expected, the median {omega} estimate increased as the position of the transition along the internal branch (C) was made more ancient (fig. 4). Strikingly, the distributions for {omega} estimates progressively widened as the length of the neutral evolution period was increased; this would be expected to negatively affect the confidence interval. The distribution is widest when {omega} = 1 is assumed for the full length of all 3 branches, the internal branch (C) and the external branches leading to the hog and roe deer. The distribution is much narrower when all the 3 branches are modeled as having evolved under purifying selection ({omega} = 0.3).

Comparing the median values of {omega} from the simulations with the empirical estimate for the deer clade ({omega} = 0.5) indicated that the empirical value is closest to the simulations that assumed gene inactivation just before the speciation of hog deer and roe deer ({omega}med = 0.56 for P4; see fig. 4). However, the empirically measured value is within the confidence interval of all simulations, indicating that we cannot constrain well the timing of the inactivation using this approach. In fact, the 5 active site replacements shared by both deer sequences (fig. 2) can be viewed as stronger evidence that the gene inactivation took place long before the hog and roe deer diverged than the estimates of {omega}.

We then asked whether the small number of substitutions in the seminal RNase lineage was responsible for the inaccurate estimates of {omega}. To this end, we asked if increasing the expected number of mutations in the deer lineage (by lengthening the branches to mimic a more ancient divergence or a faster rate of evolution than estimated from the data) that occur after loss of purifying selection would alter the width of the {omega} distribution. To do this, we repeated the simulations but increased the branch lengths for the deer clade by factors of 10, 20, 50, and 100. The {omega} distribution narrows when the branch is 10-fold longer, but starts to widen again with further increases in the branch length (fig. 5). The distribution is substantially worse when it is 100 times the true lengths, suggesting that such a large number of mutations have resulted in saturation. It is important to note that confidence in {omega} estimates improves with time up to a factor of 20 compared with the empirically measured branch lengths. Thus, the use of ML estimates of {omega} alone is unlikely to provide convincing support for a hypothesis of gene inactivation unless the inactivation is more ancient than the ~19 Myr divergence of the deer examined here. Obviously, increased taxon sampling would increase the power of this approach, but there are likely to be fundamental limits like the length of the sequence.


Figure 5
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 5.— Distributions of {omega} estimates depending upon the length of the internal branch (C) of the deer clade. Branch length of internal branch C has been increased by factors of 10, 20, 50, and 100 when simulating the evolution of seminal RNase. A thousand data sets have been simulated under each of the shown conditions. P0, as before, shows a distribution of {omega} values for simulations assuming {omega} to be 0.3 for all branches.

 
Gene Conversion Did Not Have a Major Impact on Seminal RNase Evolution
Interlocus gene conversion is the most plausible mechanism for pseudogene reactivation when multiple lesions are present. Presumably, a functional paralog remaining in the genome would be the source of the information needed for reactivation (in this case, both pancreatic RNase A and brain RNase are available). Only a single narrative involving a gene conversion in the coding region has the potential to repair a defective seminal RNase pseudogene that also can be reconciled with the RNase gene tree. This gene conversion narrative would involve shifting the occurrence of one or more of the probable inactivating mutations in the lesser kudu to a time prior to the divergence of the kudu and bovine lineage, followed by repair due to gene conversion in the bovine lineage. Clearly, this narrative is less parsimonious than a narrative that postulates that the inactivating mutations in the kudu occurred after the kudu diverged from the bovine lineage. However, evidence for gene conversion involving the functional paralogs, either with the potential to repair a mutation or with no obvious functional consequence, would have general implications for the plausibility of pseudogene reactivation as a mechanism for the origin of genes with novel functions.

The spectrum of gene conversion tract lengths in humans is highly variable and has a relatively short mean tract length, with estimates of ~50 bp (Bosch et al. 2004Go; Jeffreys and May 2004Go). Although there is likely to be both among-species and among-locus variation in the rate of gene conversion and the mean tract length, the human data are likely to provide information about the plausible tract length spectrum. Although the relatively high degree of divergence between seminal RNase and the other RNase genes may have an impact on this tract length spectrum, we felt that the available data on gene conversion tracts indicated that we should expect short tracts where the functional bovine seminal RNase would appear more similar to a functional paralog (RNase A or brain RNase genes) than to the seminal RNase pseudogenes.

To identify sites of this type, we took advantage of the fact that all sites are independent in a maximum parsimony analysis (each site can support trees that differ from the optimal tree). Briefly, the parsimony tree lengths for all characters were measured on the probable "true" tree (fig. 1) and alternative trees that place the bovine lineage within the paralogs ("gene conversion tree"; see Supplementary Material online). Five sites have a shorter parsimony tree length supporting one of the gene conversion trees over the probable true tree. This provided a limited set of candidate sites for gene conversion. The candidate sites, however, were spread throughout the sequence and clearly do not represent a single gene conversion tract (see Supplementary Material online). Further, they are distinct from the position of the kudu lesions, and in fact, the sites are in positions distinct from all of the observed lesions in the cited species.

It is also important to note that the candidate sites are not active site amino acids. Because the candidate sites can also be explained by homoplastic single-base substitutions, it seems reasonable to conclude that interparalog gene conversion has had little or no impact on the evolution of bovine seminal RNase.


    Conclusion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
This paper introduces a new approach based on crystallographic data for assessing whether purifying selection acted on ancestral genes. Applied to the seminal RNase gene family, the approach suggests inferences consistent with those made by classical analyses based upon estimates of {omega}. Both inferences are contrary to the inference, frequently mentioned in the literature, that bovine seminal RNase arose in modern ox through the reactivation of an ancestral pseudogene. Instead, it appears that seminal RNase evolved under selective constraints through much of its history, independently suffering lesions that rendered it a pseudogene in the external leaves of multiple ruminants, and underwent an episode of rapid sequence evolution, presumably adaptive, in the lineage connecting modern oxen with their immediate ancestor.

One virtue of combining these 2 types of analysis is that they draw on quite different data sets. Further, they speak to the status of a gene as a potential pseudogene in the event that a chance lesion, like a frameshift or removal of a residue known to be essential for catalysis, does not. Common to both is the need to infer ancestral sequences and the changes along individual branches from extant sequences. Such inferences include uncertainties that are well understood in the community, as well as those that are less well understood (Zhang and Nei 1997Go; Williams et al. 2006Go). The crystallographic analysis relies, however, on independent concepts of the physicochemical properties necessary for protein folding and function. Models of amino acid replacement (Jones et al. 1992Go; Koshi and Goldstein 1998Go; Whelan and Goldman 2001Go) also incorporate physicochemical information, but only in an implicit manner reflecting empirical information. Whereas efforts to incorporate 3-dimensional structural information in evolutionary models have been productive for some time (Benner 1989Go; Thorne et al. 1996Go; Goldman et al. 1998Go; Pollock et al. 1999Go; Fornasari et al. 2002Go; Robinson et al. 2003Go; Rodrigue et al. 2005Go; Yu and Thorne 2006Go), progress in both sequence genomics and structural genomics is required for these to be universally applicable.

Obviously, sufficient time and consequently a sufficient number of evolutionary events (e.g., nucleotide substitutions and/or amino acid replaces) are required to draw inferences for any segment in a tree. For example, the 19 Myr (for a divergence in time of 38 Myr) separating hog and roe deer was sufficient to be associated with 13 inferred amino acid replacements. This is both sufficient to allow the crystallographic approach to identify the pseudogene status of the hog deer and roe deer seminal RNase genes and is consistent with the number of replacements expected during drift in a typical mammalian lineage. This time period was not, however, sufficiently long to allow estimates of {omega} to persuasively allow inference of the same status. Indeed, our data suggest that sequence length and taxon sample may limit these inferences as well, as these create large variances for {omega}.

These examples, combined with the high variance in the estimate of {omega} found by our simulations, illustrate the need for multiple approaches to complement classical approaches based on the estimation of {omega} when evaluating the possibilities of constrained evolution, neutral evolution, and adaptive evolution in ancestral lineages. Such approaches can even include experiments, through the resurrection of ancestral proteins within a lineage for study in the laboratory, a field now coming to be called "paleogenetics" (Jermann et al. 1995Go; Chandrasekharan et al. 1996Go; Messier and Stewart 1997Go; Golding and Dean 1998Go; Zhang et al. 2000Go, 2002Go; Thornton 2004Go; Thomson et al. 2005Go; Benner et al. 2006Go; Sassi et al. in press). Once resurrected, ancestral proteins can be studied in the laboratory, allowing the evaluation of their biochemical properties via functional assays. These results may have clear implications for whether changes were neutral or adaptive. Paleogenetic studies become especially important when the confidence in {omega} values is low.

The combination of analyses focused on the estimation of {omega}, analyses based upon protein structure and simulations can be interpreted as strong support for a multiple inactivation model for seminal RNase. A corollary of this conclusion is the inference that the ancestral seminal RNase genes had a function and that the function involved the specification of a protein. Although there are examples of pseudogenes that have a regulatory role (Hirotsune et al. 2003Go), it is difficult to construct a model in which a regulatory pseudogene (i.e., one acting at the RNA level) would show the patterns of conservation evident in seminal RNase (specifically constraints on nonsynonymous sites, especially those that alter residues buried in the protein structure). This finding is suggestive of an environmental change that rendered seminal RNase either irrelevant or disadvantageous in a variety of ruminants, with the exception of the oxen where the gene shifted to novel function. Further understanding of the ancestral function for seminal RNase in ruminants may require the combination of biochemical experiments from reconstructed ancestors with any information about the ancestral patterns of gene expression that can be obtained. Regardless of the basis for the multiple losses, it is clear that seminal RNase should no longer be viewed as an example of pseudogene inactivation.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
Supplementary materials are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
We are indebted to the National Aeronautics and Space Administration Exobiology program for support of this research. We are also indebted to Ross P. Davis for invaluable computer support.


    Footnotes
 
Michele Vendruscolo, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 

    Asenjo AB, Rim J, Oprian DD. (1994) Molecular determinants of human red/green color discrimination. Neuron 12:1131–1138.[CrossRef][ISI][Medline]

    Balakirev ES and Ayala FJ. (2003) Pseudogenes: are they "junk" or functional DNA? Annu Rev Genet 37:123–151.[CrossRef][ISI][Medline]

    Benner SA. (1989) Patterns of divergence in homologous proteins as indicators of tertiary and quaternary structure. Adv Enzyme Regul 28:219–236.[CrossRef][ISI][Medline]

    Benner SA, Sassi SO, Gaucher EA. (2006) Molecular paleoscience: systems biology from the past. Adv Enzymol Relat Areas Mol Biol 75:1–132.

    Berisio R, Sica F, De Lorenzo C, Di Fiore A, Piccoli R, Zagari A, Mazzarella L. (2003) Crystal structure of the dimeric unswapped form of bovine seminal ribonuclease. FEBS Lett 554:105–110.[CrossRef][ISI][Medline]

    Bielawski JP and Yang Z. (2003) Maximum likelihood methods for detecting adaptive evolution after gene duplication. J Struct Funct Genomics 3:201–212.[CrossRef][Medline]

    Bosch E, Hurles ME, Navarro A, Jobling MA. (2004) Dynamics of a human interparalog gene conversion hotspot. Genome Res 14:835–844.[Abstract/Free Full Text]

    Braun EL, Halpern AL, Nelson MA, Natvig DO. (2000) Large-scale comparison of fungal sequence information: mechanisms of innovation in Neurospora crassa and gene loss in Saccharomyces cerevisiae. Genome Res 10:416–430.[Abstract/Free Full Text]

    Breukelman HJ, van der Munnik N, Kleineidam RG, Furia A, Beintema JJ. (1998) Secretory ribonuclease genes and pseudogenes in true ruminants. Gene 212:259–268.[CrossRef][ISI][Medline]

    Brosius J and Gould SJ. (1992) On "genomenclature": a comprehensive (and respectful) taxonomy for pseudogenes and other "junk DNA". Proc Natl Acad Sci USA 89:10706–10710.[Abstract/Free Full Text]

    Burnham KP, Anderson DR, Burnham KP. (2002) Model selection and multimodel inference: a practical information-theoretic approach. (Springer, New York).

    Capasso S, Giordano F, Mattia CA, Mazzarella L, Zagari A. (1983) Refinement of the structure of bovine seminal ribonuclease. Biopolymers 22:327–332.[CrossRef][ISI][Medline]

    Chandrasekharan UM, Sanker S, Glynias MJ, Karnik SS, Husain A. (1996) Angiotensin II-forming activity in a reconstructed ancestral chymase. Science 271:502–505.[Abstract]

    Doxiadis GG, van der Wiel MK, Brok HP, de Groot NG, Otting N, 't Hart BA, van Rood JJ, Bontrop RE. (2006) Reactivation by exon shuffling of a conserved HLA-DR3-like pseudogene segment in a New World primate species. Proc Natl Acad Sci USA 103:5864–5868.[Abstract/Free Full Text]

    Endo T, Ikeo K, Gojobori T. (1996) Large-scale search for genes on which positive selection may operate. Mol Biol Evol 13:685–690.[Abstract]

    Fornasari MS, Parisi G, Echave J. (2002) Site-specific amino acid replacement matrices from structurally constrained protein evolution simulations. Mol Biol Evol 19:352–356.[Abstract/Free Full Text]

    Golding GB and Dean AM. (1998) The structural basis of molecular adaptation. Mol Biol Evol 15:355–369.[Abstract]

    Goldman N, Thorne JL, Jones DT. (1998) Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics 149:445–458.[Abstract/Free Full Text]

    Harrison PM and Gerstein M. (2002) Studying genomes through the aeons: protein families, pseudogenes and proteome evolution. J Mol Biol 318:1155–1174.[CrossRef][ISI][Medline]

    Harrison PM, Zheng D, Zhang Z, Carriero N, Gerstein M. (2005) Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability. Nucleic Acids Res 33:2374–2383.[Abstract/Free Full Text]

    Hernandez Fernandez M and Vrba ES. (2005) A complete estimate of the phylogenetic relationships in Ruminantia: a dated species-level supertree of the extant ruminants. Biol Rev Camb Philos Soc 80:269–302.[Medline]

    Hirotsune S, Yoshida N, Chen A, Garrett L, Sugiyama F, Takahashi S, Yagami K, Wynshaw-Boris A, Yoshiki A. (2003) An expressed pseudogene regulates the messenger-RNA stability of its homologous coding gene. Nature 423:91–96.[CrossRef][Medline]

    Hughes AL. (1994) The evolution of functionally novel proteins aft