MBE Advance Access originally published online on April 17, 2006
Molecular Biology and Evolution 2006 23(7):1357-1369; doi:10.1093/molbev/msk022
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Article |
Genomic and Evolutionary Insights into Genes Encoding Proteins with Single Amino Acid Repeats
Department of Biological Sciences and Bioengineering, Indian Institute of Technology, Kanpur, India
E-mail: sganesh{at}iitk.ac.in.
| Abstract |
|---|
|
|
|---|
Mutations causing expansion of amino acid repeats are responsible for 19 hereditary disorders. Repeats in several other proteins also show length variations. These observations prompted us to identify single amino acid repeatcontaining proteins (SARPs) in humans and to understand their functional and evolutionary significance. We identified 8812 SARPs containing 17 146 repeat domains, each harboring 4 or more residues. In all, 5% of SARPs (471) showed repeat length variations, and nearly 84% of them (394) have repeats of 10 residues or less. We find that SARPs are involved in functions that require formation of multiprotein complexes. Nearly 78% (6859) of the SARPs did not find a paralogue in the human proteome, and such proteins are considered as orphan SARPs. Orphan SARPs show longer repeat stretches, longer peptide length, and lower expression levels as compared with SARPs belonging to protein family. Because the intensity of gene expression is known to relate inversely with the rate of protein sequence evolution, our results suggest that the orphan SARPs evolve faster than the familial forms and therefore are under a weaker selection pressure. We also find that while GC-rich codons are favored for coding the repeat tracts of SARPs, specific codons and not nucleotide motifs per se are selected, suggesting functional constraints placed on the usage of codons. One of the constraints could be the mRNA stability as clustering of rare codons is known to destabilize the transcripts and rare codons are not favored for coding repeat tracts. Genes encoding polymorphic SARPs show preferential localization toward the telomeric segments. Further, the sex-specific recombination rates of the chromosomal locus strongly correlate with the parental gender that influence the repeat instability in disorder caused by dynamic mutation. Therefore, instability associated with repeats might be driven by processes that are specific to sperm or oocyte development, and the recombination frequency might play a positive role in this process.
Key Words: trinucleotide repeats dynamic mutation repeat instability orphan proteins sequence evolution sex-specific recombination
| Introduction |
|---|
|
|
|---|
Repeat instability is a unique dynamic mutation mechanism that is linked to more than 40 neurological, muscular, and developmental disorders (reviewed in Pearson et al. 2005
A variety of models have been proposed to explain the repeat expansions, and it is widely believed that the secondary structure the repeat tracts might form could play a critical step in the expansion process (McMurray 1999; Cleary and Pearson 2005
; Gatchel and Zoghbi 2005
; Pearson et al. 2005
). While all possible combinations of nucleotides are known to exist as triplet repeats, questions such as why some are more common than others, why there exist variations in repeat lengths among various genes, and why certain repeat loci are more unstable when transmitted through one sex are important from evolutionary and genetics point of view. Though there are reports on the amino acid repeats in the human proteome (Karlin and Burge 1996
; Karlin et al. 2001
; Alba and Guigo 2004
), a majority of such studies have considered proteins with repeat domains longer than 10 amino acid residues. However, with the discovery that the instability of 5 consecutive aspartic acid residues within the cartilage oligomeric matrix protein (COMP) protein associating with 2 distinct types of dysplasia (Delot et al. 1999
; Song et al. 2003
), it is imperative that proteins with shorter repeat domains should also be catalogued and analyzed. Realizing the importance of amino acid repeats in the proteome and in human disorders, we undertook a study to analyze in detail the amino acid repeat distribution in proteins and the nucleotide repeats associated with them. We show here that nearly 77% of the polymorphic repeats containing proteins have repeat domains that are less than 10 residues and are enriched in the Morbid and online mendelian inheritance in man (OMIM) databases. We also show that genes encoding repeat-containing proteins belonging to gene families express highly and evolve at a slower rate when compared with genes encoding orphan proteins with repeats.
| Materials and Methods |
|---|
|
|
|---|
Identification of Genes Encoding Single Amino Acid RepeatContaining Proteins, Chromosomal Mapping, and Codon Analysis
We developed an in-house program to search the Human Reference Sequence database (version 13, released on 16 September 2005) and to retrieve single amino acid repeatcontaining proteins (SARPs) containing one or more repeat domains. Three files, namely, human.rna.fna.gz (containing fasta-formatted human RNA sequences), human.protein.faa.gz (human protein sequence in fasta format), and human.rna.gbff.gz (containing annotation information) were downloaded from the National Center for Biotechnology Information (NCBI) ftp server (http://www.ncbi.nlm.nih.gov/Ftp/). The amino acid repeat size threshold of 4 (i.e., 4 or greater) was based on the observation that the smallest repeat in a SARP previously implicated in a human disorder (COMP protein) has 5 aspartic acid repeats and loss or gain of one residue in this repeat motif is pathogenic (Delot et al. 1999
Detection of Polymorphism in the Repeat Domains of SARPs
Expressed sequence tag (EST) clusters, derived from UniGene database corresponding to repeat-containing cDNA sequence, were used for the in silico detection of repeat length variations. Repeat region along with 10 flanking amino acid sequences on both sides was used as the query and aligned with UniGene clusters using a stand-alone TBlastN program. Length differences within the repeat domain were detected by comparing the number of amino acids present within the repeat block of the query with the translated sequence of EST (subject). Positive hits were manually checked to ensure the authenticity. Details of the EST hits and length variants observed for the polymorphic SARP are available in the Web link http://home.iitk.ac.in/
sganesh/sarp/.
Calculation of Recombination Rates for Repeat Loci and Subchromosomal Localization of Genes
The sex-specific recombination rates for individual microsatellite markers from 5-Mb region spanning the selected gene locus (around 2.5 Mb on either side) were added together to get the recombination index (sum value) for the male and female sex. For each gene locus, the markers were identified using the Ensembl genome browser (http://www.ensembl.org), and the recombination data for each marker were obtained from the Decode High Resolution Genetic Map genotype data, Release 1.0 (Kong et al. 2002
). For subchromosomal localization of genes, the 2 arms of chromosome were divided into 2 equal halves (using the cytoband information retrieved from the MapView database of NCBI), and the genes were grouped as those located in the centromeric or the telomeric segment (see Supplementary figure 3, Supplementary Material online).
Functional Annotation of SARPs and Phylogenetic Analyses
We used the Gene Ontology tool, Onto-Express (Draghici and Sharp 1988
), to functionally classify whole proteome and SARPs using molecular function annotations. Gene Ontology may link a single protein with more than one annotation term. The difference in the distribution of molecular functions between whole proteome and SARPs was tested for significance using chi-square test. For phylogenetic analyses, protein orthologues were identified using the HomoloGene database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene). Paralogous proteins were identified by blasting a query file containing all SARPs against database of all available human proteins and selecting hits as paralogues if the alignment length was >80% of the query and alignment identity was >60%. Sequence of orthologous and paralogous proteins were aligned with the ClustalX program (Higgins and Sharp 1988
), and the multiple alignments were manually edited and checked for the conservation of amino acid repeats. For phylogenetic tree construction, the multiple sequence alignments were used, and a neighbor-joining phylogram was generated using the ClustalX program (Higgins and Sharp 1988
).
Linking SARPs with the OMIM and Morbid Map
We used NCBI MapView database (BUILD 35.1; files, cyto_gene.md and cyto_morbid.md) for linking OMIM and Morbid data set with genes encoding SARPs. The gene ID and its cytoband position of the whole transcriptome or that encode SARPs were extracted from the data set and correlated with the disease names by using key word match method.
In Silico Analysis of Transcript Abundance for Genes Encoding SARPs
To estimate the gene expression intensity of SARPs (global as well as tissue specific), we used the EST sequences present within the UniGene clusters (BUILD #18; 6 375 729 EST sequences). This count was taken to represent the level of expression. In addition to containing sequences that represent a unique gene, the UniGene cluster as well provide related information such as the tissue types in which the gene has been expressed and map location. For estimating the breadth of expression, we therefore have used data from EST libraries belonging to 32 different tissues and combined into 10 broader groups (representing excretory system, circulatory system, skeletal system, respiratory system, reproductive system, immune system, digestive system, sensory organs, neuronal tissues, and developmental stages). Genes that had
5 EST hits in a given system were considered as not expressed. The number of EST hits observed for each gene was calculated, and the value was averaged for genes encoding familial and orphan SARPs. Difference in the expression levels was calculated by one round of normalization by taking the abundance of ESTs for the actin (ACTN1) gene as an internal control, and the relative expression values for genes encoding orphan and familial SARPs were calculated. Senial analysis of gene expression (SAGE) analyses show almost same level of expression for ACTN1 gene in the tissue systems analyzed.
| Results |
|---|
|
|
|---|
Frequent Occurrence of SARPs
Bioinformatics-based approach, applied on Human Reference Sequence database (version 13 released on 16 September 2005), identified 8812 genes encoding SARPs (table 1; Supplementary figure 1 and Supplementary table 1, Supplementary Material online). In our data set, SARPs constitute around 30% of the total proteome. Curiously, the 8812 SARPs harbor 17 146 single amino acid repeat domains, suggesting that on an average each SARP might have more than one repeat domain. Among these, 130 SARPs (1.5%) had isoforms with unique repeat domains, suggesting that the presence/absence of repeat domains in the proteome could be further modulated by alternate mRNA splicing. We found that all 20 amino acids are repeated but with varying degree of repeatedness (table 1; Supplementary figure 1, Supplementary Material online). Poly-L repeats are significantly overrepresented (15%), whereas poly-W repeats are extremely rare (only 2). Homopolymeric repeats of L, S, E, P, A, and G residues are the most prominent ones. On the other hand, repeats for amino acids I, N, C, Y, M, F, and W were relatively rare. With the exception of E, amino acids that were repeated in the majority of the SARPs are of low molecular weight and aliphatic in nature. Smaller repeats (49 residues) are too frequent in the SARPs as nearly 94% of SARPs in our data set harbor such repeats (Supplementary tables 1 and 2, Supplementary Material online). We also compared the usage of amino acids in repeats with their occurrence in the proteome or the nonrepeat region. By and large, the usage of amino acids in the repeat tracts was found to be representative of the global amino acid composition (Supplementary figure 2, Supplementary Material online). However, there was no strict one-to-one correlation between their usage in the proteome and in the repeat tracts of SARPs.
|
A Significant Number of SARPs Show Repeat Length Variations
Using bioinformatics approaches, we identified 417 SARPs (
5%) to have amino acid repeat length variations (polymorphic) (table 1; Supplementary figure 1 and Supplementary table 1, Supplementary Material online). When the number of domains was considered, 471 out of 17 146 repeat domains (
3%) identified in the SARPs were predicted to be polymorphic. A great majority of polymorphic SARPs were having shorter repeat lengths (394 polymorphic repeats harbor 410 amino acid residues) (Supplementary tables 1 and 2, Supplementary Material online). However, the occurrence of polymorphism is far too frequent in SARPs with longer repeat lengths because nearly 25% of repeats having
20 amino acids exhibit repeat length variations (fig. 1B). We have also analyzed the average peptide length of SARPs. Strikingly, SARPs with polymorphic repeats were significantly longer in length when compared with SARPs with nonpolymorphic repeats (fig. 2A). Moreover, the average length of peptides that lacked amino acid repeats was found to be significantly smaller when compared with the SARPs and the whole proteome (fig. 2A).
|
|
Details of length variants observed for the polymorphic SARPs are provided in the Web link http://home.iitk.ac.in/
sganesh/sarp/.
Spatial Distribution and Co-occurrence of Repeat Domains in SARPs
With regard to the length of the repeats in SARPs, amino acids Q, S, E, P, and G, in general, show longer domains (
20 residues). However, we note that there is a sharp decrease in the number of repeat domains with increasing length of the repeat (fig. 1A). Aromatic amino acids (Y and W), in particular, show smaller repeat domains (repeat size 48). We also looked at the spatial distribution of the repeat domains in SARPs. Repeats of amino acids L, A, G, N, C, Q, H, and V showed a bias for their localization to the amino terminal of the peptide, whereas the amino acid repeats of F, I, K, and S showed a bias toward the carboxyl terminus. The terminal bias could be of biological importance because in a great majority of disease-associated peptides, the expanding repeats are located at either of the 2 terminals (fig. 2B). In our data set, nearly 48% of the SARPs show multiple repeat domains. We, therefore, investigated the frequency with which a repeat of one amino acid occurs with another in the same SARP, and the results of this analysis are shown in table 1. Proline (56%), followed by glutamic acid (46%), showed the strongest correlation for co-occurrence (excluding the selfself pair) in SARPs with multiple repeat domains. Using our default parameters (i.e., uninterrupted repeats), we found that the proline (48%), followed by glycine (36%) and glutamine repeats (35%), shows a high frequency for co-occurrence (selfself pair) in SARPs with multiple repeat domains (table 1). By allowing interruptions of up to 5 residues, we also checked the possibility whether the predominance of selfself pair is because of interruptions in homopolymeric repeat tracts. We find that only 28% of the selfself pairs are interrupted by 5 or less residues. Among these, Q repeats are more often interrupted (41%) as compared with R repeats (8%) (Supplementary table 3, Supplementary Material online).
GC-Rich Codons Encode Repeat Domains in SARPs
There was a significant overrepresentation for GC-rich codons (81%) in regions that code for repeat domains in SARPs as against their average occurrence in the total transcriptome (56%) or in transcripts that encode peptides lacking repeats (55%; fig. 3A). Curiously, amino acids that are exclusively coded by GC-rich codons (A, G, and P) are abundant in repeat domains of SARPs. For amino acids that are coded by both GC- and AT-rich codons, a significant increase in the usage of GC-rich codon was found in corresponding coding regions.
|
One of the interesting observations in terms of codon usage is the iteration of repeat motifs. For example, CAG repeat motif in the coding region can be read as CAG, AGC, or GCA (reading frames 1, 2, and 3, respectively), and they encode amino acids Q, S, and A, if used as codons in that order. We therefore calculated the abundance of 3 possible codons generated by trinucleotide repeat motifs in the transcriptome. We then calculated the frequency of reiteration (uninterrupted
4 repeats) of these motifs in the coding sequence. We found that 10 triplet-repeat motifs (encoding 14 amino acids) are overrepresented in the coding region of transcriptome. We also compared the usage frequency of respective codons of these 10 motifs in the nonrepeat-coding regions (fig. 3B and C). Our results clearly demonstrate that specific codons, and not motifs per se, are selected for the iteration of amino acids. When the CAG motif was considered as an example, codon CAG coding for Q residue has higher frequency to be present in repeats, followed by AGC (coding S) and GCA (coding A). On the contrary, amino acid residue A was predominantly coded by the GCC codon when iterated, although codon GCA (CAG motif) is used when A is not iterated (fig. 3B and C). Thus, elements other than repeat structure (repeat motif) seemed to have an impact on repeat generation and instability. It has been shown that the usage of synonymous codons in mRNA is not random as the codon usage is constrained by a combination of tRNA availability and nature of its codon recognition (Duan and Antezana 2003
|
Genes Encoding SARPs Are Located in Recombination Hot Spots
To explore whether genes encoding amino acid repeats show preferential distribution in the human genome, we checked for their chromosomal and subchromosomal localization. For this, the 2 arms of chromosome were divided into 2 equal halves, and the genes were grouped as those located in the centromeric or the telomeric segment. On the whole, SARPs did not show any chromosomal bias; about 33% of genes in each chromosome encode SARPs (data not shown). However, a slight overrepresentation for genes encoding SARPs in the subtelomeric segment was observed (fig. 5). This difference was striking and highly significant when genes encoding polymorphic SARPs were considered separately (fig. 5). In order to confirm that the differential distribution observed for the genes encoding polymorphic SARPs is not due to sampling error, we have generated random data sets for genes and evaluated their localization. The random data set did not show any preference for the subchromosomal localization, suggesting that the overrepresentation observed for genes encoding polymorphic SARPs in the telomeric segment is not likely to be a random event and could perhaps imply a selection process (fig. 5). This suggestion was strengthened by the observation that 19 out of 24 disease genes associated with repeat instability are located in the telomeric segment (Supplementary figure 3, Supplementary Material online).
|
Gender is known to influence the transmission of trinucleotide repeats in human disease. For example, the transmission of the repeat through males was less stable than that through females for genes involved in dentatorubral-pallidoluysian atrophy (DRPLA) (Ikeuchi et al. 1996
|
Evolution of SARPs
In order to investigate an evolutionary context for repeat-containing proteins, we searched for paralogous and orthologous proteins for SARPs. Out of 8812 SARPs identified in the present study, 1953 (22%) of them constitute 899 paralogous clusters having 2 or more members. SARPs that do not find a paralogue in the human proteome are considered as orphan SARPs. Nearly 78% of the SARPs remained as orphan proteins. The representation of orphan forms in SARPs did not differ significantly from the total proteome (81%). However, proteins having larger repeat lengths are frequent in orphan SARPs than in familial SARPs, and the difference was more significant among polymorphic SARPs (fig. 7). Moreover, the average peptide length of orphan SARPs (702 residues) was greater than that of familial SARPs (610 residues). Intriguingly, a majority of repeat expansion disorders are caused by orphan SARPs (Supplementary figure 3, Supplementary Material online), suggesting that repeats present in the orphan forms are more likely to expand.
|
We have also checked whether or not the repeat motif in SARPs is evolutionarily conserved. For this analysis, clusters having at least 5 paralogous proteins from the human proteome were considered. Out of 91 such groups, only in 26 clusters the amino acid repeat motif was found to be conserved in majority of the members (>80%). Vertebrate orthologues were found for 16 of them and were included for further analyses. Among these, 7 clusters were having L repeats, 5 having E repeats, and 4 having K repeats (Supplementary figure 4AD, Supplementary Material online). These include heat shock proteins (3 clusters), guanine-binding proteins (1 cluster), and structural proteins (2 clusters). The remaining clusters represent uncharacterized hypothetical proteins. We also checked the functional context of these amino acid repeats by analyzing whether the repeat tract fall into any known functional domains. No obvious pattern, however, could be detected.
Functional Groups in SARPs
To investigate the significance of amino acid repeats in SARPs, functional annotation was done using Gene Ontology terms (fig. 8). This analysis reveals that a majority of SARPs are enzymatic in functions (fig. 8). Intriguingly, the cellular functions for a majority of SARPs are known as the "unknown" category is significantly underrepresented for SARPs when compared with the total proteome (fig. 8A and B). Further, detailed analysis of all 20 amino acid repeats depicts that most of the SARPs having smaller repeats (repeat size <10 residue), irrespective of the repeating amino acid, are significantly enriched in the functional group "enzyme activity" followed by binding, transporter, receptor, structural, and other functions (fig. 8B). Smaller repeats of A, D, E, G, H, K, P, and Q are overrepresented in enzymatic activity (such as polymerase and transcription factors) followed by binding activity (such as nucleic acid binding and protein binding) (fig. 8B).
|
We used OMIM and Morbid databases to relate the potential association of genes encoding SARPs in human genetic disorders (Supplementary table 4, Supplementary Material online). In all, 51% genes encoding SARPs are identified in the Morbid and/or OMIM database as against 26% for the whole transcriptome. The representation of genes encoding polymorphic SARPs in the 2 databases was far greater (>40%). Thus, genes encoding polymorphic SARPs that are enriched in chromosomal loci known to be associated with disorders (141 genes) are ideal candidates for screening for repeat instability.
Expression Patterns of SARPs
Our in silico expression analysis reveals similar tissue distributions for SARP and non-SARP genes (Supplementary figure 5A and B, Supplementary Material online). We did not find any significant difference with regard to the number of genes expressed in each of the organ systems analyzed (Supplementary figure 5C, Supplementary Material online). However, a significant overrepresentation of ESTs for the genes representing familial SARPs was found, suggesting that the expression level for familial SARPs is higher when compared with orphan forms. This difference was consistent for each of the physiological system analyzed (fig. 9A and B).
|
| Discussion |
|---|
|
|
|---|
We show here that SARPs are abundant in human proteome and that the level of repeatedness and the length of the repeat tracts could vary. One of the interesting observations of our study is that SARPs are relatively longer peptides (697 residues), suggesting that the length gain could be due to amino acid repeat tracts. This suggestion was strengthened by the fact that the average length of SARPs excluding the repeat motifs (354 residues) was less than that of non-SARPs (375 residues). It is equally likely that amino acid repeats are not tolerated in smaller proteins because decreasing protein length is known to have a direct correlation with increased cellular toxicity (Hackam et al. 1998
In order to understand the functional significance of amino acid repeat tracts, we classified SARPs into 7 functional categories and compared them with the total proteome. Our analysis shows that a majority of SARPs are involved in enzymatic activity, followed by processes related to gene expression. This could perhaps mean that SARPs in general are involved in functions that require formation of multiprotein complexes and that amino acid repeats might facilitate such proteinprotein interactions (Lavoie et al. 2003
; Faux et al. 2005
). It has been suggested that simple repeat sequences offer greater flexibility in protein structure by serving as spacers between other motifs in the protein (Huntley and Golding 2002
). This suggestion also implies that the length of the "spacer," at least in certain cases, is likely to tolerate length variations because a significant number of SARPs in the human proteome are polymorphic for repeat lengths and because repeat stretches are shorter in orthologous proteins. Our data reveal that repeats of certain amino acids appear to be preferentially located at the amino or the carboxyl terminal regions. The terminal bias was far more striking for disease-associated repeats because a majority of them (86%) are located at either of the 2 terminals. It has been suggested that most repeat regions do not adopt well-ordered structures but instead are disordered (Huntley and Golding 2002
). Notwithstanding the distinctive function that the amino acid repeats may offer, it could be suggested that the structural property of repeat tracts might restrict their presence toward the terminals.
Although many of the genes that encode SARPs are conserved across vertebrates, most of the time the repeat motif itself is not conserved. The diversity of repeat tracts found in the orthologous proteins of different species suggests that repeat sequences are differentially acquired and lost during evolution at a rate faster than the genes encoding them. Very similar trend was observed when paralogous clusters of SARPs were analyzed. Therefore, it could be suggested that amino acid repeats evolved or retained in a given SARP to perform a specific function that is unique to the protein and the organism. An alternative explanation would be that repeat motifs are functionally less important and therefore are less conserved during evolution. Intriguingly, proteins having larger repeat lengths (
20 residues) are far more frequent in orphan SARPs as against familial SARPs, suggesting that the 2 forms of SARPs are subjected to differential selection constraints. A great majority of orphan SARPs identified in the present study are vertebrate specific, whereas familial SARPs show orthologous proteins in invertebrates. Moreover, the percent identity observed for the humanmouse orthologous SARP pairs reveals that orphan SARPs are less conserved (76%) when compared with familial SARPs (95%; data not shown). It has been shown that vertebrate-specific genes evolve faster than older genes (Subramanian and Kumar 2004
; Alba and Castresana 2005
). A possible explanation for the evolutionary origin of orphan genes is that they evolve so fast that the sequence similarity is lost even within a relatively short evolutionary time span (Schmid and Aquadro 2001
; Domazet-Loso and Tautz 2003
). This, in other words, suggests that the relatively dispensable proteins are subjected to weaker selection constraints and should therefore evolve rapidly and may even accumulate mildly deleterious changes (Hirsh and Fraser 2001
). Extending this analogy, it may be proposed that the orphan SARPs, because of the weaker constraints placed on them, may acquire repeat tracts that are longer than those present in the familial SARPs. However, this expansion might not cross a "threshold" because very long repeats could become pathogenic (gain-of-toxic effect) and might get eliminated from the population. Intriguingly, a majority of the repeat tracts that are known to be involved in disorders are larger in length and are coded by orphan SARPs. Moreover, expansions of the repeats in SARPs more often result in gain-of-function effect, whereas a complete loss of genes shows minimal effect on the survival. For example, the murine knockouts for the genes involved in SCA1, SCA2, and spinal and bulbar muscular atrophy (Matilla et al. 1998
; Yeh et al. 2002
; Marrades et al. 2006
) exhibit wild-type or less severe phenotype, although the overexpression of expanded polyglutamine repeats is pathogenic (Mangiarini et al. 1997
; Lorenzetti et al. 2000
; McManamny et al. 2002
; Aguiar et al. 2006
). The weaker functional constraints for orphan genes, however, would not be static as there would be a gradual increase in the selection pressure with time, leading to fewer changes in older genes when compared with the novel ones (Alba and Castresana 2005
). This could perhaps explain as to why null mutations for a few orphan SARPs show severe phenotype when knocked out, for example, HD (Zeitlin et al. 1995
). It has been shown that genes that exhibit slower rate of evolution encode shorter peptides and express ubiquitously and at higher levels (Pal et al. 2001
; Subramanian and Kumar 2004
). We find that the average peptide length of orphan SARPs are longer than that of familial SARPs. Furthermore, familial SARPs show almost 2-fold increase in expression levels when compared with orphan SARPs, suggesting that the latter ones are evolving at a faster rate.
Based on the results obtained from various model systems, a variety of molecular mechanisms have been proposed to explain the repeat expansions associated with human disorders (Pearson and Sinden 1998
; Sinden 1999
; Cleary and Pearson 2005
; Pearson et al. 2005
; Wells et al. 2005
). These include meiotic recombination, DNA replication slippage, and DNA damage repair. Although whether one or all of these processes contribute to the expansion of repeat is unknown, it is widely believed that the secondary structure that the repeat tracts might form could be a critical step in the expansion process (Pearson and Sinden 1998
; Sinden 1999
; Cleary and Pearson 2005
; Pearson et al. 2005
; Wells et al. 2005
). Among the various repeats tested, the CTG/CAG repeat was shown to have a higher potential to form secondary DNA structure and may thus enhance repeat instability (Petruska et al. 1996
; Pearson and Sinden 1998
). Our analyses on the usage of codons encoding the repeat domains of SARPs reveal that a majority of the repeat tracts are coded by mixed codons and are GC rich. However, this pattern did not differ between the repeat tracts that are polymorphic and nonpolymorphic, suggesting that other factors, such as cis-acting elements (Cleary and Pearson 2003
; Pearson et al. 2005
), might regulate the instability associated with repeats. In coding regions, trinucleotide repeats also represent codons, and therefore the orientation of repeating unit is also important. For example, CAG, AGC, and GCA repeats represent same repeating unit (CAG), but they are distinct in coding regions because they code for amino acids Q, S, and A, respectively. We found that the CAG codon encoding Q has a higher tendency to iterate as compared with AGC or GCA as codons. Our observation that specific codons and not the nucleotide motifs per se are selected for the repeat tracts reveals functional constraints placed on the usage of codons in the regions encoding repeat tracts. For example, codon usage reflects selection for translational efficiency as highly expressed genes tend to use codons that are decoded by abundant cognate tRNAs (Ikemura 1985
; Moriyama and Powell 1997
; Duret 2000
). Moreover, clustering of several rare codons within a narrow region has been shown to cause destabilization of the transcript (Hoekema et al. 1987
; Caponigro et al. 1993
; Carlini 2005
). We show here that rare codons are not favored for coding the repeat tracts, suggesting that the mRNA stability therefore could be one of the factors that minimize the usage of rare codons in the repeat tracts despite their potential to form stable secondary structure and contributing to repeat expansions.
Recombination-based processes have been suggested to be major contributors to the evolution of tandem repeat sequences. Studies have demonstrated that the repeats can act as a recombination hot spot by enhancing the rate of recombination relative to the genome average (Jeffreys et al. 1998
; Richard and Paques 2000
). The frequent association of tandem repeats near the chromosomal ends suggests that repeats may flourish near telomeres simply because of higher rates of recombination or vice versa (Wintle et al. 1997
; Kong et al. 2002
; Linardopoulou et al. 2005
). This has been strengthened by our observation that genes encoding SARPs show preferential localization toward the telomeric segment. The recombination rate is known to be different between male and female in humans; the frequency of recombination in the autosomes of females is about one and a half times that in the autosomes of males (Broman et al. 1998
; Kong et al. 2002
). However, this difference is not homogeneous because there are regions in the genome where recombination rate is particularly high in women and particularly low in men and vice versa (Kong et al. 2002
). Here we show that the sex-specific recombination rates for the genomic region spanning the 6 genes associated with repeat expansion disorders strongly correlate with the parental gender that positively influences the repeat instability. This led us to hypothesize that regional and sex-specific differences in the recombination rate, in combination with processes that are specific to sperm or oocyte development, might influence the instability of repeats.
We have created a catalogue of all SARPs that have repeat domains of longer than 4 residues. The rationale for choosing the small cutoff value was that the 4 consecutive aspartic acid residues within COMP protein are by far the shortest disease-causing repeat expansion mutations described (Delot et al. 1999
). A unique feature of this protein is that both expansion and shortening of the repeat cause the same disease (Delot et al. 1999
; Song et al. 2003
). Intriguingly, 112 polymorphic SARPs (27%) identified in the present study harbor repeats of 4 residues, and nearly 77% of the polymorphic SARPs show repeats having <10 residues. Thus, by lowering the cutoff value, we were able to identify and annotate a large number of polymorphic and potentially disease-causing SARPs from the RefSeq data set. This suggestion is further strengthened by the observation that the representation of genes encoding polymorphic SARPs in the Morbid and OMIM databases was significantly greater when compared with the total proteome or the nonpolymorphic SARPs. The number of SARPs that exhibit repeat length variation is likely to be higher as our approach relied only on screening the ESTs that could perhaps represent a smaller population size. Therefore, we hope that this catalogue will be of much use for studying various aspects of SARPs and that it will be helpful in identifying their probable disease association and evolutionary significance. The details of 8812 SARPs identified in the present study and the predicted polymorphisms in SARPs are available for downloads.
| Supplementary Material |
|---|
|
|
|---|
Supplementary figures 15 and tables 14 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|
This study was supported by a research grant from the Ministry of Human Resource Development, Government of India, to S.G. P.S. was supported by a research fellowship from the Council of Scientific and Industrial Research, Government of India, and S.D.P. received a fellowship from the Indian Institute of Technology, Kanpur.
| Footnotes |
|---|
Jianzhi Zhang, Associate Editor
| References |
|---|
|
|
|---|
Aguiar J, Fernandez J, Aguilar A et al. (13 co-authors). 2006. Ubiquitous expression of human SCA2 gene under the regulation of the SCA2 self promoter cause specific Purkinje cell degeneration in transgenic mice. Neurosci Lett 392:2026.[Medline]
Alba MM, Castresana J. 2005. Inverse relationship between evolutionary rate and age of mammalian genes. Mol Biol Evol 22:598606.
Alba MM, Guigo R. 2004. Comparative analysis of amino acid repeats in rodents and humans. Genome Res 14:54954.
Albrecht A, Mundlos S. 2005. The other trinucleotide repeat: polyalanine expansion disorders. Curr Opin Genet Dev 15:28593.[CrossRef][Web of Science][Medline]
Amiel J, Trochet D, Clement-Ziza M, Munnich A, Lyonnet S. 2004. Polyalanine expansions in human. Hum Mol Genet 1:R23543.
Bell MV, Hirst MC, Nakahori Y et al. (17 co-authors). 1991. Physical mapping across the fragile X: hypermethylation and clinical expression of the fragile X syndrome. Cell 64:8616.[CrossRef][Web of Science][Medline]
Berger Z, Davies JE, Luo S, Pasco MY, Majoul I, O'kane CJ, Rubinsztein DC. 2005. Deleterious and protective properties of an aggregate-prone protein with a polyalanine expansion. Hum Mol Genet 15:43342.[Medline]
Broman KW, Murray JC, Sheffield VC, White RL, Weber JL. 1998. Comprehensive human genetic maps: individual and sex-specific variation in recombination. Am J Hum Genet 63:8619.[CrossRef][Web of Science][Medline]
Caponigro G, Muhlrad D, Parker R. 1993. A small segment of the MAT alpha 1 transcript promotes mRNA decay in Saccharomyces cerevisiae: a stimulatory role for rare codons. Mol Cell Biol 13:51418.
Carlini DB. 2005. Context-dependent codon bias and messenger RNA longevity in the yeast transcriptome. Mol Biol Evol 22:140311.
Choudhry S, Mukerji M, Srivastava AK, Jain S, Brahmachari SK. 2001. CAG repeat instability at SCA2 locus: anchoring CAA interruptions and linked single nucleotide polymorphisms. Hum Mol Genet 10:243746.
Cleary JD, Pearson CE. 2003. The contribution of cis-elements to disease-associated repeat instability: clinical and experimental evidence. Cytogenet Genome Res 100:2555.[CrossRef][Web of Science][Medline]
Cleary JD, Pearson CE. 2005. Replication fork dynamics and dynamic mutations: the fork-shift model of repeat instability. Trends Genet 21:27280.[CrossRef][Web of Science][Medline]
Delot E, King LM, Briggs MD, Wilcox WR, Cohn DH. 1999. Trinucleotide expansion mutations in the cartilage oligomeric matrix protein (COMP) gene. Hum Mol Genet 8:1238.
De Michele G, Cavalcanti F, Criscuolo C, Pianese L, Monticelli A, Filla A, Cocozza S. 1998. Parental gender, age at birth and expansion length influence GAA repeat intergenerational instability in the X25 gene: pedigree studies and analysis of sperm from patients with Friedreich's ataxia. Hum Mol Genet 7:19016.
Domazet-Loso T, Tautz D. 2003. An evolutionary analysis of orphan genes in Drosophila. Genome Res 13:22139.
Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA. 2003. Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate. Nucleic Acids Res 31:377581.
Duan J, Antezana MA. 2003. Mammalian mutation pressure, synonymous codon choice and mRNA degradation. J Mol Evol 57:694701.[CrossRef][Web of Science][Medline]
Duret L. 2000. tRNA gene number and codon usage in the C. elegans genome are co-adapted for optimal translation of highly expressed genes. Trends Genet 16:2879.[CrossRef][Web of Science][Medline]
Faux NG, Bottomley SP, Lesk AM, Irving JA, Morrison JR, de la Banda MG, Whisstock JC. 2005. Functional insights from the distribution and role of homopeptide repeat-containing proteins. Genome Res 15:53751.
Gatchel JR, Zoghbi HY. 2005. Diseases of unstable repeat expansion: mechanisms and common principles. Nat Rev Genet 6:74355.[Medline]
Hackam AS, Singaraja R, Wellington CL, Metzler M, McCutcheon K, Zhang T, Kalchman M, Hayden MR. 1998. The influence of huntingtin protein size on nuclear localization and cellular toxicity. J Cell Biol 141:1097105.
Higgins DG, Sharp PM. 1988. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:23744.[CrossRef][Web of Science][Medline]
Hirsh AE, Fraser HB. 2001. Protein dispensability and rate of evolution. Nature 411:10469.[CrossRef][Medline]
Hoekema A, Kastelein RA, Vasser M, de Boer HA. 1987. Codon replacement in the PGK1 gene of Saccharomyces cerevisiae: experimental approach to study the role of biased codon usage in gene expression. Mol Cell Biol 7:291424.
Holmes SE, O'Hearn E, Callahan C et al. (12 co-authors). 2001. A CTG trinucleotide repeat expansion in junctophilin 3 is associated with Huntington's disease-like 2 (HDL2). Nat Genet 29:3778.[CrossRef][Web of Science][Medline]
Huntley M, Golding GB. 2002. Simple sequences are rare in the Protein Data Bank. Proteins 48:13440.[CrossRef][Web of Science][Medline]
Ikemura T. 1985. Codon usage and tRNA content in unicellular and multicellular organisms. Mol Biol Evol 2:1334.[Abstract]
Ikeuchi T, Igarashi S, Takiyama Y, Onodera O, Oyake M, Takano H, Koide R, Tanaka H, Tsuji S. 1996. Non-Mendelian transmission in dentatorubral-pallidoluysian atrophy and Machado-Joseph disease: the mutant allele is preferentially transmitted in male meiosis. Am J Hum Genet 58:7303.[Web of Science][Medline]
Jeffreys AJ, Neil DL, Neumann R. 1998. Repeat instability at human minisatellites arising from meiotic recombination. EMBO J 17:414757.[CrossRef][Web of Science][Medline]
Karlin S, Brocchieri L, Bergman A, Mrazek J, Gentles AJ. 2001. Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci USA 99:3338.
Karlin S, Burge C. 1996. Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. Proc Natl Acad Sci USA 93:15605.
Kizawa H, Kou I, Iida A, Sudo A et al. (15 co-authors). 2005. An aspartic acid repeat polymorphism in asporin inhibits chondrogenesis and increases susceptibility to osteoarthritis. Nat Genet 37:13844.[CrossRef][Web of Science][Medline]
Kong A, Gudbjartsson DF, Sainz J et al. (16 co-authors). 2002. A high-resolution recombination map of the human genome. Nat Genet 31:2417.[CrossRef][Web of Science][Medline]
Lavoie H, Debeane F, Trinh QD, Turcotte JF, Corbeil-Girard LP, Dicaire MJ, Saint-Denis A, Page M, Rouleau GA, Brais B. 2003. Polymorphism, shared functions and convergent evolution of genes with sequences coding for polyalanine domains. Hum Mol Genet 12:296779.
Linardopoulou EV, Williams EM, Fan Y, Friedman C, Young JM, Trask BJ. 2005. Human subtelomeres are hot spots of interchromosomal recombination and segmental duplication. Nature 437:94100.[CrossRef][Medline]
Lorenzetti D, Watase K, Xu B, Matzuk MM, Orr HT, Zoghbi HY. 2000. Repeat instability and motor incoordination in mice with a targeted expanded CAG repeat in the Sca1 locus. Hum Mol Genet 9:77985.
Mangiarini L, Sathasivam K, Mahal A, Mott R, Seller M, Bates GP. 1997. Instability of highly expanded CAG repeats in mice transgenic for the Huntington's disease mutation. Nat Genet 15:197200.[CrossRef][Web of Science][Medline]
Marrades MP, Milagro FI, Martinez JA, Moreno-Aliaga MJ. 2006. Generation and characterization of Sca2 (ataxin-2) knockout mice. Biochem Biophys Res Commun 339:1724.[CrossRef][Web of Science][Medline]
Martindale D, Hackam A, Wieczorek A et al. (13 co-authors). 1998. Length of the protein and polyglutamine tract influence localization and frequency of intracellular aggregates of huntingtin. Nat Genet 18:1504.[CrossRef][Web of Science][Medline]
Matera T, Bachetti F, Puppo M et al. (13 co-authors). 2004. PHOX2B mutations and polyalanine expansions correlate with the severity of the respiratory phenotype and associated symptoms in both congenital and late onset central hypoventilation syndrome. J Med Genet 41:37380.
Matilla A, Roberson ED, Banfi S, Morales J, Armstrong DL, Burright EN, Orr HT, Sweatt JD, Zoghbi HY, Matzuk MM. 1998. Mice lacking ataxin-1 display learning deficits and decreased hippocampal paired-pulse facilitation. J Neurosci 18:550816.
McManamny P, Chy HS, Finkelstein DI et al. (12 co-authors). 2002. A mouse model of spinal and bulbar muscular atrophy. Hum Mol Genet 11:210311.
McMurray CT. 1999. DNA secondary structure: a common and causative factor for expansion in human disease. Proc Natl Acad Sci USA 96:18235.
Moriyama EN, Powell JR. 1997. Codon usage bias and tRNA abundance in Drosophila. J Mol Evol 45:51423.[CrossRef][Web of Science][Medline]
Oma Y, Kino Y, Sasagawa N, Ishiura S. 2004. Intracellular localization of homopolymeric amino acid containing proteins expressed in mammalian cells. J Biol Chem 279:2121722.
Pal C, Papp B, Hurst LD. 2001. Highly expressed genes in yeast evolve slowly. Genetics 158:92731.
Pearson CE, Edamura KN, Cleary JD. 2005. Repeat instability: mechanisms of dynamic mutations. Nat Rev Genet 6:72942.[CrossRef][Web of Science][Medline]
Pearson CE, Sinden RR. 1998. Trinucleotide repeat DNA structures: dynamic mutations from dynamic DNA. Curr Opin Struct Biol 8:32130.[CrossRef][Web of Science][Medline]
Petruska J, Arnheim N, Goodman MF. 1996. Stability of intrastrand hairpin structures formed by the CAG/CTG class of DNA triplet repeats associated with neurological diseases. Nucleic Acids Res 24:19928.
Pujana MA, Corral J, Gratacos M, Combarros O, Berciano J, Genis D, Banchs I, Estivill X, Volpini V. 1999. Spinocerebellar ataxias in Spanish patients: genetic analysis of familial and sporadic cases. The Ataxia Study Group. Hum Genet 104:51622.
Ranum LP, Day JW. 2004. Myotonic dystrophy: RNA pathogenesis comes into focus. Am J Hum Genet 74:793804.[CrossRef][Web of Science][Medline]
Richard G-F, Paques E. 2000. Mini-and microsatellite expansions: the recombination connection. EMBO Rep 1:1226.[CrossRef][Web of Science][Medline]
Schmid KJ, Aquadro CE. 2001. The evolutionary analysis of "orphans" from the Drosophila genome identifies rapidly diverging and incorrectly annotated genes. Genetics 159:58998.
Sinden RR. 1999. Biological implications of the DNA structures associated with disease-causing triplet repeats. Am J Hum Genet 64:34653.[CrossRef][Web of Science][Medline]
Song HR, Lee KS, Li QW, Koo SK, Jung SC. 2003. Identification of cartilage oligomeric matrix protein (COMP) gene mutations in patients with pseudoachondroplasia and multiple epiphyseal dysplasia. J Hum Genet 48:2225.[CrossRef][Web of Science][Medline]
Subramanian S, Kumar S. 2004. Gene expression intensity shapes evolutionary rates of the proteins encoded by the vertebrate genome. Genetics 168:37381.
Sullivan AK, Crawford DC, Scott EH, Leslie ML, Sherman SL. 2002. Paternally transmitted FMR1 alleles are less stable than maternally transmitted alleles in the common and intermediate size range. Am J Hum Genet 70:153244.[CrossRef][Web of Science][Medline]
Sutherland GR, Richards RI. 1995. Simple tandem DNA repeats and human genetic disease. Proc Natl Acad Sci USA 92:363641.
Suzuki H, Ueda T, Ichikawa T, Ito H. 2003. Androgen receptor involvement in the progression of prostate cancer. Endocr Relat Cancer 10:20916.[Abstract]
Toyota T, Yoshitsugu K, Ebihara M et al. (19 co-authors). 2004. Association between schizophrenia with ocular misalignment and polyalanine length variation in PMX2B. Hum Mol Genet 13:55161.
Trottier Y, Biancalana V, Mandel JL. 1994. Instability of the CAG repeat in Huntington's disease: relation to paternal transmission and age at onset. J Med Genet 31:37782.
Vijai J, Kapoor A, Ravishankar HM et al. (14 co-authors). 2005. Protective and susceptibility effects of hSKCa3 allelic variants on juvenile myoclonic epilepsy. J Med Genet 42:43942.
Wells RD, Dere R, Hebert ML, Napierala M, Son LS. 2005. Advances in mechanisms of genetic instability related to hereditary neurological diseases. Nucleic Acids Res 33:378598.
Wintle RF, Nygaard TG, Herbrick JA, Kvaloy K, Cox DW. 1997. Genetic polymorphism and recombination in the subtelomeric region of chromosome 14q. Genomics 40:40914.[CrossRef][Web of Science][Medline]
Yeh S, Tsai MY, Xu Q et al. (16 co-authors). 2002. Generation and characterization of androgen receptor knockout (ARKO) mice: an in vivo model for the study of androgen functions in selective tissues. Proc Natl Acad Sci USA 99:13498503.
Zeitlin S, Liu JP, Chapman DL, Papaioannou VE, Efstratiadis A. 1995. Increased apoptosis and early embryonic lethality in mice nullizygous for the Huntington's disease gene homologue. Nat Genet 11:15563.[CrossRef][Web of Science][Medline]
Zoghbi HY, Orr HT. 2000. Glutamine repeats and neurodegeneration. Annu Rev Neurosci 23:21747.[CrossRef][Web of Science][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
J. G. Gibbons and A. Rokas Comparative and Functional Characterization of Intragenic Tandem Repeats in 10 Aspergillus Genomes Mol. Biol. Evol., March 1, 2009; 26(3): 591 - 602. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Huntley and A. G. Clark Evolutionary Analysis of Amino Acid Repeats across the Genomes of 12 Drosophila Species Mol. Biol. Evol., December 1, 2007; 24(12): 2598 - 2609. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. G. Faux, G. A. Huttley, K. Mahmood, G. I. Webb, M. Garcia de la Banda, and J. C. Whisstock RCPdb: An evolutionary classification and codon usage database for repeat-containing proteins Genome Res., July 1, 2007; 17(7): 1118 - 1127. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






) and female (
) sex-specific recombination rates of individual markers (shown against the name of each marker) and the sum of them (sum). The markers are from the 5-Mb region spanning each of the 6 loci (see Materials and Methods). For DRPLA, SCA7, and FRDA, the value (sum*) represents the sum of the recombination rates of more than 6 markers. FMR1 is located in the X chromosome, and hence the recombination rate in male is shown as zero. The maternal or paternal transmission bias of each locus is given within parentheses.



