Skip Navigation


MBE Advance Access originally published online on April 17, 2006
Molecular Biology and Evolution 2006 23(7):1357-1369; doi:10.1093/molbev/msk022
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
23/7/1357    most recent
msk022v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Siwach, P.
Right arrow Articles by Ganesh, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Siwach, P.
Right arrow Articles by Ganesh, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Article

Genomic and Evolutionary Insights into Genes Encoding Proteins with Single Amino Acid Repeats

Pratibha Siwach, Saurabh Dilip Pophaly and Subramaniam Ganesh

Department of Biological Sciences and Bioengineering, Indian Institute of Technology, Kanpur, India

E-mail: sganesh{at}iitk.ac.in.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Mutations causing expansion of amino acid repeats are responsible for 19 hereditary disorders. Repeats in several other proteins also show length variations. These observations prompted us to identify single amino acid repeat–containing proteins (SARPs) in humans and to understand their functional and evolutionary significance. We identified 8812 SARPs containing 17 146 repeat domains, each harboring 4 or more residues. In all, 5% of SARPs (471) showed repeat length variations, and nearly 84% of them (394) have repeats of 10 residues or less. We find that SARPs are involved in functions that require formation of multiprotein complexes. Nearly 78% (6859) of the SARPs did not find a paralogue in the human proteome, and such proteins are considered as orphan SARPs. Orphan SARPs show longer repeat stretches, longer peptide length, and lower expression levels as compared with SARPs belonging to protein family. Because the intensity of gene expression is known to relate inversely with the rate of protein sequence evolution, our results suggest that the orphan SARPs evolve faster than the familial forms and therefore are under a weaker selection pressure. We also find that while GC-rich codons are favored for coding the repeat tracts of SARPs, specific codons and not nucleotide motifs per se are selected, suggesting functional constraints placed on the usage of codons. One of the constraints could be the mRNA stability as clustering of rare codons is known to destabilize the transcripts and rare codons are not favored for coding repeat tracts. Genes encoding polymorphic SARPs show preferential localization toward the telomeric segments. Further, the sex-specific recombination rates of the chromosomal locus strongly correlate with the parental gender that influence the repeat instability in disorder caused by dynamic mutation. Therefore, instability associated with repeats might be driven by processes that are specific to sperm or oocyte development, and the recombination frequency might play a positive role in this process.

Key Words: trinucleotide repeats • dynamic mutation • repeat instability • orphan proteins • sequence evolution • sex-specific recombination


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Repeat instability is a unique dynamic mutation mechanism that is linked to more than 40 neurological, muscular, and developmental disorders (reviewed in Pearson et al. 2005Go). Among these, the pathological role of the repeat with respect to its position in the causative gene could vary. For example, methylation of the expanded GC-rich repeats located in the regulatory region leads to silencing of the gene, whereas those in the 3'-untranslated region might lead to aberrant mRNA processing (Bell et al. 1991Go; Ranum and Day 2004Go; Gatchel and Zoghbi 2005Go; Pearson et al. 2005Go). On the other hand, disorders involving genes containing trinucleotide repeats in the coding region form a distinct group because such repeats encode amino acid tracts within the peptide (Gatchel and Zoghbi 2005Go). For example, glutamine repeats are associated with 9 neurological disorders (Pearson et al. 2005Go), alanine repeats with 9 developmental disorders (Albrecht and Mundlos 2005Go), aspartate repeats with 2 types of dysplasia and osteoarthritis (Delot et al. 1999Go; Song et al. 2003Go; Kizawa et al. 2005Go), and leucine repeats with the Huntington's disease–like 2 (Holmes et al. 2001Go). Amino acid repeat polymorphisms have also been known to associate with complex disorders such as schizophrenia, epilepsy, prostate cancer, and central hypoventilation syndrome to name a few (Suzuki et al. 2003Go; Matera et al. 2004Go; Toyota et al. 2004Go; Vijai et al. 2005Go). Although the repeat dynamism is known for all these loci, the molecular etiology of amino acid repeat disorders is, however, not identical. For example, whereas a gain-of-toxic function has been attributed to the polyglutamine repeat tracts, it is the loss-of-function effect that underlies disorders associated with instable repeats of alanine (Zoghbi and Orr 2000Go; Amiel et al. 2004Go; Gatchel and Zoghbi 2005Go). Intriguingly, proteins with longer repeats of alanine have also been shown to have both toxic and protective functions in model systems (Berger et al. 2005Go). Beyond pathology, amino acid repeats in several other proteins are suggested to be involved in normal cellular functions (Lavoie et al. 2003Go; Faux et al. 2005Go).

A variety of models have been proposed to explain the repeat expansions, and it is widely believed that the secondary structure the repeat tracts might form could play a critical step in the expansion process (McMurray 1999; Cleary and Pearson 2005Go; Gatchel and Zoghbi 2005Go; Pearson et al. 2005Go). While all possible combinations of nucleotides are known to exist as triplet repeats, questions such as why some are more common than others, why there exist variations in repeat lengths among various genes, and why certain repeat loci are more unstable when transmitted through one sex are important from evolutionary and genetics point of view. Though there are reports on the amino acid repeats in the human proteome (Karlin and Burge 1996Go; Karlin et al. 2001Go; Alba and Guigo 2004Go), a majority of such studies have considered proteins with repeat domains longer than 10 amino acid residues. However, with the discovery that the instability of 5 consecutive aspartic acid residues within the cartilage oligomeric matrix protein (COMP) protein associating with 2 distinct types of dysplasia (Delot et al. 1999Go; Song et al. 2003Go), it is imperative that proteins with shorter repeat domains should also be catalogued and analyzed. Realizing the importance of amino acid repeats in the proteome and in human disorders, we undertook a study to analyze in detail the amino acid repeat distribution in proteins and the nucleotide repeats associated with them. We show here that nearly 77% of the polymorphic repeats containing proteins have repeat domains that are less than 10 residues and are enriched in the Morbid and online mendelian inheritance in man (OMIM) databases. We also show that genes encoding repeat-containing proteins belonging to gene families express highly and evolve at a slower rate when compared with genes encoding orphan proteins with repeats.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Identification of Genes Encoding Single Amino Acid Repeat–Containing Proteins, Chromosomal Mapping, and Codon Analysis
We developed an in-house program to search the Human Reference Sequence database (version 13, released on 16 September 2005) and to retrieve single amino acid repeat–containing proteins (SARPs) containing one or more repeat domains. Three files, namely, human.rna.fna.gz (containing fasta-formatted human RNA sequences), human.protein.faa.gz (human protein sequence in fasta format), and human.rna.gbff.gz (containing annotation information) were downloaded from the National Center for Biotechnology Information (NCBI) ftp server (http://www.ncbi.nlm.nih.gov/Ftp/). The amino acid repeat size threshold of 4 (i.e., 4 or greater) was based on the observation that the smallest repeat in a SARP previously implicated in a human disorder (COMP protein) has 5 aspartic acid repeats and loss or gain of one residue in this repeat motif is pathogenic (Delot et al. 1999Go). We considered uninterrupted repeats as a repeat motif because a majority of triplet-repeat disease proteins contain runs of homopolymeric repeats and because uninterrupted repeat stretches are more likely to be unstable and show allelic variations (Sutherland and Richards 1995Go; McMurray 1999Go; Choudhry et al. 2001Go). Sequence redundancy in the SARPs was removed using unique identifiers and was manually curated. When a gene is represented by more than one transcript in the RefSeq database, the repeat domain coded by each of the isoform was predicted and nonredundant repeat domains were included. Our program also calculated the peptide length, repeat length, co-occurring repeats, spatial location of repeat in respective peptide, codon usage in the repeat and nonrepeat regions, reiteration of codons, and other related information. The resulting data were exported to Excel format for tabulation and further analysis, and some of them are provided as supplementary material online. The frequency of rare and common codons was calculated as described by Hoekema et al. (1987)Go. Briefly, codons that are used <13 times in 1000 codons in the RefSeq data set are considered as "rare codons." Chromosome and cytoband location of genes were extracted from the annotation file.

Detection of Polymorphism in the Repeat Domains of SARPs
Expressed sequence tag (EST) clusters, derived from UniGene database corresponding to repeat-containing cDNA sequence, were used for the in silico detection of repeat length variations. Repeat region along with 10 flanking amino acid sequences on both sides was used as the query and aligned with UniGene clusters using a stand-alone TBlastN program. Length differences within the repeat domain were detected by comparing the number of amino acids present within the repeat block of the query with the translated sequence of EST (subject). Positive hits were manually checked to ensure the authenticity. Details of the EST hits and length variants observed for the polymorphic SARP are available in the Web link http://home.iitk.ac.in/~sganesh/sarp/.

Calculation of Recombination Rates for Repeat Loci and Subchromosomal Localization of Genes
The sex-specific recombination rates for individual microsatellite markers from 5-Mb region spanning the selected gene locus (around 2.5 Mb on either side) were added together to get the recombination index (sum value) for the male and female sex. For each gene locus, the markers were identified using the Ensembl genome browser (http://www.ensembl.org), and the recombination data for each marker were obtained from the Decode High Resolution Genetic Map genotype data, Release 1.0 (Kong et al. 2002Go). For subchromosomal localization of genes, the 2 arms of chromosome were divided into 2 equal halves (using the cytoband information retrieved from the MapView database of NCBI), and the genes were grouped as those located in the centromeric or the telomeric segment (see Supplementary figure 3, Supplementary Material online).

Functional Annotation of SARPs and Phylogenetic Analyses
We used the Gene Ontology tool, Onto-Express (Draghici and Sharp 1988Go), to functionally classify whole proteome and SARPs using molecular function annotations. Gene Ontology may link a single protein with more than one annotation term. The difference in the distribution of molecular functions between whole proteome and SARPs was tested for significance using chi-square test. For phylogenetic analyses, protein orthologues were identified using the HomoloGene database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene). Paralogous proteins were identified by blasting a query file containing all SARPs against database of all available human proteins and selecting hits as paralogues if the alignment length was >80% of the query and alignment identity was >60%. Sequence of orthologous and paralogous proteins were aligned with the ClustalX program (Higgins and Sharp 1988Go), and the multiple alignments were manually edited and checked for the conservation of amino acid repeats. For phylogenetic tree construction, the multiple sequence alignments were used, and a neighbor-joining phylogram was generated using the ClustalX program (Higgins and Sharp 1988Go).

Linking SARPs with the OMIM and Morbid Map
We used NCBI MapView database (BUILD 35.1; files, cyto_gene.md and cyto_morbid.md) for linking OMIM and Morbid data set with genes encoding SARPs. The gene ID and its cytoband position of the whole transcriptome or that encode SARPs were extracted from the data set and correlated with the disease names by using key word match method.

In Silico Analysis of Transcript Abundance for Genes Encoding SARPs
To estimate the gene expression intensity of SARPs (global as well as tissue specific), we used the EST sequences present within the UniGene clusters (BUILD #18; 6 375 729 EST sequences). This count was taken to represent the level of expression. In addition to containing sequences that represent a unique gene, the UniGene cluster as well provide related information such as the tissue types in which the gene has been expressed and map location. For estimating the breadth of expression, we therefore have used data from EST libraries belonging to 32 different tissues and combined into 10 broader groups (representing excretory system, circulatory system, skeletal system, respiratory system, reproductive system, immune system, digestive system, sensory organs, neuronal tissues, and developmental stages). Genes that had ≤5 EST hits in a given system were considered as not expressed. The number of EST hits observed for each gene was calculated, and the value was averaged for genes encoding familial and orphan SARPs. Difference in the expression levels was calculated by one round of normalization by taking the abundance of ESTs for the actin (ACTN1) gene as an internal control, and the relative expression values for genes encoding orphan and familial SARPs were calculated. Senial analysis of gene expression (SAGE) analyses show almost same level of expression for ACTN1 gene in the tissue systems analyzed.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Frequent Occurrence of SARPs
Bioinformatics-based approach, applied on Human Reference Sequence database (version 13 released on 16 September 2005), identified 8812 genes encoding SARPs (table 1; Supplementary figure 1 and Supplementary table 1, Supplementary Material online). In our data set, SARPs constitute around 30% of the total proteome. Curiously, the 8812 SARPs harbor 17 146 single amino acid repeat domains, suggesting that on an average each SARP might have more than one repeat domain. Among these, 130 SARPs (1.5%) had isoforms with unique repeat domains, suggesting that the presence/absence of repeat domains in the proteome could be further modulated by alternate mRNA splicing. We found that all 20 amino acids are repeated but with varying degree of repeatedness (table 1; Supplementary figure 1, Supplementary Material online). Poly-L repeats are significantly overrepresented (15%), whereas poly-W repeats are extremely rare (only 2). Homopolymeric repeats of L, S, E, P, A, and G residues are the most prominent ones. On the other hand, repeats for amino acids I, N, C, Y, M, F, and W were relatively rare. With the exception of E, amino acids that were repeated in the majority of the SARPs are of low molecular weight and aliphatic in nature. Smaller repeats (4–9 residues) are too frequent in the SARPs as nearly 94% of SARPs in our data set harbor such repeats (Supplementary tables 1 and 2, Supplementary Material online). We also compared the usage of amino acids in repeats with their occurrence in the proteome or the nonrepeat region. By and large, the usage of amino acids in the repeat tracts was found to be representative of the global amino acid composition (Supplementary figure 2, Supplementary Material online). However, there was no strict one-to-one correlation between their usage in the proteome and in the repeat tracts of SARPs.


View this table:
[in this window]
[in a new window]
 
Table 1 Frequency of Amino Acid Usage in the Repeat Domains of SARPs

 
A Significant Number of SARPs Show Repeat Length Variations
Using bioinformatics approaches, we identified 417 SARPs (~5%) to have amino acid repeat length variations (polymorphic) (table 1; Supplementary figure 1 and Supplementary table 1, Supplementary Material online). When the number of domains was considered, 471 out of 17 146 repeat domains (~3%) identified in the SARPs were predicted to be polymorphic. A great majority of polymorphic SARPs were having shorter repeat lengths (394 polymorphic repeats harbor 4–10 amino acid residues) (Supplementary tables 1 and 2, Supplementary Material online). However, the occurrence of polymorphism is far too frequent in SARPs with longer repeat lengths because nearly 25% of repeats having ≥20 amino acids exhibit repeat length variations (fig. 1B). We have also analyzed the average peptide length of SARPs. Strikingly, SARPs with polymorphic repeats were significantly longer in length when compared with SARPs with nonpolymorphic repeats (fig. 2A). Moreover, the average length of peptides that lacked amino acid repeats was found to be significantly smaller when compared with the SARPs and the whole proteome (fig. 2A).


Figure 1
View larger version (15K):
[in this window]
[in a new window]
 
FIG. 1.— Longer repeat domains, though less frequent in SARPs, are more likely to be polymorphic. (A) The actual number of SARPs having domains of small (≤10 residues), medium (11–20 residues), and larger (>20 residues) repeat lengths is shown. (B) The frequency (parentage) of repeat length variations observed (polymorphism) in the 3 groups of repeat domains (small, medium, and large) is shown.

 

Figure 2
View larger version (34K):
[in this window]
[in a new window]
 
FIG. 2.— SARPs are longer peptides. (A) The bar diagram shows the average peptide length in the human proteome (total proteome), proteins lacking amino acid repeats (non-SARPs), all SARPs, nonpolymorphic SARPs (non-poly-SARPs), and polymorphic SARPs (poly-SARPs) (The asterisk denotes P < 0.001; chi-square calculation). (B) Spatial distribution of repeat domains in SARPs associated with 20 human disorders. The location of the repeat domain is identified as amino terminal (N), middle (M), or the carboxyl terminal (C) of the peptide. The numbers on the top indicate the number of repeat domains present in that segment (N, M, or C) followed by the number of domains that are polymorphic for the repeat length. The length of the peptide is shown in percentage scale, and the polymorphic repeat domains (filled circles) and repeat domains that are not polymorphic (unfilled circles) are identified. Alphabets D, A, and Q on the left side indicate the amino acid repeat domains that are implicated in the disorder. The names of the genes/loci encoding these peptides are shown on the right side. Details of the individual repeat domains are given in Supplementary table 1, Supplementary Material online.

 
Details of length variants observed for the polymorphic SARPs are provided in the Web link http://home.iitk.ac.in/~sganesh/sarp/.

Spatial Distribution and Co-occurrence of Repeat Domains in SARPs
With regard to the length of the repeats in SARPs, amino acids Q, S, E, P, and G, in general, show longer domains (≥20 residues). However, we note that there is a sharp decrease in the number of repeat domains with increasing length of the repeat (fig. 1A). Aromatic amino acids (Y and W), in particular, show smaller repeat domains (repeat size 4–8). We also looked at the spatial distribution of the repeat domains in SARPs. Repeats of amino acids L, A, G, N, C, Q, H, and V showed a bias for their localization to the amino terminal of the peptide, whereas the amino acid repeats of F, I, K, and S showed a bias toward the carboxyl terminus. The terminal bias could be of biological importance because in a great majority of disease-associated peptides, the expanding repeats are located at either of the 2 terminals (fig. 2B). In our data set, nearly 48% of the SARPs show multiple repeat domains. We, therefore, investigated the frequency with which a repeat of one amino acid occurs with another in the same SARP, and the results of this analysis are shown in table 1. Proline (56%), followed by glutamic acid (46%), showed the strongest correlation for co-occurrence (excluding the self–self pair) in SARPs with multiple repeat domains. Using our default parameters (i.e., uninterrupted repeats), we found that the proline (48%), followed by glycine (36%) and glutamine repeats (35%), shows a high frequency for co-occurrence (self–self pair) in SARPs with multiple repeat domains (table 1). By allowing interruptions of up to 5 residues, we also checked the possibility whether the predominance of self–self pair is because of interruptions in homopolymeric repeat tracts. We find that only 28% of the self–self pairs are interrupted by 5 or less residues. Among these, Q repeats are more often interrupted (41%) as compared with R repeats (8%) (Supplementary table 3, Supplementary Material online).

GC-Rich Codons Encode Repeat Domains in SARPs
There was a significant overrepresentation for GC-rich codons (81%) in regions that code for repeat domains in SARPs as against their average occurrence in the total transcriptome (56%) or in transcripts that encode peptides lacking repeats (55%; fig. 3A). Curiously, amino acids that are exclusively coded by GC-rich codons (A, G, and P) are abundant in repeat domains of SARPs. For amino acids that are coded by both GC- and AT-rich codons, a significant increase in the usage of GC-rich codon was found in corresponding coding regions.


Figure 3
View larger version (31K):
[in this window]
[in a new window]
 
FIG. 3.— Codon usage in regions encoding the repeat domains of SARPs. (A) The GC-rich codons are overrepresented in regions that encode amino acid repeat tracts of SARPs compared with their normal occurrence in the total transcriptome (total) or in transcripts that encode proteins lacking amino acid repeats (non-SARPs). Poly-SARPs represent SARPs in which the repeat domain is polymorphic (The asterisk denotes P < 0.001; chi-square calculation). (B and C) Frequency of various trinucleotide repeat motifs in the coding region representing various codons. The usage of various motifs in the transcriptome excluding the region that encodes amino acid repeats is depicted in (B), whereas (C) shows the motif usage of the region that encodes repeats in SARPs. Each of these motifs when repeated would encode amino acids (shown in single-letter code) in the first, second, and the third reading frame, respectively. The relative frequency of these reading frames (codons) is shown within the bar. For example, motif "CAG" would encode amino acids Q, S, and A on reading frames 1 (codon CAG), 2 (codon AGC), and 3 (codon GCA), respectively.

 
One of the interesting observations in terms of codon usage is the iteration of repeat motifs. For example, CAG repeat motif in the coding region can be read as CAG, AGC, or GCA (reading frames 1, 2, and 3, respectively), and they encode amino acids Q, S, and A, if used as codons in that order. We therefore calculated the abundance of 3 possible codons generated by trinucleotide repeat motifs in the transcriptome. We then calculated the frequency of reiteration (uninterrupted ≥4 repeats) of these motifs in the coding sequence. We found that 10 triplet-repeat motifs (encoding 14 amino acids) are overrepresented in the coding region of transcriptome. We also compared the usage frequency of respective codons of these 10 motifs in the non–repeat-coding regions (fig. 3B and C). Our results clearly demonstrate that specific codons, and not motifs per se, are selected for the iteration of amino acids. When the CAG motif was considered as an example, codon CAG coding for Q residue has higher frequency to be present in repeats, followed by AGC (coding S) and GCA (coding A). On the contrary, amino acid residue A was predominantly coded by the GCC codon when iterated, although codon GCA (CAG motif) is used when A is not iterated (fig. 3B and C). Thus, elements other than repeat structure (repeat motif) seemed to have an impact on repeat generation and instability. It has been shown that the usage of synonymous codons in mRNA is not random as the codon usage is constrained by a combination of tRNA availability and nature of its codon recognition (Duan and Antezana 2003Go). We therefore calculated the codon usage in the transcriptome (global) and in the repeat-encoded region. Our analysis reveals that rare codons are not favored to code for amino acid repeats (fig. 4).


Figure 4
View larger version (22K):
[in this window]
[in a new window]
 
FIG. 4.— Usage of codons for amino acids L, S, P, A, G, Q, V, and T. The codons of each amino acid were grouped as "rare" and "nonrare" forms (see Materials and Methods), and their relative frequency was compared between regions that encode repeats in SARPs and that of the total proteome. Specific codons that were grouped as rare and nonrare for each amino acid are also indicated.

 
Genes Encoding SARPs Are Located in Recombination Hot Spots
To explore whether genes encoding amino acid repeats show preferential distribution in the human genome, we checked for their chromosomal and subchromosomal localization. For this, the 2 arms of chromosome were divided into 2 equal halves, and the genes were grouped as those located in the centromeric or the telomeric segment. On the whole, SARPs did not show any chromosomal bias; about 33% of genes in each chromosome encode SARPs (data not shown). However, a slight overrepresentation for genes encoding SARPs in the subtelomeric segment was observed (fig. 5). This difference was striking and highly significant when genes encoding polymorphic SARPs were considered separately (fig. 5). In order to confirm that the differential distribution observed for the genes encoding polymorphic SARPs is not due to sampling error, we have generated random data sets for genes and evaluated their localization. The random data set did not show any preference for the subchromosomal localization, suggesting that the overrepresentation observed for genes encoding polymorphic SARPs in the telomeric segment is not likely to be a random event and could perhaps imply a selection process (fig. 5). This suggestion was strengthened by the observation that 19 out of 24 disease genes associated with repeat instability are located in the telomeric segment (Supplementary figure 3, Supplementary Material online).


Figure 5
View larger version (13K):
[in this window]
[in a new window]
 
FIG. 5.— Genes encoding SARPs are more often located in the telomeric segment of chromosomes. The subchromosomal localization of all genes present in the RefSeq database was gathered and analyzed individually for the genes encoding the total human proteome, the SARPs, or the polymorphic SARPs. The relative frequency of genes falling into the centromeric and telomeric segments of the chromosome in each group was calculated and plotted. Because the number of genes grouped in the polymorphic SARPs is relatively small (546 genes), 10 random data sets, each having 546 genes, were created from the RefSeq data and analyzed. The bar represents the mean average of the genes from the random data sets falling into the 2 subchromosomal segments (The asterisk denotes P < 0.001; chi-square calculation).

 
Gender is known to influence the transmission of trinucleotide repeats in human disease. For example, the transmission of the repeat through males was less stable than that through females for genes involved in dentatorubral-pallidoluysian atrophy (DRPLA) (Ikeuchi et al. 1996Go), Huntington disease (HD) (Trottier et al. 1994Go), and spinocerebellar ataxia 1 (SCA1) (Pujana et al. 1999Go). However, it is the female sex in the case of Friedreich's ataxia (FRDA) (De Michele et al. 1998Go) and spinocerebellar ataxia 7 (SCA7) (Pujana et al. 1999Go). For fragile X mental retardation 1 (FMR1), permutation-size alleles are far more unstable when transmitted through females (Sullivan et al. 2002Go). One of the reasons for the repeat instability could be the recombination process. FMR1 is located on Xq27.3; therefore, during female meiosis, homologous X chromosomes would pair and may facilitate unequal recombination at the FMR1 locus, leading to expansion of trinucleotide repeats. We therefore looked at the sex-specific recombination rates for the chromosomal loci spanning the genes involved in DRPLA, HD, SCA1, SCA7, and FRDA (fig. 6). Strikingly, the parental gender that influences repeat instability on transmission showed increased recombination rate for respective chromosomal locus.


Figure 6
View larger version (43K):
[in this window]
[in a new window]
 
FIG. 6.— Correlations between sex-specific recombination rates and maternal/paternal transmission bias of pathogenic trinucleotide repeat loci. The values represent the male (male) and female (female) sex-specific recombination rates of individual markers (shown against the name of each marker) and the sum of them (sum). The markers are from the 5-Mb region spanning each of the 6 loci (see Materials and Methods). For DRPLA, SCA7, and FRDA, the value (sum*) represents the sum of the recombination rates of more than 6 markers. FMR1 is located in the X chromosome, and hence the recombination rate in male is shown as zero. The maternal or paternal transmission bias of each locus is given within parentheses.

 
Evolution of SARPs
In order to investigate an evolutionary context for repeat-containing proteins, we searched for paralogous and orthologous proteins for SARPs. Out of 8812 SARPs identified in the present study, 1953 (22%) of them constitute 899 paralogous clusters having 2 or more members. SARPs that do not find a paralogue in the human proteome are considered as orphan SARPs. Nearly 78% of the SARPs remained as orphan proteins. The representation of orphan forms in SARPs did not differ significantly from the total proteome (81%). However, proteins having larger repeat lengths are frequent in orphan SARPs than in familial SARPs, and the difference was more significant among polymorphic SARPs (fig. 7). Moreover, the average peptide length of orphan SARPs (702 residues) was greater than that of familial SARPs (610 residues). Intriguingly, a majority of repeat expansion disorders are caused by orphan SARPs (Supplementary figure 3, Supplementary Material online), suggesting that repeats present in the orphan forms are more likely to expand.


Figure 7
View larger version (11K):
[in this window]
[in a new window]
 
FIG. 7.— Larger repeat domains are frequent in orphan SARPs. Values of orphan and familial SARPs were plotted for the amino acid repeat lengths and their frequency. The rectangle (drawn in dotted line) delimits a region out of which 80% orphan SARPs and 20% familial SARPs are present, suggesting an overrepresentation for longer repeat domains in orphan SARPs (P < 0.001; 2-way ANOVA). The y axis is shown in log scale.

 
We have also checked whether or not the repeat motif in SARPs is evolutionarily conserved. For this analysis, clusters having at least 5 paralogous proteins from the human proteome were considered. Out of 91 such groups, only in 26 clusters the amino acid repeat motif was found to be conserved in majority of the members (>80%). Vertebrate orthologues were found for 16 of them and were included for further analyses. Among these, 7 clusters were having L repeats, 5 having E repeats, and 4 having K repeats (Supplementary figure 4AD, Supplementary Material online). These include heat shock proteins (3 clusters), guanine-binding proteins (1 cluster), and structural proteins (2 clusters). The remaining clusters represent uncharacterized hypothetical proteins. We also checked the functional context of these amino acid repeats by analyzing whether the repeat tract fall into any known functional domains. No obvious pattern, however, could be detected.

Functional Groups in SARPs
To investigate the significance of amino acid repeats in SARPs, functional annotation was done using Gene Ontology terms (fig. 8). This analysis reveals that a majority of SARPs are enzymatic in functions (fig. 8). Intriguingly, the cellular functions for a majority of SARPs are known as the "unknown" category is significantly underrepresented for SARPs when compared with the total proteome (fig. 8A and B). Further, detailed analysis of all 20 amino acid repeats depicts that most of the SARPs having smaller repeats (repeat size <10 residue), irrespective of the repeating amino acid, are significantly enriched in the functional group "enzyme activity" followed by binding, transporter, receptor, structural, and other functions (fig. 8B). Smaller repeats of A, D, E, G, H, K, P, and Q are overrepresented in enzymatic activity (such as polymerase and transcription factors) followed by binding activity (such as nucleic acid binding and protein binding) (fig. 8B).


Figure 8
View larger version (26K):
[in this window]
[in a new window]
 
FIG. 8.— SARPs are enriched in some Gene Ontology molecular functions. (A) Bar diagram illustrating molecular functions of whole proteome, all SARPs, and polymorphic SARPs, as per the Gene Ontology consortium annotations. Significant overrepresentation of all SARPs or polymorphic SARPs in functional categories compared with whole proteome is denoted by an asterisk. (B) Bar diagram illustrating molecular functions of whole proteome and SARPs having repeat domains of specific amino acids (A, D, E, G, H, K, P, and Q). Significant overrepresentation of SARPs having a specific amino acid repeat in functional categories compared with whole proteome is denoted by an asterisk (P < 0.001; chi-square calculation).

 
We used OMIM and Morbid databases to relate the potential association of genes encoding SARPs in human genetic disorders (Supplementary table 4, Supplementary Material online). In all, 51% genes encoding SARPs are identified in the Morbid and/or OMIM database as against 26% for the whole transcriptome. The representation of genes encoding polymorphic SARPs in the 2 databases was far greater (>40%). Thus, genes encoding polymorphic SARPs that are enriched in chromosomal loci known to be associated with disorders (141 genes) are ideal candidates for screening for repeat instability.

Expression Patterns of SARPs
Our in silico expression analysis reveals similar tissue distributions for SARP and non-SARP genes (Supplementary figure 5A and B, Supplementary Material online). We did not find any significant difference with regard to the number of genes expressed in each of the organ systems analyzed (Supplementary figure 5C, Supplementary Material online). However, a significant overrepresentation of ESTs for the genes representing familial SARPs was found, suggesting that the expression level for familial SARPs is higher when compared with orphan forms. This difference was consistent for each of the physiological system analyzed (fig. 9A and B).


Figure 9
View larger version (12K):
[in this window]
[in a new window]
 
FIG. 9.— Familial SARPs show higher expression level compared with orphan SARPs. Difference in the expression levels for genes encoding orphan and familial SARPs was calculated by one round of normalization by taking the abundance of ESTs for the actin gene (ACTN1) as an internal control (see Materials and Methods). In (A), the bar diagram shows the relative expression of genes in various physiological systems. Compared with orphan SARPs, genes encoding familial SARPs show nearly 2-fold increase in the expression level in each of the system analyzed. The y axis on the right side is specific to bar shown for the developmental stages because the expression level of SARPs was nearly 9-fold higher in this group. In (B), the bar diagram depicts the difference in the global expression levels of genes encoding orphan and familial SARPs (all tissues put together). In both figures, the y axis indicates values that are relative to the level of ACTN1 gene expression. Significant difference in the expression level is denoted by an asterisk (P < 0.001; chi-square calculation).

 

    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
We show here that SARPs are abundant in human proteome and that the level of repeatedness and the length of the repeat tracts could vary. One of the interesting observations of our study is that SARPs are relatively longer peptides (697 residues), suggesting that the length gain could be due to amino acid repeat tracts. This suggestion was strengthened by the fact that the average length of SARPs excluding the repeat motifs (354 residues) was less than that of non-SARPs (375 residues). It is equally likely that amino acid repeats are not tolerated in smaller proteins because decreasing protein length is known to have a direct correlation with increased cellular toxicity (Hackam et al. 1998Go; Martindale et al. 1998Go). With regard to the biochemical properties of the repeats, our results support an earlier finding that hydrophobic residues are overrepresented in SARPs (Oma et al. 2004Go). The frequency of individual amino acids used in the repeat tracts was largely found to be representative of global amino acid composition. For example, K is the most commonly used amino acid in the human proteome and is overrepresented in SARPs as well. Thus, the initial seeding of amino acid repeat tracts is likely to be a random event. Once seeded, the continuation as well as the length of the repeat tract appears to be dependent on the amino acid that is iterated, the location of the repeat within the protein, or the protein itself. For example, proline by far is the most abundant amino acid in the repeat region, glutamine shows longer repeat tracts but is absent in proteins representing structural and transporter activity, and leucine repeats are more often present in the amino terminal of proteins. An additional factor could that be the co-occurring repeats as nearly half of the SARPs harbor multiple repeat domains. This raises an interesting question as to why amino acid repeats in general, and repeat combinations (co-occurrence) in particular, have evolved in proteome and what functions they might impart on the proteins harboring them.

In order to understand the functional significance of amino acid repeat tracts, we classified SARPs into 7 functional categories and compared them with the total proteome. Our analysis shows that a majority of SARPs are involved in enzymatic activity, followed by processes related to gene expression. This could perhaps mean that SARPs in general are involved in functions that require formation of multiprotein complexes and that amino acid repeats might facilitate such protein–protein interactions (Lavoie et al. 2003Go; Faux et al. 2005Go). It has been suggested that simple repeat sequences offer greater flexibility in protein structure by serving as spacers between other motifs in the protein (Huntley and Golding 2002Go). This suggestion also implies that the length of the "spacer," at least in certain cases, is likely to tolerate length variations because a significant number of SARPs in the human proteome are polymorphic for repeat lengths and because repeat stretches are shorter in orthologous proteins. Our data reveal that repeats of certain amino acids appear to be preferentially located at the amino or the carboxyl terminal regions. The terminal bias was far more striking for disease-associated repeats because a majority of them (86%) are located at either of the 2 terminals. It has been suggested that most repeat regions do not adopt well-ordered structures but instead are disordered (Huntley and Golding 2002Go). Notwithstanding the distinctive function that the amino acid repeats may offer, it could be suggested that the structural property of repeat tracts might restrict their presence toward the terminals.

Although many of the genes that encode SARPs are conserved across vertebrates, most of the time the repeat motif itself is not conserved. The diversity of repeat tracts found in the orthologous proteins of different species suggests that repeat sequences are differentially acquired and lost during evolution at a rate faster than the genes encoding them. Very similar trend was observed when paralogous clusters of SARPs were analyzed. Therefore, it could be suggested that amino acid repeats evolved or retained in a given SARP to perform a specific function that is unique to the protein and the organism. An alternative explanation would be that repeat motifs are functionally less important and therefore are less conserved during evolution. Intriguingly, proteins having larger repeat lengths (≥20 residues) are far more frequent in orphan SARPs as against familial SARPs, suggesting that the 2 forms of SARPs are subjected to differential selection constraints. A great majority of orphan SARPs identified in the present study are vertebrate specific, whereas familial SARPs show orthologous proteins in invertebrates. Moreover, the percent identity observed for the human–mouse orthologous SARP pairs reveals that orphan SARPs are less conserved (76%) when compared with familial SARPs (95%; data not shown). It has been shown that vertebrate-specific genes evolve faster than older genes (Subramanian and Kumar 2004Go; Alba and Castresana 2005Go). A possible explanation for the evolutionary origin of orphan genes is that they evolve so fast that the sequence similarity is lost even within a relatively short evolutionary time span (Schmid and Aquadro 2001Go; Domazet-Loso and Tautz 2003Go). This, in other words, suggests that the relatively dispensable proteins are subjected to weaker selection constraints and should therefore evolve rapidly and may even accumulate mildly deleterious changes (Hirsh and Fraser 2001Go). Extending this analogy, it may be proposed that the orphan SARPs, because of the weaker constraints placed on them, may acquire repeat tracts that are longer than those present in the familial SARPs. However, this expansion might not cross a "threshold" because very long repeats could become pathogenic (gain-of-toxic effect) and might get eliminated from the population. Intriguingly, a majority of the repeat tracts that are known to be involved in disorders are larger in length and are coded by orphan SARPs. Moreover, expansions of the repeats in SARPs more often result in gain-of-function effect, whereas a complete loss of genes shows minimal effect on the survival. For example, the murine knockouts for the genes involved in SCA1, SCA2, and spinal and bulbar muscular atrophy (Matilla et al. 1998Go; Yeh et al. 2002Go; Marrades et al. 2006Go) exhibit wild-type or less severe phenotype, although the overexpression of expanded polyglutamine repeats is pathogenic (Mangiarini et al. 1997Go; Lorenzetti et al. 2000Go; McManamny et al. 2002Go; Aguiar et al. 2006Go). The weaker functional constraints for orphan genes, however, would not be static as there would be a gradual increase in the selection pressure with time, leading to fewer changes in older genes when compared with the novel ones (Alba and Castresana 2005Go). This could perhaps explain as to why null mutations for a few orphan SARPs show severe phenotype when knocked out, for example, HD (Zeitlin et al. 1995Go). It has been shown that genes that exhibit slower rate of evolution encode shorter peptides and express ubiquitously and at higher levels (Pal et al. 2001Go; Subramanian and Kumar 2004Go). We find that the average peptide length of orphan SARPs are longer than that of familial SARPs. Furthermore, familial SARPs show almost 2-fold increase in expression levels when compared with orphan SARPs, suggesting that the latter ones are evolving at a faster rate.

Based on the results obtained from various model systems, a variety of molecular mechanisms have been proposed to explain the repeat expansions associated with human disorders (Pearson and Sinden 1998Go; Sinden 1999Go; Cleary and Pearson 2005Go; Pearson et al. 2005Go; Wells et al. 2005Go). These include meiotic recombination, DNA replication slippage, and DNA damage repair. Although whether one or all of these processes contribute to the expansion of repeat is unknown, it is widely believed that the secondary structure that the repeat tracts might form could be a critical step in the expansion process (Pearson and Sinden 1998Go; Sinden 1999Go; Cleary and Pearson 2005Go; Pearson et al. 2005Go; Wells et al. 2005Go). Among the various repeats tested, the CTG/CAG repeat was shown to have a higher potential to form secondary DNA structure and may thus enhance repeat instability (Petruska et al. 1996Go; Pearson and Sinden 1998Go). Our analyses on the usage of codons encoding the repeat domains of SARPs reveal that a majority of the repeat tracts are coded by mixed codons and are GC rich. However, this pattern did not differ between the repeat tracts that are polymorphic and nonpolymorphic, suggesting that other factors, such as cis-acting elements (Cleary and Pearson 2003Go; Pearson et al. 2005Go), might regulate the instability associated with repeats. In coding regions, trinucleotide repeats also represent codons, and therefore the orientation of repeating unit is also important. For example, CAG, AGC, and GCA repeats represent same repeating unit (CAG), but they are distinct in coding regions because they code for amino acids Q, S, and A, respectively. We found that the CAG codon encoding Q has a higher tendency to iterate as compared with AGC or GCA as codons. Our observation that specific codons and not the nucleotide motifs per se are selected for the repeat tracts reveals functional constraints placed on the usage of codons in the regions encoding repeat tracts. For example, codon usage reflects selection for translational efficiency as highly expressed genes tend to use codons that are decoded by abundant cognate tRNAs (Ikemura 1985Go; Moriyama and Powell 1997Go; Duret 2000Go). Moreover, clustering of several rare codons within a narrow region has been shown to cause destabilization of the transcript (Hoekema et al. 1987Go; Caponigro et al. 1993Go; Carlini 2005Go). We show here that rare codons are not favored for coding the repeat tracts, suggesting that the mRNA stability therefore could be one of the factors that minimize the usage of rare codons in the repeat tracts despite their potential to form stable secondary structure and contributing to repeat expansions.

Recombination-based processes have been suggested to be major contributors to the evolution of tandem repeat sequences. Studies have demonstrated that the repeats can act as a recombination hot spot by enhancing the rate of recombination relative to the genome average (Jeffreys et al. 1998Go; Richard and Paques 2000Go). The frequent association of tandem repeats near the chromosomal ends suggests that repeats may flourish near telomeres simply because of higher rates of recombination or vice versa (Wintle et al. 1997Go; Kong et al. 2002Go; Linardopoulou et al. 2005Go). This has been strengthened by our observation that genes encoding SARPs show preferential localization toward the telomeric segment. The recombination rate is known to be different between male and female in humans; the frequency of recombination in the autosomes of females is about one and a half times that in the autosomes of males (Broman et al. 1998Go; Kong et al. 2002Go). However, this difference is not homogeneous because there are regions in the genome where recombination rate is particularly high in women and particularly low in men and vice versa (Kong et al. 2002Go). Here we show that the sex-specific recombination rates for the genomic region spanning the 6 genes associated with repeat expansion disorders strongly correlate with the parental gender that positively influences the repeat instability. This led us to hypothesize that regional and sex-specific differences in the recombination rate, in combination with processes that are specific to sperm or oocyte development, might influence the instability of repeats.

We have created a catalogue of all SARPs that have repeat domains of longer than 4 residues. The rationale for choosing the small cutoff value was that the 4 consecutive aspartic acid residues within COMP protein are by far the shortest disease-causing repeat expansion mutations described (Delot et al. 1999Go). A unique feature of this protein is that both expansion and shortening of the repeat cause the same disease (Delot et al. 1999Go; Song et al. 2003Go). Intriguingly, 112 polymorphic SARPs (27%) identified in the present study harbor repeats of 4 residues, and nearly 77% of the polymorphic SARPs show repeats having <10 residues. Thus, by lowering the cutoff value, we were able to identify and annotate a large number of polymorphic and potentially disease-causing SARPs from the RefSeq data set. This suggestion is further strengthened by the observation that the representation of genes encoding polymorphic SARPs in the Morbid and OMIM databases was significantly greater when compared with the total proteome or the nonpolymorphic SARPs. The number of SARPs that exhibit repeat length variation is likely to be higher as our approach relied only on screening the ESTs that could perhaps represent a smaller population size. Therefore, we hope that this catalogue will be of much use for studying various aspects of SARPs and that it will be helpful in identifying their probable disease association and evolutionary significance. The details of 8812 SARPs identified in the present study and the predicted polymorphisms in SARPs are available for downloads.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Supplementary figures 1–5 and tables 1–4 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
This study was supported by a research grant from the Ministry of Human Resource Development, Government of India, to S.G. P.S. was supported by a research fellowship from the Council of Scientific and Industrial Research, Government of India, and S.D.P. received a fellowship from the Indian Institute of Technology, Kanpur.


    Footnotes
 
Jianzhi Zhang, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 

    Aguiar J, Fernandez J, Aguilar A et al. (13 co-authors). 2006. Ubiquitous expression of human SCA2 gene under the regulation of the SCA2 self promoter cause specific Purkinje cell degeneration in transgenic mice. Neurosci Lett 392:202–6.[Medline]

    Alba MM, Castresana J. 2005. Inverse relationship between evolutionary rate and age of mammalian genes. Mol Biol Evol 22:598–606.[Abstract/Free Full Text]

    Alba MM, Guigo R. 2004. Comparative analysis of amino acid repeats in rodents and humans. Genome Res 14:549–54.[Abstract/Free Full Text]

    Albrecht A, Mundlos S. 2005. The other trinucleotide repeat: polyalanine expansion disorders. Curr Opin Genet Dev 15:285–93.[CrossRef][ISI][Medline]

    Amiel J, Trochet D, Clement-Ziza M, Munnich A, Lyonnet S. 2004. Polyalanine expansions in human. Hum Mol Genet 1:R235–43.

    Bell MV, Hirst MC, Nakahori Y et al. (17 co-authors). 1991. Physical mapping across the fragile X: hypermethylation and clinical expression of the fragile X syndrome. Cell 64:861–6.[CrossRef][ISI][Medline]

    Berger Z, Davies JE, Luo S, Pasco MY, Majoul I, O'kane CJ, Rubinsztein DC. 2005. Deleterious and protective properties of an aggregate-prone protein with a polyalanine expansion. Hum Mol Genet 15:433–42.[Medline]

    Broman KW, Murray JC, Sheffield VC, White RL, Weber JL. 1998. Comprehensive human genetic maps: individual and sex-specific variation in recombination. Am J Hum Genet 63:861–9.[CrossRef][ISI][Medline]

    Caponigro G, Muhlrad D, Parker R. 1993. A small segment of the MAT alpha 1 transcript promotes mRNA decay in Saccharomyces cerevisiae: a stimulatory role for rare codons. Mol Cell Biol 13:5141–8.[Abstract/Free Full Text]

    Carlini DB. 2005. Context-dependent codon bias and messenger RNA longevity in the yeast transcriptome. Mol Biol Evol 22:1403–11.[Abstract/Free Full Text]

    Choudhry S, Mukerji M, Srivastava AK, Jain S, Brahmachari SK. 2001. CAG repeat instability at SCA2 locus: anchoring CAA interruptions and linked single nucleotide polymorphisms. Hum Mol Genet 10:2437–46.[Abstract/Free Full Text]

    Cleary JD, Pearson CE. 2003. The contribution of cis-elements to disease-associated repeat instability: clinical and experimental evidence. Cytogenet Genome Res 100:25–55.[CrossRef][ISI][Medline]

    Cleary JD, Pearson CE. 2005. Replication fork dynamics and dynamic mutations: the fork-shift model of repeat instability. Trends Genet 21:272–80.[CrossRef][ISI][Medline]

    Delot E, King LM, Briggs MD, Wilcox WR, Cohn DH. 1999. Trinucleotide expansion mutations in the cartilage oligomeric matrix protein (COMP) gene. Hum Mol Genet 8:123–8.[Abstract/Free Full Text]

    De Michele G, Cavalcanti F, Criscuolo C, Pianese L, Monticelli A, Filla A, Cocozza S. 1998. Parental gender, age at birth and expansion length influence GAA repeat intergenerational instability in the X25 gene: pedigree studies and analysis of sperm from patients with Friedreich's ataxia. Hum Mol Genet 7:1901–6.[Abstract/Free Full Text]

    Domazet-Loso T, Tautz D. 2003. An evolutionary analysis of orphan genes in Drosophila. Genome Res 13:2213–9.[Abstract/Free Full Text]

    Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA. 2003. Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate. Nucleic Acids Res 31:3775–81.[Abstract/Free Full Text]

    Duan J, Antezana MA. 2003. Mammalian mutation pressure, synonymous codon choice and mRNA degradation. J Mol Evol 57:694–701.[CrossRef][ISI][Medline]

    Duret L. 2000. tRNA gene number and codon usage in the C. elegans genome are co-adapted for optimal translation of highly expressed genes. Trends Genet 16:287–9.[CrossRef][ISI][Medline]

    Faux NG, Bottomley SP, Lesk AM, Irving JA, Morrison JR, de la Banda MG, Whisstock JC. 2005. Functional insights from the distribution and role of homopeptide repeat-containing proteins. Genome Res 15:537–51.[Abstract/Free Full Text]

    Gatchel JR, Zoghbi HY. 2005. Diseases of unstable repeat expansion: mechanisms and common principles. Nat Rev Genet 6:743–55.[Medline]

    Hackam AS, Singaraja R, Wellington CL, Metzler M, McCutcheon K, Zhang T, Kalchman M, Hayden MR. 1998. The influence of huntingtin protein size on nuclear localization and cellular toxicity. J Cell Biol 141:1097–105.[Abstract/Free Full Text]

    Higgins DG, Sharp PM. 1988. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73:237–44.[CrossRef][ISI][Medline]

    Hirsh AE, Fraser HB. 2001. Protein dispensability and rate of evolution. Nature 411:1046–9.[CrossRef][Medline]

    Hoekema A, Kastelein RA, Vasser M, de Boer HA. 1987. Codon replacement in the PGK1 gene of Saccharomyces cerevisiae: experimental approach to study the role of biased codon usage in gene expression. Mol Cell Biol 7:2914–24.[Abstract/Free Full Text]

    Holmes SE, O'Hearn E, Callahan C et al. (12 co-authors). 2001. A CTG trinucleotide repeat expansion in junctophilin 3 is associated with Huntington's disease-like 2 (HDL2). Nat Genet 29:377–8.[CrossRef][ISI][Medline]

    Huntley M, Golding GB. 2002. Simple sequences are rare in the Protein Data Bank. Proteins 48:134–40.[CrossRef][ISI][Medline]

    Ikemura T. 1985. Codon usage and tRNA content in unicellular and multicellular organisms. Mol Biol Evol 2:13–34.[Abstract]

    Ikeuchi T, Igarashi S, Takiyama Y, Onodera O, Oyake M, Takano H, Koide R, Tanaka H, Tsuji S. 1996. Non-Mendelian transmission in dentatorubral-pallidoluysian atrophy and Machado-Joseph disease: the mutant allele is preferentially transmitted in male meiosis. Am J Hum Genet 58:730–3.[ISI][Medline]

    Jeffreys AJ, Neil DL, Neumann R. 1998. Repeat instability at human minisatellites arising from meiotic recombination. EMBO J 17:4147–57.[CrossRef][ISI][Medline]

    Karlin S, Brocchieri L, Bergman A, Mrazek J, Gentles AJ. 2001. Amino acid runs in eukaryotic proteomes and disease associations. Proc Natl Acad Sci USA 99:333–8.

    Karlin S, Burge C. 1996. Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. Proc Natl Acad Sci USA 93:1560–5.[Abstract/Free Full Text]

    Kizawa H, Kou I, Iida A, Sudo A et al. (15 co-authors). 2005. An aspartic acid repeat polymorphism in asporin inhibits chondrogenesis and increases susceptibility to osteoarthritis. Nat Genet 37:138–44.[CrossRef][ISI][Medline]

    Kong A, Gudbjartsson DF, Sainz J et al. (16 co-authors). 2002. A high-resolution recombination map of the human genome. Nat Genet 31:241–7.[CrossRef][ISI][Medline]

    Lavoie H, Debeane F, Trinh QD, Turcotte JF, Corbeil-Girard LP, Dicaire MJ, Saint-Denis A, Page M, Rouleau GA, Brais B. 2003. Polymorphism, shared functions and convergent evolution of genes with sequences coding for polyalanine domains. Hum Mol Genet 12:2967–79.[Abstract/Free Full Text]

    Linardopoulou EV, Williams EM, Fan Y, Friedman C, Young JM, Trask BJ. 2005. Human subtelomeres are hot spots of interchromosomal recombination and segmental duplication. Nature 437:94–100.[CrossRef][Medline]

    Lorenzetti D, Watase K, Xu B, Matzuk MM, Orr HT, Zoghbi HY. 2000. Repeat instability and motor incoordination in mice with a targeted expanded CAG repeat in the Sca1 locus. Hum Mol Genet 9:779–85.[Abstract/Free Full Text]

    Mangiarini L, Sathasivam K, Mahal A, Mott R, Seller M, Bates GP. 1997. Instability of highly expanded CAG repeats in mice transgenic for the Huntington's disease mutation. Nat Genet 15:197–200.[CrossRef][ISI]