MBE Advance Access originally published online on June 29, 2007
Molecular Biology and Evolution 2007 24(12):2598-2609; doi:10.1093/molbev/msm129
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Evolutionary Analysis of Amino Acid Repeats across the Genomes of 12 Drosophila Species
Department of Molecular Biology and Genetics Cornell University
E-mail: mah223{at}cornell.edu.
| Abstract |
|---|
|
|
|---|
Repeated motifs of amino acids within proteins are an abundant feature of eukaryotic sequences and may catalyze the rapid production of genetic and even phenotypic variation among organisms. The completion of the genome sequencing projects of 12 distinct Drosophila species provides a unique dataset to study these intriguing sequence features on a phylogeny with a variety of timescales. We show that there is a higher percentage of proteins containing repeats within the Drosophila genus than most other eukaryotes, including non-Drosphila insects, which makes this collection of species particularly useful for the study of protein repeats. We also find that proteins containing repeats are overrepresented in functional categories involving developmental processes, signaling, and gene regulation. Using the set of 1-to-1 ortholog alignments for the 12 Drosophila species, we test the ability of repeats to act as reliable phylogenetic signals and find that they resolve the generally accepted phylogeny despite the noise caused by their accelerated rate of evolution. We also determine that in general the position of repeats within a protein sequence is non-random, with repeats more often being absent from the middle regions of sequences. Finally we find evidence to suggest that the presence of repeats is associated with an increase in evolutionary rate upon the entire sequence in which they are embedded. With additional evidence to suggest a corresponding elevation in positive selection we propose that some repeats may be inducing compensatory substitutions in their surrounding sequence.
Key Words: protein repeats simple sequence homopeptides
| Introduction |
|---|
|
|
|---|
Evolutionary innovations are often a result of internal duplication events within a genome followed by subsequent modification. In fact, simple sequence repeats are now being recognized as influential sequences with the ability to produce rapid variation, and act as evolutionary tuning knobs (Kashi and King 2006
The rapid expansion and contraction of simple sequence repeats is generally thought to be facilitated through replicative slippage (Levinson and Gutman 1987). However, we have previously found evidence that selection also plays a role in the evolution and maintenance of some homopolymers within proteins (Huntley and Golding 2006
).
The abundance of repeats within eukaryotic protein sequences, and the observation that selection has acted to preserve centain repeats by preventing further slippage suggest that repeats themselves may have inherent functional attributes. Studies aimed at elucidating these attributes using structural data have demonstrated that consistent structures for repeats are largely absent (Wootton 1994
; Saqi 1995
; Huntley and Golding 2002
). This result has led to the suggestion that repeats do not form stable globular structures, and are in fact structurally disordered (Dunker et al. 2002
).
Proteins containing disordered regions are generally involved in molecular recognition and signaling (Alba et al. 1999
; Dunker et al. 2002
). This function might arise from the intrinsic flexibility and mobility of disordered regions as this could allow for increased association and dissociation rates along with binding promiscuity.
In this study we utilize the newly available genome sequences from 12 Drosophila species in order to further investigate the properties of protein repeats. We determine the relative abundance and patterns of repeats within each of the Drosophila proteomes and the specific functional categories associated with protein repeats. We then examine the relative positions of repeats within protein sequences, as this may provide more insight into their function and evolution. We also test the ability of repeats to act as phylogenetic signals. Finally, we investigate the effect of repeats on the evolutionary rate of their surrounding sequence.
| Materials and methods |
|---|
|
|
|---|
The protein sequences based on the Comparative Analysis Freeze 1 (CAF1) dataset for the 12 Drosophila species (D. melanogaster, D. simulans, D. sechellia, D. erecta, D. yakuba, D. ananassae, D. pseudoobscura, D. persimilix, D. willistoni, D. virilis, D. mojavensis, and D. grimshawi) were retrieved from NCBI.
To investigate the patterns of repetitive sequence within the proteins of the 12 Drosphila species, we searched for homopolymer tracts and extended segments of low complexity sequence. We required that homopolymers be at least five tandem, identical amino acids in length and implemented the program SEG (Wootton and Federhen 1993
) to detect regions of low complexity within the protein sequences. The SEG algorithm is often used to filter out simple sequence within proteins before performing homology and similarity based searches like BLAST (Altschul et al. 1997
), while a separate program is used to filter simple sequence from nucleotide sequences. In our implementation of the SEG algorithm we used a window length of 15, and a complexity cut-off K2(1) of 1.9 instead of the default values (12 and 2.2 respectively). The parameter K2(1) is an initial cut-off complexity value such that when SEG initially calculates the complexity of a subsequence, it must not exceed the cut-off complexity value. These values were previously shown to preferentially detect the longer and more repetitive sequence regions common among eukaryotic proteins (Huntley and Golding 2002
).
Repeat Enrichment and Functional Associations
For each protein, we recorded the number of homopolymers, and low complexity sequences detected, along with their lenghts, relative position within the protein, and gene ontology (GO) associations (from Drosophila 12 Genomes Consortium 2007). For homopolymers, we also recorded the amino acid comprising the tract.
The CAF1 nucleotide sequences for the 12 Drosophila species were scanned for triplet repeat tracts. All tracts of five triplet repeats or more were recorded, and subsequently analyzed to demonstrate differences in the triplet repeat composition of coding and non-coding sequences.
Repeat Positions
To test the hypothesis that repeats are randomly dispersed throughout the length of a protein sequence, we divided each protein into three segments of equal length: an N-terminal segment, a mid segment, and a C-terminal segment. Each detected repeat was assigned to one of the three segments, based upon where the midpoint of the repeat was located. For instance, a repeat whose midpoint fell within the N-terminal third of the full protein was considered to have the majority of the repeat present in that segment of the protein. Any repeat whose midpoint fell on the boundary between two segments was randomly assigned to one of them. This gave us an obseved distribution of repeat positions within the protein sequences.
To perform a goodness of fit
2 test, we needed an expected distribution for the positions of repeats. However, since the length of each individual repeat and of the protein within which the repeat is embedded influences that expectation, we had to generate the expected position of each repeat independently. Therefore, for each detected repeat, assuming random dispersal, we calculated the probability of the midpoint falling within each segment. If the length of the entire protein was L and the length of the repeat was l, then there would be L–l possible positions within the protein where the midpoint could fall. The mid segment contains L/3 of those possibilities, while the N-terminal and C-terminal each contain L/3 – l/2 possible midpoint positions. Therefore the probability that the repeat midpoint falls within the mid segment is
while the probability for each terminal segment is
A Random number was then generated and the expected position of the repeat was assigned based on these probabilities.
To determine whether the trends we observed within the Drosophila species were unique to this clade, we also performed the above analysis on a range of other species, including two additional insects (Anopheles gambiae and Apis mellifera), five other eukaryotes (the yeast Saccharomyces cerevisiae, the worm Ceanorhabditis elegans, the plant Arabidopsis thaliana, the fish Danio rerio, and the mammal Homo sapiens), two archaebacteria (Methanococcus jannaschii and Pyrococcus horikoshii) and two eubacteria (the gram negative Escherichia coli and the gram positive Bacillus subtilis).
Repeats as Phylogenetic Signals
In order to evaluate trends within each Drosophila species, without confounding the results with proteins unique to a given lineage, we also analyzed the 1-to-1 Tcoffee ortholog alignments (obtained from http://rana.1b1.gov/drosophila/wiki/). In this way, we also had a collection of proteins for which there was a single ortholog in each of the 12 Drosophila genomes.
Due to the unique mechanism producing most protein repeats (replicative slippage and expansion, rather than typical point mutations), we were curious whether repeats alone could perform well as phylogenetic signals. Using the 1-to-1 ortholog alignments of the 12 Drosophila species we began to test this hypothesis by taking each ortholog alignment, and scanning for homopolymer repeats at leat five residues in length. If such a homopolymer was found in any of the species, we attempted to expand the repeat boundaries in both directions by examining the sequence in the other species at those particular positions in the alignment. If any of the sequences contained at leat two tandem amino acid residues identical to the residues in the homopolymer, starting at the boundary position and extending beyond the homopolymer repeat, the boundary was then extended (see figure 1).
|
After isolating the repeat alignment blocks within the ortholog alignments we then constructed a distance matrix for each repeat alignment using two separate methods. In the first method, termed exact, we determined the length of the longest homopolymer tract (of the predominant amino acid) in each species for each repeat alighment block. The distance between any given species pair was then calculated as the magnitude of the difference between the lengths of their longest homopolymers. In the second, fuzzy method, we counted for each species in a repeat alignment block the number of times the predominant amino acid was found in that stretch of sequence, regardless of whether it was part of a tandem homopolymer tract. The distance between species was calculated as the magnitude of the difference between their respective sums of predominant amino acid residues. We included the fuzzy analysis because often a longer repeat tract becomes subsequently interrupted by amino acid substitutions which could significantly shorten the observed length of the longest homopolymer tract within the sequence. This could then effectively bias the distances to be greater than the single amino acid substitution would warrant. The fuzzy analysis should alleviate this concern.
The exact and fuzzy distance matrices were then used separately as input for the NEIGHBOR and CONSENSE programs in the PHYLIP package (Felsenstein 1989
) in order to produce individual trees for each repeat alignment block, and then finally a consensus tree. These consensus trees were then examined to determine the robustness of repeat sequences as phylogenetic signals.
Rates of Evolution
Finally, we wanted to use the unique dataset provided by the 1-to-1 orthologs of these 12 Drosophila species in order to test a hypothesis regarding the influence of repeats on the evolutionary rate of the surrounding protein sequence in which they are embedded. It has been previously established that repeats themselves evolve more rapidly than their flanking sequence (Huntley and Golding 2000
), however, it has not yet been determined whether the sequence surrounding a repeat evolves at a different rate than proteins in which no repeat occurs.
To test this hypothesis we took the entire set of 1-to-1 ortholog alignments, containing 12 species per alignment, and scanned for homopolymer repeats as described above and in figure 1. We then removed any detected repeat blocks from the alignments and denoted the remaining aligned sequences Repeats removed. Any alignments in which no repeat was detected (and thus, no repeat was removed) were put into a group denoted No repeats. The resulting sets of alignments were then used to create phylogenetic trees using the PROTDIST and FITCH programs in the PHYLIP package. Tree lengths were then calculated by summing the branch lengths within each tree. Tree lengths were used as a proxy for evolutionary rate in each alignment of an ortholog between the 12 species. In this way we could compare the evolutionary rates of proteins containing no detectable repeats, to those containing repeats, excluding the repeated regions themselves.
A subset of the species from these alignments was then used in a PAML (Yang 1997
) analysis test (including branch specific models and model 7 versus model 8) for differences in selective constraint between the coding sequence alignments with repeats removed, and those containing no repeats to begin with. Due to saturation of sites, in these analyses we limited the taxa to D. melanogaster, D. simulans, D. sechellia, D. erecta, D. yakuba, and D. ananassae. We converted the p-values from the likelihood ratio test for PAML models 7 and 8 to q-values using a false discovery rate method (Storey and Tibshirani 2003
). The distribution of q-values between alignments with repeats removed and alignments without repeats were then compared. Finally, we recorded the location of sites relative to the closest repeat boundary where a repeat had been removed, and the probability of
> 1 for each site where that probability was at least 0.5. This provided a spatial account of potential sites with evidence for positive selection throughout the alignments.
| Results |
|---|
|
|
|---|
Repeat Enrichment and Functional Associations
As expected from previous surveys of repeat abundance across the three domains of life (Marcotte et al. 1999
Figure 2 demonstrates that even among the set of 1-to-1 orthologs across all 12 Drosophila species there are differences in the level of repeat enrichment. There is significantly less low complexity per protein within the melanogaster and obscura groups compared to the remaining four species (p < 0.00001).
|
Even the underlying codon structures of homopolymer repeats appear to differ between these species (see figure 2). While less than 13% of homopolymer tracts within Drosophila proteins are encoded by an uninterrupted homogeneous tandem array of codons, on average only 53.5% of a homopolymer tract will be encoded by such a homogeneous codon tract, and the amount of codon homogeneity varies, with D. sechellia having the least, and D. virilis having the most. Since codon homogeneity can further promote replicative slippage, thus increasing expansion and contraction events within homopolymer sequence, these results may indicate subtle differences in the evolutionary mechanisms creating and maintaining repeat sequences within these species.
The patterns of nucleotide triplet repeats differ markedly between coding and non-coding sequences (see figure 3). The length of triplet repeat tracts in non-coding regions is significantly longer than within coding regions, as expected due to relaxed constraint (p < 0.00001). However, the most notable difference between coding and non-coding triplet repeats is the frequency of CAG triplets (a CAG repeat may also be an AGC, GCA, GTC, TCG, and CGT repeat if all 6 reading frames are considered). CAG repeats are the most frequent of all repeats within coding sequences. This result is not surprising, since CAG encodes the amino acid glutamine, and poly-glutamine repeats are the most common homopeptides within Drosophila proteins (as described below). What is intriguing about this observation is that the alternate codon for glutamine, CAA, is more common among the triplet repeats within non-coding sequence. Therefore CAA repeats are preferred within non-coding sequence, while CAG codon repeats are preferred within coding sequences. Repeats detected by tandem repeats finder (Benson 1999
) and mreps (Kolpakov et al. 2003
) are consistent with these results (data provided by Hadi Quesneville). The only exception to this preference is found in D. willistoni whose protein coding sequences contain more CAA repeats than CAG. This deviance is likely caused by the lower G+C content of the CAA triplet, as D. willistoni sequences appear to have uniquely lowered G+C content among the 12 Drosophila species (discussed further below).
|
Of the 6,689 1-to-1 ortholog alignments, 3,607 contained at least one detectable repeat in at least one species. This resulted in 20,916 repeat alignment blocks being identified. The large number of repeat alignment blocks is a result of there being a mean value of 5.80 repeat blocks per alignment in the 3,607 alignments that contained repeats. The average length of a repeat block was 7.88 residues. Figure 4 demonstrates that the poly-Q repeats dominate the amino acid composition of the repeat alignment blocks, followed by poly-A, poly-S, poly-G, poly-T, and poly-N. Examining the lengths of repeats also reveals poly-Q to generaly have the longest repeat tracts, with the exception of two intriguingly extended tyrosine tracts (see figure 5). Like the human diseases caused by poly-Q repeat tract expansions, the poly-Q tracts within Drosophila also tend to be encoded by CAG triplet repeats, and less often by CAA codons.
|
|
Drosophila 12 Genomes Consortium (2007) and Powell et al. (in review) have found that D. willistoni has peculiar G+C content and codon usage. Upon inspecting the types of homopolymers formed in this species, even those proteins from the 1-to-1 ortholog sets, we find they appear to be influenced by low G+C content (see Supplemental figure S2). The amino acids that more frequently form homopolymers in D. willistoni (poly-D, G, H, I, N, S, and T) than in the other 11 species collectively have a lower G+C content than those that are more frequent in the remaining 11 species. In fact, even when only the codons with the lowest G+C content for each amino acid are considered, this trend holds true.
Using the gene ontology (GO) associations and methodology from Drosophila 12 Genomes Consortium (2007), we find there are significantly larger proportions of proteins with repeats than expected by chance in GO terms associated with developmental processes, signaling and gene regulation (see table 1). Additionally, GO categories for housekeeping and metabolic processes have significantly smaller proportions of proteins containing repeats than expected. These results illustrate underlying functional differences between proteins containing and lacking repeats.
|
Repeat Positions
Table 2 displays the
2 results from comparing the observed patterns of repeat positions to the expected patterns based on the null hypothesis of random dispersal throughout the length of a protein. The two eubacterial species did not deviate significantly from the expected distributions for low complexity sequence or homopolymer repeats. The archaebacteria P. horikoshii, however, did not show a random distribution of low complexity sequence. In this archaea, and all the eukaryotes, low complexity sequence tend to occur predominantly towards the ends of the protein (N-terminal and C-terminal).
|
Figure 6 depicts the distribution pattern of low complexity sequence within D. melanogaster proteins. This pattern where the mid segment of the protein is least enriched for repeats is common to all the other eukaryotes and P. horikoshii, except for the mosquito (An. gambiae) whose distribution is skewed towards the C-terminus (see figure 6).
|
When we examine the distribution of homopolymer repeats, we notice that all
2 values but those for A. thaliana and S. cerevisiae decrease compared to the values obtained from the low complexity sequence. For A. thaliana this is likely a result of the increase in the amount of sequences containing homopolymers, as it was the only species to show more homopolymers than low complexity sequence (see Supplemental figure S1). We also note that the archaea, P. horikoshii, does not show a significant deviation from random dispersal of homopolymers within its proteins. More intriguing, however, is the result that the two non-Drosophila insects, An. gambiae and Ap. mellifera also fail to reject random dispersal for homopolymers, despite all other eukaryotes rejecting the hypothesis. Figure 7 illustrates an interesting observation regarding differences in the type of repeats that occupy the three different segments of a protein. Repeats in the N-terminal segment of a protein comprise a significantly larger percentage of the protein length than do those in the md segment (p < 0.00001), and likewise those in the mid segment take up a larger percentage than those in the C-terminal (p < 0.0001).
|
Repeats as Phylogenetic Signals
The consensus trees for both the exact and fuzzy repeat analysis showed the generally accepted topology (as seen in figure 2). However the number of repeat alignments supporting any particular node in the consensus tree was consistently low (see figure 8). This lack of support likely arises from a limited number of phylogenetically informative repeat alignments, as a result of the mechanism of repeat evolution itself. The rapid rate of repeat evolution, thought to be facilitated by replicative slippage, would likely produce many repeats as autapomorphic traits. However because of the finer scale within clades of the Drosophila phylogeny one could still expect to see a large enough number of synapomorphic repeats to resolve the phylogeny, and the ability of the repeat analysis to produce the accepted topology suggests this to be the case.
|
We also performed the above analysis with the additional criteria that the repeat be conserved in at least six, and then all 12 species in an alignment. Our definition of conserved was broad in that it only required at least five of the predominant amino acids to be present (in tandem for the exact analysis). We hoped that this would reduce the amount of phylogenetically uninformative repeat alignments by removing autapomorphies. These results were nearly identical to those above, both for exact and fuzzy analyses. Consensus tree topologies were consistent with the accepted topology, and nodes were supported by roughly 3% to 18% of the repeat alignments.
Rates of Evolution
As described above, 20,916 repeat alignment blocks were detected within the 6,689 1-to-1 ortholog alignments. These repeats were detected in only 3,607 of the ortholog alignments, and were then removed producing the repeats removed group. The no repeats group contained the remaining 3,082 ortholog alignments in which no repeats were detected. Figure 9A depicts the different distributions of tree lengths for the no repeats and repeats removed groups. The mean tree length for the no repeats group (0.8986489) is significantly smaller than that for the repeats removed group (1.027367) as determined by a t-test (P < 0.00001).
|
We confirmed this result using a collection of mammalian multiz alignments obtained from the UCSC genome browser (Karolchik et al. 2003
To determine if the length of the alignments might be influencing the above result, we also tested for significant differences in alignment lengths between the two groups (see figure 9B). A significant difference (p < 0.00001) was found, with the repeats removed group having longer mean alignment lengths (738.35 residues) than the no repeats group (404.73 residues).
However a regression analysis of the tree lengths and the alignment lengths demonstrated that only 0.1% of the variation in the data could be explained by a relationship between tree length and alignment length. Therefore the increase in evolutionary rate among the repeats removed group is not simply due to an increase in sequence alignment length.
Since repeat-containing proteins are non-randomly distributed among functional categories (see table 1) which are known to evolve at different rates we tested whether or not the elevated rate of evolution seen among proteins with repeats was simply an artifact of the evolutionary rate of a functional category. We again used alignments for proteins whose repeats had been removed, and alignments without repeats for each comparison within a functional category. The categories for developmental processes, cell cycle, defense response (immunity), and stress response were each examined. Defense and stress responses are functional categories associated with higher rates of evolution, while cell cycle and developmental processes are generally more conserved. In each category the mean treelength for the set of proteins with repeats removed was always higher than for the proteins without repeats. However the difference was only significant for developmental processes (P < 0.0001) and stress response (P < 0.00001).
We then tested the hypothesis that the increased rate of evolution in the sequence surrounding the repeats was due to an increase in compensatory substitutions. We reasoned that protein sequence surrounding repeats with evidence of ongoing replicative slippage should display a higher rate of evolution than sequence surrounding more stable repeats whose underlying codons had mutated to inhibit further slippage. To test this hypothesis we used proteins containing serine homopolymers from a previous study that had determined the influence of stabilizing selection or replicative slippage mechanisms on each repeat (Huntley and Golding 2006
).
We took the 31 human proteins from this previous study and collected homologous sequences from mouse, rat, and cow (the only taxa with sequences resulting in BLAST expect values less than or equal to 10-20 for all 31 human sequences) and aligned the sequences using CLUSTALW. The repeats were then excised from the 31 alignments as described above for the Drosophila sequences and trees built from the remaining sequence.
Although the sample size is small, we found a significant difference (p = 0.0354) in the mean tree lengths for the sequence surrounding repeats with evidence for selection (0.065505) than for those with evidence for slippage (0.310135). Therefore, proteins containing repeats with slippage-resistant codon structures appear to have an overall lower rate of evolution in the flanking sequence than proteins with repeats undergoing slippage.
The PAML analysis using the subset of six Drosophila species found no significant difference between the ratio of non-synonymous and synonymous substitutions along lineages of the phylogeny with repeats excised and those that lacked repeats all together. However, a larger fraction of the alignments in which at least one species contained a repeat, subsequently excised, showed evidence of positive selection compared to the set of alignments in which none of the species contained repeats. A Wilcoxon rank sum test of the q-value distributions revealed this difference to be significant (p < .00001).
We observe the position of sites with evidence for positive selection to be strongly clustered near the boundaries of repeats (see figure 10). The spatial distribution of these sites is skewed towards the N-terminal side of a repeat, and 50% of the sites fall within the first 26 residues closest to a repeat boundary, while 95% fall within 201 residues. Examining only those sites with Pr(
> 1)
0.95 we find a nearly identical spread (50% of sites within the first 25 residues, and 95% within 229). In all cases the spatial distribution of sites with evidence for positive selection tends to be larger on the N-terminal side than the C-terminal side of a repeat boundary. Reasons for this asymmetry remain unclear at present, and may even be caused by the unidirectionality of transcription rather than protein structure.
|
We next investigated the distribution of q-values within functional categories between alignments with repeats removed and alignments without repeats. This allowed us to determine if there was a consistent association of increased evidence for positive selection among proteins containing repeats, or whether the association was with the functional category instead. Of the four categories examined (developmental processes, cell cycle, defense response, and stress response) only cell cycle and defense response showed significantly lower mean q-values for alignments with repeats (P < 0.0002 and P < 0.0310 respectively). The sequences involved in stress response displayed a different pattern, though not significant, having a lower mean q-value for the alignments without repeats.
| Discussion |
|---|
|
|
|---|
Repeat Enrichment and Functional Associations
Repeated motifs of amino acids appear to be more abundant in Drosophila than most other species examined. A notable exception is the human malarial parasite, Plasmodium falciparum, whose proteins have previously been demonstrated to have abnormally high amounts of low complexity sequence (Pizzi and Frontali 2001
DePristo et al. (2006)
suggest that the abundance of low complexity sequence within several species, including P. falciparum, is partly attributed to increased genomic A+T content. However, we find that despite the abnormally high A+T content of D. willistoni among the 12 Drosophila species, there is no corresponding increase in low complexity sequence within this species. It may be that the A+T content of D. willistoni is not different enough from the other Drosophila species to cause a relative increase in the amount of low complexity sequence within its proteome. We do, however, find evidence for the especially low G+C content of D. willistoni affecting the types of amino acid repeats being formed within its proteins. This is particularly notable since the data set being used is composed of 1-to-1 orthologs between the 12 species, and presumably repeated motifs within this collection of proteins would be less susceptible to species specific attributes.
However, substantial variation in repeat characteristics does exist between the taxa, even in the 1-to-1 orthologs. In particular, the amount of low complexity sequence per protein varies between species, in an increasing fashion as the phylogenetic distance from the melanogaster subgroup increases. The differences between closely related taxa might be explained by differences in effective population sizes, where longer and more abundant repeats could go to fixation more often in smaller populations, assuming that repeats are somewhat deleterious.
Likewise, the underlying nucleotide structures of homopolymer repeats vary significantly between species. D. virilis, for instance has the highest average, 58.1%, for the portion of a homopolymer tract encoded by a homogeneous codon tract. Differences in the lengths of homopolymer tracts between species cannot account for these underlying codon compositional differences, as the relationship between homopolymer tract length and the portion encoded by a homogeneous codon tract explains only 6.1% of the variation.
Due to the inherent replication dynamics of uninterrupted homogeneous codon tracts, with the likelihood of replicative slippage increasing with repeat copy number, the differences in codon tract homogeneity underlying repeats between species may indicate differences in how these repeats are arising and being maintained in each lineage. Overall, fewer than 13% of all homopolymer tracts within Drosophila are encoded completely by uninterrupted codon tracts. This demonstrates that the majority of repeats are persisting long enough in the genome to be modified by substitution processes, and perhaps eventually stabilized by selection to prevent further replicative slippage. The creation of slippage resistant codon structures underlying repeat regions may indicate either selection against unstable slippage prone variants, selection to maintain some function arising from the presence of a repeat region or both.
Consistent with previous results demonstrating that developmental proteins are enriched for amino acid repeats (Karlin and Burge 1996
; Huntley and Golding 2004
), we also find here that proteins containing repeats are significantly associated with developmental processes. It has been a curious observation that these amino acid repeat sequences, which can wreak havoc with their propensity to expand, as seen in the many human neurodegenerative diseases caused by homopolymer tract expansions, would be so abundant in a class of proteins so influential as the developmental proteins.
One key to this puzzle comes from the observation that amino acid repeats tend to form structurally disordered regions (Wootton 1994
; Saqi 1995
; Dunker et al. 2002
; Huntley and Golding 2002
). These regions may form flexible linkers between globular domains, and become structured upon binding with a substrate. The inherent mobility of such unstructured regions could facilitate faster association and dissociation rates and binding promiscuity that may be advantageous to proteins involved in gene regulation and signaling associated with development. This structural trait of repeats could explain the persistence of such high numbers of repeats within eukaryotic proteins.
Repeat Positions
In line with this argument, our finding that repeats are not randomly distributed throughout the length of a protein may also indicate a functional explanation for their survivial in the terminal segments of peptide sequences. We are not aware of any mechanisms creating amino acid repeats that would bias their generation to the amino and carboxyl ends of the peptide. Rather, if repeats do arise randomly throughout a protein, then a relatively larger fraction of those variants that arise towards the center of the protein must eventually be eliminated from the population.
Fujimori et al. (2003)
reported a position bias in the density of nucleotide microsatellites towards the transcriptional start site within plant gene sequences. However they found no such bias in D. melanogaster. Interestingly, however, they observed a trend in mammals where microsatellite density increased towards both ends of the gene. A follow up study on amino acid repeats in plants also demonstrated a gradient of repeats with increased frequency in the N-terminus, decreasing in frequency along the direction of transcription (Zhang et al. 2006
). However they did not account for varying lengths of proteins, or repeats, and performed only a sliding window analysis across the first 400 (or fewer) amino acids of each protein. This would inevitably bias the results to have repeats appearing more frequently in the N-terminus as shorter peptides containing repeats in their mid segment or C-terminus would contribute to the results appearing towards the N-terminal portion of the protein.
However, studies by Alba and Guigo (2004)
and Siwach et al. (2006)
have shown particular amino acid repeats overrepresented in each terminal end of proteins (poly-L, A, G, N, C, Q, H, and V in the N-terminus, and poly-F, I, K, and S in the C-terminus), suggesting that repeat persistence in the ends of protein sequences is not simply a result of repeats being tolerated more at those positions. In conjunction with our finding that the relative lenghts of repeats vary depending on their location within the protein (see figure 7), the above appearance of a consistent pattern among several taxa likely indicates an underlying non-random process maintaining repeats in these positions.
Repeats as Phylogenetic Signals
The combination of rapid expansion and yet to be resolved mechanisms of maintenance involved in the evolution of repeat sequences are unique traits that we tested as phylogenetic signals using the Drosophila phylogeny. Although many repeat sequences are unique to single lineages and are therefore phylogenetically uninformative, those that are shared among several lineages are informative and can still resolve the generally accepted phylogeny.
We hope this result can be used to further develop a framework for using indels within sequence alignments as informative sites. A previous analysis demonstrated their utility as a specific character state for detecting selection, despite indels usually being excluded from such analyses (Huntley and Golding 2006
).
Rates of Evolution
It has been well established now that repeats themselves tend to evolve more rapidly than the remaining peptide sequence in which they are embedded (Huntley and Golding 2000
; Romov et al. 2006
). However our finding in this study that the sequence surrounding a repeat evolves faster and with an increased signal for positive selection than sequence containing no repeats is intriguing. Our hypothesis that this increase in evolutionary rate might result from compensatory substitutions in the flanking sequence to accommodate the rapid length perturbations in the repeat sequence is supported by a preliminary data set indicating that repeats stablilized by selection to prevent further expansion have flanking sequence with lower evolutionary rates than repeats that have ongoing slippage. However, an equally supported hypothesis to explain these results is that proteins undergoing rapid evolution may benefit from the acquisition of repeat domains which can then rapidly expand and contract until stabilization is preferred. In this way, repeats could act as evolutionary "tuning knobs" (Kashi and King 2006
) and be selected for on the basis of the increase in variability afforded by their unique mechanism of mutation.
It is still curious, however, that some repeats appear advantageous or neutral, while others are incredibly deleterious. This apparent discordance can be somewhat reconciled by the findings from an experiment by Brignull et al. (2006)
, who used C. elegans mutants to demonstrate that the threshold for pathogenic length in poly-Q type diseases could be manipulated by perturbing the function of various housekeeping proteins. By using mutants with extended lifespans they demonstrated that the onset for poly-Q pathogenesis can be further delayed, in agreement with observations that in general the age of onset is related to homopolymer tract length and lifespan of the organism. They then reasoned that a cellular buffering system exists to prevent proteotoxicty until a certain age when the buffering system begins to fail. They found additionally that they could induce transition from soluble protein to aggregate states in homopolymer lengths just under the pathogenic threshold by disrupting genes involved in the clearance of misfolded proteins and protein turnover. It then seems apparent that repeats themselves only become problematic to the cell when other housekeeping networks begin to fail. Otherwise repeat expansions may induce rapid compensatory mutations presumably to stabilize the protein structure, preserving function, or by virtue of their propensity to rapidly expand and contract, repeats may enable the exploration of novel protein conformations and functions.
| Supplementary Material |
|---|
|
|
|---|
Supplementary material figures S1 and S2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|
The authors wish to thank Hadi Quesneville for providing microsatellite predictions within the genomic sequences, Dara Torgerson for collecting the mammalian multiz alignments, Amanda Laracuente for assistance with the gene ontology associations and Tim Sackton for providing the PAML data. We also thank David King and two anonymous reviewers for their insightful comments on this manuscript. This work was supported by a Natural Sciences and Engineering Council of Canada (NSERC) fellowship to M.A.H.
| Footnotes |
|---|
David Erwin, Associate Editor
| References |
|---|
|
|
|---|
Alba MM, Guigo R. Comparative analysis of amino acid repeats in rodents and humans. Genome Res (2004) 14:549–554.
Alba MM, Santibanez-Koref MF, Hancock JM. Amino acid reiterations in yeast are overrepresented in particular classes of proteins and show evidence of a slippage-like mutational process. J Mol Evol (1999) 49:789–797.[CrossRef][Web of Science][Medline]
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.
Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res (1999) 27:573–580.
Blanchette M, Kent WJ, Riemer C. (12 co-authors). Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. (2004) 14:708–715.
Brignull HR, Morley JF, Garcia SM, Morimote RI. Modeling polyglutamine pathogenesis in C. elegans. Methods Enzymol (2006) 412:256–282.[CrossRef][Web of Science][Medline]
DePristo MA, Zilversmit MM, Hartl DL. On the abundance, amino acid composition, and evolutionary dynamics of low-complexity regions in proteins. Gene (2006) 378:19–30.[CrossRef][Web of Science][Medline]
Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature (2007) doi: 10.1038/nature06341.
Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradovic Z. Intrinsic disorder and protein function. Biochemistry (2002) 41:6573–6582.[CrossRef][Web of Science][Medline]
Felsenstein J. PHYLIP- Phylogeny Inference Package (Version 3.2). Cladistics (1989) 5:164–166.
Fujimori S, Washio T, Higo K. (11 co-authors). A novel feature of microsatellites in plants: a distribution gradient along the direction of transcription. FEBS Lett. (2003) 554:17–22.[CrossRef][Web of Science][Medline]
Gatchel JR, Zoghbi HY. Diseases of unstable repeat expansion: mechanisms and common principles. Nat Rev Genet. (2005) 6:743–755.[Web of Science][Medline]
Golding GB. Simple sequence is abundant in eukaryotic proteins. Protein Sci. (1999) 8:1358–1361.[Web of Science][Medline]
Huntley M, Golding GB. Evolution of simple sequence in proteins. J Mol Evol. (2000) 51:131–140.[Web of Science][Medline]
Huntley MA, Golding GB. Simple sequences are rare in the Protein Data Bank. Proteins (2002) 48:134–140.[CrossRef][Web of Science][Medline]
Huntley MA, Golding GB. Neurological proteins are not enriched for repetitive sequences. Genetics (2004) 166:1141–1154.
Huntley MA, Golding GB. Selection and slippage creating serine homopolymers. Mol Biol Evol. (2006) 23:2017–2025.
Karlin S, Burge C. Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. Proc Natl Acad Sci USA. (1996) 93:1560–1565.
Karolchik D, Baertsch R, Diekhans M. (13 co-authors). The UCSC Genome Browser Database. Nucleic Acids Res. (2003) 31:51–54.
Kashi Y, King DG. Simple sequence repeats as advantageous mutators in evolution. Trends Genet. (2006) 22:253–259.[CrossRef][Web of Science][Medline]
Kolpakov R, Bana G, Kucherov G. Mreps: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res. (2003) 31:3672–3678.
Levinson G, Gutman GA. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol Biol Evol (1987) 4:203–221.[Abstract]
Marcotte EM, Pellegrini M, Yeates TO, Eisenberg D. A census of protein repeats. J Mol Biol. (1999) 293:151–160.[CrossRef][Web of Science][Medline]
Pizzi E, Frontali C. Low-complexity regions in Plasmodium falciparum proteins. Genome Res. (2001) 11:218–229.
Romov PA, Li F, Lipke PN, Epstein SL, Qiu WG. Comparative genomics reveals long, evolutionarily conserved, low-complexity islands in yeast proteins. J Mol Evol (2006) 63:415–425.[CrossRef][Web of Science][Medline]
Saqi M. An analysis of structural instances of low complexity sequence segments. Protein Eng (1995) 8:1069–1073.
Siwach P, Pophaly SD, Ganesh S. Genomic and evolutionary insights into genes encoding proteins with single amino acid repeats. Mol Biol Evol. (2006) 23:1357–1369.
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA. (2003) 100:9440–9445.
Wootton J. Sequences with unusual amino acid compositions. Current Opinion in Struct Bi. (1994) 4:413–421.[CrossRef]
Wootton J, Federhen S. Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem. (1993) 17:149–163.[CrossRef][Web of Science]
Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci (1997) 13:555–556.
Zhang L, Yu S, Cao Y, Wang J, Zuo K, Qin J, Tang K. Distributional gradient of amino acid repeats in plant proteins. Genome (2006) 49:900–905.[Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
D. A. Morrison Why Would Phylogeneticists Ignore Computerized Sequence Alignment? Syst Biol, March 25, 2009; (2009) syp009v1. [Full Text] [PDF] |
||||
![]() |
J. G. Gibbons and A. Rokas Comparative and Functional Characterization of Intragenic Tandem Repeats in 10 Aspergillus Genomes Mol. Biol. Evol., March 1, 2009; 26(3): 591 - 602. [Abstract] [Full Text] [PDF] |
||||
![]() |
G.-F. Richard, A. Kerrest, and B. Dujon Comparative Genomics and Molecular Dynamics of DNA Repeats in Eukaryotes Microbiol. Mol. Biol. Rev., December 1, 2008; 72(4): 686 - 727. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












