Skip Navigation


MBE Advance Access originally published online on June 29, 2007
Molecular Biology and Evolution 2007 24(12):2598-2609; doi:10.1093/molbev/msm129
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
24/12/2598    most recent
msm129v3
msm129v2
msm129v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Huntley, M. A.
Right arrow Articles by Clark, A. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Huntley, M. A.
Right arrow Articles by Clark, A. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Evolutionary Analysis of Amino Acid Repeats across the Genomes of 12 Drosophila Species

Melanie A. Huntley and Andrew G. Clark

Department of Molecular Biology and Genetics Cornell University

E-mail: mah223{at}cornell.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Repeated motifs of amino acids within proteins are an abundant feature of eukaryotic sequences and may catalyze the rapid production of genetic and even phenotypic variation among organisms. The completion of the genome sequencing projects of 12 distinct Drosophila species provides a unique dataset to study these intriguing sequence features on a phylogeny with a variety of timescales. We show that there is a higher percentage of proteins containing repeats within the Drosophila genus than most other eukaryotes, including non-Drosphila insects, which makes this collection of species particularly useful for the study of protein repeats. We also find that proteins containing repeats are overrepresented in functional categories involving developmental processes, signaling, and gene regulation. Using the set of 1-to-1 ortholog alignments for the 12 Drosophila species, we test the ability of repeats to act as reliable phylogenetic signals and find that they resolve the generally accepted phylogeny despite the noise caused by their accelerated rate of evolution. We also determine that in general the position of repeats within a protein sequence is non-random, with repeats more often being absent from the middle regions of sequences. Finally we find evidence to suggest that the presence of repeats is associated with an increase in evolutionary rate upon the entire sequence in which they are embedded. With additional evidence to suggest a corresponding elevation in positive selection we propose that some repeats may be inducing compensatory substitutions in their surrounding sequence.

Key Words: protein repeats • simple sequence • homopeptides


    Introduction
 TOP
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Evolutionary innovations are often a result of internal duplication events within a genome followed by subsequent modification. In fact, simple sequence repeats are now being recognized as influential sequences with the ability to produce rapid variation, and act as ‘evolutionary tuning knobs’ (Kashi and King 2006Go). However, this increase in mutability can come at a devatating price; numerous diseases involving neurodegeneration are associated with homopolymer repeat expansions within protein sequences (Gatchel and Zoghbi 2005Go). Yet despite the propensity of many homopeptides to form insoluble toxic aggregates within the cell, repetitive sequence is the most commonly shared feature among eukaryotic proteins (Golding 1999Go; Huntley and Golding 2000Go).

The rapid expansion and contraction of simple sequence repeats is generally thought to be facilitated through replicative slippage (Levinson and Gutman 1987). However, we have previously found evidence that selection also plays a role in the evolution and maintenance of some homopolymers within proteins (Huntley and Golding 2006Go).

The abundance of repeats within eukaryotic protein sequences, and the observation that selection has acted to preserve centain repeats by preventing further slippage suggest that repeats themselves may have inherent functional attributes. Studies aimed at elucidating these attributes using structural data have demonstrated that consistent structures for repeats are largely absent (Wootton 1994Go; Saqi 1995Go; Huntley and Golding 2002Go). This result has led to the suggestion that repeats do not form stable globular structures, and are in fact structurally disordered (Dunker et al. 2002Go).

Proteins containing disordered regions are generally involved in molecular recognition and signaling (Alba et al. 1999Go; Dunker et al. 2002Go). This function might arise from the intrinsic flexibility and mobility of disordered regions as this could allow for increased association and dissociation rates along with binding promiscuity.

In this study we utilize the newly available genome sequences from 12 Drosophila species in order to further investigate the properties of protein repeats. We determine the relative abundance and patterns of repeats within each of the Drosophila proteomes and the specific functional categories associated with protein repeats. We then examine the relative positions of repeats within protein sequences, as this may provide more insight into their function and evolution. We also test the ability of repeats to act as phylogenetic signals. Finally, we investigate the effect of repeats on the evolutionary rate of their surrounding sequence.


    Materials and methods
 TOP
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
The protein sequences based on the Comparative Analysis Freeze 1 (CAF1) dataset for the 12 Drosophila species (D. melanogaster, D. simulans, D. sechellia, D. erecta, D. yakuba, D. ananassae, D. pseudoobscura, D. persimilix, D. willistoni, D. virilis, D. mojavensis, and D. grimshawi) were retrieved from NCBI.

To investigate the patterns of repetitive sequence within the proteins of the 12 Drosphila species, we searched for homopolymer tracts and extended segments of low complexity sequence. We required that homopolymers be at least five tandem, identical amino acids in length and implemented the program SEG (Wootton and Federhen 1993Go) to detect regions of low complexity within the protein sequences. The SEG algorithm is often used to filter out simple sequence within proteins before performing homology and similarity based searches like BLAST (Altschul et al. 1997Go), while a separate program is used to filter simple sequence from nucleotide sequences. In our implementation of the SEG algorithm we used a window length of 15, and a complexity cut-off K2(1) of 1.9 instead of the default values (12 and 2.2 respectively). The parameter K2(1) is an initial cut-off complexity value such that when SEG initially calculates the complexity of a subsequence, it must not exceed the cut-off complexity value. These values were previously shown to preferentially detect the longer and more repetitive sequence regions common among eukaryotic proteins (Huntley and Golding 2002Go).

Repeat Enrichment and Functional Associations
For each protein, we recorded the number of homopolymers, and low complexity sequences detected, along with their lenghts, relative position within the protein, and gene ontology (GO) associations (from Drosophila 12 Genomes Consortium 2007). For homopolymers, we also recorded the amino acid comprising the tract.

The CAF1 nucleotide sequences for the 12 Drosophila species were scanned for triplet repeat tracts. All tracts of five triplet repeats or more were recorded, and subsequently analyzed to demonstrate differences in the triplet repeat composition of coding and non-coding sequences.

Repeat Positions
To test the hypothesis that repeats are randomly dispersed throughout the length of a protein sequence, we divided each protein into three segments of equal length: an N-terminal segment, a mid segment, and a C-terminal segment. Each detected repeat was assigned to one of the three segments, based upon where the midpoint of the repeat was located. For instance, a repeat whose midpoint fell within the N-terminal third of the full protein was considered to have the majority of the repeat present in that segment of the protein. Any repeat whose midpoint fell on the boundary between two segments was randomly assigned to one of them. This gave us an obseved distribution of repeat positions within the protein sequences.

To perform a goodness of fit {chi}2 test, we needed an expected distribution for the positions of repeats. However, since the length of each individual repeat and of the protein within which the repeat is embedded influences that expectation, we had to generate the expected position of each repeat independently. Therefore, for each detected repeat, assuming random dispersal, we calculated the probability of the midpoint falling within each segment. If the length of the entire protein was L and the length of the repeat was l, then there would be L–l possible positions within the protein where the midpoint could fall. The mid segment contains L/3 of those possibilities, while the N-terminal and C-terminal each contain L/3 – l/2 possible midpoint positions. Therefore the probability that the repeat midpoint falls within the mid segment is Formula while the probability for each terminal segment is Formula A Random number was then generated and the expected position of the repeat was assigned based on these probabilities.

To determine whether the trends we observed within the Drosophila species were unique to this clade, we also performed the above analysis on a range of other species, including two additional insects (Anopheles gambiae and Apis mellifera), five other eukaryotes (the yeast Saccharomyces cerevisiae, the worm Ceanorhabditis elegans, the plant Arabidopsis thaliana, the fish Danio rerio, and the mammal Homo sapiens), two archaebacteria (Methanococcus jannaschii and Pyrococcus horikoshii) and two eubacteria (the gram negative Escherichia coli and the gram positive Bacillus subtilis).

Repeats as Phylogenetic Signals
In order to evaluate trends within each Drosophila species, without confounding the results with proteins unique to a given lineage, we also analyzed the 1-to-1 Tcoffee ortholog alignments (obtained from http://rana.1b1.gov/drosophila/wiki/). In this way, we also had a collection of proteins for which there was a single ortholog in each of the 12 Drosophila genomes.

Due to the unique mechanism producing most protein repeats (replicative slippage and expansion, rather than typical point mutations), we were curious whether repeats alone could perform well as phylogenetic signals. Using the 1-to-1 ortholog alignments of the 12 Drosophila species we began to test this hypothesis by taking each ortholog alignment, and scanning for homopolymer repeats at leat five residues in length. If such a homopolymer was found in any of the species, we attempted to expand the repeat boundaries in both directions by examining the sequence in the other species at those particular positions in the alignment. If any of the sequences contained at leat two tandem amino acid residues identical to the residues in the homopolymer, starting at the boundary position and extending beyond the homopolymer repeat, the boundary was then extended (see figure 1).


Figure 1
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— The large box depicts the extended boundaries of a poly-Q repeat region in an alignment. The alignment shown is a fragment of the FBgn0000438 sequence alignment. The longest homopolymer detected, 23 residues in length, is found in the D. grimshawi sequence (shaded gray). The boundary of the repeat is extended to the right by two additional residues because of the three gray shaded glutamines in the D. pseudoobscura and D. persimilis sequences (the left most shaded glutamines overlap with the rightmost shaded glutamine from D. grimshawi which previously delimited the right boundary).

 
After isolating the repeat alignment blocks within the ortholog alignments we then constructed a distance matrix for each repeat alignment using two separate methods. In the first method, termed ‘exact,’ we determined the length of the longest homopolymer tract (of the predominant amino acid) in each species for each repeat alighment block. The distance between any given species pair was then calculated as the magnitude of the difference between the lengths of their longest homopolymers. In the second, ‘fuzzy’ method, we counted for each species in a repeat alignment block the number of times the predominant amino acid was found in that stretch of sequence, regardless of whether it was part of a tandem homopolymer tract. The distance between species was calculated as the magnitude of the difference between their respective sums of predominant amino acid residues. We included the ‘fuzzy’ analysis because often a longer repeat tract becomes subsequently interrupted by amino acid substitutions which could significantly shorten the observed length of the longest homopolymer tract within the sequence. This could then effectively bias the distances to be greater than the single amino acid substitution would warrant. The ‘fuzzy’ analysis should alleviate this concern.

The ‘exact’ and ‘fuzzy’ distance matrices were then used separately as input for the NEIGHBOR and CONSENSE programs in the PHYLIP package (Felsenstein 1989Go) in order to produce individual trees for each repeat alignment block, and then finally a consensus tree. These consensus trees were then examined to determine the robustness of repeat sequences as phylogenetic signals.

Rates of Evolution
Finally, we wanted to use the unique dataset provided by the 1-to-1 orthologs of these 12 Drosophila species in order to test a hypothesis regarding the influence of repeats on the evolutionary rate of the surrounding protein sequence in which they are embedded. It has been previously established that repeats themselves evolve more rapidly than their flanking sequence (Huntley and Golding 2000Go), however, it has not yet been determined whether the sequence surrounding a repeat evolves at a different rate than proteins in which no repeat occurs.

To test this hypothesis we took the entire set of 1-to-1 ortholog alignments, containing 12 species per alignment, and scanned for homopolymer repeats as described above and in figure 1. We then removed any detected repeat blocks from the alignments and denoted the remaining aligned sequences ‘Repeats removed.’ Any alignments in which no repeat was detected (and thus, no repeat was removed) were put into a group denoted ‘No repeats.’ The resulting sets of alignments were then used to create phylogenetic trees using the PROTDIST and FITCH programs in the PHYLIP package. Tree lengths were then calculated by summing the branch lengths within each tree. Tree lengths were used as a proxy for evolutionary rate in each alignment of an ortholog between the 12 species. In this way we could compare the evolutionary rates of proteins containing no detectable repeats, to those containing repeats, excluding the repeated regions themselves.

A subset of the species from these alignments was then used in a PAML (Yang 1997Go) analysis test (including branch specific models and model 7 versus model 8) for differences in selective constraint between the coding sequence alignments with repeats removed, and those containing no repeats to begin with. Due to saturation of sites, in these analyses we limited the taxa to D. melanogaster, D. simulans, D. sechellia, D. erecta, D. yakuba, and D. ananassae. We converted the p-values from the likelihood ratio test for PAML models 7 and 8 to q-values using a false discovery rate method (Storey and Tibshirani 2003Go). The distribution of q-values between alignments with repeats removed and alignments without repeats were then compared. Finally, we recorded the location of sites relative to the closest repeat boundary where a repeat had been removed, and the probability of {omega} > 1 for each site where that probability was at least 0.5. This provided a spatial account of potential sites with evidence for positive selection throughout the alignments.


    Results
 TOP
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Repeat Enrichment and Functional Associations
As expected from previous surveys of repeat abundance across the three domains of life (Marcotte et al. 1999Go; Huntley and Golding 2000Go), repeats are much more abundant within eukaryotic proteomes (see Supplemental figure S1). The percentage of proteins containing at least one repeat within the 12 Drosophila genomes ranges from 16% in D. sechellia to 30% in D. grimshawi. Interestingly, the Drosophila proteins appear to have more repeat enrichment than the other eukaryotes, including the two other insect species (An. gambiae and Ap. mellifera). In every species, with the exception of the plant A. thaliana, low complexity sequence is more common than homopolymer sequence. Typically low complexity regions encompass all homopolymer sequences. However, in an attempt to prevent questionable sequences from being detected as low complexity we set the minimum threshold length for homopolymers to be five residues, while requiring low complexity regions to be at least 15 residues in length. Since A. thaliana proteins have numerous homopolymer sequences that are too short to be dected as low complexity regions by these thresholds, they have the unique pattern of containing more homopolymers than low complexity sequence.

Figure 2 demonstrates that even among the set of 1-to-1 orthologs across all 12 Drosophila species there are differences in the level of repeat enrichment. There is significantly less low complexity per protein within the melanogaster and obscura groups compared to the remaining four species (p < 0.00001).


Figure 2
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Mean values for the percentage of low complexity sequence per protein across 1-to-1 orthologs in the 12 Drosophila genomes and the percent of homopolymer tract lengths composed of homogeneous codon tracts.

 
Even the underlying codon structures of homopolymer repeats appear to differ between these species (see figure 2). While less than 13% of homopolymer tracts within Drosophila proteins are encoded by an uninterrupted homogeneous tandem array of codons, on average only 53.5% of a homopolymer tract will be encoded by such a homogeneous codon tract, and the amount of codon homogeneity varies, with D. sechellia having the least, and D. virilis having the most. Since codon homogeneity can further promote replicative slippage, thus increasing expansion and contraction events within homopolymer sequence, these results may indicate subtle differences in the evolutionary mechanisms creating and maintaining repeat sequences within these species.

The patterns of nucleotide triplet repeats differ markedly between coding and non-coding sequences (see figure 3). The length of triplet repeat tracts in non-coding regions is significantly longer than within coding regions, as expected due to relaxed constraint (p < 0.00001). However, the most notable difference between coding and non-coding triplet repeats is the frequency of CAG triplets (a CAG repeat may also be an AGC, GCA, GTC, TCG, and CGT repeat if all 6 reading frames are considered). CAG repeats are the most frequent of all repeats within coding sequences. This result is not surprising, since CAG encodes the amino acid glutamine, and poly-glutamine repeats are the most common homopeptides within Drosophila proteins (as described below). What is intriguing about this observation is that the alternate codon for glutamine, CAA, is more common among the triplet repeats within non-coding sequence. Therefore CAA repeats are preferred within non-coding sequence, while CAG codon repeats are preferred within coding sequences. Repeats detected by tandem repeats finder (Benson 1999Go) and mreps (Kolpakov et al. 2003Go) are consistent with these results (data provided by Hadi Quesneville). The only exception to this preference is found in D. willistoni whose protein coding sequences contain more CAA repeats than CAG. This deviance is likely caused by the lower G+C content of the CAA triplet, as D. willistoni sequences appear to have uniquely lowered G+C content among the 12 Drosophila species (discussed further below).


Figure 3
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— The frequency of triplet repeats within coding and non-coding sequences of the 12 Drosophila species. Each triplet repeat, except AAA and GGG, represents all six possible frames. For instance a CAA repeat (CAACAACAACAA) is also AAC and ACA repeat, in addition to GTT, TTG, and TGT repeats on the complimentary strand. The AAA and GGG repeats represent TTT and CCC repeats respectively.

 
Of the 6,689 1-to-1 ortholog alignments, 3,607 contained at least one detectable repeat in at least one species. This resulted in 20,916 repeat alignment blocks being identified. The large number of repeat alignment blocks is a result of there being a mean value of 5.80 repeat blocks per alignment in the 3,607 alignments that contained repeats. The average length of a repeat block was 7.88 residues. Figure 4 demonstrates that the poly-Q repeats dominate the amino acid composition of the repeat alignment blocks, followed by poly-A, poly-S, poly-G, poly-T, and poly-N. Examining the lengths of repeats also reveals poly-Q to generaly have the longest repeat tracts, with the exception of two intriguingly extended tyrosine tracts (see figure 5). Like the human diseases caused by poly-Q repeat tract expansions, the poly-Q tracts within Drosophila also tend to be encoded by CAG triplet repeats, and less often by CAA codons.


Figure 4
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— The abundance of amino acid homopeptide repeats within the 1-to-1 ortholog alignment sets for the 12 Drosphila species. Poly-Q homopolymers are by far the most frequent, being found in 6,108 alignment regions.

 

Figure 5
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 5.— The distribution of homopolymer repeat alignment lengths within the 1-to-1 ortholog alignment set of the 12 Drosophila species, by amino acid. The data for tyrosine (Y) is not shown, as there were 3 poly-Y repeats (of lengths 7, 97 and 122), the longest of which affects the visible spread for the other amino acids. The center line of a box represents the median value, while the top and bottom lines show the 75th and 25th percentile values, forming the interquartile range. The whiskers extend from the median out until 3/2 times the interquartile range, with data points beyond plotted as individual circles.

 
Drosophila 12 Genomes Consortium (2007) and Powell et al. (in review) have found that D. willistoni has peculiar G+C content and codon usage. Upon inspecting the types of homopolymers formed in this species, even those proteins from the 1-to-1 ortholog sets, we find they appear to be influenced by low G+C content (see Supplemental figure S2). The amino acids that more frequently form homopolymers in D. willistoni (poly-D, G, H, I, N, S, and T) than in the other 11 species collectively have a lower G+C content than those that are more frequent in the remaining 11 species. In fact, even when only the codons with the lowest G+C content for each amino acid are considered, this trend holds true.

Using the gene ontology (GO) associations and methodology from Drosophila 12 Genomes Consortium (2007), we find there are significantly larger proportions of proteins with repeats than expected by chance in GO terms associated with developmental processes, signaling and gene regulation (see table 1). Additionally, GO categories for housekeeping and metabolic processes have significantly smaller proportions of proteins containing repeats than expected. These results illustrate underlying functional differences between proteins containing and lacking repeats.


View this table:
[in this window]
[in a new window]

 
Table 1 Gene Ontology Associations with Protein Repeats

 
Repeat Positions
Table 2 displays the {chi}2 results from comparing the observed patterns of repeat positions to the expected patterns based on the null hypothesis of random dispersal throughout the length of a protein. The two eubacterial species did not deviate significantly from the expected distributions for low complexity sequence or homopolymer repeats. The archaebacteria P. horikoshii, however, did not show a random distribution of low complexity sequence. In this archaea, and all the eukaryotes, low complexity sequence tend to occur predominantly towards the ends of the protein (N-terminal and C-terminal).


View this table:
[in this window]
[in a new window]

 
Table 2 {chi}2 Results for Homogeneity of the Distribution of Repeats along the Lengths of Proteins

 
Figure 6 depicts the distribution pattern of low complexity sequence within D. melanogaster proteins. This pattern where the mid segment of the protein is least enriched for repeats is common to all the other eukaryotes and P. horikoshii, except for the mosquito (An. gambiae) whose distribution is skewed towards the C-terminus (see figure 6).


Figure 6
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 6.— The frequency distribution of repeat positions within proteins. Observed and expected position distributions for low complexity sequence among D. melanogaster and An. gambiae proteins ({chi}2 tests produce p-values of 0.00001 and 0.0003 respectively). The pattern observed in D. melanogaster was typical of all the other eukaryotes examined, with the exception of An. gambiae.

 
When we examine the distribution of homopolymer repeats, we notice that all {chi}2 values but those for A. thaliana and S. cerevisiae decrease compared to the values obtained from the low complexity sequence. For A. thaliana this is likely a result of the increase in the amount of sequences containing homopolymers, as it was the only species to show more homopolymers than low complexity sequence (see Supplemental figure S1). We also note that the archaea, P. horikoshii, does not show a significant deviation from random dispersal of homopolymers within its proteins. More intriguing, however, is the result that the two non-Drosophila insects, An. gambiae and Ap. mellifera also fail to reject random dispersal for homopolymers, despite all other eukaryotes rejecting the hypothesis.

Figure 7 illustrates an interesting observation regarding differences in the type of repeats that occupy the three different segments of a protein. Repeats in the N-terminal segment of a protein comprise a significantly larger percentage of the protein length than do those in the md segment (p < 0.00001), and likewise those in the mid segment take up a larger percentage than those in the C-terminal (p < 0.0001).


Figure 7
View larger version (5K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 7.— The distribution of repeats within the set of 1-to-1 orthologs in the 12 Drosophila species, measured as the percentage of the protein length comprising the repeat segment, separated by the positional segment within the protein where the repeat appears (A) the N-terminal segment of the protein; (B) the mid segment of the protein; (C) the C-terminal segment of the protein). Repeats within the N-terminal segment comprise a significantly larger percentage of the protein length than do those within the mid segment (p < 0.0001). Likewise, repeats in the mid segment take up a significantly larger percentage of the protein length than repeats found in the C-terminal segment (p < 0.00001).

 
Repeats as Phylogenetic Signals
The consensus trees for both the ‘exact’ and ‘fuzzy’ repeat analysis showed the generally accepted topology (as seen in figure 2). However the number of repeat alignments supporting any particular node in the consensus tree was consistently low (see figure 8). This lack of support likely arises from a limited number of phylogenetically informative repeat alignments, as a result of the mechanism of repeat evolution itself. The rapid rate of repeat evolution, thought to be facilitated by replicative slippage, would likely produce many repeats as autapomorphic traits. However because of the finer scale within clades of the Drosophila phylogeny one could still expect to see a large enough number of synapomorphic repeats to resolve the phylogeny, and the ability of the repeat analysis to produce the accepted topology suggests this to be the case.


Figure 8
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 8.— The consensus tree branching topology for the ‘exact’ repeat analysis (A), and ‘fuzzy’ repeat analysis (B). The numeric values at each node indicate the frequency of repeat alignments in support of the node.

 
We also performed the above analysis with the additional criteria that the repeat be conserved in at least six, and then all 12 species in an alignment. Our definition of conserved was broad in that it only required at least five of the predominant amino acids to be present (in tandem for the ‘exact’ analysis). We hoped that this would reduce the amount of phylogenetically uninformative repeat alignments by removing autapomorphies. These results were nearly identical to those above, both for ‘exact’ and ‘fuzzy’ analyses. Consensus tree topologies were consistent with the accepted topology, and nodes were supported by roughly 3% to 18% of the repeat alignments.

Rates of Evolution
As described above, 20,916 repeat alignment blocks were detected within the 6,689 1-to-1 ortholog alignments. These repeats were detected in only 3,607 of the ortholog alignments, and were then removed producing the ‘repeats removed’ group. The ‘no repeats’ group contained the remaining 3,082 ortholog alignments in which no repeats were detected. Figure 9A depicts the different distributions of tree lengths for the ‘no repeats’ and ‘repeats removed’ groups. The mean tree length for the ‘no repeats’ group (0.8986489) is significantly smaller than that for the ‘repeats removed’ group (1.027367) as determined by a t-test (P < 0.00001).


Figure 9
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 9.— The distribution of tree lengths (A) and alignment lengths (B) in 1-to-1 ortholog sets for the 12 Drosophila species. The ‘no repeats’ set is comprised of sequences in which no repeats were detected in any of the 12 species and had a mean tree length of 0.8986489 and alignment length of 404.73 residues. The ‘repeats removed’ set contained sequences in which at least one repeat in one species was detected, and that portion of the alignment was removed from each species. The mean tree length (1.027367) and alignment length (738.35) in this group were significantly larger than those of the ‘no repeats’ group (p < 0.0001 each).

 
We confirmed this result using a collection of mammalian multiz alignments obtained from the UCSC genome browser (Karolchik et al. 2003Go; Blanchette et al. 2004Go). The seven mammals examined (Homo sapiens, Pan troglodytes, Macaca mulatta, Mus musculus, Rattus norvegicus, Canis familiaris, and Bos taurus) had a significantly longer mean tree length for their ‘repeats removed’ group of proteins than for the ‘no repeats’ group (p = 0.0031).

To determine if the length of the alignments might be influencing the above result, we also tested for significant differences in alignment lengths between the two groups (see figure 9B). A significant difference (p < 0.00001) was found, with the ‘repeats removed’ group having longer mean alignment lengths (738.35 residues) than the ‘no repeats’ group (404.73 residues).

However a regression analysis of the tree lengths and the alignment lengths demonstrated that only 0.1% of the variation in the data could be explained by a relationship between tree length and alignment length. Therefore the increase in evolutionary rate among the ‘repeats removed’ group is not simply due to an increase in sequence alignment length.

Since repeat-containing proteins are non-randomly distributed among functional categories (see table 1) which are known to evolve at different rates we tested whether or not the elevated rate of evolution seen among proteins with repeats was simply an artifact of the evolutionary rate of a functional category. We again used alignments for proteins whose repeats had been removed, and alignments without repeats for each comparison within a functional category. The categories for developmental processes, cell cycle, defense response (immunity), and stress response were each examined. Defense and stress responses are functional categories associated with higher rates of evolution, while cell cycle and developmental processes are generally more conserved. In each category the mean treelength for the set of proteins with repeats removed was always higher than for the proteins without repeats. However the difference was only significant for developmental processes (P < 0.0001) and stress response (P < 0.00001).

We then tested the hypothesis that the increased rate of evolution in the sequence surrounding the repeats was due to an increase in compensatory substitutions. We reasoned that protein sequence surrounding repeats with evidence of ongoing replicative slippage should display a higher rate of evolution than sequence surrounding more stable repeats whose underlying codons had mutated to inhibit further slippage. To test this hypothesis we used proteins containing serine homopolymers from a previous study that had determined the influence of stabilizing selection or replicative slippage mechanisms on each repeat (Huntley and Golding 2006Go).

We took the 31 human proteins from this previous study and collected homologous sequences from mouse, rat, and cow (the only taxa with sequences resulting in BLAST expect values less than or equal to 10-20 for all 31 human sequences) and aligned the sequences using CLUSTALW. The repeats were then excised from the 31 alignments as described above for the Drosophila sequences and trees built from the remaining sequence.

Although the sample size is small, we found a significant difference (p = 0.0354) in the mean tree lengths for the sequence surrounding repeats with evidence for selection (0.065505) than for those with evidence for slippage (0.310135). Therefore, proteins containing repeats with slippage-resistant codon structures appear to have an overall lower rate of evolution in the flanking sequence than proteins with repeats undergoing slippage.

The PAML analysis using the subset of six Drosophila species found no significant difference between the ratio of non-synonymous and synonymous substitutions along lineages of the phylogeny with repeats excised and those that lacked repeats all together. However, a larger fraction of the alignments in which at least one species contained a repeat, subsequently excised, showed evidence of positive selection compared to the set of alignments in which none of the species contained repeats. A Wilcoxon rank sum test of the q-value distributions revealed this difference to be significant (p < .00001).

We observe the position of sites with evidence for positive selection to be strongly clustered near the boundaries of repeats (see figure 10). The spatial distribution of these sites is skewed towards the N-terminal side of a repeat, and 50% of the sites fall within the first 26 residues closest to a repeat boundary, while 95% fall within 201 residues. Examining only those sites with Pr({omega} > 1) ≥ 0.95 we find a nearly identical spread (50% of sites within the first 25 residues, and 95% within 229). In all cases the spatial distribution of sites with evidence for positive selection tends to be larger on the N-terminal side than the C-terminal side of a repeat boundary. Reasons for this asymmetry remain unclear at present, and may even be caused by the unidirectionality of transcription rather than protein structure.


Figure 10
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 10.— The positional distribution (A) and mean probability of positive selection (B) for sites with a probability of positive selection greater than or equal to 0.5 in 1-to-1 ortholog alignments for six Drosophila species (D. melanogaster, D. simulans, D. sechellia, D. erecta, D. yakuba and D. ananassae). The probability of positive selection (Pr({omega} > 1)) was calculated using the PAML model 7 and 8 test on alignments where repeats had been removed. Sites to the left of a repeat boundary are indicated as negative distance.

 
We next investigated the distribution of q-values within functional categories between alignments with repeats removed and alignments without repeats. This allowed us to determine if there was a consistent association of increased evidence for positive selection among proteins containing repeats, or whether the association was with the functional category instead. Of the four categories examined (developmental processes, cell cycle, defense response, and stress response) only cell cycle and defense response showed significantly lower mean q-values for alignments with repeats (P < 0.0002 and P < 0.0310 respectively). The sequences involved in stress response displayed a different pattern, though not significant, having a lower mean q-value for the alignments without repeats.


    Discussion
 TOP
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Repeat Enrichment and Functional Associations
Repeated motifs of amino acids appear to be more abundant in Drosophila than most other species examined. A notable exception is the human malarial parasite, Plasmodium falciparum, whose proteins have previously been demonstrated to have abnormally high amounts of low complexity sequence (Pizzi and Frontali 2001Go).

DePristo et al. (2006)Go suggest that the abundance of low complexity sequence within several species, including P. falciparum, is partly attributed to increased genomic A+T content. However, we find that despite the abnormally high A+T content of D. willistoni among the 12 Drosophila species, there is no corresponding increase in low complexity sequence within this species. It may be that the A+T content of D. willistoni is not different enough from the other Drosophila species to cause a relative increase in the amount of low complexity sequence within its proteome. We do, however, find evidence for the especially low G+C content of D. willistoni affecting the types of amino acid repeats being formed within its proteins. This is particularly notable since the data set being used is composed of 1-to-1 orthologs between the 12 species, and presumably repeated motifs within this collection of proteins would be less susceptible to species specific attributes.

However, substantial variation in repeat characteristics does exist between the taxa, even in the 1-to-1 orthologs. In particular, the amount of low complexity sequence per protein varies between species, in an increasing fashion as the phylogenetic distance from the melanogaster subgroup increases. The differences between closely related taxa might be explained by differences in effective population sizes, where longer and more abundant repeats could go to fixation more often in smaller populations, assuming that repeats are somewhat deleterious.

Likewise, the underlying nucleotide structures of homopolymer repeats vary significantly between species. D. virilis, for instance has the highest average, 58.1%, for the portion of a homopolymer tract encoded by a homogeneous codon tract. Differences in the lengths of homopolymer tracts between species cannot account for these underlying codon compositional differences, as the relationship between homopolymer tract length and the portion encoded by a homogeneous codon tract explains only 6.1% of the variation.

Due to the inherent replication dynamics of uninterrupted homogeneous codon tracts, with the likelihood of replicative slippage increasing with repeat copy number, the differences in codon tract homogeneity underlying repeats between species may indicate differences in how these repeats are arising and being maintained in each lineage. Overall, fewer than 13% of all homopolymer tracts within Drosophila are encoded completely by uninterrupted codon tracts. This demonstrates that the majority of repeats are persisting long enough in the genome to be modified by substitution processes, and perhaps eventually stabilized by selection to prevent further replicative slippage. The creation of slippage resistant codon structures underlying repeat regions may indicate either selection against unstable slippage prone variants, selection to maintain some function arising from the presence of a repeat region or both.

Consistent with previous results demonstrating that developmental proteins are enriched for amino acid repeats (Karlin and Burge 1996Go; Huntley and Golding 2004Go), we also find here that proteins containing repeats are significantly associated with developmental processes. It has been a curious observation that these amino acid repeat sequences, which can wreak havoc with their propensity to expand, as seen in the many human neurodegenerative diseases caused by homopolymer tract expansions, would be so abundant in a class of proteins so influential as the developmental proteins.

One key to this puzzle comes from the observation that amino acid repeats tend to form structurally disordered regions (Wootton 1994Go; Saqi 1995Go; Dunker et al. 2002Go; Huntley and Golding 2002Go). These regions may form flexible linkers between globular domains, and become structured upon binding with a substrate. The inherent mobility of such unstructured regions could facilitate faster association and dissociation rates and binding promiscuity that may be advantageous to proteins involved in gene regulation and signaling associated with development. This structural trait of repeats could explain the persistence of such high numbers of repeats within eukaryotic proteins.

Repeat Positions
In line with this argument, our finding that repeats are not randomly distributed throughout the length of a protein may also indicate a functional explanation for their survivial in the terminal segments of peptide sequences. We are not aware of any mechanisms creating amino acid repeats that would bias their generation to the amino and carboxyl ends of the peptide. Rather, if repeats do arise randomly throughout a protein, then a relatively larger fraction of those variants that arise towards the center of the protein must eventually be eliminated from the population.

Fujimori et al. (2003)Go reported a position bias in the density of nucleotide microsatellites towards the transcriptional start site within plant gene sequences. However they found no such bias in D. melanogaster. Interestingly, however, they observed a trend in mammals where microsatellite density increased towards both ends of the gene. A follow up study on amino acid repeats in plants also demonstrated a gradient of repeats with increased frequency in the N-terminus, decreasing in frequency along the direction of transcription (Zhang et al. 2006Go). However they did not account for varying lengths of proteins, or repeats, and performed only a sliding window analysis across the first 400 (or fewer) amino acids of each protein. This would inevitably bias the results to have repeats appearing more frequently in the N-terminus as shorter peptides containing repeats in their mid segment or C-terminus would contribute to the results appearing towards the N-terminal portion of the protein.

However, studies by Alba and Guigo (2004)Go and Siwach et al. (2006)Go have shown particular amino acid repeats overrepresented in each terminal end of proteins (poly-L, A, G, N, C, Q, H, and V in the N-terminus, and poly-F, I, K, and S in the C-terminus), suggesting that repeat persistence in the ends of protein sequences is not simply a result of repeats being tolerated more at those positions. In conjunction with our finding that the relative lenghts of repeats vary depending on their location within the protein (see figure 7), the above appearance of a consistent pattern among several taxa likely indicates an underlying non-random process maintaining repeats in these positions.

Repeats as Phylogenetic Signals
The combination of rapid expansion and yet to be resolved mechanisms of maintenance involved in the evolution of repeat sequences are unique traits that we tested as phylogenetic signals using the Drosophila phylogeny. Although many repeat sequences are unique to single lineages and are therefore phylogenetically uninformative, those that are shared among several lineages are informative and can still resolve the generally accepted phylogeny.

We hope this result can be used to further develop a framework for using indels within sequence alignments as informative sites. A previous analysis demonstrated their utility as a specific character state for detecting selection, despite indels usually being excluded from such analyses (Huntley and Golding 2006Go).

Rates of Evolution
It has been well established now that repeats themselves tend to evolve more rapidly than the remaining peptide sequence in which they are embedded (Huntley and Golding 2000Go; Romov et al. 2006Go). However our finding in this study that the sequence surrounding a repeat evolves faster and with an increased signal for positive selection than sequence containing no repeats is intriguing. Our hypothesis that this increase in evolutionary rate might result from compensatory substitutions in the flanking sequence to accommodate the rapid length perturbations in the repeat sequence is supported by a preliminary data set indicating that repeats stablilized by selection to prevent further expansion have flanking sequence with lower evolutionary rates than repeats that have ongoing slippage. However, an equally supported hypothesis to explain these results is that proteins undergoing rapid evolution may benefit from the acquisition of repeat domains which can then rapidly expand and contract until stabilization is preferred. In this way, repeats could act as evolutionary "tuning knobs" (Kashi and King 2006Go) and be selected for on the basis of the increase in variability afforded by their unique mechanism of mutation.

It is still curious, however, that some repeats appear advantageous or neutral, while others are incredibly deleterious. This apparent discordance can be somewhat reconciled by the findings from an experiment by Brignull et al. (2006)Go, who used C. elegans mutants to demonstrate that the threshold for pathogenic length in poly-Q type diseases could be manipulated by perturbing the function of various housekeeping proteins. By using mutants with extended lifespans they demonstrated that the onset for poly-Q pathogenesis can be further delayed, in agreement with observations that in general the age of onset is related to homopolymer tract length and lifespan of the organism. They then reasoned that a cellular buffering system exists to prevent proteotoxicty until a certain age when the buffering system begins to fail. They found additionally that they could induce transition from soluble protein to aggregate states in homopolymer lengths just under the pathogenic threshold by disrupting genes involved in the clearance of misfolded proteins and protein turnover. It then seems apparent that repeats themselves only become problematic to the cell when other housekeeping networks begin to fail. Otherwise repeat expansions may induce rapid compensatory mutations presumably to stabilize the protein structure, preserving function, or by virtue of their propensity to rapidly expand and contract, repeats may enable the exploration of novel protein conformations and functions.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Supplementary material figures S1 and S2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
The authors wish to thank Hadi Quesneville for providing microsatellite predictions within the genomic sequences, Dara Torgerson for collecting the mammalian multiz alignments, Amanda Laracuente for assistance with the gene ontology associations and Tim Sackton for providing the PAML data. We also thank David King and two anonymous reviewers for their insightful comments on this manuscript. This work was supported by a Natural Sciences and Engineering Council of Canada (NSERC) fellowship to M.A.H.


    Footnotes
 
David Erwin, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 

    Alba MM, Guigo R. Comparative analysis of amino acid repeats in rodents and humans. Genome Res (2004) 14:549–554.[Abstract/Free Full Text]

    Alba MM, Santibanez-Koref MF, Hancock JM. Amino acid reiterations in yeast are overrepresented in particular classes of proteins and show evidence of a slippage-like mutational process. J Mol Evol (1999) 49:789–797.[CrossRef][Web of Science][Medline]

    Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.[Abstract/Free Full Text]

    Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res (1999) 27:573–580.[Abstract/Free Full Text]

    Blanchette M, Kent WJ, Riemer C. (12 co-authors). Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. (2004) 14:708–715.[Abstract/Free Full Text]

    Brignull HR, Morley JF, Garcia SM, Morimote RI. Modeling polyglutamine pathogenesis in C. elegans. Methods Enzymol (2006) 412:256–282.[CrossRef][Web of Science][Medline]

    DePristo MA, Zilversmit MM, Hartl DL. On the abundance, amino acid composition, and evolutionary dynamics of low-complexity regions in proteins. Gene (2006) 378:19–30.[CrossRef][Web of Science][Medline]

    Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature (2007) doi: 10.1038/nature06341.

    Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradovic Z. Intrinsic disorder and protein function. Biochemistry (2002) 41:6573–6582.[CrossRef][Web of Science][Medline]

    Felsenstein J. PHYLIP- Phylogeny Inference Package (Version 3.2). Cladistics (1989) 5:164–166.

    Fujimori S, Washio T, Higo K. (11 co-authors). A novel feature of microsatellites in plants: a distribution gradient along the direction of transcription. FEBS Lett. (2003) 554:17–22.[CrossRef][Web of Science][Medline]

    Gatchel JR, Zoghbi HY. Diseases of unstable repeat expansion: mechanisms and common principles. Nat Rev Genet. (2005) 6:743–755.[Web of Science][Medline]

    Golding GB. Simple sequence is abundant in eukaryotic proteins. Protein Sci. (1999) 8:1358–1361.[Web of Science][Medline]

    Huntley M, Golding GB. Evolution of simple sequence in proteins. J Mol Evol. (2000) 51:131–140.[Web of Science][Medline]

    Huntley MA, Golding GB. Simple sequences are rare in the Protein Data Bank. Proteins (2002) 48:134–140.[CrossRef][Web of Science][Medline]

    Huntley MA, Golding GB. Neurological proteins are not enriched for repetitive sequences. Genetics (2004) 166:1141–1154.[Abstract/Free Full Text]

    Huntley MA, Golding GB. Selection and slippage creating serine homopolymers. Mol Biol Evol. (2006) 23:2017–2025.[Abstract/Free Full Text]

    Karlin S, Burge C. Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. Proc Natl Acad Sci USA. (1996) 93:1560–1565.[Abstract/Free Full Text]

    Karolchik D, Baertsch R, Diekhans M. (13 co-authors). The UCSC Genome Browser Database. Nucleic Acids Res. (2003) 31:51–54.[Abstract/Free Full Text]

    Kashi Y, King DG. Simple sequence repeats as advantageous mutators in evolution. Trends Genet. (2006) 22:253–259.[CrossRef][Web of Science][Medline]

    Kolpakov R, Bana G, Kucherov G. Mreps: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res. (2003) 31:3672–3678.[Abstract/Free Full Text]

    Levinson G, Gutman GA. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol Biol Evol (1987) 4:203–221.[Abstract]

    Marcotte EM, Pellegrini M, Yeates TO, Eisenberg D. A census of protein repeats. J Mol Biol. (1999) 293:151–160.[CrossRef][Web of Science][Medline]

    Pizzi E, Frontali C. Low-complexity regions in Plasmodium falciparum proteins. Genome Res. (2001) 11:218–229.[Abstract/Free Full Text]

    Romov PA, Li F, Lipke PN, Epstein SL, Qiu WG. Comparative genomics reveals long, evolutionarily conserved, low-complexity islands in yeast proteins. J Mol Evol (2006) 63:415–425.[CrossRef][Web of Science][Medline]

    Saqi M. An analysis of structural instances of low complexity sequence segments. Protein Eng (1995) 8:1069–1073.[Abstract/Free Full Text]

    Siwach P, Pophaly SD, Ganesh S. Genomic and evolutionary insights into genes encoding proteins with single amino acid repeats. Mol Biol Evol. (2006) 23:1357–1369.[Abstract/Free Full Text]

    Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA. (2003) 100:9440–9445.[Abstract/Free Full Text]

    Wootton J. Sequences with ‘unusual’ amino acid compositions. Current Opinion in Struct Bi. (1994) 4:413–421.[CrossRef]

    Wootton J, Federhen S. Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem. (1993) 17:149–163.[CrossRef][Web of Science]

    Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci (1997) 13:555–556.[Free Full Text]

    Zhang L, Yu S, Cao Y, Wang J, Zuo K, Qin J, Tang K. Distributional gradient of amino acid repeats in plant proteins. Genome (2006) 49:900–905.[Medline]

Accepted for publication June 18, 2007.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
J. G. Gibbons, E. M. Janson, C. T. Hittinger, M. Johnston, P. Abbot, and A. Rokas
Benchmarking Next-Generation Transcriptome Sequencing for Functional and Evolutionary Genomics
Mol. Biol. Evol., December 1, 2009; 26(12): 2731 - 2744.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
D. A. Morrison
Why Would Phylogeneticists Ignore Computerized Sequence Alignment?
Syst Biol, March 25, 2009; (2009) syp009v1.
[Full Text] [PDF]


Home page
Mol Biol EvolHome page
J. G. Gibbons and A. Rokas
Comparative and Functional Characterization of Intragenic Tandem Repeats in 10 Aspergillus Genomes
Mol. Biol. Evol., March 1, 2009; 26(3): 591 - 602.
[Abstract] [Full Text] [PDF]


Home page
Microbiol. Mol. Biol. Rev.Home page
G.-F. Richard, A. Kerrest, and B. Dujon
Comparative Genomics and Molecular Dynamics of DNA Repeats in Eukaryotes
Microbiol. Mol. Biol. Rev., December 1, 2008; 72(4): 686 - 727.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
24/12/2598    most recent
msm129v3
msm129v2
msm129v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Huntley, M. A.
Right arrow Articles by Clark, A. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Huntley, M. A.
Right arrow Articles by Clark, A. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?