Skip Navigation


MBE Advance Access originally published online on November 21, 2005
Molecular Biology and Evolution 2006 23(3):598-607; doi:10.1093/molbev/msj065
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/3/598    most recent
msj065v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Vowles, E. J.
Right arrow Articles by Amos, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Vowles, E. J.
Right arrow Articles by Amos, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Article

Quantifying Ascertainment Bias and Species-Specific Length Differences in Human and Chimpanzee Microsatellites Using Genome Sequences

Edward J. Vowles and William Amos

Department of Zoology, University of Cambridge, Cambridge, United Kingdom

E-mail: ejv22{at}cam.ac.uk.


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
Surveys of variability of homologous microsatellite loci among species reveal an ascertainment bias for microsatellite length where microsatellite loci isolated in one species tend to be longer than homologous loci in related species. Here, we take advantage of the availability of aligned human and chimpanzee genome sequences to compare length difference of homologous microsatellites for loci identified in humans to length difference for loci identified in chimpanzees. We are able to quantify ascertainment bias for a range of motifs and microsatellite lengths. Because ascertainment bias should not exist if a microsatellite selected in one species is as likely to be longer as it is to be shorter than its homologue, we propose that the nature of ascertainment bias can provide evidence for understanding how microsatellites evolve. We show that bias is greater for longer microsatellites but also that many long microsatellites have short homologues. These results are consistent with the notion that growth of long microsatellites is constrained by an upper length boundary that, when reached, sometimes results in large deletions. By evaluating ascertainment bias separately for interrupted and uninterrupted repeats we also show that long microsatellites tend to become interrupted, thereby contributing a second component of ascertainment bias. Having accounted for ascertainment bias, in agreement with results published elsewhere, we find that microsatellites in humans are longer on average than those in chimpanzees. This length difference is similar among repeat motifs but surprisingly comprises two roughly equal components, one associated with the repeats themselves and one with the flanking sequences. The differences we find can only be explained if microsatellites are both evolving directionally under a biased mutation process and are doing so at different rates in different closely related species.

Key Words: microsatellite • short tandem repeat • ascertainment bias • human • chimpanzee


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
Microsatellites are important molecular markers, comprising sequences of tandemly repeated 1–5 bp motifs. They are thought to mutate primarily by molecular slippage (Levinson and Gutman 1987Go; Schlötterer and Tautz 1992Go), thereby gaining and losing repeat units, but occasionally single nucleotide substitutions occur within the repeat tract. Such interruption mutations appear to reduce slippage and hence the rate at which length changes occur (Brinkmann et al. 1998Go; Primmer and Ellegren 1998Go). Microsatellites are visualized by means of polymerase chain reaction (PCR) primers designed against sequences flanking the repeat region, yielding variable length products that can be resolved electrophoretically and detected by fluorescence.

Microsatellites have proved particularly useful in conservation genetic studies because, although the microsatellites themselves mutate rapidly, the primer sites tend to be relatively stable. Consequently, primers designed for one species often amplify homologous products in related species, termed cross-species amplification (e.g., see Kayser et al. 1995Go; Coote and Bruford 1996Go; Davis et al. 2002Go). In extreme cases, primer conservation over more than 100 Myr has been documented (C. Rico, I. Rico, and Hewitt 1996;Go Zardoya et al. 1996Go). Thus, the time-consuming process of developing a new set of loci for every new species can be circumvented.

When new microsatellite markers are designed, there is often strong selection so as to achieve maximum levels of polymorphism. This involves choosing sequences with long tracts of pure repeats in preference to shorter stretches and interrupted sequences. Once primers have been designed, they are usually tested on a panel of samples and those showing strongest evidence of polymorphism are taken forward. Unfortunately, these selection criteria give rise to a phenomenon termed ascertainment bias, whereby a microsatellite chosen to be maximally long in one species is then deemed likely to be shorter in a second species. Ascertainment biases have been reported for length (Ellegren et al. 1997Go; Webster, Smith, and Ellegren 2002Go), repeat tract purity (Hutter, Schug, and Aquadro 1998Go; Zhu, Queller, and Strassmann 2000Go), and heterozygosity (Hutter, Schug, and Aquadro 1998Go).

The notion of ascertainment bias was first used in the context of microsatellites in a study comparing the length of human microsatellites amplified in chimpanzees (Rubinsztein et al. 1995Go). This paper noted that human microsatellites were consistently longer than their chimpanzee homologues and interpreted this as evidence of an upwardly biased mutation process coupled with differences in the genome-wide rate of slippage. However, it was subsequently pointed out that these length differences could arise simply through ascertainment bias (Ellegren, Primmer, and Sheldon 1995Go), and this stimulated a number of studies looking specifically for the existence of ascertainment bias. The overall conclusion is that ascertainment biases are common and operate with regard to length, purity, and polymorphism (Ellegren et al. 1997Go; Zhu, Queller, and Strassmann 2000Go; Webster, Smith, and Ellegren 2002Go; Amos et al. 2003Go).

Despite widespread agreement that ascertainment bias operates, the exact mechanism is unclear. If all microsatellites evolve at a uniform rate and without a limit on repeat number, no bias should accrue. For a bias to operate, there must be some form of restriction on length such that long loci are in some sense unlikely. One possibility is that long microsatellites tend to accumulate internal point mutations such that they are no longer recognized as long. Then, a long, pure locus would be unusual because its homologues in related species will often appear interrupted and hence shorter. A second possibility is that long microsatellites either begin shrinking in length through a reversal of the upward bias that seems to operate on most marker length loci (Amos et al. 1996Go; Primmer et al. 1998Go; Cooper et al. 1999Go; Ellegren 2000Go; Kayser et al. 2000Go), or suffer large deletions. Again, a long locus will be unusual because it has reached a state where many homologues will have mutated to become shorter. Modeling supports the plausibility of an interruption-based model (Kruglyak et al. 1998Go), and there is some direct evidence for it (Brinkmann et al. 1998Go; Primmer and Ellegren 1998Go).

The presence of ascertainment bias does not rule out the possibility that microsatellites in one species are consistently longer than those in another. To discover whether such trends exist, it is necessary to eliminate ascertainment bias by conducting a reciprocal test, with species A-derived markers tested on species B and vice versa. When this is done, both ascertainment bias and consistent length differences have been found. Thus, human microsatellites appear longer than their homologues in chimpanzees (Cooper, Rubinsztein, and Amos 1998Go; Webster, Smith, and Ellegren 2002Go), loci in sheep are longer than those in cattle (Crawford et al. 1998Go) and there are also differences among sibling species of Drosophila melanogaster and Drosophila simulans (Amos et al. 2003Go).

To date, reciprocal studies have been conducted largely using relatively small numbers of cloned loci coupled with direct amplification. However, a more recent study has used microsatellite loci identified in human and chimpanzee sequence alignments from GenBank, finding consistent length difference for dinucleotide repeats but failing to show any difference for mono-, tri-, tetra-, and pentanucleotide repeats (Webster, Smith, and Ellegren 2002Go). With the advent of the human genome sequence (Lander et al. 2001Go) and, more recently, the draft chimpanzee genome sequence (Chimpanzee Genome Sequencing Consortium, unpublished data), along with powerful software tools to provide comparison between the two (Karolchik et al. 2003Go), a more comprehensive study is possible. In particular, with large numbers of loci, ascertainment bias and inherent length differences can be estimated in each available length class rather than as a single overall property. Here we quantify ascertainment bias and the extent of mean microsatellite length differences between humans and chimpanzees using a large number of microsatellite loci collated from the available human and chimpanzee genomes.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
Microsatellites were selected from the human genome sequence (Homo sapiens), July 2003 assembly, and the chimpanzee genome sequence (Pan troglodytes), November 2003 assembly, using the University of California Santa Cruz (UCSC) Genome Browser web interface (http://genome.ucsc.edu). Locations of all microsatellites at least eight repeat units in length and with a repeating motif of 2–5 bp inclusive were identified using the Tandem Repeat Finder Track on the Table Browser of the web interface (Benson 1999Go). A minimum cutoff of eight repeats were chosen to mimic the selection process used when generating new markers, although the Tandem Repeat Finder Track includes only repeating patterns that occupy at least 25 bases of sequence. We define the "focal" species as the species in which the microsatellite was first identified/developed.

We next sought to identify all those sequences for which homologous regions could be identified in both species. Overall homology between the two genomes is determined by the UCSC Genome Browser by first masking repeat sequences and then linking the best alignments for short stretches of genomic sequence to form a chain (Kent et al. 2003Go). A file containing the chain is available to download and was used as the basis for identifying homologous pairs of loci using a Visual Basic program that we wrote. Our program searches through the large number of microsatellite loci identified by the Tandem Repeat Finder and returns a subset of loci where both homologues are present in both species, along with their database coordinates.

Using the coordinates generated by our program, homologous pairs of microsatellite-containing sequences were downloaded from the UCSC Genome Browser. Sequence in the focal species included 200 bases of flanking sequences either side. In the nonfocal species, 300 bases of flanking sequence were downloaded, the larger window allowing for possible insertion/deletion events. All loci for the focal species were on the positive strand. Sequences longer than 500 bases were removed from the analysis because such sequences typically contained large tracts of "N"s. In addition, sequences mapping to nonhomologous chromosomes were also removed. To confirm homology, loci were "electronically amplified" by selecting theoretical primer sequences in the focal species and then searching for matching sequences in the flanking sequence of its homologue. Primer sequences were 15 bases in length, yielding products around 250 bases. The exact length depended on both a random component, to simulate variation seen in laboratory primer location selection, and on the repeat copy number of the microsatellite itself. Amplification was deemed successful if no more than two mutations were found in the primer sequence and if repetitive motifs in the nonfocal flanking sequences did not lead to the prediction of multiple products.

For any given reciprocal set of microsatellites derived using the same criteria in both species, the observed length differences can be used to calculate both the average extent of the ascertainment bias (Ab) and the average underlying length difference between the two species (Amos et al. 2003Go). First, we derive two equations for average microsatellite length in the nonfocal species:

Formula 1(1)

Formula 2(2)
Where LXY is the mean length of species X-derived loci measured in species Y, and D(h–c) is the average size of any inherent difference in length between the two species expressed as length in humans minus the length in chimpanzees.

By rearranging equations (1) and (2), we derive expressions for both Ab and D(h–c):

Formula 3(3)

Formula 4(4)

Microsatellite mutation rates vary with repeat number, repeat type, and repeat tract purity (Weber and Wong 1993Go; Brinkmann et al. 1998Go; Primmer et al. 1998Go; Schug et al. 1998Go). Consequently, we divided our data set into a number of classes according to the number of repeats they contained in the focal species and the repeat motif; (AC)n, non-AC di-, tri-, tetra-, and pentanucleotides. Given that interrupted repeats appear to evolve more slowly than uninterrupted repeats, loci were also divided according to interruption state. We defined an interruption as any motif of equal length or shorter than the repeat unit itself that broke the repeating pattern of the microsatellite. Thus (AC)9(TT)1(AC)2 was defined as an interrupted microsatellite 12 repeat units in length with a longest pure repeat stretch of nine repeat units. To calculate D(h–c) and Ab, length was defined as the length of the total repeat tract, ignoring any interruptions. Extreme length classes contain too few loci to make meaningful estimates. Therefore, subsequent to calculation of Ab and D(h–c), wherever necessary, adjacent length classes were pooled to create classes containing a minimum of 10 loci, with the length of each amalgamated class defined as the average length of the loci it contains. Where Tandem Repeat Finder identified loci with a pure repeat tract shorter than eight, such loci were excluded.

Among the repeat types we consider, only trinucleotide repeats can exist within coding regions without disrupting the reading frame every time a new slippage event occurs. Because this possibility may constrain the way in which some trinucleotide repeats evolve, we further used the UCSC Genome Browser to identify those loci located in known coding regions. To do this we used the "RefSeq Genes" track that shows known genes taken from curated mRNA sequences compiled in LocusLink (National Center for Biotechnology Information).


    Results
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
Around 70% of human microsatellites we identified had identifiable homologues in chimpanzees, while the converse figure was 77%. Presumably the higher chimpanzee to human rate is due to the greater completeness of the human genome. After discarding loci where no homologue was found, where base ambiguities were present, and where one or both primer sites were absent or deemed not functional, we were left with a total of 67,120 human-derived and 63,781 chimpanzee-derived loci for the main analyses.

The proportion of loci discarded for a given reason may itself be informative about microsatellite evolution. For example, if long microsatellites tend to degrade through large deletions or rearrangements, the proportion of sequences discarded because homologous sequences could not be found should increase with focal microsatellite length. To test if any of the possible reasons for discarding a sequence correlated with length, we fitted a series of general linear models with binomial error structure and length and interruption status as explanatory factors. Models were fitted for the following discard reasons: homologous sequence not found, one or both primer sites not found, multiple good matches for primer sites found, and sequence contained many N bases. The majority (72%) of discards were due either to no homologous sequence being found on the Genome Browser or because both primers mismatched. In neither of these two models did we find a significant relationship with either length or interruption status. In other models some significant relationships were found, but these accounted only for a small faction of all loci discarded.

Figure 1 depicts how ascertainment bias varies with pure repeat length in the focal species for all repeat classes and for both pure and interrupted microsatellites. For reference, on each graph the data for non-AC dinucleotides are also included. In all cases, ascertainment bias increases more or less monotonically with microsatellite length. Several trends are apparent. First, although the slopes of bias against length are similar, for any given length of microsatellite, the strength of bias tends to be higher where the repeat unit is longer. Second, the slope of bias against length is shallower for interrupted compared with pure repeats, suggesting that the size of the ascertainment bias is determined more by the number of repeats than by the length of the repeat tract. Third, maximum repeat number for longer repeat unit lengths is less than for dinucleotides, though whether this is a genuine pattern or merely reflects the lower total number of tri-, tetra-, and pentanucleotides is unclear.


Figure 1
View larger version (21K):
[in this window]
[in a new window]
 
FIG. 1.— Dependence of ascertainment bias, Ab, on repeat length for a range of repeat types. Panels AE show results for uninterrupted repeats. Panels FJ depict equivalent results for repeat tracts with at least one interruption. To allow comparison, in all cases average Ab is plotted for the length of longest uninterrupted repeat tract in the microsatellite. Where there are fewer than 10 loci for a given repeat length, data are grouped into classes. Gray shading indicates Ab for non-(AC)n dinucleotide repeats and is shown on all plots with a horizontal dashed line at Ab = 0, also for comparison.

 
Having calculated ascertainment bias, it is then possible to estimate the average length difference corrected for bias, D(h–c), for any given microsatellite length in the focal species (fig. 2). Again, the values for pure, non-AC dinucleotides are included on each panel for comparison. The most general trend is that, wherever sample size is large, human microsatellites are longer than their chimpanzee homologues. Given that microsatellite mutation rates are thought to increase with microsatellite length, it is perhaps surprising to note that the relationship between D(h–c) is more or less flat over much of its range. Only among the higher repeat number classes does the value of D(h–c) show much variation, sometimes rising significantly above the mean value for loci 10–20 repeats in length and sometimes dipping significantly below zero. Among these longer microsatellites, the patterns appear complicated and poorly defined due to the decreasing sample sizes of available loci, but there is a suggestion of an oscillation of increasing amplitude. It is also interesting to compare equivalent plots for pure and interrupted repeats. With pure repeats, in all but one case the longest length class has the smallest D(h–c) value, the exception being trinucleotides. In contrast, for interrupted repeats, the longest length class tends to have the highest D(h–c) value.


Figure 2
View larger version (23K):
[in this window]
[in a new window]
 
FIG. 2.— Dependence of absolute differences, D(h–c), on focal species microsatellite length for a range of repeat types. D(h–c) is based on difference in number of repeats for uninterrupted repeats (AE) and for repeats with at least one interruption (FJ). To allow comparison, in each case average D(h–c) is plotted for the length of longest uninterrupted repeat tract in the microsatellite. Where there are fewer than 10 loci for a given repeat length, data are grouped into classes. Error bars show standard error of mean D(h–c). Gray shading indicates D(h–c) for uninterrupted non-(AC)n dinucleotide repeats and is shown on all plots with horizontal dashed lines at D(h–c) = 0 and D(h–c) = 1, also for comparison.

 
The exceptionally low values of D(h–c) for trinucleotide repeats up to around 20 units in length suggest that length changes may be constrained. This constraint might arise because trinucleotide repeats are likely to be more common that other repeat motifs in coding regions where they can mutate without disrupting the reading frame. To test this prediction, we compared plots of both Ab and D(h–c) for trinucleotide repeats in coding and noncoding regions separately (data not shown). Unfortunately, although trinucleotide repeats occur six times more frequently in coding regions than other repeat motifs, the total number of repeats found in coding regions remained small. Consequently, Ab and D(h–c) values could be obtained only for a few of the shorter length classes, and the resulting values appear similar to those for equivalent repeats in noncoding regions.

By counting repeat units in sequence data, we ignore any changes in length that occur in the flanking sequences, and hence, our results are not directly comparable with empirical studies. To address this possibility, we used electronic amplification to mimic PCR. The resulting plots (fig. 3) do not generate any obvious or consistent change in pattern, suggesting that any possible instability near the upper length threshold does not in turn make the flanking sequences prone to frequent, large insertions or deletions. However, electronic amplification does tend to increase the average value of D(h–c), and for several repeat types this increase is of a factor of around two. This surprising result indicates that slippage within the repeat tract is not the only mechanism operating to generate consistent length differences but instead insertions and/or deletions in the flanking regions contribute almost as much. One explanation could be that the flanking regions contain repetitive sequences that also mutate in length through slippage.


Figure 3
View larger version (24K):
[in this window]
[in a new window]
 
FIG. 3.— Dependence of absolute differences, D(h–c), on electronically amplified focal species microsatellite length for a range of repeat types. D(h–c) is estimated from the difference in electronic PCR product size for uninterrupted repeats (AE) and for repeats with at least one interruption (FJ). For comparison, gray shading indicates D(h–c) for non-(AC)n dinucleotide repeats with D(h–c) measured using actual repeat length as in figure 2. Plotting conventions also as figure 2.

 
If microsatellites nearing their upper length threshold tend to become unstable and suffer deletions of many or all of their repeat units, we might expect some long and short microsatellites to have unexpectedly short or long homologues, respectively. To examine this possibility, we plotted 2-dimensional frequency plots of length in humans against length in chimpanzees, expressed as percentages of the number of loci of a given length in the focal species (fig. 4). As expected, most data lie on or close to the diagonals where the sequenced alleles are of similar length in both species. In addition, two other areas of the plot show nonzero counts. First, there are a few loci that are currently short in the focal species and much longer in nonfocal species (e.g., panel C). Second, there are a much larger number of data points representing cases where the locus is long in the focal species and much shorter in the nonfocal species. Noticeably, the relative proportions of loci on the diagonal as opposed to elsewhere decreases with increasing motif length such that for dinucleotides most data lie on the diagonal while for tetranucleotides (and pentanucleotides; data not shown) the commonest class involve loci that are long in the focal species and very short in the nonfocal species. Overall, these patterns seem most consistent with a process whereby long microsatellites become unstable and delete.


Figure 4
View larger version (77K):
[in this window]
[in a new window]
 
FIG. 4.— Similarity in length among human and chimpanzee homologous microsatellites. Microsatellites identified in humans are shown separately to those identified in chimpanzees. The distribution of lengths in the nonfocal species is plotted for a given allele length in the focal species expressed as a percentage of the total number of repeats for that focal length. Percentage is indicated by color: white ≥ 0 < 5, light gray ≥ 5 < 10, dark gray ≥ 10 < 15, black ≥15. A and B show frequencies of dinucleotide and tetranucleotide repeats, respectively, for uninterrupted loci identified in humans. C and D show the same plots for uninterrupted loci identified in chimpanzees.

 
The alternative explanation for the absence of very long pure repeat tracts in the nonfocal species is that long tracts become increasingly prone to interruption mutations. Under this model, long microsatellites exist, but are classified as being much shorter based on their longest tract of pure repeats. In the focal species, such long, interrupted tracts are excluded during selection because selection looks only for pure tracts. In this way, interruptions will on average cause an ascertainment bias in which nonfocal microsatellites tend to appear shorter than their homologues. To explore how interrupted loci in the nonfocal species affect ascertainment bias, we recalculated both ascertainment bias and D(h–c) for four subsets of data partitioned according to whether focal and nonfocal loci were interrupted or pure (fig. 5). Ascertainment bias is strongest when interrupted loci are excluded in the nonfocal species and weakest when pure loci are excluded in the nonfocal species. Such a pattern suggests that interruption mutations do contribute to ascertainment bias and is consistent with the idea that interrupted loci tend to be long: when nonfocal loci are only included if they are pure, this excludes some of the longer loci, decreasing mean length and increasing ascertainment bias.


Figure 5
View larger version (20K):
[in this window]
[in a new window]
 
FIG. 5.— Dependence of D(h–c) and ascertainment bias, Ab, on exclusion of either the pure or interrupted subset of nonfocal loci. A and B, D(h–c) and Ab for pure repeats. C and D, D(h–c) and Ab for interrupted repeats. In each plot, nonfocal loci are excluded if pure (crosses), excluded if interrupted (open circles) or all included (filled squares). Data are for dinucleotide repeats only. Other plotting conventions as figures 2 and 3.

 

    Discussion
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
To learn more about the nature of microsatellite length differences between species, we have used published genome sequences for humans and chimpanzees to analyze a large number of homologous pairs of loci. The wealth of available data has allowed us to calculate both ascertainment bias and average length difference partitioned by both repeat type and repeat number. We find that ascertainment bias increases with microsatellite length and is the dominant factor among long microsatellites. In contrast, after statistically removing the effect of ascertainment bias, there remains a small but highly consistent residual length difference that varies little with repeat number and almost invariably involves human microsatellites being longer than their chimpanzee homologues. Only among the longest focal microsatellites does this residual length difference show much variation. When we examine possible causes of ascertainment bias, we find evidence of both large deletions and that interruptions play a role.

Previous studies have raised the question of whether ascertainment bias for microsatellite length exists and have then shown that it does by means of reciprocal amplification between species. However, few have attempted to measure the size of the bias and, where this has been done, data have been too sparse to explore how the bias might vary with variables such as repeat number or repeat type. Here we have used complete genome sequences to measure ascertainment bias across a wide range of microsatellite motifs and lengths. Our approach lacks some power because we do not have data on mean length or polymorphism and instead have to use each single allele to represent the locus from which it was drawn. However, this appears more than compensated for by the very large number of loci we are able to study.

Overall, the pattern is quite striking. Below around 15 repeats, ascertainment bias is small or in some cases negative, but as repeat number increases so does the strength of the ascertainment bias, peaking in size at around 15 repeats for the longest loci we found. The negative ascertainment bias seen for short dinucleotide microsatellites probably results from the manner in which repeat tract length is measured. The minimum length identified by Tandem Repeat Finder is 25 bases, including interruptions and allowing some repeat degradation (Benson 1999Go). This corresponds to a dinucleotide repeat tract length of 13 units. However, repeat tract length was measured using a custom Visual Basic program that followed a different set of rules and found many repeat tracts shorter than 13 units. Because Tandem Repeat Finder was only able to identify these shorter repeats as part of longer, degraded repeat tracts, they have undergone selection for impurity. This could result in a reversal of the ascertainment bias expected from selection for length or purity.

Averaged across all dinucleotide loci, ascertainment bias is estimated to be 2.31, similar to but a little larger than the value of 1.97 obtained empirically by Cooper, Rubinsztein, and Amos (1998)Go. One reason for the higher value in our study may be that we jointly estimate two forms of ascertainment bias, one due to selection of loci with greater mean length in the focal species and a second that may arise due to selecting relatively long, pure alleles from the database to represent each locus. The latter can be illustrated by considering a short locus with mean length of, say, seven repeats. Among all such loci, only those where the sequenced allele is longer than average (i.e., carries eight or more repeats) will be included in our study.

Over and above the ascertainment bias, we have also confirmed previous studies suggesting that human microsatellites tend to be longer than their chimpanzee homologues. Having statistically removed ascertainment bias the residual length difference is small but highly consistent, being around one to two repeat units across most repeat numbers and most motif types. The lack of dependence on repeat numbers is particularly surprising because any small difference would be expected to be magnified among longer microsatellites where the mutation rate is presumed to be higher. At present we are unable to find a convincing explanation for why the length difference should appear unlinked to mutation rate, though the fact that our electronic amplification analysis reveals a further component of D(h–c) of similar magnitude over and above differences in repeat number suggests that nonslippage-based mutations may be involved. This aspect clearly requires further analysis.

An ascertainment bias can arise in several different ways. Most obviously, if there is an upper length boundary where microsatellites become unstable and either delete or accumulate internal point mutations that reduce their apparent length, a locus sampled near the boundary will on average tend to be longer than its homologues in other species because in some cases alleles in the nonfocal species will have been shortened either by deletion or by interruption. An alternative model operates through repeat tract purity. If biased mutation causes continual expansion in length over the range of lengths used as markers and if interruption mutations reduce the rate of slippage, long pure microsatellites will again on average be longer than their homologues, some of which will have been expanding more slowly after having suffered an interruption mutation. These propositions raise two key questions. First, what is the nature of the upper length boundary? Secondly, how do interruption mutations impact on the size of the ascertainment bias?

Our analysis suggests that ascertainment bias is caused at least in part by deletion of long microsatellites at some upper length threshold. Two-dimensional frequency plots of allele length in focal versus nonfocal species reveal that most loci are of similar length in both species, as expected of a stepwise mutation where large changes of length are rare. The main exceptions occur among very long and very short microsatellites. Here, there appear significant incidences of loci that are long in one species and short in the other, suggesting that large length changes do occur but mainly involving long/short microsatellites. This is consistent with long microsatellites suffering large deletions (Ellegren 2000Go; Harr and Schlötterer 2000Go; Xu et al. 2000Go; Huang et al. 2002Go). Finally, long focal microsatellites exhibit a wide range of lengths in the nonfocal species, supporting a process of progressive deletion in medium-large jumps rather than single events that remove most or all of the repeat units. The actual distribution of mutation sizes will only be revealed with detailed modeling.

Similarly, the mechanism through which multiple repeat units are deleted is unclear. Presumably, the deletion events arise due either to an increase in the rate of large slipped-strand mispairing deletion or to a decrease in the rate at which such mutations are repaired. An alternative mechanism could involve unequal crossing-over during recombination. Although it is widely accepted that, unlike minisatellites, microsatellites mutate through primarily through slipped-strand mispairing (Levinson and Gutman 1987Go; Schlötterer 2000Go), unequal cross-over mutations have been observed in alanine-encoding repeats involved in disease loci (Warren 1997Go). However, while such mutation events could explain the deletion events that we see, they should be accompanied by corresponding medium-large expansion mutations, and these are generally not observed.

Despite the evidence that deletion events affect long microsatellites, providing support for one mechanism of ascertainment bias, we looked for evidence that interruptions also play a role. We find that ascertainment bias does indeed strengthen when only loci that are pure in both species are considered and weakens when the loci are pure in the focal species and interrupted in the nonfocal species. We argue that this supports a model where ascertainment bias accrues through interruption mutations causing otherwise long microsatellites to be classified as short, where length is defined by the longest tract of pure repeats. Thus, consider pure microsatellites length X identified in species A. Under an unbounded stepwise mutation model, half of their homologues will be longer and half shorter in a second species B. However, if mutations have also occurred in the lineage leading to species B that interrupt the repeat tract, such loci will have shorter tracts of pure repeats and create a length ascertainment bias. Our data show that this is indeed an important component of the overall ascertainment bias.

In conclusion, we have shown that microsatellite length is influenced by both an ascertainment bias and by a consistent length difference between species. The ascertainment bias increases greatly as focal length increases, peaking around 15 repeats, and appears to operate through long microsatellites being lost by one of two mechanisms: either becoming unstable and deleting or by point mutations reducing the length of their longest pure stretch of repeats. By and large, similar processes operate in di-, tri-, tetra-, and pentanucleotide repeats, though the upper length threshold is shorter for longer repeat motifs. In addition to ascertainment bias, almost all repeat types and repeat numbers examined exhibit a greater true length in humans, the average difference in length being small, of the order of one to two repeat units, and varying little with repeat number. This lack of length dependency coupled with the a similar sized difference due to mutations in the flanking sequence suggests the existence of a more general, nonslippage-based mechanism by which the human genome is becoming larger relative to chimpanzees. Our analysis highlights the wealth of possibilities for more detailed modeling of microsatellite evolution using the vast numbers of loci that can be found in genome sequences.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 
This work was supported by a Natural Environment Research Council studentship and a Cambridge Philosophical Society research studentship. We are grateful for helpful comments on the manuscript from Manfred Kayser.


    Footnotes
 
Naruya Saitou, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion
 Acknowledgements
 References
 

    Amos, W., C. M. Hutter, M. D. Schug, and C. F. Aquadro. 2003. Directional evolution of size coupled with ascertainment bias for variation in Drosophila microsatellites. Mol. Biol. Evol. 20:660–662.[Abstract/Free Full Text]

    Amos, W., S. J. Sawcer, R. W. Feakes, and D. C. Rubinsztein. 1996. Microsatellites show mutational bias and heterozygote instability. Nat. Genet. 13:390–391.[CrossRef][ISI][Medline]

    Benson, G. 1999. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27:573–580.[Abstract/Free Full Text]

    Brinkmann, B., M. Klintschar, F. Neuhuber, J. Huhne, and B. Rolf. 1998. Mutation rate in human microsatellites: influence of the structure and length of the tandem repeat. Am. J. Hum. Genet. 62:1408–1415.[CrossRef][ISI][Medline]

    Cooper, G., N. J. Burroughs, D. A. Rand, D. C. Rubinsztein, and W. Amos. 1999. Markov chain Monte Carlo analysis of human Y-chromosome microsatellites provides evidence of biased mutation. Proc. Natl. Acad. Sci. USA 96:11916–11921.[Abstract/Free Full Text]

    Cooper, G., D. C. Rubinsztein, and W. Amos. 1998. Ascertainment bias cannot entirely account for human microsatellites being longer than their chimpanzee homologues. Hum. Mol. Genet. 7:1425–1429.[Abstract/Free Full Text]

    Coote, T., and M. W. Bruford. 1996. Human microsatellites applicable for analysis of genetic variation in apes and old world monkeys. J. Hered. 87:406–410.[Abstract/Free Full Text]

    Crawford, A. M., S. M. Kappes, K. A. Paterson, M. J. deGotari, K. G. Dodds, B. A. Freking, R. T. Stone, and C. W. Beattie. 1998. Microsatellite evolution: testing the ascertainment bias hypothesis. J. Mol. Evol. 46:256–260.[CrossRef][ISI][Medline]

    Davis, C. S., T. S. Gelatt, D. Siniff, and A. Strobeck. 2002. Dinucleotide microsatellite markers from the Antarctic seals and their use in other Pinnipeds. Mol. Ecol. Notes 2:203–208.[CrossRef]

    Ellegren, H. 2000. Heterogeneous mutation processes in human microsatellite DNA sequences. Nat. Genet. 24:400–402.[CrossRef][ISI][Medline]

    Ellegren, H., S. Moore, N. Robinson, K. Byrne, W. Ward, and B. C. Sheldon. 1997. Microsatellite evolution—a reciprocal study of repeat lengths at homologous loci in cattle and sheep. Mol. Biol. Evol. 14:854–860.[Abstract]

    Ellegren, H., C. R. Primmer, and B. C. Sheldon. 1995. Microsatellite evolution—directionality or bias. Nat. Genet. 11:360–362.[CrossRef][ISI][Medline]

    Harr, B., and C. Schlötterer. 2000. Long microsatellite alleles in Drosophila melanogaster have a downward mutation bias and short persistence times, which cause their genome-wide underrepresentation. Genetics 155:1213–1220.[Abstract/Free Full Text]

    Huang, Q. Y., F. H. Xu, H. Shen, H. Y. Deng, Y. J. Liu, Y. Z. Liu, J. L. Li, R. R. Recker, and H. W. Deng. 2002. Mutation patterns at dinucleotide microsatellite loci in humans. Am. J. Hum. Genet. 70:625–634.[CrossRef][ISI][Medline]

    Hutter, C. M., M. D. Schug, and C. F. Aquadro. 1998. Microsatellite variation in Drosophila melanogaster and Drosophila simulans: a reciprocal test of the ascertainment bias hypothesis. Mol. Biol. Evol. 15:1620–1636.[Abstract]

    Karolchik, D., R. Baertsch, M. Diekhans et al. (13 co-authors). 2003. The UCSC genome browser database. Nucleic Acids Res. 31:51–54.[Abstract/Free Full Text]

    Kayser, M., P. Nurnberg, F. Bercovitch, M. Nagy, and L. Roewer. 1995. Increased microsatellite variability in Macaca-Mulatta compared to humans due to a large-scale deletion insertion event during primate evolution. Electrophoresis 16:1607–1611.[CrossRef][ISI][Medline]

    Kayser, M., L. Roewer, M. Hedman et al. (14 co-authors). 2000. Characteristics and frequency of germline mutations at microsatellite loci from the human Y chromosome, as revealed by direct observation in father/son pairs. Am. J. Hum. Genet. 66:1580–1588.[CrossRef][ISI][Medline]

    Kent, W. J., R. Baertsch, A. Hinrichs, W. Miller, and D. Haussler. 2003. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc. Natl. Acad. Sci. USA 100:11484–11489.[Abstract/Free Full Text]

    Kruglyak, S., R. T. Durrett, M. D. Schug, and C. F. Aquadro. 1998. Equilibrium distributions of microsatellite repeat length resulting from a balance between slippage events and point mutations. Proc. Natl. Acad. Sci. USA 95:10774–10778.[Abstract/Free Full Text]

    Lander, E. S., L. M. Linton, B. Birren et al. (255 co-authors). 2001. Initial sequencing and analysis of the human genome. Nature 409:860–921.[CrossRef][Medline]

    Levinson, G., and G. A. Gutman. 1987. Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol. Biol. Evol. 4:203–331.[Abstract]

    Primmer, C. R., and H. Ellegren. 1998. Patterns of molecular evolution in avian microsatellites. Mol. Biol. Evol. 15:997–1008.[Abstract]

    Primmer, C. R., N. Saino, A. P. Moller, and H. Ellegren. 1998. Unraveling the processes of microsatellite evolution through analysis of germ line mutations in barn swallows Hirundo rustica. Mol. Biol. Evol. 15:1047–1054.[ISI]

    Rico, C., I. Rico, and G. Hewitt. 1996. 470 million years of conservation of microsatellite loci among fish species. Proc. R. Soc. B 263:549–557.[Medline]

    Rubinsztein, D. C., W. Amos, J. Leggo, S. Goodburn, S. Jain, S. H. Li, R. L. Margolis, C. A. Ross, and M. A. Fergusonsmith. 1995. Microsatellite evolution—evidence for directionality and variation in rate between species. Nat. Genet. 10:337–343.[CrossRef][ISI][Medline]

    Schlötterer, C. 2000. Evolutionary dynamics of microsatellite DNA. Chromosoma 109:365–371.[ISI][Medline]

    Schlötterer, C., and D. Tautz. 1992. Slippage synthesis of simple sequence DNA. Nucleic Acids Res. 20:211–215.[Abstract/Free Full Text]

    Schug, M. D., C. M. Hutter, K. A. Wetterstrand, M. S. Gaudette, T. F. C. Mackay, and C. F. Aquadro. 1998. The mutation rates of di-, tri- and tetranucleotide repeats in Drosophila melanogaster. Mol. Biol. Evol. 15:1751–1760.[Abstract]

    Warren, S. T. 1997. Polyalanine expansion in synpolydactyly might result from unequal crossing-over of HOXD13. Science 275:408–409.[CrossRef][ISI][Medline]

    Weber, J. L., and C. Wong. 1993. Mutation of human short tandem repeats. Hum. Mol. Genet. 2:1123–1128.[Abstract/Free Full Text]

    Webster, M. T., N. G. C. Smith, and H. Ellegren. 2002. Microsatellite evolution inferred from human-chimpanzee genomic sequence alignments. Proc. Natl. Acad. Sci. USA 99:8748–8753.[Abstract/Free Full Text]

    Xu, X., M. Peng, Z. Fang, and X. P. Xu. 2000. The direction of microsatellite mutations is dependent upon allele length. Nat. Genet. 24:396–399.[CrossRef][ISI][Medline]

    Zardoya, R., D. M. Vollmer, C. Craddock, J. T. Streelman, S. Karl, and A. Meyer. 1996. Evolutionary conservation of microsatellite flanking regions and their use in resolving the phylogeny of cichlid fishes (Pisces: Perciformes). Proc. R. Soc. Lond. B Biol. Sci. 263:1589–1598.[Medline]

    Zhu, Y., D. C. Queller, and J. E. Strassmann. 2000. A phylogenetic perspective on sequence evolution in microsatellite loci. J. Mol. Evol. 50:324–338.[ISI][Medline]

Accepted for publication November 14, 2005.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Genome ResHome page
M. Brandstrom and H. Ellegren
Genome-wide analysis of microsatellite polymorphism in chicken circumventing the ascertainment bias
Genome Res., June 1, 2008; 18(6): 881 - 887.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
Y. D. Kelkar, S. Tyekucheva, F. Chiaromonte, and K. D. Makova
The genome-wide determinants of human and chimpanzee microsatellite evolution
Genome Res., January 1, 2008; 18(1): 30 - 38.
[Abstract] [Full Text] [PDF]


Home page
J HeredHome page
J. Laidlaw, Y. Gelfand, K.-W. Ng, H. R. Garner, R. Ranganathan, G. Benson, and J. W. Fondon III
Elevated Basal Slippage Mutation Rates among the Canidae
J. Hered., August 3, 2007; (2007) esm017v2.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
M. Kayser, E. J. Vowles, D. Kappei, and W. Amos
Microsatellite Length Differences Between Humans and Chimpanzees at Autosomal Loci Are Not Found at Equivalent Haploid Y Chromosomal Loci
Genetics, August 1, 2006; 173(4): 2179 - 2186.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
23/3/598    most recent
msj065v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Vowles, E. J.
Right arrow Articles by Amos, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Vowles, E. J.
Right arrow Articles by Amos, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?