MBE Advance Access originally published online on November 9, 2005
Molecular Biology and Evolution 2006 23(2):450-468; doi:10.1093/molbev/msj050
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Article |
The Origins of Eukaryotic Gene Structure
Department of Biology, Indiana University, Bloomington
E-mail: milynch{at}indiana.edu.
| Abstract |
|---|
|
|
|---|
Most of the phenotypic diversity that we perceive in the natural world is directly attributable to the peculiar structure of the eukaryotic gene, which harbors numerous embellishments relative to the situation in prokaryotes. The most profound changes include introns that must be spliced out of precursor mRNAs, transcribed but untranslated leader and trailer sequences (untranslated regions), modular regulatory elements that drive patterns of gene expression, and expansive intergenic regions that harbor additional diffuse control mechanisms. Explaining the origins of these features is difficult because they each impose an intrinsic disadvantage by increasing the genic mutation rate to defective alleles. To address these issues, a general hypothesis for the emergence of eukaryotic gene structure is provided here. Extensive information on absolute population sizes, recombination rates, and mutation rates strongly supports the view that eukaryotes have reduced genetic effective population sizes relative to prokaryotes, with especially extreme reductions being the rule in multicellular lineages. The resultant increase in the power of random genetic drift appears to be sufficient to overwhelm the weak mutational disadvantages associated with most novel aspects of the eukaryotic gene, supporting the idea that most such changes are simple outcomes of semi-neutral processes rather than direct products of natural selection. However, by establishing an essentially permanent change in the population-genetic environment permissive to the genome-wide repatterning of gene structure, the eukaryotic condition also promoted a reliable resource from which natural selection could secondarily build novel forms of organismal complexity. Under this hypothesis, arguments based on molecular, cellular, and/or physiological constraints are insufficient to explain the disparities in gene, genomic, and phenotypic complexity between prokaryotes and eukaryotes.
Key Words: complexity gene networks gene regulation gene structure genome evolution genetic draft introns modularity mutation natural selection population size pleiotropy random genetic drift recombination subfunctionalization transcription factors UTR
| Introduction |
|---|
|
|
|---|
Although full-genome sequencing has revealed numerous patterns of variation in genomic architecture among major taxonomic groups, a formidable, remaining challenge is to transform the descriptive field of comparative genomics into a more mechanistic theory of evolutionary genomics. Such an enterprise does not have to start from scratch. Nearly a century of mathematical derivation has resulted in a formal theory for evolution based on the expected dynamics of gene-frequency changes. Initially dubbed the Modern Synthesis by Huxley in 1942 and having experienced further enhancements since then, this theory has survived so much empirical scrutiny that the credibility of any proposed scenario for genome evolution must remain in doubt until shown to be consistent with basic population-genetic principles. In turn, if a mechanistic understanding of genome evolution is to be achieved, population-genetic theory will need to go beyond its reliance on algebraic formulations involving selection, mutation, recombination, and random genetic drift to incorporate the DNA-level constraints that are now known to define the evolutionary playing field.
Ever since Darwin, the vast majority of biologists have invoked natural selection as the primary, and in many cases the only, explanation for observed patterns of variation at most levels of organization. This greatly oversimplifies the evolutionary process. For example, Kimura, Ohta, and several contemporaries showed why numerous aspects of DNA sequence evolution cannot be explained entirely in terms of adaptive processes (reviewed in Kimura 1983
; Ohta 1997
). The neutral (or nearly neutral) theory that emerged from this work still enjoys a central place in the field of molecular evolution and has been applied to some aspects of evolutionary genomics (Force et al. 1999, 2005
; Lynch et al. 2001
, Lynch 2002
; Lynch and Conery 2003
; Lynch, Scofield, and Hong 2005b
). The goal of this paper is to expand on these previous results to demonstrate the plausibility of the hypothesis that many of the unique complexities of the eukaryotic gene arose by semi-neutral processes with little, if any, direct involvement of positive selection.
Although eukaryotes share many basic aspects of transcription, translation, and replication with their prokaryotic ancestors, there are profound differences at the level of gene architecture. Prokaryotic genes are often organized into operons that are transcribed into polycistronic units, whereas with few exceptions, eukaryotic genes are transcribed as single-gene units. Unlike prokaryotic genes, eukaryotic genes often have complex regulatory regions, and in multicellular species such regions often have a modular structure that helps facilitate tissue-specific expression. Eukaryotic protein-coding genes also often contain introns, whereas prokaryotic genes do not, and eukaryotic transcripts generally contain longer untranslated leader and terminal sequences (untranslated regions [UTRs]) than do those of prokaryotes.
Three general observations have encouraged the view that that these kinds of increases in gene complexity were necessary prerequisites to the origin of organisms with multiple cell types: (i) most aspects of gene architecture are much more elaborate in multicellular than unicellular eukaryotes; (ii) similar forms of genomic architecture are found in the two major and independently evolved multicellular lineages, animals and land plants; and (iii) more complex genes can often carry out more complex sets of tasks (Raff 1996
; Gerhart and Kirschner 1997
; Davidson 2001
; Carroll, Grenier, and Weatherbee 2001
). However, despite these clear associations, the direction of causality in the link between genome and organismal complexity is far from certain. There is no direct evidence that multicellularity itself was promoted by adaptive processes, and the fact that many prokaryotes are capable of cell differentiation reminds us that the evolution of multicellularity need not have awaited the emergence of eukaryotes.
The key point to be made below is that the types of genomic evolution that can occur within a species are not so much dependent on aspects of cell biology as on the constraints imposed by population-level processes, most notably by population size itself. The arguments underlying this hypothesis will be laid out in three sections. First, I will review the role that chance plays in evolution and why this depends on population size. Second, I will summarize several sets of empirical data that show that the efficiency of natural selection declines dramatically between prokaryotes, unicellular eukaryotes, and multicellular eukaryotes. Third, I will demonstrate that theory and empirical observations are mutually consistent in pointing to a central role for nonadaptive processes in the origins of many of aspects of eukaryotic gene structure.
The Role of Chance in Evolution
Evolution is an inherently stochastic process, starting from the chance events that produce single mutations and proceeding through a series of fortuitous steps that gradually lead to the spread of some mutations to every member of the descendant population. The usual conceptual point of departure here is the classic Wright-Fisher model, which assumes a population of diploid individuals, each contributing equally and synchronously to an effectively infinite gamete pool. The idealized mating system consists of randomly mating hermaphrodites and ignores complexities associated with overlapping generations, separate sexes, spatial structure, nonrandom variation in family sizes, and so on. However, most of these complications can be dealt with by equating the genetic "effective" size of a population (Ne) to the size of an ideal Wright-Fisher population yielding equivalent gene-frequency dynamics. The effective size of a population is a fundamental determinant of nearly all aspects of evolution as it determines the probability of (and times to) fixation or removal of mutant alleles.
Most deviations from the assumptions of the Wright-Fisher model cause Ne to be less than the total number of adult individuals (N) (Caballero 1994
; Whitlock and Barton 1997
; Rousset 2003
). For example, if adults differ in the number of gametes produced, either because of selection or chance ecological events, Ne will be reduced simply because some individuals contribute little or nothing to the following generation. A sex ratio that deviates from 1:1 reduces Ne because the rarer sex (which necessarily contributes half of the genes to the next generation) acts as a population bottleneck. Population-size fluctuations reduce long-term Ne relative to the arithmetic average because the losses of variation during population bottlenecks exceed the preservational effects of equal population-size expansions. A number of procedures have been developed to estimate Ne from pedigree data or by relating temporal fluctuations in allele frequencies to the expectations from sampling 2Ne gametes. Studies of this sort, mostly confined to vertebrates, suggest an average Ne/N of
0.1 (Frankham 1995
).
To see how population size influences the long-term rate of evolution, consider a newly arisen mutation in a diploid population containing 2N gene copies at each locus. If the mutation is neutral, with no external forces favoring one allele over another, the probability of eventual fixation is always equal to the initial frequency, p0 = 1/(2N). Because this fraction is inversely proportional to N, one might expect neutral changes to accumulate more slowly in larger populations. However, the expected number of new mutations arising at a locus per generation is 2Nµ, where µ is the rate of origin of neutral mutations per gene, and the long-term rate of evolution is equal to the product of the rate of origin of mutations and their probability of fixation. Thus, the rate of neutral evolution reduces to the genic mutation rate µ, which is entirely independent of the effective and absolute population size (Kimura 1983
).
Intuition suggests that the fixation probability of a beneficial allele must exceed the neutral expectation 1/(2N), whereas that of a detrimental allele must be <1/(2N), but how much so? To see that a beneficial mutation is never guaranteed to go to fixation, no matter how favorable, consider a new mutation that improves the fitness of its initial carrier by a fraction s (the selection coefficient). Such an allele has expected frequency p0 = (1 + s)/(2N) in the gamete pool leading to the next generation, and the probability that it is not successfully inherited by at least one offspring, (1 p0)2N, is closely approximated by (1 s)e1 for small s. Thus, relative to the situation for a neutral allele (s = 0), selection only reduces the probability of a rapid initial exit for a beneficial allele by a fraction s. This shows that until a favorable mutation has avoided chance elimination in the first few generations and increased its frequency in doing so, there is little assurance that it will successfully go to fixation. For mutations with additive effects on fitness (increasing homozygote fitness by 2s), the probability of fixation is
(Kimura 1962
). Thus, even in very large populations (Ne
), the high degree of stochasticity in the early phase of mutation establishment still restricts the probability of fixation to an upper limit of 2sNe/N. Given the arguments presented above, this means that the probability of fixation of a favorable mutation is almost always <2s.
Letting µb be the beneficial-mutation rate per gene and 2Nµb be the rate at the population level, the above results imply that the upper limit to the rate of incorporation of beneficial mutations is 4Neµbs, which unlike the situation for neutral mutations increases with Ne. In contrast, because detrimental mutations with 0.3 < Nes < 0.0 have fixation probabilities at least half as great as the neutral expectation, if the rate of origin of mutations in this range of effects is sufficiently high, a considerable load of mildly deleterious mutations can accumulate in populations of sufficiently small size (Ohta 1973
, 1974
). Modifications to this theory for situations in which populations are subdivided or changing in size do not change these basic scaling properties (Otto and Whitlock 1997
; Whitlock 2003
).
These results yield the robust prediction that the ability of a population to incorporate beneficial mutations and to purge deleterious mutations should scale positively with population size, assuming that Ne scales positively with N. However, something beyond the demographic features of a population, the physical structure of the genome itself, will generally limit the growth of Ne with N in the largest of populations. Because tightly linked nucleotide sites are transmitted across generations as a unit, to a degree that depends on the rate of recombination, the fate of any new mutation depends on the selective forces operating on all linked loci. On average, this causes the fixation rates of beneficial mutations to be lower and detrimental mutations to be higher than the single-locus predictions suggested above (Hill and Robertson 1966
). For example, a beneficial mutation that rapidly sweeps through a population will necessarily drag along any deleterious alleles at tightly linked loci with which it is associated at the time of origin, whereas the selective removal of deleterious alleles can impede adaptive evolution at linked loci. Even mutually advantageous alleles will interfere with each other's fixation when linked. Consider a beneficial allele A segregating at one locus, with a second beneficial mutation B arising at a tightly linked locus on an a-bearing chromosome. If the advantages of each mutation were the same, then the Ab and aB linkage groups would compete with each other in the fixation process, with one eventually excluding the other. Linkage need not be absolute for these effects to be important, but the stronger the degree of linkage the greater the degree of selective interference.
Gillespie (2000)
presented an elegant argument relating the influence of linkage to the effective population size of a chromosomal region. The key issue is that the effective size of a population defines the variance of allele-frequency change from generation to generation, which for a neutral locus is p(1 p)/(2Ne), where p is the current allele frequency, and 2Ne is the effective number of genes sampled per locus. Now imagine a neutrally evolving site completely linked to another site that is experiencing selective sweeps at rate
. On average, selective sweeps do not influence which alleles go to fixation at linked neutral sites because the probability that a beneficial mutation destined to fixation will arise in association with a particular allele is simply equal to that allele's frequency. However, selective sweeps do magnify the fluctuations of allele frequencies at linked neutral sites. Assuming as a first approximation that sweeps cleanse a population of linked variation essentially instantaneously, then conditional on a sweep occurring, the variance in allele-frequency change at the neutral locus is p(1 p). Thus, for a neutral locus in an ideal randomly mating population, the average variance of allele-frequency change is approximately p(1 p){[(1
)/(2Ne)] +
}. Equating the right-hand quantity to 1/(2Nl), the long-term effective population size is Nl
Ne/(1 + 2Ne
), where Ne is now defined to be the short-term effective size during phases free of selective sweeps. (Maruyama and Birky [1991]
obtained essentially the same result by a different method.) Recombination reduces the likelihood that a selective sweep will completely purge the variation at a linked locus, but Gillespie (2000)
showed that this simply modifies the preceding expression to Nl
Ne/(1 + 2NeC
), where C is the average squared frequency of the neutral hitchhiking allele after the completion of a sweep (previously assumed to be equal to one).
An unresolved issue is the way in which the rate of selective sweeps in a tightly linked region, C
, scales with population size. If C
were completely independent of population size, then Nl would increase with Ne at a decreasing rate, eventually reaching an upper limit of 1/(2C
). Because larger populations contain more targets for rare beneficial mutations and also experience more recombination events, the rate of sweeps (
) is expected to increase and the breadth of sweeps (C) to decrease with increasing Ne, so in principle, these opposite patterns of scaling might fortuitously balance such that the product C
is indeed independent of Ne. However, in the absence of any direct observations on this matter, a more general approximation is to treat C
as a power function of Ne, yielding a relationship of the form Nl = Ne/(1 +
Neß). Here, ß = 1 describes the case in which C
is independent of Ne, whereas with ß = 2 the rate of selective sweeps is proportional to Ne, as in the single-locus result described above. The latter condition is clearly too high to be biologically realistic as the rate of selective sweeps per locus eventually exceeds one per generation at large Ne. Thus, ß is likely to fall in the range of 12, although this remains to be formally demonstrated. Assuming that short-term Ne scales linearly with absolute population size (N), these qualitative arguments suggest that long-term Nl will also scale linearly with N for small to moderate N where random genetic drift is the predominant stochastic force. However, the degree of scaling is expected to be progressively reduced at larger N, with an asymptotic limit to Nl possibly being reached at very large N where stochastic fluctuations in allele frequencies are primarily a function of the chromosomal nature of the genome (Gillespie's genetic draft).
To gain an appreciation of the power of random genetic drift and draft to compromise the efficiency of natural selection, it is useful to consider the ratio of the single-locus fixation probability, pf, and the neutral expectation, 1/(2N). This is a simple function of
where Nl is defined by the function in the preceding paragraph (fig. 1). If the strength of selection is sufficiently large relative to the power of random genetic drift (4Nls > 1), the fixation probability for an advantageous allele is inflated by a factor of 4Nls relative to the neutral expectation, whereas the fixation probability of a deleterious allele asymptotically approaches zero as 4Nls
. However, if |4Nls| < 0.2, the probability of fixation is within 10% of the neutral expectation, and if |4Nls| < 0.02, the deviation from neutrality is no more than 1%. Thus, for any long-term effective population size, there exists a range of deleterious mutations whose selective disadvantages are overwhelmed by stochastic forces. Such alleles are said to be effectively neutral.
|
The Three Genomic Perils of Increased Organism Size
A central premise of this paper is that there is a general reduction in the efficiency of selection between prokaryotes, unicellular eukaryotes, and multicellular species. We now take a more empirical look at this issue, showing that all three major factors responsible for reductions in Nlsmall population size, tight linkage, and high background mutational activityare jointly exacerbated as organisms increase in size, producing a synergism that causes substantial reductions in the efficiency of natural selection.
Body Size and Population Size
A typical prokaryote is five to seven orders of magnitude smaller than the average single-celled eukaryote, with a similar disparity existing between unicellular and multicellular eukaryotes (Bonner 1988
). Such massive differences in size impose numerous ecological and physiological constraints and opportunities, but the implications for the population-genetic environment are equally pronounced. All other things being equal, the genetic effective size of a population should generally increase with the actual number of breeding adults (N), and one of the few well-established laws in ecology is that a primary determinant of N is the average size of members of the population. Eukaryotes generally show an inverse relationship between population density per unit area and average individual body mass within a species, with the extreme values ranging from
107 individuals/M2 for the largest vertebrates to
1011 individuals/M2 for the smallest unicellular eukaryotes (Damuth 1981
; Schmid, Tokeshi, and Schmid-Araya 2000
; Enquist and Niklas 2001
; Carbone and Gittleman 2002
; Finlay 2002
).
Ecological factors unique to individual species can cause local deviations around this pattern, and an inverse scaling between population density and organism size need not reflect the pattern for total population size as it does not account for total species ranges. However, the geographic area occupied by vertebrate species is negligibly to weakly positively correlated with average body size (Gaston and Blackburn 1996
; Diniz and Torres 2002
; Housworth, Martins, and Lynch 2003
), and the geographic ranges of unicellular species appear to be substantially greater than those for multicellular taxa (Finlay et al. 2001
; Finlay, Monaghan, and Maberly 2002
; Green et al. 2004
; Horner-Devine et al. 2004
). Thus, in a broad phylogenetic sense, there is little doubt that the total number of individuals within a species declines with increasing organism size, and the total range in N over all species certainly exceeds 20 orders of magnitude. Assuming
1030 prokaryotic cells inhabiting the earth (Whitman, Coleman, and Wiebe 1998
) and 107 being an upper-bound estimate for the number of prokaryotic species (Hammond 1995
), N for an average prokaryote would be
1023.
Reduced Recombination in Large Genomes
High-density genetic maps allow the estimation of the average amount of meiotic crossing over for numerous eukaryotes. The magnitude of recombination per physical distance scales negatively with genome size, ranging from 3 x 1010/bp/generation in Pinus sylvestris to 3 x 106/bp/generation in Saccharomyces cerevisiae (fig. 2). Such scaling is due mostly to the simple fact that most species experience between one and two meiotic crossover events per chromosome. Because chromosome number is uncorrelated with genome size, the intensity of recombination per nucleotide position naturally increases in smaller genomes (with smaller average chromosome lengths). Less clear is why the recombination rate declines with increasing genome size twice as rapidly in unicellular as in multicellular species (fig. 2). In any event, because genome size increases with organism size, these results imply that increases in organism size are accompanied by decreases in the intensity of recombination. Not only can a selective sweep in a multicellular eukaryote drag along up to 10,000-fold more linked nucleotide sites than is likely in a unicellular species, but species with small genomes also experience increased levels of recombination on a per-gene basis. For example, the rate of recombination over the entire physical distance associated with an average gene (including intergenic DNA) is
0.007 in S. cerevisiae versus
0.001 in Homo sapiens, and the discrepancy is greater if one considers just coding exons and introns, 0.005 versus 0.0005. The consequences of reduced recombination rates are particularly clear in the human population, which harbors numerous haplotype blocks, tens to hundreds of kilobases in length, with little evidence of internal recombination (Daly et al. 2001
; Reich et al. 2001
; Dawson et al. 2002
; Gabriel et al. 2002
; Greenwood, Rana, and Schork 2004
; McVean et al. 2004
).
|
The Rate of Mutation
Because of the rarity of mutations at individual nucleotides sites, there are enormous challenges to estimating the rate at which mutations arise at the molecular level. Most estimates are derived either from surveys of visible mutations at reporter loci or of dominant genetic disorders, followed by sequence analysis of individuals exhibiting a phenotype. These approaches are not without problems, as corrections must be made for the incidence of undetectable mutations. More indirect attempts to estimate the mutation rate are based on comparisons of distantly related species, using DNA sequences thought to be free of natural selection and making assumptions about times of interspecific divergence and species-specific generation times (e.g., Keightley and Eyre-Walker 2000
Across a phylogenetically diverse set of a species, there is a strong correlation between the mutation rate per generation and genome size (fig. 3). The range for the base-substitution mutation rate is approximately two orders of magnitude, and again exhibits a gradient with organism size, the extremes being 5.0 x 1010 and 5.4 x 108/bp/generation for prokaryotes and vertebrates, respectively. Despite the uncertainties in each estimate contributing to this pattern, the validity of the overall relationship is supported by two observations. First, the estimate for the nematode Caenorhabditis elegans obtained by direct sequence analysis (highest red point in the plot) is consistent with the remaining data obtained via reporter constructs. Second, estimates of the human mutation rate obtained from observations on dominant genetic disorders (Kondrashov 2003
) are very similar to those obtained from comparisons of pseudogene sequences in humans and chimpanzees (Nachman and Crowell 2000
), 2.6 x 108 and 2.2 x 108/bp/generation, respectively. Based on more limited data, Drake (1991)
concluded that the mutation rate per nucleotide per generation is inversely related with genome size in microbial species, but the results in figure 3 suggest the opposite pattern, even within the subset of unicellular species.
|
The Global Effective Population Sizes of Species
The preceding results show that three factors (low population sizes, low recombination rates, and high mutation rates) conspire to reduce the efficiency of natural selection with increasing organism size, although it is difficult to predict the magnitude of decline in Nl from these three factors alone. For example, fluctuations in population size can result in a substantial depression of Nl below average N, and it is unclear whether the magnitude of such fluctuations varies with organism size. In addition, as discussed above, hitchhiking effects should depress the Nl/N ratio much more in large populations, but because the recombination rate (per meiosis) is substantially greater and the mutation rate is substantially lower in species with large N, this decline could be weaker than otherwise expected. Given the many additional factors that can influence Nl, the degree to which Nl varies with organism size is best resolved by direct empirical observation.
One way to accomplish this task is to consider the amount of nucleotide-sequence variation at silent sites in protein-coding genes within natural populations. Under the assumption that mutations at such sites escape the eyes of natural selection, the amount of silent-site variation has a simple interpretation. The rate of introduction of new variation per nucleotide site in two randomly compared alleles is 2u (twice the base-substitution mutation rate per nucleotide), while the expected rate of loss of variation by genetic drift is 1/(2Nl). At equilibrium, the average number of nucleotide substitutions separating individual neutral sites in two randomly sampled alleles is the ratio of these two rates, 4Nlu. For a haploid species, the rate of random genetic drift is 1/Nl, and the equilibrium divergence among neutral nucleotide sites becomes 2Nlu. Both results have the same meaningat mutation-drift equilibrium, the amount of within-species nucleotide variation at silent sites is equal to twice the effective number of gene copies at the locus times the per nucleotide mutation rate. An estimate of this composite parameter is provided by the observed level of silent-site variation within a species (hereafter
s). Although the complex nature of the definition of Nl introduces some interpretive issues with
s (Laporte and Charlesworth 2002
), a fully general definition can be described in terms of allelic ancestry.
s is equal to the average age of random pairs of sequences times twice the base-substitutional mutation rate per nucleotide site. As will be seen below, the fact that
s is a function of the product of Nl and u is very useful because many aspects of genome evolution depend directly on this product.
Information on
s now exists for a wide enough phylogenetic range of species that some general statements can be made. Drawing from a substantially larger database than presented in the earlier survey of Lynch and Conery (2003)
, there is a striking inverse relationship between organism size and silent-site variation (fig. 4). For prokaryotes,
s lies in the broad range of 0.00710.3881, with an average value of 0.1044. This is nearly twice the average value for unicellular eukaryotes (0.0573), although the range of values among the latter taxa is again very high (0.01030.2522). For the still larger invertebrates, there is a further reduction in average
s to 0.0265, with a range of 0.00900.0473. The average value of
s for plants (0.0152) is still lower and that for vertebrates (0.0038) is even lower.
|
A significant caveat with respect to these data is that the bulk of existing surveys on nuclear variation in unicellular species have focused on pathogens, which because of the demographic dependence on their host species, probably have lower Ne than free-living species (Hartl et al. 2002
s < 0.01 are Serratia (a human pathogen) and Buchnera (an obligate endosymbiont of aphids). The six lowest estimates of
s for unicellular eukaryotes are all derived from pathogens (Candida, Coccidioides, Encephalitozoon, Fusarium, Phytophthora, and Plasmodium), all other taxa (including some pathogens) having
s > 0.02.
From these estimates of
s, Nl can be disentangled from u by applying the mutation-rate estimates described above. For example, using the average observed value of u
5.0 x 1010 for prokaryotes to factor u out of 2Nlu, the estimated average Nl for prokaryotes is
108. After removal of the six lowest values of
s for eukaryotic parasites, application of the average mutation rate for unicellular eukaryotes (1.6 x 109) yields an average Nl of
107 for this group. Similar analyses for invertebrates and vertebrates yield average Nl estimates of 106 and 104, respectively. Finally, a phylogenetically based mutation-rate estimate for plants of 7.3 x 109/bp/year for plants (Lynch 1997
) yields an average Nl estimate of
106 for annual species and assuming a generation time of 20 years,
104 for trees.
Given the rough nature of the preceding calculations, they are intentionally reported to just an order of magnitude. Nevertheless, it is likely that the range in Nl from prokaryotes to multicellular eukaryotes exceeds the four orders of magnitude just noted. Any selection on silent sites associated with codon-usage bias and/or mRNA processing features will bias
s below the neutral expectation, and the magnitude of bias will be greatest in large populations where selection is most efficient. Several observations suggest that this issue is of significance (Bustamante, Nielsen, and Hartl 2002
; Hellmann et al. 2003
; Chamary and Hurst 2004
; Desai et al. 2004
; Halligan et al. 2004
; Sharp et al. 2005
), and because the divergence rate of silent sites in prokaryotes may be at least ten times lower than the mutation rate (Ochman 2003
), the prokaryotic Nl estimates given above could be underestimated by at least tenfold. Despite these uncertainties, it is clear that the disparity in Nl across all domains of life is nearly 20 orders of magnitude less than the disparity in absolute numbers, a pattern that is consistent with a significant stochastic role of genetic draft in large populations.
Because of their potential for considerable clonal structure, prokaryotic species may be particularly vulnerable to selective sweeps, but the breadth of such sweeps remains unclear. Moreover, because prokaryotes have a number of mechanisms for uptake and exchange of exogenous DNA, the absence of meiosis need not imply exceptionally low levels of recombination. Some insight into this matter can be acquired by considering the statistical associations between locus-specific allelic variants that develop stochastically as a consequence of random genetic drift. At drift-recombination balance, the amount of linkage disequilibrium in a population is a function of the product Nlc, where c is the rate of recombination per nucleotide site (Ohta and Kimura 1971
; Hill 1975
). This quantity can be estimated by evaluating the rate at which the level of disequilibrium declines with the physical distance between nucleotide sites in samples of gene sequences. When joint estimates of Nlu and Nlc are available, their ratio eliminates Nl, providing an estimate of the relative rates of recombination and mutation (c/u).
For eukaryotes, the results from figures 2 and 3 can be used to reveal c/u more directly, yielding average relationships of c/u
0.321G2.1 for unicellular species and 0.014G1.5 for multicellular species, which implies expected values of 17 and 0.4 for genomes 100 and 1,000 Mb in size, respectively. The few available estimates for c/u for eukaryotes from polymorphism studies are in rough accord with these predictions, the average being 2.3 (0.9) for animals and 1.2 (0.5) for land plants (table 1), whose genomes are generally in the vicinity of a few hundred to several thousand megabases. The few available estimates of c/u for prokaryotes are of the same order of magnitude as those for multicellular eukaryotes, although not as high as expected for unicellular eukaryotes, averaging 4.3 (1.6). Thus, relative to the background rate of mutation, recombination at the nucleotide level is not exceptionally low in prokaryotes. Horizontal transfer across species boundaries can also expand the genomic resources available to prokaryotes (Ochman, Lawrence, and Groisman 2000
), and hence the efficiency of selection, although this source of diversity has been avoided in the preceding analyses.
|
In summary, all lines of evidence point to the fact that the efficiency of selection is greatly reduced in eukaryotes to a degree that depends on organism size. However, as suggested by Gillespie (2000)
2 x 109 for Helicobacter pyogenes, a highly recombining member of the eubacteria. After accounting for the probable downward bias of this estimate, the upper limit to Nl for all species dictated by the unavoidable constraints of linkage and selective sweeps may be on the order of 10101011. These numbers are relevant because, as will be shown below, they are just a few orders of magnitude above the point at which the population-genetic environment for gene-structure evolution becomes significantly altered.
One caveat with respect to estimates of Nl derived from polymorphism data is that they apply only over the time span necessary for the fixation of an average neutral mutation, 4Nl and 2Nl generations for diploids and haploids, respectively, which necessarily increases in species with larger Nl. Because many of the gross features of genomes may require tens to hundreds of millions of years to emerge, the short-term estimates of Nl for any particular species are likely to frequently misrepresent longer term conditions relevant to genome evolution. For example, newly emergent pathogenic bacteria, which often harbor almost no genetic variation (Daubin and Moran 2004
), are not expected to exhibit a signature of random genetic drift at the level of genomic architecture. On the other hand, averages of Nl over the members of broad taxonomic/functional groups (fig. 4) eliminate outliers resulting from sampling error and stochastic temporal fluctuations in population size, thereby providing more meaningful estimates of long-term conditions.
Drift, Mutation Pressure, and the Emergence of Eukaryotic Gene Complexity
Associated with reductions in Nl in eukaryotes are dramatic expansions in genome size, most of which reflect changes in noncoding regions: introns, mobile elements and their remnants, and other forms of intergenic DNA (fig. 5). A notable feature of these scalings is their continuity over all forms of life, even across the prokaryote-eukaryote boundary. This strongly suggests that neither cellular nor physiological changes associated with phylogenetic transitions are major determinants of genome size. We have previously suggested that the types of genomic evolution that are possible in various lineages are instead largely defined by the population-genetic environment, in particular by the effective number of individuals within a species (Lynch and Conery 2003
). In the remainder of this paper, these ideas will be extended to show how many aspects of eukaryotic gene structure may have arisen by nonadaptive processes.
|
Prokaryotic genes generally have remarkably simple structuresa single continuous coding region with one or two transcription-factor binding (TFB) sites residing just a few nucleotides upstream. Often, a single transcription-initiation site services several downstream prokaryotic genes, which are jointly transformed into a single polycistronic mRNA (an operon). In contrast, the coding regions of eukaryotic genes are often dissected by introns, which are transcribed into precursor mRNAs and then subsequently eliminated by splicing (fig. 6). In multicellular species, dozens of introns may occupy a single gene, and each intron can be many times longer than its surrounding exons. Eukaryotic genes also often have complex sets of regulatory elements distributed over large distances upstream (and sometimes internally or downstream) of the coding region, and with few exceptions, eukaryotic genes are transcribed as single monocistronic units. Finally, eukaryotic gene transcripts are generally flanked by extensive UTRs, which may harbor additional introns. Understanding how such modifications of gene structure emerged is a major challenge for evolutionary genomics because each additional layer of gene complexity entails a cost in terms of mutational vulnerability.
|
Population geneticists have historically treated selection and mutation as separate forces in the dynamics of evolutionary change, with mutation producing the variation upon which natural selection acts but having no further influence on the fates of alleles. However, in the context of gene architectural features, there are numerous ways in which mutation can act indirectly as a selective agent. Consider a pair of alleles with different forms of gene architecture but otherwise identical functions. Aside from any energetic burden associated with the maintenance of larger numbers of nucleotides, as a larger mutational target, the more complex allele will experience a greater rate of transformation to defective copies. For example, intergenic DNA has the potential to incur mutations that produce spurious TFB sites that cause inappropriate patterns of gene expression; introns necessitate the maintenance of localization signals at the nucleotide level to insure proper mRNA splicing; and 5' UTRs can acquire premature translation-initiation codons that cause downstream frameshifts. In this sense, most aspects of gene-architectural complexity impose an intrinsic mutational burden. The selective disadvantage associated with any single aspect of gene complexity need not be very large, as it is roughly equivalent to the product of the mutation rate per nucleotide per generation (u) and the excess number of nucleotide sites in the more complex allele critical to gene function (n) (Lynch 2002
As noted above, if a costly modification of gene architecture is to evolve in an effectively neutral manner, 4Nls must be smaller than
1.0. Because s = nu and
s is a function of Nlu, this criterion is equivalent to
sn < 1.0. Thus, recalling the average estimate of
s for prokaryotes (fig. 4) and its likely downward bias, the population-genetic environment of prokaryotic species may only rarely be conducive to expansions in gene-architectural complexity. In contrast, the extremely low levels of
s for multicellular eukaryotes create situations that are highly permissive to the accumulation of gene architectural changes with weak mutational disadvantages, which are easily overwhelmed by the power of random genetic drift. With their wide ranges of Nl, the various lineages of unicellular eukaryotes are expected to fall between these two extremes. It should be noted, however, that although organism size appears to be the primary determinant of Nl, it is the latter that ultimately governs the genetic properties of populations. It is conceivable that some unicellular species may reside at a sufficiently low Nl for long enough periods to promote genomic expansion but exceedingly unlikely that any multicellular species ever achieves prokaryote-like levels of Nl.
It is clear that the emergence of the complexities of eukaryotic gene structure offered novel opportunities for the evolution of organismal diversity, many of which have been exploited by multicellular eukaryotes, for example, increased regulatory-region complexity and alternative splicing associated with introns. Less certain is whether multiple cell types and mechanisms of cell signaling are advantageous in a formal fitness sense. In any event, because any such adaptive modifications are highly unlikely to have arisen de novo, alternative explanations are needed for the first steps in the retailoring of the eukaryotic genome. Specific examples are now given on how three aspects of eukaryotic gene complexity may have emerged despite their initial intrinsic disadvantages. Each scenario discussed is quantitatively consistent with the theory presented above and shows how a reduction in Nl can passively promote the evolution of gene architectural changes that ultimately facilitate the evolution of organismal complexity by descent with modification.
Introns
As noted above, introns impose a burden on their host genes, in that specific nucleotide signatures must be reserved to insure precise recognition of each exon-intron junction by the spliceosome. The most conserved nucleotide sites are located at the ends of introns and at internal intronic branch points (Burge, Tuschl, and Sharp 1999
; Lorkovi
et al. 2000
; Bon et al. 2003
). However, this information is often insufficient for proper spliceosomal recognition, particularly in the case of large introns containing numerous spurious recognition sites (Mount et al. 1992
; Burge, Tuschl, and Sharp 1999
; Long and Deutsch 1999
). Supplemental information often resides within the surrounding exons in the form of exon splicing enhancers and exon splicing silencers (ESSs), each typically four to ten nucleotides in length (Liu, Zhang, Krainer 1998
; Schaal and Maniatis 1999
; Blencowe 2000
). In mammals,
2%4% of exonic sequences match the signatures of known ESSs, with
5 such clumps per exon (Fairbrother et al. 2002
), and the maintenance of such motifs by transcription-related processes is supported by the significantly different frequencies of various oligomers in intron-containing versus intron-free genes (Federov et al. 2001). The most direct evidence for the increased mutational vulnerability associated with introns derives from the observation that about a third of human genetic disorders is attributable to mutations causing defective splice-site recognition (Culbertson 1999
; Frischmeyer and Dietz 1999
; Philips and Cooper 2000
), many of which are located in exons (including substitutions at synonymous sites) (Cooper and Mattox 1997
; Nissim-Rafinia and Kerem 2002
).
Based on the known molecular requirements for spliceosomal recognition, one may surmise that the equivalent of n = 20 to 40 nucleotide sites are required for the precise removal of each intron (Lynch 2002
), and indirect estimates based on the incidence of splicing-defective alleles among new mutations are consistent with this prediction (Lynch, Hong, and Scofield 2005a
). Recalling the theory presented above, a permissive environment for intron colonization requires that Nls = Nlnu be smaller than
0.25, or equivalently
s < 1/n. Thus, as a first-order approximation, populations with silent-site nucleotide diversities greater than
0.05 are expected to be nearly immune to intron colonization. Essentially, the full range of variation in observed
s for animals and land plants is well below this threshold value (fig. 4), and all members of these groups have an average of four to seven introns per protein-coding gene (Lynch and Conery 2003
). In contrast, the average value of
s for unicellular species (0.06) slightly exceeds the expected threshold, yielding the prediction that the demographic features of such species often place them in close proximity to the barrier to intron colonization (and maintenance). Consistent with this prediction is the broad range of variation in intron numbers in unicellular eukaryotes, ranging from a few dozen or less in the entire genomes of some species (e.g., trypanosomes, the diplomonad Giardia, the red alga Cyanidioschyzon, and some fungi) to numbers approaching those in animals and land plants in other fungi (fig. 5). For prokaryotes, which are devoid of spliceosomal introns, average
s is more than twice the threshold for intron colonization.
The phylogenetic distribution of introns and the components of the spliceosome make it quite clear that the stem eukaryote harbored introns (Lynch and Richardson 2002
; Collins and Penny 2005
), and perhaps a substantial number of them (an average of up to three per protein-coding gene being plausible) (Rogozin et al. 2003
; Roy and Gilbert 2005a
; see Qiu, Schisler, and Stoltzfus 2004
for an alternative view). Thus, because there is no evidence of the prior existence of a spliceosome in any prokaryote, the stem eukaryote must have provided a highly permissive environment for intron colonization, with some subsequent lineages then experiencing conditions that favored intron loss and others favoring further intron gain. Could the stem eukaryote have had a sufficiently small population size to allow the accumulation of a substantial intron population via effectively neutral processes alone? Two different approaches have led to the conclusion that the birth rate of introns within the past
100 Myr is
0.001/nucleotide site/Byr in invertebrates (Lynch and Richardson 2002
; Roy and Gilbert 2005b
). At this rate,
1.0 Byr would be required since the origin of the spliceosome for the protein-coding genes of the stem eukaryote to acquire an average of
1.0 introns, assuming an average coding length of
1.0 kb as in common in today's eukaryotes. Thus, because the time span between the origin of life and the origin of eukaryotes is
1.0 Byr (Knoll 1992
; Furnes et al. 2004
), the passive acquisition of more than one intron per protein-coding gene in the stem eukaryote is just barely plausible, unless the physical rate of intron birth was substantially higher than in today's species.
A more rapid early proliferation of introns could have occurred if some form of positive selection offset the intrinsic disadvantages associated with elevated mutational vulnerability as this would increase the rate of fixation beyond the neutral expectation. One possibility involves the nonsense-mediated decay (NMD) pathway, an mRNA surveillance mechanism for detecting and eradicating transcripts harboring premature termination codons (PTCs). The details of this process have been worked out in only a few organisms, but at least in mammals NMD often uses a protein complex laid down at splicing junctions (the exon junction complex [EJC]) to discriminate PTCs from proper termination codons. If a termination codon is detected upstream of an EJC, the transcript is generally targeted for destruction, a process that works so long as the true termination codon generally resides in the final exon, as is usually the case in mammals (Maquat 2004
). Aside from the transcription of mutant alleles, there are many different routes to the stochastic production of PTC-containing transcripts, including base misincorporation, sloppy points of transcription initiation, and erroneous splicing (reviewed in Lynch, Hong, and Scofield 2005a
). Thus, most cells are regularly confronted with the need to eliminate transcripts that could lead to harmful truncated proteins, and the benefits of doing so via NMD are well documented (Hodgkin et al. 1989
; Leeds et al. 1992
; Dahlseid et al. 1998
; Mendell et al. 2000
; Medghalchi et al. 2001
).
Phylogenetic analysis suggests that both NMD and the EJC were present in the stem eukaryote (Lynch, Hong, and Scofield 2005a
). Thus, an early functional association of NMD with introns could have elevated the rate of intron proliferation beyond the neutral expectation (Lynch and Kewalramani 2003
). Under this hypothesis, the first intron to colonize a gene would provide a basis for eliminating transcripts with the subset of upstream PTCs. However, because the spatial locations of initially colonizing introns must be largely random, because introns themselves encourage the production of erroneous transcripts via splicing errors, and because some PTCs may be unable to elicit NMD if the nearest EJC is too far downstream, once this coevolutionary process initiated, further colonization of introns would be encouraged. In this manner, some of the earliest colonizing introns (those in locations that allowed sufficient PTC detection) may have had a selective advantage that offset the cost of increased mutational susceptibility. The overdispersed distributions of introns in the genes of multicellular species support the hypothesis that selection favors a uniform coverage of coding regions with introns (Lynch and Kewalramani 2003
). Other factors may encourage the colonization of introns (Lynch and Richardson 2002
), but the central point here is that once introns became a reliable aspect of a substantial fraction of eukaryotic genes, they served as a natural substrate for secondary adaptive evolution.
Despite the obvious benefits of an mRNA surveillance system, a few species appear to have lost the NMD pathway (Lynch, Hong, and Scofield 2005a
). These include the kinetoplastids Trypanosoma and Leishmania, the unicellular red alga Cyanidioschyzon, the microsporidian Encephalitozoon, and the diplomonad Giardia. Remarkably, each of these lineages is almost entirely devoid of introns, and with the exception of Leishmania, they all appear to have lost the EJC apparatus. It is tempting to conclude that the loss of introns and NMD must go hand in hand simply because of the latter's functional requirement for an EJC. However, because NMD operates on some genes in an intron-independent manner in a phylogenetically broad group of species (Ruiz-Echevarria, González, and Peltz 1998
; Hilleren and Parker 1999
; Gatfield et al. 2003
; Amrani et al. 2004
), it is not clear that introns are an absolute requirement for the maintenance of NMD. An alternative explanation for NMD losses is that the species involved have had very large historical effective sizes, which facilitated the elimination of all forms of extraneous DNA, including mobile elements and most intergenic DNA. As the degree of genomic streamlining increases, the production of erroneous transcripts may eventually decline to the point at which the selective advantage of an intron-based NMD system is no longer sufficient to insure its evolutionary stability (Lynch, Hong, and Scofield 2005a
).
A central unresolved issue with respect to introns is whether intron numbers have reached a steady-state equilibrium, and if so, what prevents runaway intron colonization. The equilibrium occupancy of introns is a function of the ratio of birth (b) to death (d) rates per coding nucleotide site (Lynch 2002
), but no equilibrium is possible if b > d for all levels of occupancy. The NMD hypothesis provides a potential density-dependent mechanism that could stabilize intron numbers. As a sufficiently well-distributed population of introns is established, the NMD-associated advantages of additional introns will progressively decline until a point is eventually reached at which further intron colonization imposes a net disadvantage. Such a scenario would provide a natural barrier to runaway intron colonization only if the net selective disadvantage of each additional intron exceeded the power of random genetic drift or if the physical rate of intron removal somehow increased with intron number. The average number of introns per protein-coding gene in vertebrates ranges from 5.2 (Fugu) to 7.9 (Gallus), whereas the range for invertebrates is nearly nonoverlapping, 3.1 (Drosophila) to 5.5 (Bombyx). Thus, it is clear that the animal lineage experienced a basal increase in intron number, although it is an open question as to whether the numbers in vertebrates continued to expand (perhaps even today) as a consequence of the reduced efficiency of selection associated with low Nl. One analysis is consistent with the latter interpretation (Rogozin et al. 2003
).
Finally, it is worth considering the origin of the complex molecular machine that makes introns possible, the spliceosome. The most credible hypothesis involves descent from a group II intron (Sharp 1985
; Cech 1986
; Lambowitz and Zimmerly 2004
). Although these "self-splicing" introns have never been found in nuclear genes, their presence in eubacteria, archaea, and the organelles of plants, fungi, and numerous protists (Bonen and Vogel 2001
; Dai and Zimmerly 2002
, 2003
; Rest and Mindell 2003
) makes plausible the idea that they were present in the stem eukaryote. Deriving further support from the numerous structural and functional similarities between the excision mechanisms for group II introns and spliceosome-dependent introns (Michel and Ferat 1995
; Hetzer et al. 1997
; Burge, Tuschl, and Sharp 1999
; Sontheimer, Gordon, and Piccirilli 1999
; Shukla and Padgett 2002
; Valadkhan and Manley 2002
), the group II seed hypothesis postulates that the five small RNAs at the heart of the spliceosome are direct descendants of the major subunits of the catalytic core of a group II intron. However, the transition from a self-splicing group II intron to a large population of eukaryotic spliceosome-dependent introns would have involved a number of evolutionary challenges (Stoltzfus 1999
; Lynch and Richardson 2002
), not the least of which is the reassignment of functional fragments from group II introns associated with specific genes to a more generalized splicing mechanism servicing hundreds to perhaps thousands of genes. The proposed evolutionary pathway to group II intron fragmentation involves a series of effectively neutral steps (Cavalier-Smith 1991
; Stoltzfus 1999
) that are intrinsic to the subfunctionalization process (Force et al. 1999
). However, subfunctionalization is exceedingly unlikely in large populations, which impose difficulties for the establishment of functional fragments without mutational deterioration during the long time required for fixation (Lynch and Richardson 2002
). Thus, if the group II seed hypothesis is correct, it reinforces the view that the emergence of eukaryotes was accompanied by a reduction in Nl.
5' UTRs
Like introns, the 5'-untranslated leader sequences of mRNAs are liabilities for genes because they increase the size of the mutational target. Most notably, the 5' UTR serves as substrate for the mutational appearance of premature translation start codons (PSCs), which because of the scanning mechanism for translation initiation in eukaryotes (Kozak 1994
) can lead to N-terminal expansion of the protein product (in
1/3 of cases) and a shift in the reading frame and protein truncation (in
2/3 of cases). A deficit of ATG triplets in the exons of 5' UTRs but much less so in their introns provides compelling evidence of the negative translation-associated consequences of such mutations (Rogozin et al. 2001
; Lynch, Scofield, and Hong 2005b
), as does the incidence of human genetic disorders associated with the appearance of PSCs (Kozak 2002
). This raises questions not only as to why 5' UTRs are present, but why they are so long. Contrary to the situation for introns, which vary in average size by over two orders of magnitude in different phylogenetic groups, the average 5'-UTR lengths of most eukaryotic lineages are remarkably constant, falling in the narrow range of 100200 bp (Lynch, Scofield, and Hong 2005b
).
Messenger RNAs are unlikely to require leader sequences as physical landing pads for the ribosome. For example, although archaebacteria employ transcriptional mechanisms similar to those of eukaryotes (Bell and Jackson 2001
), their 5' UTRs are generally no more than a dozen nucleotides in length and in some cases are completely absent (Slupska et al. 2001
). Similar situations are observed in eubacteria (Weiner, Herrmann, and Browning 2000
; Moll et al. 2002
) and in mitochondria (Gillham 1994
; Taanman 1999
). Moreover, 5' UTRs in the diplomonad Giardia often consist of just a single nucleotide (Iwabe and Miyata 2001
), and a substantial fraction of those in a variety of other unicellular eukaryotes are <25 bp, including those in the ciliate Euplotes crassus (Ghosh et al. 1994
), the amoeba Entamoeba histolytica (Singh et al. 1997
), and the trichomonad Trichomonas vaginalis (Liston and Johnson 1999
). Experimental evidence suggests that such diminutive leader sequences are sufficient to support translation in mammals and yeast, although the efficiency of translation can be reduced with UTRs shorter than
30 bp (van den Heuvel et al. 1989
; Maicas, Shago, and Friesen 1990
; Hughes and Andrews 1997
).
To evaluate whether the expansion of eukaryotic 5'-UTR lengths might be a simple consequence of the reduced efficiency of selection associated with small Nl, a simple null model for the stochastic growth and contraction of UTRs based on mutational gains and losses of PSCs and transcription-initiation signals (TISs, e.g., the TATA box) has been developed (Lynch, Scofield, and Hong 2005b
). Under this model, all alleles are assumed to be effectively neutral with respect to each other, with the exception of two defective classes: alleles containing a harmful PSC between the TIS and the true translation start codon, and alleles for which the TIS has moved so close to the coding region that transcription initiates beyond the translation-start point. As ATG triplets are free to accumulate upstream of currently utilized transcription-initiation sites, this process results in a natural barrier to excessive growth of 5' UTRs, which can only expand in a 5' direction if the extension is devoid of harmful PSCs. Over time, the stochastic winking on and off of PSCs and TISs upstream of the true translation-start site results in an equilibrium L-shaped distribution of 5'-UTR lengths with a mean and variance that are quite similar to those observed within a wide variety of species, largely independent of the assumed length of the TIS (Lynch, Scofield, and Hong 2005b





