MBE Advance Access originally published online on March 17, 2006
Molecular Biology and Evolution 2006 23(5):997-1010; doi:10.1093/molbev/msk004
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Proceedings of the SMBE Tri-National Young Investigators' Workshop 2005 |
Positional Conservation of Clusters of Overlapping Promoter-Like Sequences in Enterobacterial Genomes


* Evolutionary Genomics Department, Department of Energy Joint Genome Institute and Genomics Division, Lawrence Berkeley National Laboratory, Walnut Creek, CA; and
Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
E-mail: amhuerta{at}ccg.unam.mx.
| Abstract |
|---|
|
|
|---|
The selective mechanisms operating in regulatory regions of bacterial genomes are poorly understood. We have previously shown that, in most bacterial genomes, regulatory regions contain high densities of
70 promoter-like signals that are significantly above the densities detected in nonregulatory genomic regions. In order to investigate the molecular evolutionary forces that operate in bacterial regulatory regions and how they affect the observed redundancy of promoter-like signals, we have undertaken a comparative analysis across the completely sequenced genomes of enteric
-proteobacteria. This analysis detects significant positional conservation of promoter-like signal clusters across enterics, some times in spite of strong primary sequence divergence. This suggests that the conservation of the nature and exact position of specific nucleotides is not necessarily the priority of selection for maintaining the transcriptional function in these bacteria. We have further characterized the structural conservation of the regulatory regions of dnaQ and crp across all enterics. These two regions differ in essentiality and mode of regulation, the regulation of crp being more complex and involving interactions with several transcription factors. This results in substantially different modes of evolution, with the dnaQ region appearing to evolve under stronger purifying selection and the crp region showing the likely effects of stabilizing selection for a complex pattern of gene expression. The higher flexibility of the crp region is consistent with the observed less conservation of global regulators in evolution. Patterns of regulatory evolution are also found to be markedly different in endosymbiotic bacteria, in a manner consistent with regulatory regions suffering some level of degradation, as has been observed for many other characters in these genomes. Therefore, the mode of evolution of bacterial regulatory regions appears to be highly dependent on both the lifestyle of the bacterium and the specific regulatory requirements of different genes. In fact, in many bacteria, the mode of evolution of genes requiring significant physiological adaptability in expression levels may follow patterns similar to those operating in the more complex regulatory regions of eukaryotic genomes.
Key Words: regulatory evolution bacterial promoters signal redundancy comparative genomics
| Introduction |
|---|
|
|
|---|
Transcription initiation requires that RNA polymerase (RNAP) identifies and binds specific DNA sequences called promoters. In bacteria, RNAP has to associate with a small protein, known as
factor, in order for this recognition to occur. The primary or housekeeping
factor in Escherichia coli is encoded by the rpoD gene and is known as
70 (Gross et al. 1998
70 DNA promoter is characterized by two hexamers centered around positions 35 and 10 from the +1 site and separated by 1521 bp, with consensus sequences TTGACA and TATAAT, respectively (Hawley and McClure 1983
Computationally, promoter signals in genomic sequences can be predicted through the use of weight matrices, which describe base probabilities at each position of the promoter motifs, and a score can be assigned that measures the similarity of different promoter predictions to the canonical promoter sequence. Although sequences closer to the consensus generally increase promoter strength (Hawley and McClure 1983
), we have estimated that sequence motifs with the highest scores in their regulatory region correctly identify no more than 40% of the known functional promoters in E. coli (Huerta and Collado-Vides 2003
). In that study, many known functional promoters scored more than 2.5 SD below the mean score of computationally detected promoter signals. A score cutoff for signal recognition that recovered all known functional promoters also detected 4,093 additional signals, giving an average of 38 signals per 250-bp region upstream of transcribed genes. The density of promoter-like signals was shown to be much lower in coding and convergent noncoding regions (where promoters are not expected), suggesting that the abundance of detected signals in upstream regulatory regions is not simply due to methodological limitations. We have recently shown that differential densities between regulatory and nonregulatory regions are also detectable in most eubacterial genomes, with the exception of those that have experienced severe size reduction (P < 0.001) (A. M. Huerta, M. P. Francino, E. Morett, and J. Collado-Vides, unpublished data).
Given that promoters can accommodate many sequence variations in the 10 and 35 motifs and variable spacer lengths between them, the ocurrence of promoter-like sequences in a given genomic region can probably occur by neutral mutation and random genetic drift. For eukaryotes, computer simulations have shown that this is indeed the case for transcription factorbinding sites, which can appear neutrally within microevolutionary timescales (Stone and Wray 2001
). However, random mutations are more likely to destroy existing promoter sequences and erode promoter-like signals than to create new ones. In addition, it has been shown that selection acts to remove 10 and 35 consensus sequences in both coding and noncoding regions, implying that it is disadvantageous to maintain misplaced sites which can strongly bind
70 and interfere with proper gene expression (Hahn, Stajich, and Wray 2003
). Furthermore, the numbers of promoter-like signals in noncoding regions deviate significantly from the random expectations based on base composition, di- and trinucleotide contents in a majority of eubacteria (J. L. Froula and M. P. Francino, in preparation). Therefore, the high density of promoter-like signals in regulatory regions cannot be explained as a mere random accretion of this type of sites through extended periods of time. Rather, we suggest that such density might be generated by some form of natural selection and that the difference between regulatory and nonregulatory regions is due to the joint effects of selection for promoter-like signals in regulatory regions and selection against them in the other areas of the genome. However, it is important to note that such scenario does not imply that all promoter-like signals in regulatory regions are being maintained by selection because a specific function is required from each one of them. Several different types of selective regimes could act on regulatory regions and influence the abundance and distribution of promoter-like signals.
Models Implying Turnover of Functional Promoter Sequences
Because promoter-like sites are easy to create and destroy, the accumulation of these signals might be a consequence of selection for alternative functionally redundant promoter sequences at different times as large numbers of similarly effective promoter sites appear and disappear in the population. Numerous sequence comparison analyses have shown that there is a high turnover of transcription factorbinding sites in animal genomes, with alternative sites appearing and disappearing among closely related species, and even within populations (Ludwig et al. 2000
; Stone and Wray 2001
; Carter and Wagner 2002
; Ludwig 2002
).
- Compensatory selection
Promoter turnover should be facilitated by the fact that the level of sequence and structural degeneracy tolerated by functional promoter sites provides ample opportunity for compensatory evolution. Compensatory evolution occurs when a pair of mutations at different sites that would be singly deleterious produces normal fitness in combination and can therefore easily become fixed in a population (Kimura 1985
). Most remarkably, Ludwig et al. (2000)
experimentally demonstrated that, in Drosophila enhancer elements, sequence differences between species that have functional consequences when occurring singly can be compensated by coevolved differences at other sites in the enhancer.
Carter and Wagner (2002)
modeled compensatory evolution for transcription factorbinding sites in populations evolving under different evolutionary parameters and concluded that compensatory evolution is especially likely when the population size is large (Carter and Wagner 2002
). Given that bacteria have large populations, short generation times and high mutation rates, the neutral turnover of redundant sites should be even faster than in animal genomes. Potentially, the target of purifying selection for promoter function could shift often enough to preserve a large number of promoter-like sites at a given point in time. In other words, purifying selection could shift target among different neighboring or overlapping sequence motifs at such a high pace that redundant promoter-like signals would not have time to decay by mutation accumulation.
Furthermore, once a dynamic of compensatory selection is in place, the preservation of multiple redundant promoter-like sites could be further promoted by selection for genetic robustness because the existence of multiple potential promoter sites capable of initiating transcription at relatively similar rates would minimize the deleterious effect of genetic mutations on gene expression.
- Stabilizing selection
With basis on their chimeric enhancer experiments, Ludwig et al. (2000)
have proposed an alternative model of regulatory evolution in eukaryotes. They suggest that the evolution of enhancers could be modeled by treating the structure/function of these regions as a quantitative character. Because enhancers contain multiple binding sites and can accommodate sequence and structural variability, many independent mutations could actually contribute to variation in gene expression (Ludwig et al. 2000
; Ludwig 2002
). Stabilizing selection would then have to act on eukaryotic regulatory regions to maintain stable levels of gene expression. When the number of sites affecting a quantitative trait is large, the average selection coefficient per mutant site under stabilizing selection is small, and the rate of substitution can be high (Kimura 1981
). Therefore, stabilizing selection on eukaryotic enhancers could maintain phenotypic constancy while allowing mutational turnover of functionally important sites.
In an analogous manner, the structure/function of bacterial regulatory regions could also be considered as a quantitative character. Given the observed redundancy of promoter-like sites and given that functional promoters can accommodate extensive variability in their motifs sequences, spacer length and position relative to the transcription start site, bacterial gene expression could also be affected by independent mutations at many sites. As proposed for eukaryotes, stabilizing selection would be required to maintain the level of gene expression within boundaries not affecting fitness. At the same time, stabilizing selection could preserve a substantial level of hidden genetic variation capable of affecting gene expression in bacterial populations. Such variation would permit rapid evolution of subtle changes in gene expression and could contribute substantially to the impressive ability of bacteria to adapt swiftly and precisely to environmental change.
Purifying Selection on Specific Functional Sites
Several factors could potentially further facilitate the observed maintenance of at least some of the promoter-like signals in bacterial regulatory regions by purifying selection. Different promoter-like signals could be involved in distinct functions affecting the regulation of gene expression. Overlapping promoter-like signals could play a regulatory role through functional interaction with the true transcription-initiation site; their effect on regulation could be negative if the interaction were competitive (Goodrich and McClure 1991
) or positive if they helped channel RNAP into its required position for initiating transcription (Reznikoff et al. 1987
). Promoter-like signals distinct from the functional promoter could also contribute to the regulation of gene expression by providing sequence traps pausing or arresting the RNAP complex during the early phase of transcription elongation, when
70 may often be still attached to RNAP. There is biochemical evidence that
70-dependent pause occurs during the early elongation of some genes in E. coli and bacteriophage
(Roberts et al. 1998
; Brodolin et al. 2004
; Nickels et al. 2004
).
It is also possible that some promoter-like signals serve as functional promoters in alternate conditions encountered by the bacterial cell. Although generally the site of transcription initiation appears to be rather precise, 25% of the reported regulatory regions in E. coli are known to harbor multiple functional promoters, three in average (Huerta and Collado-Vides 2003
). The availability of alternate promoters for a given gene could provide plasticity of gene expression in response to different environments, and some promoter-like signals could be relics of ancient promoters, abandoned in response to changes in regulatory requirements (A. M. Huerta, M. P. Francino, E. Morett, and J. Collado-Vides, unpublished data).
In order to investigate the factors that maintain a high density of promoter-like signals in bacterial regulatory regions, we are undertaking detailed structural analyses of homologous regulatory regions across the Enterobacteria. The availability of numerous complete genomes for enteric species and strains with different degrees of relatedness enables the use of comparative analyses to asses the conservation of the regulatory features defined in E. coli and to reveal the mode of evolution of regulatory sequences within this group. In turn, by understanding their mode of evolution, we should be able to appraise the functional relevance of the different features within regulatory regions and, for some of these features, to eventually formulate hypotheses about their specific functions.
| Materials and Methods |
|---|
|
|
|---|
Given that the
70 regions that bind the 10 and 35 boxes of the promoter are identical across the enteric bacteria (fig. 1), we assumed that they recognize 10 and 35 motifs similar to those found in E. coli. Resting on this assumption, we used consensus frequency matrices derived from E. coli functional
70 promoters to search for
70 promoter-like signals in other bacteria, after calibration with the a priori base probabilities of the strictly noncoding regions of the target genome.
|
The strategy applied to select the best matrices describing E. coli functional
70 promoters is reported in Huerta and Collado-Vides (2003)The strategy to find potential promoters in other bacterial genomes involved several steps:
- The base composition of the strictly noncoding regions of the genome to be analyzed was obtained and used to define the probabilities of each base in that genome.
- The frequency matrices obtained from the E. coli genome for the 10 and 35 boxes were calibrated with the base probabilities of the analyzed genome using PATSER (Hertz and Stormo 1999
). The base probabilities of the noncoding regions were used as the a priori probabilities in the calculation of each element of the weight matrix using the formula:
where f(b, l) is the relative probability of the base b at the position l of the input E. coli matrix and p(b) is the probability of the base b in the analyzed genome.
- In order to define which motifs would be considered significant promoter-like signals in the different target genomes, we determined minimal cutoff scores that would retain 98% of the original motifs from functional E. coli
70 promoters from which frequency matrices were generated. With this aim, the E. coli motifs were rescored according to the base probabilities of each target genome by means of PATSER, and the mean and standard deviation of their scores was obtained. For each of the genomes, the score, Iseq, of a motif of size L, according to the base probabilities of that genome, was obtained as:
where n is the number of sequences in the input E. coli matrix alignment.
- Each analyzed genome was divided into three different regions according to National Center for Biotechnology Information (NCBI) genome annotations: noncoding regions between convergent genes, coding, and strictly noncoding regions (excluding convergent). Again using PATSER, we searched each region for 10 and 35 motifs with the corresponding calibrated matrices and retained the motifs that scored above the respective cutoff.
- We then identified the subsets of most likely functional promoters in every genome. With this aim, we took the predicted transcription units (single genes and operons) within each genome according to a method that relies on distributions of intergenic distances (Moreno-Hagelsieb and Collado-Vides 2002
). The collection of 250-bp sequences upstream of the first gene for every transcription unit, named MURs for minimal upstream regions (Huerta et al. 2002
), is likely to constitute the smallest set of regulatory regions required for the expression of all genes in a genome. The subset of promoter-like signals most likely to be functional was defined as the signals retained by the COVER function (Huerta and Collado-Vides 2003
) within the MURs of a given genome.
- The COVER program is used to choose the most likely functional promoters in a regulatory region from the conglomerate of promoter-like signals identified by PATSER. COVER employs a "divide and conquer" strategy: first, each promoter-like signal is assigned to a class based on the length of its spacer region. Then, selection of the best promoter from each class is based on a well-known partial order relation (Tremblay and Manohar 1987
), the "inclusion relation." Each promoter-like signal is qualified by two different scores, (1) the sum of the 10 and 35 scores obtained through PATSER and (2) a score based on the position of the 10 box relative to the gene start. A promoter A is said to be included by another promoter B if, and only if, the scores of promoter B are both better than those of promoter A. This inclusion relation defines a "cover set," or collection of separate subsets, for each spacer class. Within each class, the predicted functional promoter is defined as the upper border of the subset with the highest cardinality, that is, the one that includes the highest number of promoter signals.
- The signals detected by COVER are mostly found grouped into clusters. In order to characterize and compare signal clusters in a quantitative manner, we follow Huerta and Collado-Vides (2003)
and define a "total cluster score" and an "average cluster score" for each cluster of signals detected by COVER as follows: 
where i denotes the ith promoter signal within a cluster of n signals. The promoter score sums up the PATSER scores of the 10 and 35 boxes and a spacer score, corresponding to the log of the relative frequency of such spacer length among known promoters.
| Results and Discussion |
|---|
|
|
|---|
Table 1 presents the complete enterobacterial genomes analyzed in this study, sorted by their overall similarity to E. coli K12. This measure of similarity is based on the similarities of all orthologs between two genomes (Moreno-Hagelsieb et al. 2001
|
Patterns of Individual Promoter-Like Signals in Enteric Genomes
We started our study by performing for the enteric species analyses of promoter-like signal density in regulatory, coding, and convergent noncoding regions, as described for E. coli in Huerta and Collado-Vides (2003)
|
In order to gain some insight into the possible mechanistic and evolutionary causes of this differential signal pattern, we have undertaken a more detailed analysis of the promoter-like sequences detected in the regulatory regions of enteric bacteria. The total conservation of the
70 regions that bind the 10 and 35 motifs across these species (fig. 1) suggests that the genomes of other enterics should contain collections of promoter-like sequences which are, on average, similar to those of E. coli K12. Table 2 presents a general characterization of the promoter-like signals that we estimate to most likely represent functional promoters in the MURs (Huerta et al. 2002
|
In that specifically calibrated PATSER searches, definition of the statistical cutoff for score significance and application of COVER were done in an independent manner in each of the genomes, the collections of recovered motifs represent the sequences most likely to be recognized by
70 in the specific context of the target genome. Potentially, these collections could be substantially different from those of E. coli K12, although representing the same basic frequency matrices for the 10 and 35 motifs. However, the collections of motifs recovered from the different enteric genomes are similar to those of E. coli K12 in several respects. First, the consensus sequences for 10 and 35 motifs are completely identical to those of E. coli K12 in all enteric species, with the exception of insect endosymbionts, which present a slight AT enrichment. Second, the average scores for both motifs also display comparable magnitudes in the different enteric species. In fact, average motif scores are above those of E. coli K12 for all enteric genomes having a GC content more elevated than that of the K12 strain (all other E. coli, Shigella, Salmonella, and Erwinia). This is probably due to the slightly higher GC content conferring greater compositional difference between the AT-rich promoter motifs and the background genomic sequence. As we expected, the GC-poor genomes of insect endosymbionts have the lowest motif scores.
Conserved Anatomy
70 Promoters in Enteric Genomes: Overlapping Promoter-Like Clusters
Regarding the organization of promoter-like signals into clusters, above 80% of the signals identified by COVER as the most likely to be functional are found within clusters in every enteric species. This indicates that the strong connection between promoter functionality and signal clustering first identified in E. coli (Huerta and Collado-Vides 2003
) is maintained throughout the enteric family. Almost all genes have at least one signal cluster, comprising a minimal average of 4.61 signals.
In comparisons against every target genome, we have determined the percentage of E. coli K12 clusters for which a cluster is also detected in the second genome at an overlapping position in relation to the gene start. We call this relative positional conservation "phylogenetic positional overlap," in contrast to phylogenetic footprinting and shadowing in vertebrates which refer to nucleotide sequence conservation (Tagle et al. 1988
; Gumucio et al. 1993
; Aparicio et al. 1995
; Sumiyama, Kim, and Ruddle 2001
). Table 3 shows the comparison of E. coli K12 to each enteric genome, taking into account only the subset of E. coli K12 genes with orthologs in the target genome. For 90% or more of the shared genes between E. coli K12 and each target genome, there is at least one cluster with a positional overlap. When all clusters for shared genes are considered, the percentage of clusters with positional overlap generally decays with genetic distance but remains above 55% across the enteric family.
|
We wanted to investigate whether the E. coli K12 clusters having positional overlaps in other genomes have properties that distinguish them from those that do not. In order to characterize and compare signal clusters in a quantitative manner, we have computed a total cluster score and an average cluster score for each cluster of signals detected by COVER. The total cluster score (ScoreTOT) defined in Huerta and Collado-Vides (2003)
Conservation and Flexibility in Upstream Regulatory Regions: Two Modes of Evolution?
We would like to further characterize the structural conservation of specific regulatory regions across all enterics. We have started our analyses with the regulatory regions of the genes dnaQ and crp. These two genes were chosen because both perform important cellular functions and are therefore present in a majority of bacterial species. However, crp has no ortholog in the enteric endosymbionts of insects, which are also known to lack numerous other regulatory proteins (Wilcox et al. 2003
). This is not surprising, given that cyclic adenosine 3',5'-monophosphate (cAMP) receptor protein (CRP) is a transcription factor that provides versatility in mediating the effects of environmental and metabolic signals on transcriptional regulation. In fact, we know that global regulators are poorly conserved in bacteria (I. Lozada-Chavez, S. C. Janga, and J. Collado-Vides, unpublished data). This mediation is accomplished via the CRP-cAMP complex, which positively regulates numerous catabolic and noncatabolic functions (Botsford and Harman 1992
). In that, endosymbionts have a highly confined lifestyle, the metabolic versatility they require must be low. In contrast, dnaQ encodes the epsilon subunit of DNA polymerase III, which performs 3'5' exonucleolytic proofreading activity. This proofreading function is essential, and dnaQ is present across eubacteria.
In addition, the regulation of dnaQ and crp has been experimentally characterized in E. coli, which will help guide the comparison of regulatory regions across enteric species. dnaQ has been shown to have at least two functional promoters with different physiological roles, and the balance of transcription from the two promoters depends mostly on the cellular concentration of RNAP. The regulation of crp expression is more complex as it is repressed by its own gene product and a second transcription factor. The control of crp transcription in response to the physiological state of the cell appears to be achieved by oscillations in the composition of regulatory nucleoprotein complexes, resulting from competition between proteins for overlapping binding sites (Gonzalez-Gil, Kahmann, and Muskhelishvili 1998
).
The differences in essentiality and mode of regulation of these two genes make it interesting to compare the evolutionary patterns undergone by their regulatory regions. Finally, for both of these genes, the upstream open reading frames (ORFs) are conserved across enteric genomes, and hence, the intervening noncoding regions are very likely to be orthologous, which warrants their comparison for studying the mode of sequence evolution in regulatory regions.
- The regulatory region of dnaQ
dnaQ has been experimentally shown to be regulated by two promoters, with +1 sites positioned at 50 bp (P1) and 133 bp (P2) from the gene start. At low RNAP concentrations, the downstream promoter (P1) is utilized preferentially, but the upstream promoter (P2) is utilized as well when RNAP concentration increases (Nomura, Aiba, and Ishihama 1985
; Nomura, Fujita, and Ishihama 1985
).
The feature map in figure 3 shows the positions and relative scores of the promoters selected by the COVER function in the 250 bp immediately upstream of the dnaQ gene start. COVER identifies promoter signals corresponding to P1 in sequence and relative position in all enteric genomes, with the exception of the insect endosymbionts. Nevertheless, all the reduced genomes still contain promoter-like signals which extensively overlap the position of P1. Inspection of the sequence alignments for this region (Supplementary Material online) indicates that the bases overlapping E. coli K12 P1 in the endosymbionts are poorly conserved, implying that promoter-signal positional overlap has been maintained in face of high sequence divergence. This suggests that the identification of phylogenetic positional overlap can be employed to detect probable functional conservation among regions undergoing high rates of evolution.
In addition, figure 3 shows that P1 is overlapped or closely neighbored by a second promoter-like signal. This signal is less conserved across enterics in position, score and nucleotide sequence but is still detectable in most species. Thus, both P1 itself and its immediate vicinity seem to be fairly conserved in large enteric genomes.
The start codon of dnaQ is located 65 bp downstream of the start codon of the reversely transcribed ribonuclease gene rnhA. The second experimentally mapped promoter P2 is completely located within rnhA. P2 also presents phylogenetic overlaps in most enterics, with the notable exception of the Salmonella strains, which lack COVER promoter-like signals in the corresponding position. Wherever P2 is present, its score is below that of P1. In E. coli K12, P2 is overlapped by one additional promoter-like signal, but this specific arrangement is only preserved in strain CFT073. Therefore, the promoter P2, which functions under more restricted conditions, is less conserved in sequence, score, and immediate context than the main promoter, P1.
- The regulatory region of crp
crp expression is repressed by its gene product, CRP, and by a small and abundant DNA-binding protein called factor for inversion stimulation (FIS). The best-characterized promoter of crp, crp1, is located 167 bp away from the gene start. Experimental evidence indicates that an additional promoter, crp2, is located downstream of crp1 and can promote transcription initiation at three closely spaced sites, located 73, 79, and 80 bp downstream from the crp1 initiation site. A fairly constant level of mRNA is transcribed from crp1 during all phases of growth, but, in addition, a shorter mRNA transcribed from crp2 appears during stationary phase. Several CRP- and FIS-binding sites with different affinities are located in the crp regulatory region, some of them overlapping the crp2 transcription-initiation sites. Transcription from crp2 is prevented during exponential growth by the binding of FIS, which blocks transcription by steric hindrance (Gonzalez-Gil, Kahmann, and Muskhelishvili 1998
).
In E. coli K12, the crp gene is separated by a large intergenic region of 302 noncoding bases from the start of the upstream predicted ORF yhfA. Figure 4 shows the phylogenetic overlap of promoter-like signal clusters across enteric genomes for the first 250 bp of this intergenic regulatory region. The experimentally mapped locations of 10 and 35 motifs for crp1 in E. coli K12 do not correspond exactly with the motif positions selected by COVER, but the general position of crp1 overlaps that of two COVER promoter signals (the furthest upstream in fig. 4). This signal pair is present across E. coli, Shigella, and Salmonella, although in Salmonella the exact positions and relative scores of the signals differ, and in E. coli CFT073 two more signals are present in the same cluster, without disturbing the position or strength of the conserved signal pair. There is also a single COVER-predicted strong promoter overlapping the crp1 position in Yersinia and Photorhabdus. This positional overlap occurs in spite of strong sequence divergence between the E. coli clade and these species, including numerous indels which obstruct nucleotide sequence alignment. In fact, the nucleotide sequences of the 10 and 35 COVER promoters corresponding to crp1 do not seem to be orthologous to the E. coli motifs according to gapped primary sequence alignments (Supplementary Material online). Notably, there is no COVER-predicted promoter near the crp1 position in Erwinia.
Similarly to the situation for crp1, COVER identifies a 2- or 3-signal cluster in E. coli, Shigella, and Salmonella which positionally overlaps crp2, although the COVER signals do not share the exact 10 and 35 boxes determined for crp2 through in vitro experiments. The corresponding regions in the other enteric species (Erwinia, Yersinia, and Photorhabdus) contain a large cluster or a pair of closely spaced clusters that positionally overlap crp2, but extend over 60 bp downstream, including three to five additional signals. Some of these downstream signals display the highest scores for these species over the whole 250 bp region, above those of the signals that directly overlap crp2. Again, the positional overlap of crp2 in these species occurs in spite of strong nucleotide divergence, presence of indels, and very poor primary sequence alignment.
In addition, there is a COVER promoter located at 16 from the start of crp for which no experimental evidence is available, whose nucleotide sequence is conserved in E. coli, Shigella and Salmonella. However, in E. coli CFT073, this signal is not retained by COVER, in spite of complete sequence conservation. The 250-bp upstream regulatory region in this strain differs from other E. coli by only two nucleotide transitions, at 99 and 129 bases from the gene start. The 99 T to C transition occurs within the experimentally detected 35 box of crp2 and removes one COVER signal in the crp2 region cluster. The 129 C to T change occurs four bases downstream of crp1 and creates two new COVER promoter signals, without affecting the presence and strength of the two signals detected in other E. coli. Remarkably, these two changes are sufficient to significantly alter the profile of signals retained by COVER throughout the region because they also alter the relative strengths of signals within their respective spacer length classes. As a consequence, the 16 promoter disappears from the CFT073 COVER set, and a new signal is retained at 67 from the gene start. This illustrates how a few nucleotide changes between closely related strains can alter dramatically the regulatory feature map of a region.
- Differences between the dnaQ and crp regulatory regions
First, it should be noticed that in the crp regulatory region there is less correspondence between the signals detected by COVER in E. coli K12 and the functional promoter sites detected experimentally. This may be due to the increased complexity of this region, which contains binding sites for at least two repressors. In terms of rate and mode of evolution, the crp region appears to be more variable among enteric species, both in terms of the overall feature landscape and in the conservation of the experimentally mapped promoters. Most notably, the main promoter of E. coli, crp1, is not detected in the regulatory region of Erwinia. In addition, when present, crp1 and crp2 show substantially more variation among species than the P1 and P2 promoters of dnaQ, in terms of sequence, precise localization, and surrounding promoter-like signals.
These differences suggest that the upstream regions of dnaQ and crp may be affected to different extents by the types of natural selection that potentially operate in regulatory regions. The generally higher degree of conservation of regulatory features in the dnaQ regulatory region seems to indicate the operation of stronger levels of purifying selection. This would be consistent with the essential nature of this gene, which is likely to be subject to similar expression requirements in all bacteria. In contrast, the crp regulatory region, which has a more complex structure and organization, as required to regulate a gene whose expression has to be highly tuned to environmental conditions, seems to present lower overall conservation of regulatory features. This suggests a less important role for purifying selection. In fact, the mode of evolution of the crp regulatory region could be more consistent with the operation of stabilizing selection. Indeed, the high complexity of this region could signify that many mutations throughout the region would be likely to have functional effects. In that case, the crp regulatory region could behave as a quantitative character, with the average effect of mutating a particular nucleotide being relatively small, which could cause high rates of sequence evolution and motif turnover in this region.
|
|
The Special Case of Intracellular Symbionts
The promoter-like signal patterns and mode of evolution of endosymbiotic genomes differ substantially from those of the other enterics. This can be reconciled with many other features and patterns of evolution already described for the genomes of animal parasites and symbionts with an intracellular or predominantly host-restricted lifestyle (Moran and Plague 2004
factors, and regulatory proteins (Madan Babu 2003The analyses of regulatory regions in insect endosymbionts presented here support the notion that a certain level of general degradation in regulatory functions is occurring in these species. First, the insect endosymbionts present slight deviations from the 10 and 35 consensi obtained for E. coli K12 and all other enteric species (one G to T change in each consensus). Average scores for both motifs are also much lower in the endosymbionts than in all other enterics. Also, whereas the experimentally identified dnaQ P1 promoter is conserved in both sequence and relative position in all other enteric genomes, the bases forming E. coli K12 P1 are very poorly conserved in the endosymbionts. Although all the reduced genomes present promoter-like signals which extensively overlap the position of P1, the differences in promoter position, score, and arrangement in these species may have altered the level of dnaQ expression. Given the fundamental role of this gene, this might have significant consequences on fitness.
In regards to signal density and clustering, it is interesting to note that the regulatory regions of endosymbionts present promoter-like signal clusters for 100% of the genes, with average signal numbers (5.496.00) above those of the other enterics (4.565.28; table 2). Furthermore, high promoter-like signal densities in endosymbionts are not limited to regulatory regions but are pervasive across all areas of the genome (fig. 2). We have recently shown that significant differences between promoter-like signal densities in regulatory versus nonregulatory regions are also absent in other small genomes from nonenteric parasites of animals with an intracellular or predominantly host-restricted lifestyle, including rickettsias, chlamydias, and mycoplasmas (A. M. Huerta, M. P. Francino, E. Morett, and J. Collado-Vides, unpublished data).
All these patterns can be explained by a general effect of increased AT mutation pressure in these reduced genomes. Indeed, high A/T mutation pressure will increase the likelihood of appearance of AT-rich motifs, explaining the high promoter-like signal densities across all areas of the genome. A/T pressure is also the likely explanation of the G to T changes in the 10 and 35 consensus sequences for these species. Low average scores for the 10 and 35 motifs would also be expected due to the lack of compositional contrast between these AT-rich motifs and the also AT-rich genomic background. In light of the total conservation of the
70 regions that bind the 10 and 35 motifs in endosymbionts (fig. 1), it is improbable that the observed changes in consensus sequences and average scores of the 10 and 35 motifs result from coevolution of these sites to track changes in the binding specificity of
70.
Therefore, the evolution of regulatory regions and the distribution of promoter-like sequences in these genomes are totally consistent with the general scenario of genome degradation due to a diminished effectiveness of natural selection causing high mutational pressure in endosymbiotic bacteria (Moran 1996
). Although the phylogenetic positions of the enteric insect endosymbionts have been hard to ascribe, recent analyses suggest that they form a monophyletic clade (Francino, Santos, and Ochman 2003
; Herbeck, Degnan, and Wernegreen 2005
). The observed changes in regulatory regions and overall patterns of promoter-like signal densities probably reflect an increased A/T mutation pressure that initiated at the base of this clade.
Finally, it is interesting to contrast the evolution of the endosymbiont regulatory regions with that of the invertebrate-associated bacterium P. luminiscens. The genome of this species behaves in the same manner as other large enteric genomes, presenting no signs of reduction or degradation. Remarkably, this species is not endosymbiotic, and, in fact, has a highly complex lifestyle in that it needs to switch between a symbiotic relationship with a nematode and a parasitic relationship with an insect during the course of a single cellular generation (Joyce, Watson, and Clarke 2005
).
| Conclusions, Insights, and Speculations |
|---|
|
|
|---|
Our preliminary analyses of structural and organizational evolution in bacterial regulatory regions suggest that the conservation of the nature and exact position of specific nucleotides is not necessarily the priority of selection for maintaining regulatory function. Although our analyses so far are strictly computational and therefore cannot establish functional conservation, the positional conservation of similar promoter-like signal clusters with regulatory potential, in the face of nucleotide sequence divergence, suggests that function could be selectively maintained without requiring the presence of specific nucleotide sequences. This implies that phylogenetic positional overlap should be a good technique for detecting probable functional conservation among regions undergoing high rates of evolution and even to identify homoplastic functional similarity among regulatory regions having independent evolutionary origins. Similar strategies that measure conservation of functional sequence features have also been proposed for analyzing the highly complex and modular regulatory regions of eukaryotes (Berman et al. 2002
In bacteria, the mode of evolution of a regulatory region may be highly dependent on the properties of the regulated gene. For genes requiring a stable level of expression throughout the cell cycle and under different environmental conditions, the predominant force acting on the evolution of the regulatory regions may be purifying selection on a few features with highly specific functionality. In contrast, the regulatory evolution of genes whose expression needs more precise fine-tuning may be dominated by stabilizing selection, allowing for higher turnover of nucleotide sequence and feature motifs. In spite of the fundamental differences in gene regulation between bacteria and eukaryotes (Struhl 1999
), this scenario is highly reminiscent of the mode of evolution operating in the more complex regulatory regions of eukaryotic cells.
In spite of the general tendency toward conservation of cluster positions in enteric genomes, comparison between closely related E. coli strains also suggests that, at shallow divergences, a few nucleotide changes can cause profound alterations in the functional anatomy of a regulatory region. Although these two observations may seem contradictory, they might actually represent alternative sides of the same coin: the relatively loose rapport between primary nucleotide sequence and regulatory information. In sharp contrast to protein and RNA coding regions, regulatory function seems to be encoded at higher levels of complexity, involving the physical properties of a given sequence stretch and its competitiveness for protein binding and other interactions with their neighboring sites. Because DNA regulatory sequences act in cis, they represent at the same time the genotype and the phenotype of their regulatory function. Although this complicates the analysis of their structure and evolution, comparative approaches that look beyond simple sequence conservation and prioritize positional feature overlap should clarify the mode of evolution, function, and information encoding of regulatory regions.
| Supplementary Material |
|---|
|
|
|---|
NCBI reference numbers of the analyzed species and the ClustalW sequence alignments of the regulatory regions of dnaQ and crp and their neighboring genes can be found at http://www.ccg.unam.mx/Computational_Genomics/PromoterTools/MBE_2006.
| Acknowledgements |
|---|
|
|
|---|
We greatly thank the organizers for selecting this work for the 2005 Society for Molecular Biology and Evolution Tri-National Young Investigators' Workshop. Marta L. Wayne and two anonymous reviewers helped us through thoughtful and constructive criticism that improved enormously the quality of this paper. We are also grateful to Jeff L. Froula for sharing unpublished results, editorial help with the manuscript, and fruitful discussions. Victor del Moral, Cesar Bonavides, and the Center for Genomic Sciences at the National University of Mexico (UNAM) provided computational support. This work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program and by the University of California, Lawrence Berkeley National Laboratory under contract no. DE-AC03-76SF00098.
| Footnotes |
|---|
Marta Wayne, Associate Editor
| References |
|---|
|
|
|---|
Aparicio, S., A. Morrison, A. Gould, J. Gilthorpe, C. Chaudhuri, P. Rigby, R. Krumlauf, and S. Brenner. 1995. Detecting conserved regulatory elements with the model genome of the Japanese puffer fish, Fugu rubripes. Proc. Natl. Acad. Sci. USA 92:16841688.
Berman, B. P., Y. Nibu, B. D. Pfeiffer, P. Tomancak, S. E. Celniker, M. Levine, G. M. Rubin, and M. B. Eisen. 2002. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl. Acad. Sci. USA 99:757762.
Berman, B. P., B. D. Pfeiffer, T. R. Laverty, S. L. Salzberg, G. M. Rubin, M. B. Eisen, and S. E. Celniker. 2004. Computational identification of developmental enhancers: conservation and function of transcription factor binding site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol. 5:R61.[CrossRef][Medline]
Botsford, J. L., and J. G. Harman. 1992. Cyclic AMP in prokaryotes. Microbiol. Rev. 56:100122.
Brodolin, K., N. Zenkin, A. Mustaev, D. Mamaeva, and H. Heumann. 2004. The sigma-70 subunit of RNAP induces lac UV5 promoter proximal pausing of transcription. Nat. Struct. Mol. Biol. 11:551557.[CrossRef][Web of Science][Medline]
Carter, A. J., and G. P. Wagner. 2002. Evolution of functionally conserved enhancers can be accelerated in large populations: a population-genetic model. Proc. Biol. Sci. 269:953960.
Clark, M. A., N. A. Moran, and P. Baumann. 1999. Sequence evolution in bacterial endosymbionts having extreme base compositions. Mol. Biol. Evol. 16:15861598.[Abstract]
deHaseth, P. L., and T. W. Nilsen. 2004. Molecular biology. When a part is as good as the whole. Science 303:13071308.
Francino, M. P., S. R. Santos, and H. Ochman. 2003. Phylogenetic relationships of bacteria with special reference to enteric species. in M. Dworkin, ed. The prokaryotes, 3rd edition. A Handbook on the Biology of Bacteria: Ecophysiology, Isolation, Identification, Applications, Springer-Verlag, New York. ([WWW document] URL http://141.150.157.117:8080/prokPUB/chaphtm/398/COMPLETE.htm).
Galas, D. J., M. Eggert, and M. S. Waterman. 1985. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J. Mol. Biol. 186:117128.[CrossRef][Web of Science][Medline]
Gasch, A. P., A. M. Moses, D. Y. Chiang, H. P. Fraser, M. Berardini, and M. B. Eisen. 2004. Conservation and evolution of cis-regulatory systems in Ascomycete fungi. PLos Biol. 2:e398.[CrossRef][Medline]
Gonzalez-Gil, G., R. Kahmann, and G. Muskhelishvili. 1998. Regulation of crp transcription by oscillation between distinct nucleoprotein complexes. EMBO J. 17:28772885.[CrossRef][Web of Science][Medline]
Goodrich, J. A., and W. R. McClure. 1991. Competing promoters in prokaryotic transcription. Trends Biochem. Sci. 16:394397.[CrossRef][Web of Science][Medline]
Gralla, J., and J. Collado-Vides. 1996. Organization and function of transcription regulatory elements. Pp. 12321246 in F. C. Neidhart, R. Curtiss, J. Ingraham, E. C. C. Lin, K. B. Low, B. Magasanik, W. S. Reznikoff, M. Riley, M. Schaechter, and H. E. Umbarger, eds. Escherichia coli and Salmonella, cellular and molecular biology. American Society for Microbiology, Washington, D.C.
Gross, C. A., C. Chan, A. Dombroski, T. Gruber, M. Sharp, J. Tupy, and B. Young. 1998. The functional and regulatory roles of sigma factors in transcription. Cold Spring Harb. Symp. Quant. Biol. 63:141155.[CrossRef][Web of Science][Medline]
Gumucio, D. L., D. A. Shelton, W. J. Bailey, J. L. Slightom, and M. Goodman. 1993. Phylogenetic footprinting reveals unexpected complexity in trans-factor binding upstream from the epsilon globin gene. Proc. Natl. Acad. Sci USA 90:60186022.
Hahn, M. W., J. E. Stajich, and G. A. Wray. 2003. The effects of selection against spurious transcription factor binding sites. Mol. Biol. Evol. 20:901906.
Harley, C. B., and R. P. Reynolds. 1987. Analysis of E. coli promoter sequences. Nucleic Acids Res. 15:23432361.
Hawley, D. K., and W. R. McClure. 1983. Compilation and analysis of Escherichia coli promoter DNA sequences. Nucleic Acids Res. 11:22372255.
Herbeck, J. T., P. H. Degnan, and J. J. Wernegreen. 2005. Nonhomogeneous model of sequence evolution indicates independent origins of primary endosymbionts within the enterobacteriales (gamma-Proteobacteria). Mol. Biol. Evol. 22:520532.
Herbeck, J. T., D. J. Funk, P. H. Degnan, and J. J. Wernegreen. 2003. A conservative test of genetic drift in the endosymbiotic bacterium Buchnera: slightly deleterious mutations in the chaperonin groEL. Genetics 165:16511660.
Herbeck, J. T., D. P. Wall, and J. J. Wernegreen. 2003. Gene expression level influences amino acid usage, but not codon usage, in the tsetse fly endosymbiont Wigglesworthia. Microbiology 149:25852596.
Hertz, G. Z., and G. D. Stormo. 1999. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15:563577.
Huerta, A. M., and J. Collado-Vides. 2003. Sigma70 promoters in Escherichia coli: specific transcription in dense regions of overlapping promoter-like signals. J. Mol. Biol. 333:261278.[CrossRef][Web of Science][Medline]
Huerta, A. M., J. D. Glasner, H. Jin, F. D. Blattner, R. M. Gutierrez-Rios, and J. Collado-Vides. 2002. GETools: gene expression tool for analysis of transcriptome experiments in Escherichia coli. Trends Genet. 8:217218.
Joyce, S. A., R. J. Watson, and D. H. Clarke. 2005. The regulation of pathogenicity and mutualism in Photorhabdus. Curr. Opin. Microbiol. 9:16.[Web of Science]
Kimura, M. 1981. Possibility of extensive neutral evolution under stabilizing selection with special reference to non-random usage of synonymous codons. Proc. Natl. Acad. Sci. USA 78:57735777.
. 1985. The role of compensatory neutral mutations in molecular evolution. J. Genet. 64:719.[CrossRef]
Lisser, S., and H. Margalit. 1993. Compilation of E. coli mRNA promoter sequences. Nucleic Acids Res. 21:15071516.
Ludwig, M. Z. 2002. Functional evolution of non-coding DNA. Curr. Opin. Genet. Dev. 12:634639.[CrossRef][Web of Science][Medline]
Ludwig, M. Z., C. Bergman, N. H. Patel, and M. Kreitman. 2000. Evidence for stabilizing selection in a eukaryotic enhancer element. Nature 403:564567.[CrossRef][Web of Science][Medline]
Madan Babu, M. 2003. Did the loss of sigma factors initiate pseudogene accumulation in M. leprae? Trends Microbiol. 11:5961.[CrossRef][Web of Science][Medline]
Moran, N. A. 1996. Accelerated evolution and Muller's rachet in endosymbiotic bacteria. Proc. Natl. Acad. Sci. USA 93:28732878.
Moran, N. A., H. E. Dunbar, and J. L. Wilcox. 2005. Regulation of transcription in a reduced bacterial genome: nutrient-provisioning genes of the obligate symbiont Buchnera aphidicola. J. Bacteriol. 187:42294237.
Moran, N. A., and G. R. Plague. 2004. Genomic changes following host restriction in bacteria. Curr. Opin. Genet. Dev. 14:627633.[CrossRef][Web of Science][Medline]
Moreno-Hagelsieb, G., and J. Collado-Vides. 2002. A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics 18(Suppl. 1):S329S336.[Abstract]
Moreno-Hagelsieb, G., V. Trevino, E. Perez-Rueda, T. F. Smith, and J. Collado-Vides. 2001. Transcription unit conservation in the three domains of life: a perspective from Escherichia coli. Trends Genet. 17:175177.[CrossRef][Web of Science][Medline]
Moyle, H., C. Waldburger, and M. M. Susskind. 1991. Hierarchies of base pair preferences in P22 ant promoter. J. Bacteriol. 173:19441950.
Nickels, B. E., J. Mukhopadhyay, S. J. Garrity, R. H. Ebright, A. Hochschild. 2004. The sigma-70 subunit of RNAP mediates a promoter proximal pause at the lac promoter. Nat. Struct. Mol. Biol. 11:544550.[CrossRef][Web of Science][Medline]
Nomura, T., H. Aiba, and A. Ishihama. 1985. Transcriptional organization of the convergent overlapping dnaQ-rnh genes of Escherichia coli. J. Biol. Chem. 260:71227125.
Nomura, T., N. Fujita, and A. Ishihama. 1985. Promoter selectivity of E. coli RNA polymerase: analysis of the promoter system of convergently-transcribed dnaQ-rnh genes. Nucleic Acids Res. 13:76477661.
Reznikoff, W. S., K. Bertrand, C. Donnelly, M. Krebs, L. E. Maquat, M. Peterson, L. Wray, J. Yin, and X. M. Yu. 1987. Complex promoters. Pp. 105113 in W. S. Reznikoff, R. R. Burgess, J. E. Dahlberg, C. A. Gross, M. Thomas Record, and M. P. Wickens, eds. RNA polymerase and the regulation of transcription. Elsevier, New York.
Roberts, J. W., W. Yarnell, E. Bartlett, J. Guo, M. Marr, D. C. Ko, H. Sun, and C. W. Roberts. 1998. Antitermination by bacteriophage lambda Q protein. Cold Spring Harb. Symp. Quant. Biol. 63:31925.[CrossRef][Web of Science][Medline]
Salgado, H., S. Gama-Castro, M. Peralta-Gil et al. (14 co-authors). 2006. RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res. 34(Database issue):D394D397.
Stone, J. R., and G. A. Wray. 2001. Rapid evolution of cis-regulatory sequences via local point mutations. Mol. Biol. Evol. 18:17641770.
Struhl, K. 1999. Fundamentally different logic of gene regulation in eukaryotes and prokaryotes. Cell 98:14.[CrossRef][Web of Science][Medline]
Sumiyama, K., C. B. Kim, and F. Ruddle. 2001. An efficient cis-element discovery method using multiple sequence comparisons based on evolutionary relationships. Genomics 71:260262.[CrossRef][Web of Science][Medline]
Tagle, D. A., B. F. Koop, M. Goodman, J. L. Slightom, D. L. Hess, and R. T. Jones. 1988. Embryonic epsilon and gamma globin genes of a prosimian primate (Galago craussicaudatus). Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J. Mol. Biol. 203:439455.[CrossRef][Web of Science][Medline]
Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:46734680.
Tremblay, J. P., and R. Manohar. 1987. Discrete mathematical structures with application to computer science. McGraw Hill, New York.
van Helden, J. 2003. Regulatory sequence analysis tools. Nucleic Acids Res. 31:35933596.
Wilcox, J. L., H. E. Dunbar, R. D. Wolfinger, and N. A. Moran. 2003. Consequences of reductive evolution for gene expression in an obligate endosymbiont. Mol. Microbiol. 48:14911500.[CrossRef][Web of Science][Medline]
Wosten, M. M. 1998. Eubacterial sigma-factors. FEMS Microbiol. Rev. 22:127150.[CrossRef][Web of Science][Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
J. Baumbach, A. Tauch, and S. Rahmann Towards the integrated analysis, visualization and reconstruction of microbial gene regulatory networks Brief Bioinform, January 1, 2009; 10(1): 75 - 83. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




