MBE Advance Access originally published online on April 6, 2006
Molecular Biology and Evolution 2006 23(6):1269-1285; doi:10.1093/molbev/msk013
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Article |
Evolution of Paralogous Genes: Reconstruction of Genome Rearrangements Through Comparison of Multiple Genomes Within Staphylococcus aureus
,
,

,
,
* Department of Medical Genome Sciences, Graduate School of Frontier Science, University of Tokyo, Tokyo, Japan;
Graduate Program in Biophysics and Biochemistry, Graduate School of Science, University of Tokyo, Tokyo, Japan;
Division of Molecular Biology, Institute of Medical Science, University of Tokyo, Tokyo, Japan;
Division of Pathology, Immunology and Microbiology, Graduate School of Medicine, University of Tokyo, Tokyo, Japan; and || Laboratory of Genome Informatics, Natural Institute for Basic Biology, National Institutes of Natural Sciences, Okazaki, Japan
E-mail: ikobaya{at}ims.u-tokyo.ac.jp.
| Abstract |
|---|
|
|
|---|
Analysis of evolution of paralogous genes in a genome is central to our understanding of genome evolution. Comparison of closely related bacterial genomes, which has provided clues as to how genome sequences evolve under natural conditions, would help in such an analysis. With species Staphylococcus aureus, whole-genome sequences have been decoded for seven strains. We compared their DNA sequences to detect large genome polymorphisms and to deduce mechanisms of genome rearrangements that have formed each of them. We first compared strains N315 and Mu50, which make one of the most closely related strain pairs, at the single-nucleotide resolution to catalogue all the middle-sized (more than 10 bp) to large genome polymorphisms such as indels and substitutions. These polymorphisms include two paralogous gene sets, one in a tandem paralogue gene cluster for toxins in a genomic island and the other in a ribosomal RNA operon. We also focused on two other tandem paralogue gene clusters and type I restriction-modification (RM) genes on the genomic islands. Then we reconstructed rearrangement events responsible for these polymorphisms, in the paralogous genes and the others, with reference to the other five genomes. For the tandem paralogue gene clusters, we were able to infer sequences for homologous recombination generating the change in the repeat number. These sequences were conserved among the repeated paralogous units likely because of their functional importance. The sequence specificity (S) subunit of type I RM systems showed recombination, likely at the homology of a conserved region, between the two variable regions for sequence specificity. We also noticed novel alleles in the ribosomal RNA operons and suggested a role for illegitimate recombination in their formation. These results revealed importance of recombination involving long conserved sequence in the evolution of paralogous genes in the genome.
Key Words: genome comparison genome rearrangements deletion recombination restriction-modification
| Introduction |
|---|
|
|
|---|
Various forms of genome rearrangements are important in the plasticity of genomes and contribute greatly to genome evolution. One of the specific issues relevant to genome evolution through rearrangement is evolution of multiple homologous genes present in a genome, often called paralogous genes. Earlier, it was addressed through comparison within a genome and between distantly related genomes. For several years, comparative analyses of closely related bacterial genome sequences have been providing numerous novel insights about genome evolution in general and paralogue evolution in particular (Alm et al. 1999
Close genome comparison revealed various genome rearrangements involving repeated genes within a genome and prevalence of horizontal gene transfer throughout bacterial and archaeal worlds. For example, our laboratory has compared two genome sequences within species Helicobacter pylori and found a characteristic mode of insertion of restriction-modification (RM) gene complexes (Nobusato et al. 2000
). We have been developing a tool, Comparative Genome Analysis Tool (CGAT), to compare two bacterial genome sequences to identify middle-sized to large genome polymorphisms and to help inference about their formation (Uchiyama, Higuchi, and Kobayashi 2000
). Availability of as many as seven complete genome sequences within the species Staphylococcus aureus (Kuroda et al. 2001
; Baba et al. 2002
; Holden et al. 2004
; Ohta et al. 2004
; Gill et al. 2005
) (http://www.genome.ou.edu.) provides a unique opportunity for such an analysis.
Staphylococcus aureus represents a gram-positive eubacterium with low GC content (Kuroda et al. 2001
). Staphylococci are normal inhabitants on skin and mucous membranes of warm-blooded animals and can become pathogens. In addition, this bacterium has developed resistance to practically all types of antibiotics (Hiramatsu et al. 2001
). The conserved synteny in the seven genomes helped in whole-genome comparison with each other, which revealed that a major diversity of S. aureus strains was associated with a variety of mobile elements: prophages, transposons, insertion sequences, and other genomic islands (Lindsay and Holden 2004
; Gill et al. 2005
). The intraspecific variety with respect to pathogenicity and antibiotic resistance was, at least partly, found associated with these mobile elements (Kuroda et al. 2001
; Baba et al. 2002
; Holden et al. 2004
; Gill et al. 2005
). A recent comparison between N315 and Mu50 genomes exhaustively inspected all the putative open reading frames (ORFs) for difference (Ohta et al. 2004
).
In S. aureus only little is known about molecular mechanisms of gene transfer and genome rearrangements. Conjugation was hypothesized for apparent chromosomal replacements inferred from multilocus sequence typing (Robinson and Enright 2004
). A class of genomic islands can move from one cell to another with the aid of a helper phage (Ruzin, Lindsay, and Novick 2001
). Staphylococcal cassette chromosomes and several types of genomic islands are believed to integrate into the chromosome by their own site-specific recombinase (Katayama, Ito, and Hiramatsu 2000
; Ruzin, Lindsay, and Novick 2001
).
The genome sequence analysis of N315 and Mu50 strains (Kuroda et al. 2001
), in which this laboratory was involved, prompted us to study rearrangement mechanisms that have formed their differences. We focused on middle-sized (more than 10 bp) to large polymorphisms, which were located at the toxin paralogue gene cluster in genomic island
Sa
, the ribosomal RNA operon, and other short repeats with variable configurations.
Two genomic islands,
Sa
and
Saß, identified in all the strains of S. aureus examined so far (Baba et al. 2002
), carry three tandem paralogous gene clusters: staphylococcal superantigen-like (ssl or set) (Lina et al. 2004
) gene cluster and lipoprotein (lpl) gene cluster in
Sa
and serine protease (spl) gene cluster in
Saß. For the ssl cluster in the seven strains, Fitzgerald et al. (2003)
suggested that an ancestral strain with a full complement of ssl genes underwent multiple gene losses by some recombination mechanisms. Here we have considered molecular mechanisms of genome rearrangements in more detail for the ssl cluster as well as for the lpl cluster and the spl cluster.
Another feature of these genomic islands is the presence of type I RM genes linked with those tandem gene clusters. A type I RM system comprises three genes, hsdR, hsdM, and hsdS, which are tightly linked (Murray 2000
) with interesting exceptions (Schouler et al. 1998
; Rocha and Blanchard 2002
). In the seven S. aureus genomes, only hsdM and hsdS are found both in
Sa
and
Saß. A homologue of hsdR (SA0189 for N315) has been identified at a locus distant from these two islands in all these genomes (Kuroda et al. 2001
). Though the deduced amino acid sequences for hsdM are almost identical among all the 14 alleles, those for hsdS show divergence (Baba et al. 2002
). We have characterized the variation found in the hsdS genes.
The ribosomal RNA (rrn) operon is involved in various genome rearrangements. Homologous recombination via extensive homology of rDNAs can cause genome-wide reorganization (Hill 1999
) and homogenization of rDNAs and their flanking regions (Liao 2000
). The 16S-23S rDNA intergenic spacer region is known to be polymorphic (Gurtler and Stanisich 1996
). Several types of rearrangements there were reported and explained also in terms of homologous recombination (Harvey et al. 1988
; Lan and Reeves 1998
; Privitera et al. 1998
; Gurtler 1999
). For S. aureus, 10 different types of sequences of this region were reported (Gurtler and Barrie 1995
; Gurtler and Mayall 2001
). These analyses included two complete genomes and three unfinished genomes available at the time. We examined whole sets of rrn operons of the seven strains and considered how their differences were generated.
| Materials and Methods |
|---|
|
|
|---|
Sequence Sources
Annotated whole-genome sequences of the following S. aureus strains have been released: N315 and Mu50 (Kuroda et al. 2001
Software and Programs
BlastN or BlastP (http://www.ncbi.nlm.nih.gov/Blast/) for homology search and ClustalW (http://www.ebi.ac.uk/clustalw/) for sequence alignment were introduced with default parameters. The resulting alignments were followed by manual refinement on Se-Al (http://evolve.zoo.ox.ac.uk/software.html?id=seal) when necessary. The dot plot analyses for pairwise and multiple comparison were performed by DOTTUP and POLYDOT (http://emboss.sourceforge.net), respectively. Each of these software and programs were downloaded from the ftp site and run on Mac OS X (version 10.3.7).
Screening of Polymorphisms Between Strains N315 and Mu50
The genome-wide polymorphisms between N315 and Mu50 were screened by pairwise alignment on the CGAT (Uchiyama, Higuchi, and Kobayashi 2000
). CGAT has organized the genome alignment by bidirectional best-hit analysis based on all-against-all comparison of BlastN for pairs of 2-kb segments with 200 bp each overlapping. All the polymorphic sites were identified by visual inspection as breaks within the whole-genome alignment in 500-bp windows of CGAT. This manual screening covered all the macroscopic differences of more than 10 bp.
Criteria for Classification of Putative Rearrangement Events
We tentatively classified recombination into four types: site-specific recombination, transposase-mediated recombination, homologous recombination, and illegitimate recombination.
Occurrence of homologous recombination and some sort of illegitimate recombination between similar DNA sequences was inferred from the presence of similar sequences at the sites of rearrangement in genome comparison. Dependence of homologous recombination frequency on the homology length varies among organisms and processes (Fujitani, Yamamoto, and Kobayashi 1995
). In Bacillus subtilis, the closest relative of S. aureus so far reported with respect to this process, an apparent lower limit is reported to be 70 bp (Khasanov et al. 1992
). Frequency of homologous recombination decreases very rapidly as the two sequences diverge (Datta et al. 1997
; Vulic et al. 1997
; Fujitani and Kobayashi 1999
). From these data, we regarded the repeats of more than 80 bp long sharing as much as 90% nucleotide sequence identity in present data as a strong candidate for a remnant of homologous recombination.
Illegitimate recombination events can be classified into two: those between short repeats and those between sequences with no or very little similarity (Michel 1999
). The former class can take place between sequences with much shorter homology than homologous recombination through such molecular events as simple slipped misalignment, sister chromosome slipped misalignment, or single-strand annealing (Lovett 2004
). The latter one was reported to be caused by errors of DNA gyrase and topoisomerase I in Escherichia coli (Michel 1999
). We expediently set the threshold between the two types of illegitimate recombination at 3 bp: equal or more than 3 bp versus less than 3 bp. In these mechanisms, there is strong negative dependence of the recombination frequency on the distance between the repeats (Chedin et al. 1994
; Lovett et al. 1994
), and it is estimated that less than hundreds of base pairs should be required for efficient processes due to their involvement with replication machinery (Michel 1999
; Lovett 2004
). Therefore, we assumed that the distance between the short recombining sequences has to be less than 1 kb in order to support an event of illegitimate recombination.
Larger Polymorphisms and Smaller Polymorphisms
The 19 polymorphic sites of interest, which exclude indels of mobile elements, were examined in CGAT for possible linkage with other mobile elements or genome-wide interspersed repeats (see Results and Discussion, Macroscopic Polymorphisms Between N315 and Mu50). The polymorphisms that are supposedly related to other mobile elements or genome-wide interspersed repeats were grouped as "larger polymorphisms" (see table 1, Feature). The others, the sequences of which were accordingly not homologous to the rest of the chromosome or other mobile elements, were designated as "smaller polymorphisms" (see table 2 below).
|
|
|
Analysis of Smaller Polymorphisms
We employed DOTTUP to arrange a dot plot matrix between N315 and Mu50 with appropriate parameters for each site of smaller polymorphisms. In cases of multiple repeats estimated as more than three copies by eye, Tandem Repeats Finder (http://tandem.bu.edu/trf/trf.html) was applied with default parameters in order to define each repeat unit and its copy number. The sequence identities of direct repeats neighboring indels were extracted manually from their optimized multiple alignments by ClustalW. Lengths of these direct repeat sequences were decided as the maximal lengths keeping 90% base pair identity or more.
The corresponding sites on the other five genomes were identified by BlastN search with N315 and Mu50 sequences as queries. Then the multiple dot plot comparison between the seven genomes was performed on POLYDOT.
Analysis of Paralogous Genes
For three tandem paralogue clusters within genomic islands,
Sa
and
Saß, the nucleotide sequences of the entire cluster and adjacent 200-bp regions were subjected to the dot plot analysis by POLYDOT with a 15-bp window.
For ssl and spl cluster, repeat sequences flanking each indel was sought by eye on the optimized multiple alignment of the three recombination joints in the relevant genome pairs with ClustalW for calculation of the length and sequence identity. N315-Mu50 strain pair gave the best alignment for the red indel in the ssl cluster, N315-MW2 gave the best for the orange indel in the ssl cluster, while N315-COL pair gave the best for the spl cluster.
For the lpl cluster and the spl cluster, we constructed Neighbor-Joining phylogenetic trees on the nucleotide p-distances (Nei and Kumar 2001
) and then assigned each clustering group as to share 85% or more sequence identity. This phylogenetic grouping of the homologous ORFs was displayed in 12 and 7 distinct colors in figure 3A and B (see below) and figure 4A and B (see below), respectively.
|
|
For the hsdS genes of type I RM systems, the predicted amino acid sequences of those on
Sa
and
Saß in the sequenced strains and on their putative homologue on etd pathogenicity island in strain TY104 (Yamaguchi et al. 2002
|
The overall structure of rrn operon was set by arranging 16S-23S-5S rDNAs, and then the positions of these rDNAs were mapped on the circular genomes (see fig. 6A below). The locus names were assigned as illustrated in figure 6A (see below). For alleles of the 16S-23S rDNA intergenic spacer region in rrn operon, sequences of this region from the seven strains were aligned together with sets of reported sequences of other S. aureus strains, H11, ATCC33925, D46, and SAU39769 (Gurtler and Barrie 1995
|
| Results and Discussion |
|---|
|
|
|---|
Macroscopic Polymorphisms Between N315 and Mu50 Strains
We identified, in total, 27 macroscopic differences (more than 10 bp) between the genome sequences of two strains, N315 and Mu50. In S. aureus, indels of mobile genetic elements and their relatives had been identified, and in many cases, flanking direct repeats were identified at each end of the apparent insert (Kuroda et al. 2001
Sa1mu, and a genomic island
Sa3, which were reported earlier (Kuroda et al. 2001
The remaining 19 sites were examined further. Through examination by CGAT (see Materials and Methods), we categorized those polymorphisms into two: six of larger polymorphisms (table 1) and 13 of smaller polymorphisms (table 2). Among the larger polymorphisms, for an indel within serine-aspartate dipeptides repeat region of clfA (ID #7) and for many localized polymorphisms in homologous prophages
Sa3 (ID #13), mechanisms for rearrangements were discussed in the literature (McDevitt and Foster 1995
; Foster and Hook 1998
; Brussow, Canchaya, and Hardt 2004
). Below (Polymorphisms in Three Tandem Paralogue Clusters Within Genomic Islands,
Sa
and
Saß and the following sections), we consider detailed mechanisms for the following two, an indel within ssl tandem gene cluster in genomic island
Sa
(ID #2) and a substitution within rrn operon (ID #4), referring to the five genomes other than N315 and Mu50 as well.
Smaller Polymorphisms Between N315 and Mu50 Strains
These smaller polymorphisms were considered as generated through local events because sequences homologous to these regions were not found elsewhere in the genome in the seven sequenced S. aureus strains as analyzed in CGAT (Materials and Methods). Based on a dot plot matrix for the local sequences around them (Materials and Methods), the 13 smaller polymorphisms were classified into four types (fig. 1 and table 2). Type (A) (fig. 1A) consists of indels with simple configuration with flanking direct repeats; a single pair of dispersed repeats in one strain (horizontal axis) has apparently deleted one of the repeats together with the intervening sequence in the other strain (vertical axis). The lengths of the flanking repeats are equal or more than 10 bp. Type (B) (fig. 1B) also represents indels, yet, unlike type (A), the length of the flanking direct repeats, if any, is less than 3 bp. The threshold between types (A) and (B) is set considering their possible mechanisms (see Criteria for Classification of Putative Rearrangement Events in Materials and Methods). Type (C) (fig. 1C) can also be considered as indels with flanking direct repeats. However, in this type, a sequence is repeated multiple times (copy number >2), and the copy number varies between the two genomes. The unit of the repeats apparently consists of two parts. Type (D) (fig. 1D), simple substitution, is seen in the polymorphism ID #8.
What are likely mechanisms leading to these polymorphisms? The indels of types (A) and (C) can be explained by recombination events between the flanking direct repeats. In order to distinguish between homologous recombination and illegitimate recombination, we examined the lengths and sequence identities of the flanking direct repeats (table 2). For the polymorphism ID #3 and ID #19 with 188-bp homology and 384-bp homology, respectively, homologous recombination appears likely, although recA-independent replication error is not excluded. The remainder is likely to have been generated by various types of illegitimate recombination requiring sequence similarity. For type (B), the lengths of the flanking repeats, if any, are insufficient for neither homologous recombination nor illegitimate recombination involving short repeats. Illegitimate recombination between sequences with no or very little similarity can explain these polymorphisms.
Table 2 also refers to the corresponding regions of these polymorphic loci in the other five strains than N315 and Mu50. Notably, in all the cases of types (A) and (B), all the other five strains matched either N315 or Mu50, that is, the N315 or Mu50 showed an exceptional pattern among the seven sequenced strains. Furthermore, the exception, whether it be N315 or Mu50, is the shorter type in all the cases. This directionality indicates that these polymorphisms were generated by deletion, as opposed to insertion, in either N315 or Mu50 strain lineage.
On the other hand, with type (C), there is no such simple tendency as seen in the types (A) and (B). Instead, in these polymorphisms, several of the other five strains possess the same repeat unit although they differ in the repeat number (see footnotes of table 2). These observations can be interpreted in terms of generation of a composite unit of the type s-t-s (see fig. 1C) and its expansion and contraction through recombination involving the repeats.
Polymorphisms in Three Tandem Paralogue Clusters Within Genomic Islands,
Sa
and
Saß
ssl Cluster
Our initial pairwise comparison between N315 and Mu50 genomes confirmed an indel in ssl gene cluster (table 1, ID #2), which results in loss of one ORF, ssl4 in figure 2B, in Mu50 (Kuroda et al. 2001
; Fitzgerald et al. 2003
) (see also Introduction).
|
Figure 2A represents multiple dot plots with these regions from the seven sequenced genomes. The successive lines with some breaks in the diagonal part of each rectangle for a pairwise comparison indicate conserved gene order with occasional indels. The presence of only few black lines or dots other than the diagonal one in each rectangle for self-to-self plot is consistent with the divergence of the tandem genes within a strain. Four groups of indels were identified in all in the dot plots for pairwise comparison (Materials and Methods) among seven genomes and are labeled Id in red, orange, blue, and green, respectively, in figure 2A.
Through each of the two break points of the red indel, a horizontal line and a vertical line are drawn across the entire plots in red in figure 2A. In the self-to-self plots there, dots are visible at the cross section of two red lines. These indicate direct repeats flanking the red indel. The direct repeats flanking the orange indel were identified similarly with the help of orange lines (fig. 2A). These indels likely represent deletions that have been generated by recombination between these flanking direct repeats.
The positions of these repeats are shown on the maps of this gene cluster in figure 2B as red arrows and as orange arrows, which reveals that the relative position of the repeat to the closest ORF is nearly identical for the two repeats. The map also shows conservation of the repeats among the strains.
Through a multiple alignment of the relevant recombination joints (with ClustalW, Materials and Methods), we defined repeats flanking each indel in terms of length and sequence identity. Figure 2C shows the repeats flanking the red indel (fig. 2A and B) between N315 and Mu50. The 45-bp perfect match is seen at the putative recombination point (underlined in yellow in fig. 2C), and the length of homology could be elongated to 108 bp (444388444495 [N315], 445630445737 [N315], and 470146470253 [Mu50]) if 10% sequence divergence was allowed. This 45-bp sequence codes for the initiating ATG, the putative ribosome-binding site, and a part of the putative signal peptides for secretion (Williams et al. 2000
).
For the orange indel in figure 2A and B, the length of the perfect match was 61 bp (5'-GTGACATGAAACAATGTGGAAAACATAATTAAATTGAGGGAAAGTGTGAATAGTTAAAAAA-3') between N315 and MW2, and the repeats' length could be elongated to 84 bp if 10% divergence was allowed (434809434892 [MW2], 435929436012 [MW2], and 446612446695 [N315]). There was a 184-bp sequence between this repeated sequence and the nearest ORF to its right in the case of N315.
A similar analysis of the remaining two indels (in blue and green in fig. 2A) revealed only very short flanking direct repeats, which are insufficient for homologous recombination. Even if 10% sequence divergence was allowed, only 16-bp-long repeats were detected between NCTC8325 and COL for the apparent deletion (in blue) in COL (392547392564 [NCTC8325], 396915396931 [NCTC8325], and 474767474783 [COL]). Only 13-bp-long repeats were detected between MSSA476 and MRSA252 for the apparent deletion (in green) in MRSA252 (435859435871 [MSSA476], 436936436948 [MSSA476], and 459325459337 [MRSA252]).
lpl Cluster
Similar tandem gene clusters, lpl cluster on
Sa
and spl cluster on
Saß (see Introduction) were examined for mechanisms responsible for interstrain difference. Contrary to the ssl cluster, the same multiple dot plot analysis of the lpl cluster (fig. 3A) yielded multiple discrete lines in parallel to a diagonal line in most of the rectangles for pairwise comparison, which suggests extensive genome rearrangements. We noticed that many of their endpoints mapped at a specific site in all the ORFs. This became apparent when a vertical line and a horizontal line were drawn through this site for all the ORFs. This feature indicates that the highly identical repeats exist at the boundaries of the homology and nonhomology. Furthermore, many of the cross sections between these red lines coincided with a (black) dot in figure 3A. This lattice pattern of the dots implies that this sequence is repeated within a genome as well as between genomes.
These lpl ORFs were classified into 12 phylogenetic groups (Materials and Methods) displayed in distinct colors in figure 3A and B. Their maps in figure 3B revealed variation in the gene order, which suggests extensive genome rearrangements. Individual rearrangements such as indels, duplications, or translocations were hard to identify in the pairwise comparison.
Figure 3B also shows the above repeats on ORF maps. All the copies of this repeated sequence are located at a similar position near the 5' terminus of ORF. Notably, in N315 and Mu50, the dots indicating these patterned repeats are also seen upstream of the ORFs (colored in white; SA0399 for N315) not annotated as a lipoprotein (displayed as R7 for N315 in fig. 3B). These two ORFs may represent remnants of the 3' part of an lpl gene. In support of this view, the deduced amino acid sequence for these ORFs aligned well with the C-terminal region of other lpl genes (data not shown).
The repeated sequences in N315 are aligned in figure 3C. In the 27 pairs among all the possible pairs (45), the repeated length is 80 bp or more if 10% divergence is allowed. Judging from the lengths and sequence identities between them, homologous recombination likely has occurred between these repeats to generate the observed gene order changes.
spl cluster
Similar dot plots for the spl cluster are shown in figure 4A together with ORF maps in figure 4B. The spl ORFs were classified into seven phylogenetic groups (Materials and Methods) displayed in distinct colors in figure 4A and B. The right part of this cluster seems to be conserved between all the examined strains except for MRSA252, in which the yellow ORF (splB in N315) is truncated, presumably by another rearrangement event, leaving its 3' terminus (SAR1907). In the left half, some rearrangements are seen among the seven strains. Their simplest description would be as follows. An ORF in violet (splF in N315) in MW2, MSSA476, and MRSA252 appears duplicated in N315, Mu50, NCTC8325, and COL. An ORF in blue is present in NCTC8325, COL, and MRSA252 but not in the other four strains.
The dot plot patterns corresponding to these two types of polymorphisms are labeled as Dp and Id, respectively, in figure 4A. As in the lpl cluster, many of the endpoints of the discrete lines corresponding to these two groups of rearrangements mapped at 5' terminus of several ORFs (fig. 4A). This became apparent when a vertical line and a horizontal line were drawn through these sites. Furthermore, some of the cross sections between these red lines coincided with a (black) dot in figure 4A. This pattern implies that this sequence is repeated within a genome. These results suggest that these repeats are associated with those rearrangements via some homology-dependent recombination events. These very similar repeats present at the recombination joints are drawn as arrows in figure 4B.
In search for the sequences involved in the recombination event, a multiple alignment was computed for the indel seen between N315 and COL as an example (fig. 4C). The length of the perfect match is 12 bp (underlined in blue in fig. 4C), and the repeats length could be elongated to 223 bp covering the initiating ATG and the 108 nucleotides encoding the signal peptides (Reed et al. 2001
) (underlined in gray in fig. 4C) if 10% divergence was allowed (19182571918036 [COL], 19173911917169 [COL], and 18618821861661 [N315]). It was therefore suggested that this indel represents a deletion caused by homologous recombination between the repeats.
Polymorphisms in Type I RM System Genes Within Genomic Islands,
Sa
and
Saß
An hsdM and hsdS gene pair is found in both
Sa
and
Saß genomic islands of the seven strains. A similar hsdM and hsdS gene pair was identified in another genomic island, etd, of S. aureus strain TY114 (Yamaguchi et al. 2002
). Earlier work pointed out that the deduced amino acid sequences for hsdM are almost identical among all the 15 alleles, but those for hsdS show divergence into seven different types with most of the differences localized in the target sequence recognition domains (Baba et al. 2002
). We carried out more detailed sequence comparison to characterize the variation of these hsdS genes.
Type I RM genes have been grouped into five families, type IA through type IE (Murray 2000
; Chin et al. 2004
), and some of the above genes were already grouped into type IC family (Kuroda et al. 2001
). To ascertain their domain structure and to characterize their variation, we compared their sequences with those of other type I RM genes (see Materials and Methods). Earlier works revealed that all the members of one family show strong homology in the conserved regions of HsdS (Murray 2000
), but recent sequence data have revealed that those similarities are lower than estimated for the prototype members (Titheradge et al. 2001
; Adamczyk-Poplawska et al. 2003
).
Figure 5A shows a multiple alignment of the seven types of HsdS proteins of S. aureus and other three types of HsdS proteins of type IC members of Lactococcus lactis. After manual refinement, three conserved regions (designated as N-terminal, central, and C-terminal conserved regions) and two intervening variable regions that are likely responsible for target sequence recognition (designated as N-terminal and C-terminal target recognition domains) were identified, as was reported for various HsdSs (Titheradge et al. 2001
). The sequences of the conserved regions aligned very well within the HsdSs of S. aureus and within those of L. lactis, respectively. There is also some interspecific similarity (fig. 5A).
Figure 5B shows schematic diagrams for the organization of the seven types of HsdS proteins of S. aureus. Reassortment of the two target recognition domains can be identified in intrastrain and interstrain comparison: type 1 and type 3 share the sequence of the C-terminal target recognition domain, while they do not share that of the N-terminal target recognition domain. Likewise type 2, type 4, and type 5 share the sequence of the N-terminal target recognition domain but not that of the C-terminal one.
The reassortment of the two target recognition domains was shown to create novel target specificity (Fuller-Pace et al. 1984
; Gann et al. 1987
; Gubler et al. 1992
). The present result suggests that such reassortment occurred also under natural conditions. Furthermore, it is inferred that this reassortment was caused by some mode of homologous recombination involving the central conserved region. Indeed, the nucleotide sequence alignment of the central conserved region of the seven types of hsdS in S. aureus showed that there is a common sequence as long as 140 bp sharing more than 90% in 18 out of all (21) the possible combinations (data not shown).
Polymorphisms in rrn Operons
Figure 6A and B shows copy number, chromosomal position, gene organization, and orientation of ribosomal RNA operons in the seven genomes. There are two types of strains, the strains with five rrn copies (N315, Mu50, NCTC8325, and MRSA252) and those with six rrn copies (COL, MW2, and MSSA476), as reported (Kuroda et al. 2001
; Baba et al. 2002
; Holden et al. 2004
; Gill et al. 2005
). The gene organization is 16S-23S-5S in all the cases except for an additional 5S rDNA upstream of locus 2 and locus 2-1. In the strains with six copies, two copies of rrn operons on locus 2-1 and 2-2 are found located next to each other intervened only by a 211-bp sequence (between 5S rDNA of locus 2-1 and 16S rDNA of locus 2-2). This corresponds to the single copy on locus 2 of the strains with five copies, although the 107-bp sequence (black box in fig. 5B) between the 3' end of the 5S rDNA of locus 2-1 and the point 104-bp upstream of the 16S rDNA of locus 2-2 was not found in locus 2.
Our initial comparison between N315 and Mu50 revealed one substitution in a 16S-23S rDNA intergenic spacer region (table 2, ID #3). Figure 6C illustrates the sequence patterns of 16S-23S intergenic spacer region found in the seven strains, which show a mosaic structure that consists of three conserved sequence blocks, CS1 through CS3, and different sets of variable sequence blocks, VS1 through VS13 (see legend for fig. 6C). The upper nine alleles (rrnA-rrnH) were already reported (Gurtler and Barrie 1995
; Forsman, Tilsala-Timisjarvi, and Alatossava 1997
). The last two were discovered in this study from MW2 and MSSA476 and MRSA252 and designated as rrnY and rrnZ, respectively. While rrnY consists of the known variable sequence blocks in a novel combination, rrnZ bears a novel 16-bp sequence, 5'-GTGATAATAAAGCAGT-3', designated as VS13, apparently inserted within VS4. This 16-bp sequence was also found at the locus 4 in MRSA252, which was caused by base substitution mutations from the consensus sequence of rrnH, 5'-aTGAaAAATAAAGCAGT-3', comprising 3 bp of VS4 and 13 bp of VS5.
|
Table 3 shows which of the above types of rrn operon is present in each locus of each strain. Between N315 and Mu50, the only difference is on the locus 1, which is apparently a simple substitution of VS1-VS2-VS3 by VS4-VS6-VS7. However, rrnC is found both on the locus 1 and the locus 3 in N315. This could be a result of gene conversion in the N315 lineage involving rrnH at locus 1 in the Mu50-type ancestor as the recipient and rrnC at locus 3 as the donor. A similar situation is seen between NCTC8325 and COL for locus 4 and locus 5, from rrnF to A48073 [GenBank] . Another interesting relationship is found between MW2 and MSSA476; the locus 2-1 and the locus 2-2 are apparently reciprocally exchanged between rrnK and rrnJ.
What kind of mechanisms can be inferred for them? The indel involving whole rrn operons observed at loci 2-1 and 2-2 and locus 2 can be explained by an unequal crossing-over leading to deletion or tandem duplication. The former deletion can occur by a single homologous recombination event via extensive homology of duplication. For the latter duplication process, unequal crossing-over between the flanking 5S rDNA genes at locus 2 is not sufficient. Such a homologous recombination event should be followed by another illegitimate recombination event in order to explain the presence of the 107-bp sequence between locus 2-1 and locus 2-2, which is absent from locus 2.
The various rrn alleles in the 16S-23S intergenic spacer region can be explained as combinations of a left variable region and a right variable region (fig. 6C). This recombination is likely through homologous recombination involving the long homology of CS1 and CS2.
In addition, these two variable regions contain full of indels and substitutions that can be explained by illegitimate recombination. For discussing this possibility, the boundaries of indels or substitutions were explored. In the following three cases, short repeats ranging from 3 to 9 bp were found, which indicates illegitimate recombination between these short repeats, as detailed below. First, in apparent VS2 deletion, between rrnA and rrnF, for example, 9-bp flanking direct repeats were found. Second, the apparent VS5-VS6 deletion, between A48073 [GenBank] and rrnL, for example, had 4-bp flanking direct repeats. Third, in substitution between VS9 and VS10, VS10 is apparently inserted with 3-bp flanking direct repeats instead of VS9. In addition, illegitimate recombination between 10-bp repeats can be inferred exclusively in MRSA252 for its rrnH and rrnZ due to the base substitutions and characteristic insertion of VS13. Their left and right variable regions thus generated by illegitimate recombination may recombine through homologous recombination at the conserved regions (CS1 and CS2).
Further Discussion
Most of the smaller polymorphisms were indels. They were deduced to be caused locally by illegitimate recombination between flanking direct repeats or between sequences with no or very little similarity, though, in two cases, involvement of homologous recombination cannot be ruled out. Deletion appeared more likely than insertion between a pair of dispersed repeats (fig. 1A and B and table 2A and B), but once sequences were repeated in multiple times, they might become easy to expand and contract (fig. 1C and table 2C). Some of these sites may be used for markers for strain typing and may provide a tool to trace the route of infection (Read et al. 2002
).
In all the three tandem paralogue clusters on
Sa
and
Saß, highly similar repeats, sufficiently long for homologous recombination, were identified at the boundaries of genome rearrangements. The relative positions of these repeats to the ORFs are apparently the same, so these repeats were conserved parts among these divergent tandem gene clusters. Homologous recombination between these repeats may have caused the interstrain copy number variation of the tandem paralogous genes. We cannot exclude the possibility of site-specific recombination for these and the other indels (blue indel and green indel in fig. 2A and B).
Extent of rearrangements appears to vary from one cluster to another as seen in the variation of gene order; the gene order appears to be conserved in the ssl cluster but far from conserved in the lpl cluster (figs. 2 and 3). This difference may be explained by the extent of divergence of tandem genes in each cluster and the strong dependency of the frequency of homologous recombination on the homology length and the sequence identity. In the lpl cluster, the conserved repetitive regions with sufficient length and sequence identity for homologous recombination exist in many pairs of overall repetitive structure corresponding to the tandem genes, which makes characteristic lattice pattern of dots in the dot plot (fig. 3A). This abundance of repeats that potentially cause homologous recombination is well correlated with the extensive rearrangements of the lpl cluster, which possibly include insertions, deletions, duplications, conversions, and translocations. A similar situation is seen in the left-hand side of the spl cluster as shown in figure 4. On the other hand, there seems no such abundance of significant repeats in the ssl cluster, which is coherent with the presence of only few indels in this cluster. If the sequence conserved within the lpl genes served as the recombination site, one might expect that phylogeny of the gene could be different to its left and to its right. This is indeed the case (T. Tsuru and I. Kobayashi, unpublished data).
Can the maintenance of these conserved repeats be interpreted in terms of their function? One pair in the ssl cluster (red in fig. 2) and those in the spl cluster (fig. 4) encode the initiating ATG, the putative ribosome-binding site, and a part of the putative signal peptides (Williams et al. 2000
) and could be important in gene expression and secretion. The ssl transcripts were detected in an S. aureus strain (Williams et al. 2000
), and the spl transcripts were detected in a derivative of NCTC8325 (Reed et al. 2001
). The repeats in spl genes include the 108-bp sequence for the signal peptides, which is cleaved during secretion (fig. 4C) (Reed et al. 2001
). Meanwhile, those repeats in the lpl cluster (fig. 3) correspond to conserved amino acid sequences, yet, these conserved residues were outside the recognized sequence motif for lipoproteins (Prosite accession number, PS00013) (http://www.expasy.org/prosite/). Lastly, another sequence repeated in the ssl cluster (orange in fig. 2) is located in the center of an intergenic region. We found no evidence for spread of these repetitive regions outside these genomic islands.
Is there any meaning in this variability among strains? These three tandem paralogues (ssl, lpl, and spl) have been considered as virulence genes as they encode exotoxins, lipoproteins, and secreted proteases, respectively (Williams et al. 2000
; Kuroda et al. 2001
; Reed et al. 2001
). That all the three kinds of proteins are likely involved in interaction with environments suggests that the intraspecies variability of these clusters may confer ability to adapt to diverse environmental conditions. Many bacteria adapt to variable environments by altering a phenotype through changes in expression of multiple genes based on DNA rearrangements and other meansa process called "phase variation" (van der Woude and Baumler 2004
). Variability of these clusters may be explained in the same context. We do not know whether these repeats are maintained because of their function as recombination sites that control the variability of a paralogue cluster.
Genetic exchanges via horizontal gene transfer in this region have been proposed because this region is on a putative mobile genetic element (Fitzgerald et al. 2003
). The apparently distorted gene order in the lpl cluster and the spl cluster supports this view. The polymorphic pattern of hsdS genes that shows significant sequence similarity between different strains is far more indicative; an interstrain reassortment of target recognition domains might have occurred.
There is evidence that RM genes have undergone extensive horizontal gene transfer (Kobayashi 2004
). Many RM genes of type I and type II are present on various mobile genetic elements (Kobayashi 2004
). In the case of type I RM genes, aberrant GC contents and presence of different alleles of the same subfamily support this view (Murray 2000
). The significant sequence similarity between conserved domains of hsdS in S. aureus and those in L. lactis also indicates that the exchange of the hsdS genes was possible even across their interspecies barriers. Similarity was also present between their M subunits. The presence of type I RM genes on the genomic islands has lead us to hypothesize that these type I RM genes could play a role in stabilizing the maintenance of these islands (Kuroda et al. 2001
) because of postsegregational host killing as reported for type II RM systems (Naito, Kusano, and Kobayashi 1995
).
The type I RM genes on the two genomic islands,
Sa
and
Saß, encode only HsdS and HsdM subunits sufficient for the methylation activity but not HsdR subunit necessary for the endonuclease activity (Kuroda et al. 2001
). This led us to hypothesize that the HsdR subunit encoded in a locus outside the two genomic islands can interact with two sets of HsdM and HsdS to form two different RM systems (Kuroda et al. 2001
). Combination of this HsdR and HsdM/HsdS of
Sa
may form a restriction enzyme of one specificity, while combination of this HsdR and HsdM/HsdS of
Saß may form another restriction enzyme of another specificity. This hypothesis is worth testing for the following reasons. First, the sequences of conserved domains of HsdS on the two different islands are very similar to each other. These domains are responsible for interacting with HsdR (Abadjieva et al. 1993
; MacWilliams and Bickle 1996
; Kim et al. 2005
). Second, their homologous systems in L. lactis, Lla7I, Lla103I, and Lla1403I, comprise chromosomally encoded HsdM, HsdR, and HsdS genes and plasmid-encoded HsdS genes to confer combinatorial variation of restriction specificity by switching the HsdS subunit (Schouler et al. 1998
).
With respect to the rrn operons, occurrence of homologous recombination involving long homology of rDNAs was inferred between two tandem loci, locus 2-1 and locus 2-2. This process was inferred to be deletion rather than duplication from detailed comparison of intergenic sequences. We do not know why the duplicated version (locus 2-1 and locus 2-2) does not segregate into a monomer version (locus 2) by homologous recombination. The selective advantage of the former in ribosomal function could provide an explanation. Manifestation of the homologous recombination between distant loci is also gained by the typing of 16S-23S rDNA spacer.
The presence of short repeats in the 16S-23S rDNA spacer rearrangements indicated involvement of illegitimate recombination between short repeats in the formation of its novel allele. Among their examples, the 9-bp repeats, which involved in VS2 deletion and consist of the 3'-end sequence of tRNA (Ala), CCACCA and following TTA, are noteworthy because the DNA secondary structure of the VS2 region which encodes tRNA (Ala) may have induced the illegitimate recombination. Indeed, computer analysis suggested that a stable secondary structure could be formed at tRNA (Ala) in Hemophilus parainfluzae (Giannino et al. 2001
), and presence of such secondary structure stimulates illegitimate recombination (Michel 1999
).
We have focused on polymorphisms that exist between N315 and Mu50 because we were involved in their genome analysis. Their genome sequences were close enough to allow reconstruction of their formation. A comparison of all the seven genomes would very likely identify additional and significant polymorphisms. However, how these polymorphisms were formed would be quite difficult to analyze with diverged genomes. There are more than six additional pathogenicity islands (excluding bacteriophages) in the N315/Mu50 genome and the remaining five genomes that also likely contribute to virulence. A truly global comparison of S. aureus genomes would include an analysis of the entire complement of pathogenicity islands.
After the manuscript was prepared, two more genome sequences of S. aureus strains, RF122 (RefSeq: NC_007622 [GenBank] ) and USA300 (RefSeq: NC_007793 [GenBank] ), were released.
| Conclusion |
|---|
|
|
|---|
In the present work, we compared multiple genome sequences of S. aureus to deduce mechanisms of genome rearrangements, in paralogous genes and others, that resulted in large genome polymorphisms. Most of them were inferred to have resulted from deletion through illegitimate recombination. For the tandem paralogue gene clusters on genomic islands,
Sa
and
Saß, we were able to identify sequences for homologous recombination. The evolution through homologous recombination was also found for the type I RM hsdS genes on these islands. Homologous recombination likely has caused the rearrangements in the rRNA operons. We also found novel alleles in rRNA operons and suggested involvement of illegitimate recombination in their formation. Taken together, these results demonstrate the importance of homologous recombination in the evolution of paralogous genes and illustrate power of comparative genomics in the analysis of genome evolution through genome rearrangements. | Acknowledgements |
|---|
|
|
|---|
We are grateful to Toshiko Ohta and Hideki Hirakawa for communication of unpublished results and to Makoto Kuroda and Harumi Yuzawa for helpful comments on the manuscript. This work was supported by grants from Ministry of Education, Culture, Sports, Science, and Technology of the Japanese government to I.K. (Genome Biology, Genome Homeostasis, DNA Repair, Kiban-evolution, Kiban-genome, 21COE: genome language).
| Footnotes |
|---|
Takashi Gojoboroi, Associate Editor
| References |
|---|
|
|
|---|
Abadjieva, A., J. Patel, M. Webb, V. Zinkevich, and K. Firman. 1993. A deletion mutant of the type IC restriction endonuclease EcoR1241 expressing a novel DNA specificity. Nucleic Acids Res. 21:44354443.
Adamczyk-Poplawska, M., A. Kondrzycka, K. Ur





