Molecular Biology and Evolution 18:1231-1245 (2001)
© 2001 Society for Molecular Biology and Evolution
HIV-1 and HIV-2 LTR Nucleotide Sequences: Assessment of the Alignment by N-block Presentation, "Retroviral Signatures" of Overrepeated Oligonucleotides, and a Probable Important Role of Scrambled Stepwise Duplications/Deletions in Molecular Evolution
*Laboratoire Génome et Informatique, Université de Versailles Saint Quentin-en-Yvelines, Versailles, France
Laboratoire d'Informatique Fondamentale de Lille, Equipe Bioinformatique, Université des Sciences et Technologie de Lille, Villeneuve d'Ascq, France
Deutsches Krebsforschungszentrum Theoretische Bioinformatik (H0300) Im Neuenheimer Feld 280, Heidelberg, Germany
Institut de Génétique Humaine, Montpellier, France
| Abstract |
|---|
|
|
|---|
Previous analyses of retroviral nucleotide sequences, suggest a so-called "scrambled duplicative stepwise molecular evolution" (many sectors with successive duplications/deletions of short and longer motifs) that could have stemmed from one or several starter tandemly repeated short sequence(s). In the present report, we tested this hypothesis by focusing on the long terminal repeats (LTRs) (and flanking sequences) of 24 human and 3 simian immunodeficiency viruses. By using a calculation strategy applicable to short sequences, we found consensus overrepresented motifs (often containing CTG or CAG) that were congruent with the previously defined "retroviral signature." We also show many local repetition patterns that are significant when compared with simply shuffled sequences. First- and second-order Markov chain analyses demonstrate that a major portion of the overrepresented oligonucleotides can be predicted from the dinucleotide compositions of the sequences, but by no means can biological mechanisms be deduced from these results: some of the listed local repetitions remain significant against dinucleotide-conserving shuffled sequences; together with previous results, this suggests that interspersed and/or local mononucleotide and oligonucleotide repetitions could have biased the dinucleotide compositions of the sequences. We searched for suggestive evolutionary patterns by scrutinizing a reliable multiple alignment of the 27 sequences. A manually constructed alignment based on homology blocks was in good agreement with the polypeptide alignment in the coding sectors and has been exhaustively assessed by using a multiplied alphabet obtained by the promising mathematical strategy called the N-block presentation (taking into account the environment of each nucleotide in a sequence). Sector by sector, we hypothesize many successive duplication/deletion scenarios that fit our previous evolutionary hypotheses. This suggests an important duplication/deletion role for the reverse transcriptase, particularly in inducing stuttering cryptic simplicity patterns.
| Introduction |
|---|
|
|
|---|
Previously, computer-aided analyses of retroviral nucleotide sequences aimed to unravel putative molecular evolution models from sequence comparisons and oligonucleotide distributions (Laprevotte et al. 1984, 1997
In the present study, we tested these evolutionary hypotheses by focusing on the long terminal repeats (LTRs) (and flanking sequences) that bound proviral DNA sequences from two groups of human immunodeficiency viruses (HIV): 15 HIV-1's together with a chimpanzee simian immunodeficiency virus (SIV), and 9 HIV-2's together with a macaque SIV and a sooty mangabey SIV. It is known that following retrovirus integration into the host-cell genome, the double-stranded proviral DNA is flanked by two identical LTRs, with the 5' LTR element serving as the binding site for transcription factors (reviewed in Ou and Gaynor 1995
; Pereira et al. 2000
). The HIV nef gene open reading frame partially overlaps the 3' LTR (fig. 1
). The HIV LTRs are short sequences that can be visually compared and have been subjected to exhaustive sequencing and biological studies because of an important pathological concern (as a result of the worldwide AIDS crisis) and the presence of transcription control sites on them. It is already known that retrovirus LTRs have multiplied motifs that may correspond to experimentally determined regulatory elements (Frech, Brack-Werner, and Werner 1996
). Several studies have shown that the LTR structures and their regulation are of particular interest for HIV expression (Gaynor 1992
). Here, we use the control sites that are conserved during evolution as starter homology blocks for a reliable multiple alignment of the 27 LTR sequences. We also list overrepresented words by using a new calculation strategy (Klaerr-Blanchard, Chiapello, and Coward 2000)
applicable to short sequences such as the LTRs and their coding and noncoding sectors. CTG is often found in the overlapping multiplied motifs described in HIV-1 LTRs (Seto, Brunck, and Bernstein 1989
). Moreover, the sequences of HIV (1 and 2) are the most biased in favor of the overrepresented trinucleotides in the LTRs (Laprevotte et al. 1997
).We search for putative short- and longer-range duplications/deletions by comparisons with shuffled sequences and by scrutinizing the thoroughly assessed alignment of the 27 sequences sector by sector. The results are in accordance with the previous hypotheses for the retrovirus nucleotide sequences of molecular evolution by scrambled stepwise short- and longer-range duplications/deletions.
|
| Materials and Methods |
|---|
|
|
|---|
The 27 Studied Nucleotide Sequences
The sequences represented a portion of the plus strand of the proviral DNA (this strand corresponds to the viral RNA). These are the LTRs together with flanking sectors (fig. 1 and the alignment on the web page). The 27 5' nucleotides were those located upstream of the 3' LTR. They included the polypurine tract (PPT) that is the binding site for the primer of DNA plus-strand reverse transcription. The
40 3' nucleotides were those located downstream of the 5' LTR. They included the primer-binding site (PBS) for minus-strand reverse transcription. These two flanking sectors were highly conserved 5' and 3' landmarks that bound the alignment (see below). The three regions of the LTR were 5'-U3-R-U5-3' (reviewed in Peterlin 1995
|
The Method for Finding Exceptional Words in a Sequence
The calculation strategy (Klaerr-Blanchard, Chiapello, and Coward 2000)
|
Methods for Finding Probable Local Repetition Sectors
We set a priori patterns that could correspond to local repetitions. A numerical value was used, that is, the percentage of the bases in the observed sequence that were included in at least one of the repetitions so defined. For each of the 27 sequences, the actual value was compared with those of 100 simply shuffled sequences (Bernoulli model) or 100 shuffled sequences additionally conserving the exact starter dinucleotide count (first-order Markov chain model). The result was considered significant (table 6 ) when any random value was lower than that of the observed sequence. The result was considered somewhat significant when no more than 5 of the 100 random values were above the observed one. A computer program that shuffled the letters of a sequence while accurately conserving its dinucleotide, or even trinucleotide, composition, was implemented (Kandel et al. 1996
|
The N-Presentation Algorithm
The N-presentation strategy (Didier 1999
The Alignment Strategy
The alignment can be found on the web page http://genome.genetique.uvsq.fr/laprevotte/. Within each HIV-1 or HIV-2 group, the sequences were closely related and the alignments are easily constructed, such that the two consensuses were easily deduced. The point is to align HIV-1 and HIV-2 sequences together (these sequences are supposed to have a common evolutionary progenitor). The alignment was constructed manually. To begin with, it was based on eight consensus elements (or groups of elements), that is, 18 positions highlighted by Frech, Brack-Werner, and Werner (1996)
, who studied common modular structures in primate lentiviral LTR sequences. Most of these core blocks enable one to propose a reliable alignment of the corresponding and neighboring HIV-1 and HIV-2 sectors, provided that some local corrections are done. In addition, PPT and PBS were very significant core blocks, together with the 5' and 3' ends of the LTRs, respectively. The rest of the alignment was built by recursively searching the intercalary sectors for perfectly matched segments of at least three bases in length. For each step, a new intercalary alignment was then based on the longest perfect match between any paired HIV-1 and HIV-2 sequences, that is, consistent with the prealigned bordering sectors. This match was a priori assumed to be the closest to the putative original sequence. In addition, probable duplication/deletion events were taken into account. Especially in the case of an unequal number of repeated motifs between HIV-1's and HIV-2's, gaps were inserted (gaps were not treated explicitly but remain as those parts of the sequences that did not belong to any of the aligned segments; Morgenstern, Dress, and Werner 1996
). Eventually, the alignment was based on the nucleotides printed on the line labeled "common sectors." These nucleotides covered 643 positions (
58%) of the alignment. In the coding reading frame, the polypeptide alignment was constructed in the same way based on the conserved amino acids (and the corresponding codons). In order to assess and to locally correct the nucleotide alignment while increasing the signal-to-noise ratio, alphabets of more than four letters were additionally used: that of the polypeptide alignment in the coding reading frame (as just mentioned), and that obtained by the 8- and 12-ranked N-block presentation for the whole of the sequences. Obviously, there was good agreement between the polypeptide and the nucleotide alignments except for a few locations (see the web page). All of the aligned sequences were coded using a 12-ranked and an 8-ranked N-block presentation (the latter being less stringent). Obviously (see the web page), the N-block presentation corroborated the homology blocks (in addition, local corrections of the alignment were made possible).
| Results |
|---|
|
|
|---|
Overrepresented Oligonucleotides Appear to Be Congruent with the So-Called "Retroviral Signature"
We used a new calculation strategy (Materials and Methods) to perform on a short sector of the retroviral genome an investigation similar to that performed on complete sequences (Laprevotte et al. 1997
|
|
As a whole, the overrepeated words selected from the entire lengths of the sequences (table 2 ) appeared to be congruent with the retroviral signature previously found, particularly the core consensuses CCTGG and CAGR (Laprevotte et al. 1997
In the coding sector (table 3 ), only CIVCG and RESIVSMM showed overrepresented oligonucleotides including CTG. In addition, two HIV-1's and six HIV-2's showed an overrepresented CCAG. The noncoding sector (table 4 ) appeared to be much more congruent with the retroviral signature: the sequences studied showed at least one overrepresented oligonucleotide including CTG (except for K03456, M26727, RESIVSMM, and RESIMM251); six HIV-2's out of nine showed at least one overrepresented word including CAG. For K03455, X01762, and M17449, only the entire lengths of the sequences were studied because of a premature stop codon (tables 35 ).
Table 5 displays the tri- and tetranucleotides that were found to be overrepresented when a first-order Markov chain model was used. Only the sequence RESIVSMM had an overrepresented oligonucleotide (CTGG, in the entire sequence and in the coding sector) that was congruent with the so-called retroviral signature. GGGA remained significantly overrepresented in the noncoding sectors of all of the HIV-1 sequences that were tested (except for CIVCG) and three HIV-2 sequences (L07625, M30895, and X61240). Obviously, the major portion of these overrepresented GGGA's was clustered in the sectors (aligned with CIVCG 397458), where the repeated sites NF-KB and SP.1 were located (see the alignment on the web page). Actually, for these sequences with overrepresented GGGA's, the noncoding sectors were 341547 bases in length and included between 7 and 10 occurrences of this word. In these actual sectors, the zone in which NF-KB and SP.1 sites were clustered (being only 5973 bases in length) included as many as four or five occurrences of GGGA (boxed by a thick line in the alignment).
|
Sectors Suggesting Local Repetition Scenarios
The simulation procedures (Materials and Methods) were aimed at finding local repetition processes such as those hypothesized previously (Laprevotte et al. 1997
In figure 2 , the numerical values defined for table 6 (column A) are displayed for three sets of 2,700 (27 x 100) shuffled sequences. For each distribution graph, each of the 27 starter sequences was shuffled 100 times (Materials and Methods). For the left graph, the sequences only rigorously conserved the starter nucleotide compositions. For the middle graph, the dinucleotide counts were additionally exactly conserved, as were the trinucleotide compositions for the right graph. The middle graph was more shifted from the left than was the right from the middle, such that the major part of the increase of the random values was accounted for by the first-order Markov chain model. Hence, the repeated sequences investigated in table 6 (column A) appear to be accounted for, to a large extent, by the dinucleotide compositions of the sequences.
|
Columns B, C, D, and E of table 6 focus on particular repetitions that were part of those recapitulated in column A. Column B focuses on the tandemly repeated dinucleotides and displays but a few significant results. For columns C, D, and E, significant results are defined using a first-order Markov chain model. Column C displays sequences with overrepresented motifs of at least six bases in length, made up of only two distinct and alternate letters; such are all of the HIV-1 sequences (CIVCG excepted), and a single HIV-2 sequence; on average, this overrepresentation accounts for an overestimation of about 5% of the numerical value that is computed for column A. Column D shows that all of the HIV-2 sequences have overrepresented repetitions as ABCDABCDABCD; actually, this only highlights the sector GCTTGCTTGCTT (boxed by a thick line in the alignment on the web page), which extends from position RESIVSMM-671 to position RESIVSMM-682. Column E accounts for nonoverlapping repetitions of words at least 10 bases in length with no more than five letters intercalated between two successive identical motifs. For each starter sequence, no more than one random sequence with such a repetition (and with only two copies) occurs, such that in an observed sequence, this repetition (boxed by a thick line in the alignment) is thus considered significant (it accounts for an overestimation of about 3% of the numerical value calculated for column A). Together with the overrepetitions displayed in column C, these overrepresented patterns could account for at least some of the significant results presented in column A. Only a single HIV-2 sequence (L07625) shows such a pattern (two copies of a decanucleotide), located in the sector of the SP-1 sites (see below). On the contrary, all the HIV-1 sequences but L20571 show such a repetition, that is, 2 copies of the NF-KB site (see below). In the same sector of the alignment, L20571 (which is usually distinct from the other sequences of its group; Gurtler et al. 1994
|
|
Column J of table 6 accounts for significant overrepresentations of sectors of more than 15 bases in length that are made up of no more than two letters. According to the Bernoulli model, the HIV-1 sequences (except for K02013, K02007, and K03456) were somewhat significant; except for M17451, they were not significant when compared with a first-order Markov chain model. For the HIV-2 group, when a first-order Markov chain model was used, RESIVSMM, J04542, L07625, M30502, and M31113 remained significant, while J03654, J04498, and M15390 did not; the three other sequences in the group were not significant anyway. For the sequences that remained significant when compared to shuffled sequences conserving the exact starter dinucleotide count (first-order Markov chain model), the numerical value was 3.6% or 3.7%, except for M17451 (2.3%) conserving only a borderline significance; for those sequences which did not conserve their significance or remain nonsignificant, the parameter was 0%2.3%.
Column K of table 6 accounts for significant overrepresentations of sectors at least 30 bases in length made up of no more than three letters (with at most one base excepted, this latter not being included in the computation of the numerical value defined above). HIV-1 sequences (except for M17449 and L20571) were significant even against a first-order Markov chain model. For significant sequences, the numerical values range from 18.7% to 32% (14% and 10.1% for M17449 and L20571, respectively). As a whole, HIV-2 sequences (value from 0% to 17%) were not significant.
The Reliability of the Alignment of the 27 Nucleotide Sequences
A reliable alignment is an essential tool in the present work. The accurate alignment of previously identified benchmarks and its congruency with the polypeptide alignment and with the N-block presentation coding of the sequences (Materials and Methods and the alignment on the web page) allowed us to consider this alignment reliable for testing local molecular evolution hypotheses. Three available multiple-sequence alignment programs were tested (table 7
) against the benchmarks found in both the HIV-1 and the HIV-2 groups to select the most suitable algorithm for aligning the actual sequences studied in this work. Clustal-X (Thompson, Plewniak, and Poch 1999
) is a progressive alignment method comparing individual residues by using a Needleman-Wunschbased algorithm (Needleman and Wunsch 1970
) and employing gap penalties; Mabios (Abdeddaïm 1997
) and Dialign (Morgenstern, Dress, and Werner 1996
) calculate homology blocks of which the best combinations are chosen in order to select the benchmarks on which the rest of the alignment is constructed. At first, it appeared that there was no program constructing the same alignment that another did. The program Clustal-X produced a total misalignment downstream of the HIV-1 deletion zone following "TAR Common Sector" (CIVCG-519); moreover, the deleted sequence J03654 was oddly aligned (data not shown). Mabios and/or Dialign aligned all of the benchmarks (the R-U5 junction excepted); the alignment of five of these benchmarks was more accurate when Mabios was used (with Dialign constructing an alignment that was less accurate or only partial). However, the Dialign program was the only one aligning all but one of the indicated benchmarks, particularly the much-conserved polyadenylation site (Poly (A)) and highlighting the duplication of the NF-KB site in the HIV-1 group (by inserting a gap in HIV-2 sequences in front of one of the two NF-KB copies). Hence, as regards the present alignment, Dialign appeared to be the most reliable program of those tested; it was further tested for two sectors where the alignment was difficult to construct even manually: the set of sequences aligned with CIVCG 328463 and that aligned with CIVCG 558609 (see the web page). The nucleotides of the alignment constructed with Dialign in these sectors were coded by the 8-ranked N-block presentation, HIV-1 and HIV-2 matching letters being highlighted in red as on the web page (data not shown). Actually, these highlighted letters are less numerous than in the manually aligned corresponding sectors, which suggests that the alignment constructed with such a program has at least to be visually refined.
|
| Discussion |
|---|
|
|
|---|
Previous analyses of retroviral nucleotide sequences have suggested a scrambled stepwise duplicative molecular evolution. Genetic diversity in these sequences is usually presumed to arise as a consequence of reverse transcriptase infidelity (Katz and Skalka 1990
By no means can biological mechanisms be deduced from the correlation of overrepresented words and significant local repetitions with the dinucleotide compositions of the corresponding sequences. It is impossible to decide between two hypotheses: either the doublet frequencies, due to any event, account for these words found to be overrepresented or clustered when compared with a Bernoulli model, or a large number of duplications of oligonucleotides (such as AG and CT; tables 24
) bias the dinucleotide composition of the sequence and account for the nonsignificance of many repetitions when tested against a first-order Markov chain model. Such duplications could favor particular nucleotide motifs for biochemical reasons or because of starter tandem repeated sequences. Previous studies of complete retroviral sequences (Laprevotte 1992
; Laprevotte et al. 1997
) strengthen the second hypothesis by demonstrating that for most of the overrepeated oligonucleotides, the observed frequency is not merely a consequence of dinucleotide distribution (many overrepresented oligonucleotides remained significant versus a first-order Markov chain model; moreover, the correlation between the dinucleotide distribution in the subset of the overrepresented oligonucleotides and that of the whole sequence was variable, high, weak, or even null). The fact that for RESIVSMM (which is supposed to be the evolutionary progenitor of the HIV-2's; Gao et al. 1999
) CTGG is overrepresented even when a first-order Markov chain model is used (table 5
) fits the same hypothesis. Moreover, many of the putative locally repeated sectors remain significant even against a first-order Markov chain model (table 6
), giving examples of probable duplications that are obviously not accounted for by the dinucleotide composition of the sequence; these are tandem repeats, local repetitions, clusters of oligonucleotides, and "monotonous" sectors made up of no more than two or three letters. Columns F and K of table 6
show many repetitions and "monotonous" sectors that may cover up to
30% of the sequence and that are significant even against "Markov-1" random sequences (in these cases, the percentage is overestimated by about 10%15%). Moreover, about one third of the alignment includes sectors boxed by a thick line at at least one sequence or one HIV-1/HIV-2 consensus (see the web page). As seen below, these sectors suggest local repetition events. Hence, it appears that in any case the dinucleotide compositions cannot account for all of the listed repetitive patterns and that these patterns cover a large portion of the sequences. The discrepancies between the results (tables 26
) for the HIV-1 and the HIV-2 groups, respectively, suggest distinct mono- or oligonucleotide duplication/deletion scenarios that could have occurred since the evolutionary divergence between the two groups; this led us to search the reliable alignment of the sequences for patterns both statistically significant and evolutionary suggestive.
Differentiated sectors can be delineated in the alignment (see the web page) in terms of the degrees of homology between HIV-1 and HIV-2 aligned sectors. The 5' landmark that is the PPT, together with the 5' end of the LTR (CIVCG 1037) and the 3' landmark that is the PBS (CIVCG 682705), are highly conserved and highlighted by the 12-ranked N-block presentation, as are six other homology blocks; out of these six blocks, the NF-KB site (CIVCG 397407 and 409418) and the polyadenylation signal (CIVCG 570581) are in the noncoding part of the sequence; the other four (CIVCG [105136], [168181], [215229], and [284296]), align with conserved sectors in the polypeptide sequence nef.
The major portion of the coding sectors (up to and including position CIVCG-346), is to be distinguished from the rest of the alignment: the aligned sectors (except for two) measure about the same length (337346 bases). Except for M17449 (which exhibits a premature stop codon), the lengths are equal or differ, as expected, by multiples of three. The length is longer for the L20571 sequence (349 bases); a scan of the colored alignment obviously corroborates the fact that L20571 is a divergent isolate among the HIV-1 group (Gurtler et al. 1994
). In the HIV-2 group, J03654 (Zagury et al. 1988
) is deleted between positions CIVCG-88 and CIVCG-317 (excluded). This could be accounted for by two successive deletion events. Let us write the HIV-2 consensus between the positions CIVCG-79 and CIVCG-94 while supposing a jump of the reverse transcriptase (Katz and Skalka 1990
; Zhang and Temin 1994
) from the first aga to the second; then, the sequence becomes TATACTTAGAAGG. Eleven out of the 13 letters of this motif match the HIV-2 consensus between positions CIVCG-309 and CIVCG-321 (TATARYTACAAGG), suggesting a second jump between the two motifs. Furthermore, in spite of the conservation of the lengths of the major part of the coding sectors, gaps have to be inserted in the sequences in order to align the homology blocks, suggesting any number of duplications/deletions. For instance, for the sectors aligned with CIVCG from position 39 to position 58, four demonstrative sequences lead to the proposal of a suggestive alignment:
|
|
AG), as discussed elsewhere (Averof et al. 2000)|
|
From position CIVCG-347 downward, the major portion of the aligned sequences is noncoding. Consequently, their lengths do not necessarily differ by multiples of three; they are much more divergent between the HIV-1 and the HIV-2 groups and even, within the HIV-2 group, between the two SIV-2 and the HIV-2 sequences. The duplication/deletion events appear to have been much less constrained during evolution than they have been in the coding parts. In this respect, several sectors deserve scrutiny.
The alignment between positions CIVCG-348 and CIVCG-386 can be accounted for by stepwise duplications/deletions (see the web page). HIV-1 clones have been described (Estable et al. 1996
) where the HIV-1 empty sectors are occupied by the so-called "most frequent naturally occurring length polymorphism" (MFLNP on the web page), which shows varying lengths and appears more or less clearly to contain repeated sectors. Here, the aligned sectors in the HIV-2 group do not appear to be deleted.
Between CIVCG-394 and CIVCG-413 (excluded), HIV-2 sequences (SIVs excluded) exhibit two imperfectly repeated sectors that could be the remnant of a duplication event. L07625 and X61240 HIV-2 sequences are to be distinguished from the other seven (Kreutz et al. 1992
; Barnett et al. 1993
), as they differ in numerous locations all along the alignment. In each of them, the two homologous sectors (boxed by a thick line in the alignment) extend from position L07625-436 to position L07625-460 and from position L07625-461 to position L07625-484, respectively, and do not coincide with those of the other HIV-2 sequences:
|
|
|
|
|
|
Column K of table 6 shows that the HIV-1 sequences (except for M17449 and L20571) include overrepresented sectors at least 30 bases in length made up of no more than three letters (with at most one base excepted). Such a sector is found in these sequences between positions K02013-462 and K02013-494 (the corresponding sector is boxed by a thick line at the HIV-1 consensus; see the web page). In this sector, as well as upstream and downstream, the HIV-1 group shows clusters of CT's, CTG's, and CTGG's (table 6, columns F, G, H, and I). These words are boxed by a thick line in the alignment when the corresponding pattern is overrepresented against a one-order Markov chain model in the corresponding sequence taken as a whole (table 6 ). All of these words are scattered in a region that could be accounted for, at least partly, by stepwise duplications/deletions of mono- or oligonucleotides taken from tandemly repeated CTG's.
The aligned sectors extending from the 5' end of the R region (CIVCG-501, the beginning of viral RNA; see above) to the positions aligned with CIVCG-565 correspond to the TAR region, which has been extensively studied concerning its biological meaning and the stable stem-loop structure that forms TAR RNA (reviewed in Ou and Gaynor 1995
; Rabson and Graves 1997
). The HIV-1 TAR RNA contains both a loop and a bulge structure that are critical for Tat-mediated activation. The HIV-2 TAR RNA is capable of forming a complex structure that consists of two discrete stem-loop regions. Possible evolution routes from simple one-hairpin to complex branched TAR structures have been discussed in the literature. The extended portion of the HIV-2 TAR, relative to the HIV-1 TAR, have the greatest similarity to a human immunoglobulin pseudogene sequence, suggesting (see above) that this sub-sequence is a captured element (reviewed in Myers 1997
). In the alignment, the sector referred to as "TAR Common Sector" is conserved between HIV-1 and HIV-2. It corresponds to the upper portion of the HIV-1 stem-loop (the bulge-and-loop zone) and to the 5' HIV-2 stem-loop region. The two successive sectors of the HIV-2 consensus that are boxed by a thick line in the alignment on the web page (the first including the TAR Common Sector), correspond (apart from a few bases) to the two HIV-2 TAR RNA discrete stem-loop regions:
|
|
As a whole, the results discussed above fit the molecular-evolution model hypothesized previously (Laprevotte 1989, 1992
; Laprevotte et al. 1997
): overrepresented oligonucleotides are scattered throughout the entire range of the retroviral sequences; they share complementary core consensuses that fit the rule of a trend to TG/CT excess (Ohno and Yomo 1990
) and suggest starter tandemly repeated oligonucleotides (short tandem repeats giving rise to longer oligonucleotide repeats, as hypothesized previously [Southern 1972
; Ohno 1988
]); they are mixed with scrambled short-scale repetitions, deletions/duplications, tandem repeats, and cryptic simplicity patterns, suggesting a molecular evolution by scrambled stepwise short- and longer-range duplications/deletions (in addition to nucleotide miscopying).
Even though this model gives a good account of the repetitive aspects of retroviral nucleotide sequences, other evolutionary processes may be considered, such as gene conversion (leading to homogeneity throughout DNA sequences; see discussion in Laprevotte 1989
) and a converging evolution toward repeated motifs serving useful functions (Laprevotte et al. 1997
). This also leads to consideration of possible selective pressures maintaining the repeats.
| Conclusions |
|---|
|
|
|---|
The listed overrepresented oligonucleotides (selected here by using a calculation strategy applicable to short sequences and often containing CTG or CAG) are congruent with the retroviral signature (previously defined for the entire sequences) when focusing on the noncoding part of the HIV LTRs (this retroviral signature was not found among yeast, plant, and invertebrate retrotransposons; Terzian et al. 1997
Sector by sector, we hypothesize a large number of local duplication/deletion scenarios that span a great portion of the alignment and could account for length divergences between the HIV-1 and HIV-2 groups. Consequently, base substitutions are by no means the unique evolutionary process to take into account for comparisons of such sequences and their phylogenetic analyses. Altogether, our results support our previous hypotheses on the molecular evolution of retroviral nucleotide sequences: a large portion of the sequences can be accounted for by scrambled stepwise short- and longer-range duplications/deletions. There is an emerging hypothesis of an important duplication/deletion role for the reverse transcriptase that could (in addition to already-proposed scenarios) generate perfect or stuttering tandem repeats and then a cryptic simplicity of the sequence. The consensus overrepresented motifs and the numerous cryptic simplicity sectors observed suggest one or several starter tandemly repeated short motif(s). Additional comparisons of decreasingly homologous sequences using a fast and reliable method for the alignments could further unravel these evolutionary patterns.
A reliable and accurate alignment of the compared sequences is an essential tool for performing a high-resolution molecular evolution study. The accurate assessment of the nucleotide alignment with already-identified benchmarks, with the polypeptide alignment, and with the N-presentation coding of the sequences allows us to consider the alignment reliable. The multiplied alphabet obtained by the mathematical strategy called N-block presentation appears to be a promising method to increase the signal-to-noise ratio in the nucleotide alignment studies.
It is well known that in eukaryotic cells, reverse transcription processes are not restricted to parasitic retroviruses, and that a diverse set of genes, referred to as retrotranscripts, derived from their normal progenitor genes via an mRNA intermediate (Boeke and Stoye 1997
). These elements, as well as retroviruses and retrotransposons, are a source of genomic variation, as could be an increasing number of human endogenous retrovirus sequences that have been demonstrated (Kjellman, Sjogren, and Widegren 1999
). The endogenous IAP particles of mice may also contribute to the generation of genetic diversity in this host population. Furthermore, it has been hypothesized that if the prebiotic genetic material was RNA, reverse transcription might have been required to formulate DNA-based genetic information (Katz and Skalka 1990
). All of these data and others, taken together, suggest that further investigation of the reverse transcription could shed light on some aspects of eukaryotic genome evolution and consequently not be restricted to the biology of retroviruses.
| Supplementary Material |
|---|
|
|
|---|
The multiple alignment of the 27 HIV-1 and HIV-2 LTR nucleotide sequences is available from the website http://genome.genetique.uvsq.fr/laprevotte. In addition, this full sequence alignment is directly available from I.L.
| Footnotes |
|---|
Pierre Capy, Reviewing Editor
1 Keywords: HIV-1 and HIV-2 LTR nucleotide sequences
multiple alignment
N-block presentation
"retroviral signatures" of overrepeated oligonucleotides
scrambled stepwise duplications/deletions
cryptic simplicity ![]()
2 Address for correspondence and reprints: Ivan Laprevotte, Laboratoire Génome et Informatique, Université de Versailles Saint Quentin-en-Yvelines, 45 avenue des Etats-Unis, 78035 Versailles cedex, France. E-mail: laprevotte{at}genetique.uvsq.fr ![]()
| References |
|---|
|
|
|---|
Abdeddam S., 1997 Incremental computation of transitive closure and greedy alignment Lect. Notes Comput. Sci 1264:167-179
Antezana M. A., M. Kreitman, 1999 The nonrandom location of synonymous codons suggests that reading frame-independent forces have patterned codon preferences J. Mol. Evol 49:36-43[ISI][Medline]
Averof M., A. Rokas, K. H. Wolfe, P. M. Sharp, 2000 Evidence for a high frequency of simultaneous double-nucleotide substitutions Science 287:1283-1286
Barnett S. W., M. Quirogu, A. Werner, D. Dina, J. A. Levy, 1993 Distinguishing features of an infectious molecular clone of the highly divergent and noncytopathic human immunodeficiency virus type 2 UC1 strain J. Virol 67:1006-1014
Beasty A. M., M. J. Behe, 1988 An oligopurine sequence bias occurs in eukaryotic viruses Nucleic Acids Res 16:1517-1528
Berkhout B., 1996 Structure and function of the human immunodeficiency virus Prog. Nucleic Acid Res. Mol. Biol 54:1-34[ISI][Medline]
Boeke J. D., J. P. Stoye, 1997 Retrotransposons, endogenous retroviruses, and the evolution of the retroelements Pp. 343435 in J. M. Coffin, S. H. Hughes, and H. E. Varmus, eds. Retroviruses. Cold Spring Harbor Laboratory Press, New York
Chaboissier M. C., D. Finnegan, A. Bucheton, 2000 Retrotransposition of the I factor, a non-long terminal repeat retrotransposon of Drosophila, generates tandem repeats at the 3' end Nucleic Acids Res 28:2467-2472
Coward E., 1998 Mathematical methods for repeated patterns in biological sequences Dr.Ing. thesis, Norwegian University of Science and Technology, Trondheim, Norway
. 1999 Shufflet: shuffling sequences while conserving the k-let counts Bioinformatics 15:1058-1059
Devereux J., 1989 The GCG sequence analysis software package Version 6.0. Genetics Computer Group, Madison, Wis
Didier G., 1999 Caractérisation des N-écritures et application à létude des suites de complexité ultimement n + cste Theor. Comput. Sci 215:31-49
Estable M. C., B. Bell, A. Merzouki, J. S. G. Montaner, M. V. O'Shaughnessy, I. J. Sadowski, 1996 Human immunodeficiency virus type 1 long terminal repeat variants from 42 patients representing all stages of infection display a wide range of sequence polymorphism and transcription activity J. Virol 70:4053-4062[Abstract]
Frech K., R. Brack-Werner, T. Werner, 1996 Common modular structure of lentivirus LTRs Virology 224:256-267[ISI][Medline]
Gao F., E. Bailes, L. Robertson, et al. (12 co-authors) 1999 Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes. Nature 397:436-441[Medline]
Gaynor R., 1992 Cellular transcription factors involved in the regulation of HIV-1 gene expression AIDS 6:347-363[ISI][Medline]
Gurtler L. G., P. H. Hauser, J. Eberli, A. von Brunn, S. Knapp, L. Zekeng, J. M. Tsague, L. Kaptue, 1994 A new subtype of human immunodeficiency virus type 1 (MVP-5180) from Cameroon J. Virol 68:1581-1585
Kandel D., Y. Matias, R. Unger, P. Winkler, 1996 Shuffling biological sequences Discrete Appl. Math 71:171-185
Katz R. A., A. M. Skalka, 1990 Generation of diversity in retroviruses Annu. Rev. Genet 24:409-445[ISI][Medline]
Kjellman C., H. O. Sjogren, B. Widegren, 1999 HERV-F, a new group of human endogenous retrovirus sequences J. Gen. Virol 80:2383-2392
Klaerr-Blanchard M., H. Chiapello, E. Coward, 2000 Detecting localized repeats in genomic sequences: a new strategy and its application to B. subtilis and A. thaliana sequences Comput. Chem 24:57-70[ISI][Medline]
Kreutz R., U. Dietrich, H. Kühnel, K. Nieselt-Struwe, M. Eigen, H. Rübsamen-Waigmann, 1992 Analysis of the envelope region of the highly divergent HIV-2 ALT isolate extends the known range of variability within the primate immunodeficiency viruses AIDS Res. Hum. Retroviruses 8:1619-1629[ISI][Medline]
Laprevotte I., 1989 Scrambled duplications in the feline leukemia virus gag gene: a putative pattern for molecular evolution J. Mol. Evol 29:135-148[ISI][Medline]
1992 Mo-MuLV nucleotide sequence exhibits three levels of oligomeric repetitions, suggesting a stepwise molecular evolution J. Mol. Evol 35:420-428[ISI][Medline]
Laprevotte I., S. Brouillet, C. Terzian, A. Hénaut, 1997 Retroviral oligonucleotide distributions correlate with biased nucleotide compositions of retrovirus sequences, suggesting a duplicative stepwise molecular evolution J. Mol. Evol 44:214-225[ISI][Medline]
Laprevotte I., A. Hampe, C. J. Sherr, F. Galibert, 1984 Nucleotide sequence of the gag gene and gag-pol junction of feline leukemia virus J. Virol 50:884-894
Malik H. S., W. D. Burke, T. H. Eickbush, 2000 Putative telomerase catalytic subunits from Giardia lamblia and Caenorhabditis elegans. Gene 251:101-108[ISI][Medline]
Morgenstern B., A. Dress, T. Werner, 1996 Multiple DNA and protein sequence alignment based on segment-to-segment comparison Proc. Natl. Acad. Sci. USA 93:12098-12103






