Skip Navigation


MBE Advance Access originally published online on December 8, 2006
Molecular Biology and Evolution 2007 24(3):640-649; doi:10.1093/molbev/msl195
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
24/3/640    most recent
msl195v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Strope, C. L.
Right arrow Articles by Moriyama, E. N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Strope, C. L.
Right arrow Articles by Moriyama, E. N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

indel-Seq-Gen: A New Protein Family Simulator Incorporating Domains, Motifs, and Indels

Cory L. Strope*, Stephen D. Scott* and Etsuko N. Moriyama{dagger}

* Department of Computer Science and Engineering, University of Nebraska–Lincoln
{dagger} School of Biological Sciences and Plant Science Initiative, University of Nebraska–Lincoln

E-mail: emoriyama2{at}unl.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
Reconstructing the evolutionary history of protein sequences will provide a better understanding of divergence mechanisms of protein superfamilies and their functions. Long-term protein evolution often includes dynamic changes such as insertion, deletion, and domain shuffling. Such dynamic changes make reconstructing protein sequence evolution difficult and affect the accuracy of molecular evolutionary methods, such as multiple alignments and phylogenetic methods. Unfortunately, currently available simulation methods are not sufficiently flexible and do not allow biologically realistic dynamic protein sequence evolution. We introduce a new method, indel-Seq-Gen (iSG), that can simulate realistic evolutionary processes of protein sequences with insertions and deletions (indels). Unlike other simulation methods, iSG allows the user to simulate multiple subsequences according to different evolutionary parameters, which is necessary for generating realistic protein families with multiple domains. iSG tracks all evolutionary events including indels and outputs the "true" multiple alignment of the simulated sequences. iSG can also generate a larger sequence space by allowing the use of multiple related root sequences. With all these functions, iSG can be used to test the accuracy of, for example, multiple alignment methods, phylogenetic methods, evolutionary hypotheses, ancestral protein reconstruction methods, and protein family classification methods. We empirically evaluated the performance of iSG against currently available methods by simulating the evolution of the G protein–coupled receptor and lipocalin protein families. We examined their true multiple alignments, reconstruction of the transmembrane regions and beta-strands, and the results of similarity search against a protein database using the simulated sequences. We also presented an example of using iSG for examining how phylogenetic reconstruction is affected by high indel rates.

Key Words: protein superfamily • sequence simulation • domains • motifs • indels


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
Generating multiple alignments and reconstructing phylogenetic relationships from protein sequences are frequently the most important first steps for various bioinformatics and molecular evolutionary analyses. Reconstructing reliable phylogenetic relationships, for example, depends on the quality of multiple alignments. Generating reliable multiple alignments, however, becomes extremely difficult when one has to deal with distantly related, remotely similar sequences. During such long sequence evolution, dynamic evolutionary events such as duplication, recombination, insertion, and deletion are frequently found. Such dynamic changes are important when we consider the evolution of protein functions.

Rigorous evaluation of multiple alignment and phylogenetic methods, especially for highly diverged sequences with large numbers of amino acid insertions or deletions (indels), is possible if we have a tool that can simulate realistic sequence evolution incorporating indel events based on a valid biological model derived from empirical data. However, current simulation methods are mostly focused on reconstructing "substitution" events (for both nucleotides and amino acids) over entire protein sequences, for example, evolver (Yang 1997Go) and the simulator in Molphy (Adachi and Hasegawa 1996). They cannot accommodate the numerous evolutionary forces that act upon functional domains found in families of proteins. These domains frequently, but not always, coincide with structural domains and are connected together by amino acid sequences that have no rigid structure (e.g., random coils). Conservation of these domain regions often imposes different evolutionary constraints, including different evolutionary models (e.g., amino acid frequencies and substitution rates) and indel parameterizations (e.g., maximum indel size, length distribution, and probability of opening an indel). Smaller functional units, or motifs, embedded within domain regions may require other specific models, such as invariability of amino acid positions and prevention of indel events within the motif. Seq-Gen (Rambaut and Grassly 1997Go) introduced substructures that are allowed to have different lengths and evolve along different evolutionary trees. However, one cannot change evolutionary pressures acting upon the substructures, such as differing substitution models and motif conservation. Simulating sequence evolution with site-specific interactions (SISSI) (Gesell and von Haeseler 2006) accounts site-specific interactions in its nucleotide sequence simulation, enabling a process such as RNA evolution with secondary-structure constraints. To apply SISSI for protein sequence evolution, however, we need information on residue interactions in greater detail than we currently have.

Although indels are a necessary component for portraying sequence evolution, their addition in sequence simulation poses a number of problems. One such set of problems relates to the indel creation itself, that is, where to and where not to place indels, the distribution of indel lengths, and how to generate inserted sequences. The pioneering application for the addition of indel events during simulation is ROSE (Stoye et al. 1998Go). ROSE implements indel events using a simple, strict model: Indels occur linearly with respect to evolutionary distance and with a user-defined length distribution. The sequence inserted is a random sequence based on the given amino acid frequencies. Empirical indel models whose indel probabilities are nonlinear with respect to evolutionary distance, such as Benner et al. (1993)Go and Chang and Benner (2004)Go, cannot be utilized in ROSE. User input parameters are also global across the entire sequence, not allowing for domain-specific interactions.

Recently, several other sequence simulators including indels have been developed: SIMPROT (Pang et al. 2005Go), MySSP (Rosenberg 2005Go), EvolveAGene (Hall 2004Go), and DAWG (Cartwright 2005Go). SIMPROT can generate protein sequences and, like Seq-Gen, allows the user to create subsequences evolving using different parameters, in the end concatenating these subsequences into a single sequence. However, SIMPROT does not include the options for the input of a root sequence, the preservation of sequence motifs, or changing amino acid frequencies between subsequences.

indel-Seq-Gen (iSG) is our new method of generating realistic protein families, accomplished through the introduction of multiple models of indel evolution and the ability to parameterize and simulate heterogeneous domains. It provides a simple and biologically realistic method of modeling a protein family. It allows the use of multiple related root sequences, instead of a single static root sequence or multiple random sequences, to generate a realistic large sequence space. This is useful for evaluating, for example, protein classification methods. We showed that when compared with other methods, our method could generate divergent protein sequences without destroying important functional properties, whereas providing a clean and concise method of model creation.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
Protein Sequence Evolution Engine without Indels
As the name implies, we use Seq-Gen (Version 1.3.2) from Rambaut and Grassly (1997)Go as the substitution evolution engine. The current Seq-Gen incorporates both nucleotide and amino acid sequence simulation and succeeded the original protein sequence generator PSeq-Gen (Grassly et al. 1997Go). The reason for using Seq-Gen as the core engine is threefold: 1) We found no reason to reinvent a method that evolves sequences with substitutions as many methods already exist (Adachi and Hasegawa 1996Go; Rambaut and Grassly 1997Go; Yang 1997Go; Stoye et al. 1998Go; Pang et al. 2005Go), 2) Seq-Gen has been widely used in simulation studies, and 3) Seq-Gen's setup is a good fit to iSG's family-modeling system. Our approach adds new capabilities to Seq-Gen as described below.

Protein Family Creation
Protein families are often characterized by their structural and functional components, domains, and motifs. The heterogeneity in evolutionary rates caused by domains and motifs is often modeled with continuous or discrete gamma rate distributions (e.g., Yang 1993Go, 1997Go; Adachi and Hasegawa 1996Go). However, if the functional regions are well known and the user wishes to simulate a sequence family that reflects such characteristics, it is better to provide region- or site-specific constraints and evolutionary models, rather than to allow the simulator to randomly select conserved sites. For example, a transmembrane region should not be guided by the same substitution model as a hydrophilic coil region. Neither should a sequence motif occurring in a loop region be changed at the same rate as the surrounding region. Highly diverged families such as the G protein–coupled receptors (GPCRs) also accumulate indel events throughout their evolution without destroying their functional units. Our strategy for simulating protein family evolution is through 1) the introduction of "domain" units and 2) the introduction of invariable sites, "motifs."

Motif Conservation
The conservation of essential motifs in sequence generation can be accomplished by disallowing substitution events within the motifs. Invariable sites are introduced for this purpose. However, when indels are incorporated, the conservation of length-sensitive motifs becomes a problem. For example, ROSE allows for the preservation of motifs by joining the gamma-distribution rates and invariable sites into a mutation probability vector, which we call the I + {gamma} array. Each site in the I + {gamma} array specifies the substitution rate for the corresponding amino acid site. If the rate is 0.0, then the site is invariable. If the rate is less than 1.0, the site is not allowed to have any indel. Tying indels to the relative substitution frequencies is a drawback in this representation because it cannot represent low-frequency indels in regions that have high substitution rates, such as transmembrane regions. Neither can it represent highly variable sites in length-based motifs that depend on the distance between particular amino acids, such as the "CXXC" motif. When shorter or longer, this motif loses its propensity for creating the disulfide bridges that are essential for thioredoxin-fold proteins (Chivers et al. 1996Go).

To rectify these issues, we created a new representation of invariable arrays with 4 classes of invariable sites: 0, no constraint; 1, invariable, 2, no indel; and 3, invariable and no indel. "Invariable" sites can have no substitution, but insertions are allowed between consecutive invariable sites. "No-indel" sites refer to the position in which indels are not allowed but substitutions are. "Invariable and no-indel" sites allow neither substitution nor indel to occur. Neither an insertion nor a deletion can exist between 2 consecutive "no-indel" positions (positions represented as "22," "23," "32," and "33" in the array). For example, the thioredoxin-fold CXXC motif (where X is any amino acid) can be represented as "3223" in the invariable array. This will hold the 2 C's invariable, allow the 2 X's to be substituted independently, and disallow indels from occurring within the motif, preserving the length dependence.

Heterogeneous Evolution among Domains
Seq-Gen (Rambaut and Grassly 1997Go) introduced the idea of "partitions" to allow variations in evolutionary rates and patterns among domains. Each partition represents a subsequence of a protein, and each partition evolves independently as specified by the evolutionary tree. However, more variations in evolutionary patterns need to be incorporated to represent, for example, different indel rates in secondary-structure regions compared with nonstructured coil regions (Thorne 2000Go). In iSG, the percentage of invariable sites, branch lengths (representing substitution rates), amino acid frequencies, substitution models, and indel rates can all be varied between partitions. Such flexible options allow us to generate realistically complex protein families.

Indel Event Handling
Indel events are governed by 4 parameters: the probability of an indel occurring, the placement of indels, the length distribution, and the maximum indel length. For insertion events, 1 more parameter is required for generating the amino acids for an inserted sequence.

Probability of Indels
In iSG, the following 2 indel models are used:

  • Linear model, which assumes that the number of indel events is linearly related to the substitution rate: P(indel) = kd, where k is a user-defined constant and d is the substitution rate specified by the branch length between the ancestral and descendent sequences.
  • The Chang and Benner (2004)Go model, which specifies the probability of an indel based on the following exponential equation:

Formula (1)

Placement of Indels
The placement of indels is chosen randomly. For insertions, invariable and no-indel sites are excluded from consideration. Deletion can happen only at "no constraint" positions ("0" in the invariable array). For a deletion of size n, any position that is within n – 1 positions from a nonzero position in the invariable array is excluded.

Length Distribution of Indels
The distribution of gap lengths in protein sequences has been studied through the use of pairwise alignments (Chang and Benner 2004Go) as well as by examining structural databases (Qian and Goldstein 2001Go; Gooneskere and Lee 2004Go). By default, iSG will create the normalized Zipfian distribution for indel lengths (Chang and Benner 2004Go):

Formula (2)
where N is the number of proteins of length X. Other models can also be input by the users as precalculated indel distributions (e.g., following a structural indel distribution developed by Qian and Goldstein [2001]Go or Gooneskere and Lee [2004]Go).

Insertion Sequence Options
iSG incorporates 2 methods of amino acid sequence generation. The random model randomly chooses amino acids based on the given frequencies. The second method is unique because it chooses amino acids in the insertion sequence by incorporating the "neighbor preference" (Xia and Xie 2002Go) along with the amino acid frequencies. Neighbor preference was derived by studying neighboring amino acid positions in functional proteins in order to empirically determine the effect of the side chains in the acceptance of functionally beneficial amino acid changes. We use their 20 x 20 matrix to generate insertion sequences in a Bayesian fashion:

Formula (3)
where P(j|i) is the probability that amino acid j follows amino acid i and P(j) and P(i) are the frequencies of amino acids j and i in the sequence, respectively. Given an amino acid, we choose the next amino acid based on the probability P(j|i). When the first amino acid is inserted, we use the amino acid preceding the insertion point as amino acid i. If the insertions are longer than 1 amino acid, we use the newly generated amino acid as the predecessor to find the next amino acid.

Originally, neighbor preferences were built on the protein sequences derived from the Escherichia coli K-12 genome. We also provide neighbor preferences calculated over all of the protein sequences contained in the Swiss-Prot database (Bairoch et al. 2005Go). The neighbor preferences are global over the entire simulation run of a family of sequences.

Root Sequence Options
iSG provides 2 options to incorporate a root sequence provided by the user: a single root sequence or a set of sequences. For the single root sequence option, a sequence and its invariable array are read in and used verbatim for all simulation runs. However, for protein families that may contain highly variable regions, varying the root sequence in order to explore a larger sequence space may be advantageous. To facilitate this, iSG incorporates the option of inputting a set S of sequences in the form of a multiple alignment. For each simulation run, a single root sequence, which may vary between simulation runs, is constructed from S in the following manner. For each position (column) in the multiple alignment, where half or more of the sequences are nongap, we choose a representative amino acid using either of the following 2 methods: 1) the consensus method, which chooses the representative using majority rule or 2) the random method, which probabilistically chooses the representative based on the amino acids that exist in the column of the multiple alignment. If any alignment position has a gap in more than half of the sequences, the position is considered to be an insertion position and is ignored in the construction of the root sequence. The following 3 parameters can be set for building the root sequence from a set of sequences:

  • The range of columns in the multiple alignments to use.
  • The number of sequences used. All (the default) or a given number of sequences can be selected from the multiple alignment using the bootstrap-sampling method (selecting randomly with replacement). The bootstrap-sampling option is useful, for example, when using highly divergent sequences with a large number of gap sites. The simple consensus from such divergent sequences may include mainly gaps and very few amino acids. On the other hand, when there are many equally probable root sequences in S, the user may opt to randomly select 1 sequence as the root for each simulation run.
  • The root sequence generation method. The default consensus method uses the majority rule to choose the amino acid for the column, using a coin toss to break ties. The random method randomly chooses the amino acid representing the column based on the position-specific amino acid frequencies, with the exception of invariable columns. For invariable columns, the representative amino acid is chosen using the consensus method.

Implementation
iSG is freely available at http://bioinfolab.unl.edu/~cstrope/iSG/. iSG has been tested on the RedHat Linux, SuSE Linux, IRIX, and Macintosh OSX operating systems with the PERL and gcc compilers.

Comparison of Simulation Methods
Transmembrane Region Prediction
For the GPCR family simulation, we used HMMTOP 2.0 (Tusnady and Simon 2001Go) to predict the number of transmembrane regions as well as the position of the N-terminal region (intra- or extracellular). Realistically simulated sequences should have 7 transmembrane regions and extracellular N-terminal regions.

Beta-Strand Prediction
For the lipocalin family simulation, we used PSIPRED (Jones 1999Go) to predict the secondary structures. Realistically simulated sequences should have 8 beta-strands that correspond to the beta-barrel structure found in the lipocalin family proteins.

Blast Similarity Search
In order to see how the simulated sequences preserved the similarities against the template proteins, we performed BlastP (Altschul et al. 1997Go) protein similarity searches against the protein database UniProt (Bairoch et al. 2005Go), using the simulated sequences as the queries. We did not use the option to filter out the low-complexity sequences. The default E-value threshold (10) was used to cut off the search results.

Pfam Search
Simulated sequences were used to search the profile hidden Markov model database, Pfam (Bateman et al. 2004Go), using the program hmmpfam of the HMMER package (Eddy 1998Go) with the Pfam_ls database to find the glocal alignments. The scores against the models "7tm_1" (PF00001; for GPCRs) and "Lipocalin" (PF00061; for lipocalins) were recorded. The default E-value threshold (10) was used to cut off the search results.

Parametric Bootstrap and Phylogenetic Analysis
Following the phylogenetic tree reconstructed from the template alignment (see Results and Discussion), 1,000 simulated data sets were generated, each including the same number of protein sequences as in the template set. Phylogenetic trees were reconstructed from these data sets. The maximum parsimony and Neighbor-Joining (Saitou and Nei 1987Go) with JTT distance (Jones et al. 1992Go) estimation methods, consensus trees, and bootstrap values were calculated using PHYLIP programs (version 3.65; Felsenstein 2005Go).


    Results and Discussion
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
Simulation Setup
Template Multiple Alignments
We chose to model protein sequences based on 2 protein family data sets: 1) the vertebrate olfactory receptor family and 2) the lipocalin protein family.

G Protein–Coupled Receptors.
The vertebrate olfactory receptor family belongs to the GPCR superfamily, Class A (rhodopsin-like). GPCR proteins have 7 transmembrane regions, and their sequences are known to be highly diverged including many indels. Among them, vertebrate olfactory receptors are relatively conserved and generating a template multiple alignment is not too difficult. We chose 2 sequences from each of the 14 subfamilies, yielding 28 protein sequences. We also included 1 outgroup sequence (OPSD_BOVIN) from the GPCR rhodopsin class (Class A), as was done previously (Gilad et al. 2005Go). Thus, the total number of sequences in our template alignment is 29.

The process of building the template multiple alignment is illustrated in figure 1, steps 1–3. Based on the sequence characterization in UniProt, we first split each sequence into 15 parts: the 7 transmembrane regions (TM1–TM7), 4 extracellular regions (EC1–EC4), and 4 cytosolic regions (CY1–CY4). We manually adjusted region boundaries where the UniProt characterizations were conflicting.


Figure 1
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— A flow chart showing the process of iSG protein sequence simulation. In this example, parameterization of evolutionary information is done based on the vertebrate olfactory receptor family. Steps 1–5 illustrate the process of obtaining the template multiple alignment, phylogeny, and various parameters. Step 6 shows the sample input for iSG with each line showing parameters used for a different subsequence (domain). See supplementary figure 2 (Supplementary Material online) for the example input file. Two evaluation methods, phylogenetic analysis and alignment comparison using profile HMMs, are shown in dashed boxes.

 
Multiple alignments of each region were done using T-Coffee (Notredame et al. 2000Go) and adjusted manually. These regional alignments were concatenated together obtaining the template multiple alignment for the entire protein sequences (see fig. 2a; the actual alignment is shown in supplementary fig. 1, Supplementary Material online). In Step 4, we reconstructed the maximum parsimony phylogeny using PAUP* version 4.0b10 (Swofford 2003Go). Using the topology obtained from the template multiple alignment, the branch lengths were calculated for each region. The average number of changes per site for each segment was obtained by dividing each tree length by the number of sites in the segment. Subsequence parameters are set as shown in step 6 of figure 1 (full specifications are found in supplementary fig. 2a, Supplementary Material online). They include amino acid frequencies, relative evolutionary rates, root sequence templates, indel parameters, and phylogenies. We obtained the amino acid frequencies specific to the 3 regions (EC, CY, and TM) using 1594 GPCR Class A sequences found in UniProt.


Figure 2
View larger version (70K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Multiple alignments of the template and simulated GPCR sequences. Each pixel represents one position in the multiple alignment. The color of the pixel represents: a gap (white), an amino acid from a transmembrane region (gray), and an amino acid not from a transmembrane region (black). The template alignment (a) includes 29 GPCR sequences, and the 15 subsequence regions are indicated above. For 5 simulated data sets produced by iSG (b) and ROSE (c), the true multiple alignments and multiple alignments reconstructed by T-Coffee are shown.

 
Lipocalins.
Lipocalin proteins are a family of small globular proteins belonging to the calycin superfamily. They are often implicated in allergic reactions. Members of the lipocalin family are highly diverged, but all share a conserved beta-barrel conformation consisting of 8 beta-strands (Flower et al. 1993).

The template multiple alignment and phylogeny for the lipocalin family including 23 sequences were obtained from an evolutionary study done by Sánchez et al. (2003)Go. The sequences were split into 12 regions including 5 beta-strand regions as shown in figure 3c. Note that the region including 4 short beta-strand regions (B5678 in fig. 3c) is treated as a single evolution unit. Branch lengths were estimated from each segment as described before. Four coil regions (indicated by "C" in fig. 3c) were very short (each with 4–6 amino acids in length). Therefore, their branch lengths were obtained based on their concatenated sequences. The amino acid frequencies specific to the beta-strand, coil, or alpha-helix regions were calculated using those obtained from the template multiple alignment (75% weighting) combined with pseudocounts (25% weighting). The pseudocounts were obtained either from the conformational parameters by Chou and Fasman (1974)Go for beta-strands and alpha-helices or from Jones et al. (1992)Go for the N terminus, C terminus, and coil regions.


Figure 3
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Distribution of predicted beta-strands along the lipocalin sequences. (a) Proportions of predicted beta-strands are plotted for each amino acid position of the alignments. For the reference alignment, the proportions are based on 23 lipocalin sequences. For simulated sequences, average proportions were plotted based on 5 simulated data sets using their true alignments obtained from each simulation method. Seven regions are mainly gaps with amino acids inserted into a few sequences (less than half the sequences have an amino acid). The proportions of predicted beta-strands in these 7 regions are represented by a single dot. These regions correspond to those in which indels are allowed (3, 7, 9, B and F as well as the start and end regions in the reference alignment, see the region labels in b). For each method, the average numbers of positions predicted as beta-strands are calculated from subsequence regions simulated as beta-strands (regions 2, 4, 6, 8, A, C, E, G in b) and coils (regions 1, 3, 5, 7, 9, B, D, F, H in b). These values are shown as mean-beta and mean-coil, respectively, in each plot. (b) The reference lipocalin alignment obtained from Sánchez et al. (2003)Go is schematically shown with a single pixel representing 1 amino acid. The gray pixels represent beta-strands. The alignment is 213 positions long and consists of 23 lipocalins. Seventeen regions are labeled 1 through H. (c) Twelve subsequence regions used for simulation. The region B5678 concatenated short consecutive regions from 9 to G.

 
Setting Parameters
The setting of the parameters was done so that the simulated sequences were close to the template alignment, but still general enough to allow for variations. For example, in the GPCR template alignment (fig. 2a and supplementary fig. 1a, Supplementary Material online):
  1. TM1, 2, 4, and 6 contain more gaps than TM3, 5, and 7 and
  2. The loop regions between TM4 and TM7 (EC4, CY3, and EC5) are more diverged than other regions.

Parameters such as indel rates for each region are set to closely follow these features. Supplementary figure 1 (Supplementary Material online) shows the invariable array used with iSG that holds the conserved motifs found in the vertebrate olfactory receptors by Fuchs et al. (2001)Go. For the lipocalin template alignment, we allowed no indel in the beta-strand or alpha-helix regions. For the other lipocalin regions, we set the deletion rate to be 1 per 100 substitutions and the insertion rate to be twice higher than the deletion rate. Supplementary table 1 (Supplementary Material online) shows a list of the parameters used in iSG, ROSE, and SIMPROT. In this study, a single global substitution matrix (JTT for iSG, SIMPROT, and Seq-Gen and PAM for ROSE) was used. However, with iSG we can specify different substitution matrices for different subsequences (e.g., a TM-specific substitution matrix). Using these parameters a data set of 29 GPCR-like sequences or 23 lipocalin-like sequences were simulated following their template phylogenies.

Comparison of Simulated Sequences between iSG and Other Methods
We compared iSG with ROSE, Seq-Gen, and SIMPROT for their protein family simulation performance. As described before, ROSE uses the I + {gamma} array for simulating protein family sequences, and SIMPROT and Seq-Gen both allow subsequence generation. The major disadvantage of using ROSE was that we could not allow low-frequency indels although allowing amino acid substitutions as observed in TM regions in GPCRs. Because indel rates are linearly correlated with substitution rates in ROSE, in low-frequency indel regions, we were required to set low substitution rates. Because parameters (e.g., indel rates and amino acid frequencies) could not be varied among subsequences in ROSE, we were forced to set parameters with the "average case" values across the entire sequence. Note also that setting these parameters in ROSE was very tedious and time consuming. For example, over 300 real-valued numbers had to be assigned to the I + {gamma} array in the case of GPCRs, and the array does not visually align with the sequence because the former is a real-valued array and the latter is a character array. iSG, on the other hand, uses integer values in the invariable array, which can be easily aligned with a character array (see supplementary fig. 1, Supplementary Material online).

Setting the parameters for SIMPROT and Seq-Gen were more straightforward. The Seq-Gen root sequence was constructed using iSG's root sequence construction method, and all parameters available to Seq-Gen were set the same as iSG. For SIMPROT, we partitioned the sequence as we did in iSG, and many parameters could be set as done with iSG. The root sequence is randomly generated in SIMPROT. Additionally, SIMPROT uses an equal rate for insertions and deletions.

One thousand data sets were simulated using each method. The "true" multiple alignments showing the actual indel positions were collected. We compared the true alignments generated by iSG and ROSE against the profile HMM built from the template GPCR alignment (supplementary fig. 3, Supplementary Material online; Edgar and Sjolander 2004). No significant differences were shown between the normal score distributions obtained from iSG (mean = 75.03 and variance = 17.20) and ROSE (mean = 72.30 and variance = 16.72), indicating that the sequences generated by the 2 methods using the set of the parameters we chose were approximately equivalent with respect to the template multiple alignment.

Simulated Sequences
In general, functional domains are under strong selective constraints and very few indels are tolerated. TM regions are one such example. On the other hand, changes (both substitutions and indels) within loop regions that simply connect functional or structural domains often have much smaller negative consequences and thus are under very weak constraints. In figure 2, TM regions are illustrated with a gray color and indels are shown as white gaps. As figure 2a shows, vertebrate olfactory receptor proteins are more conserved in the regions between TM2 and TM4 than other regions (as found in Fuchs et al. 2001Go). Note that some TM regions (TM1 and TM4) can have a few small indels. In the simulated sequences, we attempted to recreate these heterogeneous sequence features: fewer indels between TM2 and TM4 and, conversely, more indels in N- and C-terminal regions and those between TM4 and TM7.

As shown in figure 2b, the simulated sequences by iSG reproduced the intended sequence features well. Because the equivalent parameters could be set, SIMPROT produced the true alignments similar to those with iSG (data not shown). ROSE, however, spread the indels throughout the entire length of the proteins (fig. 2c). When the simulated sequences were aligned using T-Coffee multiple alignment method, reconstructed alignments based on iSG's data sets placed indels in a manner that is much closer to the template multiple alignment than those based on ROSE's data sets.

iSG allowed low-frequency indels within TM regions. From figure 2, it appears as if iSG allows far too many indels in the TM regions compared with the template alignment. However, in this experiment we intentionally chose parameters to introduce a large number of indels to our simulated data sets. In ROSE, on the other hand, it was not possible to allow indels within TM regions. In order to introduce any indels within TM regions, the mutation array values used in ROSE need to be set at 1.0 or higher. Because indel rates are linearly correlated to substitution rates, using such high mutation array values would perforate each region with so many indels that the resulting loss of protein function would be inevitable. To avoid such unreasonable simulation conditions, we used conservative substitution rates with ROSE (0.7 in the I + {gamma} array). Consequently, in the true alignments produced by ROSE (fig. 2c) there are gray "islands" (TM regions) where no indel was found. The side effect of this was an easier multiple alignment reconstruction for T-Coffee, which was nearly perfect in the TM region reconstructions. Supplementary figure 4 (Supplementary Material online) shows examples of the actual true alignments given by different simulation methods.

Conservation of Transmembrane Regions
To check if the TM regions were retained in the reconstructed data sets, we predicted TM regions from simulated GPCR sequences. The numbers of predicted TM regions were 7.03 ± 0.30 by iSG, 5.94 ± 1.25 by ROSE, 0.20 ± 0.37 by SIMPROT, and 6.84 ± 0.91 by Seq-Gen. Clearly, iSG has the greatest accuracy in reproducing the 7 TM regions. The N terminus (EC1) was correctly predicted to be extracellular 94% of the time in simulated data sets of iSG. Conversely, with ROSE and Seq-Gen, only 44% and 55% of simulated sequences, respectively, were predicted to have the extracellular N terminus correctly. These results clearly show that iSG's ability to allow different amino acid frequencies between domains is effective in preserving important features of transmembrane proteins more accurately than other methods. This is important because results from Panchenko and Madej (2005) suggested patterns found in nondomain regions, such as loop or N/C-terminal regions, were more important in recognizing superfamily members than considering only the conserved core regions.

Beta-Region Prediction
To examine the conservation of beta-strand regions, secondary structures were predicted from simulated lipocalin sequences. As shown in figure 3, comparing with the reference sequences, all simulated sequences have lower proportions of beta-strands indicated by lower "mean-beta" values and lower peaks of beta-strand plots corresponding to the 8 beta-strand regions (gray regions in fig. 3c). However, iSG performed better than the other methods with just over 5% better accuracies. Among the simulation methods compared, SIMPROT performed most poorly. Eight beta-strand regions were indistinguishable in its simulated sequences, indicated also by the lowest mean-beta and highest "mean-coil" values. The poor performance by SIMPROT is explained by its inability to use a specific root sequence.

In this study, beta-strand regions were characterized only by the amino acid composition derived from small samples of lipocalin sequences. Therefore, the performance by iSG to reconstruct beta-strand structures was surprisingly good considering the use of such simple parameters. Better performance can be expected by optimizing pseudocount methods and using secondary structure–specific amino acid composition and substitution matrices derived from larger samples.

Blast and PFAM Search Results
In order to further examine how simulated sequences preserved functionally important sequence properties, we used the simulated sequences as queries for BlastP protein similarity search. Table 1 summarizes the search results. The top 250 hits obtained by all methods except SIMPROT were correctly from various members of vertebrate olfactory receptors. Seq-Gen performed the best of the 3 methods indicated by the highest average scores. The absence of indels in Seq-Gen generated sequences must have caused a low percentage of gaps in the alignments. iSG performed nearly as well as Seq-Gen and outperformed ROSE with higher scores and lower percentages of gaps. The lower percentage of gaps found in Blast alignments with iSG sequences than those with ROSE sequences can be explained by the use of amino acid frequencies specific to subsequences (e.g., TM, extracellular, and cytosolic regions) in iSG. It must have produced amino acids that are most likely to appear in each region of GPCRs. As a result, fewer gaps were required in the iSG alignment making better hit scores than in ROSE. Contrary to iSG, ROSE, and Seq-Gen, simulated sequences by SIMPROT did not find any vertebrate olfactory receptor protein under the E-value threshold used with the BlastP search. Within the threshold, SIMPROT sequences averaged between 3 and 5 unrelated hits, with the average highest score = 30 (E value = 1.5).


View this table:
[in this window]
[in a new window]

 
Table 1 The Results of BlastP and PFAM Search When Using Simulated GPCR Sequences as the Query

 
PFAM search results showed similar conclusions as BlastP. Except for SIMPROT sequences, the search against PFAM profile HMM database using simulated sequences returned 7tm_1, the profile HMM entry for GPCR Class A, as the most common hit. Table 1 (in the bottom half) summarizes the scores against the 7tm_1 profile HMM by the 4 simulation methods. iSG clearly outperformed ROSE. The difference between iSG and Seq-Gen was again very small, with slight advantage to iSG-generated sequences. No significant hit by GPCR-derived profile HMMs were found using SIMPROT-simulated sequences as queries.

For simulated lipocalin proteins, BlastP and PFAM search did not produce as many significant hits as we observed with GPCR sequences, which implies the difficulty in simulating beta-strand proteins. As shown in table 2, the BlastP results were comparable among the simulation methods except for SIMPROT. Only about 2 lipocalins were found per simulated sequence query. Interestingly, slightly more different lipocalin sequences were found among simulation sets produced by iSG than among those produced by ROSE and Seq-Gen. This is probably a result of the multiple alignment root sequence option in iSG changing the root sequences between data sets. Against the PFAM Lipocalin profile HMM entry, iSG had the lowest scores. However, iSG produced the highest number of sequences (33.62%) that showed a hit to the lipocalin profile, compared with ROSE and Seq-Gen (25.68% and 22.41%, respectively). Again, simulated sequences by SIMPROT did not find any lipocalin profile HMM.


View this table:
[in this window]
[in a new window]

 
Table 2 The Results of BlastP and PFAM Search When Using Simulated Lipocalin Sequences as the Query

 
Note that, as described earlier, the root sequences used for ROSE and Seq-Gen were constructed by the "consensus method" we incorporated in iSG. In this method, a consensus sequence was constructed based on the reference multiple alignment and this sequence was used as the root sequence for ROSE and Seq-Gen. This may explain the similar performances observed by BlastP and PFAM among iSG, ROSE, and Seq-Gen. Neither ROSE nor Seq-Gen in their original methods included this capability.

Phylogenetic Analysis with Parametric Bootstrap
Neighbor-Joining phylogenies were reconstructed based on the T-Coffee as well as the true multiple alignments of simulated sequences. For both GPCR and lipocalin simulations using the parameters described earlier, the topologies of the consensus trees obtained from all 4 methods were identical to the original phylogenies reconstructed from the template multiple alignments, and all nodes were supported with 99% or better bootstrap values. To explore the effect of indels on phylogenetic reconstruction, we simulated another 100 data sets of GPCRs and lipocalins using iSG, with twice higher substitution rates and with and without indels as shown in table 3. When substitution rates were not changed ("Substitution rate x 1"), indels did not significantly affect the phylogenetic reconstructions. However, when the substitution rates were doubled ("Substitution rate x 2"), for both GPCR and lipocalin simulations, the branch supports became lower when indels were incorporated in the sequence evolution. For example, the number of internal branches supported by 90% or higher was 9 for the lipocalin phylogeny based on T-Coffee alignments when indel was not incorporated. However, the same number decreased to only 1 when indels were incorporated. Similar results were obtained for GPCR simulations although the effect was less drastic. The average branch support value (80.2%) for simulated lipocalin phylogenies using T-Coffee multiple alignments with indels was 10 percent lower than when no indel was incorporated. Note also that when indels were incorporated, phylogenies based on T-Coffee alignments were less supported than those based on the true multiple alignments. Phylogenies based on simulated GPCR sequences were consistently more supported than those based on simulated lipocalin sequences, and incorporating indels showed only small decreases in branch supports. This is because the lipocalin template sequences are more diverged (the average pairwise sequence identity = 13.7%) than GPCR template sequences (the average pairwise sequence identity = 35.4%). Thus, our simulations generated more diverged lipocalin-like sequences than GPCR-like sequences.


View this table:
[in this window]
[in a new window]

 
Table 3 Phylogenetic Analysis Using Simulated GPCR and Lipocalin Sequences with Various Substitution and Indel Rates

 

    Conclusion
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
iSG is a new method for simulating protein sequence evolution and generating realistic protein families. It has a unique ability to simulate heterogeneous evolution of multidomain protein families in addition to the incorporation of indel evolution. iSG is highly flexible and various sequence features can be easily included in the simulation. For example, iSG allows highly variable regions in the sequence although still maintaining low indel rates. This is useful in simulating motifs such as CXXC in thioredoxin-fold proteins or zinc fingers. This is not possible in other simulation methods. The use of a template multiple alignment is also a unique strength of iSG. With these new functions, iSG can be used effectively for testing the performance of various molecular evolution and bioinformatics methods when they are applied to extremely divergent heterogeneous protein analysis. Because each subsequence can be simulated with a different evolutionary tree, lateral gene transfer and recombination detection methodologies can also be examined. For protein families with well-known features, iSG can be used to generate synthetic model sequences. Such model sequences can be used to search their candidate members from databases.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
Supplementary figures 1–4 and table 1 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 
We thank Drs Nick Grassly (Imperial College) and Andrew Rambaut (Department of Zoology, Oxford University) whose program, Seq-Gen, is the simulation engine on which iSG is built. We also thank Dr Dave Posada and 3 anonymous reviewers for their helpful comments.


    Footnotes
 
Dan Graur, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Methods
 Results and Discussion
 Conclusion
 Supplementary Material
 Acknowledgements
 References
 

    Adachi J and Hasegawa M. (1996) MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood. Computer Science Monographs 28 (March).

    Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402.[Abstract/Free Full Text]

    Bairoch A, Apweiler R, Wu CH, et al. (15 co-authors). (2005) The universal protein resource (UniProt). Nucleic Acids Res 33:D154–D159.[Abstract/Free Full Text]

    Bateman A, Coin L, Durbin R, et al. (13 co-authors). (2004) The Pfam protein families database. Nucleic Acids Res 32:D138–D141.[Abstract/Free Full Text]

    Benner SA, Cohen M, Gonnet G. (1993) Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J Mol Biol 229:1065–1082.[CrossRef][ISI][Medline]

    Cartwright RA. (2005) DNA assembly with gaps (DAWG): simulating sequence evolution. Bioinformatics 21:iii31–iii38.[Abstract/Free Full Text]

    Chang MSS and Benner SA. (2004) Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. J Mol Biol 341:617–631.[CrossRef][ISI][Medline]

    Chivers PT, Laboissière MC, Raines RT. (1996) The CXXC motif: imperatives for the formation of native disulfide bonds in the cell. EMBO J 15:2659–2667.[ISI][Medline]

    Chou PY and Fasman GD. (1974) Prediction of protein conformation. Biochemistry 13:222–245.[CrossRef][Medline]

    Eddy SR. (1998) Profile hidden Markov models. Bioinformatics 14:755–763.[Abstract/Free Full Text]

    Edgar RC and Sjolander K. (2004) COACH: profile-profile alignment of protein families using hidden Markov models. Bioinformatics 20:1309–1318.[Abstract/Free Full Text]

    Felsenstein J. (2005) PHYLIP phylogeny inference package. Version 3.65. Distributed by the author(Department of Genome Sciences, University of Washington, Seattle (WA)).

    Flower DR, North ACT, Attwood TK. (1993) Structure and sequence relationships in the lipocalins and related proteins. Protein Sci 2:753–761.[Abstract]

    Fuchs T, Glusman G, Horn-Saban S, Lancet D, Pilpel Y. (2001) The human olfactory subgenome: from sequence to structure and evolution. Hum Genet 108:1–13.[CrossRef][ISI][Medline]

    Gesell T and Haeseler A. (2006) In silico sequence evolution with site-specific interactions along phylogenetic trees. Bioinformatics 22:716–722.[Abstract/Free Full Text]

    Gilad Y, Man O, Glusman G. (2005) A comparison of human and chimpanzee olfactory receptor gene repertoires. Genome Res 15:224–230.[Abstract/Free Full Text]

    Gooneskere NCW and Lee B. (2004) Frequency of gaps observed in a structurally aligned protein pair database suggests a simple gap penalty function. Nucleic Acids Res 32:2838–2843.[Abstract/Free Full Text]

    Grassly NC, Adachi J, Rambaut A. (1997) PSeq-Gen: an application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees. Comput Appl Biosci 13:559–560.[Free Full Text]

    Hall BG. (2004) Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol Biol Evol 22:792–802.

    Jones DT, Taylor WR, Thornton JM. (1992) The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci 8:275–282.[Abstract/Free Full Text]

    Jones DT. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292:195–202.[CrossRef][ISI][Medline]

    Notredame C, Higgins DG, Heringa J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217.[CrossRef][ISI][Medline]

    Panchenko AR and Madej T. (2005) Structural similarity of loops in protein families: toward the understanding of protein evolution. BMC Evol Biol 5:10.[CrossRef][Medline]

    Pang A, Smith AD, Nuin PAS, Tillier ERM. (2005) SIMPROT: using an empirically determined indel distribution in simulations of protein evolution. BMC Bioinformatics 6:236.[Medline]

    Qian B and Goldstein RA. (2001) Distribution of indel lengths. Proteins Struct Funct Genet 45:102–104.[CrossRef][ISI][Medline]

    Rambaut A and Grassly NC. (1997) Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci 13:235–238.[Abstract/Free Full Text]

    Rosenberg MS. (2005) MySSP: non-stationary evolutionary sequence simulation, including indels. Evol Bioinform Online 1:81–83.

    Saitou N and Nei M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4:406–425.[Abstract]

    Sánchez D, Ganfornina MD, Gutiérrez G, Marín A. (2003) Exon-intron structure and evolution of the lipocalin gene family. Mol Biol Evol 20:775–783.[Abstract/Free Full Text]

    Stoye J, Evers D, Meyer F. (1998) Rose: generating sequence families. Bioinformatics 14:157–163.[Abstract/Free Full Text]

    Swofford D. (2003) PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4. (Sinauer Associates, Sunderland (MA)).

    Thorne J. (2000) Models of protein sequence evolution and their applications. Curr Opin Genet Dev 10:602–605.[CrossRef][ISI][Medline]

    Tusnady GE and Simon I. (2001) The HMMTOP transmembrane topology prediction server. Bioinformatics 17:849–850.[Abstract/Free Full Text]

    Xia X and Xie Z. (2002) Protein structure, neighbor effect, and a new index of amino acid dissimilarities. Mol Biol Evol 19:58–67.[Abstract/Free Full Text]

    Yang Z. (1993) Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Mol Biol Evol 10:1396–1401.[Abstract]

    Yang Z. (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13:555–556.[Free Full Text]

Accepted for publication November 29, 2006.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
24/3/640    most recent
msl195v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Strope, C. L.
Right arrow Articles by Moriyama, E. N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Strope, C. L.
Right arrow Articles by Moriyama, E. N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?