Skip Navigation


MBE Advance Access originally published online on June 26, 2008
Molecular Biology and Evolution 2008 25(9):1995-2007; doi:10.1093/molbev/msn145
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
25/9/1995    most recent
msn145v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bao, L.
Right arrow Articles by Bielawski, J. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bao, L.
Right arrow Articles by Bielawski, J. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Likelihood-Based Clustering (LiBaC) for Codon Models, a Method for Grouping Sites according to Similarities in the Underlying Process of Evolution

Le Bao*, Hong Gu*, Katherine A. Dunn{dagger} and Joseph P. Bielawski*,{dagger}

* Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada
{dagger} Department of Biology, Dalhousie University, Halifax, Nova Scotia, Canada

E-mail: j.bielawski{at}dal.ca.


    Abstract
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Models of codon evolution are useful for investigating the strength and direction of natural selection via a parameter for the nonsynonymous/synonymous rate ratio ({omega} = dN/dS). Different codon models are available to account for diversity of the evolutionary patterns among sites. Codon models that specify data partitions as fixed effects allow the most evolutionary diversity among sites but require that site partitions are a priori identifiable. Models that use a parametric distribution to express the variability in the {omega} ratio across site do not require a priori partitioning of sites, but they permit less among-site diversity in the evolutionary process. Simulation studies presented in this paper indicate that differences among sites in estimates of {omega} under an overly simplistic analytical model can reflect more than just natural selection pressure. We also find that the classic likelihood ratio tests for positive selection have a high false-positive rate in some situations. In this paper, we developed a new method for assigning codon sites into groups where each group has a different model, and the likelihood over all sites is maximized. The method, called likelihood-based clustering (LiBaC), can be viewed as a generalization of the family of model-based clustering approaches to models of codon evolution. We report the performance of several LiBaC-based methods, and selected alternative methods, over a wide variety of scenarios. We find that LiBaC, under an appropriate model, can provide reliable parameter estimates when the process of evolution is very heterogeneous among groups of sites. Certain types of proteins, such as transmembrane proteins, are expected to exhibit such heterogeneity. A survey of genes encoding transmembrane proteins suggests that overly simplistic models could be leading to false signal for positive selection among such genes. In these cases, LiBaC-based methods offer an important addition to a "toolbox" of methods thereby helping to uncover robust evidence for the action of positive selection.

Key Words: codon model • likelihood-based clustering • Bayes error rate • nonsynonymous/synonymous rate ratio • positive Darwinian selection


    Introduction
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
The ratio of nonsynonymous (amino acid altering) and synonymous (silent) substitution (commonly denoted as the dN/dS ratio or {omega}) has proven to be a valuable index of the strength and direction of natural selection pressure (for a review, see Bielawski and Yang 2004Go). A value of {omega} < 1 indicates that, on average, amino acid altering changes to a protein have had negative fitness consequences for the organism, and a value of {omega} > 1 indicates that amino acid altering changes have increased fitness. However, because the strength and direction of selection on a given amino acid is a function of the 3-dimensional structure of the protein, most proteins are expected to be subject to variable selection pressures among codon sites. Furthermore, the combination of structural and functional constraints means that adaptive changes are expected to occur at only a small subset of sites, with most sites subject to strong negative selection (e.g., Gillespie 1991Go). For this reason, considerable effort has been devoted to developing methods to infer if individual amino acid sites are subject to positive or negative selection pressure (e.g., Nielsen and Yang 1998Go; Suzuki and Gojobori 1999Go; Yang and Swanson 2002Go).

The different methods have unique advantages and limitations, and it is not our purpose to review all previous work; thorough reviews and comparisons are available from several sources (e.g., Wong et al. 2004Go; Kosakovsky Pond and Frost 2005Go; Massingham and Goldman 2005Go; Yang et al. 2005Go). However, several studies indicate that inadequate modeling of the underlying substitution process can negatively impact estimates of substitution rates (e.g., Yang and Nielsen 2000Go; Dunn et al. 2001Go; Aris-Brosou and Bielawski 2006Go) and classification of sites according to selection pressure (e.g., Anisimova et al. 2002Go; Wong et al. 2004Go; Kosakovsky Pond and Muse 2005Go). Regardless of the method of inference, biological interpretation of any differences among sites in {omega} requires that such differences are due to selection pressure alone. If several aspects of the substitution process are not constant across sites in real data, estimated differences in {omega} might not be solely due to differences in selection pressure. Recent work on fixed-effect codon models provides increased flexibility to model heterogeneity among sites in the transition to transversion rate ratio, codon frequencies, relative rates, and selection pressure (Bao et al. 2007Go). However, these models are not useful when there is no prior information by which to partition the data, and they cannot be fit to partitions comprised of a single site due to the large number of parameters.

The field of statistical clustering offers a possible solution to the problem of resolving groups of sites under similar selection pressure when several aspects of the substitution process vary among sites. We proceed under the assumption that codon sites in a multiple sequence alignment could be comprised of "clusters" that share a common generating model. In this paper, we describe a new clustering method, called likelihood-based clustering (LiBaC), which allows us to maximize the likelihood of the data when different subsets of codon sites have different evolutionary models. Hence, LiBaC can be used to identify sites subject to positive selection when several aspects of the substitution process vary among codon sites. We use computer simulation and real data analyses to evaluate the performance of LiBaC and several alternative approaches. Our simulation studies reveal that estimates of {omega} can be negatively impacted when other aspects of the substitution process, such as transition to transversion ratio and codon frequencies, are not constant across sites. We introduce the use of the "Bayes error rate" as an objective standard for the performance of methods that classify sites according to selection pressure. Using this framework, we show that performance is expected to depend on the data, with some evolutionary scenarios representing more difficult classification problems than others. We find that LiBaC can provide improved estimates of model parameters under a variety of scenarios and can approach the theoretical upper boundary on classification performance.


    Theory and Methods
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Fixed-Effect Codon Models
We employ the basic codon model of Goldman and Yang (1994)Go, which assumes that the process of substitutions from one codon to another is a Markov process. The ijth element Pij(t) in transition matrix P(t) gives the probability going from codon i to codon j during time t. There are 64 codons, and the 3 stop codons (UAA, UAG, and UGA) are excluded from the state space of the model because they do not occur within a functional protein-coding gene. The transition matrix P(t) can be calculated by P(t) = eQt, where Q = {qij} is a 61 x 61 rate matrix; the element qij denotes the instantaneous substitution rate from codon i to codon j, and only single-nucleotide substitutions are permitted to occur instantaneously. The elements of Q are parameterized as follows:

Formula
where {pi}j is the frequency of the jth codon, {kappa} is the transition to transversion rate ratio, and {omega} is the nonsynonymous to synonymous rate ratio.

In the fixed-effect approach, codon sites are partitioned into G different groups, and each group may be permitted to have a different evolutionary model (Yang and Swanson 2002Go; Kosakovsky Pond and Muse 2005Go; Bao et al. 2007Go). Thus, these models can accommodate considerable among-site variability in the substitution process; any combination of heterogeneity or homogeneity among site groups for the parameters {kappa}, {omega}, and {pi} may be specified. Typically, site groupings are derived from a priori knowledge of a protein's structural and functional domains; for example, buried versus exposed sites in the 3-dimensional structure of the protein. Two problems have hindered the application of fixed-effect models: 1) for many genes, there is no obvious a priori criterion by which sites can be partitioned into different groups and 2) there is often uncertainty about how to partition sites even when some a priori information is available. In the latter case, it is often possible to divide sites in many different ways according to different criteria (e.g., active, internal, surface, highly variable, or conserved sites), all of which might serve as a suboptimal basis to partition sites.

Likelihood-Based Clustering
We develop a LiBaC method to partition codon sites into groups where each group has a different model, and the likelihood over all sites is maximized (LiBaC). Given a tree topology, the likelihood of the observed data at the ith codon site is f(xi|{theta}). Let fk(xi|{theta}k) be the probability of observing codons at site i under the hypothesis that this codon site belongs to the kth cluster (i.e., group of sites), where {theta}k is the collection of parameters corresponding to the kth cluster (k = 1, ..., G). Suppose the mixing probabilities are {tau}1, {tau}2, ...{tau}G such that P(xi belongs to the kth cluster) = {tau}k. The purpose is then to maximize the mixture log likelihood

Formula

A difficulty results from the fact that a summation appears inside the logarithm. The typical algorithm to solve this problem is the expectation–maximization (EM) algorithm (Dempster et al. 1977Go; McLachlan and Krishnan 1997Go). The EM algorithm augments the observed data xi by a G-dimensional binary random variable (latent variable) Formula with a particular element zik = 1 if xi belongs to the kth cluster; zik = 0 otherwise. The marginal distribution over zi is specified according to the mixing coefficients {tau}1, {tau}2, ...{tau}G. The density of an observation xi given zi is given by Formula thus, the joint probability of the so-called "complete" data {xi,zi} can be written as Formula The EM algorithm circumvents the difficulty of maximizing the mixture log likelihood by maximizing the joint log likelihood of the complete data. If the value of zi is given, then the joint probability of the complete data {xi,zi} simply takes the form of Formula. However, as zi is unknown in practice, the joint probability of the complete data is estimated by its expected value under the posterior distribution of the latent variable. The resulting expected complete data log likelihood is

Formula

Note that independence among the codon sites of a gene is assumed; hence, the log likelihood of the whole sequence is simply the sum of the log likelihood of each site as above (Goldman and Yang 1994Go). The E-step of the EM algorithm evaluates the posterior distribution of the latent variable for given parameter values Formula. The M-step estimates the Formula by maximizing the above complete data log likelihood. Each pair of successive E- and M-steps gives rise to a revised Formulaand increases the log-likelihood value. The convergence can be checked by either the log likelihood or the parameter values.

For the Markov models of codon evolution, the M-step proves to be very time consuming; an alternative procedure which improves this point by compromising the values in E-step is also developed. More specifically, the posterior probabilities in E-step Formula will be replaced by 1 if site i has the highest posterior probability coming from cluster k and Formula for Formula. This corresponds to setting a classification step between the E-step and M-step, which was termed as classification EM by Celeux and Govaert (1992)Go. This can also be viewed as a generalization of the popular K-means clustering algorithm (MacQueen 1967Go) using likelihood as criterion in the codon evolutionary models. This method is hereafter referred as "hard-LiBaC," and the former method, based on an exact EM algorithm, is referred to as "soft-LiBaC." The speed of hard-LiBaC, soft-LiBaC, and several alternatives is provided in the Supplementary Material online for a variety of real data sets.

We summarize both soft-LiBaC and hard-LiBaC procedures as below:

Initial step: Use M0 (Goldman and Yang 1994Go) to separately estimate the parameters for sites initially placed in a user-defined site partition, or use M3 (Yang et al. 2000Go) to estimate the parameters for a user-defined number of site classes G.
E-step: Based on the current parameter estimates, Formula, compute the posterior probabilities by Bayes’ rule:

Formula

C-step (only for hard-LiBaC, skip this step for soft-LiBaC): Set Formula and all others Formula.
M-step: Reestimate parameters Formula by maximizing Formulausing the current posterior probabilities:

Formula

Convergence: Check for convergence of the log likelihood Formulaevaluated at Formula. If the convergence criterion is not satisfied return to E-step.

Because EM is a hill-climbing algorithm, the starting point for LiBaC (i.e., the initial cluster allocations and parameter values of the model) could influence the outcome; depending on the staring point, LiBaC could converge at a suboptimal peak in likelihood.

We implement LiBaC under 2 levels of among-class heterogeneity in the substitution process, referred to as LiBaC1 and LiBaC2. LiBaC1 permits heterogeneity among classes of sites in codon frequencies ({pi}’s), natural selection pressure ({omega}), transition to transversion ratio ({kappa}), and branch length scale parameter (c). LiBaC2 differs from LiBaC1 in that codon frequencies ({pi}’s) are assumed to be homogenous among classes of sites. Class-specific parameter values are estimated via maximum likelihood (ML) except that {pi}’s are estimated empirically. LiBaC1 and LiBaC2 make the same assumptions as fixed-effect codon models FE1 and FE2 (Yang and Swanson 2002Go; Bao et al. 2007Go). However, FE1 and FE2 treat each branch length in a reference data partition as free model parameters and estimate them by ML. As the computation cost of such an approach is prohibitive under LiBaC, we obtain an initial set of branch lengths via ML under M0, which is assumed to be proportionally correct for each class of sites, and a scale parameter, ci, is then estimated for each class. This way we avoid estimating a full set of branch length parameters. Note that the scale parameters are rescaled such that c1 = 1 (i.e., c1 = c1/c1, c2 = c2/c1, etc.). We found that this treatment provides very similar results to those obtained under FE1 and FE2 (Yang and Swanson 2002Go; Bao et al. 2007Go) in both simulated and real data analyses while providing considerable savings in computational time. LiBaC was implemented by modifying the codeml program of the PAML package (Yang 1997Go); the modified program is available at http://www.bielawski.info or http://www.mathstat.dal.ca/~hgu/publications.html.

Simulations and Analyses
Simulation is used to evaluate LiBaC (both soft and hard) according to 1) the classification of codon sites into different classes and 2) the reliability of parameter estimates for different groups. Our simulations include a balanced 2-class scenario, 2 unbalanced scenarios based on continuously variable selection pressure among sites, and a benchmark scenario adapted from Anisimova et al. (2001)Go.

To better understand the simulation results, we first need to describe the criteria that we use to evaluate the classification. Three measures of performance are considered; precision (accuracy), recall (power), and misclassification rate. These terminologies are closely associated with the statistics of classification. Precision is the percentage of correctly predicted positive selection sites among the total predicted to be under positive selection. Note that precision is also known as "accuracy" (Anisimova et al. 2002Go) or the "positive predictive value." Recall is measured as the percentage of correctly predicted positive selection sites among the number of all sites truly subject to positive selection. This measure is also known as "power" (Anisimova et al. 2002Go) or "sensitivity." Misclassification is the total percentage of sites misclassified from its own group to any other group. These 3 measures are not independent; given any 2 measures, and the sizes of the site groups, the third can be calculated. For example, for 2 groups, the misclassification rate is equal to Formula (1 + recall(1 –2 recall)/precision), where n1 and n2 are the number of sites in each group. It is not hard to see that with increasing precision, the recall decreases and vice versa.

The concepts of precision (i.e., accuracy) and recall (i.e., power) are often used in the context of measuring the performance of a method at detecting positive selection (Anisimova et al. 2002Go). As the problem is now formulated as a classification problem, the misclassification rate has its own importance. For any given model, the theoretical lower bound for misclassification rate, referred to as the Bayes error rate (Fukunaga 1985Go; Tumer and Ghosh 2003Go), can be found as an objective standard for comparison of classification methods. The Bayes error rate is defined as the expected misclassification rate, with the expectation taken on the true model. Hence, it is the theoretical lower bound on the misclassification rate. A high Bayes error rate indicates a hard classification problem, and a method that tends to give an average misclassification rate close to the Bayes error rate is regarded as a good method.

Bayes error rates for the models involved here are not analytically available. We thus estimate this theoretical lower bound by simulation. We simulate 100 data sets under 2 generating codon models with sequence lengths of 500 sites. For each site pattern, we calculate the posterior probabilities that a site was generated under either model according to the true model parameters and assign the site to a group according to the higher posterior probability. The average misclassification rate over the 100 data sets is the estimated Bayes error rate.

Similar to the above, we can also calculate the average precision and recall over 100 data sets by assigning the sites to groups using different posterior probability cutoff values calculated under the true generating models. A precision versus recall curve from these data shows the theoretical upper bound of precision for each possible value of the estimated recall. Given a set of candidate methods, those methods that yield values of precision and recall close to this "idealized precision–recall curve" are regarded as optimized methods. Note that a cutoff posterior probability can be chosen that always yields 100% precision (and 0% recall). As this is not an optimal solution to a classification problem, most investigators seek to achieve some compromise between precision and recall. In this study, the average misclassification rate is calculated for a cutoff posterior probability of 50%.

It is not possible to use standard approaches such as likelihood ratio test (LRT) and Akaike information criterion under LiBaC to select the optimal number of site classes (G). Although a simulation-based approach can be employed on a case-by-case basis, this would be too time consuming under LiBaC to be of practical value. Our solution to this problem is motivated by a class of techniques called "indirect inference" where an auxiliary criterion is employed when complex likelihood functions are difficult to work with Gouriéroux et al. (1993)Go and Genton and Ronchetti (2003)Go. In the spirit of indirect inference, we employ the likelihood score obtained under model M3 as an auxiliary criterion from which we test the goodness of fit of different numbers of site classes by using the LRT. Several simulation scenarios were used to verify the performance of this auxiliary criterion; acceptable performance was achieved over a variety of scenarios under {alpha} = 0.01 (Supplementary Material online). We note that mixture models allowing greater complexity than M3 could serve as better auxiliary criteria (Kosakovsky Pond SL, personal communication), and such models can be implemented by using the program HyPhy (Kosakovsky Pond et al. 2005Go). A test of the number of site classes by using M3 as the auxiliary criterion is hereafter referred to as the "surrogate test" for number of site classes.

Simulation Study 1: Balanced, 2-Class Scenarios
The purpose of this simulation is to evaluate LiBaC under a wide variety of scenarios of among-site heterogeneity. We adopt a balanced design comprised of 2 classes of sites having equal sizes; hence, the prior probability of each site class is equal. Four aspects of codon evolution ({omega}, c, {kappa}, and {pi}) are permitted to be heterogeneous among site classes. As we are interested in detecting sites subject to positive selection, site classes were always heterogeneous for {omega} ({omega}1 = 0.3 and {omega}2 = 1.5). The study (table 1) is designed to cover all 23 = 8 possible combinations of heterogeneity for {kappa}’s (scenarios A and C: {kappa}1 = 1, {kappa}2 = 5), branch lengths (scenarios A and B: c1 = 1, c2 = 5), and codon frequencies (in A1, B1, C1, and D1: {pi}1i = 1/64 and {pi}2i = empirical estimates from abalone sperm lysin). Note that the generating models A1 and A2 match the models assumed under LiBaC1 and LiBaC2, respectively. Data sets of 1,000 codons are created by independently simulating 2 groups of 500 codons. Data are simulated on a 16-taxon tree (Supplementary Material online). For each scenario, 100 data sets are simulated by using the program "evolver" of the PAML package (Yang 1997Go). All data sets are analyzed under soft-LiBaC1, soft-LiBaC2, and hard-LiBaC2.


View this table:
[in this window]
[in a new window]

 
Table 1 Parameter Values for the 8 Scenarios of Simulation Study 1

 
In addition, each scenario is analyzed under the most commonly used random effect models (M2a, M3, and M8). Here, branch lengths and parameters of the substitution model (except {pi}’s) are obtained by ML. Placement of sites into groups is carried out by using the naive empirical Bayes (Nielsen and Yang 1998Go) and Bayes empirical Bayes methods (Yang et al. 2005Go), hereafter referred to as NEB and BEB, respectively. The generating model D2 matches the random effect model M3 (G = 2).

Bayes error rates (table 2) provide a measure of difficulty of each scenario. For example, scenario A2 should be the easiest and D2 the hardest. Moreover, there is a clear relationship between the level of difficulty and the specification of the c parameter; scenarios where parameter c differed among groups (A and B: c1 = 1 and c2 = 5) represent substantially easier cases as compared with scenarios where parameter c is the same among partitions (C and D: c1 = c2 = 1). The difference in {omega} and c in scenarios A and B yields a difference in the absolute rate of dS and dN between groups of sites, which likely contributes to a greater signal for differential evolution between the 2 classes of sites. For any given classifier in table 2, the observed misclassification rate corroborates the relative difficulty of a scenario inferred from the Bayes error rate.


View this table:
[in this window]
[in a new window]

 
Table 2 The Bayes Error Rate and Method-Specific Classification Error Rates under the 8 Scenarios with the 500–500 Site Partition of Simulation Study 1

 
When comparing among classifiers, LiBaC nearly achieves the theoretical lower bound on misclassification error under cases A1 and A2. This is expected, as in these scenarios, the generating model matches that employed under LiBaC1 and LiBaC2, respectively. In scenarios A2, B1, and B2, the misclassification error of some classifiers (M2a, M3, and M8 under NEB) is comparable with LiBaC, although not quite as close to the lower bound. This is noteworthy because the assumption that c1 = c2 = 1 under M2a, M3, and M8 is incorrect for A2, B1, and B2, respectively. The reason for the low misclassification rates for those models is that differences among sites in both dS and dN can be "absorbed" by the among-site variability in the {omega} parameter alone. We note that among-site heterogeneity in both dS and dN is a realistic aspect of gene sequence evolution as several studies have uncovered evidence of such variability in a wide variety of genes and genomes (e.g., Kosakovsky Pond and Muse 2005Go; Bao et al. 2007Go). We believe this explains, in part, why models M2a, M3, and M8 have been so successful in providing biologically valuable results for a wide variety of genes (e.g., Yang 2005Go). However, our simulations indicate that estimates of {omega} under models M2a, M3, and M8 can be biased in such cases due to misspecification for the c parameter (Supplementary Material online).

In all the scenarios, the LiBaC methods are closer to the theoretical lower bound than the other classifiers (table 2). Under scenarios C and D, the observed error rates for all the classifiers are higher and further from the theoretical lower bound as compared with scenarios A and B. Interestingly, the 3 LiBaC methods outperformed M3 in case D2, where the generating model matches that of M3. The results suggest that LiBaC-based methods can be robust to model misspecification.

The misclassification error rates in table 2 are based on a posterior probability cutoff of 50%. However, the usual practice with NEB and BEB methods is to choose a higher cutoff value for sites having {omega} > 1 (typically 90% or 95%) thereby increasing the precision of identifying sites subject to positive selection. The cost of this practice is reduced recall. In order to evaluate the classifiers in this context, we plotted the precision and recall of each classifier in relation to an idealized precision–recall curve (fig. 1). The precision–recall curve is obtained from the Bayes error rate and represents the theoretical upper bound on performance for a given scenario. In general, curves that closely follow the upper and right side borders of the parameter space (scenarios A and B) are theoretically easier classification problems than those with curves that approach the 45° diagonal (scenarios C and D). As expected, the observed relationship between precision and recall for individual classifiers falls at or below the curve (fig. 1). The spread of points for a particular cutoff posterior probability (i.e., taking results for only 0.5, 0.9, or 0.95) illustrates that using common posterior probability cutoff does not yield comparable levels of performance among different classifiers. To directly compare precision among different classifiers, we would need to fix their recall at a common level, and this is not practical. However, we can assess each classifier by its distance from the precision–recall curve.


Figure 1
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Performance of LiBaC, and selected alternative methods, relative to the theoretical upper bound on precision for the 8 simulation scenarios described in table 1. The curve illustrates the relationship between the upper bound on precision relative to recall. Note that precision and recall are equivalent to the measures called "accuracy" and "power" by Anisimova et al. (2002)Go.

 
Consistent with the notion that scenarios A and B represent the easier cases; all classifiers performed reasonably well, being close to the upper bound on performance (fig. 1). Taken across the different cutoff values, the different classifiers achieve a remarkably wide range of trade-offs between precision and recall. Consistent with the misclassification rates, all 3 LiBaC methods are closer to the precision–recall curve than the other classifiers (fig. 1 and table 3).


View this table:
[in this window]
[in a new window]

 
Table 3 Precision and Recall of Positively Selected Sites in Scenarios A1, A2, and D2 of Simulation Study 1

 
Scenarios C and D present a bigger challenge to all classifiers. The NEB and BEB methods under models M2 and M8 achieve acceptable levels of precision but only by a nearly complete loss of recall (fig. 1 and table 3). These classifiers cannot be improved by adjusting the posterior probability cutoff as recall remains close to zero when the cutoff is set to 50%. The trade-off between recall and precision for NEB under model M3 can be adjusted via the posterior probability cutoff; but in these scenarios, this classifier falls well below the idealized precision–recall curve (fig. 1). As LiBaC methods yield performance close to the idealized precision–recall curve in scenarios C and D, its performance can be further tuned by adjusting the posterior probability cutoff to achieve different trade-off between precision and recall.

Measuring performance relative to the recall and precision curves in figure 1 reflects the primary interest of biologists to minimize the number of falsely detected positive selection sites among the number of sites they have inferred from the data; this is why the curves focus on precision rather than the classical false-positive rate (1—the specificity of the classifier). In cases A and B, there is a clear "elbow" (i.e., where the curvature is the greatest); this is the point where there is an optimal trade-off between precision and recall. Although the cutoff could be adjusted under LiBaC, simply using a 50% cutoff yields precision in excess of 90% while maintaining recall at 88–89% (table 3).

Simulation Study 2: Realistic Purifying Selection Scenario
Although the evolutionary models permitted by LiBaC1 and LiBaC2 are quite complex, they are nonetheless highly idealized as compared with the complexity of real gene evolution. In order to evaluate performance closer to a real-world scenario, we assume that sites were drawn from an unbalanced mixture of site classes characterized by continuous distributions for {omega} (fig. 2A). Site class 1 is simulated under a beta distribution for {omega} having a mean = 0.1, with p = 0.5 and q = 4.5 (L shaped), and site class 2 sites are simulated under a beta distribution for {omega} having a mean = 0.9, with p = 4.5 and q = 0.5 (J shaped). These groups are also heterogeneous for ci and {pi}{iota} ({kappa}1 = 2, c1 = 1, and {pi}1 = biased; {kappa}2 = 2, c2 = 5, and {pi}2 = 1/64). Site class 1 is comprised of 900 sites, and site class 2 is comprised of 100 sites. Note that this scenario represents a gene that is subject only to purifying selection pressure.


Figure 2
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Distribution of {omega} specified in simulation studies 2 and 3. (A) Distribution of {omega} in a realistic scenario of purifying selection pressure (study 2). The scenario is unbalanced, being comprised of a group of 900 sites and a group of 100 sites that differ in {omega} and {pi}. Group 1 (900 sites) is characterized by an L-shaped distribution for {omega} (beta distribution having shape parameters p = 0.5 and q = 4.5). To simulate these data, we discretize the beta distribution into 10 classes of sites having equal probability; the values of {omega} in each class are 0.0009, 0.0047, 0.0124, 0.0245, 0.0419, 0.0659, 0.0991, 0.1469, 0.2236, and 0.6359. Group 2 (100 sites) is characterized by a J-shaped distribution for {omega} (beta distribution having shape parameters p = 4.5 and q = 0.5). The area under the group 2 curve is shaded in gray. Again we use 10 discrete classes of sites; the values of {omega} in each class are 0.3641, 0.7764, 0.8531, 0.9009, 0.9341, 0.9581, 0.9755, 0.9876, 0.9953, and 0.9991. Discrete site classes are indicated by the vertical lines. (B) Distribution of {omega} in simulation study 3 extends simulation study 2 by adding a third group of sites having {omega} > 1. The evolutionary models of groups 1 and 2 are the same as that in study 2, but the size is adjusted; groups 1 and 2 are comprised of 700 and 200 sites, respectively. Group 3, having {omega} > 1, is comprised of 100 sites.

 
In a real-world scenario, we also lack knowledge of the optimal number of site classes. Hence, prior to analysis under LiBaC, we employed the surrogate test ({alpha} = 0.01) for number of site classes; G = 2 was selected for 47% of the data sets and G = 3 for the remaining 53% (Supplementary Material online). Given these data are comprised by a mixture of 2 continuous distributions, these results indicate that the surrogate test is likely to be conservative for real data sets. In this simulation study, we employ LiBaC under both G = 2 and G = 3.

All the LiBaC (G = 2) methods resolved the unbalanced structure of the data, revealing a large group of sites with low {omega} and a small group of sites with higher {omega} (table 4). For all 3 LiBaC-based methods, the parameter estimates are biased, with LiBaC having underestimated {omega} for both groups of sites (table 4). The bias in mean {omega} arises from the type of site misclassification errors; some sites from site class 1, which have generally lower {omega}, are incorrectly classified into site class 2 thereby causing an overestimate of its size and reducing its average {omega} (table 4). In this scenario, we observed a substantially larger misclassification error for hard-LiBaC, as compared with soft-LiBaC (table 4). This is presumably a result of compromising the values in the E-step under hard-LiBaC in order to limit the computational costs. Nonetheless, errors are conservative under all the LiBaC-based methods with respect to the most common analytical use of codon models; that is, in no cases was the presence of positively selected sites falsely suggested under LiBaC.


View this table:
[in this window]
[in a new window]

 
Table 4 Error Rate and Parameter Estimates for LiBaC and Selected Alternative Methods under the Realistic Purifying Selection Scenario (simulation study 2: 2-class beta distribution scenario with 900–100 site partition)

 
Although models M3 and M8 had the lowest overall misclassification error, this result is offset by the biases in parameter estimates that lead to false signal for sites subject to positive selection (table 4). Interestingly, M2a also produced false signal for positive selection, despite restrictions placed on its parameters intended to achieve conservative performance in detecting positive selection sites. Indeed, M2a suggests a lower average fraction of sites with {omega} > 1, making it more conservative than M3 and M8; however, the estimated value of {omega} for site class 2 is largest under M2a. These findings illustrate that differences among sites in the estimated value of {omega} can reflect more than differences in selection pressure when other aspects of the substitution process are not constant across sites.

With the exception of hard-LiBaC, results under G = 3 were very similar to those obtained under G = 2 (Supplementary Material online). Because hard-LiBaC yielded false signal for positive selection under G = 3, we do not recommend it for real data analysis.

LiBaC is implemented to address the tasks of parameter estimation and site classification. For the random effect models M2a and M8, the task of classification (alternatively referred to as a "prediction" problem) has been intrinsically linked to the process of hypothesis testing. Indeed, interpretation of NEB- and BEB-based classification of positively selected sites is recommended only if an LRT for positive selection indicates that such sites exist in a given data set (e.g., Wong et al. 2004Go; Yang et al. 2005Go). We did not carry out such LRTs in simulation study 1 as those simulated data always contained sites subject to positive selection. In simulation study 2, we must also consider the false-positive rate of the involved LRTs; that is, the percentage of false rejections of the null hypothesis as formulated under models M1a and M7. Hence, we carried out the required LRTs (M1a against M2a and M7 against M8) and found that the null hypothesis of no sites having evolved under positive selection was strongly rejected in all cases (Supplementary Material online). Furthermore, in all data sets, there are cases of high posterior probabilities for sites subject to positive selection under NEB and BEB, with some sites having a posterior probability >90% (Supplementary Material online). We note that several authors have predicted that model violations could lead to increases in the false-positive rate over the nominal level of the LRT and encouraged the report of such findings (Wong et al. 2004Go; Yang et al. 2005Go). Simulation study 2 clearly represents such a scenario.

Simulation Study 3: Realistic Positive Selection Scenario
Simulation study 2 was based on a realistic model of a gene subject only to purifying selection pressure. In simulation study 3, we extend the model to include a small fraction of sites subject to positive Darwinian selection pressure. Here the model of evolution assumes 3 classes of sites (fig. 2B). Site classes 1 and 2 follow the same evolutionary model as in simulation study 2 above but are comprised of just 700 and 200 sites, respectively. Site class 3 is simulated under {omega}3 = 5, {kappa}3 = 2, and c3 = 10 and biased codon frequencies. The surrogate test ({alpha} = 0.01) for number of site classes indicated G = 3 for 99% of the replicates (Supplementary Material online).

Misclassification error rates are generally similar among all the methods (table 5). All methods resolved a small fraction of sites (estimates ranging from 9% to 18%) having {omega} > 1 (table 5). However, M3 also suggested a second, moderately sized group of sites (22%) having {omega} slightly above 1, giving a false signal for a second group of sites potentially subject to positive selection. Estimates of {omega} under hard-LiBaC tend to be unrealistically high; again, computational savings achieved under hard-LiBaC appear to come with a cost in performance. Under models M2a, M3, and M8, heterogeneity in several aspects of the substitution process leads to amplification of the signal for differences between groups of sites, thereby allowing reasonably good classification based on {omega} alone. Whereas this yields favorable results when positively selected sites truly exist in the data, the effect appears to be a potential liability when such sites are not present in the data (as in simulation study 2).


View this table:
[in this window]
[in a new window]

 
Table 5 Error Rate and Parameter Estimates for LiBaC and Selected Alternative Methods under the Realistic Positive Selection Scenario (simulation study 3: 3-class beta distribution scenario with 700–200–100 site partition)

 
Simulation Study 4: Strictly Neutral Model
We simulate 50 replications under a scenario that is generally considered among the most difficult of cases; the single class of strictly neutral evolution ({omega} = 1) scenario (Anisimova et al. 2001Go; Kosakovsky Pond and Frost 2005Go). Because LiBaC is a clustering algorithm, we add a second class of sites under perfect purifying selection ({omega} = 0) to the scenario. This 2-class scenario was balanced (500 sites with {omega}1 = 0 and 500 sites with {omega}2 = 1). The site classes also differ according to parameters c and {pi} ({kappa}1 = 2, c1 = 1, and {pi}1 = 1/64; {kappa}2 = 2, c2 = 5, and {pi}2 = biased). Although such an extreme distribution of {omega} is unlikely to be observed in a real biological data, a very similar scenario was shown to constitute a challenging case (Anisimova et al. 2001Go); it thereby serves as a useful benchmark.

Prior to application of LiBaC, we carried out the surrogate test ({alpha} = 0.01) for number of site classes and selected G = 2 in 100% of the replications (Supplementary Material online). Although this scenario did not represent a difficult classification problem (Supplementary Material online), having {omega} on the boundary for positive selection ({omega} = 1) represents a challenge for parameter estimation. The danger is that an estimate of {omega} > 1 can easily be obtained for a group of sites, thereby yielding false signal for positive selection. Comparison of parameter estimates under LiBaC and M2a (the most conservative of the M-series models; Anisimova et al. 2001Go; Wong et al. 2004Go) reveals similar performance, with both approaches having a small propensity for estimating {omega} > 1 (Supplementary Material online). Indeed, the maximum estimate of {omega} over all the replications was just 1.18 under soft-LiBaC1 and 1.84 under M2a. In accordance with previous suggestions (Anisimova et al. 2001Go; Wong et al. 2004Go; Kosakovsky Pond and Frost 2005Go), we recommend caution in attributing the evolution of sites to positive selection when the estimates of {omega} are only marginally larger than 1, even when an LRT such as in M1a versus M2a is significant (Supplementary Material online).

Real Data Analysis: Transmembrane Proteins
Based on the results of our simulation studies, we recommend the following procedure for real data analysis.

Step1: Use a surrogate test to determine the number of site classes. Here we use the LRT under M3 ({alpha} = 0.01) as the auxiliary criterion.
Step 2: Use empirical Bayes under M3, with G determined from step 1, to provisionally place sites into groups for the purpose of model selection. We note that if structural information is available for the protein product of the gene, such information could be used in place of steps 1 and 2 as the basis a provisional assignment of sites to groups.
Step 3: Carry out model selection by using FE models following the suggestions of Bao et al. (2007)Go.
Step 4: If a model is identified in step 3 that includes variable {pi}’s among groups, then use soft-LiBaC1 to group sites according to similarities in the process of evolution. For other models, use an appropriate mixture model as implemented in either PAML (Yang 1997Go) or HyPhy (Kosakovsky Pond et al. 2005Go). Hard-LiBaC is not recommended for real data analysis.

Here, we analyze a set of 8 genes encoding transmembrane proteins according to the above steps. These genes were previously identified in a genomic scan of Rickettsia as having one or more significant LRTs for positive selection (KA Dunn, unpublished data). We restrict our sample of genes to transmembrane proteins because they are composed of hydrophobic membrane-spanning helices and hydrophilic loops that extend either into the cytoplasm or outside of the cell. Hence, these genes are expected to be composed of groups of sites having different equilibrium codon frequencies and evolutionary pressures. Indeed, step 3 of the analysis indicates 7 of the genes are best described by a model that permits different codon frequencies among groups (FE1 or FE9: Supplementary Material online). In the one case where we select a model that is not heterogeneous for codon frequencies (FE10), the gene (pgpA: 198 codons) is much smaller than the others (351–938 codons), and model selection could have been limited by the power of the involved LRTs. For the other 7 genes, either FE1 or FE9 was the best-fit model. FE1 and FE9 differ only in their treatment of the {kappa} parameter; FE1 permits heterogeneous {kappa} among sites, whereas FE9 specifies homogenous {kappa}.

Because FE1 and FE9 specify variable {pi}’s among groups of site, soft-LiBaC1 is most appropriate for step 4. For comparison, models M2a, M8, and soft-LiBaC2 are also employed in step 4. Selected results are presented in table 6 (see Supplementary Material online for full results). There is a clear discrepancy between the M-series models and LiBaC in the estimates of {omega} (table 6). Parameter estimates under M2a and M8 suggest the presence of positively selected sites in all 8 genes. Parameter estimates under soft-LiBaC1 and soft-LiBaC2 were generally consistent with each other but, unlike M2a and M8, suggest the presence of positively selected sites in only 3 genes (RfaL, ccmF, and nuoL3). In ccmF, the estimated value of {omega} is only marginally larger than 1 under soft-LiBaC-1, suggesting that the evidence for positive selection is weak for this gene. Based on the results of simulation study 2, a possible explanation for the discrepancy within these real data sets is the estimation bias that can arise when among-site heterogeneity is not accommodated (see simulation study 2).


View this table:
[in this window]
[in a new window]

 
Table 6 Selected Estimates of the Strength and Direction of Natural Selection in 8 Transmembrane Proteins

 
The discrepancy between M2a and LiBaC is less dramatic if we require a significant LRT prior to interpreting the parameter estimates. Here, model M1a is employed for the purpose of conducting LRTs for the presence of positively selected sites. The LRT of M1a versus M2a is not significant in 4 of the 5 genes where LiBaC does not indicate positive selection (table 6); this is despite the estimate of {omega} > 1 for a group of sites under M2a. In contrast, the LRT of M7 against M8 is significant for every gene in the data set (table 6). This result reinforces earlier findings (Anisimova et al. 2001Go; Wong et al. 2004Go) that the M1a–M2a LRT tends to be more conservative than other LRTs. For this reason, we suggest that it should be preferred over the M7–M8 LRT for real data analyses.

Taking together the results of our simulation studies and real data analyses, we recommend that data sets be tested for heterogeneity in aspects other than {omega} before testing for positive selection. Given evidence for such heterogeneity, we suggest claims for positive selection should be limited to those genes where results are consistent over multiple approaches. Although this idea of robustness is not a new one (e.g., Anisimova et al. 2001Go; Wong et al. 2004Go; Kosakovsky Pond and Frost 2005Go), our results highlight that the LRT of M7–M8, and parameter estimates under M2a and M8, should be viewed with additional caution when several aspects of the substitution process are known to vary among sites. For certain classes of protein-coding genes, such as the transmembrane proteins examined here, their biology suggests that such heterogeneity should be expected. In these cases, soft-LiBaC1 offers a new framework, when applied collectively with other methods, to help avoid false positives and identify robust evidence for the action of positive selection.


    Discussion
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
A family of methods, called model-based clustering (MBC: Banfield and Raftery 1993Go; Fraley and Raftery 1998Go), were developed for a purpose similar to those pursued here. In MBC, the data are viewed as coming from a mixture of underlying multivariate normal (Gaussian) probability distributions, each representing a different cluster of data. Clusters are resolved by maximizing the likelihood function:

Formula
where pk is the probability that an observation belongs to the kth cluster, pk > 0; Formula. Our task differed from the one addressed by MBC; we wished to maximize the likelihood that a set of codon sites belong to a cluster under a given model of evolution; in this case, one of several codon models described in Bao et al. (2007)Go. To achieve this we developed a new method, called LiBaC, which is an extension of the basic idea of MBC. The LiBaC algorithm can be viewed as a generalization of the MBC approach (Fraley and Raftery 1998Go) to Markov models of codon evolution (Goldman and Yang 1994Go; Muse and Gaut 1994Go). As compared with previous methods based on fixed-effect models (e.g., Yang and Swanson 2002Go), random effect models (e.g., Nielsen and Yang 1998Go), and counting methods (e.g., Suzuki and Gojobori 1999Go), LiBaC represents a novel approach to the problem of inferring amino acid sites subject to positive or negative selection pressure.

Site classifications under the LiBaC methods presented in this study were based on a 50% posterior probability cutoff value. Although we compared LiBaC with a selected set of commonly used methods, each using a typical posterior probability cutoff value from the literature, such comparisons are not straightforward. Our simulations clearly show that fixing different methods to a common posterior probability cutoff value does not guarantee comparable performance as recall can vary widely despite a common cutoff value. Furthermore, different cases of molecular evolution represent different levels of classification difficulty, with performance depending on the data at hand. As with the more commonly used methods, LiBaC performance could be fine-tuned by adjusting the posterior probability cutoff value; the question is how to adjust for optimal performance. We suggest that a priori data partitions, such as ones derived from the structure of the protein product, might be used for model selection and parameter estimation (e.g., Bao et al. 2007Go). The results of such an analysis could be used to simulate data from which an optimal precision–recall curve can be estimated and series of posterior probability cutoff values evaluated. Selecting a cutoff value closest to the point in the curve where the curvature is greatest, that is, the elbow should achieve a better trade-off between precision and recall for the data at hand than simply adopting the most commonly used cutoff value in the literature.

Recent work has revealed that among-site variability in synonymous substitution rates can occur in a variety of different genes and organisms, and can impact estimation of among-site variability in the dN/dS ratio (Kosakovsky Pond and Muse 2005Go). Moreover, Kosakovsky Pond and Muse (2005)Go showed classification of positively selected sites under methods that assume a constant rate of synonymous substitution among sites can be negatively impacted. By using bivariate distributions to model among-site variability in both dN and dS, Kosakovsky Pond and Muse (2005)Go achieved significant gains in the fit of a codon model to most of the data sets tested, although other aspects of the substitution process were treated as homogenous among sites. By using codon models with fixed effects determined from protein structure, Bao et al. (2007)Go showed that in addition to dN and dS, codon frequencies, and {kappa} can differ significantly among groups of codons within a gene. Building on those findings, we carried out a series of simulation studies where groups of sites differed in several aspects of the substitution process. Our results clearly show that differences among sites in the estimated values of {omega} can reflect more than just natural selection pressure when several aspects of the substitution process differ. Simulation study 2 shows that it is possible for such estimation biases to lead to false signal for positively selected sites, and our analysis of a set of transmembrane proteins suggests such problems could be frequent among certain types of genes. Nonetheless, a large-scale survey of real gene sequences will be required to better gauge the risk of such errors to real data analysis.

We proposed a procedure for real data analysis that recommends using soft-LiBaC1 if model selection in step 3 yields a model having variable codon frequencies among groups. In step 3, models can be tested very rapidly as the group membership is treated as a fixed effect. An alternative approach would be to evaluate a wide variety of random effect mixture models that can be implemented in the program HyPhy (Kosakovsky Pond et al. 2005Go). This approach offers the advantage that likelihood scores can be directly compared with the commonly used M-series models (M1, M2a, etc.) as well as possibly providing better starting conditions from which to run the EM in LiBaC. Possible drawbacks of this alternative include a somewhat higher computational cost and the inability to permit codon frequencies to vary among sites. Regardless of how LiBaC is ultimately integrated into a real data analysis, we expect LiBaC will be most useful when implemented in conjunction with other approaches. In this way, it contributes to a "toolbox" of methods best applied to real data with the goal of identifying robust evidence for positive selection.

LiBaC can easily be extended to nucleotide or amino acid models; however, it appears most promising for codon and amino acid models. Amino acids sites in the folded protein are subject to different microenvironments and perform different functions; hence, a better understanding of the connections between genotype and protein phenotype may be served by improving methods for grouping sites according to similarities in the underlying evolutionary process. Indeed, protein models are substantially improved by employing context-dependent substitution matrices in cases where the protein secondary structure can be treated as a fixed effect (e.g., Koshi and Goldstein 1995Go; Goldman et al. 1998Go; Robinson et al. 2003Go). For both codon and amino acid models, LiBaC could permit greater complexity in the amino acid replacement process between groups of sites by coupling it to models permitting adjustable exchangeabilities between amino acids (e.g., Dimmic et al. 2000Go; Yang 2000Go). Such an approach might improve both parameter estimates (e.g., {omega}) and classification of sites according to similarity of the evolutionary process. Lastly, it should also be possible to develop LiBaC for use on multigene data. Here the problem is that different genes in a genome might be best treated as having evolved under different models, but one does not want to specify an independent model for each gene in a very large data set. LiBaC-based methods could be used to cluster genes (rather than individual sites) into groups according to similarities in the underlying process of evolution.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Supplementary materials are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
We thank Edward Susko for helpful discussions and S. L. Kosakovsky Pond for comments that substantially improved this manuscript. This research was supported by Discovery Grants from the Natural Sciences and Engineering Research Council of Canada (DG298394 to J.P.B. and DG40156 to H.G.) and a grant from Genome Canada.


    Footnotes
 
Jianzhi Zhang, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Theory and Methods
 Discussion
 Supplementary Material
 Acknowledgements
 References
 

    Anisimova M, Bielawski JP, Yang Z. The accuracy and power of likelihood ratio tests to detect positive selection at amino acid sites. Mol Biol Evol (2001) 18:1585–1592.[Abstract/Free Full Text]

    Anisimova M, Bielawski JP, Yang Z. Accuracy and power of bayes prediction of amino acid sites under positive selection. Mol Biol Evol (2002) 19:950–958.[Abstract/Free Full Text]

    Aris-Brosou S, Bielawski JP. Large-scale analyses of synonymous substitution rates can be sensitive to assumptions about the process of mutation. Gene (2006) 378:58–64.[CrossRef][Web of Science][Medline]

    Banfield JD, Raftery AE. Model-based Gaussian and non-Gaussian clustering. Biometrics (1993) 49:803–821.[CrossRef][Web of Science]

    Bao L, Gu H, Dunn KA, Bielawski JP. Methods for selecting fixed-effect models for heterogeneous codon evolution, with comments on their application to gene and genome data. BMC Evol Biol (2007) 7(Suppl 1):S5.

    Bielawski JP, Yang Z. Likelihood methods for detecting adaptive evolution. Statistical methods in molecular evolution (2004) New York: Springer.

    Celeux G, Govaert G. A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal (1992) 14:315–332.[CrossRef]

    Dempster AP, Laird NM, Rubin DB. Maximum likelihood for incomplete data via the EM algorithm (with discussion). J R Stat Soc B (1977) 39:1–38.

    Dimmic MW, Mindell DP, Goldstein RA. Modeling evolution at the protein level using an adjustable amino acid fitness model. Pac Symp Biocomput (2000) 2000:18–29.

    Dunn KA, Bielawski JP, Yang Z. Substitution rates in Drosophila nuclear genes: implications for translational selection. Genetics (2001) 157:295–305.[Abstract/Free Full Text]

    Fraley C, Raftery AE. How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J (1998) 41:578–588.[CrossRef]

    Fukunaga K. The estimation of the Bayes error by the k-nearest neighbor approach. In: Progress in pattern recognition.—Kanal LN, Rosenfeld A, eds. (1985) Vol. 2. Amsterdam (the Netherlands): North-Holland. 169–187.

    Genton GG, Ronchetti E. Robust indirect inference. J Am Stat Assoc (2003) 98:67–76.[CrossRef][Web of Science]

    Gillespie JH. The causes of molecular evolution (1991) Oxford: Oxford Univesity Press.

    Goldman N, Thorne JL, Jones DT. Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics (1998) 149:445–458.[Abstract/Free Full Text]

    Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol (1994) 11:725–736.[Abstract]

    Gouriéroux C, Monfort A, Renault AE. Indirect inference. J Appl Econom (1993) 8:S85–S118.[CrossRef]

    Kosakovsky Pond SL, Frost SDW. Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol Biol Evol (2005) 22:1208–1222.[Abstract/Free Full Text]

    Kosakovsky Pond SL, Frost SDW, Muse SV. HyPhy: hypothesis testing using phylogenies. Bioinformatics (2005) 21:676–679.[Abstract/Free Full Text]

    Kosakovsky Pond SL, Muse SV. Site-to-site variation in synonymous substitution rates. Mol Biol Evol (2005) 22:2375–2385.[Abstract/Free Full Text]

    Koshi JM, Goldstein RA. Context-dependent optimal substitution matrices derived using Bayesian statistics and phylogenetic trees. Protein Eng (1995) 8:641–645.[Abstract/Free Full Text]

    MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on mathematical statistics and probability—LeCam LM, Neyman J, eds. (1967) Berkeley (CA): University of California Press. 281–297.

    Massingham T, Goldman N. Detecting amino acid sites under positive selection and purifying selection. Genetics (2005) 169:1753–1762.[Abstract/Free Full Text]

    McLachlan GJ, Krishnan T. The EM algorithm and extensions (1997) New York: Wiley.

    Muse SV, Gaut BS. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol (1994) 11:715–724.[Abstract]

    Nielsen R, Yang Z. Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics (1998) 148:929–936.[Abstract/Free Full Text]

    Robinson DM, Jones DT, Kishino H, Goldman N, Thorne JL. Protein evolution with dependence among codons due to tertiary structure. Mol Biol Evol (2003) 20:1692–1704.[Abstract/Free Full Text]

    Suzuki Y, Gojobori T. A method for detecting positive selection at single amino acid sites. Mol Biol Evol (1999) 16:1315–1328.[Abstract]

    Tumer K, Ghosh J. Bayes error rate estimation using classifier ensembles. Int J Smart Eng Syst Des (2003) 5:95–110.[CrossRef]

    Wong WS, Yang Z, Goldman N, Nielsen R. Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites. Genetics (2004) 168:1041–1051.[Abstract/Free Full Text]

    Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci (1997) 13:555–556.[Free Full Text]

    Yang Z. Relating physicochemical properties of amino acids to variable nucleotide substitution patterns among sites. Pac Symp Comput Biol (2000) 2000:81–92.

    Yang Z. The power of phylogenetic comparison in revealing protein function. Proc Natl Acad Sci USA (2005) 102:3179–3180.[Free Full Text]

    Yang Z, Nielsen R. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol (2000) 17:32–43.[Abstract/Free Full Text]

    Yang Z, Nielsen R, Goldman N, Pedersen A-MK. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics (2000) 155:431–449.[Abstract/Free Full Text]

    Yang Z, Swanson WJ. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol Biol Evol (2002) 19:49–57.[Abstract/Free Full Text]

    Yang Z, Wong SW, Nielsen R. Bayes empirical bayes inference of amino acid sites under positive selection. Mol Biol Evol (2005) 22:1107–1118.[Abstract/Free Full Text]

Accepted for publication May 30, 2008.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
W. Delport, K. Scheffler, and C. Seoighe
Models of coding sequence evolution
Brief Bioinform, January 1, 2009; 10(1): 97 - 109.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
25/9/1995    most recent
msn145v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bao, L.
Right arrow Articles by Bielawski, J. P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bao, L.
Right arrow Articles by Bielawski, J. P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?