Skip Navigation


MBE Advance Access originally published online on August 16, 2007
Molecular Biology and Evolution 2007 24(10):2277-2285; doi:10.1093/molbev/msm160
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
24/10/2277    most recent
msm160v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Massingham, T.
Right arrow Articles by Goldman, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Massingham, T.
Right arrow Articles by Goldman, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Research Articles

Statistics of the Log-Det Estimator

T. Massingham and N. Goldman

European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK

E-mail: tim.massingham{at}ebi.ac.uk.


    Abstract
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion and Conclusions
 Appendix A Influence Function...
 Appendix B Asymptotic Efficiency...
 Acknowledgements
 References
 
The log-det estimator is a measure of divergence (evolutionary distance) between sequences of biological characters, DNA or amino acids, for example, and has been shown to be robust to biases in composition that can cause problems for other estimators. We provide a statistical framework to construct high-accuracy confidence intervals for log-det estimates and compare the efficiency of the estimator to that of maximum likelihood using time-reversible Markov models. The log-det estimator is found to have good statistical properties under such general models.

Key Words: log-det distance • pairwise distances • statistical properties


    Introduction
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion and Conclusions
 Appendix A Influence Function...
 Appendix B Asymptotic Efficiency...
 Acknowledgements
 References
 
Many forms of the "log-det" divergence estimator have been given in the literature, including the log-det distances (Lockhart et al. 1994Go; Steel 1994Go), "paralinear distances" (Lake 1994Go), and related "asynchronous distance" introduced by Barry and Hartigan (1987)Go. The properties of the log-det estimator as a measure of evolutionary distance have been carefully studied (Baake and von Haeseler 1999Go), and the publication of stand-alone software implementing log-det estimation (Thollesson 2004Go) suggests that interest in its use, for both nucleotide and protein sequences, is increasing (Lockhart and Cameron 2001Go; Hoffmeister et al. 2004Go; Goremykin and Hellwig 2005Go; Martin et al. 2005Go; Fitzpatrick et al. 2006Go; Turmel et al. 2006Go). The advantages of the log-det estimator include resilience to both the process of evolution changing within the tree and nonstationarity of base frequencies, in addition to it being easy to calculate (Lockhart et al. 1994Go).

It is important to assess the accuracy of any estimate of divergence, and previous papers have derived approximations of the variance of the log-det estimator, implicitly suggesting that confidence in log-det estimates could be assessed using approximations based on the normal distribution. For example, the log-det–based test for molecular clock under nonstationarity described by Gu and Li (1996Go, p. 1378) is of this form: "When the sequence is long, [the statistic] follows approximately the standard normal distribution". Lockhart et al. (1994)Go derived an approximation to the variance of the log-det estimator using the delta method (see also Barry and Hartigan 1987Go). Gu and Li (1996)Go analyzed the bias and variance of the paralinear distance, proving that log-det estimation has a small-sample bias approximately equal to twice its variance.

This paper applies the empirical likelihood framework (Owen 2001Go) to the analysis of the log-det divergence estimator and shows how to construct more accurate confidence intervals (CIs), even in cases when the sampling distribution is highly nonsymmetric, and so normal-based intervals are poor approximations. The CIs constructed are accurate in the sense that their coverage error, the difference between their nominal and actual size, decreases rapidly as the number of sites increases. (Note that throughout this paper, the size of a CI is taken to mean its statistical size, i.e., the probability that the true value of a parameter lies within it, rather than its width.) Both the small- and large-sample performance of the log-det estimator are assessed, the small-sample performance by simulation and the large-sample performance by comparison of mathematical limits of the variance as the number of sites increases.


    Methods
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion and Conclusions
 Appendix A Influence Function...
 Appendix B Asymptotic Efficiency...
 Acknowledgements
 References
 
For the purposes of this paper, the log-det estimator analyzed is the paralinear distance (Lake 1994Go), with an additional scaling factor for the number of possible characters, due to its consistency under stationary time-reversible Markov models. This form of the log-det estimator is also described by Lockhart et al. (1994)Go. The theory and methods described are applicable to all variants of the log-det divergence estimator, as well as the related asynchronous distance (Barry and Hartigan 1987Go). The log-det divergence between 2 sequences consisting of n aligned sites, each with one of N possible characters observed at each site, is

Formula (1)
where D is the divergence matrix between the 2 sequences, with Dij the number of times that a site in the first sequence with character i is aligned with a site in the second with character j (referred to henceforth as the pair (i, j)), and the Ck(D) and Rk(D) are the sums of the kth column or row of D, respectively.

It is also useful to define the joint probability matrix as that which specifies the probability of pair (i, j) being observed at a given site. If the true joint probability matrix is J0, then the observed divergence matrix is a multinomial sample of size n with probabilities J0. The true underlying divergence, t0, is defined to be LD(J0).

Parametric and Empirical Likelihood
In a parametric likelihood context, the data are assumed to come from a parametric model and the joint probability of observing the pair (i, j) is fij(t, {theta}), where t is the divergence and {theta} represents additional parameters describing the substitution process. The log-likelihood for a given t and {theta} is

Formula (2)
and the values Formula , Formula that maximize ln L are estimates of the true parameter values. There is an extensive literature on the use of parametric likelihood for statistical inference, both in the field of phylogenetics and elsewhere (see, e.g., Pawitan 2001Go; Felsenstein 2004Go).

If estimating {theta} is not of interest (nuisance parameters), then they can be "removed" from the problem by treating its maximum likelihood (ML) estimate as a function of t:

Formula (3)
Removing nuisance parameters in this manner produces a profile likelihood, which can be used for estimation and CI construction in a similar manner to standard likelihood (Pawitan 2001Go). The profiling procedure ensures that the uncertainty in t is adjusted for the additional variance caused by {theta} being estimated from the data rather than fixed at a known value.

Rather than the parametric approach of restricting estimation to a single, albeit potentially general, family of distributions, profiling can be taken to the extreme, and we can treat the entire joint probability matrix as a nuisance parameter. Profiling in this manner is the basis of empirical likelihood (Owen 2001Go). Analogously to parametric likelihood, we can define the empirical log-likelihood of any joint probability matrix J, given the observed data as

Formula

Just like a parametric likelihood, the empirical likelihood measures how probable the data are, given a specific joint probability matrix (specified implicitly by t and {theta} in the case of the parametric likelihood). For each choice of J, there is a corresponding divergence given by LD(J) (eq. 1) and the method of empirical likelihood seeks to find the joint probability distribution for which the data are most probable, given that the underlying divergence is t. For each possible value of t, the value of the empirical likelihood function (ELF; analogous to the parametric likelihood function of eqs. 2 and 3) can be expressed as the following constrained optimization problem:

Formula (4)
where the first and second constraints ensure that the Jij form a valid joint probability matrix and the third ensures that the divergence is what is required.

The solution to equation 4, as a function of t, is the ELF for t. Owen (1988)Go shows that this has many favorable properties similar to those of parametric likelihood. In particular, there is an empirical likelihood analog of Wilk's theorem, that is, twice the difference in empirical log-likelihood between the maximum of the ELF (attained when J equals the empirical joint probability matrix Formula = D/n, and so t = Formula ) and the value when t = t0 is approximately distributed as a {chi}Formula distribution. The coverage error of this approximation is proportional to n–1 (DiCiccio et al. 1991Go), that is, inversely proportional to the number of aligned sites considered. Unlike parametric likelihood, this distribution holds even when the model is incorrect, although in such circumstances the meaning of estimates and CIs is open to interpretation. Other properties of empirical likelihood include good statistical power, equaling that of parametric likelihood for large-sample sizes and sometimes exceeding it for small sample sizes (Lazar and Mykland 1998Go).

A Least Favorable Family Approximation
Directly solving equation 4 to determine the empirical likelihood for a given divergence requires optimization over N2 1 parameters subject to the stated constraints, and the structure of the log-det estimator means that usual techniques for simplifying the optimization of the empirical likelihood (Owen 2001Go) are not applicable to this problem. Exact calculation of the ELF for the log-det estimator is computationally prohibitive, but DiCiccio and Romano (1990)Go considered several approaches to constructing nonparametric CIs and Lee and Young (1999)Go studied the properties of a particularly amenable approximation to empirical likelihood based on "Stein's Least Favorable Family" (LFF), which shares many of empirical likelihood's favorable properties. Rather than considering all possible joint probability matrices, only those that belong to a carefully chosen family are considered—this family is least favorable in the sense that estimation of parameters under the LFF is as hard (same variance) as the original problem. The empirical likelihood curves described by the LFF and by the ELF both have maxima at the empirical joint probability matrix, and their curvatures are equal about this point; hence, the 2 curves closely agree in the region important for assessing confidence in estimates.

The LFF is a family of multinomial distributions described by a single parameter, {phi}. For a given value of {phi}, the joint probability matrix has entries

Formula (5)
and it is easy to see that this agrees with the empirical joint probability matrix when {phi} = 0. The uij are the influence function (Hampel et al. 1986Go) of the estimator, given by

Formula (6)
for the log-det estimator (see Appendix A).

Although finding the empirical likelihood for divergence t (or its LFF approximation) involves solving t = LD(J({phi})) for {phi}, plotting the empirical likelihood curve for the least favorable family, the LFF curve, requires no optimization or root finding as it is expressed as an implicit function of {phi}. The coordinates of the curve are given by

Formula (7)

Large-Sample Variance
In addition to its use in constructing the LFF, the influence function has close connections to other statistical quantities of interest (Hampel et al. 1986Go). In particular, the asymptotic variance of the log-det estimator is

Formula (8)
where J0 is the true underlying joint probability matrix with its (i, j) element written as [J0]ij. This expression for the variance differs from that stated by Gu and Li (1996)Go because the additional variance due to estimating the character frequencies from the observed data are now included. Because the variance of an estimator converges to its asymptotic value as the number of sites in the sequences analyzed increases, the asymptotic variance is a useful quantity to compare performance between different ways of estimating the same quantity. In practice, J0 is unknown but could be approximated by the empirical joint probability matrix.

The Cramér-Rao Lower Bound (CRLB) places a limit on how small the variance of any unbiased estimator can be, relating it to the Fisher information for the parameters of a specified family of distributions (e.g., Pawitan 2001Go). This bound applies to all estimators, irrespective of whether they estimate all parameters or just a single parameter of interest (e.g., the log-det estimator only estimates the divergence). Estimators that achieve this lower bound are said to be efficient. The ratio between the minimum possible variance and that of the estimator is its efficiency, a measure of the how well the estimator uses the observed data: an estimator with an efficiency of x requires sequences that are 1/x times longer to achieve the same variance as an efficient estimator; that is, twice as long when the efficiency is 0.5, 11% longer when the efficiency is 0.9, and so on. The large-sample performance of different estimators can be compared by considering their respective asymptotic efficiencies, which is equivalent to comparing their asymptotic variances to the CRLB.

The obvious comparison for the log-det estimator is to study its performance relative to ML using stationary Markov models of evolution, the most general of which is the "General Time-Reversible" (GTR) model (Tavaré 1986Go), as time-reversible Markov models are often used for pairwise distance estimation for tree building (Gascuel 1997Go; Bruno et al. 2000Go) and large-data sets tend to support more complex models (Negrisolo et al. 2004Go; Goremykin and Hellwig 2005Go; Nikolaev et al. 2007Go). The asymptotic variance of time-reversible Markov models under ML estimation is equal to the inverse Fisher information divided by the number of observations, so they have an asymptotic efficiency of 1 (Pawitan 2001Go). The Fisher information for time-reversible Markov models can be calculated following Goldman (1998)Go, although in that paper the Fisher information for parameters other than divergences was approximated numerically. Here it is calculated analytically using the formulas given by Schadt and Lange (2002)Go.

Markov processes of sequence evolution have an arbitrary scaling factor, reflecting that rate and time are confounded: increasing the rate of substitution by a factor µ and decreasing the divergence by the same factor gives a joint probability distribution identical to that of the original process. For ML estimation under time-reversible models, it is usual to set µ so that one substitution is expected in unit time (when the process is stationary); that is, µ{sum}i{pi}iqi= 1, where {pi}i is the stationary frequency of character i and qi is the rate at which substitutions occur, given the character is currently i (qi = –qii, where qii is the (i, i) entry of the instantaneous rate matrix for the Markov process). The log-det estimator does not assume that the underlying processes is stationary (Steel 1994Go) and uses a different scaling that does not explicitly involve the stationary distribution: µ {sum}iqi/N = 1. For all calculations in this paper, the scaling factor for the rate matrix of GTR was chosen so that divergences estimated by log-det and GTR are directly comparable. Estimation of divergence by (parameteric) ML using the GTR model will be referred to as the ML-GTR method.

Unlike the case of large samples, there are no formulas available that describe the small-sample performance of these estimators and it does not necessarily hold that an estimator with good large-sample properties is also good for small samples. The actual performance for small samples can be investigated by simulation: many sets of data are randomly generated according to a known model of evolution and the divergence calculated for each one. The mean squared error (MSE) of these divergences is an estimate of the expected squared error of each estimator (the variance, if the estimator is unbiased). The small-sample bias of each estimator can also be investigated using this approach.


    Results
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion and Conclusions
 Appendix A Influence Function...
 Appendix B Asymptotic Efficiency...
 Acknowledgements
 References
 
We present the results in 3 subsections. We consider the performance of the LLF approximation to empirical likelihood and show how to construct CIs using this framework. We then analyze the large-sample performance, and finally the small-sample performance, of the log-det estimator.

Performance of the LFF Approximation
The LFF-approximated empirical likelihood curve for the divergence between human and orangutan mitochondrial sequences (Brown et al. 1982Go) is shown in figure 1, the curve being defined by equations 5, 6, and 7. As it must, the maximum empirical log-likelihood occurs when the divergence is equal to the log-det estimate of the divergence between the 2 sequences.


Figure 1
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 1.— Empirical log-likelihood curve for divergence between human and orangutan. The continuous curve is the maximum empirical log-likelihood for the divergence between human and orangutan mitochondrial DNA sequences, as approximated by the LFF. The maximum occurs when the divergence is 0.201 (vertical dashed line), and the intersections between the dotted lines and the curve give 95%, 99%, and 99.9% CIs for the divergence (shown at the foot of the graph). The points represent the empirical log-likelihood of families chosen randomly from around the LFF curve. Inset is the divergence matrix used in the calculations.

 
Curves calculated using equation 7, as illustrated in figure 1, are only an approximation to the curves for the ELF (i.e., those curves defined by eq. 4), necessarily lying below the true curve. There may be joint probability matrices neighboring those in the LFF that have the same log-det divergence but greater empirical likelihood. To assess the accuracy of the LFF approximation, random joint probability matrices were simulated from a Dirichlet distribution with mean equal to a member of the LFF and with various different scaling factors (5,000; 10,000; 20,000; ...; 50,000) to control how near the sampling was to the mean. If the approximation is inaccurate, then many of these points will lie above the LFF curve; the farther above, the worse the approximation. The empirical log-likelihoods and log-det divergences of the data under these random families for the human–orangutan sequence comparison are shown as points in figure 1. Few points lie above the LFF curve, and then only by a very small amount; it is clear that the LFF is an extremely good approximation to the ELF for these data. In addition, a further 1,000,000 points were generated uniformly over the space of all joint probability matrices. The maximum empirical likelihood for these matrices was –1767, far short of the values appearing in figure 1 and so far from the ELF.

CIs for the log-det estimator can be constructed in the usual manner using the LFF curve. Because twice the difference between the maximum of the ELF and its value when it is evaluated at the underlying divergence t0 is approximately {chi}Formula distributed (Owen 1988Go), an appropriate interval can be determined by placing thresholds on the LLF curve. The values of {phi} corresponding to the end points of a CI of size x, ({phi}L, {phi}U), are the solution of Formula where q(x) is quantile x of a {chi}Formula distribution (e.g., x = 0.95, q(x) = 3.84 for a 95% CI; x = 0.99, q(x) = 6.63 for a 99% interval). The corresponding divergences LD(J({phi}L)) and LD(J({phi}U)) provide the lower and upper bounds of the CI. Such points can be found using standard root-finding techniques (e.g., Press et al. 1992Go), the task made slightly easier because the roots must lie on either side of {phi} = 0.

CIs of size 95%, 99%, and 99.9% for the human–orangutan divergence are shown in figure 1. Empirical likelihood-derived 95% and 99% CIs for all pairwise comparisons in the Brown et al. (1982)Go mitochondrial data are reported in table 1; these agree closely with intervals derived analogously using ML-GTR (not shown). The CIs shown for the normal approximation are generally smaller than those derived using empirical likelihood. Because the coverage error for the empirical likelihood intervals, even after taking the LFF approximation into account, is an order smaller than that for normal intervals, this result suggests that the normal approximation may be systematically returning intervals with an actual size smaller than the nominal size. A possible explanation for this would be the failure to take the skew of the distribution into account.


View this table:
[in this window]
[in a new window]

 
Table 1 Log-Det Divergences and CIs for the Mitochondrial Sequence Data of Brown et al. (1982)Go, Derived Using Empirical Likelihood (CI) and Normal Approximation Using Estimated Variance (Normal Confidence Interval [NCI])

 
As an alternative to empirical likelihood, confidence in log-det estimates of divergence can also be assessed using bootstrap techniques. CIs for the human–orangutan divergence produced using Efron's nonparametric and Rubin's Bayesian bootstraps (Efron 1979Go; Rubin 1981Go) are compared with those derived using empirical likelihood in figure 2. Although CIs produced by the Bayesian bootstrap appear to agree more closely with those from empirical likelihood than do those from the nonparametric bootstrap, the 3 methods are broadly comparable until the 2 bootstrap methods become unstable in the extreme tails due to the number of replicates used (here, 100,000) being insufficient. The advantages of the empirical likelihood approach over bootstrap techniques are explored in the Discussion.


Figure 2
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 2.— Comparison of CIs for human–orangutan divergence produced using empirical likelihood (solid line), nonparametric bootstrap (dashed), and Bayesian bootstrap (dotted). Bootstrap results are based on 100,000 replicate data sets. A CI of given statistical size can be read from the graph by finding the divergence of those points where the curves intersect the size wanted. For example, the construction of an empirical likelihood CI of size 0.99 (hence log10 (1 – 0.99) = –2) is shown; the CI is (0.156, 0.256).

 
Large-Sample Performance
Pairwise sequence divergence is often measured by assuming sequence evolution proceeded according to a more restricted model than GTR. For example, the Kimura 2-parameter model (Kimura 1980Go) only allows the relative rate of transitions and transversions to vary, and such divergence estimators should have lower variance than GTR because there are fewer parameters to estimate (although their estimates will be biased if the model of evolution is not good). Taken to the extreme, the entire model could be specified, leaving only the divergence to be estimated. Figure 3 shows how the asymptotic variance of the log-det estimator compares with such ideal estimators of divergence, for the case that the true values are equal to their ML-GTR estimates, for the human–orangutan comparison (the rate matrix and stationary distribution are given inset in fig. 3), and for several other models (JC69: Jukes and Cantor 1969Go; K80: Kimura 1980Go; F81: Felsenstein 1981Go; HKY85: Hasegawa et al. 1985Go; GTR: Tavaré 1986Go). For all models, the log-det estimator does well for low divergences, but there are noticeable differences in its asymptotic performance depending on which model represents the underlying truth. For the equal input models (JC69 and F81), the log-det estimator tends to asymptotic efficiency. For more complex models, the efficiency decreases rapidly as the underlying divergence increases.


Figure 3
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 3.— Asymptotic efficiency of the log-det estimator of divergence compared with idealized estimators of divergence for which all parameters of the substitution process are fixed at their true values and only the divergence is to be estimated. Several models are shown (see text for details); when it is a model parameter, the stationary distribution of nucleotides is set equal to that for the GTR model (the true scaled rate matrix and stationary distribution of which are shown inset). For all models, the efficiencies of the ML-GTR and log-det estimators are indistinguishable.

 
Small-Sample Performance
When an estimator is unbiased, the variance and the MSE derived from simulations agree, so we estimate efficiency as the ratio of the minimum possible variance and the estimated MSE. For the results presented in this paper, all estimated MSE's were based on 100,000 simulated pairs of sequences. Estimating the MSE in this manner has a minor problem: there is some probability that the estimated divergence is not finite. Such unusual results are due to large deviations from the mean, so their probability of occurrence decreases exponentially as the number of sites in the sequences increases. Reflecting likely use in practice, simulated pairs of sequence with infinite divergence were discarded. The model used for simulation was GTR with parameters equal to the ML estimates for the human–orangutan sequences.

Figure 4 shows that there is little difference between the log-det and ML-GTR divergence estimators when the divergence is equal to that of the human–orangutan sequences. Both their efficiencies and biases are almost indistinguishable compared with the stochastic error of the simulation, the log-det estimator marginally outperforming ML-GTR. As the sample size increases, the efficiency tends to 1 and the bias tends to 0, suggesting that both estimators are asymptotically unbiased and efficient. Very few of the simulated sequences were discarded due to having infinite estimated divergence, with 1.5% of generated sequences being discarded when the length is 50 characters and none discarded for sequences longer than 200 characters.


Figure 4
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 4.— Small-sample performance of divergence estimators. The small-sample efficiencies of the log-det (solid line) and ML-GTR (dashed) estimators of sequence divergence are given as the ratio between the variance observed from 100,000 pairs of simulated sequences and the minimum possible variance. The underlying sequence divergence (t = 0.201) and GTR model parameters were derived from ML-GTR analysis of the human and orangutan sequences of Brown et al. (1982)Go. Inset is the bias, the difference between the mean estimated divergence in the simulation and the true divergence; the bias decreases approximately as n–1, as expected (Gu and Li 1996Go).

 
The efficiency of estimators will also vary with the underlying divergence being estimated. Figure 5 illustrates how the efficiency of the log-det estimator varies as both the number of sites in the sequences and the underlying divergence changes. For low divergence and many sites, the efficiency of the log-det estimator rapidly approaches 1, although the number of sites needed for the estimator to become efficient increases quite dramatically as the underlying divergence approaches and exceeds 1. For high divergence and few sites, many generated sequences are discarded due to having infinite estimated divergence, and this results in artifactual high efficiencies that eventually exceed 1. All points where more than 5% of sequences were discarded have been excluded from the surface in figure 5.


Figure 5
View larger version (51K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIG. 5.— Efficiency of the log-det estimator of divergence as sequence length and true divergence vary. Each point on the surface was estimated using 100,000 pairs of simulated sequences. See text for details.

 

    Discussion and Conclusions
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion and Conclusions
 Appendix A Influence Function...
 Appendix B Asymptotic Efficiency...
 Acknowledgements
 References
 
We have placed the log-det estimator of divergence inside a powerful statistical framework. This framework has many benefits, including more accurate assessment of performance and, due the close connection with CIs, a hypothesis testing framework with good asymptotic properties. In order to study the log-det estimator using empirical likelihood, we have made a LFF approximation that is very successful and has good statistical properties in its own right (Lee and Young 1999Go).

Empirical likelihood CIs using the LFF approximation are computationally easier than bootstrap techniques, relying on a root-finding algorithm rather than simulating many replicates. Neither technique is particularly burdensome unless CIs of a particularly large size or approximate P values of extreme points are required; in these cases, empirical likelihood has computational benefits. Another advantage of empirical likelihood over bootstrap methods is that it can readily be extended to the problem of finding simultaneous CIs on the divergences between all pairs in an alignment of sequences, allowing confidence in trees to be assessed and permitting individual topologies to be rejected as incompatible with observed pairwise divergences.

We have shown that the LFF can be a good approximation to empirical likelihood and can be used to construct CIs whose shape is determined by the distribution of the data, rather than an assumption of normality. These CIs agree closely with those produced using bootstrap methods. The small-sample variance and bias of the log-det estimator are comparable with those of ML using the GTR model and, for low to moderate divergences, it is close to efficient. Its performance rapidly declines for large divergences, however, especially when the data could have been explained using a simpler model of evolution.

Instead of using the method outlined here, CIs can also be calculated using a normal distribution approximation about the estimated divergence with variance equal to the asymptotic variance of the estimator (estimated by evaluating at the empirical joint probability matrix). These CIs are symmetric and typically have coverage error (the difference between the actual and nominal probability that the true divergence lies in the interval) that decreases as n–1/2 as the number of observations (aligned sites, n) increases. In contrast, coverage error decreases as n–1 for intervals based on empirical likelihood (Hall and Scala 1990Go). Because the shape of empirical likelihood CIs is driven by the data, and is not restricted to being symmetric, the intervals tend to be narrower and more accurate than those based on the normal approximation for an equivalent statistical size. Unlike those derived from approximation by the normal distribution, empirical likelihood CIs cannot include impossible negative divergences.

For simplicity, CIs of a known size have been derived from the empirical likelihood surface by assuming that the asymptotic {chi}2 calibration is correct. Although the error using this method decreases rapidly (doubling the number of observations halves the expected error), there are other techniques that may prove more effective: a Bartlett correction (DiCiccio et al. 1991Go) of the statistic increases the speed at which the error decreases, so doubling the number of observations quarters the expected error, but Owen (2001)Go recommends calibrating the likelihood surface using the nonparametric bootstrap. Although this procedure may seem redundant—if we have to generate bootstraps, why not use them directly?—it tends to produce useful estimates with few replicates because the contribution of each replicate is not restricted to the left-hand or right-hand tail of the distribution. As noted above, the empirical likelihood approach can also be generalized to several sequences, and bootstrap calibration can also be used in this context even though bootstrap confidence regions cannot be easily defined.

Our results suggest that both the log-det and ML-GTR estimators are asymptotically efficient. This is not surprising for the ML-GTR divergence estimator because ML estimators are known to be asymptotically efficient and unbiased. The apparent asymptotic efficiency of the log-det estimator is surprising as it produces unbiased divergence estimates under more general conditions than ML-GTR, but we provide a proof in Appendix B that this result holds for almost all members of the GTR family of models. As shown in the comparison of asymptotic efficiency for divergence estimation when the model is fixed (fig. 3), being efficient for all time-reversible models does not mean that the estimator is necessarily good when the assumed model is too general and information is wasted fitting unnecessary parameters. This consideration will be especially important for calculation of divergence between amino acid sequences, where it is usual to assume a fixed empirical model like JTT (Jones et al. 1992Go) or WAG (Whelan and Goldman 2001Go) and the appropriate GTR model has 189 additional parameters, resulting in a higher variance for estimates. The use of a compressed alphabet (Wang J and Wang W 1999Go; Coghlan et al. 2001Go; Kosiol et al. 2004Go) might help both by reducing the divergence to be estimated and the number of additional parameters in the corresponding GTR model.

Our results give guidance as to situations in which the log-det divergence estimator will be efficient or inefficient. The methods introduced mean that the efficiency of the estimator can be calculated in particular cases, and the accuracy of individual estimates can be assessed. This second point is extremely important if the estimates are to be compared or used in phylogenetic reconstructions.


    Appendix A Influence Function of the Log-Det Estimator
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion and Conclusions
 Appendix A Influence Function...
 Appendix B Asymptotic Efficiency...
 Acknowledgements
 References
 
The quantity uij defined in equation 6 is the influence function (Hampel et al. 1986Go) of the log-det estimator evaluated at the empirical joint probability matrix. The influence function is a gradient that describes how estimates would change if the distribution was perturbed slightly, by a small amount of contamination, for example, and so reflects the sensitivity of the estimates to additional observations. We may write Aij({epsilon}) for the distribution formed by contaminating a distribution X with a fraction {epsilon} of i -> j substitutions, so Aij({epsilon}) = (1–{epsilon})X + {epsilon}Eij, where Eij is the N x N matrix whose (i, j) element is 1 and all other elements are 0. Formally, the influence function of the log-det estimator is defined as the limit

Formula

By the linearity of the log-det formula (eq. 1), this derivative can be broken into 2 parts: the derivative of the summation of the logarithms of the column and row sums and the derivative of the logarithm of the determinant of Aij({epsilon}). The derivatives of the logarithms of the column and row sums are:

Formula (9)

Formula (10)
where Iij is an indicator function, equaling 1 when i = j and 0 otherwise, and Ck(X) and Rk(X) are as defined following equation 1.

The corresponding derivative for the logarithm of the determinant of Aij({epsilon}) follows from Jacobi's formula for the derivative of a determinant:

Formula
where tr(X) is trace of X and Adj(X) is the adjugate of matrix X. If X is invertible, which is always the case when X is joint probability matrix having finite log-det divergence, the adjugate is related to the inverse matrix by Adj(X) = det(X)X–1 and Jacobi's formula becomes

Formula

The derivative of Aij({epsilon}) with respect to {epsilon} is –X + Eij, giving the following results:

Formula (11)

Formula (12)
Combining equations 9, 10, and 12 and evaluating at {epsilon} = 0 gives the influence function for the log-det estimator:

Formula


    Appendix B Asymptotic Efficiency of Log-Det Estimation
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion and Conclusions
 Appendix A Influence Function...
 Appendix B Asymptotic Efficiency...
 Acknowledgements
 References
 
It has long been appreciated that there is a connection between the log-det estimator and the GTR model of evolution (Rodríguez et al. 1990Go). The 2 give identical estimates whenever there exists a time-reversible (instantaneous) rate matrix Q such that

Formula (13)
where {Pi} is the diagonal matrix whose elements form the stationary distribution of Q and Formula is the empirical joint probability matrix (equal to D/n and, by the strong law of large numbers, an unbiased estimate of the true underlying joint probability matrix J0: Billingsley 1995Go).

For a given set of data, there is an absolute upper bound for the log-likelihood of any distribution Jij, attained when Jij = Formula . If there exist {Pi} and Q that satisfy equation 13, then the upper bound is achieved and {Pi} and Q are ML estimates. The existence of {Pi} and Q can easily be checked using standard properties of time-reversible Markov chains and the matrix logarithm function. This is not sufficient to establish that the 2 estimators have the same asymptotic distribution because there is some probability that a valid solution to equation 13 does not exist and the estimates will differ. We provide a proof that their sampling distributions of the ML-GTR and log-det estimators do indeed converge, establishing that the log-det estimator is asymptotically efficient for the GTR model.

Previously, we stated that the asymptotic variance of the log-det estimator can be found by evaluating its influence function at the true joint probability function. This statement ignores the need to check that several regularity conditions hold (see Hampel et al. 1986Go, and references therein). Asymptotic normality of the log-det estimator is more easily verified by means of the delta method (e.g., Cox 2005Go). Because the central limit theorem guarantees that Formula (Formula J0) converges in distribution to a multivariate normal of known variance, the delta method establishes that

Formula
where t0 is the true underlying divergence (given by LD(J0)), N(0,V) is the normal distribution with zero mean and variance V given by equation 8, and Formula signifies convergence in distribution. This convergence might fail if all the partial derivatives of LD at J0 are zero; this cannot happen when J0 is a joint probability matrix.

Our proof proceeds by considering the symmetric form of the log-det estimator, where Formula is replaced by Formula . This has the same asymptotic distribution as the log-det estimator, again established using the delta method, but it is much more likely to be a solution to equation 13 because both Formula and {Pi}eQ are guaranteed to be symmetric. We show that, for large samples, a solution to equation 13 almost always exists, and so the asymptotic distribution of the symmetric form of the log-det is identical to that of the ML-GTR estimator; hence, so too is the asymptotic distribution for the log-det estimator. Asymptotic efficiency of the log-det estimator then follows from the asymptotic efficiency of ML-GTR.

Consider an appropriately scaled instantaneous rate matrix of a time-reversible continuous-time Markov chain, Q0, such that every off-diagonal element is strictly greater than zero (a generator for a uniform semigroup), and let the true divergence be t0 > 0 (the case for t0 = 0 being trivial). By definition, the matrix Q0 t0 belongs to an open region in the space of generators for time-reversible continuous-time Markov chains that contain a ball of nonzero radius with center Q0 t0. The inverse function theorem (e.g., Rudin 1976Go) guarantees that there is an open region in the space of symmetric joint probability matrices, containing J0 = {Pi} eQ0t0, for which a solution to equation 13 always exists. If the symmetric log-det and ML-GTR estimators disagree, the empirical joint probability matrix must lie outside of this region. The strong law of large numbers (e.g., Billingsley 1995Go) establishes convergence into this region, and hence, convergence of the sampling distribution of the symmetric log-det estimator to that of ML-GTR, with probability 1.

The above proof shows that the log-det and symmetric log-det estimators are asymptotically efficient, providing that the arbitrary scaling factor for the underlying Markov process is chosen to be consistent with the log-det estimator, rather than the more natural scaling normally used (see text for details). We now show that, providing the correct scaling factor is estimated efficiently, the rescaled form of the log-det estimator is still asymptotically efficient.

For 2 functionals, f and g, of random variable X, the asymptotic variance of their product can be obtained by the delta method:

Formula (14)
where a prime is used to denote the first derivative with respect to the appropriate component of X, (fg)'(x) = f'(x)g(x) + f(x)g'(x), and aT denotes the transpose of vector a. We now take f to be a function that estimates the rescaling factor from Formula in a Fisher consistent manner and g to be any of the log-det, symmetric log-det, or ML-GTR estimators. Because these estimators are Fisher consistent and are shown above to have asymptotically equal variance, we see that they must have equal values of g(Formula Formula ) and g'(Formula Formula ). Hence, they also have equal values of (fg)'(Formula Formula ) and, from equation 14, equal values of the rescaled variance Var(f(Formula )g(Formula )). Taking f to be the ML estimate of the rescaling factor under the GTR model, f(Formula )g(Formula ) for the ML-GTR estimator is equal to the estimate of the divergence by ML using the conventionally scaled GTR model, and so must be asymptotically efficient. For any f that estimates the rescaling factor efficiently, the variance of any of the rescaled estimators is equal to that for the ML divergence under the conventional GTR model, and so the rescaled estimators are asymptotically efficient.


    Acknowledgements
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion and Conclusions
 Appendix A Influence Function...
 Appendix B Asymptotic Efficiency...
 Acknowledgements
 References
 
This work was funded by the European Molecular Biology Laboratory.


    Footnotes
 
Peter Lockhart, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Methods
 Results
 Discussion and Conclusions
 Appendix A Influence Function...
 Appendix B Asymptotic Efficiency...
 Acknowledgements
 References
 

    Baake E, von Haeseler A. Distance measures in terms of substitution processes. Theor Popul Biol. (1999) 55:166–175.[CrossRef][Web of Science][Medline]

    Barry D, Hartigan JA. Asynchronous distances between homolgous DNA sequences. Biometrics (1987) 43:261–276.[CrossRef][Web of Science][Medline]

    Billingsley P. Probability and measure (1995) 3rd ed. New York: Wiley.

    Brown WM, Prager EM, Wang A, Wilson AC. Mitochondrial DNA sequences of primates: tempo and mode of evolution. J Mol Evol. (1982) 18:225–239.[CrossRef][Web of Science][Medline]

    Bruno WJ, Socci ND, Halpern AL. Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. Mol Biol Evol. (2000) 17:189–197.[Abstract/Free Full Text]

    Coghlan A, Dónaill DA, Buttimore NH. Representation of amino acids as five-bit or three-bit patterns for filtering protein databases. Bioinformatics (2001) 17:676–685.[Abstract/Free Full Text]

    Cox C. Delta method. In: Encyclopedia of Biostatistics—Armitage P, Colton T, eds. (2005) 2nd ed. New York: Wiley. 1409–1411.

    DiCiccio JT, Romano JP. Nonparametric confidence limits by resampling methods and least favourable families. Int Stat Rev. (1990) 58:59–76.

    DiCiccio TJ, Hall P, Romano J. Empirical likelihood is Bartlett-correctable. Ann Stat. (1991) 19:1053–1061.

    Efron B. Bootstrap methods: another look at the jackknife. Ann Stat. (1979) 7:1–26.

    Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. (1981) 17:368–376.[CrossRef][Web of Science][Medline]

    Felsenstein J. Inferring phylogenies (2004) Sunderland (MA): Sinauer associates.

    Fitzpatrick DA, Creevey CJ, McInerney JO. Genome phylogenies indicate a meaningful {alpha}-proteobacterial phylogeny and support a grouping of the mitochondria with the Rickettsiales. Mol Biol Evol. (2006) 23:74–85.[Abstract/Free Full Text]

    Gascuel O. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol. (1997) 14:685–695.[Abstract]

    Goldman N. Phylogenetic information and experimental design in molecular systematics. Proc R Soc Lond B Biol Sci. (1998) 265:1779–1786.[Medline]

    Goremykin VV, Hellwig FH. Evidence for the most basal split in land plants dividing bryophyte and tracheophyte lineages. Plant Syst Evol. (2005) 254:93–103.[CrossRef]

    Gu X, Li W-H. Bias-corrected paralinear and LogDet distances and tests of molecular clocks and phylogenies under nonstationary nucleotide frequencies. Mol Biol Evol. (1996) 13:1375–1383.[Abstract]

    Hall P, Scala BL. Methodolgy and algorithms of empirical likelihood. Int Stat Rev. (1990) 58:109–127.

    Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. Robust statistics: the approach based on influence functions (1986) New York: Wiley.

    Hasegawa M, Kishino H, Yano T. Man's place in Hominoidea as inferred from molecular clocks of DNA. J Mol Evol. (1985) 21:160–174.

    Hoffmeister M, van der Klei A, Rotte C, van Grinsven KWA, van Hellemond JJ, Henze K, Tielens AGM, Martin W. Euglena gracilis rhodoquinone:ubiquinone ratio and mitochondrial proteome differ under aerobic and anaerobic conditions. J Biol Chem. (2004) 279:22422–22429.[Abstract/Free Full Text]

    Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. (1992) 8:275–282.[Abstract/Free Full Text]

    Jukes TH, Cantor CR. Evolution of protein molecules. In: Mammalian protein metabolism—Munro HN, ed. (1969) New York: Academic Press. 21–132.

    Kimura M. A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J Mol Evol. (1980) 16:111–120.[CrossRef][Web of Science][Medline]

    Kosiol C, Goldman N, Buttimore NH. A new criterion and method for amino acid classification. J Theor Biol. (2004) 228:97–106.[CrossRef][Web of Science][Medline]

    Lake JA. Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc Natl Acad Sci USA (1994) 91:1455–1459.[Abstract/Free Full Text]

    Lazar NA, Mykland PA. An evaluation of the power and conditional properties of empirical likelihood. Biometrika (1998) 85:523–534.[Abstract/Free Full Text]

    Lee SMS, Young GA. Nonparametric likelihood ratio confidence intervals. Biometrika (1999) 86:107–118.[Abstract/Free Full Text]

    Lockhart PJ, Cameron SA. Trees for bees. Trends Ecol Evol. (2001) 16:84–88.[CrossRef][Medline]

    Lockhart PJ, Steel MA, Hendy MD, Penny D. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol Biol Evol. (1994) 11:605–612.[Web of Science]

    Martin W, Deusch O, Stawski N, Grünheit N, Goremykin V. Chloroplast genome phylogenetics: why we need independent approaches to plant molecular evolution. Trends Plant Sci. (2005) 10:203–209.[CrossRef][Web of Science][Medline]

    Negrisolo E, Minelli A, Valle G. The mitochondrial genome of the house centipede Scutigera and the monophyly versus paraphyly of myriapods. Mol Biol Evol. (2004) 21:770–780.[Abstract/Free Full Text]

    Nikolaev S, Montoya-Burgos JI, Margulies EH, Program NCS, Rougemont J, Nyffeler B, Antonarakis SE. Early history of mammals is elucidated with the ENCODE multiple species sequencing data. PLoS Genetics (2007) 3:e2.[CrossRef]

    Owen AB. Empirical likelihood ratio confidence intervals for a single functional. Biometrika (1988) 75:237–249.[Abstract/Free Full Text]

    Owen AB. Empirical likelihood (2001) London: Chapman and Hall.

    Pawitan Y. In all likelihood: statistical modelling and inference using likelihood (2001) Oxford: Oxford University Press.

    Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical recipes in C (1992) 2nd ed. Cambridge: Cambridge University Press.

    Rodríguez F, Oliver JL, Martin A, Medina JR. The general stochastic model of nucleotide substitution. J Theor Biol. (1990) 142:485–501.[Web of Science][Medline]

    Rubin DB. The Bayesian boostrap. Ann Stat. (1981) 9:130–134.

    Rudin W. Principles of mathematical analysis (1976) 3rd ed. New York: McGraw-Hill.

    Schadt E, Lange K. Codon and rate variation models in molecular phylogeny. Mol Biol Evol. (2002) 19:1534–1549.[Abstract/Free Full Text]

    Steel MA. Recovering a tree from the leaf colourations it generates under a Markov model. Appl Math Lett. (1994) 7:19–23.

    Tavaré S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect Math Life Sci. (1986) 17:57–86.

    Thollesson M. LDDIST: a Perl module for calculating LogDet pair-wise distances for protein and nucleotide sequences. Bioinformatics (2004) 20:416–418.[Abstract/Free Full Text]

    Turmel M, Otis C, Lemieux C. The chloroplast genome sequence of Chara vulgaris sheds new light into the closest green algal relatives of land plants. Mol Biol Evol. (2006) 23:1324–1338.[Abstract/Free Full Text]

    Wang J, Wang W. A computational approach to simplifying the protein folding alphabet. Nat Struct Biol. (1999) 6:1033–1038.[CrossRef][Web of Science][Medline]

    Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol. (2001) 18:691–699.[Abstract/Free Full Text]

Accepted for publication July 21, 2007.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
24/10/2277    most recent
msm160v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Massingham, T.
Right arrow Articles by Goldman, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Massingham, T.
Right arrow Articles by Goldman, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?