MBE Advance Access originally published online on December 8, 2006
Molecular Biology and Evolution 2007 24(3):650-659; doi:10.1093/molbev/msl193
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Phylogenetic Methodology for Detecting Protein Interactions


* Department of Statistics and Department of Biological Sciences, University of South Carolina, Columbia
Laboratory of Biometry and Bioinformatics, Graduate School of Agriculture and Life Sciences, University of Tokyo, Yayoi Bunkyo-ku, Tokyo, Japan
Allan Wilson Centre, Institute of Molecular BioSciences, Massey University, Palmerston North, New Zealand
E-mail: waddell{at}med.sc.edu.
| Abstract |
|---|
|
|
|---|
Detecting proteinprotein interactions and assigning proteins to functional complexes are key challenges of modern biology. The rise of genomics has lead to evidence that correlated patterns of presence/absence and/or fusing of proteins in any organism suggest these proteins interact. Unfortunately, methods based on such data work best with divergent genomes, whereas major sequencing efforts in vertebrates, for example, are yielding alignments of the same set of proteins sampled from the same set of taxa (species). Using vertebrate mitochondrial genomes to illustrate a novel method, we associate proteins based on vectors of their evolutionary tree edge (branch or internode) lengths. This approach is based on the expectation that molecular coevolution is greatest between proteins that interact in someway. Mitochondrial DNAencoded proteins are associated into groups largely consistent with the complexes they come from. This association is apparently not due to the tree structure or mutation processes, leaving coevolution as the best explanation.
We show that it is important that the tree used to derive the edge-length vector is estimated accurately in terms of both topology and edge lengths. Although more complex substitution models reduce systematic error, they also inflate stochastic error. This makes the use of less complex substitution models preferable in some circumstances. We describe a method to estimate correlations of pairwise evolutionary distances, which adjusts for nonindependent correlations due to shared evolutionary history. Associations of proteins based on their edge-length vectors are visualized and assessed using a variety of hierarchical clustering and multidimensional scaling methods. New formula for estimating the fit of data to model, including the average percent standard deviation of distances on least squares trees, are presented. Use of edge-length vectors is compared and contrasted with correlated distance methods, correlated rates methods, and site-specific evidence of coevolution.
Key Words: detecting correlated evolution mitochondrial genome metabolic complexes molecular coevolution proteinprotein interaction
| Introduction |
|---|
|
|
|---|
A major challenge in bioinformatics is how to use large amounts of comparative genomic data to reveal proteinprotein interactions. This can involve either identifying the specific regions where proteins contact each other or identifying when groups of 2 or more proteins form a complex. To date a number of methods have been suggested. Firstly, pairs of proteins that are fused into a single protein in any known genome are much more likely to interact than random pairs of proteins (Enright et al. 1999
Increasingly genomes that are more closely related are being sequenced. For example, major genomic sequencing efforts are underway in vertebrates and particularly mammals (e.g., Waddell and Mclachlan 2004
). A recent count of mammals planned or underway for 2x or greater genome coverage was >20. Although these are large genomes (
3 x 109 bp), they contain very similar sets of genes, so that there is relatively little information available via joint presence/absence or gene fusion. Rather, what are becoming available in larger amounts are readily aligned coding sequences of all the different vertebrate proteins. Thus, the challenge is how might this sequence information be used to infer possible associations between proteins.
Crucial to defining associations between proteins is the concept of coevolution (Harvey and Pagel 1991
). Seminal work here includes the study of complementary evolution for another macromolecule, rRNA (Gutell et al. 1985
). An expectation for proteins that interact is that these coevolve to adjust to each other's structure. This has been seen in specific residues of proteinproteinbinding regions (Moyle et al. 1994
; Pazos et al. 1997
). More generally, substitutions that subtly alter the shape of one of a pair of proteins may require compensatory substitutions in the other protein, not just at the points they touch (Robinson et al. 2003
). Further, if molecular evolution is episodic, then such coevolving proteins should show evidence of change at the same time in the same lineage.
The episodic nature of protein evolution is confirmed in a quantitative way using model selection statistics such as Akaike Information Criterion (AIC) (Adachi et al. 2000
). It is expected that for selective reasons, there are periods of stasis followed by periods of sequence adjustment, followed by periods of stasis (Gillespie 1991
). Thus, periods of higher than average substitution rates should be most correlated in the most closely interacting proteins. This will be a combination of both contact-site interactions and coordinated adjustment of overall shape. Methods to detect correlated evolution based on distance matrices include those described in Waddell (1990)
, Goh et al. (2000)
, and Pazos et al. (2000)
. These approaches may suffer from individual distance estimates being highly correlated due to the underlying evolutionary tree. Note, also the terminology used: correlated evolution is what is detected, whereas coevolution is the hypothesized cause when others (such as correlation due to the phylogeny) are discounted.
Proteins encoded by the mitochondrial genome offer a useful and interesting context in which to study coevolution. A feature of vertebrate mitochondrial DNA is that recombination is unexpected (e.g., Alberts et al. 1994
; Lewin 1997
), and therefore, all genes should share exactly the same chronological tree (or very similar even if there is occasional parental leakage and recombination, due to the reduced coalescence time). There are 12 proteins involved in the generation of cellular energy encoded on the mitochondrial heavy strand DNA. In genomic order they are nd1 (length 323 residues), nd2 (308), co1 (512), co2 (226), at8 (32), at6 (201), co3 (260), nd3 (106), nd4l (95), nd4 (440), nd5 (538), and cytb (337). They belong to 4 complexes involved in the transfer of electrons in oxidative phosphorylation. There are 6 reduced form of nicotinamide adenine dinucleotide (NADH) dehydrogenase (nd) subunits, which are key parts of respiratory Complex I, transferring electrons from reduced nicotinamide adenine dinucleotide phosphate to ubiquinone (Coenzyme Q10). Cytochome b (cytb) is part of Complex III, which is the next step and passes electrons from ubiquinone to cytochrome c using the energy of this step to "pump" protons out of the inner mitochondrial compartment. The cytochrome oxidase proteins (co Complex IV) pick up the electrons from cytochrome c and pass them to oxygen in the formation of water. The energy released here is again used to pump protons out of the inner compartment. The 2 ATPase proteins (at) are part of Complex V, which use the proton motive force to synthesize ATP.
These extremely important protein complexes underpin the energy metabolism of most living organisms. Unfortunately, they are also membrane bound so that although the proteins may be placed in complexes, their 3D structures remain incompletely known despite major efforts (Lodish et al. 2000
; Mather et al. 2004
). It is unclear, for example, if parts of Complex III may interact with parts of Complex IV. It is also apparent that the interactions between domains of these complexes undergo large changes associated with the movement of electrons and protons (Mather et al. 2004
).
Below, we describe a method for detecting correlated evolution based on vectors of evolutionary tree edge lengths, which circumvents potential problems in using distances. We also derive a distance method that takes into account the correlations of pairs of distances. Association with prior work in phylogenetics helps clarify methodology in this area of bioinformatics.
This type of molecular evolution is heavily dependent upon visual multivariate statistics. Two alternatives are clustering (e.g., tree building) and multidimensional scaling (MDS). We consider issues of fit as they relate to each and show links between Sammon's (1969)
criterion and least squares on trees. We also derive useable average percent standard deviation (SD) measures that may be used to compare least squares trees and least squares weights.
| Materials and Methods |
|---|
|
|
|---|
Our method uses matched ordered vectors of 2s 3 edge (internode or branch) lengths from a predefined binary tree as the basic data upon which to infer proteinprotein interaction (here s is the number of sequences in each alignment). Edge lengths are defined as the expected number of substitutions, including multiple hits, per site. The absolute ordering of edge lengths in the vectors is not important, but they must be in the same relative order for all genes. The information in these vectors may be visualized using multivariate techniques. One approach is hierarchical clustering or tree inference (e.g., Swofford et al. 1996
Both hierarchical cluster analyses and MDS begin with a measure of association (distance) between all pairs of edge-length vectors. As with microarray analysis (e.g., Waddell and Kishino 2000
), the distance
ab = 1
ab, where
ab is the Pearson correlation (Agresti 1990
), between edge-length vectors was found useful. We also considered Euclidean and Manhattan distances between edge-length vectors (Agresti 1990
). These distances have the possible disadvantage that they are more sensitive to scale effects, although edge lengths are reported on a per-site basis, which goes someway to equilibrating them. Further, transforming the edge lengths of each gene so that they sum to 1 may also help. Hierarchical clustering used the unweighted pair group method with arithmetic mean (UPGMA), Neighbor-Joining (NJ), and least squares options of PAUP* (Swofford 2002
). MDS of distance matrices used the programs SPSS (1999)
and R (Ihaka and Gentleman 1996
; see also, R Development Core Team 2005
).
Amino acid sequences used to reconstruct phylogenetic trees and derive edge-length vectors were retrieved from GenBank (see Supplementary Material online). They cover the major craniate groups. These were aligned using Se-AL (Rambaut and Grassly 1997
) and then edited to remove regions of alignment ambiguity. Care was taken to include only well-aligned regions and to exclude any portions of overlapping genes (e.g., Waddell et al. 1999
). For searching the maximum likelihood (ML) tree, we used ProtML, part of MOLPHY 2.3 (Adachi and Hasegawa 1996
). The options f plus m were used, and to avoid being trapped in local optima when doing tree searches, the searches were seeded with the best trees found with more efficient (faster) tree-building methods. Edge lengths and amino acid distance matrices were estimated using PAML (Yang 1999
). Either a Poisson model and identical site rates or the empirical mtmam matrix and a discretized gamma distribution with 8 rate classes were used.
| Results |
|---|
|
|
|---|
Importance of Getting the Tree Right
In exploring how correlations of edge lengths may associate proteins, it is important to consider how the tree topology or branching order might affect the result. We illustrate this point with analyses based on 2 different trees. The first tree (T1) is a ML tree inferred from the mitochondrial protein sequences. It is well recognized, due particularly to a very few highly aberrant sequences, including those of murid rodents, that there is a misplacement of the root of the placental subtree in most analyses of mitochondrial proteins, including using the ML method (Waddell et al. 1999
Heuristic Hierarchical Clustering
Figure 1 shows heuristic hierarchical clustering of the edge-length correlation distance (1
) between the 12 genes using a Poisson model and identical site rates. The correlations themselves ranged from
0.9 to
0.4. The absolute size of the correlations is not the essential information, for these correlations tend to be largely a consequence of the phylogenetic tree structure and the highly unequal times separating nodes. Rather, what is important is visualizing and assessing the clustering suggested by the correlations. First, comparing just the UPGMA trees with the NJ trees, it is apparent that the NJ trees show better grouping of genes involved in the same complexes. This can be attributed to NJ relaxing the restrictive assumption that distances are ultrametric, violation of which may mislead UPGMA (e.g., Swofford et al. 1996
). For example, even though at6 and at8 are in the same complex and show correlated rates of evolution, at8 is much less correlated in its rates with other genes. This in turn places at8 a long way from all other genes, and in the NJ or ordinary or unweighted least squares (OLS) tree, on a long external edge. A possible reason for at8 showing such rate variation is that it is a very short gene and therefore stochastic forces play a significant role in its estimated edge lengths. Indeed, for the NJ clustering, the genes showing the longest terminal edge lengths are also the shortest 4 genes (at8, co2, nd3, and nd4l).
|
There are also differences arising from using the edge lengths on T1 versus those on T2, and this is seen amongst the NJ trees. The clustering of genes based on T2 appears more coherent than those based on T1. For example, the cytochrome genes (co1, co2, co3, and cytb) are scattered in the NJ tree based on T1, but form a coherent group when based on the more realistic T2 (albeit with nd5 appearing in their midst). Using T2 and the more complex
model, the groupings become less clear.
Criterion-Based Hierarchical Clustering
The use of criterion-based hierarchical clustering or tree-building methods are important as they allow fit of data to model to be gauged (Swofford et al. 1996
; Waddell and Kishino 2000
). They may lead to different and valuable insights into the data that are not accessible with heuristic methods, such as UPGMA or NJ. Least squares fitting of distances to trees has been well studied and is available in PAUP* (Swofford 2002
) using fast exact algorithms to evaluate trees (Bryant and Waddell 1998
). We consider three least squares criteria; OLS plus two forms of weighted least squares fitting. The first weighted method assumes the variance increases linearly with the distance. This is the power equals to 1 or P = 1 criterion (Swofford et al. 1996
), where the least squares weights are equal to 1/d
. The second assumes that the SD of a distance increases linearly with the magnitude of a distance, the FitchMargoliash or FM criterion, with weights equal to 1/d
. It is also important to consider whether edges in the tree or clustering diagram can be negative or not. Constraining edge lengths to be nonnegative results in trees that tend to better group like objects together (Swofford et al. 1996
; Waddell and Kishino 2000
). Herein, "+" indicates the nonnegativity constraint.
An important aspect of using criterion-based clustering methods is to have a criterion that measures the quality with which the data fit the model and is hopefully: 1) invariant to the scale of the distance being used and 2) allows comparison between criteria (e.g., P = 0, P = 1, and P = 2). The FM formula for average percent SDs as used in PAUP*, for example, should not be used for comparison between different powers of P. Appropriate formula are presented in Appendix 1. Table 1 shows the results of using these formulas in comparison to currently reported numbers.
|
With the correlation distances based on T1, OLS+ clustering (not shown) gives a result close to that of figure 1a. However, there are no strict clusters of proteins in the same complex, except the at6/at8 pair, which appears to be the most robust feature of all these diagrams. The average percent SD of distances is close to 11%. Allowing negative edge lengths improves the raw fit of data to model (table 1), but does not result in improved clustering of proteins in the same complex. With the "+" methods and higher powers the fit improves slightly. With P = 1 the cluster of co2&co3 appears and with P = 2 so does the cluster of nd1&nd4. Switching back to allowing negative edge lengths sees a slight improvement in fit, but these clusters are lost. The shortcomings of allowing negative edge lengths for evolutionary trees are reported (e.g., Kuhner and Felsenstein 1994
distribution of site rate variability (Swofford et al. 1996
The results in table 1 suggest that it is important to get the phylogenetic tree upon which edge lengths are estimated correct. However, there is also evidence of a trade-off in estimating edge lengths between systematic and stochastic error. Increased stochastic error for edge -length estimates under the
model is typically observed, as the model extrapolates further from the observed data than the standard identical rates model (Waddell et al. 1997
). Consistent with this are the increased edge lengths in figure 1f compared with figure 1d. Using 1
distances derived from T2 edge lengths (estimated assuming the simpler model) results in the best clustering and also the best fit. With an incorrect tree, components of the edge-length vector that should be far from zero may be pushed close to zero and occasionally the opposite may also occur.
Visualization with MDS
There are alternatives to hierarchical clustering to visualize the relationships suggested by correlated evolutionary rates. One of the most common is MDS, of which PCA is a well-known example (Edwards and Oman 2003
). PCA is, however, not criterion based, so we also used the least squares MDS methods available in ALSCAL, part of SPSS, and those available in R. The SPSS criterion-based MDS in SPSS aims to minimize Young's S-stress 1 value, which is itself similar to the Kruskal's stress value 1 found in R (Edwards and Oman 2003
). R also offers a routine to minimize the Sammon's (1969)
stress value, which as explained later, is coincident with the P = 1 least squares criterion for evaluating trees. (A typo in Edwards and Oman [2003]
misreports the normalization term of Sammon's stress function as involving a sum of squared distances, when it is a sum of distances, yet the R function "Sammon" reports the correct value).
The result of applying criterion-based MDS to the 1
distances used in figure 1c and d is shown in figure 2. In figure 2a, using the Sammon stress, the grouping of proteins by complex is apparent. The resultant stress was 0.03736. Sammon (1969)
indicates that a stress of less than 0.1 tends to be reasonable for visualization purposes. As with hierarchical clustering, the gene nd5 appears close to some of the co genes. It should be noted that there is no closed form solution for minimizing Sammon stress, and in this case, there were local optima. One of these, which was not globally optimal was near the starting coordinates obtained using a standard principal coordinates analysis (PCoA in R). To overcome this, we repeated the analysis 1,000 times using random starting points (script used is in Supplementary Material online) and hit what seemed to be the global optimum in less than 20% of cases, suggesting a significant local optima problem.
|
Hierarchical clustering using least squares criteria faces an non deterministic polynomial time complete problem to check all trees (Swofford et al. 1996
Using Young's stress value and a Manhattan or absolute distance between the vectors of tree edge lengths, transformed so they sum to 1, results in figure 2b. As with Sammon MDS, the complex groups are also apparent. The NADH genes form a particularly tight cluster with the exception of nd4l. This again may be due to nd4l being a short gene that allows for a lot of stochastic error in edge-length estimation and that this, in turn, distances it from all other genes. The ATPase genes at6 and at8 occupy their own distinct part of the data space, and cytb (Complex III) appears in figure 2b intermediate between proteins of Complexes I and IV, as the biology may suggest. Again, the 3 most outlying genes in this reduced dimension representation of the data are the 3 shortest genes, namely at8, co1, and nd4l. The gene nd3 is not substantially longer than these three genes, but in this case it remains tightly within the nd cluster.
Other MDS/distance combinations were tried. Using the T2, 1
distance combination tended to place nd5 close to the cogenes (co1, in particular) and apart from the other nd genes. Using the vectors of edge lengths from T1, the MDS diagrams, like the hierarchical clustering, did not do quite as well in placing genes from the same complex together (results not shown). However, the general feature of the shortest genes being outliers on the graph remained.
A Refined Distance Method
In the above analyses, we have used evolutionary tree information (edge-length vectors) directly. However, it is also possible to consider correlated evolution using only the matrix of pairwise evolutionary distances between aligned sequences of matched sets of species (taxa). This is the basis of some earlier methods. The very high correlations observed between such matrices are expected due to the tree structure. We show how to evaluate correlated evolution without shared evolutionary history distorting the estimated distance or inflating the estimate of precision. Let d
= (dx1,...,dxK)' be a vector of the K = s(s 1)/2 unique evolutionary distances for gene x, where s are the number of different species and prime (') indicates transpose. As with vectors of edge lengths, the next step is to measure a distance between the genes, except this time the characteristics are the evolutionary distances not the edge lengths. Options here include Euclidean, Manhattan, and correlation-based distances (the method of Goh et al. 2000
is similar to the last of these alternatives). However, this type of approach immediately runs into the problem that the distances, unlike edge lengths, tend to be highly positively correlated with each other (e.g., Swofford et al. 1996
). Therefore, let the vectors representing the distance matrices of genes x and y be dx = (dx1,...,dxK)' and dy = (dy1,...,dyK)', respectively. We assume that (dx, dy) follows the multivariate normal distribution with mean
and variance
, where 1 is a K dimensional unit vector and I is the K dimensional identity matrix.
is the Kronecker product. Further, µ = (µx, µy)' is the mean vector and
is the variance matrix of (dxi, dyi), where i = 1,...,K. The estimated pairwise distances,
and
are conditionally independent given (dx, dy), and their sampling variance matrices Vx and Vy are obtained by repeated resampling of aligned sites with replacement, from which we obtain the conditional variance matrix,
, of (
,
) by reordering the elements. The unconditional likelihood of (
,
) is a multivariate normal distribution with mean
and variance
The revised correlation of evolutionary distances between genes x and y, allowing for phylogenetic structure and sampling variance, is obtained as
. If evolutionary distances involve very few substitutions, the lognormal distribution may fit the data better than the normal distribution used above.
Note that when working with observed estimated evolutionary distances, even though there is no explicit tree, the implied tree may be close to a distance-based tree derived from that data. When this occurs, it is possible that the results will not be as robust as when working with edge-length vectors produced for a biologically well-defined tree. However, in this case, even using pairwise distance matrices as the starting point, we obtained a reasonable hierarchical clustering and MDS representation. The Sammon stress value was 0.0297, but the visible clustering did not seem as good as figure 2a. This observation suggests the need for detailed exploration of the robustness and sensitivity of these different approaches using realistic simulations. Unfortunately, this is beyond the scope of the present paper.
Comparative Fit of Trees and MDS
It would be very desirable to put MDS and hierarchical clustering into a common framework. They are 2 of the most commonly used visualization techniques in the whole of molecular biology, with extensive use with microarray data, for example. To enable/facilitate comparison, we explore a common fit criterion for each. As mentioned in Appendix 1, the Sammon stress value is basically the square of the SD for the P = 1 tree criterion. Therefore, the square root of the Sammon stress multiplied by 100% is a measure of the average percent SD between the observed distances and those on the n-dimensional plot. For figure 2a, the average percent SD is 0.037360.5 x 100% = 19.48% (using N 1). This value is considerably larger than it was for trees, yet the visualization of the groups intuitively seems at least as good as the best trees (fig. 1). This contradiction may be due to the quite different restrictions that trees and MDS place upon fitting distances and the fact that humans are good at recognizing clusters visually, which may favor MDS.
To consider this issue further, note that the fit of a tree to the data can be decomposed as follows:
![]() | (1) |
These terms are analogous to those of a standard regression, the first being the sum of squares (SS) total, the second sum that of the model, and the third the residual or error, and u is a weighted mean (for OLS, wi = 1 and for P = 1, wi = 1/dobsi), which is represented as a star tree with all external edges equal in length to µ/2 and all internal edges of length zero (we call this a "point" tree to distinguish it from a more usual star tree with external edges unequal). The proportion of the variance explained by a tree can therefore be written as SSmodel/SStotal = (SStotal SSerror)/SStotal. For all the trees in table 1, this fit was between 91.2% and 96.7%. Ironically, the worst fit was on the T2 P = 1+ tree(which shows the best clustering) and best for a tree with non-negative edges on T1 OLS+ (which showed poor clustering). On the tree T1, P = 1+, SSerror = 0.17740, SStotal = 2.0079, and SSerror(star tree) = 0.52059. Thus, in this example, even the point tree explains about 75% of the total variance (for an average SD of 16.70%), which may be tree like but is not clustering! Similarly, a star tree with zero length internal edges and unequal length external edges may fit one data set better than a highly resolved tree may fit another data set. Ad hoc measures, such as the ratio of the sum of squares of the internal edges versus the residual, may be a better gauge of "good clustering." Unfortunately, MDS stress values do not fit into such a framework easily because the predicted distances on the 2-dimensional (2D) plot can show a larger range than the original distances (and therefore a larger sum of squares about the mean).
Thus, at present, it seems that the best measure of common fit for trees and MDS are the percent SDs shown in Appendix 1. These establish a direct link between the Sammon (1969)
stress value used in MDS and the percent SD introduced by Fitch and Margoliash (1967)
. However, their interpretation in comparing trees with MDS diagrams needs to be made with caution as they are presenting the information in quite different and often complimentary ways.
| Discussion |
|---|
|
|
|---|
The results show that based on just the property of amounts of evolution along edges in the tree, it is possible to detect correlated evolution between genes that form functional complexes. This is an encouraging finding, not least because the signal grouping genes of the same complex is more dominant than the signal arising from compositional heterogeneity around the mt-genome. It should be noted that the mutation bias seen around the mitochondrial genome and between mammalian species, despite its small size, is of the same general size as that seen across and between nuclear genomes (Swofford et al. 1996
It is worth emphasizing that genome heterogeneities (such as position effects, mutation biases, etc) do not explain the clustering of proteins from the same complex observed in our studies. Although some of the closest groups involve genes that are adjacent in the genome (e.g., n1 and n2 or nd4 and nd4l), genes from the major groups are widely scattered amongst each other (the order is nd1, nd2, co1, co2, at8, at6, co3, nd3, nd4l, nd4, nd5, and cytb). Thus, there are nd genes, then co genes, then at genes, then co genes, then nd genes, and finally cytb. Further, the protein-coding genes are spaced apart by the 22 tRNA and 2 large rRNA genes. Indeed, associations of adjacent genes only crop up occasionally in the hierarchical trees and in the MDS graphs. If there is a signal due to mutation biases moving around the genome, it is neither powerful enough to swamp the grouping of proteins of the same complex nor strong enough to be detectable in the majority of these analyses.
Varying methodologies for comparing weighted trees, that is, trees with edge lengths, are developed in the phylogenetics literature. Comparing 2 trees based on vectors of their edge lengths is explored in a mathematical sense by Robinson and Folds (1979)
using a Euclidean distance. However, as described above, this distance is not necessarily ideal for detecting correlated evolution. One approach to detecting correlated evolution involves using a Mantel test to compare matrices of distances derived from weighted phylogenetic trees (Waddell 1990
). Though not for use in detecting correlated evolution, unweighted trees have been compared using a distance (path) metric (Steel and Penny 1993
), whereas unweighted trees of different proteins have been associated using a partition metric (Penny et al. 1987
; Waddell and Kishino 2000
).
The tree itself may exert a considerable effect on the detection of correlated evolution using edge lengths, especially when using clustering to visualize associations. To take account of uncertainty in tree estimation, Bayesian methods (Huelsenbeck et al. 2001
), direct distance matrix methods (Goh et al. 2000
; Pazos and Valencia 2000
; Valencia and Pazos 2002
), and Hadamard conjugations (Hendy and Penny 1993
; Swofford et al. 1996
) may be used. Bayesian and Hadamard conjugation methods can both be used in a similar way. With a Hadamard conjugation, a vector of edge lengths in all possible trees is built without selecting any tree. An extension of this idea is to look at the mean length of an edge on all trees, with edge lengths estimated using ML (Waddell 1995
). A Bayesian method would look at the marginal length of an edge over all trees having it. However, such an approach is typically impractical with an Markov Chain Monte Carlo program and large data sets as a particular edge may be sampled very rarely or never. An alternative is to use the length of all edges on all trees approach, but give each ML tree a weight based on its Bayesian Information Criterion value (Waddell et al. 2001
). This weight often falls away rapidly for different trees, so effectively only the tree of optimal likelihood containing that edge need be considered.
Using matrices of pairwise distances on the same sets of species to infer correlated evolution (Goh et al. 2000
; Pazos and Valencia 2000
) is useful. However, there are some issues. Distances are usually highly correlated due to the structure of the tree, in contrast to edge lengths that tend to be more independent of each other. Further, estimation of pairwise distances tends to be less statistically efficient and more prone to systematic error than estimation of edge lengths on a tree using ML (Swofford et al. 1996
; Waddell and Steel 1997
). Above, we present a method correcting for correlated distances, when the correlations are due to shared ancestry. This approach may be desirable when calculating an ML tree is difficult, including models that take into account the 3D structure of proteins (Robinson et al. 2003
).
The methods described in this paper can be extended and refined in a number of different ways. We see our method as being complementary to site-specific methods of detecting coevolution (e.g., Pazos et al. 1997
; Pollock and Taylor 1997
; Pollock et al. 1999
; Gallet et al. 2000
). Our method works on a gross level by using ML to estimate the total number of substitutions in a series of up to 2s 3 time intervals defined by the tree (where s is the number of sequences). Site-specific methods work by identifying patterns of substitution at single sites and finding those that are highly correlated. One way of integrating these approaches is by building a matrix of 2n 3 rows (edges in a trees) and c columns (sites in the proteins) and place a 1 at each entry ij, where parsimony (for example) determines there was a change of that jth site on the ith edge and 0 everywhere else. The raw concordance of changes for a pair of sites may be measured using mutual information (e.g., Atchley et al. 2000
) or a kappa statistic for the frequencies of 00, 01, 10, and 11 matches. Given the many comparisons being made ((c2 c)/2 in total) significance may be estimated using a permutation test. Each permutation should keep the sums of the rows (the total amount of change for the jth edge) and of the columns (the amount of change for the ith site) constant. The row and column sums of the original data are calculated stored in vectors R and C. For the first permutation, a row is randomly selected with probability proportional to its row sum in R. Similarly, randomly select a column based on column sums in C and place a 1 into this ijth cell of a new matrix of otherwise all zeros. Reduce the chosen entry in R by 1 and do the same for the chosen column. This procedure continues until all entries in R and C are zero, obeying the rule that you cannot put a 1 into an already occupied cell. Repeat n times, and from the order statistics of these replicates, calculate the P values for the original measures on pairs of sites. This approach seems a preferable way to identifying coevolving sites to that based on aligned columns (e.g., Atchley et al. 2000
). It offers flexibility because the correlation of change is in time only, not necessarily in respect of the type of amino acid used. By summing mutual information for sites in pairs of genes or by looking for statistically unlikely associations of coevolving sites in different proteins, their coevolution may be inferred.
Weighting by chemical class can be important in this context. Detection of coevolving sites may then be used in multiple ways to cluster regions of proteins together. If sites in coevolving genes only vary in pairs, then site-specific methods should have an advantage. If there are more randomly distributed compensatory substitutions occurring over the structure, then our method will use this information. Note that because the role of these substitutions is preservation of overall structure (Robinson et al. 2003
), they can be far more random in terms of location in the sequence and nature of the substitution than those at directly interacting sites. However, both categories are expected to be coincident in time.
Finally, it is also possible to use estimates of the rate of evolution along edges to associate proteins. Understanding how to best estimate rates is still developing, but the methods of Sanderson (1997)
, Thorne et al. (1998)
, and Thorne and Kishino (2002)
are popular. Such intrinsic rates on particular edges, if known exactly, would remove a component of correlation not due to proteinprotein interaction but rather due to edges of trees measuring essentially the same time interval for different genes. On the other hand, the further analysis and assumptions required to estimate rates of evolution on edges may considerably increase the errors in rate estimates over edge-length estimates. Our preliminary calculations of the relative size of these two competing problems indicate that using edge lengths should work better when the coefficient of variation of edge lengths measured in real time is less than that of the estimated rates of evolution. Further investigation of this issue will be important, as will be the development and evaluation of methods that seek to improve edge-length estimation (e.g., Kitazoe et al. 2001
).
| Supplementary Material |
|---|
|
|
|---|
Supplementary materials are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Appendix 1 |
|---|
|
|
|---|
Formulas for the Average Percent Standard Deviation on Trees
At present PAUP* uses the formula:
|
| (A1) |
|
| (A2) |
2.
Consider first P = 0 or OLS. This states that the SD should be independent of the scale of the distance, so we may define an average SD as follows:
![]() | (A3) |
![]() | (A4) |
The term N 1 is from Fitch and Margoliash and it is the standard way to obtain an unbiased estimate of the SD when the mean is estimated by averaging. Interestingly, for both trees and 2D MDS, distances between 3 objects can generally be fitted perfectly, so one can argue for using N 3. Alternatively, using N alone gives the ML estimate. If N is used in place of N 1, then removing the times 100% and squaring equation (A4) gives a form identical to the Sammon (1969) stress value. This gives one way of comparing the fit of a tree and MDS to the same data set. It is a reasonable way to compare criteria (P = 0, 1, and 2) because the percent SD will reduce when the weights most appropriately match the squared residuals.
The above formula belong to the family:
![]() | (A5) |
Alternatively, a weighted root meansquared coefficient of variation (CV) of the distances dexpimay be written as follows:
![]() | (A6) |
For wi
1 (i.e., wi = 1/N or P = 2) and
the formula (eq. A6) returns the average percent SD of FM equation (A1) and (A4), respectively. However, for
we do not return to equation (A3) but the slightly different
![]() | (A7) |
We may also use a likelihood framework of N independent normal variables with heteroscedasticity of variance proportional to dPobsi. The p should be over obsi. Using the approximation of the population variance by the residuals at the ML point, we obtain the maximum log likelihood normalized by the number of distances as a criterion of fit. That is,
![]() | (A8) |
Use of this criterion gives results very similar to that of the mean percent SD. It too indicates that the highest per distance likelihood is for T2 (OLS), whereas the best with nonnegative edge lengths is T2 (FM+). This is no coincidence, equation (A8) is like equation (A5). To see this, exponentiate equation (A8),
![]() | (A9) |
| Acknowledgements |
|---|
|
|
|---|
This work was begun at Chugai, Japan. P.J.W. gratefully acknowledges the Japanese Society for the Promotion of Science senior fellowship program and National Institutes of Health grant R01 LM008626-01A1.
| Footnotes |
|---|
1 Present address: Laboratory of Biometry and Bioinformatics, Graduate School of Agriculture and Life Sciences, University of Tokyo, 1-1-1 Yayoi Bunkyo-ku, Tokyo 113-8657, Japan
2 South Carolina Cancer Center, University of South Carolina, Columbia, SC 29203 ![]()
Peter Lockhart, Associate Editor
| References |
|---|
|
|
|---|
Adachi J and Hasegawa M. (1996) MOLPHY version 2.3: Programs for molecular phylogenetics based on maximum likelihood. Computer Science. Monograph 28(Institute of Statistical Mathematics, Minato-Ku, Tokyo, Japan) pp. 1150.
Adachi J, Waddell PJ, Martin W, Hasegawa M. (2000) Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J Mol Evol 50:348358.[Web of Science][Medline]
Agresti A. (1990) Categorical data analysis. (John Wiley and sons, New York).
Alberts B, Bray D, Lewis J, Raff M, Roberts K, Watson JD. (1994) Molecular biology of the cell. 3rd ed (Taylor & Francis, New York).
Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW. (2000) Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol Biol Evol 17:164178.
Bryant D and Waddell PJ. (1998) Rapid evaluation of least squares and minimum evolution criteria on phylogenetic trees. Mol Biol Evol 15:13461359.[Abstract]
Edwards J, Oman P. 2003. Dimensional reduction for data mapping [Internet]. R News 3(3):27. [cited 2006 Dec 31]. Available from: http://www.r-project.org/.
Efron B. (1982) The jackknife, the bootstrap, and other resampling plans. (Society for Industrial and Applied Mathematics, Philadelphia (NY)) (CBMS-NSF regional conference series in applied mathematics; Monograph 38).
Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA. (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature 402:8690.[CrossRef][Medline]
Fitch WM and Margoliash E. (1967) Construction of phylogenetic trees. Science 155:279284.
Gallet X, Charloteaux B, Thomas A, Brasseur R. (2000) A fast method to predict protein interaction sites from sequences. J Mol Biol 302:917926.[CrossRef][Web of Science][Medline]
Gillespie JH. (1991) The causes of molecular evolution. (Oxford University Press, New York).
Goh CS, Bogan AA, Joachimiak M, Walther D, Cohen FE. (2000) Co-evolution of proteins with their interaction partners. J Mol Biol 299:283293.[CrossRef][Web of Science][Medline]
Gutell RR, Weiser B, Woese CR, Noller HF. (1985) Comparative anatomy of 16S-like ribosomal RNA. Prog Nucleic Acid Res Mol Biol 32:155216.[Web of Science][Medline]
Harvey PH and Pagel MD. (1991) The comparative method in evolutionary biology. (Oxford University Press, Oxford).
Hedges SB and Poling LL. (1999) A molecular phylogeny of reptiles. Science 283:9981001.
Hendy MD and Penny D. (1993) Spectral analysis of phylogenetic data. J Classif 10:524.[CrossRef]
Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP. (2001) Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294:23102314.[CrossRef][Web of Science][Medline]
Ihaka R and Gentleman R. (1996) R: a language for data analysis and graphics. J Comput Graph Stat 5:299314.[CrossRef]
Kitazoe Y, Kurihara Y, Narita Y, Okuhara Y, Tominaga A, Suzuki T. (2001) A new theory of phylogeny inference through construction of multidimensional vector space. Mol Biol Evol 18:812828.
Kuhner MK and Felsenstein J. (1994) A simulation study of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol 11:459468.[Abstract]
Lewin B. (1997) Genes VI. (Oxford University Press, Oxford).
Lin Y, Waddell PJ, Penny D. (2002) Pika and vole mitochondrial genomes increase support for both rodent monophyly and Glires. Gene 294:119129.[CrossRef][Web of Science][Medline]
Lodish H, Berk A, Zipursky SL, Matsudaira P. (2000) Molecular cell biology. 4th ed (W.H. Freeman & Co, New York).
Marcotte EM, Pellegrini M, Ng H-Y, Rice DW, Yeates TO, Eisenberg D. (1999) Detecting protein function and protein-protein interactions from genome sequences. Science 285:751753.
Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D. (1999) A combined algorithm for genome-wide prediction of protein function. Nature 402:8386.[CrossRef][Medline]
Mather OC, Singh A, van Boxel GI, White SA, Jackson JB. (2004) Active-site conformational changes associated with hydride transfer in proton-translocating transhydrogenase. Biochemistry 43:1095210964.[CrossRef][Medline]
Moyle WR, Campbell RK, Myers RV, Bernard MP, Han Y, Wang X. (1994) Co-evolution of ligand-receptor pairs. Nature 368:251255.[CrossRef][Medline]
Murphy WJ, Eizirik ED, O'Brien SJ, et al. (2001) Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science 294:23482351.[CrossRef][Web of Science][Medline]
Pazos F, Helmer-Citterich M, Ausiello G, Valencia A. (1997) Correlated mutations contain information about protein-protein interaction. J Mol Biol 271:511523.[CrossRef][Web of Science][Medline]
Pazos F and Valencia A. (2000) Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng 14:609614.
Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. (1999) Assigning protein functions by comparative genome analysis protein phylogenetic profiles. Proc Natl Acad Sci USA 428:54288.
Penny D, Hendy MD, Henderson IM. (1987) The reliability of evolutionary trees. Cold Spring Harbor Symp Quant Biol 52:857862.
Pollock DD and Taylor WR. (1997) Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. Protein Eng 10:647657.
Pollock DD, Taylor WR, Goldman N. (1999) Coevolving protein residues: maximum likelihood identification and relationship to structure. J Mol Biol 287:187198.[CrossRef][Web of Science][Medline]
RDevelopment Core Team. 2005. A language and environment for statistical computing [Internet]. Vienna (Austria): R Foundation for Statistical Computing. [cited 2006 Dec 31]. Available from: http://www.R-project.org.
Rambaut A and Grassly NC. (1997) Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci 13:235238.
Robinson DF and Folds LR. (1979) Comparison of weighted labeled trees. Lect Notes Math 748:119126.
Robinson DM, Jones DT, Kishino H, Goldman N, Thorne JL. (2003) Protein evolution with dependence among codons due to tertiary structure. Mol Biol Evol 20:16921704.
Sammon JW. (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput C-18:401409.
Sanderson MJ. (1997) A nonparametric approach to estimating divergence times in the absence of rate constancy. Mol Biol Evol 14:12181232.[Web of Science]
Scally M, Madsen O, Douady CJ, de Jong WW, Stanhope MJ, Springer MS. (2001) Molecular evidence for the major clades of placental mammals. J Mamm Evol 8:239277.[CrossRef]
Schmitz J, Ohme M, Zischler H. (2002) The complete mitochondrial sequence of Tarsius bancanus: evidence for an extensive nucleotide compositional plasticity of primate mitochondrial DNA. Mol Biol Evol 19:544553.
SPSS. (1999) SPSS Base 10.0 for Windows user's guide. (SPSS Inc, Chicago (IL)).
Steel MA and Penny D. (1993) Distributions of tree comparison metricssome new results. Syst Biol 42:126141.[CrossRef]
Swofford DL. (2002) PAUP* phylogenetic analysis using parsimony (*and other methods). Version 4.0b10. (Sinauer Associates, Sunderland (MA)).
Swofford DL, Olsen GJ, Waddell PJ, Hillis DM. (1996) Phylogenetic inference. In Hillis DM, Moritz C, Mable BK (Eds.). Molecular systematics 2nd ed (Sinauer Associates, Sunderland (MA)) pp. 407514.
Thorne JL and Kishino H. (2002) Divergence time and evolutionary rate estimation with multilocus data. Syst Biol 51:689702.[CrossRef][Web of Science][Medline]
Thorne JL, Kishino H, Painter IS. (1998) Estimating the rate of evolution of the rate of molecular evolution. Mol Biol Evol 15:16471657.[Abstract]
Valencia A and Pazos F. (2002) Computational methods for the prediction of protein interactions. Curr Opin Struct Biol 12:368373.[CrossRef][Web of Science][Medline]
Waddell PJ. (1990) The mating behaviour and phylogeny of the nasuta subgroup of the genus Drosophila [dissertation]. (University of Auckland, Auckland (New Zealand)).
Waddell PJ. (1995) Statistical methods of phylogenetic analysis: including Hadamard conjugations, LogDet transforms, and maximum likelihood [dissertation]. (Massey University, Palmerston North (New Zealand)).
Waddell PJ, Cao Y, Hauf J, Hasegawa M. (1999) Using novel phylogenetic methods to evaluate mammalian mtDNA, including AA invariant sites-LogDet plus site stripping, to detect internal conflicts in the data, with special reference to the position of hedgehog, armadillo, and elephant. Syst Biol 48:3153.[CrossRef][Web of Science][Medline]
Waddell PJ and Kishino H. (2000) Cluster inference methods and graphical models evaluated on NCI60 microarray gene expression data. Genome Inform Ser Workshop Genome Inform 11:129141.[Medline]
Waddell PJ, Kishino H, Ota R. (2001) A phylogenetic foundation for comparative mammalian genomics. Genome Inform 12:141154.
Waddell PJ and Mclachlan M. (2004) Genomics. In Bryan Ness (Ed.). Encyclopaedia of genetics 2nd ed (Salem Press, Pasadena (CA)) pp. 384388.
Waddell PJ, Okada N, Hasegawa M. (1999) Progress in resolving the interordinal relationships of placental mammals. Syst Biol 48:15.[CrossRef][Web of Science][Medline]
Waddell PJ, Penny D, Moore T. (1997) Extending Hadamard conjugations to model sequence evolution with variable rates across sites. Mol Phylogenet Evol 8:3350.[CrossRef][Web of Science][Medline]
Waddell PJ and Steel MA. (1997) General time reversible distances with unequal rates across sites: mixing
and inverse Gaussian distributions with invariant sites. Mol Phylogenet Evol 8:398414.[CrossRef][Web of Science][Medline]
Yang Z. (1999) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 5:555556.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
E. R.M. Tillier and R. L. Charlebois The human protein coevolution network Genome Res., October 1, 2009; 19(10): 1861 - 1871. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Nabholz, S. Glemin, and N. Galtier Strong Variations of Mitochondrial Mutation Rate across Mammals--the Longevity Hypothesis Mol. Biol. Evol., January 1, 2008; 25(1): 120 - 130. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











