Molecular Biology and Evolution 18:812-828 (2001)
© 2001 Society for Molecular Biology and Evolution
A New Theory of Phylogeny Inference Through Construction of Multidimensional Vector Space
Yasuhiro Kitazoe,
Yukio Kurihara,
Yuichi Narita,
Yoshiyasu Okuhara,
Akira Tominaga and
Tomohiko Suzuki
*Center of Medical Information Science,
Department of Information Science,
Department of Medical Biology,
Medical Research Center, Kochi Medical School, Nankoku, Kochi, Japan;
||Department of Biology, Faculty of Science, Kochi University, Akebono, Kochi, Japan
 |
Abstract
|
|---|
Here, a new theory of molecular phylogeny is developed in a
multidimensional vector space (MVS). The molecular evolution
is represented as a successive splitting of branch vectors in
the MVS. The end points of these vectors are the extant species
and indicate the specific directions reflected by their individual
histories of evolution in the past. This representation makes
it possible to infer the phylogeny (evolutionary histories)
from the spatial positions of the end points. Search vectors
are introduced to draw out the groups of species distributed
around them. These groups are classified according to the nearby
order of branches with them. A law of physics is applied to
determine the species positions in the MVS. The species are
regarded as the particles moving in time according to the equation
of motion, finally falling into the lowest-energy state in spite
of their randomly distributed initial condition. This falling
into the ground state results in the construction of an MVS
in which the relative distances between two particles are equal
to the substitution distances. The species positions are obtained
prior to the phylogeny inference. Therefore, as the number of
species increases, the species vectors can be more specific
in an MVS of a larger size, such that the vector analysis gives
a more stable and reliable topology. The efficacy of the present
method was examined by using computer simulations of molecular
evolution in which all the branch- and end-point sequences of
the trees are known in advance. In the phylogeny inference from
the end points with 100 multiple data sets, the present method
consistently reconstructed the correct topologies, in contrast
to standard methods. In applications to 185 vertebrates in the

-hemoglobin, the vector analysis drew out the two lineage groups
of birds and mammals. A core member of the mammalian radiation
appeared at the base of the mammalian lineage. Squamates were
isolated from the bird lineage to compose the outgroup, while
the other living reptilians were directly coupled with birds
without forming any sister groups. This result is in contrast
to the morphological phylogeny and is also different from those
of recent molecular analyses.
 |
Introduction
|
|---|
Molecular studies have provided new findings on evolutionary
phylogeny (Goodman 1963

; Sarich and Wilson 1967

; Gemmell and
Westerman 1994

; Janke et al. 1996

; Springer et al. 1997

;
Kumar and Hedges 1998

; Hedges and Poling 1999

). Phylogeny
inference has been done by methods directly using the substitution
distances between the species (Sokal and Michener 1958

; Eck
and Dayhoff 1966

; Fitch and Margoliash 1967

; Sokal and Rohlf
1981

; Saitou and Nei 1987

) and by those taking explicit account
of nucleotide or amino acid substitutions (Felsenstein 1993

;
Adachi and Hasegawa 1996

; Strimmer and von Haeseler 1996

;
Swofford 1998

). The topologies obtained by these methods are
often changed on the basis of the numbers and kinds of selected
species, including the outgroup. There appear to be strange
topologies in spite of high branch resolutions (Cao et al. 1998
a, 1998
b
; also shown in this paper). Furthermore, the branch resolutions
tend to be lower with an increasing number of species, since
the number of topological variations increases rapidly. Hence,
the whole structure of the vertebrate phylogeny remains unclear
(Novacek 1992

; Hedges 1994

; Cao et al. 1998
a
). In this respect,
the development of a new methodology is expected to provide
an understanding of molecular evolution.
There exists an original space of vectors whose components are the amino acid sequences of the respective species. However, this space is too abstract for a quantitative analysis of the tree structure, since these components are not digital. If this space can be transformed into the multidimensional vector space (MVS) of Euclidean space without decreasing the degree of freedom of the variables, the tree structure can be directly analyzed by using geometrical quantities such as the distances and angles between the vectors.
In this paper, we first show how the evolutionary process is represented in the MVS. To make lucid our basic idea for this representation, we consider the simplest model of substitutions (hereinafter called the single-substitution model [SSM]), in which the amino acids of any sites can be changed at most once over all the evolutionary processes, and the substitution distances between two species are approximated by the number of different amino acid sites between two sequences. Then, the processes are represented as the developments of branch vectors along the orthogonal coordinate axes in the MVS, and the end points of these vectors correspond to the extant species. The species vectors indicate the specific directions in the MVS, respectively.
Next, to infer the tree structure from the species vectors of the end points, we introduce search vectors to draw out the groups of species, which are distributed around these vectors. A search vector may be a one-species vector or a center-of-gravity vector (CGV) of many species. The drawn-out species stand in line in the nearby order of the topological relationship with the search vectors if the branch vectors are orthogonal. The branching patterns of the tree are analyzed by a scatter diagram which expresses the relationship among the groups of species drawn out by two search vectors. It is shown that the diagram consists of the three parts: the left branch, the right branch, and the central line. The first two include the drawn-out species which are distributed parallel to the x- and y-axes, respectively, while the third represents the other species (outgroup) of the outside branches, which are distributed around the origin of coordinates or along the central line of y = x.
The present theory is meaningless if the spatial positions of the respective species are not obtained. In this paper, their positions in the MVS are determined so that their relative distances may reproduce the substitution distances between species. This can be easily done in simple cases of three and four species. As the number I of species and the dimensional size N increase, however, it is difficult to determine the positions by statistical methods of minimizing or maximizing a quantity, since the number of variables is huge (if I = 150 and N = 100, the number is I x N = 15,000). To remove this difficulty, we introduce the dynamic method of time-dependently solving the equation of motion for the many-body system in physics (David and James 1988
). Here, the species are regarded as particles moving under their mutual interactions. By utilizing the law of physics that the system finally falls into the ground state, irrespective of the initial random distributions of the particles, we can precisely determine the positions of the species in the MVS. Here, the interactions are defined as the sums of two-body potentials, which are expressed so as to be minimum at the points equal to the substitution distances between the two related species. The solution of the equation of motion always converges on the ground state by the (I - 1)-dimensional size. However, when I is larger than the number L of variable sites in amino acid sequences of these species, it is found that the equation of motion can be solved with good accuracy by the L-dimensional size in place of (I - 1). Then, the number of variables in the MVS becomes substantially equal to that in the original abstract space. This means a substantial transformation of the original space into the MVS.
The validity of the present method was examined by using computer simulations of molecular evolution in which any branch patterns could be created by using the Monte Carlo method (Hammersley and Handscomb 1964
) with the weight of the transition probability matrix. Then, it was possible to directly compare the present method with standard ones such as neighbor joining (NJ), maximum likelihood (ML), and parsimony (PA), since all the branch- and end-point sequences were known in advance. The predominance of the present method was demonstrated by using two examples of typical branch patterns in which 100 multiple data sets with a variety of branch lengths were prepared by repeating the simulations.
The present method was applied to the vector analysis of 185 vertebrate species in the
-hemoglobin, and the results were compared with those of the standard methods. The standard methods gave erroneous topologies in the relationship among Artiodactyla, Perissodactyla, and Chiroptera in spite of very high branch resolutions which became lower as the number of species increased. On the other hand, the present approach gave the correct topology, and a larger number of species brought a more stable solution. Encouraged by this success, we challenged recent analyses of topics regarding the reptilian phylogeny. Our results showed that squamates were clearly isolated from the lineage of birds, while the other extant reptilians belonged to the lineage of birds without forming any sister groups. This is in contrast to the traditional phylogeny (Romer 1966
; Benton 1997
) and is also different from recent reports (Gorr, Mable, and Kleinschmidt 1998
; Hedges and Poling 1999
).
 |
Theory
|
|---|
Representation of the Evolutionary Process in the MVS
The minimum unit of evolution is the elementary process of
figure 1a,
in which point 1 is split into points 3 and 4 via point
2. Here, we define
di,j as the substitution distance between
the
ith and
jth points. The SSM gives the relations
d1,3 =
d1,2 +
d2,3,
d1,4 =
d1,2 +
d2,4, and
d3,4 =
d2,3 +
d2,4. In this
paper, we also define the spatial distance
Ri,j = |
Ri -
Rj|
in the MVS by the square root of
di,j. Then, these relations
are expressed as follows:
R1,32 =
R1,22 +
R2,32,
R1,42 =
R1,22 +
R2,42, and
R3,42 =
R2,32 +
R2,42, which represent Euclidean
distances. Thus, the tree structure of
figure 1a
is represented
by three vectors,
R1,2,
R2,3, and
R2,4, along the
x-,
y-, and
z-axes in the three-dimensional space (
fig. 1b
). When point
3 is further split into two branches, the tree is composed of
five vectors along the orthogonal coordinate axes in the five-dimensional
space. In this way, the evolution in the SSM is represented
as the development of branch vectors along the orthogonal coordinate
axes of the MVS, and the extant species correspond to the end
points of these developments.

View larger version (13K):
[in this window]
[in a new window]
|
Fig. 1.Three-dimensional description of the elementary process of evolution. An elementary process (a) of splitting a branch into two is expressed in the single-substitution model (SSM) by the three branch vectors (b) along the x-, y-, and z-axes of the three-dimensional space. A succession of such elementary processes forms an evolutionary tree, which can be represented in the multidimensional vector space (MVS)
|
|
The orthogonal relation between the branch vectors is very useful
in the structural analysis of the tree. Without loss of generality,
we consider the topology of
figure 2
, which consists of the
branch points (points 115) and the endpoints (points
1631). Here, point 1 is the root of the tree. The SSM
gives a number of orthogonal relations between the branch vectors
Ri,j =
Ri -
Rj. For example, we have
Ri,1
Rj,1 for the two
groups of points (
i = 2, 4, 5, 811, 1623) and
(
j = 3, 6, 7, 1215, 2431), and also
Ri,2
Rj,2 for the two groups of points (
i = 4, 8, 9, 1619) and
(
j = 5, 10, 11, 2023). However, for example, the vectors
Ri,1 (
i = 6, 12, 13, 2427) are not orthogonal to the
vectors
Rj,1 (
j = 7, 14, 15, 2831), since they include
the common vector
R1,3.

View larger version (17K):
[in this window]
[in a new window]
|
Fig. 2.Evolutionary tree for the presentation of the present theory. The present theory is presented using this figure without loss of generality. The figure consists of the upper and lower wings, which have the seven branch points and eight end points, respectively
|
|
When we specify a point
s as one of the search vectors, the
angle
i,s between the vectors
Rs and
Ri is given by the relation
Ui,s =
Ri·
Rs =
Nn=1 Xi,nXs,n =
RiRs cos
i,s. Here, the
vectors
Ri are measured from the position of the root (point
1), and
Xi,n and
Xs,n denote the
nth dimensional components
of
Ri and
Rs, respectively. Then, we have the quantity
 | (1) |
which represents the square of the projection
component of the
ith point vector
Ri to
Rs. By specifying another
point,
t, we obtain the scatter diagram for
Wi,s and
Wi,t, which
give the
x and
y values, respectively.
A variety of pairs may be selected for the search points (s, t). For the first pair, s = 31 and t = 16, the SSM gives the following orthogonal relations: R15,1
R15,s, R15,1
R15,30, and R15,s
R15,30, since points 1, 15, 30, and 31 correspond to points 1, 2, 3, and 4, respectively, in figure 1b.
We have Zs,s = Rs,12 and Z15,s = Z30,s. Hence, we obtain Zi,s = C1 for the point (i = 31), Zi,s = C2 for the points (i = 15, 30), Zi,s = C3 for the points (i = 7, 14, 28, 29), Zi,s = C4 for the points (i = 3, 6, 12, 13, 2427), and Zi,t = 0 for the points (i = 3, 6, 7, 1215, 2431), where Cj (C1 > C2 > C3 > C4) are constants. In this way, the four groups of points stand in line on the x-axis according to the nearby order of branches from the search point (s = 31). Similarly, in the upper branches of figure 2
, the point (i = 16) gives the largest value of Zi,t on the y-axis, the points (i = 8, 17) form the second group, the points (i = 4, 9, 18, 19) form the third group, and the points (i = 2, 5, 10, 11, 2023) form the fourth group.
For the second example, the pair (s = 31, t = 24), we have the orthogonal relations Zi,s = Zi,t = 0 for the points (i = 2, 4, 5, 811, 1623), Zi,t = C5 for the point (i = 24), Zi,t = C6 for the points (i = 12, 25), Zi,t = C7 for the points (i = 6, 13, 26, 27), and Zi,s = C8 for the points (i = 3, 7, 14, 15, 2831), since Ri,3
Rj,3 for the two groups of points (i = 6, 12, 13, 2427) and (j = 7, 14, 15, 2831). Here, C5, C6, C7, and C8 are constants (C5 > C6 > C7 > C8). Here, in the x-direction, the point (i = 31) gives the largest value of Zi,s, the points (i = 15, 30) form the second group, the points (i = 7, 14, 28, 29) form the third group, and the points (i = 3, 6, 12, 13, 2427) form the fourth group. In the y-direction, point 24 gives the largest value of Zi,t, the points (i = 12, 25) form the second group, the points (i = 6, 13, 26, 27) form the third group, and the points (i = 3, 7, 14, 15, 2831) form the fourth group. In this way, the scatter diagram for Zi,s and Zi,t gives the groupings of the points which correspond to the nearby orders of branches viewed from the search points (s and t). Also, the regression lines along these plots are orthogonal to each other. However, actual evolutionary processes may include more or less substitutions of higher orders than those in the SSM. Then, the two lines cannot be completely parallel to the x- and y-axes, respectively.
A computer simulation of evolution is a useful tool to check the validity of the vector analysis in terms of Zi,s and Zi,t, since all the sequences of the 31 points of figure 2
are known in advance. The simulation created these sequences on the basis of the Markov process without applying the SSM. An amino acid sequence (141 sites) of the first point (the root) was given by generating white random numbers for the 20 amino acids. We tried to substitute an amino acid at each site of the root sequence by generating a random number with the weight of the transition probability matrix between amino acids. Then, most of the sites remained unchanged, since the diagonal parts of the matrix were much larger than the nondiagonal ones. The sequences of the next two points (2, 3) were fixed after amino acid substitutions of 10 sites in the root sequence, respectively. This procedure was continued to the end points (points 1631). In this simulation, the Dayhoff model was used for both the transition probability matrix and the substitution distances between the sequences of the respective points (Dayhoff, Schwartz, and Orcutt 1978
). The substitution distances between points 131 were calculated in a way equivalent to that of the software package Protdist (Felsenstein 1993
), and used as the input data for the equation of motion.
Positions Ri of the 31 points of figure 2
were precisely determined by solving the equation of motion in the 30 dimensions and used for the vector analysis. Figure 3a
shows the scatter diagram for Zi,s and Zi,t (s = 31, t = 16). The scatter plots were clearly classified into the eight groups according to the branching orders from the specific points (s = 31, t = 16), and the regression lines along the plots were almost orthogonal to each other on the x- and y-axes, although the SSM was not applied and the end points were rather far from the root (point 1) (d
35). Figure 3b
shows the scatter diagram for Zi,s and Zi,t (s = 31, t = 24). Here, the points of the upper branches of figure 2 were distributed around the origin of coordinates to compose the outgroup, since their vectors were almost orthogonal to the vectors Rs and Rt. Those of the lower branches were distributed in a pattern quite similar to that shown in figure 3a
and were classified into the seven groups. In this way, a successive change of the search vectors s and t completely resolved the branch pattern of figure 2
.

View larger version (27K):
[in this window]
[in a new window]
|
Fig. 3.Vector analysis of the evolutionary tree of figure 2
. The positions Ri of the 31 points of figure 2
were precisely determined in the MVS of 30 dimensions. The vectors R31, R16, and R24 of the points (i = 31, 16, 24) were first chosen as the search vectors to analyze the tree structure. a and b, scatter diagrams for Zi,31 and Zi,16 and for Zi,31 and Zi,24 (defined by eq. 1
), respectively. In a, with the search points (i = 31, 16), the points of the upper and lower wings of figure 2
were almost distributed on the y- and x-axes, respectively. The groups of points were clearly classified in the nearby orders of branches from the search points (i = 31, 16). In b, with the search points (i = 31, 24), the branch points (points 3, 6, 12) were further resolved, and the regression lines along the plots were also orthogonal to each other. Here, the points of the upper wing were distributed around the origin of coordinates as the outgroup, since their vectors were almost orthogonal to the search vectors R24 and R31. The successive changes of the search vectors could resolve the branches of the tree. Next, to infer the phylogeny, we determined the vectors Ri of the end points (points 1631) by solving the equation of motion of only these points. We also determined position Rc of the spherical center by using equation (2)
. The obtained grouping patterns of the end points in the vector analysis were consistent with those of this figure
|
|
Vector Analysis for Phylogeny Inference
Unfortunately, the sequences of branch points 115 in
figure 2
are not known in actual evolution. To confirm the
validity of the vector analysis, we have to reproduce the topology
of this figure by blinding the branch points. Using the distance
matrix for the end points (points 1631) as the input
data, we determine their positions
Ri by solving the equation
of motion in 15 dimensions.
Prior to the inference, we determine the origin of coordinates using the positions of the end points. The molecular clock is assumed as the first approximation for this position. We look for the position Rc of the spherical center which is equidistant from all the end points. Let us consider the quantity S defined by the equation

| (2) |
The central
position is determined by minimizing the
S value on the basis
of the least-squares method;
S/
Xc,n = 0 (
n = 1, 2, ... ,
N)
with linear coupled equations for
Xc,n. Here,
Xc,n denotes the
nth dimensional component of
Rc. The values
Xc,n can be obtained
by solving the matrix equation. As a result, we can check how
the end points are spherically distributed around the center,
since the distance between the center and the
ith point is given
by
Ri,c = {
Nn=1 (
Xi,n -
Xc,n)
2}
1/2.
Once the root position is determined, the tree structure can be analyzed in terms of the scattering diagram for Zi,s and Zi,t by using only the end points. By adopting R16 and R31 as the search vectors, we obtain figure 3a
without points 215. Then, on the basis of the orthogonal theory presented in the previous section, we know that points 1623 and points 2431 groups that are independent from each other. In the group of points 1623, points 2023 are first separated from the evolution process toward the point 16, points 18 and 19 are separated next, and point 17 is finally separated. On the other hand, in the group of points 2431, points 2427 are first separated from the evolution process toward point 31, points 28 and 29 are separated next, and the point 30 is finally separated. As a result, we obtain the tree inserted in figure 3a.
Next, we further resolve points 2427 by taking R24 as the search vector instead of R16. We get figure 3b
without points 215. Then, we know that the points 26 and 27 are first separated from the evolution process toward point 24, and point 25 is separated next. As a result, we get a more detailed tree, presented in the insert in figure 3b.
By continuing such a procedure, we can obtain the full tree of 16 points.
The vector analysis can be performed without any mathematical approximations. However, the branch lengths of the tree are obliged to be evaluated model-dependently. This can be easily done by using the SSM as follows: In figure 2
, for example, we have the orthogonal relations Ri,4
Rj,4 for the points [i = (16, 17), j = (18, 19)], [i = (1), j = (16, 17)], and [i = (1), j = (18, 19)]. We have R16,42 + R18,42 = R16,182, R1,42 + R16,42 = R16,12, and R1,42 + R18,42 = R18,12. As a result, the branch lengths are given by R1,42 = (R16,12 + R18,12 - R16,182)/2, R16,42 = (R16,12 + R16,182 - R18,12)/2, and R18,42 = (R18,12 + R16,182 - R16,12)/2. We finally note that the scalar product Ui,s = Ri·Rs =
Nn=1 Xi,nXs,n = RiRscos
i,s of the search vector Rs and vector Ri gives the distance from the root to the branch point between the two points (s and i). The scatter diagram may be expressed in terms of Ui,s and Ui,t in place of Zi,s and Zi,t.
Determination of Species Positions Ri in the MVS
To obtain species positions Ri in the MVS, we have to determine a large number of parameters of Xi,n (i = 1, 2, ... , I; n = 1, 2, ... , N), which are the vector components of Ri. We introduce a new method of time-dependently solving the equation of motion for the many-body system. For this purpose, the species are regarded as the particles moving according to the equation of motion. They are initially deposited at random within a volume of the MVS. With time, they continue to move toward lower energy states under the influences of both a viscosity effect and their mutual interactions. They gradually lose their kinetic energies due to the viscosity effect and finally stop falling into the ground state of the system, irrespective of the initial conditions given using random numbers. This falling is one of the most fundamental laws of physics. Therefore, it is quite easy to confirm whether the equation of motion has been correctly solved or not, since the ground state is explicitly known in advance. The interactions are defined as the sum of two-body attractive potentials, which are minimized at the points equal to the square roots of the substitution distances between the two species. The ground state corresponds to the situation in which any pairs of the particles are positioned at the points equal to the square roots of the substitution distances.
The initial positions Ri and velocities Vi of the particles are randomly distributed within the volumes of R0 and V0 by the equations

| (3) |
for
i = (1,
2, ... ,
I) and
n = (1, 2, ... ,
N). Here,
V0 and
R0 are constants,
F is a random number between 0 and 1, and
Vi,n and
Xi,n are
the
nth dimensional components of
Vi and
Ri, respectively. The
particles are allowed to move in time according to the kinematics
of damped equations of motion (David and James 1988

):

| (4) |
which are the differential equations about time
T with the
N dimensions (
n = 1, 2, ... ,
N) and the
I species (
i = 1, 2,
... ,
I). Here, µ is the viscosity to damp the motion
of the particles. The total energy
E of the system is given
by

| (7) |
which consists of the
sum of the kinetic energy of the first term and the potential
energy of the second term. We introduce the two-body potential
between the
ith and
jth particles, defined by
 | (8) |
which minimizes at the point of
Ri,j =
Di,j. Here,
P0 and
A are constants,
Rij = {
Nn=1 (
Xi,n -
Xj,n)
2}
1/2, and the quantity
Di,j is defined as the square root of the substitution distance
di,j between the two sequences of the
ith and
jth species.
Equations (4)(8)
are expressed in no dimensions without physical units such as meters, grams, or seconds. If we set µ = 0, these equations are reduced to the well-known "canonical equation," where the total energy E of the system is conserved time-independently. The Runge-Kutta method is applied for the numerical solution, where we set the constant values as follows: V0 = 0.3, M = 900, P0 = -0.5, µ = 8, and A = 20. R0 denotes the range of the initial distribution of the particles and is defined as half of the maximum value of Di,j. P0 is the depth of the potential which confines the particles within the interaction range. The value of µ may be adjusted to be small such that the particles will gradually lose their kinetic energies in time, and µ = 8 was found to be suitable. A time mesh to integrate the differential equations about time is taken as
T = 0.3, which is sufficiently small to keep the calculations accurate. The size of dimensions, N, is set equal to (I - 1) if I is smaller than the number of variable sites within the given species, L. If I > L, we may put N = L, since the average error of Ri,j versus Di,j is then negligible. The numerical calculations start with a high-energy state given by a set of the initial distribution of the particles, and they stop at the ground state, whose appearance verifies the correct solution of equations (4)(6)
. This can be easily confirmed in practice, since the ground state is given by E0 = P0I(I - 1)/2, as known from equations (7) and (8)
. In numerical calculations using the interaction potential of equation (8), we do not have any local minimum of the total energy. The calculations were performed by a Fortran program with double precision.
Comparison with Other Methods by Computer Simulations of Molecular Evolution
The efficacy of the present theory was demonstrated by using computer simulations of molecular evolutions in which all of the branch- and end-point sequences were known in advance. Once an initial sequence of amino acids is given, a series of sequences of the following evolutionary process are determined by using Monte Carlo method (Hammersley and Handscomb 1964
) with the weight of the transition probability matrix t(p, q) for the Dayhoff model. The amino acid transition at each site is determined by generating a random number F between 0 and 1: if a given F satisfies the condition T(p, q - 1) < F
T(p, q), the transition from the pth amino acid to the qth one occurs. Here, T(p, q) =
qr=1 T(p, r)/
20r=1 T(p, r), T(p, 0) = 0, and T(p, p) means no transition. An operation of the transitions is performed from the first site to the last, so that the first sequence is created. The second sequence is created by the next operation. The (i + 1)-th sequence is given by using the ith sequence. Two sequences are created at each branch point. In this way, we can simulate any branch patterns of molecular evolution based on the stochastic (Markov) process. The number of operations is proportional to evolutionary time.
To compare the present method with the standard ones, we formed the branch patterns of figure 4a and b.
Here, the end points (points 59) correspond to the 100 operations. Branch points 2, 3, and 4 in figure 4a
are the positions of the 50, 70, and 70 operations, respectively. Those in figure 4b
are the positions of the 40, 60, and 80 operations, respectively. To make the simulation realistic, we assumed that the initial point (point 1) was the consensus sequence of eutherians. We also assumed that the 64 sites could be substituted, since the other 77 sites were known to almost freeze in eutherians.

View larger version (11K):
[in this window]
[in a new window]
|
Fig. 4.Computer simulations of molecular evolution. The present method was compared with the standard ones using the two branch patterns of a and b. The initial point (point 1) is given by the consensus sequence of eutherians. A series of sequences of the following evolutionary process are determined by using the Monte Carlo method with the weight of the transition probability matrix. The branch points 2, 3, and 4 in a are the positions of the 50, 70, and 70 operations for the amino acid transitions, respectively. Those in b are the positions of the 40, 60, and 80 operations, respectively. The end points (points 59) correspond to the 100 operations. By repeating the simulations, 100 multiple data sets with a variety of branch lengths were prepared for a statistical comparison with the other methods, and the sequences of the end points were used for the phylogeny inference
|
|
By repeating the simulation, we produced the 100 multiple data
sets (100 evolutionary events) in both topologies of
figure 4a and b
and tried to reconstruct the original branch patterns
by using only the sequences of the end points, where point 5
was taken as the outgroup.
Table 1
shows the percentages of
the correct solutions for ML, NJ, and the present method. The
ML and NJ analyses were performed using the software packages
PUZZLE (Strimmer and von Haeseler 1996

) and PHYLIP (Felsenstein
1993

), respectively. The NJ method gave better results than
the ML method. A clear correlation of branch resolution between
the two methods did not exist due to their different algorithms
for phylogeny inference.
Figures 5 and 6
show typical examples
in which the two methods gave erroneous topologies in same data
sets, corresponding to
figure 4a and b,
respectively. The ML
method could not reconstruct the original topology in spite
of the 100% confidence (
fig. 5a
). This suggests that the ML
value cannot necessarily be the criterion for the correct topology.
The results of
table 1
depend on the numbers of operations
in
figure 4
. For example, if the numbers 20 and 30 in
figure 4a
are replaced by 30 and 20, respectively, the results of
the standard methods are improved.

View larger version (16K):
[in this window]
[in a new window]
|
Fig. 5.Results of the standard methods for the branch pattern of figure 4a.
The figure gives a typical example (table 2
) of data sets in which the maximum-likelihood (a) and neighbor-joining (b) methods could not reconstruct the original topology of figure 4a
in spite of high branch resolutions. Here, point 5 was taken as the outgroup
|
|
Next, we follow the procedure of the present method. First,
we obtained the MVS of the four dimensions by using the distance
matrix for the end points (points 59). Then, we obtained
the spherical center by using equation (2). However, the obtained
radius was too short due to the small number of end points.
According to the SSM, the minimum value of this radius is half
of the average distance between the outgroup (5) and the other
end points. This value was basically used as the spherical center
(origin of coordinates) but multiplied by the factor 1.3 for
a deviation from the SSM. We again solved the equation of motion
of the five dimensions which took into account this center and
the end points (points 59) and constructed the MVSs for
all of the data sets. As seen in
table 1
, we obtained an excellent
result of the vector analysis over all the data sets. Here,
we give full details of the two examples in which the standard
methods gave erroneous topologies (
figs. 5 and 6
). By examining
every pair of end points (points 69) as the search vectors,
we selected the best pair of them which fulfilled the criterion
of orthogonal relations such as those in
figure 3b.
Figure 7a
gives the vector analysis using the data set in the case
of
figure 5
. Here, the search vectors of points 6 and 8 revealed
the existence of the two sister groups of (6, 7) and (8, 9).
These groupings were confirmed by using points 8 and 9 as search
vectors (see
fig. 7b
), since the other points (6 and 7) were
positioned outside of the orthogonal relation (solid lines in
Fig. 7b
) formed by points 8 and 9. Next,
figure 8
gives the
vector analysis using the data set in the case of
figure 6
.
Using points 7 and 8 as search vectors, we found that point
9 was coupled to point 8 while point 6 was positioned outside
of the orthogonal relation (solid lines) formed by points 79
(
fig. 8a
). This branch pattern was confirmed using points 8
and 9 as search vectors, since then points 6 and 7 were positioned
outside of the orthogonal relation formed by points 8 and 9,
and point 7 had larger vector components to points 8 and 9 than
point 6 (
fig. 8b
).

View larger version (12K):
[in this window]
[in a new window]
|
Fig. 7.Scatter diagram for the branch pattern of figure 4a.
The figure gives the vector analysis of the data set used in figure 5
. The analysis was done using the MVS of the five dimensions which include both the end points (points 59) and a spherical center defined as the origin of coordinates. The search vectors (points 6 and 8) drew out points 7 and 9, respectively (a). By changing the search vectors to the points 8 and 9, the points 6 and 7 were positioned outside of the orthogonal relation (solid lines of b) formed by these vectors. In this way, the original topology of figure 4a
was confirmed
|
|

View larger version (11K):
[in this window]
[in a new window]
|
Fig. 8.Scatter diagram for the branch pattern of figure 4b.
The figure gives the vector analysis of the data set used in figure 6
. The analysis was done using the MVS of the five dimensions which include both the end points (points 59) and a spherical center defined as the origin of coordinates. The search vectors (points 7 and 8) drew out point 9 (a). By changing the search vectors to points 8 and 9, the points 6 and 7 were positioned outside of the orthogonal relation (solid lines of b) formed by these vectors, and point 7 had larger vector components to points 8 and 9 than point 6. In this way, the original topology of figure 4b
was confirmed
|
|

View larger version (14K):
[in this window]
[in a new window]
|
Fig. 6.Results of the standard methods for the branch pattern of figure 4b.
The figure gives a typical example of data sets in which the maximum-likelihood (a) and neighbor-joining (b) methods could not reconstruct the original topology of figure 4b.
Here, point 5 was taken as the outgroup
|
|
Finally, we evaluated the branch lengths by using the scalar
product
Ui,j of the two vectors of the
ith and
jth end points.
Here, we note the relation
Ui,j = (
R1,i2 +
R1,j2 -
Ri,j2)/2.
In the topology of
figure 4a,
length (1-2) was given by the
average value of
Ui,j (
i = 6, 7;
j = 8, 9). Lengths (2-3) and
(2-4) were given by subtracting length (1-2) from lengths (1-3)
and (1-4), respectively. Lengths (3-
i) (
i = 6, 7) were given
by subtracting length (1-3) from lengths (1-
i), respectively.
Lengths (4-
i) (
i = 8, 9) were given in the same way.
Figure 9a
expresses the branch lengths in the case of
figure 7
, which
are compared with the true ones (
fig. 9b
) given by using the
branch- and end-point sequences.
Figure 9c
gives the result
of the ML method, and a similar result was also obtained with
the NJ method. In
figure 9c,
the branch point between points
5 and 6 was taken as the origin of coordinates, since the position
of the root was unknown. The obtained branch lengths were very
different from both the present ones and the true ones. However,
such a comparison for the different topologies may be meaningless.
The values along the branches in
figure 9
denote the point-to-point
distances. In this paper, the distances between point 1 and
the end points were assumed to be constant for simplicity. The
branch lengths in
figure 9a
can be revised by readjusting these
distances so as to be proportional to those between the outgroup
(point 5) and the other points (points 69).
Table 2
gives the original sequences of point 1 and the end points (points
69).

View larger version (17K):
[in this window]
[in a new window]
|
Fig. 9.Branch lengths for the branch pattern of figure 4a.
The branch lengths were evaluated using the data set analyzed in figures 5 and 7
. A distance between two branch (or end) points was assumed to be the sum of the branch lengths connecting these two points. The result (a) is compared with the true tree (b) given by using the branch- and end-point sequences. The values along the branches denote the point-to-point distances. For the maximum-likelihood method (c), the branch point between points 5 and 6 was taken as the origin of coordinates, since the position of the root was unknown. The obtained branch lengths were very different from both the present ones and the true ones due to the erroneous topology (fig. 5a
)
|
|
 |
Applications to Vertebrates in the -Hemoglobin
|
|---|
The present theory was applied to a few examples of vertebrates
in the

-hemoglobin, in which the largest number of species have
been registered in the database. The sequences of 255 vertebrate
species were obtained from the Protein Identification Resources
(PIR) database. First, we selected only species with 141 and
142 sites, whose sequences were readjusted so as to be 141 for
the alignment analysis. Second, we excluded species with the
same sequences. As a result, we analyzed 185 species from fish
to primates. The substitution distances were calculated in a
way equivalent to that of the package Protdist (Felsenstein
1991

) with the Dayhoff model, since calculation errors were
found in this package. The present results were insensitive
to the use of other models, such as the Dayhoff-F, the JTT (Jones,
Taylor, and Thornton 1992), and the JTT-F models (Cao et al.
1994).
For the search vectors in the vector analysis, we introduced the CGV Rs to express the representative direction of a group (s) of species in the MVS. It was defined by Rs =
'i Ri/Is, where the prime sign of
'i denotes the summation about the species of the group s and Is is their total number. Consequently, the scatter diagram for the two groups s and t was expressed in terms of the quantities Zi,s and Zi,t, which are the projection components of the species vectors Ri to the two CGVs of Rs and Rt, respectively.
Relationship Among Artiodactyla, Perissodactyla, and Chiroptera
First, the present theory was applied to a typical problem in which it is difficult to get the correct branch patterns by using standard methods such as the ML, NJ, and PA methods. We picked up eight species of Artiodactyla and Perissodactyla and used a rat as the outgroup. It is well known that the consensus topology is [rat, {(Tylopoda, Ruminantia), (Ceratomorpha, Hippomorpha)}] (Novacek 1992
; Benton 1997
). The standard methods gave an erroneous pattern, [Tylopoda, {Ruminantia, (Ceratomorpha, Hippomorpha)}], with high branch resolutions (see fig. 10
). Especially, the ML method gave this strange pattern with 100% confidence (fig. 10a
), although this method gave the correct pattern with 100% confidence in the case of the ß-hemoglobin (data not shown). Additional inclusions of other orders resulted in another problem with the topology. As shown in figure 11
, by adding Chiroptera, Tylopoda still made the first clade of the tree in a way similar to that in figure 10
, and furthermore, Chiroptera were included inside of the ungulate branches. Here, the branch resolutions were rather high for the NJ method (fig. 11b
), while they were very low for the ML method (fig. 11a
). It is well known that Chiroptera are located outside the ungulate branches, since it is rather close to Primates (Novacek 1992
; Benton 1997
).

View larger version (22K):
[in this window]
[in a new window]
|
Fig. 10.Relationship among the suborders of Artiodactyla and Perissodactyla by standard methods. The standard methods of maximum likelihood (a) and neighbor joining (b) gave the strange pattern [Tylopoda, {Ruminantia, (Ceratomorpha, Hippomorpha)}] in spite of high branch resolutions, especially with the 100% confidence in a. Here, eight species of Artiodactyla and Perrisodactyla in the -hemoglobin sequences were selected, and a rat was used as the outgroup
|
|

View larger version (33K):
[in this window]
[in a new window]
|
Fig. 11.Relationship among Artiodactyla, Perissodactyla, and Chiroptera by standard methods. The maximum-likelihood and neighbor-joining methods were applied to the analysis of the 19 species of Ungulates and the 7 species of Chiroptera, which were full inclusions from the PIR database. The relationship among the suborders of Artiodactyla and Perissodactyla remained erroneous in a pattern similar to that in figure 10
. Furthermore, Chiroptera were included inside of the ungulate branches. When the number of species was increased, the branch resolutions became lower
|
|
In the present approach, the positions
Ri of 27 species (their
accession numbers are listed in the
Supplementary Material section)
of
figure 11
were precisely determined by solving the equation
of motion in the 26 dimensions. The position
Rc of the spherical
center was also determined by minimizing the
S value of
equation (2) and defined as the origin of coordinates. We looked at
the CGVs of Artiodactyla and Perrisodactyla as the search vectors
and investigated how the species were distributed around these
two vectors. As a result, the scatter diagram for
Zi,a and
Zi,p showed a clear pattern of orthogonal relation between the two
groups of Artiodactyla and Perissodactyla (
fig. 12a
). This
is analogous to the simulation of
figure 3b
and results in
the branch pattern [Chiroptera, {(Tylopoda, Ruminantia), (Ceratomorpha,
Hippomorpha)}]. Here, the rat and Chiroptera were distributed
as the outgroup with small values of
Zi,a and
Zi,p, since their
vectors were directed at large angles from the CGVs of Artiodactyla
and Perissodactyla.
Figure 12b
gives the result of a full inclusion
of 113 eutherians stored in the PIR database. This figure shows
that the additional inclusion of other species produces a more
stable pattern of the scatter diagram.

View larger version (24K):
[in this window]
[in a new window]
|
Fig. 12.Relationship among Artiodactyla, Perissodactyla, and Chiroptera by the present method. a shows the result of the vector analysis using the same species as in figure 11
. Looking at the two directions of the center-of-gravity vectors (CGVs) of Artiodactyla and Perisodactyla as the search vectors, we obtained the scatter diagram for Zi,a and Zi,p. The plots represented a clear orthogonal relation between these two orders: they were similar to those of the simulation of figure 3b
and resulted in the topology [Chiroptera, {(Tylopoda, Ruminantia), (Ceratomorpha, Hippomorpha)}], since other groupings of the search vectors did not give the orthogonal relation (c). Here, the rat and Chiroptera gave small x and y values, since their vectors were directed at large angles from the CGVs of Artiodactyla and Perissodactyla. It is here noted that the rat and Chiroptera are also directed at both large angles and long distances from each other, although they were located close together in this figure. The stability of the topology was examined by a full inclusion of 113 eutherians stored in the PIR database. As shown in b, the branch pattern of the suborders of Artiodactyla and Perissodactyla was consistent with that of a
|
|
It is here noted that the orthogonal relation of
figure 12a and b
is unique for the branch pattern [Chiroptera, {(Tylopoda,
Ruminantia), (Ceratomorpha, Hippomorpha)}]. The results (
fig. 10 ) of the standard methods supported a grouping of Ceratomorpha,
Hippomorpha, and Ruminantia. However, if the topology of
figure 10 were true, the angles between Tylopada and Perissodactyla
would become larger than those between Ruminantia and Perissodactyla.
As a result, an orthogonal relation such as that in
figure 12a and b
could not be satisfied. The uniqueness was also confirmed
by showing the nonorthogonal relation in the other groupings
(Ceratomorpha + Ruminantia and Hippomorpha + Tylopoda) for the
CGVs (
fig. 12c
).
Drawing Out of Lineage of Mammals and Birds
The positions Ri of 185 vertebrates were determined in the 110-dimensional size in which the average error of Ri,j versus Di,j was sufficiently small, i.e., 0.08%. The spherical center Rc of vertebrates was obtained by using equation (2)
and defined as the origin of coordinates, from which the species vectors Ri were measured. The CGVs Rm and Rb of mammals and birds were calculated and regarded as their representative directions in the MVS, respectively. Figure 13
shows the angular distribution of the vectors Ri around the CGV Rm. Here, the x-axis denotes the projection components Wi,m = (Zi,m)1/2 of the species vectors Ri to the vector Rm. The angles show roughly a chronological order of evolution of the species, since the branching events of the tree at earlier stages of evolution yield larger angles. The distribution was well fitted to the dotted curve (given by W = R cos(
) with a constant radius R = 8.5, which is the average value of |Ri|.
To obtain the scatter diagram for
Zi,m and
Zi,b, we shifted
the center position toward the directions of the CGVs of mammals
and birds by using the position
Qc given by a proportional relation,

| (9) |
with a constant
C. Here,
the species vectors
Ri were redefined as measured from the new
position
Qc. Here, the
C value may be adjusted for the first
few species of fishes so as to be placed near the new origin
of coordinates, since the vectors of these fishes become orthogonal
to the CGVs of
Rm and
Rb. However, the scatter diagram patterns
of the drawn-out species were insensitive to the
C values (
C = 0 - 0.35).
Figure 14
gives the scatter diagram (with C = 0.2) consisting of the three parts: the central line, the left branch, and the right branch. This figure showed a clear splitting into three groups of the central line and the left and right branches. The central line included fishes, amphibians, and squamates to form the outgroup. Fishes began with a coelacanth, a carp, and a dragonfish and ended with a goldfish. The central line included fishes, newts, frogs, and squamates. The left branch represented the lineage of birds. Crocodilians, turtles, and tuataras belonged to this lineage. In this way, the extant reptilians were clearly classified into the two groups of squamates and others. On the other hand, the right branch represented the mammalian lineage, which began with Monotremata (a platypus and an echidna) and Marsupialia (an opossum) at the base of the branch. They were followed by a guinea pig, a hedgehog, a manatee, and a whale, which are considered to be the core member species of the mammalian radiation. Primates were located at the uppermost part of the right branch.

View larger version (34K):
[in this window]
[in a new window]
|
Fig. 14.Scatter diagram for mammals and birds in 185 vertebrates. The x and y values denote the projection components Zi,m and Zi,b of the species vectors Ri to the center-of-gravity vectors (CGVs) of mammals and birds, respectively. The scatter plots were clearly decomposed into the three parts, i.e., the left branch of birds, the right branch of mammals, and the central part of the others. The extant reptilians were split into the two branches: squamates belonged to the central part, while the others belonged to the lineage of birds. The central part included fishes, amphibians, and squamates. On the other hand, the right branch began with Monotremata (a platypus and an echidna) and Marsupialia (an opossum) at the base of the mammalian lineage. They were followed by a guinea pig, a hedgehog, a manatee, and a whale, which are considered to be the core member species of the mammalian radiation. Primates were found on the uppermost part of the right branch. There appeared a vacuum at the base of the two branches of birds and mammals. This is regarded as the result of the extinction of reptilians. The species of the central part took small values of Zi,m and Zi,b, since their vectors were directed at large angles from the CGVs of birds and mammals. The species vectors of different classes such as fishes and amphibians were directed at large angles from each other, although these species were closely located in the plots of this figure, since we looked only at the two directions of mammals and birds
|
|
From a morphological viewpoint, the origin of Monotremata has
been regarded as much older than that of the opossum. In the
present analysis, however, they had similar values of
Zi,m in
spite of their mutual long distances. This result is consistent
with the proposal of Gemmell and Westerman (1994)

that they
diversified simultaneously (Cao et al. 1994; Janke et al. 1994).
It is interesting to see that Monotremata shifted to the direction
(upward) of birds compared with the opossum.
There appeared to be a vacuum at the base of the two branches of birds and mammals. This is regarded as the result of the extinction of reptilians. Finally, it is noted that the line along the plots of the bird lineage is not orthogonal to that of the mammalian lineage (see the solid lines of fig. 14
). This suggests that substitutions of higher orders than the SSM may exist in the sequences between birds and mammals. The effect of these substitutions may be related to the well known underestimation of the molecular clock between mammals and birds (Zuckerkandl and Pauling 1965
). This problem will be discussed in detail elsewhere.
Relationship Among Birds, Squamates, and Other Extant Reptilians
The phylogeny of reptilians has been one of the most important problems in vertebrate evolution. From a morphological point of view, the number of temporal openings in the skull has been used as an important clue to the classification of species (Romer 1966
; Benton 1997
): mammals have a single opening, the turtle lacks a temporal opening, and the other extant reptilians have two openings. In this way, birds and crocodilians formed one group, tuataras and squamates formed another group, and the turtle was placed at the base of the tree (fig. 15a
). On the other hand, the concept of a sister group of birds and mammals was proposed (Gardiner 1982
; Løvtrup 1985
) and was supported by molecular analyses using the sequences of the myoglobin and ß-hemoglobin (Goodman, Miyamoto, and Czelusniak 1987
; Bishop and Friday 1988
). In recent molecular analyses (Gorr, Mable, and Kleinschmidt 1998
; Hedges and Poling 1999
), however, the turtle was joined with crocodilians, and squamates were placed at the base of the reptilian tree. If these results are accurate, the traditional classification based on the number of temporal openings becomes meaningless for phylogeny, and the morphological and paleontological evidence then remains unclear.

View larger version (14K):
[in this window]
[in a new window]
|
Fig. 15.Relationship among birds, squamates, and other extant reptilians. a, The morphological and paleontological view. b and c, Recent results (Gorr, Mable, and Kleinschmidt 1998
; Hedges and Poling 1999
) with use of the maximum-likelihood method, respectively. b and c show a Turtle-crocodilian sister group, different from a. The present result (d) is basically similar to them, but did not form the subgroup within reptilians, since all of tuataras, turtles, and crocodilians were strongly coupled with birds
|
|
In the present approach, squamates were clearly isolated from
the other extant reptilians, whose vectors were directed at
small angles from the CGV of birds. On the other hand, the vectors
of squamates, amphibians, and fishes were directed at large
angles from the CGVs of birds and mammals so that they may have
had small values of
Zi,m and
Zi,b, although they were also directed
at large angles from each other.
Figure 14
shows that the first
coupling to birds is the tuatara, the second is the turtle,
and the third is crocodilians (their accession numbers are listed
in the
Supplementary Material section). Then, we have the topology
of
figure 15d,
in contrast to the morphological view (
fig. 15a
). This figure was also compared with the recent results
of
figure 15b
(Gorr, Mable, and Kleinschmidt 1998

) and
c (Hedges
and Poling 1999

) by the ML method, which shows a sister group
between crocodilians and turtles. The present analysis did not
give this subgroup (
fig. 15d
), since birds were always and
strongly drawn out by taking tuataras, turtles, or crocodilians
in place of birds as the search vector. This situation was made
clearer by the scatter diagram for the CGVs of crocodilians
and birds (
fig. 16
): the turtle was coupled more strongly with
birds than crocodilians, in a way quite similar to that shown
in
figure 3b.
Here, a large value (
C = 0.35) was used in order
to enlarge the relationship among crocodilians, turtles, tuataras,
and birds.
 |
Discussion
|
|---|
Molecular evolution has traditionally been studied with cluster
models (Sokal and Michener 1958

; Eck and Dayhoff 1966

; Fitch
and Margoliash 1967

; Sokal and Rohlf 1981
![Citation]()