MBE Advance Access originally published online on May 26, 2006
Molecular Biology and Evolution 2006 23(8):1558-1573; doi:10.1093/molbev/msl019
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Article |
Quantitative Trait Associated Microarray Gene Expression Data Analysis
Department of Botany and Plant Sciences, University of California, Riverside
E-mail: xu{at}genetics.ucr.edu.
| Abstract |
|---|
|
|
|---|
Selection on phenotypes may cause genetic change. To understand the relationship between phenotype and gene expression from an evolutionary viewpoint, it is important to study the concordance between gene expression and profiles of phenotypes. In this study, we use a novel method of clustering to identify genes whose expression profiles are related to a quantitative phenotype. Cluster analysis of gene expression data aims at classifying genes into several different groups based on the similarity of their expression profiles across multiple conditions. The hope is that genes that are classified into the same clusters may share underlying regulatory elements or may be a part of the same metabolic pathways. Current methods for examining the association between phenotype and gene expression are limited to linear association measured by the correlation between individual gene expression values and phenotype. Genes may be associated with the phenotype in a nonlinear fashion. In addition, groups of genes that share a particular pattern in their relationship to phenotype may be of evolutionary interest. In this study, we develop a method to group genes based on orthogonal polynomials under a multivariate Gaussian mixture model. The effect of each expressed gene on the phenotype is partitioned into a cluster mean and a random deviation from the mean. Genes can also be clustered based on a time series. Parameters are estimated using the expectationmaximization algorithm and implemented in SAS. The method is verified with simulated data and demonstrated with experimental data from 2 studies, one clusters with respect to severity of disease in Alzheimer's patients and another clusters data for a rat fracture healing study over time. We find significant evidence of nonlinear associations in both studies and successfully describe these patterns with our method. We give detailed instructions and provide a working program that allows others to directly implement this method in their own analyses.
Key Words: cluster analysis EM algorithm mixture model orthogonal polynomial phenotype
| Introduction |
|---|
|
|
|---|
Selection may act on phenotypes and thus on the underlying genetic components in an indirect manner. One class of genes that are important in evolution are the genes involved in developmental pathways because developmental processes shape the morphologies of species. In recent years, with the newly emerged microarray technology, great effort has been made to identify genes that are expressed in particular developmental stages according to specific patterns (Arbeitman et al. 2002
Microarray technology was developed only recently (Schena et al. 1995
). However, it already has a tremendous impact (i.e., Lander 1999
; West et al. 2001
). A single chip can measure the expression of thousands of genes simultaneously. Microarray data can and should be analyzed according to the purposes of the experiments and the data structures. Differential expression analysis aims at detecting genes differently expressed between groups, that is, control and treatment (i.e., Schena et al. 1995
; Baldi and Long 2001
; Tusher et al. 2001
; Dudoit et al. 2002
), whereas cluster analysis classifies genes into different groups (Eisen et al. 1998
; Quackenbush 2001
; Yeung et al. 2001
; Qu and Xu 2004
). In this paper, we focus on grouping or clustering profiles. Previous clustering is based on the similarity of expression profiles across discrete conditions, for example, tissues collected from different types of organs (Eisen et al. 1998
), time points after a particular drug treatment (Saban et al. 2001
; Luan and Li 2003
), and doses of a drug injection (Peddada et al. 2003
). The most commonly used methods include hierarchical clustering (Carr et al. 1997
), K-means (Tavazoie et al. 1999
), and self-organizing maps (Tamayo et al. 1999
). These algorithms are "heuristic" algorithms that classify genes based on the "distance matrix" and do not require any underlying statistical models. As a possible alternative to the heuristic algorithms, model-based clustering methods have also been proposed and applied to microarray data (Fraley and Raftery 1998
; Yeung et al. 2001
; Fraley and Raftery 2002
; Qu and Xu 2004
). An advantage of using the model-based algorithms is that the uncertainty of cluster assignment can be evaluated. In addition, alternative models can be exploited and compared.
Gene expression profiles have also been used for classification of tissues for disease diagnosis (West et al. 2001
; Bicciato et al. 2003
; Liu et al. 2003
; Meireles et al. 2004
; Tadesse et al. 2005
). Classification methods are conceptually the same as gene clustering in the sense that the data analyses are conducted with rows and columns being transposed. Special techniques have been developed for tissue classification, for example, dimension reduction (Liu et al. 2003
) and variable selection (Tadesse et al. 2005
).
Clustering based on discrete groups or classification prediction based on discrete categories is valuable. However, there are many evolutionary interesting phenotypes that are quantitative in nature. The association between quantitative phenotype and gene expression can be examined pairwise using a Pearson correlation coefficient between the expression of a single gene and a continuous phenotype (Quackenbush 2001
; Kraft et al. 2003
). Taking this one step further, Blalock et al. (2004)
ranked genes according to the correlation coefficients of gene expression with Mini Mental Status Examination (MMSE), a quantitative measurement of severity of Alzheimer's disease (AD), and detected many genes associated with AD. Kraft et al. (2003)
used the within-family correlation analysis to remove the effect of family stratification. Although the Pearson correlation coefficient is a straightforward way to assess a linear relationship between a single gene and a quantitative phenotype, it is limited to linear relationships with a single gene/phenotype combination. Potokina et al. (2004)
investigated the association of gene expression with 6 malting quality phenotypes (quantitative traits) of 10 barley cultivars. They compared the distance matrix for each gene expression profile among the 10 cultivar with each of the distance matrix calculated from the phenotypes and applied a g-test statistic. This approach allows the examination of multiple phenotypes but still focuses on linear associations with a single gene. The linear association between groups of genes and a quantitative phenotype has been described in a regression setting (Jia and Xu 2005
) and for groups of genes using factor analysis (Coffman et. al. 2005
). However, the relationship between the phenotype of interest and the gene expression profile may not be linear. For example, in the case of developmentally regulated genes, we may expect to see exponential or Poisson distribution.
In the AD microarray experiment of Blalock et al. (2004)
, we have seen a concrete example of nonlinear association between gene expression and MMSE. Figure 1 plots expression profiles (on the y-axis) against MMSE phenotype (on the x-axis) for 6 genes. Gene (ABCF1) shows no obvious association with the phenotype. Gene (37870) and (MYO1F) show positive and negative associations, respectively, with the phenotype. The remaining 3 genes (CALB1, GDF8, and ANKH) show nonlinear associations with the phenotype. Existing methods developed solely for detecting the linear association (Blalock et al. 2004
; Jia and Xu 2005
) will not identify an association between phenotype and gene expression in these cases (CALB1, GDF8, and ANKH). However, a method able to identify nonlinear association should identify the association between these genes and the phenotype MMSE.
|
In this study, we develop a novel algorithm to cluster genes based on nonlinear association between gene expression and a quantitative phenotype using orthogonal polynomials under the Gaussian mixture model. In our model, the whole data set is assumed to be a mixture of K clusters. The expression of each gene is considered as the sum of a cluster mean (fixed effect), a normally distributed random effect specific for the gene (random effect), and a random measurement noise. We use a linear contrast to remove the mean expression of each gene, which is not of interest, from the model, so that we can cluster gene based solely on the pattern of association of their expression with the phenotypic values. Parameters in the model are estimated with the expectationmaximization (EM) algorithm (Dempster et al. 1977
| Materials and Methods |
|---|
|
|
|---|
Orthogonal Polynomials of Gene Expression Profile
Let Z be the phenotypic value of a quantitative trait (a continuous variable) and Yi(Z) be the expression level of gene i = 1, ..., M written as a function of the phenotype, where M is the total number of genes studied. The following model may be used to describe the relationship between Yi(Z) and Z
![]() | (1) |
i(Z) is an arbitrary function chosen to describe the relationship between the expressed gene and the phenotype and
i is a random error reflecting the uncertainty of the fit of the model with an assumed N(0,
2) distribution. When describing gene expression as a function of phenotype, we have borrowed the idea of Luan and Li (2003)
Note that function
i(Z) has a subscript i, meaning that this function varies from gene to gene. In the model-based cluster analysis, Luan and Li (2003)
further partitioned
i(Z) into a fixed effect function and a random effect function. The fixed effect function is a mixture of several functional clusters, depending on which cluster the ith gene belongs to. For example, if the ith gene belongs to the kth cluster (k = 1, ..., K), the mixed-effects model is
|
| (2) |
i(Z) is a gene-specific random effect function. As a result, model (2) is a functional mixed-effects model.
All the model parameters, except
2, are a function of Z. The functional relationships between the parameters and Z may be described with an orthogonal polynomial. Define
as a basis of the polynomial with order d. A basis is also called a submodel. A method to construct the basis of orthogonal polynomial is given in Supplemental Material A (Supplementary Material online). Let us define
as a vector of model effects representing the mean values of cluster k. The phenotype associated model effect ßk(Z) can be described as a linear combination of ßk weighted by the coefficients of the polynomials, that is, ßk(Z) =
(Z)ßk. Similarly, we can describe
i(Z) by
i(Z) =
(Z)
i, where
Because
i is a vector of random effects, we assume that
i is i.i.d. N(0,
), where
is a (d + 1) x (d + 1) positive definite covariance matrix. Note that ßk(Z) and
i(Z) are scalars, but ßk and
i are vectors each with d + 1 elements.
Assume that the ith gene belongs to the kth cluster, with the polynomial reparameterization, model (2) can be now rewritten as a linear function of the phenotype-independent parameters,
![]() | (3) |
![]() | (4) |
![]() | (5) |
(Z, S) = 1 for Z = S and
(Z, S) = 0 otherwise.
Model (3) is linear with the model parameters, which can be estimated using the mixed model framework. We first estimate
, and
2 and then use the basis of the polynomial to convert ß and
into their phenotype-dependent functional parameters.
Likelihood Function of Normalized Gene Expression
Assume that the expression profiles for M genes are measured in N microarrays, where each microarray represents a single subject. The subjects are sorted by the phenotypic values in either ascending or descending order and denoted by Z1, ..., ZN. We use an N x 1 vector to denote the expression levels of gene i collected from N subjects,
for i = 1, ..., M, where M is the total number of genes. Define
![]() | (6) |
0(Zj) = 1 for all j = 1, ..., N (see Supplemental Material A for details on the method of constructing matrix
, Supplementary Material online). Let Ci be a cluster indicator variable for gene i, for example, Ci = k if i belongs to cluster k. In matrix notation, the linear model for vector Yi is
![]() | (7) |
is used to indicate that the model is presented conditional on gene i belonging to cluster k and
is now an N x 1 vector for the residual effects with a multivariate normal distribution, that is,
i
N(0, I
2), where I is an N x N identity matrix.
There are 2 points that need to be examined with care. One is that the phenotypic value Z needs to be rescaled before it enters the polynomial analysis. This is to ensure that the shape of the curve rather than the mean value is clustered. The rescaled variable, zj, runs between 1 and +1, that is, 1
zj
+ 1 (Hayes 1974
). The other point is that we assume all preprocessing for the gene expression profiles has already occurred (e.g., transformation). Let Yij = Yi(zj) be the log-transformed expression for gene i measured from subject j; then the normalized expression is
![]() | (8) |
i0, have been removed from the model (see Supplemental Material B for the proof, Supplementary Material online). In addition, we do not have N independent observations of the normalized expression for each gene. Therefore, the last element of vector yi should be excluded and yi becomes an n x 1 vector for n = N1. Accordingly, ßk and
i each becomes a d x 1 vector,
is a d x d matrix, and
becomes a d x n matrix. Finally, the vector of residual errors for the normalized data is redefined as L
i, where L (an n x N matrix) is obtained by deleting the last row of matrix
(see Supplemental Material B for the definition and derivation of this matrix, Supplementary Material online). Therefore, the covariance matrix of the normalized residuals is
![]() | (9) |
Given the parameters
= {ß,
,
2,
) and the data D = {y, z}, the likelihood function can be constructed. For a gene in cluster k, the model is
![]() | (10) |
![]() | (11) |
![]() | (12) |
Let
be the prior probability that a randomly selected gene belongs to cluster k. Vector
represents the mixing proportions. The probability density of the ith gene from the kth cluster is
![]() | (13) |
![]() | (14) |
EM Algorithm for Parameter Estimation
With the EM algorithm, we classify variables into data denoted by D = {y, z}, missing values denoted by U = {C,
}, and parameters
. The likelihood function given in equation (14) with all the missing values integrated out is the observed likelihood function. Instead of maximizing the observed likelihood function, the EM algorithm deals with the following complete data likelihood function,
![]() | (15) |
![]() | (16) |
![]() | (17) |
![]() | (18) |
(Ci, k) = 1 for Ci = k and
(Ci, k) = 0 for Ci
k. Because the complete data likelihood function involves missing values, integration (or expectation) is taken with respect to the missing values, not for the complete data likelihood function (15) but for the log-transformed complete likelihood function given below,
![]() | (19) |
![]() | (20) |
![]() | (21a) |
![]() | (21b) |
![]() | (22) |
![]() | (23) |
The EM algorithm requires taking the partial derivatives of l(
) with respect to the parameters, setting these partial derivatives equal to zero and finding explicit solutions for the parameters as a function of the missing values. This completes the maximization step. The expectation step requires taking the expectations of all terms involving the missing values. Because the complete data likelihood function is just a likelihood function for a typical mixed model, we can simply utilize existing mixed model EM algorithm to find the MLE of parameters (Dempster et al. 1977
). Followings are the EM steps for the mixed model analysis.
- Step 0: Let t = 0 and initialize all parameters with values in their legal domain, denoted by
(t).
- Step 1 (E1): Compute the posterior probabilities of cluster assignment for each gene

(24)
- Step 2 (E2): Find the posterior distribution of the random effect. This posterior distribution turns out to be a mixture of K normal distributions. If gene i belongs to cluster k, then the conditional expectation is
and the covariance matrix
(25) 
(26)
- Step 3 (M1): Update the regression coefficients for each cluster

(27)
- Step 4 (M2): Update the covariance matrix of the random effect

(28)
- Step 5 (M3): Update the residual variance

(29)
- Step 6 (M4): Update the mixing proportions

(30)
- Step 7: Set t = t + 1 and repeat from step 1 to step 6 (2 E-steps and 4 M-steps) many times until ||
(t+1)
(t)||
108 holds.
- Step 1 (E1): Compute the posterior probabilities of cluster assignment for each gene
The Optimal Number of Clusters
The above EM algorithm applies to situations where the number of clusters K is fixed. In reality, K must be estimated from the data. The commonly used method for inferring K is to calculate the BIC and choose K that maximizes the BIC value (Schwarz 1978
). BIC is considered as an approximation to the Bayesian factor and is defined as
![]() | (31) |
is the MLE of the parameter vector
under K clusters and
is the dimension of
, that is, the number of parameters in the model.
After the EM algorithm converges and the number of clusters is determined, we obtain all the estimated parameters. We also have the posterior probability that the ith gene belongs to the kth cluster,
for i = 1, ..., M and k = 1, ..., K. Based on these probabilities, we can assign the ith gene into the kth cluster if
The
ik also provides a measurement of the uncertainty about the resulting clustering. We can also try only cluster those genes with the maximum
ik greater than a predetermined cutoff value and declare other genes as unresolved (Luan and Li 2003
).
After gene clustering, we obtain estimates of the gene expression profiles. For the ith gene in the kth cluster, the BLUP of the random regression coefficients
i is given in equation (25) and is now denoted by
Let
be the MLE of ßk after the EM algorithm converges. The corresponding estimate of the individual gene expression profile for the ith gene in the kth cluster is
![]() | (32) |
is a vector of the BLUP estimates of the ith gene expressions on n subjects. The actual gene expression profile is
![]() | (33) |
against z. | Results |
|---|
|
|
|---|
Simulation Study
In order to validate the method and the computer program that we developed to implement the method, we undertook 2 simulation studies.
Data Set I
In this data set, we simulated M = 1200 genes from K = 5 clusters. The first cluster contained 400 genes, whereas each of the other 4 clusters contained 200 genes. The expression levels of each gene were measured from N = 50 subjects. The phenotypic value for each subject Zj was simulated from U(0.75, 1.25), a uniform distribution from 0.75 to 1.25. The phenotypic values were then rescaled to zj so that 1
zj
1. The order of the polynomial regression was d = 4. To simulate the expression values of genes in the kth cluster for all the 50 subjects, we used the following model to generate the data,
![]() | (34) |
![]() |
was sampled from a N(0,
) distribution, where
![]() |
ij were sampled from a N(0,
2) distribution, where
2 = 0.16. The true proportions of the genes in the 5 clusters were
=[0.333 0.167 0.167 0.167 0.167]. The simulated gene expression data were then normalized using yi = LYi, where
![]() |
i0} were automatically deleted so that the parameters subject to estimation were
![]() |
![]() |
We have implemented the algorithm in SAS, which is available as supplemental material online. The cluster analysis was performed using a variable number of clusters, K = 3, ..., 9. The maximum BIC value occurred when K = 5, the true number of clusters simulated. Under K = 5, we obtained the following estimated parameters:
![]() |
![]() |
![]() |
The overall correct clustering rate for the whole data set was 98%, which was distributed as 390/400, 194/200, 199/200, 200/200, and 194/200, respectively, for the 5 clusters.
The true and estimated curves for the 5 clusters are depicted in figure 2. The true curve for each cluster was generated using
![]() | (35) |
![]() | (36) |
is the estimate of the mean of cluster k.
|
We used
![]() | (37) |
|
Data Set II
In this simulation experiment, we generated M = 500 genes from K = 6 clusters. The first cluster contained 4000 genes, and each of the other 5 clusters contained 200 genes. Variables Z and Y were simulated in the same way as described for data set I. But this time, we let
![]() |
= I4x4 x 0.005, where I is the identity matrix and
2 = 0.01.
Using this data set, we compared our method with K-means. We implemented the K-means with SAS procedure "FASTCLUS" (SAS Institute 1999
). The performance of our method was comparable to or slightly inferior to the K-means method for the small clusters (Clusters 26). However, it did much better in recognizing genes in the larger cluster (Cluster 1) and had a much better overall correct classification rate (see table 1). With our method, the estimated parameters were
![]() |
![]() |
which were quite close to the true parameter values.
|
We conducted additional simulations with a wide range of parameters and sample sizes. Results of all simulation experiments showed the following expected trends (data not shown): 1) errors of cluster assignments decreased as the differences among the cluster means increased and 2) estimation errors of the parameters decreased as the number of genes within the cluster increased.
Application to AD
Blalock et al. (2004)
measured hippocampal gene expressions from 9 normal persons (controls) and 22 affected persons (cases), N = 9 + 22 = 31. Each subject had a disease severity measurement, MMSE, whose value varied from 2 (most severe) to 30 (normal). The MMSE is a continuous measure that has been reported as a reliable index of AD-related cognitive status at a given point in time (Miller et al. 2001
). All the subjects were assigned to 1 of the 4 groups (control, incipient, less severe, and most severe), reflecting the different levels of AD severity based primarily on the MMSE score (Blalock et al. 2004
). The original scores and the orthogonal polynomials of the MMSE for the 31 subjects are shown in table 2.
|
The expression levels of 22 286 genes were measured from 31 subjects (microarray chips). Here we choose a subset of 9754 genes for analysis based on 2 criteria used by Blalock et al. (2004)
We analyzed the data using a SAS program (available from supplemental material online) where d = 4 was chosen as the order of the polynomial regression. We tested the number of clusters from K = 2 to 20 using the BIC score and found that the BIC scores reached its maximum at K = 14. Therefore, the following paragraphs only report the results when K = 14.
The estimates of the cluster mean
and mixing proportion
for each cluster are presented in table 3. The estimated covariance matrix was
![]() |
We assigned each gene to the cluster with the highest posterior probability
Table 3 also shows the actual numbers of genes assigned to each cluster.
|
In the simulated data, the expression profile of each cluster was predicted using equation (36), and the gene-specific expression profile was predicted using equation (37) for each of the 14 clusters. The profiles of all genes for the 14 clusters are depicted in figure 4.
|
The first cluster contained 93.5% of the genes whose expression do not change significantly with the phenotypic value. Clusters 8, 10, 11, 12, and 13 accounted for 0.65%, 0.15%, 0.63%, 0.03%, and 3.01% of all genes, respectively. These genes were mainly up regulated with the MMSE score. Clusters 6, 7, and 14 accounted for 0.16%, 0.25%, and 0.49% of all genes, respectively. These genes were mainly down regulated with the MMSE score. Clusters 2, 3, 4, 5, and 9, which accounted for 0.33%, 0.06%, 0.43%, 0.11%, and 0.16%, respectively, represented the genes whose expression changes with the phenotypic values in a nonlinear fashion.
To compare the predicted with the observed gene expressions, we randomly selected one gene from each cluster and presented the comparisons in figure 5. It shows that the predicted profiles do reflect the general trends indicated by the observed data.
|
Blalock et al. (2004) studied this data set using Pearson correlation analysis. Among the 9754 genes, 2852 of them (28.6%) were detected as differentially expressed (nominal P < 0.05) based on the correlation coefficients. In the AD data set, some of the genes classified into clusters 214 are not detected as differentially expressed by either correlation analysis or the linear association method.
Application to the Rat Fracture Healing Data
To demonstrate that our method can be applied to time-course gene expression data, we examined an experiment that studied rat fracture healing. The original microarray experiment was published by Li et al. (2005)
. In all, 21 rats were used in the study. A standard closed fracture was created in the right femur of each rat. Tissues were collected at different postfracture time points (0d, 1d, 4d, 7d, 14d, 21d, and 28d), each with 3 replicates. The expression levels of 12 434 genes were measured using Affymetrix Genechips. We picked a subset of 12 124 genes that expressed on at least 1 of the 21 arrays. The raw data were log transformed, and we took the average of the 3 replicates as the measurement of gene expression for each time point. The mean expression of each gene across the time course was removed by the linear contrast strategy described earlier.
Luan and Li (2003)
used a mixed-effect model with B-splines to analyze gene expression in yeast cell cycle and identified 5 major clusters. We used the method of Luan and Li (2003) to fit cubic B-splines with 2 internal knots to classify gene expression profiles in the rat fracture healing experiment. The algorithm for generating the B-spline basis was initially described by De Boor (1978)
. Unlike the yeast cell cycle data set, in which most genes were differentially expressed, there were only a small proportion of genes in the rat fracture healing experiment whose expression were significantly altered in the time course.
We varied the number of clusters from K = 5 to 70 incremented by 5 and found that K = 35 gave the maximum BIC score. Among the 35 clusters, one large cluster contained 10 235 genes (84.4%) whose expression did not change during the time course, whereas the expression of genes in the other clusters evolved with time in different patterns. See Supplemental Material C (Supplementary Material online) for the estimated cluster mean and mixing proportion of each of the 35 clusters.
In order to compare the results of B-splines and polynomials, we reclassified the data with orthogonal polynomials of order d = 3. Again, we varied the number of clusters from K = 5 to 70 incremented by 5 and found that K = 30 gave the maximum BIC score. Similar to the B-splines approach, the majority of the genes (10 017 out 12 124 genes) were assigned to a large cluster with no apparent change in expression (neutral cluster), whereas a relatively small number of genes (2107 out of 12 124 genes) were assigned to other clusters (see Supplemental Material C for more information, Supplementary Material online). Among the genes placed in clusters other than the neutral one using either the B-splines or the polynomial approach, 1663 of them (79%) were identified as differing from a flat profile by both methods. Therefore, the degree of consistency between the 2 methods is quite high. When the number of time points of measurement is small (7 time points in this experiment) or the curves to be estimated are simple, the method of orthogonal polynomials is perhaps more preferable for biologists because it is simple and easy to understand. If the curves are arbitrarily complex, B-splines should be more preferable.
| Discussion |
|---|
|
|
|---|
In this paper, "linear association" refers to the first order (or straight line) association between 2 variables. Nonlinear kinetics of gene regulation has been investigated by many authors (e.g., Kim et al. 2000
We also demonstrate that our method can be applied to time-course experiments. Luan and Li (2003)
used a mixed-effects model to analyze the time-course gene expression data for yeast cell cycle (Spellman et al. 1998
). They used B-splines basis of the time points as the design matrix. We used orthogonal polynomials with the time points as the "phenotype" and then compared our result with that of Luan and Li (2003)
for the time-course expression data of fracture healing in rats (Li et al. 2005
). We classified the data with both B-splines basis and orthogonal polynomials. In both cases, we found that the majority of the genes were expressed at relatively constant levels, whereas small subsets of the genes showed patterns that changed over the time course. Among the genes classified as different over time with the B-splines, 88% (1663/1889) were also classified as different over time with the polynomials, and 79% (1663/2107) of the genes classified as different over time with polynomials were also classified as different over time with B-splines. This result indicates a high degree of consistency between the 2 methods.
A B-spline can be described as a piecewise polynomial spline. Instead of fitting one polynomial curve across the data point, it fits different curves in different regions. B-splines require specification of series of internal knots (breaking points), for example, time points or phenotypic values. The polynomial curves are, roughly speaking, equivalent to B-splines with no internal knots. There is no best way to choose the number of locations of the knots, and the choice may have an impact on the results (De Boor 1978
). The flexibility of B-splines makes them particularly suitable to fit complex or periodical curves, for example, the curve of a sine function. To do so effectively, however, a relatively large number of measurements must be made on each subject. Current gene expression data sets do not have the sampling density to effectively estimate complex curves.
Generally speaking, observations with n treatments can be used to fit polynomial curves up to (n 1)th order, and a higher order always results in better fit. For real data sets, because of the underlying complexity of the data structures, the optimal number of clusters could be large, which results in a great number of parameters to fit. The optimal number of clusters tends to increase with the order of the polynomials used in the analysis. So, increasing the order even by one may result in a much more complex model without much additional information. We selected d = 4 for the AD data set and d = 3 for the fracture healing data. We examined other values of d around 4 for the AD data and other d values around 3 for the fracture healing data and found that the results were virtually the same, indicating that these were reasonable choices for the order of the polynomial in these data sets.
Theoretically, the optimal order (d) can be found using the BIC score because it determines the dimension of the model. In practice, however, it is tedious because the number of clusters also determines the dimension of the model. Consider the combination of the order of polynomials and the number of clusters, the potential number of models for evaluation will quickly reach the limit of our capability. Therefore, we suggest that choosing d = 3 or d = 4 will be sufficient to describe the complexity of microarray data currently available. If the association is linear or quadratic, choosing d > 2 will overfit the model. This does not present a serious problem because the higher order regression coefficients will most likely have estimates close to zero, and the fitted curve will be effectively linear or quadratic. Recall that the shape of a curve is determined by both the order and the regression coefficients. Two drastically different curves may be described by polynomials of the same order but with quite different regression coefficients.
In analysis of the AD microarray data, we found that the EM algorithm was sensitive to the initial values of the parameters. This is a well-known phenomenon that the EM algorithm is a "hill-climbing" approach. The algorithm can only guarantee the reach of a local maximum. In attempt to find the global maximum, we explored many different initial values and chose the solution with the highest converged likelihood value.
In summary, we proposed a new method to identify genes associated in a nonlinear fashion with a quantitative trait using microarray gene expression data. In order to make inferences about the evolutionary processes, the nonlinear relationship between gene expression and phenotype may be of interest in exploring functional and molecular evolution.
| Supplementary Material |
|---|
|
|
|---|
Supplementary Materials A, B, and C are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
| Acknowledgements |
|---|
|
|
|---|
We are grateful to Drs Blalock and Landfield at the University of Kentucky for providing the microarray data and the AD phenotypes. We also thank Dr Xinmin Li at the University of Chicago for providing us with the complete microarray data of the rat fracture healing experiment. We are particularly grateful to the Associate Editor Laurent McIntyre and 3 anonymous reviewers for their constructive comments and suggestions that significantly improved the presentation of the manuscript. The research was supported by the National Institutes of Health grant R01-GM55321 and National Science Foundation grant DBI-0345205 to S.X.
Funding to pay the Open Access publication charges for this article was provided by the National Institutes of Health grant R01-GM55321.
| Footnotes |
|---|
Lauren McIntyre, Associate Editor
| References |
|---|
|
|
|---|
Arbeitman MN, Furlong EEM, Imam F, Johnson E, Null BH, Baker BS, Krasnow ME, Scott MP, Davis RW, White KP. 2002. Gene expression during the life cycle of Drosophila melanogaster. Science 297:22705.
Baldi P, Long AD. 2001. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17:50919.
Baugh LR, Hill AA, Slonim DK, Brown EL, Hunter CP. 2003. Composition and dynamics of the Caenorhabditis elegans early embryonic transcriptome. Development 130:889900.
Bicciato S, Luchini A, Di Bello C. 2003. PCA disjoint models for multiclass cancer analysis using gene expression data. Bioinformatics 19:5718.
Blalock EM, Geddes JW, Chen KC, Porter NM, Markesbery WR, Landfield PW. 2004. Incipient Alzheimer's disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc Natl Acad Sci USA 101:21738.
Carr DB, Somogyi R, Michaels G. 1997. Templates for looking at the gene expression clustering. Comput Stat Graph Newsl 8:209.
Coffman C, Wayne ML, Nuzhdin SV, Higgins LA, McIntyre LM. 2005. Identification of co-regulated transcripts affecting male body size in Drosophila. Genome Biol 6:R53.[CrossRef][Medline]
De Boor C. 1978. A practical guide to splines. New York: Springer.
Dempster AP, Laird NM, Rubin DB. 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc B 39:138.
Dudoit S, Yang YH, Callow MJ, Speed TB. 2002. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 12:11139.
Eisen MB, Spielman PT, Brown PO, Botstein D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95:148638.
Fraley C, Raftery AE. 1998. MCLUST: software for model-based cluster analysis. J Classif 16:297306.
Fraley C, Raftery AE. 2002. Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 97:61131.[CrossRef][Web of Science]
Hayes JG. 1974. Numerical methods for curve and surface fitting. J Inst Math Appl 10:14452.
Jia Z, Xu S. 2005. Clustering expressed genes based on their association with a quantitative phenotype. Genet Res 86:193207.[CrossRef][Web of Science][Medline]
Kim S, Dougherty ER, Bittner ML, Chen Y, Sivakumar K, Meltzer P, Trent JM. 2000. General nonlinear framework for the analysis of gene interaction via multivariate expression arrays. J Biomed Opt 5:41124.[CrossRef][Web of Science][Medline]
Kirkpatrick S, Gelatt CD, Vecchi MP. 1983. Optimization by simulated annealing. Science 220:67180.
Kraft P, Schadt E, Aten J, Horvath S. 2003. A family-based test for correlation between gene expression and trait values. Am J Hum Genet 72:132330.[CrossRef][Web of Science][Medline]
Lander ES. 1999. Array of hope. Nat Genet 21:34.[CrossRef][Web of Science][Medline]
Li X, Quigg RJ, Zhou J, Ryaby JT, Wang H. 2005. Early signals for fracture healing. J Cell Biol 95:189205.
Liu JS, Zhang JL, Palumbo MJ, Lawrence CE. 2003. Bayesian clustering with variable and transformation selections. In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M, editors. Bayesian Statistics 7. Oxford: Oxford University Press. p 24975.
Luan Y, Li H. 2003. Clustering of time-course gene expression data using a mixed-effects model with splines. Bioinformatics 19:47284.
Meireles SI, Cristo SB, Carvalho AF et al. (13 co-authors). 2004. Molecular classifiers for gastric cancer and nonmalignant diseases of the gastric mucosa. Cancer Res 64:125565.
Miller RA, Galecki A, Shmookler-reis RJ. 2001. Interpretation, design, and analysis of gene array expression experiments. J Gerontol A Biol Sci Med Sci 56:B527.
Nelder JA, Mead R. 1965. A simplex method for function minimization. Comput J 7:30813.
Peddada SD, Lobenhofer EK, Li LP, Afshari CA, Weinberg CR, Umbach DM. 2003. Gene selection and clustering for time-course and dose-response microarray experiments using order-restricted inference. Bioinformatics 19:83441.
Potokina E, Caspers M, Prasad M, Kota R, Zhang H, Sreenivasulu N, Wang M, Graner A. 2004. Functional association between malting quality trait components and cDNA array based expression patterns in barley (Hordeum vulgare L.). Mol Breed 14:15370.[CrossRef]
Qu Y, Xu S. 2004. Supervised cluster analysis for microarray data based on multivariate Gaussian mixture. Bioinformatics 20:190513.
Quackenbush J. 2001. Computational analysis of microarray data. Nat Rev Genet 2:41827.[CrossRef][Web of Science][Medline]
Reinke V, Gil IS, Ward S, Kazmer K. 2004. Genome-wide germline-enriched and sex-biased expression profiles in Caenorhabditis elegans. Development 131:31123.
Robinson G. 1991. That BLUP is a good thing: the estimation of random effects. Stat Sci 6:1532.
Saban MR, Hellmich H, Nguyen NB, Winston J, Hammond TG, Saban R. 2001. Time course of LPS-induced gene expression in a mouse model of genitourinary inflammation. Physiol Genomics 5:14760.
SAS Institute. 1999. SAS/STAT user's guide, Version 8. Volume 1.Cary, NC: SAS Institute.
Schena M, Shalon D, Davis RW, Brown PO. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:46770.
Schwarz G. 1978. Estimating the dimension of a model. Ann Stat 6:290712.
Spellman P, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. 1998. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisae by microarray hybridization. Mol Biol Cell 9:327397.
Spieth C, Streichert F, Speer N, Zell A. 2005. Multi-objective model optimization for inferring gene regulatory networks. Conference on Evolutionary Multi-Criterion Optimization. Lect Notes Comput Sci 3410:60720.
Tadesse MG, Sha N, Vannucci M. 2005. Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc 100:60217.[CrossRef]
Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR. 1999. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96:290712.
Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. 1999. Systematic determination of genetic network architecture. Nat Genet 22:21885.
Tusher VG, Tibshirani R, Chu G. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98:511621.
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR. 2001. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA 98:114627.
Yeung KL, Fraley C, Murua A, Raftery AE, Ruzzo WL. 2001. Model-based clustering and data transformations for gene expression data. Bioinformatics 17:97787.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
R. W.M. Fung, M. Gonzalo, C. Fekete, L. G. Kovacs, Y. He, E. Marsh, L. M. McIntyre, D. P. Schachtman, and W. Qiu Powdery Mildew Induces Defense-Oriented Reprogramming of the Transcriptome in a Susceptible But Not in a Resistant Grapevine Plant Physiology, January 1, 2008; 146(1): 236 - 249. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Jia and S. Xu Mapping Quantitative Trait Loci for Expression Abundance Genetics, May 1, 2007; 176(1): 611 - 623. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||













































the Mixing Proportion
, and the Number of Genes Assigned to the kth Cluster
for the Data set of AD




