Skip Navigation


MBE Advance Access originally published online on May 26, 2006
Molecular Biology and Evolution 2006 23(8):1558-1573; doi:10.1093/molbev/msl019
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrowOA All Versions of this Article:
23/8/1558    most recent
msl019v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Google Scholar
Right arrow Articles by Qu, Y.
Right arrow Articles by Xu, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Qu, Y.
Right arrow Articles by Xu, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 The Authors This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Research Article

Quantitative Trait Associated Microarray Gene Expression Data Analysis

Yi Qu and Shizhong Xu

Department of Botany and Plant Sciences, University of California, Riverside

E-mail: xu{at}genetics.ucr.edu.


    Abstract
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Selection on phenotypes may cause genetic change. To understand the relationship between phenotype and gene expression from an evolutionary viewpoint, it is important to study the concordance between gene expression and profiles of phenotypes. In this study, we use a novel method of clustering to identify genes whose expression profiles are related to a quantitative phenotype. Cluster analysis of gene expression data aims at classifying genes into several different groups based on the similarity of their expression profiles across multiple conditions. The hope is that genes that are classified into the same clusters may share underlying regulatory elements or may be a part of the same metabolic pathways. Current methods for examining the association between phenotype and gene expression are limited to linear association measured by the correlation between individual gene expression values and phenotype. Genes may be associated with the phenotype in a nonlinear fashion. In addition, groups of genes that share a particular pattern in their relationship to phenotype may be of evolutionary interest. In this study, we develop a method to group genes based on orthogonal polynomials under a multivariate Gaussian mixture model. The effect of each expressed gene on the phenotype is partitioned into a cluster mean and a random deviation from the mean. Genes can also be clustered based on a time series. Parameters are estimated using the expectation–maximization algorithm and implemented in SAS. The method is verified with simulated data and demonstrated with experimental data from 2 studies, one clusters with respect to severity of disease in Alzheimer's patients and another clusters data for a rat fracture healing study over time. We find significant evidence of nonlinear associations in both studies and successfully describe these patterns with our method. We give detailed instructions and provide a working program that allows others to directly implement this method in their own analyses.

Key Words: cluster analysis • EM algorithm • mixture model • orthogonal polynomial • phenotype


    Introduction
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Selection may act on phenotypes and thus on the underlying genetic components in an indirect manner. One class of genes that are important in evolution are the genes involved in developmental pathways because developmental processes shape the morphologies of species. In recent years, with the newly emerged microarray technology, great effort has been made to identify genes that are expressed in particular developmental stages according to specific patterns (Arbeitman et al. 2002Go; Baugh et al. 2003Go; Reinke et al. 2004Go). The study of the expression of these genes over time is an important component in understanding how the gene expression patterns covary, and regulation of these genes is then responsible for many down stream phenotypes. Quantitative trait locus mapping represents the traditional method for identifying genes associated with quantitative phenotypes. Here, in this study, we develop a novel statistical method that uses microarray gene expression data to describe the expression profiles for genes that exhibit similar profiles with regard to a quantitative phenotype. This method can also be applied to grouping expression profiles along a developmental time course because time is on a continuous scale.

Microarray technology was developed only recently (Schena et al. 1995Go). However, it already has a tremendous impact (i.e., Lander 1999Go; West et al. 2001Go). A single chip can measure the expression of thousands of genes simultaneously. Microarray data can and should be analyzed according to the purposes of the experiments and the data structures. Differential expression analysis aims at detecting genes differently expressed between groups, that is, control and treatment (i.e., Schena et al. 1995Go; Baldi and Long 2001Go; Tusher et al. 2001Go; Dudoit et al. 2002Go), whereas cluster analysis classifies genes into different groups (Eisen et al. 1998Go; Quackenbush 2001Go; Yeung et al. 2001Go; Qu and Xu 2004Go). In this paper, we focus on grouping or clustering profiles. Previous clustering is based on the similarity of expression profiles across discrete conditions, for example, tissues collected from different types of organs (Eisen et al. 1998Go), time points after a particular drug treatment (Saban et al. 2001Go; Luan and Li 2003Go), and doses of a drug injection (Peddada et al. 2003Go). The most commonly used methods include hierarchical clustering (Carr et al. 1997Go), K-means (Tavazoie et al. 1999Go), and self-organizing maps (Tamayo et al. 1999Go). These algorithms are "heuristic" algorithms that classify genes based on the "distance matrix" and do not require any underlying statistical models. As a possible alternative to the heuristic algorithms, model-based clustering methods have also been proposed and applied to microarray data (Fraley and Raftery 1998Go; Yeung et al. 2001Go; Fraley and Raftery 2002Go; Qu and Xu 2004Go). An advantage of using the model-based algorithms is that the uncertainty of cluster assignment can be evaluated. In addition, alternative models can be exploited and compared.

Gene expression profiles have also been used for classification of tissues for disease diagnosis (West et al. 2001Go; Bicciato et al. 2003Go; Liu et al. 2003Go; Meireles et al. 2004Go; Tadesse et al. 2005Go). Classification methods are conceptually the same as gene clustering in the sense that the data analyses are conducted with rows and columns being transposed. Special techniques have been developed for tissue classification, for example, dimension reduction (Liu et al. 2003Go) and variable selection (Tadesse et al. 2005Go).

Clustering based on discrete groups or classification prediction based on discrete categories is valuable. However, there are many evolutionary interesting phenotypes that are quantitative in nature. The association between quantitative phenotype and gene expression can be examined pairwise using a Pearson correlation coefficient between the expression of a single gene and a continuous phenotype (Quackenbush 2001Go; Kraft et al. 2003Go). Taking this one step further, Blalock et al. (2004)Go ranked genes according to the correlation coefficients of gene expression with Mini Mental Status Examination (MMSE), a quantitative measurement of severity of Alzheimer's disease (AD), and detected many genes associated with AD. Kraft et al. (2003)Go used the within-family correlation analysis to remove the effect of family stratification. Although the Pearson correlation coefficient is a straightforward way to assess a linear relationship between a single gene and a quantitative phenotype, it is limited to linear relationships with a single gene/phenotype combination. Potokina et al. (2004)Go investigated the association of gene expression with 6 malting quality phenotypes (quantitative traits) of 10 barley cultivars. They compared the distance matrix for each gene expression profile among the 10 cultivar with each of the distance matrix calculated from the phenotypes and applied a g-test statistic. This approach allows the examination of multiple phenotypes but still focuses on linear associations with a single gene. The linear association between groups of genes and a quantitative phenotype has been described in a regression setting (Jia and Xu 2005Go) and for groups of genes using factor analysis (Coffman et. al. 2005Go). However, the relationship between the phenotype of interest and the gene expression profile may not be linear. For example, in the case of developmentally regulated genes, we may expect to see exponential or Poisson distribution.

In the AD microarray experiment of Blalock et al. (2004)Go, we have seen a concrete example of nonlinear association between gene expression and MMSE. Figure 1 plots expression profiles (on the y-axis) against MMSE phenotype (on the x-axis) for 6 genes. Gene (ABCF1) shows no obvious association with the phenotype. Gene (37870) and (MYO1F) show positive and negative associations, respectively, with the phenotype. The remaining 3 genes (CALB1, GDF8, and ANKH) show nonlinear associations with the phenotype. Existing methods developed solely for detecting the linear association (Blalock et al. 2004Go; Jia and Xu 2005Go) will not identify an association between phenotype and gene expression in these cases (CALB1, GDF8, and ANKH). However, a method able to identify nonlinear association should identify the association between these genes and the phenotype MMSE.


Figure 1
View larger version (15K):
[in this window]
[in a new window]
 
FIG. 1.— Expression profiles of a few genes as a function of the AD phenotype (MMSE): (ABCF1), gene expression does not change with the phenotypic value; (37870), gene expression increases as the phenotype value increases; (MYO1F), gene expression decreases as the phenotypic value increases; (CALB1, GDF8, and ANKH), gene expression is associated with the phenotypic value in a nonlinear fashion.

 
In this study, we develop a novel algorithm to cluster genes based on nonlinear association between gene expression and a quantitative phenotype using orthogonal polynomials under the Gaussian mixture model. In our model, the whole data set is assumed to be a mixture of K clusters. The expression of each gene is considered as the sum of a cluster mean (fixed effect), a normally distributed random effect specific for the gene (random effect), and a random measurement noise. We use a linear contrast to remove the mean expression of each gene, which is not of interest, from the model, so that we can cluster gene based solely on the pattern of association of their expression with the phenotypic values. Parameters in the model are estimated with the expectation–maximization (EM) algorithm (Dempster et al. 1977Go). The random effect for each gene is predicted with best linear unbiased predictor (BLUP)(Robinson 1991Go). Bayesian information criterion (BIC) is applied to choose the optimal number of clusters.


    Materials and Methods
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Orthogonal Polynomials of Gene Expression Profile
Let Z be the phenotypic value of a quantitative trait (a continuous variable) and Yi(Z) be the expression level of gene i = 1, ..., M written as a function of the phenotype, where M is the total number of genes studied. The following model may be used to describe the relationship between Yi(Z) and Z

Formula 1(1)
where {alpha}i(Z) is an arbitrary function chosen to describe the relationship between the expressed gene and the phenotype and {varepsilon}i is a random error reflecting the uncertainty of the fit of the model with an assumed N(0, {sigma}2) distribution. When describing gene expression as a function of phenotype, we have borrowed the idea of Luan and Li (2003)Go in time-course microarray data analysis where gene expression is described as a function of time, that is, Formula 1

Note that function {alpha}i(Z) has a subscript i, meaning that this function varies from gene to gene. In the model-based cluster analysis, Luan and Li (2003)Go further partitioned {alpha}i(Z) into a fixed effect function and a random effect function. The fixed effect function is a mixture of several functional clusters, depending on which cluster the ith gene belongs to. For example, if the ith gene belongs to the kth cluster (k = 1, ..., K), the mixed-effects model is


Formula 2

(2)
where ßk(Z) is a function shared by all genes in cluster k and {gamma}i(Z) is a gene-specific random effect function. As a result, model (2) is a functional mixed-effects model.

All the model parameters, except {sigma}2, are a function of Z. The functional relationships between the parameters and Z may be described with an orthogonal polynomial. Define Formula 1 as a basis of the polynomial with order d. A basis is also called a submodel. A method to construct the basis of orthogonal polynomial is given in Supplemental Material A (Supplementary Material online). Let us define Formula 1 as a vector of model effects representing the mean values of cluster k. The phenotype associated model effect ßk(Z) can be described as a linear combination of ßk weighted by the coefficients of the polynomials, that is, ßk(Z) = {psi}(Zk. Similarly, we can describe {gamma}i(Z) by {gamma}i(Z) = {psi}(Z){gamma}i, where Formula 1 Because {gamma}i is a vector of random effects, we assume that {gamma}i is i.i.d. N(0, {Sigma}), where {Sigma} is a (d + 1) x (d + 1) positive definite covariance matrix. Note that ßk(Z) and {gamma}i(Z) are scalars, but ßk and {gamma}i are vectors each with d + 1 elements.

Assume that the ith gene belongs to the kth cluster, with the polynomial reparameterization, model (2) can be now rewritten as a linear function of the phenotype-independent parameters,

Formula 3(3)
The expectation of Yi(Z) conditional on the fixed effects is

Formula 4(4)
The covariance between Yi(Z) and Yi(S) conditional on the fixed effects is

Formula 5(5)
where {eta}(Z, S) = 1 for Z = S and {eta}(Z, S) = 0 otherwise.

Model (3) is linear with the model parameters, which can be estimated using the mixed model framework. We first estimate Formula 5 {Sigma}, and {sigma}2 and then use the basis of the polynomial to convert ß and {gamma} into their phenotype-dependent functional parameters.

Likelihood Function of Normalized Gene Expression
Assume that the expression profiles for M genes are measured in N microarrays, where each microarray represents a single subject. The subjects are sorted by the phenotypic values in either ascending or descending order and denoted by Z1, ..., ZN. We use an N x 1 vector to denote the expression levels of gene i collected from N subjects, Formula 5 for i = 1, ..., M, where M is the total number of genes. Define

Formula 6(6)
as a (d + 1) x N matrix, where {psi}0(Zj) = 1 for all j = 1, ..., N (see Supplemental Material A for details on the method of constructing matrix {psi}, Supplementary Material online). Let Ci be a cluster indicator variable for gene i, for example, Ci = k if i belongs to cluster k. In matrix notation, the linear model for vector Yi is

Formula 7(7)
where Formula 7 is used to indicate that the model is presented conditional on gene i belonging to cluster k and Formula 7 is now an N x 1 vector for the residual effects with a multivariate normal distribution, that is, {varepsilon}i ~ N(0, I{sigma}2), where I is an N x N identity matrix.

There are 2 points that need to be examined with care. One is that the phenotypic value Z needs to be rescaled before it enters the polynomial analysis. This is to ensure that the shape of the curve rather than the mean value is clustered. The rescaled variable, zj, runs between –1 and +1, that is, –1 ≤ zj ≤ + 1 (Hayes 1974Go). The other point is that we assume all preprocessing for the gene expression profiles has already occurred (e.g., transformation). Let Yij = Yi(zj) be the log-transformed expression for gene i measured from subject j; then the normalized expression is

Formula 8(8)
Note that after normalization, the intercepts, ßk0 and {gamma}i0, have been removed from the model (see Supplemental Material B for the proof, Supplementary Material online). In addition, we do not have N independent observations of the normalized expression for each gene. Therefore, the last element of vector yi should be excluded and yi becomes an n x 1 vector for n = N–1. Accordingly, ßk and {gamma}i each becomes a d x 1 vector, {Sigma} is a d x d matrix, and {psi} becomes a d x n matrix. Finally, the vector of residual errors for the normalized data is redefined as L{varepsilon}i, where L (an n x N matrix) is obtained by deleting the last row of matrix Formula 8 (see Supplemental Material B for the definition and derivation of this matrix, Supplementary Material online). Therefore, the covariance matrix of the normalized residuals is

Formula 9(9)
where R = LLT is a known n x n positive definite matrix.

Given the parameters {theta} = {ß, {Sigma}, {sigma}2, {pi}) and the data D = {y, z}, the likelihood function can be constructed. For a gene in cluster k, the model is

Formula 10(10)
The expectation of the model is

Formula 11(11)
and the variance matrix is

Formula 12(12)
which applies to all i = 1, ..., M.

Let Formula 12 be the prior probability that a randomly selected gene belongs to cluster k. Vector Formula 12 represents the mixing proportions. The probability density of the ith gene from the kth cluster is

Formula 13(13)
The likelihood function is

Formula 14(14)
Several numerical methods can be used to solve for the maximum likelihood estimates (MLE) of the parameters (Nelder and Mead 1965Go; Dempster et al. 1977Go; Kirkpatrick et al. 1983Go). We used the EM algorithm (Dempster et al. 1977Go) because explicit equations for the maximization step are available in the complete data situation.

EM Algorithm for Parameter Estimation
With the EM algorithm, we classify variables into data denoted by D = {y, z}, missing values denoted by U = {C, {gamma}}, and parameters {theta}. The likelihood function given in equation (14) with all the missing values integrated out is the observed likelihood function. Instead of maximizing the observed likelihood function, the EM algorithm deals with the following complete data likelihood function,

Formula 15(15)
where

Formula 16(16)

Formula 17(17)
and

Formula 18(18)
For convenience of presentation, we defined variable {eta}(Ci, k) = 1 for Ci = k and {eta}(Ci, k) = 0 for Ci != k. Because the complete data likelihood function involves missing values, integration (or expectation) is taken with respect to the missing values, not for the complete data likelihood function (15) but for the log-transformed complete likelihood function given below,

Formula 19(19)
The expectation of equation (19) with respect to the missing values is

Formula 20(20)
where

Formula 21(21a)

Formula 22(21b)

Formula 23(22)
and

Formula 24(23)

The EM algorithm requires taking the partial derivatives of l({theta}) with respect to the parameters, setting these partial derivatives equal to zero and finding explicit solutions for the parameters as a function of the missing values. This completes the maximization step. The expectation step requires taking the expectations of all terms involving the missing values. Because the complete data likelihood function is just a likelihood function for a typical mixed model, we can simply utilize existing mixed model EM algorithm to find the MLE of parameters (Dempster et al. 1977Go). Followings are the EM steps for the mixed model analysis.

Step 0: Let t = 0 and initialize all parameters with values in their legal domain, denoted by {theta}(t).
Step 1 (E1): Compute the posterior probabilities of cluster assignment for each gene

Formula 25(24)

Step 2 (E2): Find the posterior distribution of the random effect. This posterior distribution turns out to be a mixture of K normal distributions. If gene i belongs to cluster k, then the conditional expectation is

Formula 26(25)
and the covariance matrix

Formula 27(26)

Step 3 (M1): Update the regression coefficients for each cluster

Formula 28(27)

Step 4 (M2): Update the covariance matrix of the random effect

Formula 29(28)

Step 5 (M3): Update the residual variance

Formula 30(29)

Step 6 (M4): Update the mixing proportions

Formula 31(30)

Step 7: Set t = t + 1 and repeat from step 1 to step 6 (2 E-steps and 4 M-steps) many times until ||{theta}(t+1){theta}(t)|| ≤ 10–8 holds.

The Optimal Number of Clusters
The above EM algorithm applies to situations where the number of clusters K is fixed. In reality, K must be estimated from the data. The commonly used method for inferring K is to calculate the BIC and choose K that maximizes the BIC value (Schwarz 1978Go). BIC is considered as an approximation to the Bayesian factor and is defined as

Formula 32(31)
for K clusters, where Formula 32 is the MLE of the parameter vector {theta} under K clusters and Formula 32 is the dimension of {theta}, that is, the number of parameters in the model.

After the EM algorithm converges and the number of clusters is determined, we obtain all the estimated parameters. We also have the posterior probability that the ith gene belongs to the kth cluster, Formula 32 for i = 1, ..., M and k = 1, ..., K. Based on these probabilities, we can assign the ith gene into the kth cluster if Formula 32 The {pi}ik also provides a measurement of the uncertainty about the resulting clustering. We can also try only cluster those genes with the maximum {pi}ik greater than a predetermined cutoff value and declare other genes as unresolved (Luan and Li 2003Go).

After gene clustering, we obtain estimates of the gene expression profiles. For the ith gene in the kth cluster, the BLUP of the random regression coefficients {gamma}i is given in equation (25) and is now denoted by Formula 32 Let Formula 32 be the MLE of ßk after the EM algorithm converges. The corresponding estimate of the individual gene expression profile for the ith gene in the kth cluster is

Formula 33(32)
Note that Formula 33 is a vector of the BLUP estimates of the ith gene expressions on n subjects. The actual gene expression profile is

Formula 34(33)
which can be examined visually by plotting Formula 34 against z.


    Results
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Simulation Study
In order to validate the method and the computer program that we developed to implement the method, we undertook 2 simulation studies.

Data Set I
In this data set, we simulated M = 1200 genes from K = 5 clusters. The first cluster contained 400 genes, whereas each of the other 4 clusters contained 200 genes. The expression levels of each gene were measured from N = 50 subjects. The phenotypic value for each subject Zj was simulated from U(0.75, 1.25), a uniform distribution from 0.75 to 1.25. The phenotypic values were then rescaled to zj so that –1 ≤ zj ≤ 1. The order of the polynomial regression was d = 4. To simulate the expression values of genes in the kth cluster for all the 50 subjects, we used the following model to generate the data,

Formula 35(34)
where

Formula 35
The random vector Formula 35 was sampled from a N(0, {Sigma}) distribution, where

Formula 35
The {varepsilon}ij were sampled from a N(0, {sigma}2) distribution, where {sigma}2 = 0.16. The true proportions of the genes in the 5 clusters were {pi}=[0.333 0.167 0.167 0.167 0.167]. The simulated gene expression data were then normalized using yi = LYi, where

Formula 35
is a 49 x 50 matrix. When we used the normalized data for cluster analysis, the intercepts {ßk0, {gamma}i0} were automatically deleted so that the parameters subject to estimation were

Formula 35
and

Formula 35
both of which had reduced dimensions.

We have implemented the algorithm in SAS, which is available as supplemental material online. The cluster analysis was performed using a variable number of clusters, K = 3, ..., 9. The maximum BIC value occurred when K = 5, the true number of clusters simulated. Under K = 5, we obtained the following estimated parameters:

Formula 35

Formula 35

Formula 35
These estimated values were close to the simulated parameters. We next examined the rate of correct classification of genes for each cluster. We assigned the ith gene to the kth cluster if Formula 35 The overall correct clustering rate for the whole data set was 98%, which was distributed as 390/400, 194/200, 199/200, 200/200, and 194/200, respectively, for the 5 clusters.

The true and estimated curves for the 5 clusters are depicted in figure 2. The true curve for each cluster was generated using

Formula 36(35)
whereas estimated curve for each cluster was generated using

Formula 37(36)
for k =1, ..., 5, where ßk is the true value and Formula 37 is the estimate of the mean of cluster k.


Figure 2
View larger version (12K):
[in this window]
[in a new window]
 
FIG. 2.— True and estimated mean gene expression profiles for each of the 5 simulated clusters of genes. Left panel, true profiles for the 5 clusters; Right panel, estimated profiles for the 5 clusters. In each plot, the 5 clusters (1–5) are color coded with black, blue, red, purple, and green, respectively.

 
We used

Formula 38(37)
to predict the profiles of individual gene expression for the 5 clusters. These profiles are depicted in figure 3.


Figure 3
View larger version (21K):
[in this window]
[in a new window]
 
FIG. 3.— Estimated gene expression profiles for all simulated genes in the 5 clusters.

 
Data Set II
In this simulation experiment, we generated M = 500 genes from K = 6 clusters. The first cluster contained 4000 genes, and each of the other 5 clusters contained 200 genes. Variables Z and Y were simulated in the same way as described for data set I. But this time, we let

Formula 38
{Sigma} = I4x4 x 0.005, where I is the identity matrix and {sigma}2 = 0.01.

Using this data set, we compared our method with K-means. We implemented the K-means with SAS procedure "FASTCLUS" (SAS Institute 1999Go). The performance of our method was comparable to or slightly inferior to the K-means method for the small clusters (Clusters 2–6). However, it did much better in recognizing genes in the larger cluster (Cluster 1) and had a much better overall correct classification rate (see table 1). With our method, the estimated parameters were

Formula 38

Formula 38
Formula 38 which were quite close to the true parameter values.


View this table:
[in this window]
[in a new window]
 
Table 1 Comparison of the Proposed New Method (denoted as "mixed-effect model") and K-Means Using Simulate Data set II

 
We conducted additional simulations with a wide range of parameters and sample sizes. Results of all simulation experiments showed the following expected trends (data not shown): 1) errors of cluster assignments decreased as the differences among the cluster means increased and 2) estimation errors of the parameters decreased as the number of genes within the cluster increased.

Application to AD
Blalock et al. (2004)Go measured hippocampal gene expressions from 9 normal persons (controls) and 22 affected persons (cases), N = 9 + 22 = 31. Each subject had a disease severity measurement, MMSE, whose value varied from 2 (most severe) to 30 (normal). The MMSE is a continuous measure that has been reported as a reliable index of AD-related cognitive status at a given point in time (Miller et al. 2001Go). All the subjects were assigned to 1 of the 4 groups (control, incipient, less severe, and most severe), reflecting the different levels of AD severity based primarily on the MMSE score (Blalock et al. 2004Go). The original scores and the orthogonal polynomials of the MMSE for the 31 subjects are shown in table 2.


View this table:
[in this window]
[in a new window]
 
Table 2 AD Phenotypes (MMSE scores) and the Coefficients of Orthogonal Polynomials

 
The expression levels of 22 286 genes were measured from 31 subjects (microarray chips). Here we choose a subset of 9754 genes for analysis based on 2 criteria used by Blalock et al. (2004)Go: 1) a gene was included only if it is a gene with known function (other than a hypothetical gene or an expressed sequence tag) and 2) a gene was included only if its probe set is detected on at least from 4 chips of this study. We clustered the 9754 genes based on the dependency of each gene's expression levels on the MMSE scores of the 31 subjects, as described in the method section. The gene expression data were log transformed before the analysis.

We analyzed the data using a SAS program (available from supplemental material online) where d = 4 was chosen as the order of the polynomial regression. We tested the number of clusters from K = 2 to 20 using the BIC score and found that the BIC scores reached its maximum at K = 14. Therefore, the following paragraphs only report the results when K = 14.

The estimates of the cluster mean Formula 38 and mixing proportion Formula 38 for each cluster are presented in table 3. The estimated covariance matrix was

Formula 38
The estimated residual variance was Formula 38 We assigned each gene to the cluster with the highest posterior probability Formula 38 Table 3 also shows the actual numbers of genes assigned to each cluster.


View this table:
[in this window]
[in a new window]
 
Table 3 The Estimated Cluster Mean Formula 38 the Mixing Proportion Formula 38, and the Number of Genes Assigned to the kth Cluster Formula 38 for the Data set of AD

 
In the simulated data, the expression profile of each cluster was predicted using equation (36), and the gene-specific expression profile was predicted using equation (37) for each of the 14 clusters. The profiles of all genes for the 14 clusters are depicted in figure 4.


Figure 4
Figure 4
View larger version (37K):
[in this window]
[in a new window]
 
FIG. 4.— Gene expression profiles for all genes in the 14 clusters estimated from the AD microarray data.

 
The first cluster contained 93.5% of the genes whose expression do not change significantly with the phenotypic value. Clusters 8, 10, 11, 12, and 13 accounted for 0.65%, 0.15%, 0.63%, 0.03%, and 3.01% of all genes, respectively. These genes were mainly up regulated with the MMSE score. Clusters 6, 7, and 14 accounted for 0.16%, 0.25%, and 0.49% of all genes, respectively. These genes were mainly down regulated with the MMSE score. Clusters 2, 3, 4, 5, and 9, which accounted for 0.33%, 0.06%, 0.43%, 0.11%, and 0.16%, respectively, represented the genes whose expression changes with the phenotypic values in a nonlinear fashion.

To compare the predicted with the observed gene expressions, we randomly selected one gene from each cluster and presented the comparisons in figure 5. It shows that the predicted profiles do reflect the general trends indicated by the observed data.


Figure 5
Figure 5
View larger version (28K):
[in this window]
[in a new window]
 
FIG. 5.— Observed (dots) and predicted (curve) gene expressions for a randomly selected gene from each of the 14 clusters plotted against the AD phenotype (MMSE score).

 
Blalock et al. (2004) studied this data set using Pearson correlation analysis. Among the 9754 genes, 2852 of them (28.6%) were detected as differentially expressed (nominal P < 0.05) based on the correlation coefficients. In the AD data set, some of the genes classified into clusters 2–14 are not detected as differentially expressed by either correlation analysis or the linear association method.

Application to the Rat Fracture Healing Data
To demonstrate that our method can be applied to time-course gene expression data, we examined an experiment that studied rat fracture healing. The original microarray experiment was published by Li et al. (2005)Go. In all, 21 rats were used in the study. A standard closed fracture was created in the right femur of each rat. Tissues were collected at different postfracture time points (0d, 1d, 4d, 7d, 14d, 21d, and 28d), each with 3 replicates. The expression levels of 12 434 genes were measured using Affymetrix Genechips. We picked a subset of 12 124 genes that expressed on at least 1 of the 21 arrays. The raw data were log transformed, and we took the average of the 3 replicates as the measurement of gene expression for each time point. The mean expression of each gene across the time course was removed by the linear contrast strategy described earlier.

Luan and Li (2003)Go used a mixed-effect model with B-splines to analyze gene expression in yeast cell cycle and identified 5 major clusters. We used the method of Luan and Li (2003) to fit cubic B-splines with 2 internal knots to classify gene expression profiles in the rat fracture healing experiment. The algorithm for generating the B-spline basis was initially described by De Boor (1978)Go. Unlike the yeast cell cycle data set, in which most genes were differentially expressed, there were only a small proportion of genes in the rat fracture healing experiment whose expression were significantly altered in the time course.

We varied the number of clusters from K = 5 to 70 incremented by 5 and found that K = 35 gave the maximum BIC score. Among the 35 clusters, one large cluster contained 10 235 genes (84.4%) whose expression did not change during the time course, whereas the expression of genes in the other clusters evolved with time in different patterns. See Supplemental Material C (Supplementary Material online) for the estimated cluster mean and mixing proportion of each of the 35 clusters.

In order to compare the results of B-splines and polynomials, we reclassified the data with orthogonal polynomials of order d = 3. Again, we varied the number of clusters from K = 5 to 70 incremented by 5 and found that K = 30 gave the maximum BIC score. Similar to the B-splines approach, the majority of the genes (10 017 out 12 124 genes) were assigned to a large cluster with no apparent change in expression (neutral cluster), whereas a relatively small number of genes (2107 out of 12 124 genes) were assigned to other clusters (see Supplemental Material C for more information, Supplementary Material online). Among the genes placed in clusters other than the neutral one using either the B-splines or the polynomial approach, 1663 of them (79%) were identified as differing from a flat profile by both methods. Therefore, the degree of consistency between the 2 methods is quite high. When the number of time points of measurement is small (7 time points in this experiment) or the curves to be estimated are simple, the method of orthogonal polynomials is perhaps more preferable for biologists because it is simple and easy to understand. If the curves are arbitrarily complex, B-splines should be more preferable.


    Discussion
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
In this paper, "linear association" refers to the first order (or straight line) association between 2 variables. Nonlinear kinetics of gene regulation has been investigated by many authors (e.g., Kim et al. 2000Go; Spieth et al. 2005Go). The gene regulatory network is known to be a complex system. Multiple effects, such as the autoregulation (positive or negative self-feedbacks) of a gene, the cooperativity, and competition between genes, may result in nonlinear associations between the expression levels of multiple genes or between the gene expression levels and the phenotypic traits. Motivated by these observations, we have developed a new method, which is capable of describing nonlinear associations between gene expression and phenotype that are not detected by methods using linear associations. Our novel approach to clustering microarray data is based on a nonlinear association of gene expression with a continuous phenotype. This new method is a useful tool for detecting genes associated with continuous phenotypes and to explore the patterns of gene-trait association. We verified our algorithm and implementation with simulated data and then applied the methods to 2 microarray data sets. For the analysis of AD, we found that the majority of the genes had expression profiles that did not show association with MMSE score. Some genes had expression profiles that showed interesting nonlinear association with MMSE score.

We also demonstrate that our method can be applied to time-course experiments. Luan and Li (2003)Go used a mixed-effects model to analyze the time-course gene expression data for yeast cell cycle (Spellman et al. 1998Go). They used B-splines basis of the time points as the design matrix. We used orthogonal polynomials with the time points as the "phenotype" and then compared our result with that of Luan and Li (2003)Go for the time-course expression data of fracture healing in rats (Li et al. 2005Go). We classified the data with both B-splines basis and orthogonal polynomials. In both cases, we found that the majority of the genes were expressed at relatively constant levels, whereas small subsets of the genes showed patterns that changed over the time course. Among the genes classified as different over time with the B-splines, 88% (1663/1889) were also classified as different over time with the polynomials, and 79% (1663/2107) of the genes classified as different over time with polynomials were also classified as different over time with B-splines. This result indicates a high degree of consistency between the 2 methods.

A B-spline can be described as a piecewise polynomial spline. Instead of fitting one polynomial curve across the data point, it fits different curves in different regions. B-splines require specification of series of internal knots (breaking points), for example, time points or phenotypic values. The polynomial curves are, roughly speaking, equivalent to B-splines with no internal knots. There is no best way to choose the number of locations of the knots, and the choice may have an impact on the results (De Boor 1978Go). The flexibility of B-splines makes them particularly suitable to fit complex or periodical curves, for example, the curve of a sine function. To do so effectively, however, a relatively large number of measurements must be made on each subject. Current gene expression data sets do not have the sampling density to effectively estimate complex curves.

Generally speaking, observations with n treatments can be used to fit polynomial curves up to (n – 1)th order, and a higher order always results in better fit. For real data sets, because of the underlying complexity of the data structures, the optimal number of clusters could be large, which results in a great number of parameters to fit. The optimal number of clusters tends to increase with the order of the polynomials used in the analysis. So, increasing the order even by one may result in a much more complex model without much additional information. We selected d = 4 for the AD data set and d = 3 for the fracture healing data. We examined other values of d around 4 for the AD data and other d values around 3 for the fracture healing data and found that the results were virtually the same, indicating that these were reasonable choices for the order of the polynomial in these data sets.

Theoretically, the optimal order (d) can be found using the BIC score because it determines the dimension of the model. In practice, however, it is tedious because the number of clusters also determines the dimension of the model. Consider the combination of the order of polynomials and the number of clusters, the potential number of models for evaluation will quickly reach the limit of our capability. Therefore, we suggest that choosing d = 3 or d = 4 will be sufficient to describe the complexity of microarray data currently available. If the association is linear or quadratic, choosing d > 2 will overfit the model. This does not present a serious problem because the higher order regression coefficients will most likely have estimates close to zero, and the fitted curve will be effectively linear or quadratic. Recall that the shape of a curve is determined by both the order and the regression coefficients. Two drastically different curves may be described by polynomials of the same order but with quite different regression coefficients.

In analysis of the AD microarray data, we found that the EM algorithm was sensitive to the initial values of the parameters. This is a well-known phenomenon that the EM algorithm is a "hill-climbing" approach. The algorithm can only guarantee the reach of a local maximum. In attempt to find the global maximum, we explored many different initial values and chose the solution with the highest converged likelihood value.

In summary, we proposed a new method to identify genes associated in a nonlinear fashion with a quantitative trait using microarray gene expression data. In order to make inferences about the evolutionary processes, the nonlinear relationship between gene expression and phenotype may be of interest in exploring functional and molecular evolution.


    Supplementary Material
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
Supplementary Materials A, B, and C are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).


    Acknowledgements
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 
We are grateful to Drs Blalock and Landfield at the University of Kentucky for providing the microarray data and the AD phenotypes. We also thank Dr Xinmin Li at the University of Chicago for providing us with the complete microarray data of the rat fracture healing experiment. We are particularly grateful to the Associate Editor Laurent McIntyre and 3 anonymous reviewers for their constructive comments and suggestions that significantly improved the presentation of the manuscript. The research was supported by the National Institutes of Health grant R01-GM55321 and National Science Foundation grant DBI-0345205 to S.X.

Funding to pay the Open Access publication charges for this article was provided by the National Institutes of Health grant R01-GM55321.


    Footnotes
 
Lauren McIntyre, Associate Editor


    References
 TOP
 Abstract
 Introduction
 Materials and Methods
 Results
 Discussion
 Supplementary Material
 Acknowledgements
 References
 

    Arbeitman MN, Furlong EEM, Imam F, Johnson E, Null BH, Baker BS, Krasnow ME, Scott MP, Davis RW, White KP. 2002. Gene expression during the life cycle of Drosophila melanogaster. Science 297:2270–5.[Abstract/Free Full Text]

    Baldi P, Long AD. 2001. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17:509–19.[Abstract/Free Full Text]

    Baugh LR, Hill AA, Slonim DK, Brown EL, Hunter CP. 2003. Composition and dynamics of the Caenorhabditis elegans early embryonic transcriptome. Development 130:889–900.[Abstract/Free Full Text]

    Bicciato S, Luchini A, Di Bello C. 2003. PCA disjoint models for multiclass cancer analysis using gene expression data. Bioinformatics 19:571–8.[Abstract/Free Full Text]

    Blalock EM, Geddes JW, Chen KC, Porter NM, Markesbery WR, Landfield PW. 2004. Incipient Alzheimer's disease: microarray correlation analyses reveal major transcriptional and tumor suppressor responses. Proc Natl Acad Sci USA 101:2173–8.[Abstract/Free Full Text]

    Carr DB, Somogyi R, Michaels G. 1997. Templates for looking at the gene expression clustering. Comput Stat Graph Newsl 8:20–9.

    Coffman C, Wayne ML, Nuzhdin SV, Higgins LA, McIntyre LM. 2005. Identification of co-regulated transcripts affecting male body size in Drosophila. Genome Biol 6:R53.[CrossRef][Medline]

    De Boor C. 1978. A practical guide to splines. New York: Springer.

    Dempster AP, Laird NM, Rubin DB. 1977. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc B 39:1–38.

    Dudoit S, Yang YH, Callow MJ, Speed TB. 2002. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 12:111–39.

    Eisen MB, Spielman PT, Brown PO, Botstein D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95:14863–8.[Abstract/Free Full Text]

    Fraley C, Raftery AE. 1998. MCLUST: software for model-based cluster analysis. J Classif 16:297–306.

    Fraley C, Raftery AE. 2002. Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 97:611–31.[CrossRef][Web of Science]

    Hayes JG. 1974. Numerical methods for curve and surface fitting. J Inst Math Appl 10:144–52.

    Jia Z, Xu S. 2005. Clustering expressed genes based on their association with a quantitative phenotype. Genet Res 86:193–207.[CrossRef][Web of Science][Medline]

    Kim S, Dougherty ER, Bittner ML, Chen Y, Sivakumar K, Meltzer P, Trent JM. 2000. General nonlinear framework for the analysis of gene interaction via multivariate expression arrays. J Biomed Opt 5:411–24.[CrossRef][Web of Science][Medline]

    Kirkpatrick S, Gelatt CD, Vecchi MP. 1983. Optimization by simulated annealing. Science 220:671–80.[Abstract/Free Full Text]

    Kraft P, Schadt E, Aten J, Horvath S. 2003. A family-based test for correlation between gene expression and trait values. Am J Hum Genet 72:1323–30.[CrossRef][Web of Science][Medline]

    Lander ES. 1999. Array of hope. Nat Genet 21:3–4.[CrossRef][Web of Science][Medline]

    Li X, Quigg RJ, Zhou J, Ryaby JT, Wang H. 2005. Early signals for fracture healing. J Cell Biol 95:189–205.

    Liu JS, Zhang JL, Palumbo MJ, Lawrence CE. 2003. Bayesian clustering with variable and transformation selections. In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M, editors. Bayesian Statistics 7. Oxford: Oxford University Press. p 249–75.

    Luan Y, Li H. 2003. Clustering of time-course gene expression data using a mixed-effects model with splines. Bioinformatics 19:472–84.

    Meireles SI, Cristo SB, Carvalho AF et al. (13 co-authors). 2004. Molecular classifiers for gastric cancer and nonmalignant diseases of the gastric mucosa. Cancer Res 64:1255–65.[Abstract/Free Full Text]

    Miller RA, Galecki A, Shmookler-reis RJ. 2001. Interpretation, design, and analysis of gene array expression experiments. J Gerontol A Biol Sci Med Sci 56:B52–7.[Abstract/Free Full Text]

    Nelder JA, Mead R. 1965. A simplex method for function minimization. Comput J 7:308–13.

    Peddada SD, Lobenhofer EK, Li LP, Afshari CA, Weinberg CR, Umbach DM. 2003. Gene selection and clustering for time-course and dose-response microarray experiments using order-restricted inference. Bioinformatics 19:834–41.[Abstract/Free Full Text]

    Potokina E, Caspers M, Prasad M, Kota R, Zhang H, Sreenivasulu N, Wang M, Graner A. 2004. Functional association between malting quality trait components and cDNA array based expression patterns in barley (Hordeum vulgare L.). Mol Breed 14:153–70.[CrossRef]

    Qu Y, Xu S. 2004. Supervised cluster analysis for microarray data based on multivariate Gaussian mixture. Bioinformatics 20:1905–13.[Abstract/Free Full Text]

    Quackenbush J. 2001. Computational analysis of microarray data. Nat Rev Genet 2:418–27.[CrossRef][Web of Science][Medline]

    Reinke V, Gil IS, Ward S, Kazmer K. 2004. Genome-wide germline-enriched and sex-biased expression profiles in Caenorhabditis elegans. Development 131:311–23.[Abstract/Free Full Text]

    Robinson G. 1991. That BLUP is a good thing: the estimation of random effects. Stat Sci 6:15–32.

    Saban MR, Hellmich H, Nguyen NB, Winston J, Hammond TG, Saban R. 2001. Time course of LPS-induced gene expression in a mouse model of genitourinary inflammation. Physiol Genomics 5:147–60.[Abstract/Free Full Text]

    SAS Institute. 1999. SAS/STAT user's guide, Version 8. Volume 1.Cary, NC: SAS Institute.

    Schena M, Shalon D, Davis RW, Brown PO. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–70.[Abstract/Free Full Text]

    Schwarz G. 1978. Estimating the dimension of a model. Ann Stat 6:2907–12.

    Spellman P, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. 1998. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisae by microarray hybridization. Mol Biol Cell 9:3273–97.[Abstract/Free Full Text]

    Spieth C, Streichert F, Speer N, Zell A. 2005. Multi-objective model optimization for inferring gene regulatory networks. Conference on Evolutionary Multi-Criterion Optimization. Lect Notes Comput Sci 3410:607–20.

    Tadesse MG, Sha N, Vannucci M. 2005. Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc 100:602–17.[CrossRef]

    Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR. 1999. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96:2907–12.[Abstract/Free Full Text]

    Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. 1999. Systematic determination of genetic network architecture. Nat Genet 22:218–85.

    Tusher VG, Tibshirani R, Chu G. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98:5116–21.[Abstract/Free Full Text]

    West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR. 2001. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA 98:11462–7.[Abstract/Free Full Text]

    Yeung KL, Fraley C, Murua A, Raftery AE, Ruzzo WL. 2001. Model-based clustering and data transformations for gene expression data. Bioinformatics 17:977–87.[Abstract/Free Full Text]

Accepted for publication May 16, 2006.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Plant Physiol.Home page
R. W.M. Fung, M. Gonzalo, C. Fekete, L. G. Kovacs, Y. He, E. Marsh, L. M. McIntyre, D. P. Schachtman, and W. Qiu
Powdery Mildew Induces Defense-Oriented Reprogramming of the Transcriptome in a Susceptible But Not in a Resistant Grapevine
Plant Physiology, January 1, 2008; 146(1): 236 - 249.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
Z. Jia and S. Xu
Mapping Quantitative Trait Loci for Expression Abundance
Genetics, May 1, 2007; 176(1): 611 - 623.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrowOA All Versions of this Article:
23/8/1558    most recent
msl019v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Google Scholar
Right arrow Articles by Qu, Y.
Right arrow Articles by Xu, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Qu, Y.
Right arrow Articles by Xu, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?