MBE Advance Access originally published online on November 1, 2006
Molecular Biology and Evolution 2007 24(2):352-362; doi:10.1093/molbev/msl165
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Research Articles |
Time Squared: Repeated Measures on Phylogenies
,
,||
* Department of Biostatistics, School of Public Health, University of California, Los Angeles
Department of Genetics, Development & Cell BiologyGenetics, Iowa State University
Center for Bioinformatics and Biological Statistics, Iowa State University
Department of Biomathematics, David Geffen School of Medicine at University of California, Los Angeles
|| Department of Human Genetics, David Geffen School of Medicine at University of California, Los Angeles
E-mail: msuchard{at}ucla.edu.
| Abstract |
|---|
|
|
|---|
Studies of gene expression profiles in response to external perturbation generate repeated measures data that generally follow nonlinear curves. To explore the evolution of such profiles across a gene family, we introduce phylogenetic repeated measures (PR) models. These models draw strength from 2 forms of correlation in the data. Through gene duplication, the family's evolutionary relatedness induces the first form. The second is the correlation across time points within taxonic units, individual genes in this example. We borrow a Brownian diffusion process along a given phylogenetic tree to account for the relatedness and co-opt a repeated measures framework to model the latter. Through simulation studies, we demonstrate that repeated measures models outperform the previously available approaches that consider the longitudinal observations or their differences as independent and identically distributed by using deviance information criteria as Bayesian model selection tools; PR models that borrow phylogenetic information also perform better than nonphylogenetic repeated measures models when appropriate. We then analyze the evolution of gene expression in the yeast kinase family using splines to estimate nonlinear behavior across 3 perturbation experiments. Again, the PR models outperform previous approaches and afford the prediction of ancestral expression profiles. To demonstrate PR model applicability more generally, we conclude with a short examination of variation in brain development across 4 primate species.
Key Words: Bayesian evolution development gene expression MCMC
| Introduction |
|---|
|
|
|---|
The success of high-throughput experimentation and databasing are opening the door on comparative studies of biological function. With expanding data sets, these studies now involve repeated measurements recorded within an operational taxonic unit (OTU) and multiple OTUs that range from representing species throughout the tree of Life down to individual genes within gene families. These multiple levels of hierarchy insinuate an evolutionary perspective to these data. However, a dearth of quantitative methods for repeated measures exists to analyze important aspects of this functional evolution.
Microarray techniques provide the opportunity to observe expression levels of many genes simultaneously, allowing one to study how multiple genes respond to common stimuli. Microarray data afford the ability to study the evolution of gene expression for a particular gene family. There is currently growing interest in understanding the evolution of gene function. Gene expression levels can be viewed as a proxy measure for an arbitrary phenotype in more general evolutionary studies.
When environmental conditions change abruptly, living cells must coordinate adjustments in their genome expression to accommodate for the changing environment. Spatial and temporal gene expression patterns are important aspects of gene regulation. Numerous studies have examined genomic expression change in response to environmental shift. Examples include the transcriptional program of sporulation in budding yeast and yeast genome expression response to hyperosmotic (HO) shock and dithiothreitol (DTT) exposure (DeRisi et al. 1997
; Chu et al. 1998
; Gasch et al. 2000
; Causton et al. 2001
). The data here take the form of temporal gene expression profiles from the model organism Saccharomyces cerevisiae.
Gu (2004)
and Oakley et al. (2005)
develop statistical frameworks for the phylogenetic analysis of gene expression evolution. These models focus on the correlation across genes as separate OTUs from a gene family but ignore the correlation between observations within an OTU. Given the repeated measures nature of gene expression profile data, such within-OTU correlation is likely nonnegligible. Properly addressing this correlation is a long-standing statistical problem in phylogenetics (Felsenstein 1973
, 1988
; Oakley et al. 2005
). We incorporate repeated measures analysis methods into a phylogenetic framework to more accurately model both forms of correlation simultaneously.
Building upon the comparative method (Felsenstein 1985
; Rohlf 2001
), Gu (2004)
proposes a Brownian diffusion process (Cavalli-Sforza and Edwards 1967
; Felsenstein 1973
) across a known phylogenetic tree to account for the correlation among genes within a gene family, but he assumes independence between repeated observations within an OTU. Alternatively, he also estimates the correlation between data within genes empirically and then fixes the correlation matrix within genes to calculate likelihood estimates. This fails to correctly incorporate uncertainty in the covariance matrix parameters among the observations within a sampling unit. Oakley et al. (2005)
attempt to reduce experimental correlation by using values of the change in expression from one time point to the next. This approach reduces but does not eliminate the correlation, ignores absolute expression levels entirely, and still falls short in describing the time trend properly in gene expression profiles.
We describe a statistical framework to study the evolution of gene expression using repeated measures random effects models to model the covariance among observations within a gene. Because closely related genes within a gene family tend to have stronger correlation than more distantly related genes, we extend the Brownian process to include these 2 timescales of correlation, the first over evolution and the second between repeated measures. Given such a model, we aim to investigate whether the evolution of gene expression is related to the molecular sequencederived phylogeny. Gu et al. (2002)
show that gene expression patterns remain similar shortly after gene duplication, but the evolution of expression occurs so quickly that the patterns become distinct from each other in a short-period time. This rapid change may mask the effects of neutral evolution of gene expression. In addition, we know that promoters and enhancers located upstream of coding sequence regulate gene expression levels. These phenomena suggest the necessity of testing whether the evolution of gene function is indeed related to the inferred phylogenetic tree. We specify 2 types of models: a phylogenetic model based fully or partially on a Brownian process and a nonphylogenetic model without evolutionary correlation to address this question.
In this paper, we begin with a statistical methods section, which introduces our repeated measures models. Then we proceed to discuss some simulation results and analyze 3 yeast data sets to show that our proposed models perform much better than previously available approaches. We conclude with a short discussion of the limitations and possible extensions to this work.
| Methods |
|---|
|
|
|---|
Reconstructed from molecular sequence data, a phylogenetic tree (
, T) can describe the evolutionary history of a given gene family. Symbolically, a tree for a gene family of size K consists of a rooted, bifurcating topology
and a set of branch lengths T = (t1, ..., t2K 2). The topology
describes the parent relationships among the K OTU (external) and K 2 internal and root nodes. Branch lengths T measure evolutionary distances. Depending on modeling restrictions, the lengths can be proportional to real time or expected number of sequence changes along the branches in
. Here, we take this latter view. Following Mooers et al. (1999)
, T) reconstructed from molecular sequence data, we aim to model the evolution of phenotypic data measured from the OTUs. In this study, we primarily focus on models for gene expression profiles as our phenotype of interest.
A typical gene expression data set for a given gene family of size K can be written in matrix form:
![]() | (1) |
1, ..., K indexes the gene and j
1, ..., n indexes the time points past a perturbation to the system in an experiment. Each datum {yij} records an mRNA expression level as a log2 ratio of this level to a baseline measurement. By using a log transformation, the usual normality assumption on expression ratios becomes more appropriate. Taking Y by its rows and columns, the jth column represents expression levels during a single experiment across the gene family, and the ith row reports the expression profile across experiments for 1 gene and is denoted as yi.
Gene expression data within a gene family contains potentially 2 forms of correlation. One form is the correlation induced by the gene family relationship. These genes are evolutionarily related, created by successive gene duplication. The other form is the correlation among the repeated measurements across time points within a given gene. We borrow a Brownian diffusion process along a reconstructed phylogenetic tree to account for correlation among genes as OTUs (Gu 2004
) and co-opt a repeated measures approach to model the correlation between observations within an OTU. To reduce possible confusion between these 2 correlation types when specifying our proposed models, we will adopt a 2-letter naming system. The first letter denotes how the model handles the evolutionary correlation across the gene family. Possible values are "P" for phylogenetic models using a Brownian diffusion process and "N" for nonphylogenetic models that assume independence across genes. The second letter signifies how the model deals with the correlation among repeated measures, taking values "I" for independence across measures models and "R" for repeated measures models.
PR Model
We begin with a fully phylogenetic repeated measures PfullR model. Here, "fully" indicates that the Brownian diffusion process along the reconstructed tree completely explains the correlation across a gene family. Later, we consider mixture approaches.
The PfullR model assumes that the gene-specific data yi at tip i are sampled from a common distribution fy(yi|ßi), parameterized in terms of an underlying, continuous parameter
The model takes the form
|
| (2) |
i
Nn(0,
2In). Here, Nn(·,·) specifies a n-dimensional multivariate normal distrbution,
2 is the residual error variance, and In is an n x n identity matrix. The function f(ßi,Xi) provides a family of mean response curves for the genes across measurements. This curve family can be parametric or nonparametric. For linear choices in ßi, we can construct Xi from polynomial, fourier, or spline bases, as seen appropriate for the data. Our model then reduces to a linear random effects model
|
| (3) |
The key point in this construction is to connect the random effects ßi to the phylogenetic tree (
, T), such that more closely related genes have stronger correlation in expression than more distantly related genes.
Brownian Diffusion Process for Random Effects
Taking the lead from Gu (2004)
, we extend the basic Brownian model (Cavalli-Sforza and Edwards 1967
; Felsenstein 1973
) to specify the correlation between the random effects ßi's at the tree tips. The basic Brownian model derives from a Brownian motion process along the branches in (
, T). This process generates random, normally distributed increments in the underlying random effects along each branch with mean 0 and a variance scaled by the branch length. The process further assumes independent evolution between lineages. To begin the process, we assume an unknown random effect ß2K1 at the root node in the topology
drawn from a normal distribution,
|
| (4) |
root is the prior variance matrix. Conditional on the realized value of ß2K1, parameter values at the 2 descendant nodes are generated by mutually independent Brownian motion processes along the branches connecting the nodes to their parent. This process then repeats down the rest of the tree. Let ßpa(i) specify the parent node of node i, then the Brownian process between node i and its parent specifies that
|
| (5) |
![]() | (6) |
Matrix
is an unknown positive definite q x q matrix. We posit a Wishart prior over the inverse of
. Let ßtip = (ß'1,ß'2,...,ß'K)'. Based on the independent increments of the Brownian process and independent evolution between lineages, the marginal distribution of ßtip is normally distributed and the joint distribution of ßtip is multivariate normal. For instance, when q = 1, we return a random intercept model with
|
| (7) |
This case represents a 1-dimensionalBrownian motion process along the given tree. When q = 2, we formulate a bivariate Brownian process along the given tree with
|
| (8) |
As q increases, the number of parameters increases. By forming orthogonal bases to construct Xi, we can reduce the number of parameters in matrix
that need to be estimated. For an orthogonal basis, we reach a componentwise independent model with
![]() | (9) |
The joint distribution of the proposed PfullR model for a given tree in a Bayesian framework can be written as
![]() | (10) |
Imputation of Missing Data and Prediction of Ancestral Gene Expression
Typically, missing values proliferate in microarray data. In a fully Bayesian framework, missing data are treated as random parameters. We then estimate all model parameters and impute missing data simultaneously. This avoids losing information as happens when researchers delete all gene expression observations at one time point when only 1 or a few missing values occur.
Ancestral state reconstruction along a phylogenetic tree is of paramount importance in evolutionary biology. The Markov property of the Brownian diffusion process allows us to reconstruct ancestral expression patterns and to trace the evolutionary changes in gene expression levels. Let k
(K + 1, ..., 2K 1) index 1 of the internal nodes, then the distribution P(Xkßk|Y) provides a prediction of gene expression levels at that internal or root node. Such posterior predictive distributions can be simulated without much additional computational cost at the same time the observed data are fit. Motivated readers can find a complete description of how to simulate from these posterior predictive distributions in the model source code (see Acknowledgments).
Nonphylogenetic Repeated Measures Model
The nonphylogenetic repeated measures (NR) model assumes that closely related genes are no more likely than more distantly related genes to share similar expression patterns. We illustrate the difference between the NR and PfullR models using a simple example. Suppose ßi is one-dimensional. Given ßi,
|
| (11) |
Under a nonphylogenetic model,
|
| (12) |
non is a conditional variance. Comparing equations (5) and (12) helps distinguish these models. First, observations between taxa are conditional independent in the NR model given the root value but correlated according to the reconstructed phylogenetic tree in the PfullR model. The second difference is that variance
non is common to ßi for all i in the NR model. This independent and identically distributed (iid) construction can be best understood by examining the induced tree structures for these 2 models depicted in figure 1. The NR model is equivalent to a star tree with equal branch lengths. The star tree is a special case of a phylogenetic tree under the assumption that all internal branch lengths take value zero. To see this, in figure 1, when the branch length connecting nodes 4 and 5 collapse to zero, the phylogenetic tree reduces to a star. As a consequence, the nonphylogenetic model represented here is a special case of the more general phylogenetic model.
|
With no further information regarding
non and µroot, we use vague but proper priors, such as |
|
|
| (13) |
When ßi has higher dimensionality, q > 1, we can use an inverseWishart distribution to replace the inversegamma distribution to model the correlation between components of ßi.
Partially Phylogenetic Repeated Measures Model
To increase flexibility, we propose a partially phylogenetic repeated measures, PpartialR, model. This model allows us to investigate which aspects of gene expression profiles are related to the underlying phylogenetic tree. In addition to strictly phylogenetically controlled factors, there may exist other potential factors that also affect gene expression but vary independent of the tree. To handle these situations, we evoke the idea of an additive model. Let the data yij take the following form:
|
| (14) |
ij
N(0,
2). This model is additive in terms of the factors. Based on the background knowledge or the research hypothesis at hand, we can model phylogenetic factors using the Brownian model proposed in previous section and model the remaining factors using general repeated measures methods. We illustrate 2 types of partial models in this section.
Random Effects Model
Suppose we have only 2 factors, factor 1 and factor 2. We know a Brownian process along the phylogenetic tree controls factor 2, whereas factor 1 varies independent of the tree. The model then takes the form:
|
| (15) |
Random effects ßi are controlled by a phylogenetic model. Random effects
are the coefficient vector of the nonphylogenetic factor. We assume a flexible hierarchical prior
|
| (16) |

1
Wishart(
,(
)1), where
counts the degrees of freedom (df) for the prior and
is a prior point estimate for the covariance matrix. An uninformative assumption for the prior mean is µ
Nq1(0,100Iq1). This model separates the phylogenetic factor from other factors via an additive construction. Because this model may include nonphylogenetic factors, we cannot predict ancestral node expression levels solely based on the current model. However, we can still achieve the missing value imputation. Although we illustrate only a 2-factor random effects model, this approach can be easily extended to incorporate a larger number of factors.
Mixed Effects Model
It is not always possible to control all experimental conditions consistently. To control for varying conditions, suppose we identify potential factors between experiments that affect overall gene expression levels. We add these factors to our model as fixed effects. For instance, these fixed effects may be an intercept that affect the common population mean or a linear function of covariates to control for systematic differences between microarray experiments and laboratories. For such models, we write
|
| (17) |
are the fixed effect with design matrix Zi and ßi are the random effect under the random effects model proposed in the previous section. The advantage of this model is that we can predict the effect of covariates for the entire gene family and achieve a better estimation of individual gene expression levels.
Implementation
Covariate Model Selection
We perform data exploration to find which forms of linear models are appropriate candidates for the data. For a given gene family, we draw profile plots for each cluster of gene expression levels. By viewing the profile plots, we can discern if polynomial linear or quadratic terms are sufficient to explain the time trends of the data. When the plots show that we have irregular trends for some gene clusters and enough observations within clusters, we use a spline model to capture the nonlinear dynamics.
Bayesian Inference Method
We employ Bayesian inference to fit the data and select among models. The advantage of Bayesian inference is that it takes account of the uncertainty of all parameters. We use WinBUGS (Spiegelhalter et al. 2002
; WinBUGS source code to implement PR measures models are available at http://www.biomath.edu/msuchard/software.html) to construct Markov chain Monte Carlo (MCMC) chains that generate random samples from model posterior distributions. All full conditional distributions for the model parameters are available. As a consequence, each MCMC chain reduces to a Gibbs sampler (Robert and Casella 2004
). Based on the posterior distribution of the model parameters, we produce parameter estimates and perform hypothesis testing. For computing convenience, the deviance information criterion (DIC) is used for model selection in our Bayesian inference setting (Spiegelhalter et al. 2002
). DIC tackles both the goodness of fit and model complexity simultaneously using an approximate decisiontheoretic justification. Let the model deviance D = 2log(likelihood). Then the DIC takes the form
|
| (18) |
|
| (19) |
is the posterior mean of the deviance and provides a Bayesian measurement of goodness of fit and pD is an estimate of the effective number of model parameters, a measure of model complexity. Effective df pD is the posterior mean of the deviance minus the deviance of the posterior means. The DIC works as a model selection criterion in a posterior framework as the counterpart to the Akaike information criterion in a likelihood framework. | Results |
|---|
|
|
|---|
We assess the performance of our PR models on both simulated data and cDNA expression array results for the kinase gene family in yeast.
Simulations
To investigate model performance, we simulate sets of artificial gene expression data and employ the DIC as a criterion to compare models. We base our simulations on the 8- and 22-taxon trees displayed in figure 2. Specifically, we simulate
|
|
|
|
|
| (20) |
2 = 1, and
2 = 0.1. These conditions reflect 10 repeated observations at the gene tree tips with a 10:1 variance ratio
2/
2. Branch lengths tk are given by the fixed trees. The total tree lengths equal 3.5 in the 8-taxon case and 1 in the 22-taxon case. In total, we simulate 100 data sets and fit the PfullR model with q = 1 (the true model in this case), the NR model, and the phylogenetic independent (PI) model. For each model and data set, we estimate DICs and compute differences among models.
|
Table 1 summarizes the DIC differences between the 3 models. To interpret the results, a lower DIC corresponds to a better model, so a
DIC < 0 supports the first model in the pair and a
DIC > 0 supports the second model. For the q = 1, 8-taxon case in table 1, the two repeated measure models, PfullR and NR, perform much better than the PI model that assumes independent replicates within an OTU. DIC differences favor the repeated measure models in 100% of data sets. The median DIC for the PI model is approximately 240 units larger. Comparing the PfullR and NR models, the PfullR model has a lower DIC in 90% of data sets. Examining the q = 1, 22-taxon case in table 1, the median DIC difference between the PfullR and NR models hovers around 3, noticeably higher than the difference for the 8-taxon case. This suggests that the magnitude of the difference might increase as the number of taxa grows.
|
To evaluate repeated measures model performance as the dimension of random effects increases, we extend the simulation to the q = 2 situation. We simulate 100 sets of gene expression data assuming the 8-taxon tree in figure 2 according to
|
| (21) |
2 = 0.1,
|
| (22) |
To these data, we apply the PfullR, NR, and PpartialR models. For the q = 2, 8-taxon case in table 1, the PfullR model has a lower DIC value than the NR model with a DIC difference greater than 0 in 95% of data sets, indicating that the PfullR model fits appreciably better than the NR model. As the dimensionality q increases, the randomness across the tree induced by the random effects grows. As a consequence, it may be more difficult to differentiate a PfullR model from a PpartialR model. When
2 increases to 0.5, as seen in the condition 4 of table 1, the PR models and NR model become difficult to distinguish, with a
DIC > 0 in around 60% of data sets.
To gain intuition into these results, the inference of ßtip derives from 2 sources,
|
| (23) |
The first source P(Y|ßtip,
2) is the data likelihood given the random effects ßi. The second source P(ßtip|µroot,
) models the correlations among ßi. We call P(ßtip|µroot,
) the process prior on ßtip. Different repeated measures models induce different forms on the process prior, but the data likelihood given the random effects keeps the same form. Although the PfullR and NR models are both repeated measures models, their process priors are still different. The former model assumes Brownian diffusion, and the latter assumes independence given the root. We consider the estimation of ßtip and ßtip's effects on the likelihood value under 3 conditions. These conditions vary the relative sizes of the variance attributable to the diffusion process and the variance within tip replicates, approximately trace(
)/
2.
When
2 Is Relatively Smaller
We begin by considering the simulated 8-taxon case where q = 1. Let
2 be small compared with trace(
) following our previous simulation. Here, the inference of ßtip is comparatively precise directly from the data, and the information from the process prior contributes little to this inference. The data likelihoods are almost equal for the various repeated measures models. This leads to the model comparison results in condition 2 of table 1. The DIC difference is small between the NR model and the PfullR model.
When
2 Is Comparable to Trace(
)
Here, information about ßtip flows from both directions. The underlying true model across the tree contributes more to the inference than in the previous case. The data likelihoods are strongly affected by the modeling assumptions. Increasing q to 2 generates increased flexibility in the process prior. With this flexibility, it becomes easier to distinguish between phylogenetic and nonphylogenetic models. We see this phenomenon by comparing the NRPfullR differences under condition 2 and condition 3 in table 1. The price to be paid is that fewer phylogenetic factors are required to describe the data, as long as at least some remain phylogenetic, demonstrated as the difference PpartialR PfullR becomes negligible in condition 3.
When
2 Is Relatively Larger
In this final case, the different repeated measures models become indistinguishable because of the large variance in the data. The simpler model becomes the winner because the DIC penalizes model complexity. Condition 4 of table 1 reflects this situation.
As a rule of thumb, when the data possess strong correlation across time, the PR models perform much better than the PI model. When the variance within tip replicates is comparable to the process prior variance
, phylogenetic models perform better than nonphylogenetic model.
Kinase Gene Family in Yeast
We analyze 3 data sets focusing on the kinase gene family in yeast. Kinases are enzymes that activate target molecules by phosphorylation. The largest group of kinases is protein kinases, which act on and modulate the activity of specific proteins. Because cells use kinases extensively to transmit signals and control complex processes, kinases play important roles in homeostatus and should have substantial expression level changes following environmental shifts.
We use the kinase gene family partitioned from the yeast proteome by Oakley et al. (2005)
. They compare all yeast genes with each other using FASTA and adopt 2 strict criteria for partitioning genes into families: 1) FASTA-alignable region should be at least 80% of the longer gene and 2) the similarity of two genes must be greater than a specified threshold. These 2 stringent criteria give rise to a 7-member kinase gene family. For this family, Oakley et al. (2005)
provide a reconstructed phylogenetic tree assuming a molecular clock. We use this tree, presented in figure 3 (left panel), for our study.
|
The kinase gene family tree roughly divides its members into 3 separate clades. In clade A, we find genes YBR160W, YPL031C, and YDL108W. Clade B genes consist of YPR054W, YBL016W, and YGR040W. Gene YMR139W appears to be the most distantly related among the family, constituting its own clade C. All 7 coding sequences in this family function as protein kinases. In clade A, YBR160W codes for a catalytic subunit of a main cell cycle cyclin-dependent kinase. YPL031C is also a cyclin-dependent kinase, involved in the environmental stress response. YDL108W is the serine/threonine protein kinase subunit of transcription factor TFIIH. This factor is involved in transcription initiation at RNA polymerase II promoters. The genes in clade B code for mitogen-activated protein kinases. YBL139W codes for a protein kinase required for signal transduction during entry into meiosis.
We examine the evolution of gene expression through 3 cDNA microarray experiments. The first 2 studies come from Gasch et al. (2000)
, who measured changes in transcript levels over time as cells responded to HO shock and disulfide-reducing agent DTT exposure. Gasch et al. (2000)
induced HO shock by adding 2M sorbitol to cultured cells and collected and measured mRNA levels at 5, 15, 30, 60, 90, and 120 min following the shock. These 7 measurements define the repeated measures for each gene. In the second experiment, Gasch et al. (2000)
added DTT at a concentration of 2.5 mM to cultured cells. The researchers then took measurements at 15, 30, 60, 120, 240, and 480 min, yielding six repeated measures. In the final experiment, Chu et al. (1998)
studied the transcriptional program of sporulation in budding yeast. They first exposed yeast cells to a nitrogen-deficient medium to induce sporulation. Then they measured mRNA transcripts levels at 0, 0.5, 2.0, 5.0, 7.0, 9.0, and 11.0 h, generating seven repeated measures for each gene.
In addition to the gene family tree, figure 3 depicts the expression profiles for the gene family members across the 3 experiments. In the HO shock experiment, profiles from clade A and clade B genes appear similar within clades but different across clades, suggesting higher correlation within a clade than across clades. One may expect that the PfullR model fit these data well. In the second column, the DTT experiments yield expression profiles for genes in clade A that are highly correlated and profiles in clade B that vary among themselves considerably but still demonstrate an increasing trend. Expectations are uncertain here. The expression profiles for sporulation experiments vary both within clades and across clades. The genes in clade B are mitogen related, so we suspect that their functional similarity may correlate 1 or more components in the profiles, supporting the PpartialR model.
Under the 3 experimental conditions, the time trends of expression profiles do not appear to follow a simple parametric form, such as a linear- or quadratic-in-time pattern. This suggests that a nonparametric approaches, such as splines, are better suited to model the repeated measures trends. Spline models accommodate complicated time trends by breaking the trends into piecewise continuous polynomials (Hastie et al. 2001
). The cut points between these polynomials are called knots. We employ natural cubic splines to generate a smooth function over time, which have continuous first and second derivatives at the knots. Beyond the boundary knots, natural cubic splines are constrained to be linear. For the three data sets, we have six or seven repeated measures. We assign two internal knots at the 33.3 and 66.7 percentiles of the time points, yielding spline bases Xi, which have four df parameterized by ßi.
We fit the PfullR, PpartialR, and NR repeated measures models and the PI model to these data sets. Under the Ppartial R model, we separate the intercept as the phylogenetic random effects and time effects as nonphylogenetic random effects. Table 2 summaries our model comparisons by reporting the DIC for each data set under each model. DIC estimates for the PI model applied to all 3 data sets are much larger than those for the repeated measure models. These differences hover around 20. This finding is consistent with our simulation studies where the true model contained repeated measures. As expected for the HO shock data, the PfullR model represents the best candidate with the lowest DIC of 28.6. For the DTT exposure data set, the PfullR model remains the best choice with a DIC of 25.6. However, for sporulation data, the PpartialR model claims the top spot with a DIC of 51.9. Returning to the DTT data set, a quadratic-in-time polynomial basis may provide a more parsimonious description of the data than splines. We refit the DTT data set using a quadratic polynomial basis. A DIC difference of 7.7 between bases continues to support the nonparametric pattern.
|
In figure 4, we plot the posterior predictive expression profiles for each gene and ancestral nodes for the HO data set. The model successively captures the time trends of the expression data. We also provide pointwise 95% Bayesian credible intervals for all time points. Using these intervals as model diagnosis, figure 4 hints that the model does not provide a fair prediction for the second time points. We suspect this results because of rapid changes between time points 1 and 2, whereas the splines are constrained linear beyond the boundary points. Most importantly, the figure provides the inferred ancestral expression profiles. The predicted ancestral expression profiles well capture the overall pattern of descendant expression profiles. In general, this inference is paramount in the study of gene expression evolution. From the expression estimated at ancestral nodes, researchers can predict the unknown expression pattern of unobserved external nodes and explore points of substantial divergence in the evolutionary history of the gene family.
|
| Discussion |
|---|
|
|
|---|
We demonstrate that repeated measures models significantly improve model performance compared with the independent PI model suggested by previous articles (Gu 2004
Although our primary focus in this paper centers on expression profile evolution, we believe PR models will also find success more generally across evolutionary biology. We provide a brief illustration examining ontogenetic trajectories and heterochrony (Rice 1997
). Rice (2002)
compiles brain weights as a function of time after conception for 4 primate species. Considering these data as repeated measures, we fit a spline-based PR model. Figure 5 provides posterior predictive estimates of brain growth curve changes along 2 important branches in the phylogeny (Schrago and Russo 2003
; Raaum et al. 2005
). Rice (2002)
compares human, chimpanzee, and macaque growth curves, and concludes that the human and chimp curves are commensurate, whereas the human and macaque are not. From these findings, Rice (2002)
infers that a substantial change in brain development timing occurred after the divergence of Old World monkeys and apes but before the most recent common ancestor of humans and chimps. The posterior predictive estimates offer a direct visualization of inferred changes along these branches. Indeed, the human and humanchimp ancestor curves appear commensurate with 2 phases of growth, whereas the Old World monkey/ape ancestor has a single growth phase with a more rapid decrease in growth rate around birth. For these types of studies, formal statistical tests like that for sequential hypermorphosis (Rice 2002
) are straightforward to develop in this Bayesian framework. Finally, these growth curve data are unbalanced, having different numbers of measurements and observation times across OTUs. Although easily incorporated into a PR model, unbalanced data violate assumptions of earlier models by Gu (2004)
and Oakley et al. (2005)
and many analysis of variancebased approaches.
|
Many repeated measures phenotypes, such as the expression profiles and growth curves we analyze, do not follow simple linear trends. We circumvent this limitation through less parametric forms. Here we use natural cubic splines. However, the lack of fit of second time points in figure 4 demonstrates that linear constraints beyond the boundary points should be relaxed or even more flexible bases may be necessary. The specific phenotype of interest determines the appropriate degree of flexibility that is necessary. For example, periodic repeated measures may follow the rich family of basis functions generated by Fourier or wavelet transformations.
The complexity of repeated phenotypic data provides opportunities for a growing class of models. This growth necessitates the use and inherent challenges in model selection. Here, we rely on the DIC as a model selection criterion. The DIC remains valid under the assumption that the posterior mean of all model parameters provides a good estimate; hence, the DIC is not appropriate for multimodal or skewed posterior distributions. Multimodality is likely to become a problem when the evolutionary history is poorly known. Furthermore, the DIC is not consistent as sample size increases (Spiegelhalter et al. 2002
). In a Bayesian framework, a better approach is to use Bayes factors (BFs) to select models because the BF integrates out all unknown parameters and puts emphasis on models themselves (Kass and Raftery 1995
). Methods to calculate BFs for PR measures modes are warranted.
The models we outline describe continuous outcomes under a normality assumption. Some gene function measurements are discrete, for instance, resistance to a particular drug or not. The proposed models can be extended to generalized linear mixed models for discrete longitudinal data, although still incorporating the correlation across taxa. Further, we assume that we have a known evolutionary history tree. This ignores any uncertainty in this tree. Extensions to jointly infer the evolutionary history given sequence data and model the phenotype data circumvent these limitations and may provide a gain in efficiency to topological inference and parametric inference for phenotype model.
| Acknowledgements |
|---|
|
|
|---|
We thank Todd Oakley for encouragement and helpful discussions.
| Footnotes |
|---|
Jody Hey, Associate Editor
| References |
|---|
|
|
|---|
Causton HC, Ren B, Koh SS, Harbison CT, Kanin E, Jennings EG, Lee TI, True HL, Lander ES, Young RA. (2001) Remodeling of yeast genome expression in response to environmental changes. Mol Biol Cell 12:323337.
Cavalli-Sforza L and Edwards A. (1967) Phylogenetic analysis: models and estimation procedures. Am J Hum Genet 19:233257.[Web of Science][Medline]
Chu S, DeRisi JL, Mulholland J, Brown PO, Herskowitz I. (1998) The transcriptional program of sporulation in budding yeast. Science 282:699705.
DeRisi JL, Lyer VR, Brown PO. (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278:680686.
Felsenstein J. (1973) Maximum likelihood estimation of evolutionary trees from continuous characters. Am J Hum Genet 25:471492.[Web of Science][Medline]
Felsenstein J. (1985) Phylogenies and the comparative method. Am Nat 125:115.
Felsenstein J. (1988) Phylogenies and quantitative characters. Annu Rev Ecol Syst 19:445471.[CrossRef][Web of Science]
Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botsein D, Brown PO. (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11:42414257.
Gu X. (2004) Statistical framework for phylogenomic analysis of gene family expression profiles. Genetics 167:531542.
Gu Z, Cavalcanti FC, Chen P. (2002) Extend of gene duplication in the genomes of Drosophila, nematode, and yeast. Mol Biol Evol 19:256262.
Hastie T, Tibshirani R, Friedman J. (2001) The elements of statistical learning. (Springer, New York).
Kass RE and Raftery AE. (1995) Bayes factors. J Am Stat Assoc 90:773795.[CrossRef][Web of Science]
Mooers AO, Vamosi SM, Schluter D. (1999) Using phylogenies to test macroevolutionary hypotheses of trait evolution in cranes (Gruinae). Am Nat 154:249259.[CrossRef]
Oakley TH, Z Gu EA, Patel NH, Li W. (2005) Comparative methods for the analysis of gene-expression evolution: an example using yeast functional genomic data. Mol Biol Evol 22:4050.
Raaum R, Sterner K, Noviello C, Stewart CB, Disotell T. (2005) Catarrhine primate divergence dates estimated from complete mitochondrial genomes: concordance with fossil and nuclear DNA evidence. J Hum Evol 48:237257.[CrossRef][Web of Science][Medline]
Rice SH. (1997) The analysis of ontogenetic trajectories: when a change in size or shape is not heterochrony. Proc Natl Acad Sci USA 94:907912.
Rice SH. (2002) The role of heterochrony in primate brain evolution(Johns Hopkins University Press, Baltimore (MD)) pp. 154170.
Robert CP and Casella G. (2004) Monte Carlo statistical methods(Springer, New York).
Rohlf FJ. (2001) Comparative methods for the analysis of continuous variables: geometric interpretations. Evolution 55:21432160.[CrossRef][Web of Science][Medline]
Schrago C and Russo C. (2003) Timing of the origin of new world monkeys. Mol Biol Evol 20:16201625.
Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. (2002) Bayesian measures of model complexity and fit. J R Stat Soc B 64:583639.[CrossRef]
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








