MBE Advance Access published online on October 6, 2004
Molecular Biology and Evolution, doi:10.1093/molbev/msi002
Molecular Biology and Evolution © Society for Molecular Biology and Evolution 2004; all rights reserved
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 Department of Statistics, Department of Biological Sciences, University of South Carolina, Columbia, SC 29208, USA
* To whom correspondence should be addressed. E-mail: waddell{at}stat.sc.edu.
It is fundamentally important to assess the fit of data to model in phylogenetic and evolutionary studies. Phylogenetic methods using molecular sequences typically start with a multiple alignment. It is possible to measure the fit of data to model expectations of data, for example, via the likelihood ratio (G) and/or the X2 tests, if all sites in all sequences have an unambiguous residue. However, nearly all alignments of interest contain sites (columns of the alignment) with missing data, e.g., ambiguous nucleotides, gaps, or unsequenced regions, which must presently be removed before using the above tests. Unfortunately, this is often either undesirable or impractical, as it will discard much of the data. Here we show how iterative ML estimators may directly estimate the site pattern probabilities for columns with missing data given only standard i.i.d. assumptions. The optimization may use an EM or Newton algorithm, or any other hill climbing approach. The resulting optimal likelihood under the unconstrained or multinomial model may be compared directly with the likelihood of the data coming from the model (a G statistic). Alternatively the modified observed and the expected frequencies of site patterns may be compared using a X2 test. The distribution of such statistics is best assessed using appropriate simulations. The new method is applicable to models using codons or paired sites. The methods are also useful with Hadamard conjugations (spectral analysis) and are illustrated with these and with ML evolutionary models that allow site-rate variability.
Research Article
Measuring the Fit of Sequence Data to Phylogenetic Model: Allowing for Missing Data
![]()
Abstract ![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
F. Cheng, S. Hartmann, M. Gupta, J. G. Ibrahim, and T. J. Vision A hierarchical model for incomplete alignments in phylogenetic inference Bioinformatics, March 1, 2009; 25(5): 592 - 598. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Li, G. Lu, and G. Orti Optimal Data Partitioning and a Test Case for Ray-Finned Fishes (Actinopterygii) Based on Ten Nuclear Loci Syst Biol, August 1, 2008; 57(4): 519 - 539. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. White, S. Hills, R Gaddam, B. Holland, and D. Penny Treeness Triangles: Visualizing the Loss of Phylogenetic Signal Mol. Biol. Evol., September 1, 2007; 24(9): 2029 - 2039. [Abstract] [Full Text] [PDF] |
||||


