MBE Advance Access originally published online on March 20, 2006
Molecular Biology and Evolution 2006 23(6):1217-1231; doi:10.1093/molbev/msk006
© The Author 2006. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org
A Unique Recent Origin of the Allotetraploid Species Arabidopsis suecica: Evidence from Nuclear DNA Markers
Mattias Jakobsson*,
,
Jenny Hagenblad
,
Simon Tavaré
,||,
Torbjörn Säll
,
Christer Halldén¶,
Christina Lind-Halldén# and
Magnus Nordborg
* Bioinformatics Program, Department of Human Genetics, University of Michigan;
Department of Cell and Organism Biology, Genetics, Lund University, Lund, Sweden;
Department of Biology, Linköping University, Linköping, Sweden;
Molecular and Computational Biology, University of Southern California; || Department of Oncology, University of Cambridge, Cambridge, United Kingdom; ¶ Department of Clinical Chemistry, Malmö University Hospital; # Department of Mathematics and Natural Sciences, Kristianstad University, Kristianstad, Sweden
E-mail: mjakob{at}umich.edu.
 |
Abstract
|
|---|
A coalescent-based method was used to investigate the origins
of the allotetraploid
Arabidopsis suecica, using 52 nuclear
microsatellite loci typed in eight individuals of
A. suecica and 14 individuals of its maternal parent
Arabidopsis thaliana, and four short fragments of genomic DNA sequenced in a sample
of four individuals of
A. suecica and in both its parental species
A. thaliana and
Arabidopsis arenosa. All loci were variable
in
A. thaliana but only 24 of the 52 microsatellite loci and
none of the four sequence fragments were variable in
A. suecica.
We explore a number of possible evolutionary scenarios for
A. suecica and conclude that it is likely that
A. suecica has a
recent, unique origin between 12,000 and 300,000 years ago.
The time estimates depend strongly on what is assumed about
population growth and rates of mutation. When combined with
what is known about the history of glaciations, our results
suggest that
A. suecica originated south of its present distribution
in Sweden and Finland and then migrated north, perhaps in the
wake of the retreating ice.
Key Words: Arabidopsis suecica Arabidopsis thaliana polyploidy polymorphism speciation founders
 |
Introduction
|
|---|
It has long been recognized that polyploidy has played an important
role in speciation, especially in plants (Stebbins 1971

). In
contrast with other forms of speciation, polyploidization is
instantaneous: a bona fide "speciation event." new polyploid
is isolated from its parental species because backcrossing to
either parent is likely to produce sterile offspring. The new
species will therefore evolve independently of its parental
species, and if the polyploidization occurs only once there
will be no additional gene flow into the species once it is
formed. Indeed, even if it occurs several times, different polyploid
lineages may well be isolated because of the rapid chromosome
rearrangements (Song
et al. 1995

; Pontes
et al. 2004

) and changes
in gene expression (Chen, Comai, and Pikaard 1998

; Comai
et al. 2000

; Osborn
et al. 2003

) that may accompany polyploidization.
The fraction of plant species that are of polyploid origin has
been debated; however, it increasingly seems as if this is not
a meaningful question. If polyploid species become "diploidized"
over time, then it may well be that all plant species have polyploid
ancestors (Ku
et al. 2000

; Wendel 2000

). Although comparative
genomics can detect more ancient events than classical cytology,
we will never be able to detect all events.
Arabidopsis suecica (Fr.) was first described as an allotetraploid (2n = 26) by Hylander (1957)
, who suggested Arabidpsis thaliana and Arabidopsis (Cardaminopsis) arenosa as its parental species, a suggestion since supported by molecular phylogenetics (Kamm et al. 1995
; O'kane, Schaal, and Al-Shehbaz 1996
). A. suecica is found in Sweden and southern Finland (Hultén 1971
) and it is highly, but not completely, selfing (Säll et al. 2004
). A. thaliana is a highly selfing, cosmopolitan weed, familiar as the leading model plant. It is diploid (2n = 10), although tetraploid A. thaliana individuals are occasionally found (L. Comai and M. Koornneef, personal communication). A. arenosa is a European species which is tetraploid (2n = 4x = 32) in most of its range and reportedly diploid (2n = 2x = 16) in a small area in eastern Europe (Mesicek 1970
). Our observations in the greenhouse indicate that A. arenosa is self-incompatible and thus an obligate outbreeder (Säll et al. 2004
). Several lines of evidence, including most recently cpDNA sequencing, have identified A. thaliana as the maternal parent of A. suecica (Mummenhoff and Hurka 1994
; Price, Al-Shebaz, and Palmer 1994;
Comai et al. 2000
; Säll et al. 2003
). Furthermore, when producing "artificial" A. suecica through crosses, Comai et al. (2000)
only succeeded in producing viable offspring when A. thaliana was the maternal parent. Given this, there are still several possible routes through which A. suecica could have originated (table 1). The most likely scenario would appear to involve a normal male gamete from a tetraploid A. arenosa fertilizing either a normal female gamete from a tetraploid A. thaliana (which are rare, but do exist), or an unreduced female gamete from a diploid A. thaliana.
View this table:
[in this window]
[in a new window]
|
Table 1 Possible Scenarios for the Origin of Arabidopsis suecica (the number of chromosomes is indicated in parenthesis)
|
|
An important aspect of the formation of a polyploid species
is the number of independent origins. Many allopolyploid species
are believed to have multiple origins (D. Soltis and P. Soltis
1993

). The traditional way to investigate this question is to
produce a phylogenetic tree based on data from different accessions
of the allopolyploid and its potential parental species. If
all the polyploids form a single branch within the tree, there
is evidence of a single origin, whereas if the allopolyploids
appear at different positions relative to the parent, there
is evidence of multiple origins. This method was used to analyze
cpDNA from
A. suecica, and the results indicated a single origin
(Säll
et al. 2003

). Important drawbacks of this approach
include that it is limited to non-recombining data, such as
cpDNA in plants and mtDNA in animals, and that it provides no
measure of statistical confidence in the conclusions. The genealogy
of a single locus does not necessarily reflect the history of
the species (Rosenberg and Nordborg 2002

).
In this paper we introduce a Bayesian multi-locus method based on coalescent theory that can be used to infer the number of origins and the time of the origin of a polyploid. The method is based on a rejection algorithm where sets of demographic parameters are simulated under a demographic model of two species (one "parent" and one allopolyploid "offspring" species). We utilize our method to analyze two different data sets: the first is 52 nuclear microsatellite markers from 8 A. suecica and 14 A. thaliana individuals; the second is a smaller data set of four (
0.5 kb each) sequenced fragments of nuclear DNA from each of four individuals of A. suecica, as well as from the two parental species A. thaliana and A. arenosa. We conclude that there has been a single origin of A. suecica between 12,000 and 300,000 years ago, although the time-of-origin estimates are sensitive to modeling assumptions.
 |
Materials and Methods
|
|---|
Sequencing
DNA was extracted from fresh leaves using Qiagen miniprep kits.
Four
A. suecica individuals from four geographically distinct
locations covering much of the species range were sampled (
table 2).
Four short fragments located on chromosome 4 in
A. thaliana were polymerase chain reaction (PCR)-amplified in each of the
four
A. suecica individuals, using primers designed from the
A. thaliana genome sequence (Table S1, Supplementary Material
online). The primers had previously been used to amplify and
sequence four fragments in 20 accessions of
A. thaliana (see
Nordborg and Bergelson (1999)

and Hagenblad and Nordborg (2002)
for details on these accessions) and in one
A. arenosa accession
from Piteå, northern Sweden (6518N, 2131E). Attempts were
made to amplify the fragments from one individual of each of
the outgroup species
Capsella bursa-pastoris and
Cardamine hirsuta,
both of which were sampled in Lund, southern Sweden (5542N,
1311E). PCRs were run with standard Taq supplied by Roche Diagnostics
(Basel, Switzerland) and according to the supplier's recommendations.
Sequencing was done on a Beckman Coulter CEQ 2000 XL sequencer,
using reagents supplied by the manufacturer. A fourfold dilution
of the reagents was used for sequencing; apart from this, standard
protocols were used. All fragments were sequenced in both directions
and sequences were aligned and evaluated in Sequencher 4.1 (GeneCode,
Ann Arbor, Mich.). Errors in basecalling were corrected manually
after the chromatograms were examined.
For the allotetraploid
A. suecica individuals, the PCR products
were cloned in order to allow individual alleles to be sequenced.
Cloning was done with the pGEMT Easy Vector System using standard
protocols. Clones were then selected randomly and used for an
additional round of PCR amplification to provide template for
direct sequencing of ten clones from each individual and fragment.
Assuming no bias in cloning or PCR amplification, this gives
a 99.9% chance of amplifying at least one copy of each of the
two ancestral allelic types (
A. arenosa and
A. thaliana).
Once sequences from the parental genomes in A. suecica had been obtained by sequencing cloned fragments, we designed allele-specific primers to directly amplify and sequence each sequence type. This allowed us to distinguish variation introduced by cloning and PCR from natural variation.
Sequencing of cloned PCR fragments from A. suecica genomic DNA resulted in sequences that could easily be designated as thaliana-like or arenosa-like by comparison with sequences from the ancestral species. For each primer pair, there were always more clones with thaliana-like than with arenosa-like sequences, and for one fragment, no arenosa-like sequences at all were obtained even though an additional 10 clones were sequenced. The bias toward thaliana-like sequences was not surprising, as the primers used for PCR amplification were designed using the A. thaliana genome sequence.
Recombinant fragments were found in 6 out of the original 40 clones sequenced. Recombination between chromosomes of different parental origin is possible but seems unlikely, and allele-specific primers revealed that the recombinant fragments were indeed cloning artifacts.
In summary we successfully sequenced both the arenosa-like and the thaliana-like alleles for three out of the four fragments. For the fourth fragment, only a thaliana-like sequence was obtained (table 3). There was no variation within allelic classes, either within or between individuals.
Multiple alignments were done in Sequencher, with additional
adjustments by hand. Genealogical trees were drawn using the
phylogeny program PAUP 4.0.0b10 (Sinauer Associates, Sunderland,
USA) or by hand (as the sequences were on the whole very similar).
The trees were rooted using outgroup sequences when these were
available. NCBI annotation for
A. thaliana BAC clone F6N23 (accession
number
AF058919) was used to determine the positions of exons
and introns in the fragments.
For the outgroup species, we used whichever species that was successfully amplified and sequenced. This turned out to be Cardamine hirsuta for two fragments and Capsella bursa-pastoris for a third. No outgroup was obtained for the fourth fragment (see table 3, below). The outgroup sequence was used to assign mutations to either the branch of A. thaliana or A. arenosa. Mutations that could not be assigned to any branch were divided evenly between the two branches. The mutation rates were then calculated by dividing the number of mutations on each branch by the product of the length of the sequence and the time since divergence of the two species.
Microsatellite Analysis
A total of eight A. suecica individuals from geographically distinct locations covering most of the species range were sampled in addition to 14 A. thaliana individuals (table 4). DNA was isolated from young greenhouse-grown offspring of the sampled plants using a Plant DNeasy kit from Qiagen according to the manufacturer's instructions. DNA integrity was checked by agarose gel electrophoresis and DNA concentration was determined using fluorometry with Pico Green (Molecular Probes, Carlsbad, Calif.). A total of 52 microsatellite loci were identified from the complete sequence of A. thaliana (version 2.0) using Tandem Repeat Finder (Benson 1999
) and appropriate primers (C. Lind-Halldén, C. Halldén, M. Jakobsson, and T. Säll, unpublished data) were designed with Oligo (v. 6.3). Primer sequences were subsequently analyzed for specificity using Blast. One primer for each pair was labeled with either HEX or 6-FAM. PCR assays were optimized with regard to annealing temperature and primer concentrations. The exact PCR conditions for each primer pair are available from the authors on request. PCR reactions contained 5 ng template DNA, 2 mM MgCl2, 100 µM dNTP (Amersham/Pharmacia, Uppsala, Sweden), 0.75 units AmpliTaq Gold polymerase (Applied Biosystems, Foster City, Calif.), 1 x PCR reaction buffer (Applied Biosystems), and 50900 nM of each primer in a final volume of 25 µl. PCR products were resolved using capillary electrophoresis run on ABI 310 or 3730 sequencers employing GeneScan and GeneMapper v.3.0 software (Applied Biosystems). The allele sizes of the microsatellite markers were determined in relation to a size marker, GeneScan-500 LIZ (Applied Biosystems). The 52 microsatellite loci were recruited from the whole nuclear A. thaliana genome with an overrepresentation of loci from chromosome 2.
Models and Rejection Algorithms
Because the sequence data set was much smaller, did not include
a good sample of Swedish
A. thaliana accessions, and no variation
in
A. suecica, we used slightly different models for the two
data sets. For the sequence data we only consider polymorphism
within
A. suecica, and restrict ourselves to inferring the time-of-origin
under the assumption of a single origin. Because no variation
was detected in
A. suecica the maximum likelihood estimate of
the number of origins would be a single origin under any model.
For the microsatellite data we model polymorphism within both
species, and infer the number of founders as well.
Before we describe the explicit models used in this article let us introduce some notation. For a population with constant size N, the coalescent model states that the time Wj during which a sample of size n has j ancestors (
) is exponentially distributed with parameter j(j 1)/2. The times for different j are independent and we assume in the following that time is counted in units of N generations. The time to the most recent common ancestor T is given by
 | (1) |
and the
total length of the tree
L is
 | (2) |
When the population size is variable, the times Wn, Wn1, ..., W2 are no longer independent. In the case when the population size grows exponentially (forward in time), the population size at time t (backwards in time) is
for some constant value of
. The conditional distribution of the time Wj for which there are exactly j ancestors, given that the time in which there are more than j ancestors is s, is (Tavaré et al. 1997
; eq. 21):
 | (3) |
where

in the case of exponential growth.
For a set of parameters
, the conditional probability density function of
given the data D is denoted f(
|D),
 | (4) |
where

(

) is the product
of the probability density functions (pdf) of the prior distributions
and
P(
G) is the probability of the genealogy that is given by
the joint pdf (3). We will also assume that mutations occur
independently in each branch with constant rate µ. Then,
if a branch is of length
w, the number of mutations,
k, will
be Poisson distributed with mean
wNµ, Po(
k,
wNµ).
A Model of a Species Founded by One Individual, Model 1
For the sequence data we considered a model (Model 1) where A. suecica grew exponentially from a single individual
years (or generations: A. suecica appears to be annual) ago to a present effective size of N individuals (we tried different scenarios for growth, but they all gave similar results). Gene genealogies follow the standard coalescent distribution for a growing population (eq. 3). We assume that A. suecica is sufficiently highly selfing that the rate of coalescence is twice that of an equivalent outbreeding population (Nordborg and Donnelly 1997
). Neutral mutations were assumed to occur with probability µs per bp per generation. For
= {
, N, µs} and by substituting the data D with the number of segregating sites S, the conditional pdf (4) can be expressed as (Tavaré et al. 1997
):
 | (5) |
where
f (
l*) denotes
the pdf under the coalescent model of the total length of the
genealogy up to min(

,
T). The pdf
N(
u) and


(
x) are the prior distribution of
N, µ
s, and

.
Because the sequenced fragments were located within a 70-kb region, the degree of selfing influences how we should treat the seven fragments (four thaliana-like and three arenosa-like). If A. suecica was completely selfing, then all seven fragments must have the same genealogy. If A. suecica was completely outcrossing, then the seven fragments would probably have almost independent genealogical histories depending on the recombination fraction between the fragments. We considered both extremes, complete selfing and complete outcrossing, but the results turned out to be almost identical so we only present the results from the former model here.
Estimating the Time-of-origin from Model 1
We assumed uniform prior distributions for the parameters. The posterior distribution given the data (eq. 5) was evaluated using a simple rejection algorithm (cf. Tavaré et al. 1997
):
Algorithm 1. Rejection algorithm for f(
, N, µs|S = k).- Simulate a parameter set
= {
, N, µs} from the prior distributions.
- Simulate a standard haploid (to reflect selfing) coalescent for a population that has undergone
generations of exponential growth from a single individual to reach a current size of N (eq. 3), and record the total length L* (eq. 2) of the branches up to the minimum of
and T (eq. 1) of the sample.
- Accept the parameter set with probability exp(NL*µsb), where b is the number of sequenced bases (thus only parameter sets without mutations are accepted), otherwise reject it.
The parameter values that result from accepted observations in this algorithm are then samples from the posterior distribution.
A Model of a Parent Species and an Offspring Species, Model 2
For the microsatellite data we considered a model (Model 2; fig. 1) of two species, t and s (A. thaliana and A. suecica), which have been completely separated from each other for
years (or generations). Both species have grown exponentially from 
years ago (0 <
1) to the present. The population size of species t was Nt(
) at time
(as well as at time 
) and has grown to a population size of Nt(0) at present. We define Ns(
) as the population size of species s a negligibly short time
s before (backwards in time) the actual founding event. This means that the n founders are allowed to increase to Ns(
) during
s, and the effect of drift during
s will be considered below. Before (backwards in time) this short initial phase drift is modeled using the coalescent. The population size of species s grows exponentially from Ns(
) to a population size of Ns(0) at present. After
years (backwards in time) the two populations were considered to be one population with a constant population size of Nt(
)+Ns(
). Gene genealogies from 0 to 
years ago follow the standard coalescent distribution for a growing population (eq. 3) and the gene genealogies after 
years follow the standard coalescent distribution with constant population size (see, e.g., Nordborg 2001
). We again assume that both species are sufficiently highly selfing that the rate of coalescence is twice that of an equivalent outbreeding population (Nordborg and Donnelly 1997
). Microsatellite loci were modeled according to a stepwise mutation model (Ohta and Kimura 1973
), that is, mutations were assumed to be Poisson distributed on each branch and a mutation is an addition or a subtraction of one repeat unit at a particular microsatellite locus. Mutations were assumed to occur with probability µm per generation and locus. Here we use another summary statistic, the average pairwise difference (S'), to simplify the data D. S' was computed using the following expression:
 | (6) |
where
I is the number of loci,
li,j and
li,k are the lengths of locus
i, in individual
j and
k in repeat units, and
ni is the number
of sampled individuals at locus
i. The statistic used to summarize
the data was then
 | (7) |
where
the statistics
S'
t,sim and
S'
s,sim are computed from
eq. (6) for simulated data sets for species
t and
s, and the statistics
S'
t,obs and
S'
s,obs are computed from
eq. (6) for observed data
sets for species
t and
s. The posterior distributions given
the data
 | (8) |
under
Model 2 was evaluated using a rejection algorithm similar to
Algorithm 1. The value of

was fixed to a predefined value between 0 and
1. For each of the parameters a uniform prior distribution was
assumed.
Algorithm 2. Rejection algorithm for f(
, Nt(0), Ns(0), Nt(
), Ns(
), µm|DS <
).- Simulate a parameter set
= {
, Nt(0), Ns(0), Nt(
), Ns(
), µm} from the prior distributions.
- Repeat steps a and b for the total number of modeled loci, I:
- a. Simulate two standard haploid (to reflect selfing) coalescent for the two populations (species t and species s) until
years ago, where the population sizes were Nt(
) and Ns(
) (note that the population sizes were constant from 
years to
years) that have undergone 
generations of exponential growth (eq. 3) to reach a current sizes of Nt(0) and Ns(0).
- b. Simulate a standard haploid coalescent from
years with constant population size of Nt(
)+Ns(
) for all remaining lines at
.
- Accept the parameter set if DS <
, where
is a value > 0, otherwise reject it.
We chose
to equal ß(S't,obs+S's,obs) where ß was set to some small value (ß was chosen to be as small as possible and still accept reasonable number of samples from the algorithm in a reasonable amount of time. We used ß = 0.05, unless otherwise indicated). The parameter values that result from accepted observations in Algorithm 2 are then samples from the posterior distributions (8) of the parameters. Our model of an ancestral population that splits into two populations is relatively similar to the model of Hey (2005)
. The method described above is derived from Tavaré et al. (1997)
and Pritchard et al. (1999)
. Our method uses multi-locus data and infers the number of origins of a particular species, which are the major differences between our method and the method of Prichard et al. (1999). For additional similar approaches see Approximate Bayesian Computation methods (Beaumont, Zhang, and Balding 2002
).
A Model of n Founders, Model 2a
To estimate the number of founders we modified Model 2 to allow modeling a fixed number of founders, n slightly. The modified model, Model 2a, includes an extra parameter n and founders denoted j = 1, ..., n. All lines from species s that remain at time
are randomly assigned to the n founders with equal probability (1/n) for each locus i. We thus do not model drift during the establishment of the species during the first
s generations since the appearance of A. suecica (or, rather, the first A. suecica that could have been the ancestor of modern A. suecica: it is possible that earlier polyploidization events occurred that subsequently went extinct) explicitly. Our results can be viewed as being conditional on the composition of the species after it has become established. The maximum number of founders of species s is then n, but fewer founders are possible in this model. This is the only difference between Model 2 and Model 2a. The accepted parameter sets are then samples from the posterior distribution of n. We also used Model 2a to compare scenarios of different number of founders n, given the data, for fixed values of
, Nt(0), Ns(0), Nt(
), Ns(
), and
. Likelihoods were approximated by the acceptance rate of Algorithm 2.
An Unequal Contribution Model, Model 2b
Model 2a assumes that each founder j contributed equally to the founding of the species. A slightly more realistic model incorporates the possibility of unequal contributions of the n founders. We therefore made a modification of Model 2a that allows unequal contributions of different founders to the whole founding population. This was done by assigning the lineages of species s that remain at
to either of the j founders with probability pj, and
The probability pj can be considered to be the proportion that founder j contributed to the whole founding population. The founding population is then a common gene pool [of population size Ns(
)] at time
to which each founder j contributed pj. This founding population is then modeled in the same way as in Model 2. This model is henceforth known as Model 2b. We approximated the posterior distribution of pj given the data in the same way as for Model 2a.
 |
Results
|
|---|
Sequence and Microsatellite Variation in A. thaliana, A. arenosa, and A. suecica
Sequence divergence between
A. thaliana and
A. arenosa is summarized
in
table 5. Fragment 1 contained a number of indels and could
not be aligned unambiguously, so the alignment requiring the
least number of indels was used. The exact number of indels
and substitutions will depend on the alignment, but an alternative
alignment seems unlikely to affect the general conclusions drawn.
We also note that the substitution rates for this fragment are
broadly in agreement with those for other fragments, suggesting
that the alignment is reasonable. As expected, non-coding regions
differed more than exons, in particular with respect to indels,
which were completely absent from the coding regions. In the
coding regions, the rate of synonymous substitution was higher
than the non-synonymous rate (K
S = 0.18 > K
A = 0.04).
The divergence time between
A. thaliana and
A. lyrata has been
estimated to be between 3.8 and 5.8 MYA by Kuittinen and Aguadé
(2000)

and to be between 5.1 and 5.4 MYA by Koch, Haubold, and
Mitchell-Olds (2000)

. As
A. arenosa diverged from
A. lyrata only after the two species diverged from
A. thaliana (Yang
et al. 1999

; Heenan, Mitchell, and Koch 2002

), 5 Myr is a reasonable
estimate of the divergence time between
A. thaliana and
A. arenosa.
Using this divergence time (and outgroup information), the synonymous
substitution rate was found to be 6.0
x 10
9 substitutions
per bp per year in the
A. thaliana lineage and 6.2
x 10
9 in the
A. arenosa lineage. This is about half of the estimate
of 1.5
x 10
8 for several
Arabidopsis and
Arabis species
obtained by Koch, Haubold, and Mitchell-Olds (2000)

.
The ancestral state of indels could be determined by comparison with the outgroup sequence, with the exception of a single bp indel on fragment 4 for which no outgroup was available, and several complex rearrangements on fragment 1. A. arenosa had only experienced a slight reduction in size, but the A. thaliana sequences appeared to be 4% shorter than the ancestral state, which is similar to the 5% reduction in the A. thaliana intron size compared to A. lyrata previously found by Wright, Lauga, and Charlesworth (2002)
.
A. suecica contains two nuclear genomes, one from A. thaliana and one from A. arenosa. Thus, when sequencing, we expected to find two quite different fragments. When we sequenced the four fragments in A. suecica, four thaliana-like sequences and three arenosa-like sequences were obtained (in total seven sequences; table 3). Figure 2 summarizes the observed pattern of variation as gene genealogies. The relationship between the A. suecica sequences and the ancestral species is evident. For each fragment, the four identical thaliana-like sequences fall within the cluster of actual A. thaliana sequences, and there are always A. thaliana sequences identical to the thaliana-like sequences. The arenosa-like sequences always cluster with A. arenosa, but they are different from the single A. arenosa allele we sequenced. It seems likely, however, that sequences identical to the four arenosa-like sequences would have been encountered, had we sequenced a larger sample of A. arenosa. Overall, A. suecica appears to have diverged little from its ancestral species, at least where homologous sequences exist. The fact that an arenosa-like allele was not successfully amplified from one of the four fragments could imply either that this allele does not amplify with our primers (although at least one A. arenosa allele does) or that the allele inherited from A. arenosa has been lost entirely for this fragment.

View larger version (14K):
[in this window]
[in a new window]
|
FIG. 2. Inferred gene genealogies for each of the four sequenced fragments. Asia refers to the arenosa-like allele in Arabidopsis suecica individual i, Asit analogously for the thaliana-like allele (cf. table 3). Individuals not identified by scientific names are Arabidopsis thaliana accessions. The accessions Mt-0, Dem-4, NC-6, and Köln are identical to Col in all fragments; Tamm-46 and Kondara are identical to Shakhdara in all fragments. These six accessions have therefore been left out of the trees above. The number of mutations separating each allele is indicated on the branches.
|
|
A total of 52 microsatellite loci were also typed for 14
A. thaliana and 8
A. suecica accessions. All 52 loci were variable
within
A. thaliana and 24 loci were variable in
A. suecica. Because the microsatellites were chosen without any prior knowledge
of variability in either of the two species, we can compare
the variability of these two species using all microsatellite
loci.
A. thaliana is thus more variable than
A. suecica considering
all 52 loci (
table 6).
A. thaliana was still more variable,
even if we only considered the 24 microsatellite loci that are
variable within
A. suecica. There were a total of 384 alleles
among the 14
A. thaliana accessions and 106 among the eight
A. suecica accessions. The comparable number of alleles for
eight
A. thaliana accessions would be 259.3 using a rarefaction
correction (Kalinowski 2004

) of the 14
A. thaliana accessions.
Number of Founders of A. suecica
We computed the likelihood of the
Model 2a for 110 founders,
using the microsatellite data. We also tested a wide range of
values of the remaining parameters (µ
m,
Ns(0),
Ns(

),
Nt(0),
Nt(

),

, and

). However, because only two parameters,

and
Ns(

),
affected the likelihood for different number of founders significantly
(results not shown), we limit our discussion to these parameters.
The parameter

affected the likelihoods moderately. A small
value of

(

< 1) resulted in a more evenly distributed likelihood
as a function of

(a more detailed analysis of the effect of

follows below). However, the effect on the likelihood distributions
was almost identical for any tested number of founders.
Figure 3 shows the likelihood surfaces for 1, 2, and 10 founders using
Model 2a with remaining parameters held constant. From
fig. 3 it is clear that if
Ns(

) is larger than about 500, we can be
fairly certain that the species was founded by one event. However,
there are clearly some parts of the [

,
Ns(

)] parameter space
in which the likelihoods are very similar for 1, 2, and 10 founders.
If
Ns(

) is less than about 500, the likelihoods are similar,
except for small values of

. The number of founders cannot be
determined in this part of the parameter space. The reason that
we cannot infer the number of founders for small values of
Ns (

) is that all
A. suecica lines have an MRCA before (going backwards
in time)

, regardless of the number of founders.
Figure 4 shows
the posterior distribution of the number of founders (
n = 1,
..., 10) for fixed parameter values of the population sizes
and a wide range of values of the time-of-origin

. Unless
Ns(

)
< 1000 there is strong support for one origin of
A. suecica.
There also appears to be little support for

> 40,000 for
either value of
n (see below for a more detailed estimation
of the time-of-origin).
If there were more than one founder, these founders may have
contributed unequal fractions to the founding population of
A. suecica. We evaluated this scenario using
Model 2b and
Algorithm 2.
Figure 5a shows the posterior distribution of the contributing
fraction
p2 from a second founder, given the model, for four
different values of
Ns(0) and
Ns(

). The second founding event
was arbitrarily chosen to be the founder that contributed less
than 0.5 to the founding population. Note that this model is
identical to
Model 2a with
n = 2 when
p1 =
p2 = 0.5, and it
is identical to
Model 2a with
n = 1 when
p2 = 0 (or
p1 = 0).
From
fig. 5a we see that if
Ns(

)

10
4, most of the weight is
on small values of
p2. It is thus unlikely that a second founder
contributed more than 0.050.1 to the founding population
of
A. suecica given that
Ns(

)

10
4. If
Ns(

) = 10
3 we see the
same tendency of smaller values of
p2 being more likely than
larger values of
p2, whereas if
Ns(

) = 10
2 we cannot distinguish
small contributions to the founding population from a second
founder.
Figure 5b shows the posterior distribution of the contributing
fraction
p2 from a second founder given the data, and when the
mutation rate is assumed to be between 10
4 and 10
3.
From
fig. 5b it is clear that the general conclusions about
a second founding event are robust with regard to the mutation
rate.
The complete lack of variation in the
A. suecica sequence data
(four thaliana-like fragments and three arenosa-like fragments)
also suggests that our geographically diverse sample of
A. suecica stem from a single polyploidization event. Given the level of
variation in
A. thaliana (the average pairwise difference per
bp was 0.00719, 0.0736, 0.00157 and 0.00301 in the four fragments
respectively; see also
fig. 2), it is unlikely that two or more
polyploidization events would have involved
A. thaliana parents
identical at all four fragments (although we note that some
A. thaliana accessions studied are indeed identical in all four
fragments).
Time-of-origin
The time-of-origin of the nuclear genome of A. suecica was estimated using Model 2, Algorithm 2, and the microsatellite data. We used uniform distributions for the six priors [µm,
, Nt(0), Ns(0), Nt(
), Ns(
)] and tested different types of growth scenarios (
= 1.0, 0.5, 0.25, or 0.1). The prior distribution of µm was uniform in the interval 105 to 103. These bounds were based on microsatellite mutation rate estimates from a variety of organisms (Schug, MacKay, and Aquadro 1997
; Vazquez et al. 2000
; Xu et al. 2000
; Vigouroux et al. 2002
). The origin of A. suecica is unlikely to be more recent than a thousand years ago, given its present geographical distribution and level of variation. It is also unlikely that the origin is more than a million years ago given a divergence between its parents of about 5 Myr. Thus, we used a uniform distribution between 103 and 106 as a prior for
. Strong evidence of population growth in A. thaliana has recently been reported (Nordborg et al. 2005
) and we used uniform priors for the past and present population size of A. thaliana to reflect this: Nt(0)
uniform(107, 109) and Nt(
)
uniform(105, 107). For the population sizes of A. suecica we used uniform priors, Ns(0)
uniform(105, 106) and Ns(
)
uniform(1, 104). In addition to these "reasonable" priors (runs 14 in table 7), several more or less unrealistic priors were tested (runs 511, table 7) to check whether the priors chosen above were justified or not as control.
View this table:
[in this window]
[in a new window]
|
Table 7 Prior Distributions of the Parameters That were Tested for Inferring the Time-of-origin Using Algorithm 2
|
|
Figure 6a shows the posterior distribution of

for four different
growth scenarios and with a uniform prior distribution of µ
m between 10
5 and 10
3 using "reasonable" priors
for all parameters (runs 14 in
table 7). It is obvious
from
fig. 6a that the growth scenario affects the conclusions
for the time-of-origin of
A. suecica. If growth started close
to the origin (

= 1) the posterior probability distribution
of

is relatively narrow with a median of 45,000 years. If the
growth started long after the origin (

= 0.25 or 0.1) the posterior
distributions of

are quite wide with medians of 100,000 and
240,000. The probability that

< 12,000 is about 5% for
= 1 (where

is larger at the 5% limit for all

< 1,
table 8).
Figure 6b shows one minus the cumulative probability distribution
for the same four growth scenarios. This figure illustrates
that the upper limit of

also depends heavily on

. When

0.25
we find that the probability that

> 290,000 is about 5%.
Figure 7 shows the heavy negative correlation of

and µ
m.
For example, if µ
m > 5
x 10
5, then

is <100,000.
Clearly, inference of

is heavily dependent on what is assumed
about µ
m and the population growth, whereas the other
parameter had much smaller effects on

.
View this table:
[in this window]
[in a new window]
|
Table 8 The 90% Credibility Regions of µm and of the Posterior Distributions Obtained from Algorithm 2 for Different Growth Scenarios, , and Mutation Rates, µm
|
|
Estimates of microsatellite mutation rates from plants are in
the range of 10
4 to 10
3 mutations per generation
(Vigouroux
et al. 2002

). If we assume that µ
m is between
10
4 and 10
3, the upper bound on

would be much
smaller;

> 44,000 is about 5% for

0.25 (
fig. 8). Under
this assumption on µ
m the probability distributions of

are shifted toward the lower end of the prior of
. For example,
the median of

when

= 1, 0.5, and 0.25 is about 9,000, 13,000,
and 22,000, respectively.
Given
Model 1 described above, what does the fact that we observed
no variation in the four sequenced fragments tell us about

,
N, and µ
s? It is clear that the information in these data
is limited: the maximum likelihood estimate of all three parameters
is zero, and the best we can hope for is to rule out certain
parameter combinations. A similar problem has been studied previously
in the context of a human Y chromosome data set without variation
(Tavaré
et al. 1997

).
Figure 9 shows the posterior distributions
using
Algorithm 1 for each of the three parameters

,
N, and
µ
s. To emphasize the fact that we can really only hope
to rule out large values, one minus the cumulative distribution
is shown. For example, the probability that

> 200,000 years
is about 20% and the probability that

> 650,000 years is
about 5%. It is clear that a wide range of values are plausible
for all three parameters. It is also clear that there is almost
no information about
N: the cumulative posterior distribution
is a straight line, which is the same as the prior. The only
exception is for very small values of
N. The reason for this
is that, with the exception of very small
N, observing no variation
is extremely unlikely unless either

or µ
s is small enough.
A small effective population size is not a good explanation
for the data.

View larger version (5K):
[in this window]
[in a new window]
|
FIG. 9. Cumulative posterior distributions of µs, , and N, estimated under Model 1 (see text for details).
|
|
It is much more likely that the lack of variation reflects the
relatively recent origin of
A. suecica. It may seem disappointing
that we cannot pin down this origin more narrowly than indicated
in
fig. 9. The reason for this is the uncertainty in µ
s.
Similar to what was observed for the microsatellite model (
fig. 7),
the joint posterior density of

and µ
s show a very strong
negative correlation between the two parameters (
fig. 10). We
originally assumed a uniform prior between 10
9 and 10
7 for µ
s based on mutation rates estimated above and earlier
reports of mutation rates of
Arabidopsis and
Arabis species
(e.g., Koch, Haubold, and Mitchell-Olds 2000

). If we were willing
to assume that µ
s > 10
8, the marginal posterior
distribution for

would change dramatically. As shown in
fig. 11,
the probability that

> 200,000 years is now about 7% instead
of the 20% we obtained earlier.
Figure 11 illustrates how strongly
these kinds of evolutionary inferences depend on what is assumed.
It is clear that with a high µ
s the estimate of

will
be low and visa versa.
 |
Discussion
|
|---|
Number of Origins
Unlike other forms of speciation, speciation by polyploidization
is instantaneous. It is the result of a well-defined, discrete
event (or events), and in contrast to, say, humans it is clearly
meaningful to ask when and where the species arose. The new
polyploid species is expected to be isolated from its parental
species because crosses between a polyploid and its parental
species are most likely infertile. However, repeated polyploidization
events may result in a polyploid species with multiple origins.
If alleles at a given locus of all existing individuals of a
polyploid species have an MRCA before (going backwards in time)
or at the time-of-origin, at least two scenarios are possible.
These alleles either have an MRCA before the time-of-origin
simply because of drift, or the species was founded by one polyploidization
event. In addition to these two scenarios, the polyploid species
may be the result of more that one polyploidization event of
genetically very closely related parental individuals. This
scenario would be very difficult if not impossible to distinguish
genetically from a scenario of only one founding event. On the
other hand, this case would lead to a polyploid species which,
effectively, can be considered to be of a single origin because
the founding individuals were genetically very similar. Distinguishing
between single and multiple origins using polymorphism data
is not trivial. In particular, the observation of monophyly
at one or a small number of loci with respect to the parental
species does not imply a single origin: alleles in the polyploid
species could be monophyletic with respect to the parental species
simply because they happen to coalesce more recently than the
origins of the species (Rosenberg and Nordborg 2002

).
We investigated the number of origins of A. suecica using an explicit coalescent model. We found that the likelihood of different founder numbers was determined by two parameters: the time-of-origin,
, and the population size of A. suecica immediately after the founding, Ns(
). We concluded that there is power to infer a single founder if Ns(
)
500. It is worth clarifying what this means (cf. fig. 3). First, it is obvious that we cannot say anything about the number of founders if
is sufficiently large because all lineages will have coalesced with A. suecica. However, the low level of polymorphism in A. suecica argues against a very large
(unless the population size is tiny). Second, we are not interested in founder lines that went extinct in a small number of generations. To model multiple origins, we consider the composition of A. suecica immediately after the most recent founder event that resulted in an established population. We define the population size of A. suecica at this time Ns(
). Thus, the statement that there is power to infer a single founder if Ns(
)
500 means that we can reject multiple founders unless the total (effective) population size when multiple lineages last existed was less than 500. This seems unlikely given the high reproductive rate of Arabidopsis, and the fact polyploidization events are rare so that each founding event is likely to have been isolated (geographically) from other founding events for some time. An initial population size of 500 is quickly reached given the mode of reproduction of A. suecica (Säll et al. 2004
).
For simplicity, we do not explicitly model multiple origins at different points in time (our method tests for multiple origins at one point in time). Nonetheless, our results bear on this (rather more plausible scenario) as well. In particular, our "unequal contribution" model (Model 2b) can be thought of as modeling a situation where two founding populations were established at different points in time, and have reached different sizes. We then consider the future of the merged population. We believe that the general conclusions we reach about what can, and cannot be inferred about multiple founding events would be valid under more detailed models as well. However, an explicit model of multiple founding events at different points in time would require many assumptions about initial demography, and is beyond the scope of this paper.
In summary, then, we believe that the current genome of A. suecica stems from a single founding event. While it is possible that some variation was inherited from A. arenosa (and survived the original bottleneck that must have taken place), it seems likely that the genome inherited from A. thaliana was originally that of an homozygous inbred parent.
Time-of-origin
The coalescent method was also used to estimate the time-of-origin of A. suecica. This was done separately using microsatellite data and sequence data. The results were more sensitive to assumptions than those for the number of founders. The microsatellite data show that the origin probably is >12,000 years ago, unless the microsatellite mutation rate is high (if µm > 104, then we can only conclude that the origin most likely is >3,000 years ago). The u