Coalescent theory is a model of how gene variants sampled from a
population may have originated from a common ancestor. In the simplest
case, coalescent theory assumes no recombination, no natural
selection, and no gene flow or population structure, meaning that each
variant is equally likely to have been passed from one generation to
the next. The model looks backward in time, merging alleles into a
single ancestral copy according to a random process in coalescence
events. Under this model, the expected time between successive
coalescence events increases almost exponentially back in time (with
Variance in the model comes from both the random
passing of alleles from one generation to the next, and the random
occurrence of mutations in these alleles.
The mathematical theory of the coalescent was developed independently
by several groups in the early 1980s as a natural extension of
classical population genetics theory and models, but can
be primarily attributed to John Kingman. Advances in coalescent
theory include recombination, selection, overlapping generations and
virtually any arbitrarily complex evolutionary or demographic model in
population genetic analysis.
The model can be used to produce many theoretical genealogies, and
then compare observed data to these simulations to test assumptions
about the demographic history of a population.
Coalescent theory can
be used to make inferences about population genetic parameters, such
as migration, population size and recombination.
1.1 Time to coalescence
1.2 Neutral variation
2 Graphical representation
3.1 Disease gene mapping
3.2 The genomic distribution of heterozygosity
7 External links
Time to coalescence
Consider a single gene locus sampled from two haploid individuals in a
population. The ancestry of this sample is traced backwards in time to
the point where these two lineages coalesce in their most recent
common ancestor (MRCA).
Coalescent theory seeks to estimate the
expectation of this time period and its variance.
The probability that two lineages coalesce in the immediately
preceding generation is the probability that they share a parental DNA
sequence. In a population with a constant effective population size
with 2Ne copies of each locus, there are 2Ne "potential parents" in
the previous generation. Under a random mating model, the probability
that two alleles originate from the same parental copy is thus 1/(2Ne)
and, correspondingly, the probability that they do not coalesce is
1 − 1/(2Ne).
At each successive preceding generation, the probability of
coalescence is geometrically distributed—that is, it is the
probability of noncoalescence at the t − 1 preceding
generations multiplied by the probability of coalescence at the
generation of interest:
displaystyle P_ c (t)=left(1- frac 1 2N_ e right)^ t-1 left(
frac 1 2N_ e right).
For sufficiently large values of Ne, this distribution is well
approximated by the continuously defined exponential distribution
displaystyle P_ c (t)= frac 1 2N_ e e^ - frac t-1 2N_ e
This is mathematically convenient, as the standard exponential
distribution has both the expected value and the standard deviation
equal to 2Ne. Therefore, although the expected time to coalescence is
2Ne, actual coalescence times have a wide range of variation. Note
that coalescent time is the number of preceding generations where the
coalescence took place and not calendar time, though an estimation of
the latter can be made multiplying 2Ne with the average time between
generations. The above calculations apply equally to a diploid
population of effective size Ne (in other words, for a non-recombining
segment of DNA, each chromosome can be treated as equivalent to an
independent haploid individual; in the absence of inbreeding, sister
chromosomes in a single individual are no more closely related than
two chromosomes randomly sampled from the population). Some
DNA elements, such as mitochondrial DNA, however,
are only carried by one sex, and therefore have one quarter the
effective size of the equivalent diploid population (Ne/2)
Coalescent theory can also be used to model the amount of variation in
DNA sequences expected from genetic drift and mutation. This value is
termed the mean heterozygosity, represented as
displaystyle bar H
. Mean heterozygosity is calculated as the probability of a mutation
occurring at a given generation divided by the probability of any
"event" at that generation (either a mutation or a coalescence). The
probability that the event is a mutation is the probability of a
mutation in either of the two lineages:
. Thus the mean heterozygosity is equal to
displaystyle begin aligned bar H &= frac 2mu 2mu +
frac 1 2N_ e \[3pt]&= frac 4N_ e mu 1+4N_ e mu
\[3pt]&= frac theta 1+theta end aligned
displaystyle 4N_ e mu gg 1
, the vast majority of allele pairs have at least one difference in
Coalescents can be visualised using dendrograms which show the
relationship of branches of the population to each other. The point
where two branches meet indicates a coalescent event.
Disease gene mapping
The utility of coalescent theory in the mapping of disease is slowly
gaining more appreciation; although the application of the theory is
still in its infancy, there are a number of researchers who are
actively developing algorithms for the analysis of human genetic data
that utilise coalescent theory.
A considerable number of human diseases can be attributed to genetics,
from simple Mendelian diseases like sickle-cell anemia and cystic
fibrosis, to more complicated maladies like cancers and mental
illnesses. The latter are polygenic diseases, controlled by multiple
genes that may occur on different chromosomes, but diseases that are
precipitated by a single abnormality are relatively simple to pinpoint
and trace – although not so simple that this has been achieved for
all diseases. It is immensely useful in understanding these diseases
and their processes to know where they are located on chromosomes, and
how they have been inherited through generations of a family, as can
be accomplished through coalescent analysis.
Genetic diseases are passed from one generation to another just like
other genes. While any gene may be shuffled from one chromosome to
another during homologous recombination, it is unlikely that one gene
alone will be shifted. Thus, other genes that are close enough to the
disease gene to be linked to it can be used to trace it.
Polygenic diseases have a genetic basis even though they don’t
Mendelian inheritance models, and these may have relatively
high occurrence in populations, and have severe health effects. Such
diseases may have incomplete penetrance, and tend to be polygenic,
complicating their study. These traits may arise due to many small
mutations, which together have a severe and deleterious effect on the
health of the individual.
Linkage mapping methods, including
Coalescent theory can be put to
work on these diseases, since they use family pedigrees to figure out
which markers accompany a disease, and how it is inherited. At the
very least, this method helps narrow down the portion, or portions, of
the genome on which the deleterious mutations may occur. Complications
in these approaches include epistatic effects, the polygenic nature of
the mutations, and environmental factors. That said, genes whose
effects are additive carry a fixed risk of developing the disease, and
when they exist in a disease genotype, they can be used to predict
risk and map the gene. Both regular the coalescent and the
shattered coalescent (which allows that multiple mutations may have
occurred in the founding event, and that the disease may occasionally
be triggered by environmental factors) have been put to work in
understanding disease genes.
Studies have been carried out correlating disease occurrence in
fraternal and identical twins, and the results of these studies can be
used to inform coalescent modeling. Since identical twins share all of
their genome, but fraternal twins only share half their genome, the
difference in correlation between the identical and fraternal twins
can be used to work out if a disease is heritable, and if so how
The genomic distribution of heterozygosity
The human single-nucleotide polymorphism (SNP) map has revealed large
regional variations in heterozygosity, more so than can be explained
on the basis of (Poisson-distributed) random chance. In part, these
variations could be explained on the basis of assessment methods, the
availability of genomic sequences, and possibly the standard
coalescent population genetic model. Population genetic influences
could have a major influence on this variation: some loci presumably
would have comparatively recent common ancestors, others might have
much older genealogies, and so the regional accumulation of SNPs over
time could be quite different. The local density of SNPs along
chromosomes appears to cluster in accordance with a variance to mean
power law and to obey the Tweedie compound Poisson distribution.
In this model the regional variations in the SNP map would be
explained by the accumulation of multiple small genomic segments
through recombination, where the mean number of SNPs per segment would
be gamma distributed in proportion to a gamma distributed time to the
most recent common ancestor for each segment.
Coalescent theory is a natural extension of the more classical
population genetics concept of neutral evolution and is an
approximation to the Fisher–Wright (or Wright–Fisher) model for
large populations. It was discovered independently by several
researchers in the 1980s.
A large body of software exists for both simulating data sets under
the coalescent process as well as inferring parameters such as
population size and migration rates from genetic data.
Bayesian inference package via MCMC with a wide range of
coalescent models including the use of temporally sampled sequences.
BPP - software package for inferring phylogeny and divergence times
among populations under a multispecies coalescent process.
CoaSim – software for simulating genetic data under the coalescent
DIYABC – A user-friendly approach to ABC for inference on population
history using molecular markers.
DendroPy – A Python library for phylogenetic computing, with classes
and methods for simulating pure (unconstrained) coalescent trees as
well as constrained coalescent trees under the multispecies coalescent
model (i.e., "gene trees in species trees").
GeneRecon – software for the fine-scale mapping of linkage
disequilibrium mapping of disease genes using coalescent theory based
on an Bayesian MCMC framework.
genetree software for estimation of population genetics parameters
using coalescent theory and simulation (the R package popgen). See
also Oxford Mathematical Genetics and Bioinformatics Group
GENOME – rapid coalescent-based whole-genome simulation
IBDSim – A computer package for the simulation of genotypic data
under general isolation by distance models.
IMa – IMa implements the same Isolation with Migration model, but
does so using a new method that provides estimates of the joint
posterior probability density of the model parameters. IMa also allows
log likelihood ratio tests of nested demographic models. IMa is based
on a method described in Hey and Nielsen (2007 PNAS 104:2785–2790).
IMa is faster and better than IM (i.e. by virtue of providing access
to the joint posterior density function), and it can be used for most
(but not all) of the situations and options that IM can be used for.
Lamarc – software for estimation of rates of population growth,
migration, and recombination.
Migraine – A program which implements coalescent algorithms for a
maximum likelihood analysis (using Importance Sampling algorithms) of
genetic data with a focus on spatially structured populations.
Maximum likelihood and
Bayesian inference of migration
rates under the n-coalescent. The inference is implemented using MCMC
MaCS – Markovian Coalescent Simulator – Simulates genealogies
spatially across chromosomes as a Markovian process. Similar to the
SMC algorithm of McVean and Cardin, and supports all demographic
scenarios found in Hudson's ms.
ms & msHOT – Richard Hudson's original program for generating
samples under neutral models and an extension which allows
msms – An extended version of ms that includes selective sweeps.
Recodon and NetRecodon – software to simulate coding sequences with
inter/intracodon recombination, migration, growth rate and
CoalEvol and SGWE – software to simulate nucleotide, coding and
amino acid sequences under the coalescent with demographics,
recombination, population structure with migration and longitudinal
SARG – Structure Ancestral Recombination Graph by Magnus Nordborg
simcoal2 – software to simulate genetic data under the coalescent
model with complex demography and recombination
TreesimJ[permanent dead link] Forward simulation software allowing
sampling of genealogies and data sets under diverse selective and
^ Arenas, M. and Posada, D. (2014) Simulation of Genome-Wide Evolution
under Heterogeneous Substitution Models and Complex Multispecies
Coalescent Histories. Molecular Biology and
^ Arenas, M. and Posada, D. (2007) Recodon: Coalescent simulation of
DNA sequences with recombination, migration and demography. BMC
Bioinformatics 8: 458
^ Arenas, M. and Posada, D. (2010) Coalescent simulation of intracodon
recombination. Genetics 184(2): 429–437
^ Browning, S.R. (2006) Multilocus association mapping using
variable-length markov chains. American Journal of Human Genetics
^ Cornuet J.-M., Pudlo P., Veyssier J., Dehne-Garcia A., Gautier M.,
Leblois R., Marin J.-M., Estoup A. (2014) DIYABC v2.0: a software to
Approximate Bayesian Computation
Approximate Bayesian Computation inferences about population
history using Single
DNA sequence and
microsatellite data. Bioinformatics '30': 1187–1189
^ Degnan, JH and LA Salter. 2005. Gene tree distributions under the
Evolution 59(1): 24–37. pdf from coaltree.net/
^ Donnelly, P., Tavaré, S. (1995) Coalescents and genealogical
structure under neutrality. Annual Review of Genetics 29:401–421
^ Ewing, G. and Hermisson J. (2010), MSMS: a coalescent simulation
program including recombination, demographic structure and selection
at a single locus, Bioinformatics 26:15
^ Hellenthal, G., Stephens M. (2006) msHOT: modifying Hudson's ms
simulator to incorporate crossover and gene conversion hotspots
^ Hudson, Richard R. (1983a). "Testing the Constant-Rate Neutral
Allele Model with Protein Sequence Data". Evolution. 37 (1): 203–17.
doi:10.2307/2408186. ISSN 1558-5646. JSTOR 2408186 – via
JSTOR. (Registration required (help)).
^ Hudson RR (1983b) Properties of a neutral allele model with
intragenic recombination. Theoretical Population Biology 23:183–201.
^ Hudson RR (1991) Gene genealogies and the coalescent process. Oxford
Surveys in Evolutionary Biology 7: 1–44
^ Hudson RR (2002) Generating samples under a Wright–Fisher neutral
model. Bioinformatics 18:337–338
^ Kendal WS (2003) An exponential dispersion model for the
distribution of human single nucleotide polymorphisms. Mol Biol Evol
Hein, J., Schierup, M., Wiuf C. (2004) Gene Genealogies, Variation and
Evolution: A Primer in Coalescent Theory Oxford University Press
^ Kaplan, N.L., Darden, T., Hudson, R.R. (1988) The coalescent process
in models with selection. Genetics 120:819–829
^ Kingman, J. F. C. (1982). "On the Genealogy of Large Populations".
Journal of Applied Probability. 19: 27–43. doi:10.2307/3213548.
ISSN 0021-9002. JSTOR 3213548 – via JSTOR. (Registration
^ Kingman, J.F.C. (2000) Origins of the coalescent 1974–1982.
^ Leblois R., Estoup A. and Rousset F. (2009) IBDSim: a computer
program to simulate genotypic data under isolation by distance
Molecular Ecology Resources 9:107–109
^ Liang L., Zöllner S., Abecasis G.R. (2007) GENOME: a rapid
coalescent-based whole genome simulator. Bioinformatics 23:
^ Mailund, T., Schierup, M.H., Pedersen, C.N.S., Mechlenborg, P. J.
M., Madsen, J.N., Schauser, L. (2005) CoaSim: A Flexible Environment
for Simulating Genetic Data under Coalescent Models BMC Bioinformatics
^ Möhle, M., Sagitov, S. (2001) A classification of coalescent
processes for haploid exchangeable population models The Annals of
^ Morris, A. P., Whittaker, J. C., Balding, D. J. (2002) Fine-scale
mapping of disease loci via shattered coalescent modeling of
genealogies American Journal of Human Genetics 70:686–707
^ Neuhauser, C., Krone, S.M. (1997) The genealogy of samples in models
with selection Genetics 145 519–534
^ Pitman, J. (1999) Coalescents with multiple collisions The Annals of
^ Harding, Rosalind, M. 1998. New phylogenies: an introductory look at
the coalescent. pp. 15–22, in Harvey, P. H., Brown, A. J. L.,
Smith, J. M., Nee, S. New uses for new phylogenies. Oxford University
Press (ISBN 0198549849)
^ Rosenberg, N.A., Nordborg, M. (2002) Genealogical Trees, Coalescent
Theory and the Analysis of Genetic Polymorphisms. Nature Reviews
^ Sagitov, S. (1999) The general coalescent with asynchronous mergers
of ancestral lines
Journal of Applied Probability 36:1116–1125
^ Schweinsberg, J. (2000) Coalescents with simultaneous multiple
collisions Electronic Journal of Probability 5:1–50
^ Slatkin, M. (2001) Simulating genealogies of selected alleles in
populations of variable size Genetic Research 145:519–534
^ Tajima, F. (1983) Evolutionary Relationship of
DNA Sequences in
finite populations. Genetics 105:437–460
^ Tavare S, Balding DJ, Griffiths RC & Donnelly P. 1997. Inferring
coalescent times from
DNA sequence data. Genetics 145: 505–518.
^ The international SNP map working group. 2001. A map of human genome
variation containing 1.42 million single nucleotide polymorphisms.
Nature 409: 928–933.
^ Zöllner S. and Pritchard J.K. (2005) Coalescent-Based Association
Mapping and Fine Mapping of Complex Trait Loci Genetics
^ Rousset F. and Leblois R. (2007) Likelihood and Approximate
Likelihood Analyses of Genetic Structure in a Linear Habitat:
Performance and Robustness to Model Mis-Specification Molecular
Hein, J; Schierup, M. H., and Wiuf, C. Gene Genealogies, Variation and
Evolution – A Primer in Coalescent Theory. Oxford University Press,
2005. ISBN 0-19-852996-1.
Nordborg, M. (2001) Introduction to Coalescent Theory
Chapter 7 in Balding, D., Bishop, M., Cannings, C., editors, Handbook
of Statistical Genetics. Wiley ISBN 978-0-471-86094-5
Wakeley J. (2006) An Introduction to Coalescent Theory Roberts &
Co ISBN 0-9747077-5-9 Accompanying website with sample chapters
^ Rice SH. (2004). Evolutionary Theory: Mathematical and Conceptual
Foundations. Sinauer Associates: Sunderland, MA. See esp. ch. 3 for
Berestycki N. "Recent progress in coalescent theory" 2009 ENSAIOS
Bertoin J. "Random Fragmentation and Coagulation Processes"., 2006.
Cambridge Studies in Advanced Mathematics, 102. Cambridge University
Press, Cambridge, 2006. ISBN 978-0-521-86728-3;
Pitman J. "Combinatorial stochastic processes" Springer (2003)
EvoMath 3: Genetic Drift and Coalescence, Briefly — overview, with
probability equations for genetic drift, and simulation graphs
Identity by descent
Fisher's fundamental theorem
Shifting balance theory
Coefficient of relationship
Effects of selection
on genomic variation
Small population size
R. A. Fisher
J. B. S. Haldane
Evolutionary game theory
Index of evolutionary biology articles
^ a b c Morris, A., Whittaker, J., & Balding, D. (2002).
Fine-Scale Mapping of Disease Loci via Shattered Coalescent Modeling
of Genealogies. The American Journal of Human Genetics, 70(3),
^ a b c Rannala, B. (2001). Finding genes influencing susceptibility
to complex diseases in the post-genome era. American journal of