Tajima's D is a population genetic test
statistic
A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hy ...
created by and named after the Japanese researcher
Fumio Tajima.
Tajima's D is computed as the difference between two measures of genetic diversity: the mean number of pairwise differences and the number of segregating sites, each scaled so that they are expected to be the same in a neutrally evolving population of constant size.
The purpose of Tajima's D test is to distinguish between a
DNA sequence
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. Th ...
evolving randomly ("neutrally") and one evolving under a non-random process, including
directional selection
In population genetics, directional selection, is a mode of negative natural selection in which an extreme phenotype is favored over other phenotypes, causing the allele frequency to shift over time in the direction of that phenotype. Under ...
or
balancing selection Balancing selection refers to a number of selective processes by which multiple alleles (different versions of a gene) are actively maintained in the gene pool of a population at frequencies larger than expected from genetic drift alone. Balanc ...
, demographic expansion or contraction,
genetic hitchhiking
Genetic may refer to:
*Genetics, in biology, the science of genes, heredity, and the variation of organisms
**Genetic, used as an adjective, refers to genes
***Genetic disorder, any disorder caused by a genetic mutation, whether inherited or de nov ...
, or
introgression
Introgression, also known as introgressive hybridization, in genetics is the transfer of genetic material from one species into the gene pool of another by the repeated backcrossing of an interspecific hybrid with one of its parent species. Intro ...
. A randomly evolving DNA sequence contains mutations with no effect on the fitness and survival of an organism. The randomly evolving mutations are called "neutral", while mutations under selection are "non-neutral". For example, a mutation that causes prenatal death or severe disease would be expected to be under selection. In the population as a whole, the
frequency
Frequency is the number of occurrences of a repeating event per unit of time. It is also occasionally referred to as ''temporal frequency'' for clarity, and is distinct from '' angular frequency''. Frequency is measured in hertz (Hz) which is ...
of a neutral mutation fluctuates randomly (i.e. the percentage of individuals in the population with the mutation changes from one generation to the next, and this percentage is equally likely to go up or down) through
genetic drift
Genetic drift, also known as allelic drift or the Wright effect, is the change in the frequency of an existing gene variant (allele) in a population due to random chance.
Genetic drift may cause gene variants to disappear completely and there ...
.
The strength of genetic drift depends on population size. If a population is at a constant size with constant mutation rate, the population will reach an equilibrium of gene frequencies. This equilibrium has important properties, including the number of
segregating sites
, and the number of nucleotide differences between pairs sampled (these are called
pairwise differences). To standardize the pairwise differences, the mean or 'average' number of pairwise differences is used. This is simply the sum of the pairwise differences divided by the number of pairs, and is often symbolized by
.
The purpose of Tajima's test is to identify sequences which do not fit the neutral theory model at equilibrium between
mutation
In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, m ...
and
genetic drift
Genetic drift, also known as allelic drift or the Wright effect, is the change in the frequency of an existing gene variant (allele) in a population due to random chance.
Genetic drift may cause gene variants to disappear completely and there ...
. In order to perform the test on a DNA sequence or gene, you need to sequence
homologous
Homology may refer to:
Sciences
Biology
*Homology (biology), any characteristic of biological organisms that is derived from a common ancestor
*Sequence homology, biological homology between DNA, RNA, or protein sequences
* Homologous chrom ...
DNA for at least 3 individuals. Tajima's statistic computes a standardized measure of the total number of segregating sites (these are DNA sites that are
polymorphic) in the sampled DNA and the average number of mutations between pairs in the sample. The two quantities whose values are compared are both
method of moments estimates of the population genetic parameter theta, and so are expected to equal the same value. If these two numbers only differ by as much as one could reasonably expect by chance, then the null hypothesis of neutrality cannot be rejected. Otherwise, the null hypothesis of neutrality is rejected.
Scientific explanation
Under the neutral theory model, for a population at constant size at equilibrium:
:
for diploid DNA, and
:
for haploid.
In the above formulas, ''S'' is the number of segregating sites, ''n'' is the number of samples, ''N'' is the effective population size,
is the mutation rate at the examined genomic locus,
and ''i'' is the index of summation.
But
selection
Selection may refer to:
Science
* Selection (biology), also called natural selection, selection in evolution
** Sex selection, in genetics
** Mate selection, in mating
** Sexual selection in humans, in human sexuality
** Human mating strat ...
, demographic fluctuations and other violations of the neutral model (including rate heterogeneity and introgression) will change the expected values of
and
, so that they are no longer expected to be equal. The difference in the expectations for these two variables (which can be positive or negative) is the crux of Tajima's ''D'' test statistic.
is calculated by taking the difference between the two estimates of the population genetics parameter
. This difference is called
, and D is calculated by dividing
by the square root of its
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of number ...
(its
standard deviation, by definition).
:
Fumio Tajima demonstrated by computer simulation that the
statistic described above could be modeled using a
beta distribution
In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval , 1in terms of two positive parameters, denoted by ''alpha'' (''α'') and ''beta'' (''β''), that appear as ...
. If the
value for a sample of sequences is outside the
confidence interval
In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but other levels, such as ...
then one can reject the
null hypothesis
In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...
of
neutral mutation Neutral mutations are changes in DNA sequence that are neither beneficial nor detrimental to the ability of an organism to survive and reproduce. In population genetics, mutations in which natural selection does not affect the spread of the mutatio ...
for the sequence in question. However, in real world uses, one must be careful as past population changes (for instance, a
population bottleneck
A population bottleneck or genetic bottleneck is a sharp reduction in the size of a population
Population typically refers to the number of people in a single area, whether it be a city or town, region, country, continent, or the world. Go ...
) can bias the value of the
statistic.
Mathematical details
:
where
:
and
are two estimates of the expected number of
single nucleotide polymorphisms (SNPs) between two DNA sequences under the
neutral mutation Neutral mutations are changes in DNA sequence that are neither beneficial nor detrimental to the ability of an organism to survive and reproduce. In population genetics, mutations in which natural selection does not affect the spread of the mutatio ...
model in a sample size
from an
effective population size
The effective population size (''N'e'') is a number that, in some simplified scenarios, corresponds to the number of breeding individuals in the population. More generally, ''N'e'' is the number of individuals that an idealised population w ...
.
The first estimate is the average number of SNPs found in (n choose 2) pairwise comparisons of sequences
in the sample,
:
The second estimate is derived from the
expected value
In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a ...
of
, the total number of polymorphisms in the sample
:
Tajima defines
, whereas Hartl & Clark use a different symbol to define the same parameter
.
Example
Suppose you are a geneticist studying an unknown gene. As part of your research you get DNA samples from four random people (plus yourself). For simplicity, you label your sequence as a string of zeroes, and for the other four people you put a zero when their DNA is the same as yours and a one when it is different. (For this example, the specific type of difference is not important.)
1 2
Position 12345 67890 12345 67890
Person Y 00000 00000 00000 00000
Person A 00100 00000 00100 00010
Person B 00000 00000 00100 00010
Person C 00000 01000 00000 00010
Person D 00000 01000 00100 00010
Notice the four polymorphic sites (positions where someone differs from you, at 3, 7, 13 and 19 above). Now compare each pair of sequences and get the
average
In ordinary language, an average is a single number taken as representative of a list of numbers, usually the sum of the numbers divided by how many numbers are in the list (the arithmetic mean). For example, the average of the numbers 2, 3, 4, 7, ...
number of polymorphisms between two sequences. There are "five
choose two" (ten) comparisons that need to be done.
Person Y is you!
You vs A: 3 polymorphisms
Person Y 00000 00000 00000 00000
Person A 00100 00000 00100 00010
You vs B: 2 polymorphisms
Person Y 00000 00000 00000 00000
Person B 00000 00000 00100 00010
You vs C: 2 polymorphisms
Person Y 00000 00000 00000 00000
Person C 00000 01000 00000 00010
You vs D: 3 polymorphisms
Person Y 00000 00000 00000 00000
Person D 00000 01000 00100 00010
A vs B: 1 polymorphism
Person A 00100 00000 00100 00010
Person B 00000 00000 00100 00010
A vs C: 3 polymorphisms
Person A 00100 00000 00100 00010
Person C 00000 01000 00000 00010
A vs D: 2 polymorphisms
Person A 00100 00000 00100 00010
Person D 00000 01000 00100 00010
B vs C: 2 polymorphisms
Person B 00000 00000 00100 00010
Person C 00000 01000 00000 00010
B vs D: 1 polymorphism
Person B 00000 00000 00100 00010
Person D 00000 01000 00100 00010
C vs D: 1 polymorphism
Person C 00000 01000 00000 00010
Person D 00000 01000 00100 00010
The average number of polymorphisms is
.
The second estimate of the equilibrium is M=S/a1
Since there were n=5 individuals and S=4 segregating sites
a1=1/1+1/2+1/3+1/4=2.08
M=4/2.08=1.92
The lower-case ''d'' described above is the difference between these two numbers—the average number of polymorphisms found in pairwise comparison (2) and M. Thus
.
Since this is a statistical test, you need to assess the significance of this value. A discussion of how to do this is provided below.
Interpreting Tajima's D
A negative Tajima's D signifies an excess of low frequency polymorphisms relative to expectation, indicating population size expansion (e.g., after a bottleneck or a selective sweep). A positive Tajima's D signifies low levels of both low and high frequency polymorphisms, indicating a decrease in population size and/or balancing selection. However, calculating a conventional "p-value" associated with any Tajima's D value that is obtained from a sample is impossible. Briefly, this is because there is no way to describe the distribution of the statistic that is independent of the true, and unknown, theta parameter (no pivot quantity exists). To circumvent this issue, several options have been proposed.
However, this interpretation should be made only if the D-value is deemed statistically significant.
Determining significance
When performing a
statistical test
A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis.
Hypothesis testing allows us to make probabilistic statements about population parameters.
...
such as Tajima's D, the critical question is whether the value calculated for the statistic is unexpected under a
null process. For Tajima's ''D'', the magnitude of the statistic is expected to increase the more the data deviates from a pattern expected under a population evolving according to the standard coalescent model.
Tajima (1989) found an empirical similarity between the distribution of the test statistic and a beta distribution with mean zero and variance one. He estimated theta by taking
Watterson's estimator and dividing it by the number of samples. Simulations have shown this distribution to be conservative,
and now that the computing power is more readily available this approximation is not frequently used.
A more nuanced approach was presented in a paper by Simonsen et al.
These authors advocated constructing a confidence interval for the true theta value, and then performing a grid search over this interval to obtain the critical values at which the statistic is significant below a particular alpha value. An alternative approach is for the investigator to perform the grid search over the values of theta which they believe to be plausible based on their knowledge of the organism under study. Bayesian approaches are a natural extension of this method.
A very rough rule of thumb to significance is that values greater than +2 or less than -2 are likely to be significant. This rule is based on an appeal to asymptotic properties of some statistics, and thus +/- 2 does not actually represent a critical value for a significance test.
Finally, genome wide scans of Tajima's D in sliding windows along a chromosomal segment are often performed. With this approach, those regions that have a value of D that greatly deviates from the bulk of the empirical distribution of all such windows are reported as significant. This method does not assess significance in the traditional statistical sense, but is quite powerful given a large genomic region, and is unlikely to falsely identify interesting regions of a chromosome if only the greatest outliers are reported.
See also
*
Fay and Wu's H Fay and Wu's H is a statistical test created by and named after two researchers Justin Fay and Chung-I Wu. The purpose of the test is to distinguish between a DNA sequence evolving randomly ("neutrally") and one evolving under positive selection. Th ...
References
Notes
*
External links
Computational tools:
:
DNAsp (Windows)
:
Variscan (Mac OS X, Linux, Windows)
:
Arlequin(Windows)
:
Online view of Tajima's D values in human genome
:
Python3 package for computation of Tajima's D
:
MEGA4 or MEGA5
:
Bio::PopGen::Statisticsin
BioPerl
BioPerl is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications. It has played an integral role in the Human Genome Project.
Background
BioPerl is an active open source software project sup ...
A video explanation of Tajima's D and its application to DNA sequences, is available online.
{{MolecularEvolution
DNA
Molecular evolution
Statistical genetics
Statistical tests