In
genetics
Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinians, Augustinian ...
, haplotype estimation (also known as "phasing") refers to the process of statistical estimation of
haplotype
A haplotype (haploid genotype) is a group of alleles in an organism that are inherited together from a single parent.
Many organisms contain genetic material (DNA) which is inherited from two parents. Normally these organisms have their DNA orga ...
s from
genotype
The genotype of an organism is its complete set of genetic material. Genotype can also be used to refer to the alleles or variants an individual carries in a particular gene or genetic location. The number of alleles an individual can have in a ...
data. The most common situation arises when genotypes are collected at a set of polymorphic sites from a group of individuals. For example in human genetics,
genome-wide association studies collect genotypes in thousands of individuals at between 200,000-5,000,000 SNPs using microarrays. Haplotype estimation methods are used in the analysis of these datasets and allow
genotype imputation of alleles from reference databases such as the
HapMap Project and
the 1000 Genomes Project.
Genotypes and haplotypes
Genotypes measure the unordered combination of alleles at each locus, whereas haplotypes represent the genetic information on multiple loci that have been inherited together from an individual's parents. Theoretically the number of possible haplotypes equals to the product of allele numbers of each locus in consideration. Specially, most of the SNPs are bi-allelic; Therefore when considering
heterozygous
Zygosity (the noun, zygote, is from the Greek "yoked," from "yoke") () is the degree to which both copies of a chromosome or gene have the same genetic sequence. In other words, it is the degree of similarity of the alleles in an organism.
Mos ...
bi-allelic loci, there will be
possible pairs of haplotypes that could underlie the genotypes. For example, when considering two bi-allelic loci A and B (
), of which the genotypes are ''a
1'' and ''a
2'', ''b
1'' and ''b
2,'' respectively, we will have the following haplotypes: ''a
1_b
1, a
1_b
2, a
2_b
1,'' and ''a
2_b
2'' (''"_"'' indicates that the alleles are on the same chromosome).
Haplotype estimation methods
Many statistical methods have been proposed for estimation of haplotypes. Some of the earliest approaches used a simple multinomial model in which each possible haplotype consistent with the sample was given an unknown frequency parameter and these parameters were estimated with an
Expectation–maximization algorithm
In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent varia ...
. These approaches were only able to handle small numbers of sites at once, although sequential versions were later developed, specifically the SNPHAP method.
The most accurate and widely used methods for haplotype estimation utilize some form of
hidden Markov model
A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...
(HMM) to carry out inference. For a long time PHASE was the most accurate method. PHASE was the first method to utilize ideas from
coalescent theory
Coalescent theory is a Scientific modelling, model of how alleles sampled from a population may have originated from a most recent common ancestor, common ancestor. In the simplest case, coalescent theory assumes no genetic recombination, recombina ...
concerning the joint distribution of haplotypes. This method used a
Gibbs sampling
In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate distribution, multivariate probability distribution when direct sampling from the joint distribution is dif ...
approach in which each individuals haplotypes were updated conditional upon the current estimates of haplotypes from all other samples. Approximations to the distribution of a haplotype conditional upon a set of other haplotypes were used for the conditional distributions of the Gibbs sampler. PHASE was used to estimate the haplotypes from the
HapMap Project. PHASE was limited by its speed and was not applicable to datasets from genome-wide association studies.
The fastPHASE and BEAGLE methods introduced haplotype cluster models applicable to
GWAS-sized datasets. Subsequently the IMPUTE2 and MaCH methods were introduced that were similar to the PHASE approach but much faster. These methods iteratively update the haplotype estimates of each sample conditional upon a subset of K haplotype estimates of other samples. IMPUTE2 introduced the idea of carefully choosing which subset of haplotypes to condition on to improve accuracy. Accuracy increases with K but with quadratic
computational complexity.
The SHAPEIT1 method made a major advance by introducing a linear
complexity method that operates only on the space of haplotypes consistent with an individual’s genotypes. The HAPI-UR method subsequently proposed a very similar method. SHAPEIT2
combines the best features of SHAPEIT1 and IMPUTE2 to improve efficiency and accuracy.
See also
*
List of haplotype estimation and genotype imputation software
*
imputation: predict missing genotypes using known haplotypes
References
{{reflist
Genetics techniques