HOME

TheInfoList



OR:

Imputation in
genetics Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinian friar worki ...
refers to the statistical inference of unobserved
genotypes The genotype of an organism is its complete set of genetic material. Genotype can also be used to refer to the alleles or variants an individual carries in a particular gene or genetic location. The number of alleles an individual can have in a ...
. It is achieved by using known haplotypes in a population, for instance from the
HapMap The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease ...
or the 1000 Genomes Project in humans, thereby allowing to test for association between a trait of interest (e.g. a disease) and experimentally untyped genetic variants, but whose genotypes have been statistically inferred ("imputed"). Genotype imputation is usually performed on
SNPs In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently larg ...
, the most common kind of genetic variation. Genotype imputation hence helps tremendously in narrowing down the location of probably causal variants in
genome-wide association studies In genomics, a genome-wide association study (GWA study, or GWAS), also known as whole genome association study (WGA study, or WGAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any varian ...
, because it increases the SNP density (the genome size remains constant, but the number of genetic variants increases) and thus reduces the distance between two adjacent SNPs.


Context

In
genetic epidemiology Genetic epidemiology is the study of the role of genetic factors in determining health and disease in families and in populations, and the interplay of such genetic factors with environmental factors. Genetic epidemiology seeks to derive a statist ...
and
quantitative genetics Quantitative genetics deals with phenotypes that vary continuously (such as height or mass)—as opposed to discretely identifiable phenotypes and gene-products (such as eye-colour, or the presence of a particular biochemical). Both branches ...
, researchers aim at identifying genomic locations where variation between individuals is associated with variation in traits of interest between individuals. Such studies hence require access to the genetic makeup of a set of individuals.
Sequencing In genetics and biochemistry, sequencing means to determine the primary structure (sometimes incorrectly called the primary sequence) of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which succi ...
the whole
genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
of each individual in the study is often too costly, so only a subset of the genome can therefore be measured. This often means, first, only considering
single-nucleotide polymorphisms In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently larg ...
(SNPs) and neglecting copy number variants, and second, only measuring SNPs known to be variable enough in the population that they are likely to be also variable in the set of individuals under consideration. The most informative subset of SNPs is chosen based on the distribution of common
genetic variation Genetic variation is the difference in DNA among individuals or the differences between populations. The multiple sources of genetic variation include mutation and genetic recombination. Mutations are the ultimate sources of genetic variation, b ...
along the genome, for instance as produced by the
HapMap The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease ...
or the 1000 Genomes Project in humans. These SNPs are then used to build a
micro-array A microarray is a multiplex lab-on-a-chip. Its purpose is to simultaneously detect the expression of thousands of genes from a sample (e.g. from a tissue). It is a two-dimensional array on a solid substrate—usually a glass slide or silicon ...
, thereby allowing each individual in the study to be genotyped at all these SNPs simultaneously.


Motivation

Genotyping arrays used for genome-wide association studies (GWAS) are based on tagging SNPs and therefore do not directly genotype all variation in the genome. Imputation of the genotypes to a reference panel that has been genotyped for a greater number of variants boosts the coverage of genomic variation beyond the original genotypes. As a consequence, one can assess the effect of more SNPs than those on the original micro-array. Importantly, imputation has facilitated meta-analysis of datasets that have been genotyped on different arrays, by increasing the overlap of variants available for analysis between arrays.


Tools

There are several software packages available to impute genotypes from a genotyping array to reference panels, such as 1000 Genomes Project haplotypes. These tools include MaCH Minimac, IMPUTE2 and Beagle. Each tool provides specific pros and cons in terms of speed and accuracy. Additional phasing tools such as SHAPEIT2 allow prephasing of input haplotypes for improved imputation accuracy and computational performance. In early imputation usage, haplotypes from HapMap populations were used as a reference panel, but this has been succeeded by the availability of haplotypes from the 1000 Genomes Project as reference panels with more samples, across more diverse populations, and with greater
genetic marker A genetic marker is a gene or DNA sequence with a known location on a chromosome that can be used to identify individuals or species. It can be described as a variation (which may arise due to mutation or alteration in the genomic loci) that can be ...
density. As of mid-2014, whole-genome sequence data is publicly available from the 1000 Genomes Project website for 2535 individuals from 26 different populations around the world.


Statistical models

Designing accurate statistical models for genotype imputation is very much related to the problem of haplotype estimation ("phasing") and is an active area of research.


See also

* List of haplotype estimation and genotype imputation software * Haplotype estimation


References

{{Reflist Statistical genetics