A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high

linkage disequilibrium In population genetics, linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is h ...

that represents a group of SNPs called a

haplotype A haplotype ( haploid genotype) is a group of alleles in an organism that are inherited together from a single parent. Many organisms contain genetic material ( DNA) which is inherited from two parents. Normally these organisms have their DNA or ...

. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.

Introduction

Linkage Disequilibrium

Two loci are said to be in linkage equilibrium (LE) if their inheritance is an independent event. If the alleles at those loci are non-randomly inherited then we say that they are at linkage disequilibrium (LD). LD is most commonly caused by physical linkage of genes. When two genes are inherited on the same chromosome, depending on their distance and the likelihood of recombination between the loci they can be at high LD. However, LD can be also observed due to functional interactions where even genes from different chromosomes can jointly confer an evolutionarily selected phenotype or can affect the viability of potential offspring. In families LD is highest because of the lowest numbers of recombination events (fewest meiosis events). This is especially true between inbred lines. In populations LD exists because of selection, physical closeness of the genes that causes low recombination rates or due to recent crossing or migration. On a population level, processes that influence linkage disequilibrium include

genetic linkage Genetic linkage is the tendency of DNA sequences that are close together on a chromosome to be inherited together during the meiosis phase of sexual reproduction. Two genetic markers that are physically near to each other are unlikely to be separ ...

, epistatic natural selection, rate of recombination,

mutation In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, mi ...

genetic drift Genetic drift, also known as allelic drift or the Wright effect, is the change in the frequency of an existing gene variant (allele) in a population due to random chance. Genetic drift may cause gene variants to disappear completely and there ...

random mating Panmixia (or panmixis) means random mating. A panmictic population is one where all individuals are potential partners. This assumes that there are no mating restrictions, neither genetic nor behavioural, upon the population and that therefore all ...

genetic hitchhiking Genetic may refer to: *Genetics, in biology, the science of genes, heredity, and the variation of organisms **Genetic, used as an adjective, refers to genes ***Genetic disorder, any disorder caused by a genetic mutation, whether inherited or de nov ...

and

gene flow In population genetics, gene flow (also known as gene migration or geneflow and allele flow) is the transfer of genetic material from one population to another. If the rate of gene flow is high enough, then two populations will have equivalent a ...

. When a group of SNPs are inherited together because of high LD there tends to be redundant information. The selection of a tag SNP as a representative of these groups reduces the amount of redundancy when analyzing parts of the genome associated with traits/diseases. The regions of the genome in high LD that harbor a specific set of SNPs that are inherited together are also known as

s. Therefore, tag SNPs are representative of all SNPs within a haplotype.

Haplotypes

The selection of tag SNPs is dependent on the haplotypes present in the genome. Most sequencing technologies provide the genotypic information and not the haplotypes i.e. they provide information on the specific bases that are present but do not provide phasic information (at which specific chromosome each of the bases appear). Determination of haplotypes can be done through molecular methods ( Allele Specific PCR, Somatic cell hybrids). These methods distinguish which allele is present at which chromosome by separating the chromosomes before genotyping. They can be very time-consuming and expensive, so statistical inference methods have been developed as a less expensive and automated option. These statistical-inference software packages utilize parsimony, maximum likelihood, and Bayesian algorithms to determine haplotypes. Disadvantage of statistical-inference is that a proportion of the inferred haplotypes could be wrong.

Population differences

When haplotypes are used for genome wide association studies, it is important to note the population being studied. Often different populations will have different patterns of LD. One example of differentiating patterns are African-descended populations vs. European and Asian-descended populations. Since humans originated in Africa and spread into Europe and then the Asian and American continents, the African populations are the most genetically diverse and have smaller regions of LD while European and Asian-descended populations have larger regions of LD due to

founder effect In population genetics, the founder effect is the loss of genetic variation that occurs when a new population is established by a very small number of individuals from a larger population. It was first fully outlined by Ernst Mayr in 1942, using ...

. When LD patterns differ in populations, SNPs can become disassociated with each other due to the changes in haplotype blocks. This means that tag SNPs, as representatives of the haplotype blocks, are unique in populations and population differences should be taken into account when performing association studies.

Application

LD plot of SNPs with top-ranked BFs in CHB of 1000 Genome Phase I

GWAS

Almost every trait has both genetic and environmental influence.

Heritability Heritability is a statistic used in the fields of breeding and genetics that estimates the degree of ''variation'' in a phenotypic trait in a population that is due to genetic variation between individuals in that population. The concept of h ...

is the proportion of phenotypic variance that is inherited from our ancestors. Association studies are used to determine the genetic influence on phenotypic presentation. Although mostly used for mapping diseases to genomic areas, they can also be used to map heritability of any phenotype like height, eye color etc. Genome-wide association studies (GWAS) use single-nucleotide polymorphisms (SNPs) to identify genetic associations with clinical conditions and phenotypic traits. They are hypothesis free and use a whole-genome approach to investigate traits by comparing a large group of individuals that express a phenotype with a large group of people that don't. The ultimate goal of GWAS is to determine genetic risk factors that can be used to make predictions about who is at risk for a disease, what are the biological underpinnings of disease susceptibility and creating new prevention and treatment strategies. Th
National Human Genome Research Institute
and the

European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...

publishes the GWAS Catalog, a catalog of published genome-wide association studies that highlights statistically significant associations between hundreds of SNPs with a broad range of phenotypes. Due to the large number of possible SNP variants (more than 149 million as of June 2015 ) it is still very expensive to sequence all SNPs. That is why GWAS use customizable arrays (SNP chips) to genotype only a subset of the variants identified as tag snps. Most GWAS use products from the two primary genotyping platforms. The

Affymetrix Affymetrix is now Applied Biosystems, a brand of DNA microarray products sold by Thermo Fisher Scientific that originated with an American biotechnology research and development and manufacturing company of the same name. The Santa Clara, Califor ...

platform prints DNA probes on a glass or silicone chip that hybridize to specific alleles in the sample DNA. The Illumina platform uses bead-based technology, with longer DNA sequences and produces better specificity. Both platforms are able to genotype more than a million tag SNPs using either pre-made or custom DNA oligos. Genome-wide studies are predicated on the common disease-common variant (CD/CV) hypothesis which states that common disorders are influenced by common genetic variation. Effect size (

penetrance Penetrance in genetics is the proportion of individuals carrying a particular variant (or allele) of a gene (the genotype) that also express an associated trait (the phenotype). In medical genetics, the penetrance of a disease-causing mutation is t ...

) of the common variants needs to be smaller relative to those found in rare disorders. That means that the common SNP can explain only a small portion of the variance due to genetic factors and that common diseases are influenced by multiple common alleles of small effect size. Another hypothesis is that common diseases are caused by rare variants that are synthetically linked to common variants. In that case the signal produced from GWAS is an indirect (synthetic) association between one or more rare causal variants in linkage disequilibrium. It is important to recognize that this phenomenon is possible when selecting a group for tag SNPs. When a disease is found to be associated with a haplotype, some SNPs in that haplotype will have synthetic association with the disease. To pinpoint the causal SNPs we need a greater resolution in the selection of haplotype blocks. Since whole genome sequencing technologies are rapidly changing and becoming less expensive it is likely that they will replace the current genotyping technologies providing the resolution needed to pinpoint causal variants.

HapMap

Because whole genome sequencing of individuals is still cost prohibitive, the

international HapMap Project The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease a ...

was constructed with a goal to map the human genome to haplotype groupings (haplotype blocks) that can describe common patterns of human genetic variation. By mapping the entire genome to haplotypes, tag SNPs can be identified to represent the haplotype blocks examined by genetic studies. An important factor to consider when planning a genetic study is the frequency and risk incurred by specific alleles. These factors can vary in different populations so the HapMap project used a variety of sequencing techniques to discover and catalog SNPs from different sets of populations. Initially the project sequenced individuals from Yoruba population of African origin (YRI), residents of Utah with western European ancestry (CEU), unrelated individuals from Tokyo, Japan (JPT) and unrelated Han Chinese individuals from Beijing, China (CHB). Recently their datasets have been expanded to include other populations (11 groups)

Selection and evaluation

Steps for tag SNP selection

Selection of maximum informative tag SNPs is an NP complete problem. However, algorithms can be devised to provide approximate solution within a margin of error. The criteria that are needed to define each tag SNP selection algorithm is the following: # Define area to search - the algorithm will attempt to locate tag SNPs in neighborhood N(t) of a target SNP t # Define a metric to assess the quality of tagging - the metric needs to measure how well a target SNP t can be predicted using a set of its neighbors N(t) i.e. how well a tag SNP as a representative of the SNPs in a neighborhood N(t) can predict a target SNP t. It can be defined as a probability that the target SNP t has different values for any pair of haplotypes i and j where the value of the SNP s is also different for the same haplotypes. The informativeness of the metric can be represented in terms of a graph theory, where every SNP s is represented as a graph Gs whose nodes are haplotypes. Gs has an edge between the nodes (i,j) if and only if the values of s are different for the haplotypes Hi, Hj. # Derive the algorithm to find representative SNPs - the goal of the algorithm is to find the minimal subset of tag SNPs selected with maximum informativeness between each tag SNP with every other target SNP # Validate the algorithm

Feature selection

Methods for selecting features fall into two categories: filter methods and wrapper methods. Filter algorithms are general preprocessing algorithms that do not assume the use of a specific classification method. Wrapper algorithms, in contrast, “wrap” the feature selection around a specific classifier and select a subset of features based on the classifier's accuracy using cross-validation. The feature selection method suitable for selecting tag SNPs must have the following characteristics: * scale well for large number of SNPs; * not require explicit class labeling and should not assume the use of a specific classifier because classification is not the goal of tagging SNP selection; * allow the user to select different numbers of tag SNPs for different amounts of tolerated information loss; * have comparable performance with other methods satisfying the three first conditions.

Selection algorithms

Several algorithms have been proposed for selecting tag SNPs. The first approach was based on the measure of goodness of SNP sets and searched for SNP subsets that are small but attain high value of the defined measure. Examining every SNP subset to find good ones is computationally feasible only for small data sets. Another approach uses principal component analysis (PCA) to find subsets of SNPs capturing majority of the data variance. A sliding windows method is employed to repeatedly apply PCA to short chromosomal regions. This reduces the data produced and also does not require exponential search time. Yet it is not feasible to apply the PCA method to large chromosomal data sets as it is computationally complex. The most commonly used approach, block-based method, exploits the principle of linkage disequilibrium observed within haplotype blocks. Several algorithms have been devised to partition chromosomal regions into haplotype blocks which are based on haplotype diversity, LD, four-gamete test and information complexity and tag SNPs are selected from all SNPs that belong to that block. The main presumption in this algorithm is that the SNPs are biallelic. The main drawback is that the definition of blocks is not always straightforward. Even though there is a list of criteria for forming the haplotype blocks, there is no consensus on the same. Also, local correlations based selection of tag SNPs ignores inter-block correlations. Unlike the block-based approach, a block-free approach does not rely on the block structure. The SNP frequency and recombination rates are known to vary across the genome and some studies have reported LD distances much longer than the reported maximum block sizes. Setting a strict border for the neighborhood is not desired and the block-free approach looks for tag SNPs globally. There are several algorithms to perform this. In one algorithm, the non-tagging SNPs are represented as boolean functions of tag SNPs and

set theory Set theory is the branch of mathematical logic that studies sets, which can be informally described as collections of objects. Although objects of any kind can be collected into a set, set theory, as a branch of mathematics, is mostly conce ...

techniques are used to reduce search space. Another algorithm searches for subsets of markers that can come from non-consecutive blocks. Due to the marker neighborhood, the search space is reduced.

Optimizations

With the number of individuals genotyped and number of SNPs in databases growing, tag SNP selection takes too much time to compute. In order to improve the efficiency of the tag SNP selection method, the algorithm first ignores the SNPs being biallelic, and then compresses the length (SNP number) of the haplotype matrix by grouping the SNP sites with the same information. The SNP sites that partition the haplotypes into the same group are called redundant sites. The SNP sites which contain distinct information within a block are called non-redundant sites (NRS). In order to further compress the haplotype matrix, the algorithm needs to find the tag SNPs such that all haplotypes of the matrix can be distinguished. By using the idea of joint partition, an efficient tag SNPs selection algorithm is provided.

Validation of the accuracy of the algorithm

Depending on how the tag SNPs are selected, different prediction methods have been used during the cross-validation process. Machine learning method was employed to predict the left-out haplotype. Another approach predicted the alleles of a non-tagging SNP n from the tag SNPs that had the highestcorrelation coefficient with n. If a single highly correlated tag SNP t is found, the alleles are assigned so their frequencies agree with the allele frequencies of t. When multiple tagging SNPs have the same (high) correlation coefficient with n, the common allele of n has advantage. It is easy to see that in this case the prediction method agrees well with the selection method, which uses PCA on the matrix of correlation coefficients between SNPs. There are other ways to assess the accuracy of a tag SNP selection method. The accuracy can be evaluated by the quality measure R2, which is the measure of association between the true numbers of haplotype copies defined over the full set of SNPs and the predicted number of haplotype copies where the prediction is based on the subset of tagging SNPs. This measure assumes diploid data and explicit inference of haplotypes from genotypes. Another assessment method due to Clayton is based on a measure of the diversity of haplotypes. The diversity is defined as the total number of differences in all pairwise comparison between haplotypes. The difference between a pair of haplotypes is the sum of differences over all the SNPs. The Clayton's diversity measure can be used to define how well a set of tag SNPs differentiate different haplotypes. This measure is suitable only for haplotype blocks with limited haplotype diversity and it is not clear how to use it for large data sets consisting of multiple haplotype blocks. Some recent works evaluate tag SNPs selection algorithms based on how well the tagging SNPs can be used to predict non-tagging SNPs. The prediction accuracy is determined using cross-validation such as leave-one-out or hold out. In leave-one-out cross-validation, for each sequence in the data set, the algorithm is run on the rest of the data set to select a minimum set of tagging SNPs.

Tools

Tagger

Tagger is a web tool available for evaluating and selecting tag SNPs from genotypic data such as the International HapMap Project. It utilizes pairwise methods and multimarker haplotype approaches. Users can upload HapMap genotype data or pedigree format and the linkage disequilibrium patterns will be calculated. Tagger options allow for the user to specify chromosomal landmarks, which indicate regions of interest in the genome for picking tag SNPs. The program then produces a list of tag SNPs and their statistical test values as well as a coverage report. It is developed by Paul de Bakker in the labs of David Altshuler and Mark Daly at the Center for Human Genetic Research of

Massachusetts General Hospital Massachusetts General Hospital (Mass General or MGH) is the original and largest teaching hospital of Harvard Medical School located in the West End neighborhood of Boston, Massachusetts. It is the third oldest general hospital in the United Stat ...

and

Harvard Medical School Harvard Medical School (HMS) is the graduate medical school of Harvard University and is located in the Longwood Medical Area of Boston, Massachusetts. Founded in 1782, HMS is one of the oldest medical schools in the United States and is consi ...

, at the

Broad Institute The Eli and Edythe L. Broad Institute of MIT and Harvard (IPA: , pronunciation respelling: ), often referred to as the Broad Institute, is a biomedical and genomic research center located in Cambridge, Massachusetts, Cambridge, Massachusetts, U ...

CLUSTAG and WCLUSTAG

In the freeware CLUSTAG and WCLUSTAG, there contain cluster and set-cover algorithms to obtain a set of tag SNPs that can represent all the known SNPs in a chromosomal region. The programs are implemented with Java, and they can run in Windows platform as well as the Unix environment. They are developed by SIO-IONG AO et al. in The University of Hong Kong.

References

{{DEFAULTSORT:Tag Snp Genetic polymorphisms Genomics Population genetics