SNV calling from NGS data is any of a range of methods for identifying the existence of
single nucleotide variant
In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently larg ...
s (SNVs) from the results of
next generation sequencing
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The ...
(NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms (see
SNP genotyping
SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms (SNPs) between members of a species. It is a form of genotyping, which is the measurement of more general genetic variation. SNPs are one of the most common ...
). Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications.
In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population,
as well as detecting
somatic
Somatic may refer to:
* Somatic (biology), referring to the cells of the body in contrast to the germ line cells
** Somatic cell, a non-gametic cell in a multicellular organism
* Somatic nervous system, the portion of the vertebrate nervous sys ...
SNVs within an individual using multiple tissue samples.
Methods for detecting germline variants
Most NGS based methods for SNV detection are designed to detect
germline
In biology and genetics, the germline is the population of a multicellular organism's cells that pass on their genetic material to the progeny (offspring). In other words, they are the cells that form the egg, sperm and the fertilised egg. They ...
variations in the individual's genome. These are the mutations that an individual biologically inherits from their parents, and are the usual type of variants searched for when performing such analysis (except for certain specific applications where
somatic mutations are sought). Very often, the searched for variants occur with some (possibly rare) frequency, throughout the population, in which case they may be referred to as
single nucleotide polymorphisms
In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently larg ...
(SNPs). Technically the term SNP only refers to these kinds of variations, however in practice they are often used synonymously with SNV in the literature on variant calling. In addition, since the detection of germline SNVs requires determining the individual's genotype at each locus, the phrase "SNP genotyping" may also be used to refer to this process. However this phrase may also refer to wet-lab experimental procedures for classifying genotypes at a set of known SNP locations.
The usual process of such techniques are based around:
# Filtering the set of NGS reads to remove sources of error/bias
# Aligning the reads to a reference genome
# Using an algorithm, either based on a statistical model or some heuristics, to predict the likelihood of variation at each locus, based on the quality scores and allele counts of the aligned reads at that locus
# Filtering the predicted results, often based on metrics relevant to the application
#
SNP annotation
Single nucleotide polymorphism annotation ( SNP annotation) is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in ...
to predict the functional effect of each variation.
The usual output of these procedures is a
VCF file.
Probabilistic methods
In an ideal error free world with high read
coverage
Coverage may refer to:
Filmmaking
* Coverage (lens), the size of the image a lens can produce
* Camera coverage, the amount of footage shot and different camera setups used in filming a scene
* Script coverage, a short summary of a script, wri ...
, the task of variant calling from the results of a NGS data alignment would be simple; at each
locus
Locus (plural loci) is Latin for "place". It may refer to:
Entertainment
* Locus (comics), a Marvel Comics mutant villainess, a member of the Mutant Liberation Front
* ''Locus'' (magazine), science fiction and fantasy magazine
** ''Locus Award' ...
(position on the genome) the number of occurrences of each distinct nucleotide among the reads aligned at that position can be counted, and the true genotype would be obvious; either AA if all nucleotides match allele A, BB if they match allele B, or AB if there is a mixture. However, when working with real NGS data this sort of naive approach is not used, as it cannot account for the noise in the input data.
The nucleotide counts used for base calling contain errors and bias, both due do the sequenced reads themselves, and the alignment process. This issue can be mitigated to some extent by sequencing to a greater depth of read coverage, however this is often expensive, and many practical studies require making inferences on low coverage data.
Probabilistic methods aim to overcome the above issue, by producing robust estimates of the probabilities of each of the possible genotypes, taking into account noise, as well as other available prior information that can be used to improve estimates. A genotype can then be predicted based on these probabilities, often according to the
MAP
A map is a symbolic depiction emphasizing relationships between elements of some space, such as objects, regions, or themes.
Many maps are static, fixed to paper or some other durable medium, while others are dynamic or interactive. Although ...
estimate.
Probabilistic methods for variant calling are based on
Bayes' theorem
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
. In the context of variant calling, Bayes' Theorem defines the probability of each genotype being the true genotype given the observed data, in terms of the prior probabilities of each possible genotype, and the probability distribution of the data given each possible genotype. The formula is:
:
In the above equation:
*
refers to the observed data; that is, the aligned reads
*
is the genotype whose probability is being calculated
*
refers to the ''i''th possible genotype, out of ''n'' possibilities
Given the above framework, different software solutions for detecting SNVs vary based on how they calculate the prior probabilities
, the error model used to model the probabilities
, and the partitioning of the overall genotypes into separate sub-genotypes, whose probabilities can be individually estimated in this framework.
Prior genotype probability estimation
The calculation of prior probabilities depends on available data from the genome being studied, and the type of analysis being performed. For studies where good reference data containing frequencies of known mutations is available (for example, in studying human genome data), these known frequencies of genotypes in the population can be used to estimate priors. Given population wide allele frequencies, prior genotype probabilities can be calculated at each locus according to the
Hardy–Weinberg equilibrium.
In the absence of such data, constant priors can be used, independent of the locus. These can be set using heuristically chosen values, possibly informed by the kind of variations being sought by the study. Alternatively, supervised machine-learning procedures have been investigated that seek to learn optimal prior values for individuals in a sample, using supplied NGS data from these individuals.
Error models for data observations
The error model used in creating a probabilistic method for variant calling is the basis for calculating the
term used in Bayes' theorem. If the data was assumed to be error free, then the distribution of observed nucleotide counts at each locus would follow a
Binomial Distribution
In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no quest ...
, with 100% of nucleotides matching the A or B allele respectively in the AA and BB cases, and a 50% chance of each nucleotide matching either A or B in the AB case. However, in presence of noise in the read data this assumption is violated, and the
values need to account for the possibility that erroneous nucleotides are present in the aligned reads at each locus.
A simple error model is to introduce a small error to the data probability term in the homozygous cases, allowing a small constant probability that nucleotides which don't match the A allele are observed in the AA case, and respectively a small constant probability that nucleotides not matching the B allele are observed in the BB case. However more sophisticated procedures are available which attempt to more realistically replicate the actual error patterns observed in real data in calculating the conditional data probabilities. For instance, estimations of read quality (measured as
Phred quality scores) have been incorporated in these calculations, taking into account the expected error rate in each individual read at a locus.
Another technique that has successfully been incorporated into error models is base quality recalibration, where separate error rates are calculated – based on prior known information about error patterns – for each possible nucleotide substitution. Research shows that each possible nucleotide substitution is not equally likely to show up as an error in sequencing data, and so base quality recalibration has been applied to improve error probability estimates.
Partitioning of the genotype
In the above discussion, it has been assumed that the genotype probabilities at each locus are calculated independently; that is, the entire genotype is partitioned into independent genotypes at each locus, whose probabilities are calculated independently. However, due to
linkage disequilibrium
In population genetics, linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is h ...
the genotypes of nearby loci are in general not independent. As a result, partitioning the overall genotype instead into a sequence of overlapping
haplotypes
A haplotype (haploid genotype) is a group of alleles in an organism that are inherited together from a single parent.
Many organisms contain genetic material ( DNA) which is inherited from two parents. Normally these organisms have their DNA org ...
allows these correlations to be modelled, resulting in more precise probability estimates through the incorporation of population-wide haplotype frequencies in the prior. The use of haplotypes to improve variant detection accuracy has been applied successfully, for instance in the
1000 Genomes Project
The 1000 Genomes Project (abbreviated as 1KGP), launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one th ...
.
Heuristic based algorithms
As an alternative to probabilistic methods,
heuristic
A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...
methods exist for performing variant calling on NGS data. Instead of modelling the distribution of the observed data and using Bayesian statistics to calculate genotype probabilities, variant calls are made based on a variety of heuristic factors, such as minimum allele counts, read quality cut-offs, bounds on read depth, etc. Although they have been relatively unpopular in practice in comparison to probabilistic methods, in practice due to their use of bounds and cut-offs they can be robust to outlying data that violate the assumptions of probabilistic models.
Reference genome used for alignment
An important part of the design of variant calling methods using NGS data is the DNA sequence used as a reference to which the NGS reads are aligned. In human genetics studies, high quality references are available, from sources such as the
HapMap project
The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease a ...
,
which can substantially improve the accuracy of the variant calls made by variant calling algorithms. As a bonus, such references can be a source of prior genotype probabilities for Bayesian-based analysis. However, in the absence of such a high quality reference, experimentally obtained reads can first be
assembled
Assemble or Assembled may refer to:
* ''Assemble'' (album), 2005 album by Grown at Home
*Assemble (collective), collective of people based in London
*'' Marvel Studios: Assembled'', American television series of documentary specials
See also
*
*
...
in order to create a reference sequence for alignment.
Pre-processing and filtering of results
Various methods exist for filtering data in variant calling experiments, in order to remove sources of error/bias. This can involve the removal of suspicious reads before performing alignment and/or filtering of the list of variants returned by the variant calling algorithm.
Depending on the sequencing platform used, various biases may exist within the set of sequenced reads. For instance, strand bias can occur, where there is a highly unequal distribution of forward vs reverse directions in the reads aligned in some neighborhood. Additionally, there may occur an unusually high duplication of some reads (for instance due to bias in
PCR). Such biases can result in dubious variant calls – for instance if a fragment containing a PCR error at some locus is over amplified due to PCR bias, that locus will have a high count of the false allele, and may be called as a SNV – and so analysis pipelines frequently filter calls based on these biases.
Methods for detecting somatic variants
In addition to methods that align reads from individual sample(s) to a reference genome in order to detect
germline
In biology and genetics, the germline is the population of a multicellular organism's cells that pass on their genetic material to the progeny (offspring). In other words, they are the cells that form the egg, sperm and the fertilised egg. They ...
genetic variants, reads from multiple tissue samples within a single individual can be aligned and compared in order to detect somatic variants. These variants correspond to
mutations
In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, mi ...
that have occurred
de novo within groups of
somatic cell
A somatic cell (from Ancient Greek σῶμα ''sôma'', meaning "body"), or vegetal cell, is any biological cell forming the body of a multicellular organism other than a gamete, germ cell, gametocyte or undifferentiated stem cell. Such cells compo ...
s within an individual (that is, they are not present within the individual's germline cells). This form of analysis has been frequently applied to the study of
cancer
Cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. These contrast with benign tumors, which do not spread. Possible signs and symptoms include a lump, abnormal b ...
, where many studies are designed around investigating the profile of somatic mutations within cancerous tissues. Such investigations have resulted in diagnostic tools that have seen clinical application, and are used to improve scientific understanding of the disease, for instance by the discovery of new cancer-related genes, identification of involved
gene regulatory networks
A gene (or genetic) regulatory network (GRN) is a collection of molecular regulators that interact with each other and with other substances in the cell to govern the gene expression levels of mRNA and proteins which, in turn, determine the fun ...
and
metabolic pathways
In biochemistry, a metabolic pathway is a linked series of chemical reactions occurring within a cell. The reactants, products, and intermediates of an enzymatic reaction are known as metabolites, which are modified by a sequence of chemical reac ...
, and by informing models of how tumors grow and evolve.
Recent developments
Until recently, software tools for carrying out this form of analysis have been heavily underdeveloped, and were based on the same algorithms used to detect germline variations. Such procedures are not optimized for this task, because they do not adequately model the statistical correlation between the genotypes present in multiple tissue samples from the same individual.
More recent investigations have resulted in the development of software tools especially optimized for the detection of somatic mutations from multiple tissue samples. Probabilistic techniques have been developed that pool allele counts from all tissue samples at each locus, and using statistical models for the likelihoods of joint-genotypes for all the tissues, and the distribution of allele counts given the genotype, are able to calculate relatively robust probabilities of somatic mutations at each locus using all available data.
In addition there has recently been some investigation in
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
based techniques for performing this analysis.
In 2021, the Sequencing Quality Control Phase 2 Consortium
has published a number of studies that investigated the effects of sample preparations, sequencing library kits, sequencing platforms, and bioinformatics workflows on the accuracy of somatic SNV detection
based on a pair of tumor-normal cell lines that the Consortium has established as the reference samples, data, and call sets.
List of available software
VarNetBig Data Genomics: AvocadoDeepVariantFreebayesGATK (including MuTect)JointSNVMixMaCHMagnolia DCNNNeuSomaticNGSEPPiscesPlatypusrealSFSReveelSAMtoolsSNVmixSomaticSeqSomaticSniperStrelkaVarDictVarScan
References
{{reflist, 2
DNA sequencing
Genetics techniques