The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for
genetic variation
Genetic variation is the difference in DNA among individuals or the differences between populations. The multiple sources of genetic variation include mutation and genetic recombination. Mutations are the ultimate sources of genetic variation, ...
within and across different
species
In biology, a species is the basic unit of classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest group of organisms in which any two individuals of the appropriate s ...
developed and hosted by the
National Center for Biotechnology Information
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The ...
(NCBI) in collaboration with the
National Human Genome Research Institute
The National Human Genome Research Institute (NHGRI) is an institute of the National Institutes of Health, located in Bethesda, Maryland.
NHGRI began as the Office of Human Genome Research in The Office of the Director in 1988. This Office transi ...
(NHGRI). Although the name of the database implies a collection of one class of
polymorphisms only (i.e.,
single nucleotide polymorphisms (SNPs)), it in fact contains a range of molecular variation: (1)
SNPs
In genetics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently larg ...
, (2) short deletion and insertion polymorphisms (
indels
Indel is a molecular biology term for an insertion or deletion of bases in the genome of an organism. It is classified among small genetic variations, measuring from 1 to 10 000 base pairs in length, including insertion and deletion events that ...
/DIPs), (3)
microsatellite markers or
short tandem repeats
A microsatellite is a tract of repetitive DNA in which certain DNA motifs (ranging in length from one to six or more base pairs) are repeated, typically 5–50 times. Microsatellites occur at thousands of locations within an organism's genome. ...
(STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants.
The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part ...
, NCBI’s collection of publicly available nucleic acid and protein sequences.
In 2017, NCBI stopped support for all non-human organisms in dbSNP. As of build 153 (released in August 2019), dbSNP had amassed nearly 2 billion submissions representing more than 675 million distinct variants for ''
Homo sapiens
Humans (''Homo sapiens'') are the most abundant and widespread species of primate, characterized by bipedalism and exceptional cognitive skills due to a large and complex brain. This has enabled the development of advanced tools, culture, ...
''.
Purpose
dbSNP is an online resource implemented to aid
biology
Biology is the scientific study of life. It is a natural science with a broad scope but has several unifying themes that tie it together as a single, coherent field. For instance, all organisms are made up of cells that process hereditary i ...
researchers. Its goal is to act as a single
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
that contains all identified genetic variation, which can be used to investigate a wide variety of genetically based natural phenomena. Specifically, access to the molecular variation cataloged within dbSNP aids basic research such as physical mapping,
population genetics
Population genetics is a subfield of genetics that deals with genetic differences within and between populations, and is a part of evolutionary biology. Studies in this branch of biology examine such phenomena as adaptation, speciation, and pop ...
, investigations into evolutionary relationships, as well as being able to quickly and easily quantify the amount of variation at a given site of interest. In addition, dbSNP guides applied research in
pharmacogenomics
Pharmacogenomics is the study of the role of the genome in drug response. Its name ('' pharmaco-'' + ''genomics'') reflects its combining of pharmacology and genomics. Pharmacogenomics analyzes how the genetic makeup of an individual affects the ...
and the association of genetic variation with phenotypic traits.
According to the NCBI website, “The long-term investment in such novel and exciting research
bSNP
More than 50 different modifications and experimental vehicles based on the T-26 light infantry tank chassis were developed in the USSR in the 1930s, with 23 modifications going into series production. The majority were armoured combat vehicles ...
promises not only to advance human biology but to revolutionise the practice of modern medicine.”
Submission
1. Source
Originally, dbSNP accepts submissions for any
organism
In biology, an organism () is any living system that functions as an individual entity. All organisms are composed of cells (cell theory). Organisms are classified by taxonomy into groups such as multicellular animals, plants, and ...
from a wide variety of sources including individual research laboratories, collaborative polymorphism discovery efforts, large scale genome sequencing centers, other SNP databases (e.g. the SNP consortium,
HapMap
The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease a ...
, etc.), and private businesses.
On September 1, 2017, dbSNP stopped accepting non-human variant data submissions and two months later, its interactive websites and related NCBI services stopped presenting non-human variant data. Now dbSNP only accepts and presents human variant data.
2. Types of records
Every submitted variation receives a submitted SNP ID number (“ss#”).
This accession number is a stable and unique identifier for that submission. Unique submitted SNP records also receive a reference SNP ID number (“rs#”; "refSNP cluster"). However, more than one record of a variation will likely be submitted to dbSNP, especially for clinically relevant variations. To accommodate this, dbSNP routinely assembles identical submitted SNP records into a single reference SNP record, which is also a unique and stable identifier (see below).
3. How to submit
To submit variations to dbSNP, one must first acquire a submitter handle, which identifies the laboratory responsible for the submission.
Next, the author is required to complete a submission file containing the relevant information and data. Submitted records must contain the ten essential pieces of information listed in the following table.
Other information required for submissions includes contact information, publication information (title, journal, authors, year), molecule type (genomic
DNA,
cDNA,
mitochondrial DNA,
chloroplast
A chloroplast () is a type of membrane-bound organelle known as a plastid that conducts photosynthesis mostly in plant and algal cells. The photosynthetic pigment chlorophyll captures the energy from sunlight, converts it, and stores it in ...
DNA), and organism.
Release
New information obtained by dbSNP becomes available to the public periodically in a series of “builds” (i.e. revisions and releases of data).
There is no schedule for releasing new builds; instead, builds are usually released when a new genome build becomes available, assuming that the genome has some cataloged variation associated with it.
This occurs approximately every 3–4 months. Genome sequences may be improved over time so reference SNPs (“refSNP”) from previous builds, as well as new submitted SNPs, are re-mapped to the newly available genome sequence. Multiple submitted SNPs, if mapping to the same location, are clustered into one refSNP cluster and are assigned a reference SNP ID number. However, if two refSNP cluster records are found to map to the same location (i.e. are identical), dbSNP will also merge those records. In this case, the smaller refSNP number ID (i.e. the earliest record) would now represent both records, and the larger refSNP number IDs would become obsolete. These obsolete refSNP number IDs and are not used again for new records. When a merger of two refSNP records occurs, the change is tracked, and the former refSNP number IDs can still be used as a search query. This process of merging identical records reduces redundancy within dbSNP.
There are two exceptions to the above merging criteria. First, variation of different classes (e.g. a SNP and a DIP) are not merged. Secondly, clinically important refSNPs that have been cited in the literature are termed “precious”; a merger that would eliminate such a refSNP is never performed, since it could later cause confusion.
Retrieval
1. How to
The dbSNP can be searched using the Entrez SNP search tool. A variety of queries can be used for searching: an ss number ID, a refSNP number ID, a gene name, an experimental method, a population class, a population detail, a publication, a marker, an allele, a chromosome, a base position, a heterozygosity range, or a build number.
In addition, many results can be retrieved simultaneously using batch queries.
Searches return refSNP number IDs that match the query term and a summary of the available information for that refSNP cluster.
2. Tools/Data
The information available for a refSNP cluster includes the basic information from each of the individual submissions (see “Submission”) as well as information available from combining the data from multiple submissions (e.g. heterozygosity, genotype frequencies). Many tools are available to examine a refSNP cluster in greater depth. Map view shows the position of the variation in the genome and other nearby variations. Another tool, gene view reports the location of the variation within a gene (if it is in a gene), the old and new codon, the amino acids encoded by both, and whether the change is synonymous or non-synonymous. Sequence viewer shows the position of the variant in relation to
introns
An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of the cistron .e., gene ...
,
exons, and other distant and close variants. 3D structure mapping, which shows 3D images of the encoded protein is also available.
The dbSNP is also linked to many other NCBI resources including the
nucleotide
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...
,
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
,
gene
In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
,
taxonomy
Taxonomy is the practice and science of categorization or classification.
A taxonomy (or taxonomical classification) is a scheme of classification, especially a hierarchical classification, in which things are organized into groups or types. ...
and structure databases, as well as
PubMed
PubMed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintain the ...
, UniSTS,
PMC,
OMIM
Online Mendelian Inheritance in Man (OMIM) is a continuously updated catalog of human genes and genetic disorders and traits, with a particular focus on the gene-phenotype relationship. , approximately 9,000 of the over 25,000 entries in OMIM r ...
, and UniGene.
3. Validation status
The validation status list the categories of evidence that support a variant. These include: (1) multiple independent submissions; (2) frequency or genotype data; (3) submitter confirmation; (4) observation of all alleles in at least two chromosomes; (5) genotyped by
HapMap
The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease a ...
; and (6) sequenced in the
1000 Genomes Project
The 1000 Genomes Project (abbreviated as 1KGP), launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one th ...
.
Problems With Data Quality
The quality of the data found on dbSNP has been questioned by many research groups,
which suspect high
false positive
A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resul ...
rates due to
genotyping
Genotyping is the process of determining differences in the genetic make-up (genotype) of an individual by examining the individual's DNA sequence using biological assays and comparing it to another individual's sequence or a reference sequence. ...
and base-calling errors. These mistakes can easily be entered into dbSNP if the submitter uses (1) uncritical
bioinformatic
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combine ...
alignments of highly similar but distinct DNA sequences, and/or (2)
PCRs with
primer
Primer may refer to:
Arts, entertainment, and media Films
* ''Primer'' (film), a 2004 feature film written and directed by Shane Carruth
* ''Primer'' (video), a documentary about the funk band Living Colour
Literature
* Primer (textbook), a t ...
s that cannot discriminate between similar but distinct DNA sequences.
Mitchell ''et al.'' (2004)
reviewed four studies
and concluded that dbSNP has a false positive rate between 15-17% for SNPs, and also that the minor
allele
An allele (, ; ; modern formation from Greek ἄλλος ''állos'', "other") is a variation of the same sequence of nucleotides at the same place on a long DNA molecule, as described in leading textbooks on genetics and evolution.
::"The chro ...
frequency is greater than 10% for approximately 80% of the SNPs that are not false positives. Similarly, Musemeci ''et al.'' (2010)
states that as many as 8.32% of the biallelic coding SNPs in dbSNP are artifacts of highly similar DNA sequences (i.e. paralogous genes) and refer to these entries as single nucleotide differences (SNDs). The high error rates in dbSNP may not be surprising: of the 23.7 million refSNP entries for humans, only 14.5 million have been validated, leaving the remaining 9.2 million as candidate SNPs. However, according to Musemeci ''et al.'' (2010),
even the validation code provided in the refSNP record is only partially useful: only HapMap validation reduced the number of SNDs (3% vs 8%), but only accepting this method removes more than half of the real SNPs in the dbSNP. These authors also note that one source of submissions from the Lee group are plagued with errors: 20% of these submissions are SNDs (vs. 8% for submissions). However, as the authors note, ignoring all of these submissions would remove many real SNPs.
Errors in the dbSNP can hamper candidate gene association studies and
haplotype
A haplotype ( haploid genotype) is a group of alleles in an organism that are inherited together from a single parent.
Many organisms contain genetic material ( DNA) which is inherited from two parents. Normally these organisms have their DNA or ...
-based investigations.
Errors may also increase false conclusions in association studies:
increasing the number of SNPs that are tested by testing false SNPs requires more hypothesis tests. However, these false SNPs cannot actually be associated with traits, so the alpha level is decreased more than is necessary for a rigorous test if only the true SNPs were tested and the false negative rate will increase. Musemeci ''et al.'' (2010)
suggested that authors of negative association studies inspect their previous studies for false SNPs (SNDs), which could be removed from analysis.
How to cite data from dbSNP
Individual sequences can be referred to by their refSNP cluster ID numbers (e.g. rs206437). dbSNP should be referenced using the 2001 Sherry ''et al.'' paper: Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., Sirotkin, K. (2001). dbSNP: the NCBI database of genetic variation. Nucleic Acids Research, 29: 308-311.
See also
*
SNPedia
SNPedia (pronounced "snipedia") is a wiki-based bioinformatics web site that serves as a database of single nucleotide polymorphisms (SNPs). Each article on a SNP provides a short description, links to scientific articles and personal genomics web ...
*
HapMap
The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease a ...
*
NCBI
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The ...
*
NHGRI
The National Human Genome Research Institute (NHGRI) is an institute of the National Institutes of Health, located in Bethesda, Maryland.
NHGRI began as the Office of Human Genome Research in The Office of the Director in 1988. This Office transi ...
References
External links
dbSNP homehttps://www.ncbi.nlm.nih.gov/ How to Submit to dbSNP]
{{DEFAULTSORT:Dbsnp
National Institutes of Health
Genetics databases
Mutation
Single-nucleotide polymorphisms