HOME

TheInfoList




The GenBank
sequence database In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("Digital data, digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a ...
is an
open access Open access (OA) is a set of principles and a range of practices through which research Research is "creative and systematic work undertaken to increase the stock of knowledge". It involves the collection, organization and analysis o ...
, annotated collection of all publicly available
nucleotide Nucleotides are organic molecules , CH4; is among the simplest organic compounds. In chemistry, organic compounds are generally any chemical compounds that contain carbon-hydrogen chemical bond, bonds. Due to carbon's ability to Catenation, ...

nucleotide
sequences and their
protein Proteins are large biomolecule , showing alpha helices, represented by ribbons. This poten was the first to have its suckture solved by X-ray crystallography by Max Perutz and Sir John Cowdery Kendrew in 1958, for which they received a No ...

protein
translations. It is produced and maintained by the
National Center for Biotechnology Information The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine The United States National Library of Medicine (NLM), operated by the United States federal government The federal gov ...
(NCBI; a part of the
National Institutes of Health The National Institutes of Health (NIH ) is the primary agency of the United States government The federal government of the United States (U.S. federal government or U.S. government) is the national government of the United States ...
in the
United States The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country Continental United States, primarily located in North America. It consists of 50 U.S. state, states, a Washington, D.C., ...

United States
) as part of the
International Nucleotide Sequence Database CollaborationThe International Nucleotide Sequence Database Collaboration (INSDC) consists of a joint effort to collect and disseminate database A database is an organized collection of data Data are units of information Information can be thought o ...
(INSDC). GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 100,000 distinct
organisms In biology Biology is the natural science that studies life and living organisms, including their anatomy, physical structure, Biochemistry, chemical processes, Molecular biology, molecular interactions, Physiology, physiological me ...

organisms
. The database started in 1982 by Walter Goad and
Los Alamos National Laboratory Los Alamos National Laboratory (Los Alamos or LANL for short) is a United States Department of Energy national laboratory initially organized during World War II World War II or the Second World War, often abbreviated as WWII or ...
. GenBank has become an important database for research in biological fields and has grown in recent years at an
exponential rate
exponential rate
by doubling roughly every 18 months. Release 242.0, produced in February 2021, contained over 12 trillion nucleotide bases in more than 2 billion sequences. GenBank is built by direct submissions from individual laboratories, as well as from bulk submissions from large-scale
sequencing In genetics Genetics is a branch of biology concerned with the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) Though heredity had been observed for millennia, Gregor Mendel, Moravia, Moravian scientist ...

sequencing
centers.


Submissions

Only original sequences can be submitted to GenBank. Direct submissions are made to GenBank using BankIt, which is a Web-based form, or the stand-alone submission program, Sequin. Upon receipt of a sequence submission, the GenBank staff examines the originality of the data and assigns an accession number to the sequence and performs quality assurance checks. The submissions are then released to the public database, where the entries are retrievable by
Entrez 120px, The Entrez logo The Entrez (pronounced ''ɒnˈtreɪ'') Global Query Cross-Database Search System is a federated search Federated search retrieves information from a variety of sources via a search application built on top of one or more se ...

Entrez
or downloadable by
FTP The File Transfer Protocol (FTP) is a standard communication protocol used for the transfer of computer files from a server to a client on a computer network. FTP is built on a client–server model architecture using separate control and data c ...
. Bulk submissions of
Expressed Sequence TagIn genetics Genetics is a branch of biology concerned with the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) Though heredity had been observed for millennia, Gregor Mendel, Moravia, Moravian scientist and ...
(EST),
Sequence-tagged siteA sequence-tagged site (or STS) is a short (200 to 500 base pair) DNA sequence that has a single occurrence in the genome and whose location and base sequence are known. Usage STSs can be easily detected by the polymerase chain reaction (PCR) using ...
(STS), Genome Survey Sequence (GSS), and High-Throughput Genome Sequence (HTGS) data are most often submitted by large-scale sequencing centers. The GenBank direct submissions group also processes complete microbial genome sequences.


History

Walter Goad of the Theoretical Biology and Biophysics Group at
Los Alamos National Laboratory Los Alamos National Laboratory (Los Alamos or LANL for short) is a United States Department of Energy national laboratory initially organized during World War II World War II or the Second World War, often abbreviated as WWII or ...
and others established the Los Alamos Sequence Database in 1979, which culminated in 1982 with the creation of the public GenBank. Funding was provided by the
National Institutes of Health The National Institutes of Health (NIH ) is the primary agency of the United States government The federal government of the United States (U.S. federal government or U.S. government) is the national government of the United States ...
, the National Science Foundation, the Department of Energy, and the Department of Defense. LANL collaborated on GenBank with the firm
Bolt, Beranek, and Newman Raytheon BBN (originally Bolt Beranek and Newman Inc.) is an American research and development company, based next to Fresh Pond in Cambridge, Massachusetts Cambridge ( ) is a city in Middlesex County, Massachusetts, Middlesex County, Massac ...
, and by the end of 1983 more than 2,000 sequences were stored in it. In the mid 1980s, the Intelligenetics bioinformatics company at
Stanford University Stanford University, officially Leland Stanford Junior University, is a private Private or privates may refer to: Music * "In Private "In Private" was the third single in a row to be a charting success for United Kingdom, British singer Du ...

Stanford University
managed the GenBank project in collaboration with LANL.LANL GenBank History
/ref> As one of the earliest
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biology, biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformat ...

bioinformatics
community projects on the Internet, the GenBank project started
BIOSCIBIOSCI, also known as Bionet, is a set of electronic communication forum used by life scientists around the world. It includes the Bionet Usenet Usenet () is a worldwide distributed discussion system available on computers. It was developed from ...
/Bionet news groups for promoting
open access Open access (OA) is a set of principles and a range of practices through which research Research is "creative and systematic work undertaken to increase the stock of knowledge". It involves the collection, organization and analysis o ...
communications among bioscientists. During 1989 to 1992, the GenBank project transitioned to the newly created
National Center for Biotechnology Information The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine The United States National Library of Medicine (NLM), operated by the United States federal government The federal gov ...
.


Growth

The GenBank
release notesRelease notes are documents that are distributed with software products or hardware products, sometimes when the product is still in the development or test state (e.g., a beta Beta (, ; uppercase , lowercase , or cursive Cursive (also known ...
for release 162.0 (October 2007) state that "from 1982 to the present, the number of bases in GenBank has doubled approximately every 18 months". , GenBank release 232.0 has 213,383,758
loci Locus (plural loci) is Latin for "place". It may refer to: Entertainment * Locus (comics), a Marvel Comics mutant villainess, a member of the Mutant Liberation Front * Locus (magazine), ''Locus'' (magazine), science fiction and fantasy magazine ...
, 329,835,282,370 bases, from 213,383,758 reported sequences. The GenBank database includes additional data sets that are constructed mechanically from the main sequence data collection, and therefore are excluded from this count.


Incomplete identifications

Public databases which may be searched using the National Center for Biotechnology Information Basic Local Alignment Search Tool (NCBI BLAST), lack peer-reviewed sequences of type strains and sequences of non-type strains. On the other hand, while commercial databases potentially contain high-quality filtered sequence data, there are a limited number of reference sequences. A paper released in the ''
Journal of Clinical Microbiology The ''Journal of Clinical Microbiology'' is a monthly medical journalA medical journal is a peer-reviewed scientific journal that communicates medicine, medical information to physicians and other health professionals. Journals that cover many medic ...
'' evaluated the
16S rRNA 16S rRNA may refer to: * 16S ribosomal RNA, the prokaryotic ribosomal subunit * Mitochondrially encoded 16S RNA, the eukaryotic ribosomal subunit {{Short pages monitor