The Reference Sequence (RefSeq)
database
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
is an
open access
Open access (OA) is a set of principles and a range of practices through which nominally copyrightable publications are delivered to readers free of access charges or other barriers. With open access strictly defined (according to the 2001 de ...
, annotated and curated collection of publicly available
nucleotide
Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...
sequences (
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
,
RNA
Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...
) and their
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
products. RefSeq was introduced in 2000.
This database is built by
National Center for Biotechnology Information
The National Center for Biotechnology Information (NCBI) is part of the National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is lo ...
(NCBI), and, unlike
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a par ...
, provides only a single record for each natural biological molecule (i.e. DNA, RNA or protein) for major organisms ranging from
viruses
A virus is a submicroscopic infectious agent that replicates only inside the living cells of an organism. Viruses infect all life forms, from animals and plants to microorganisms, including bacteria and archaea. Viruses are found in almo ...
to
bacteria
Bacteria (; : bacterium) are ubiquitous, mostly free-living organisms often consisting of one Cell (biology), biological cell. They constitute a large domain (biology), domain of Prokaryote, prokaryotic microorganisms. Typically a few micr ...
to
eukaryotes
The eukaryotes ( ) constitute the domain of Eukaryota or Eukarya, organisms whose cells have a membrane-bound nucleus. All animals, plants, fungi, seaweeds, and many unicellular organisms are eukaryotes. They constitute a major group of ...
.
For each
model organism
A model organism is a non-human species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the model organism will provide insight into the workings of other organisms. Mo ...
, ''RefSeq'' aims to provide separate and linked records for the genomic DNA, the gene transcripts, and the proteins arising from those transcripts. ''RefSeq'' is limited to major organisms for which sufficient data are available (121,461 distinct "named"
organisms
An organism is any living thing that functions as an individual. Such a definition raises more problems than it solves, not least because the concept of an individual is also difficult. Many criteria, few of them widely accepted, have been pr ...
as of July 2022),
while
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a par ...
includes sequences for any organism submitted (approximately 504,000 formally described
species
A species () is often defined as the largest group of organisms in which any two individuals of the appropriate sexes or mating types can produce fertile offspring, typically by sexual reproduction. It is the basic unit of Taxonomy (biology), ...
).
RefSeq categories
RefSeq collection comprises different data types, with different origins, so it is necessary to establish standard categories and identifiers to store each data type. The most important categories are:
For more details and more categories, se
Table 1i
Chapter 18 of the book ''The Reference Sequence (RefSeq) Database''
RefSeq Projects
Several projects to improve ''RefSeq'' services are currently in development by the NCBI, often in collaboration with research centers such as EMBL-EBI:
*
Consensus CDS (CCDS): This project aims to identify a core set of human and mouse
protein-coding regions and standardize sets of genes with high and consistent levels of genomic annotation quality. This project was announced in 2009 and is still in development.
* RefSeq Functional Elements (RefSeqFE): It is focused on describing non-genic functional elements which are gene regulatory regions such as:
enhancers
In genetics, an enhancer is a short (50–1500 bp) region of DNA that can be bound by proteins ( activators) to increase the likelihood that transcription of a particular gene will occur. These proteins are usually referred to as transcriptio ...
,
silencers,
DNase I hypersensitive regions,
DNA replication origins etc.). The current scope of this project is restricted to the human and mouse genomes.
* RefSeqGene: Its main goal is to define genomic sequences to be used as reference standards for well-characterized genes. Previously described
mRNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein.
mRNA is ...
, protein and chromosome sequences have the weaknesses of not providing explicit genomic coordinates of gene flanking and intronic regions as well as showing awkwardly large coordinates that change with every new genome assembly. The RefSeqGene project is designed to eliminate these errors.
* Targeted Loci: This project records molecular markers, specially protein-coding and
ribosomal RNA
Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from ribosomal ...
loci that are used for
phylogenetic
In biology, phylogenetics () is the study of the evolutionary history of life using observable characteristics of organisms (or genes), which is known as phylogenetic inference. It infers the relationship among organisms based on empirical dat ...
and
barcoding analysis. The scope of this project includes sequences for
Archaea
Archaea ( ) is a Domain (biology), domain of organisms. Traditionally, Archaea only included its Prokaryote, prokaryotic members, but this has since been found to be paraphyletic, as eukaryotes are known to have evolved from archaea. Even thou ...
,
Bacteria
Bacteria (; : bacterium) are ubiquitous, mostly free-living organisms often consisting of one Cell (biology), biological cell. They constitute a large domain (biology), domain of Prokaryote, prokaryotic microorganisms. Typically a few micr ...
and
Fungi
A fungus (: fungi , , , or ; or funguses) is any member of the group of eukaryotic organisms that includes microorganisms such as yeasts and mold (fungus), molds, as well as the more familiar mushrooms. These organisms are classified as one ...
organisms, accessible via
Entrez
The Entrez () Global Query Cross-Database Search System is a federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website. The NCB ...
and
BLAST queries. It also includes
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a par ...
sequences for
Animals
Animals are multicellular, eukaryotic organisms in the biological kingdom Animalia (). With few exceptions, animals consume organic material, breathe oxygen, have myocytes and are able to move, can reproduce sexually, and grow from a ...
,
Plants
Plants are the eukaryotes that form the kingdom Plantae; they are predominantly photosynthetic. This means that they obtain their energy from sunlight, using chloroplasts derived from endosymbiosis with cyanobacteria to produce sugars f ...
and
Protists
A protist ( ) or protoctist is any Eukaryote, eukaryotic organism that is not an animal, Embryophyte, land plant, or fungus. Protists do not form a Clade, natural group, or clade, but are a Paraphyly, paraphyletic grouping of all descendants o ...
, accessible via BLAST queries.
* Virus Variation (ViV): It is a specific resource of sequence data processing pipelines and analysis tools for display and retrieval of sequences from several viral groups such as
influenza virus
''Orthomyxoviridae'' () is a family of negative-sense RNA viruses. It includes nine genera: '' Alphainfluenzavirus'', '' Betainfluenzavirus'', '' Gammainfluenzavirus'', '' Deltainfluenzavirus'', '' Isavirus'', '' Mykissvirus'', '' Quaranjavir ...
,
ebolavirus
The genus ''Ebolavirus'' (- or ; - or ) is a International Committee on Taxonomy of Viruses, virological taxon included in the family ''Filoviridae'' (filament-shaped viruses), order ''Mononegavirales''. The members of this genus are called ebo ...
,
MERS coronavirus
Middle East respiratory syndrome–related coronavirus (MERS-CoV, ''Betacoronavirus cameli'') or EMC/2012 ( HCoV-EMC/2012), is the virus that causes Middle East respiratory syndrome (MERS). It is a species of coronavirus which infects humans, ba ...
or
Zika virus
Zika virus (ZIKV; pronounced or ) is a member of the virus family ''Flaviviridae''. It is spread by daytime-active ''Aedes'' mosquitoes, such as '' A. aegypti'' and '' A. albopictus''. Its name comes from the Ziika Forest of Uganda, where ...
. New viruses, processing pipelines, tools and other features are included regularly.
* RefSeq Select: This project aims to select datasets of RefSeq Select transcripts, as the most representative for every protein-coding gene, based on multiple criteria: prior use in clinical databases, transcript expression,
evolutionary conservation of the coding region etc. Since many genes are represented by multiple ''RefSeq'' transcripts/proteins due to the biological process of
alternative splicing
Alternative splicing, alternative RNA splicing, or differential splicing, is an alternative RNA splicing, splicing process during gene expression that allows a single gene to produce different splice variants. For example, some exons of a gene ma ...
, this complexity is problematic for studies such as
comparative genomics
Comparative genomics is a branch of biological research that examines genome sequences across a spectrum of species, spanning from humans and mice to a diverse array of organisms from bacteria to chimpanzees. This large-scale holistic approach c ...
or exchange of clinical variant data.
* MANE (Matched Annotation from the NCBI and EMBL-EBI): It is a collaborative project between
NCBI
The National Center for Biotechnology Information (NCBI) is part of the National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is loca ...
and
EMBL
The European Molecular Biology Laboratory (EMBL) is an intergovernmental organization dedicated to molecular biology research and is supported by 29 member states, two prospect member states, and one associate member state. EMBL was created in ...
-
EBI whose main goal is to define a set of transcripts and their proteins for all the protein-coding genes in the human genome. By doing that, the differences in transcripts annotation between ''RefSeq'' and
Ensembl
Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...
/
GENCODE annotation systems are reduced. A MANE Select transcripts set are created as a useful universal standard for clinical reporting and comparative or evolutionary genomics. A second MANE Plus Clinical set are also created with additional transcripts to report all ''Pathogenic'' (P) or ''Likely Pathogenic'' (LP) clinical variants available in public resources.
This project was announced in 2018 and is expected to finish in 2022.
Statistics
According to the RefSeq release 213 (July 2022), the number of species represented in the database by counting distinct taxonomic IDs are as follows:
The counts of accession and basepairs per molecule type are:
See also
*
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a par ...
*
Sequence analysis
In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome ...
*
Sequence profiling tool
A sequence profiling tool in bioinformatics is a type of software that presents information related to a genetic sequence, gene name, or keyword input. Such tools generally take a query such as a DNA, RNA, or protein sequence or ‘keyword’ and ...
*
Sequence motif
In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as ''A ...
*
UniProt
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived fro ...
*
List of sequenced eukaryotic genomes
*
List of sequenced archaeal genomes
This list of sequenced archaeal genomes contains all the archaea
Archaea ( ) is a Domain (biology), domain of organisms. Traditionally, Archaea only included its Prokaryote, prokaryotic members, but this has since been found to be paraphyleti ...
References
Sources
*{{NCBI-handbook
External links
RefSeqGenBank, RefSeq, TPA and UniProt: What's in a Name?
Genetics databases
National Institutes of Health