RefSeq
   HOME

TheInfoList



OR:

The Reference Sequence (RefSeq)
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
is an
open access Open access (OA) is a set of principles and a range of practices through which research outputs are distributed online, free of access charges or other barriers. With open access strictly defined (according to the 2001 definition), or libre op ...
, annotated and curated collection of publicly available
nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...
sequences ( DNA,
RNA Ribonucleic acid (RNA) is a polymeric molecule essential in various biological roles in coding, decoding, regulation and expression of genes. RNA and deoxyribonucleic acid ( DNA) are nucleic acids. Along with lipids, proteins, and carbohydra ...
) and their
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
products. RefSeq was first introduced in 2000. This database is built by
National Center for Biotechnology Information The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The ...
(NCBI), and, unlike
GenBank The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part ...
, provides only a single record for each natural biological molecule (i.e. DNA, RNA or protein) for major organisms ranging from
viruses A virus is a submicroscopic infectious agent that replicates only inside the living cells of an organism. Viruses infect all life forms, from animals and plants to microorganisms, including bacteria and archaea. Since Dmitri Ivanovsky's 1 ...
to
bacteria Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were among ...
to
eukaryotes Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bacte ...
. For each
model organism A model organism (often shortened to model) is a non-human species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the model organism will provide insight into the workin ...
, ''RefSeq'' aims to provide separate and linked records for the genomic DNA, the gene transcripts, and the proteins arising from those transcripts. ''RefSeq'' is limited to major organisms for which sufficient data are available (121,461 distinct "named"
organisms In biology, an organism () is any living system that functions as an individual entity. All organisms are composed of cells (cell theory). Organisms are classified by taxonomy into groups such as multicellular animals, plants, and fungi; ...
as of July 2022), while
GenBank The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part ...
includes sequences for any organism submitted (approximately 504,000 formally described
species In biology, a species is the basic unit of classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest group of organisms in which any two individuals of the appropriate s ...
).


RefSeq categories

RefSeq collection comprises different data types, with different origins, so it is necessary to establish standard categories and identifiers to store each data type. The most important categories are: For more details and more categories, se
Table 1
i
Chapter 18 of the book ''The Reference Sequence (RefSeq) Database''


RefSeq Projects

Several projects to improve ''RefSeq'' services are currently in development by the NCBI, often in collaboration with research centers such as EMBL-EBI: * Consensus CDS (CCDS): This project aims to identify a core set of human and mouse protein-coding regions and standardize sets of genes with high and consistent levels of genomic annotation quality. This project was announced in 2009 and is still in development. * RefSeq Functional Elements (RefSeqFE): It is focused on describing non-genic functional elements which are gene regulatory regions such as:
enhancers In genetics, an enhancer is a short (50–1500 bp) region of DNA that can be bound by proteins ( activators) to increase the likelihood that transcription of a particular gene will occur. These proteins are usually referred to as transcriptio ...
, silencers, DNase I hypersensitive regions, DNA replication origins etc.). The current scope of this project is restricted to the human and mouse genomes. * RefSeqGene: Its main goal is to define genomic sequences to be used as reference standards for well-characterized genes. Previously described
mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein. mRNA is ...
, protein and chromosome sequences have the weaknesses of not providing explicit genomic coordinates of gene flanking and intronic regions as well as showing awkwardly large coordinates that change with every new genome assembly. The RefSeqGene project is designed to eliminate these errors. * Targeted Loci: This project records molecular markers, specially protein-coding and
ribosomal RNA Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from ribosomal ...
loci that are used for
phylogenetic In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups o ...
and barcoding analysis. The scope of this project includes sequences for
Archaea Archaea ( ; singular archaeon ) is a domain of single-celled organisms. These microorganisms lack cell nuclei and are therefore prokaryotes. Archaea were initially classified as bacteria, receiving the name archaebacteria (in the Archaebac ...
,
Bacteria Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were among ...
and
Fungi A fungus ( : fungi or funguses) is any member of the group of eukaryotic organisms that includes microorganisms such as yeasts and molds, as well as the more familiar mushrooms. These organisms are classified as a kingdom, separately from ...
organisms, accessible via
Entrez The Entrez (pronounced ''ɒnˈtreɪ'') Global Query Cross-Database Search System is a federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information ...
and
BLAST Blast or The Blast may refer to: *Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film), ...
queries. It also includes
GenBank The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part ...
sequences for
Animals Animals are multicellular, eukaryotic organisms in the biological kingdom Animalia. With few exceptions, animals consume organic material, breathe oxygen, are able to move, can reproduce sexually, and go through an ontogenetic stage in ...
,
Plants Plants are predominantly Photosynthesis, photosynthetic eukaryotes of the Kingdom (biology), kingdom Plantae. Historically, the plant kingdom encompassed all living things that were not animals, and included algae and fungi; however, all curr ...
and Protists, accessible via BLAST queries. * Virus Variation (ViV): It is an specific resource of sequence data processing pipelines and analysis tools for display and retrieval of sequences from several viral groups such as
influenza virus ''Orthomyxoviridae'' (from Greek ὀρθός, ''orthós'' 'straight' + μύξα, ''mýxa'' 'mucus') is a family of negative-sense RNA viruses. It includes seven genera: ''Alphainfluenzavirus'', ''Betainfluenzavirus'', '' Gammainfluenzavirus'', ' ...
,
ebolavirus The genus ''Ebolavirus'' (- or ; - or ) is a virological taxon included in the family '' Filoviridae'' (filament-shaped viruses), order ''Mononegavirales''. The members of this genus are called ebolaviruses, and encode their genome in the for ...
,
MERS coronavirus ''Middle East respiratory syndrome–related coronavirus'' (''MERS-CoV''), or EMC/2012 ( HCoV-EMC/2012), is the virus that causes Middle East respiratory syndrome (MERS). It is a species of coronavirus which infects humans, bats, and camels. Th ...
or
Zika virus ''Zika virus'' (ZIKV; pronounced or ) is a member of the virus family ''Flaviviridae''. It is spread by daytime-active '' Aedes'' mosquitoes, such as '' A. aegypti'' and '' A. albopictus''. Its name comes from the Ziika Forest of Uganda, w ...
. New viruses, processing pipelines, tools and other features are included regularly. * RefSeq Select: This project aims to select datasets of RefSeq Select transcripts, as the most representative for every protein-coding gene, based on multiple criteria: prior use in clinical databases, transcript expression,
evolutionary conservation In evolutionary biology, conserved sequences are identical or similar sequences in nucleic acids ( DNA and RNA) or proteins across species ( orthologous sequences), or within a genome ( paralogous sequences), or between donor and receptor taxa ...
of the coding region etc. Since many genes are represented by multiple ''RefSeq'' transcripts/proteins due to the biological process of
alternative splicing Alternative splicing, or alternative RNA splicing, or differential splicing, is an alternative splicing process during gene expression that allows a single gene to code for multiple proteins. In this process, particular exons of a gene may be ...
, this complexity is problematic for studies such as comparative genomics or exchange of clinical variant data. * MANE (Matched Annotation from the NCBI and EMBL-EBI): It is a collaborative project between
NCBI The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The ...
and EMBL-
EBI Ebrahim Hamedi ( fa, اِبراهیم حامدی, also Romanized as "Ebrāhim Hāmedi"; born 1949), better known by his stage name Ebi (Persian: ), is an Iranian pop singer who first started his career in Tehran, gaining fame as part of a ban ...
whose main goal is to define a set of transcripts and their proteins for all the protein-coding genes in the human genome. By doing that, the differences in transcripts annotation between ''RefSeq'' and
Ensembl Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...
/
GENCODE GENCODE is a scientific project in genome research and part of the ENCODE (ENCyclopedia Of DNA Elements) scale-up project. The GENCODE consortium was initially formed as part of the pilot phase of the ENCODE project to identify and map all prote ...
annotation systems are reduced. A MANE Select transcripts set are created as a useful universal standard for clinical reporting and comparative or evolutionary genomics. A second MANE Plus Clinical set are also created with additional transcripts to report all ''Pathogenic'' (P) or ''Likely Pathogenic'' (LP) clinical variants available in public resources. This project was announced in 2018 and is expected to finish in 2022.


Statistics

According to the RefSeq release 213 (July 2022), the number of species represented in the database by counting distinct taxonomic IDs are as follows: The counts of accession and basepairs per molecule type are:


See also

*
GenBank The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part ...
*
Sequence analysis In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alig ...
*
Sequence profiling tool A sequence profiling tool in bioinformatics is a type of software that presents information related to a genetic sequence, gene name, or keyword input. Such tools generally take a query such as a DNA, RNA, or protein sequence or ‘keyword’ an ...
*
Sequence motif In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as ''As ...
*
UniProt UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...
*
List of sequenced eukaryotic genomes This list of "sequenced" eukaryotic genomes contains all the eukaryotes known to have publicly available complete nuclear and organelle genome sequences that have been sequenced, assembled, annotated and published; draft genomes are not inclu ...
*
List of sequenced archaeal genomes This list of sequenced archaeal genomes contains all the archaea known to have publicly available complete genome sequences that have been assembled, annotated and deposited in public databases. ''Methanococcus jannaschii'' was the first archaeon ...


References


Sources

*{{NCBI-handbook


External links


RefSeq

GenBank, RefSeq, TPA and UniProt: What's in a Name?
Genetics databases National Institutes of Health