The Sequence Read Archive (SRA, previously known as the Short Read Archive) is a
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
that provides a public repository for
DNA sequencing
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. Th ...
data, especially the "short reads" generated by
high-throughput sequencing
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. Th ...
, which are typically less than 1,000
base pairs
A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...
in length.
The archive is part of the
International Nucleotide Sequence Database Collaboration The International Nucleotide Sequence Database Collaboration (INSDC) consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves the following computerized databases: DNA Data Bank of Japan (Japan ...
(INSDC), and run as a collaboration between the NCBI, the
European Bioinformatics Institute
The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...
(EBI), and the
DNA Data Bank of Japan
The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Dat ...
(DDBJ).
The archive was established by the
National Center for Biotechnology Information
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The ...
(NCBI) in 2007 in order to provide a repository for data produced by
RNA-Seq
RNA-Seq (named as an abbreviation of RNA sequencing) is a sequencing technique which uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing c ...
and
ChIP-Seq
ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated prote ...
studies as well as large-scale studies including the
Human Microbiome Project
The Human Microbiome Project (HMP) was a United States National Institutes of Health (NIH) research initiative to improve understanding of the microbiota involved in human health and disease. Launched in 2007, the first phase (HMP1) focused on i ...
and the
1000 Genomes Project
The 1000 Genomes Project (abbreviated as 1KGP), launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one th ...
.
Originally called the Short Read Archive, the name was changed in anticipation of future sequencing technologies being able to produce longer sequence reads.
The volume of data deposited in the Sequence Read Archive has grown rapidly. As of September 2010, 65% of the SRA was
human genomic sequence, with another 16% relating to human
metagenome
Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microb ...
sequence reads.
Much of this data was deposited through the 1000 Genomes Project. In June 2011, the data contained within the SRA passed 100 Terabases of DNA in volume.
The preferred
data format for files submitted to the SRA is the
BAM format, which is capable of storing both aligned and unaligned reads.
Internally the SRA relies on the NCBI SRA Toolkit, used at all three INSDC member databases, to provide flexible
data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...
,
API
An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
access and conversion to other formats such as
FASTQ.
NCBI announced their plan to close the NCBI SRA in February 2011 due to funding reduction.
However, EBI and DDBJ announced that they would continue to support the SRA.
In October 2011, NCBI announced continuation of funding for the SRA.
Deposition of data in the SRA is mandated by most funding agencies and
open access journals
Open access (OA) is a set of principles and a range of practices through which research outputs are distributed online, free of access charges or other barriers. With open access strictly defined (according to the 2001 definition), or libre op ...
.
Nature Publishing Group
Nature Portfolio (formerly known as Nature Publishing Group and Nature Research) is a division of the international scientific publishing company Springer Nature that publishes academic journals, magazines, online databases, and services in scien ...
journals require that DNA and RNA sequencing data is made available through the SRA.
See also
*
List of biological databases
Biological databases are stores of biological information. The journal ''Nucleic Acids Research'' regularly publishes special issues on biological databases and has a list of such databases. The 2018 issue has a list of about 180 such databases an ...
References
{{Reflist
External links
European Nucleotide Archive page for searches in SRA
SRA homepageat NCBI.
ERA submissionsat EBI.
DRA homepageat DDBJ.
Genetics databases
Genetics in the United Kingdom
Science and technology in Cambridgeshire
South Cambridgeshire District