The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated
DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of
sequence assembly
In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one ...
and other
metadata
Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including:
* Descriptive metadata – the descriptive ...
related to
sequencing projects.
The archive is composed of three main databases: the
Sequence Read Archive
The Sequence Read Archive (SRA, previously known as the Short Read Archive) is a bioinformatics database that provides a public repository for DNA sequencing data, especially the "short reads" generated by high-throughput sequencing, which are typ ...
, the Trace Archive and the EMBL Nucleotide Sequence Database (also known as EMBL-bank).
The ENA is produced and maintained by the
European Bioinformatics Institute
The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...
and is a member of the
International Nucleotide Sequence Database Collaboration The International Nucleotide Sequence Database Collaboration (INSDC) consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves the following computerized databases: DNA Data Bank of Japan (Japan ...
(INSDC) along with the
DNA Data Bank of Japan
The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Dat ...
and
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part ...
.
The ENA has grown out of the EMBL Data Library which was released in 1982 as the first internationally supported resource for nucleotide sequence data.
As of early 2012, the ENA and other INSDC member databases each contained complete
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...
s of 5,682 organisms and sequence data for almost 700,000.
Moreover, the volume of data is
increasing exponentially with a doubling time of approximately 10 months.
History
The European Nucleotide Archive originated from separate databases, the earliest of which was the EMBL Data Library, established in October 1980 at the
European Molecular Biology Laboratory
The European Molecular Biology Laboratory (EMBL) is an intergovernmental organization dedicated to molecular biology research and is supported by 27 member states, two prospect states, and one associate member state. EMBL was created in 1974 and ...
(EMBL),
Heidelberg
Heidelberg (; Palatine German language, Palatine German: ''Heidlberg'') is a city in the States of Germany, German state of Baden-Württemberg, situated on the river Neckar in south-west Germany. As of the 2016 census, its population was 159,914 ...
.
The first release of this
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
was made in April 1982 and contained a total of 568 separate entries consisting of around 500,000
base pair
A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...
s.
In 1984, referring to the EMBL Data Library, Kneale and Kennard remarked that "it was clear some years ago that a large computerized database of sequences would be essential for research in Molecular Biology".
Despite the primary distribution method at the time being via
magnetic tape
Magnetic tape is a medium for magnetic storage made of a thin, magnetizable coating on a long, narrow strip of plastic film. It was developed in Germany in 1928, based on the earlier magnetic wire recording from Denmark. Devices that use magne ...
, by 1987, the EMBL Data Library was being used by an estimated 10,000 scientists internationally.
The same year, the EMBL File Server was introduced to serve database records over
BITNET,
EARN and the early
Internet
The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, pub ...
. In May 1988 the journal ''
Nucleic Acids Research
''Nucleic Acids Research'' is an open-access peer-reviewed scientific journal published since 1974 by the Oxford University Press. The journal covers research on nucleic acids, such as DNA and RNA, and related work. According to the ''Journal Cit ...
'' introduced a policy stating that "manuscripts submitted to
ucleic Acids Researchand containing or discussing sequence data must be accompanied by evidence that the data have been deposited with the EMBL Data Library."
During the 1990s the EMBL Data Library was renamed the EMBL Nucleotide Sequence Database
and was formally relocated to the
European Bioinformatics Institute
The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...
(EBI) from Heidelberg.
In 2003, the Nucleotide Sequence Database was extended with the addition of the Sequence Version Archive (SVA), which maintains records of all current and previous entries in the database.
A year later in June 2004, limits on the maximum sequence length for each record (then 350
kilobase
A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...
s) were removed, allowing entire genome sequences to be stored as a single
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
entry.
Following the uptake of
Sanger sequencing
Sanger sequencing is a method of DNA sequencing that involves electrophoresis and is based on the random incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. After first being developed by Frederi ...
, the
Wellcome Trust Sanger Institute
The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit British genomics and genetics research institute, primarily funded by the Wellcome Trust.
It is located on the Wellcome G ...
(then known as The Sanger Centre) had begun cataloguing sequence reads along with quality information in a database called The Trace Archive.
The Trace Archive grew substantially with the commercialisation of high-throughput parallel sequencing technologies by companies such as
Roche
F. Hoffmann-La Roche AG, commonly known as Roche, is a Swiss multinational healthcare company that operates worldwide under two divisions: Pharmaceuticals and Diagnostics. Its holding company, Roche Holding AG, has shares listed on the SIX ...
and
Illumina.
In 2008, the EBI combined the Trace Archive, EMBL Nucleotide Sequence Database (now also known as EMBL-Bank)
and a newly developed Sequence (or Short) Read Archive (SRA) to make up the ENA, aimed at providing a comprehensive
nucleotide
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...
sequence archive.
As a member of the
International Nucleotide Sequence Database Collaboration The International Nucleotide Sequence Database Collaboration (INSDC) consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves the following computerized databases: DNA Data Bank of Japan (Japan ...
, the ENA exchanges data submissions each day with both the
DNA Data Bank of Japan
The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Dat ...
and
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part ...
.
EMBL Nucleotide Sequence Database
The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) is the section of the ENA which contains high-level
genome assembly
In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one ...
details, as well as assembled sequences and their functional
annotation
An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...
.
EMBL-Bank is contributed to by direct submission from genome
consortia
A consortium (plural: consortia) is an association of two or more individuals, companies, organizations or governments (or any combination of these entities) with the objective of participating in a common activity or pooling their resources for ...
and smaller research groups as well as by the retrieval of sequence data associated with
patent application
A patent application is a request pending at a patent office for the grant of a patent for an invention described in the patent specification and a set of one or more claims stated in a formal document, including necessary official forms and re ...
s.
As of release 114 (December 2012), the EMBL Nucleotide Sequence Database contains approximately 5×10
11 nucleotides with an uncompressed filesize of 1.6
terabyte
The byte is a units of information, unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character (computing), character of text in a computer and for this ...
s.
Data classes
The EMBL Nucleotide Sequence Database supports a variety of data derived from different sources including, but not limited to:
*
Expressed sequence tag In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proc ...
s with their associated sample data.
*Nucleotide sequence being generated from
whole genome sequencing
Whole genome sequencing (WGS), also known as full genome sequencing, complete genome sequencing, or entire genome sequencing, is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a s ...
projects at varying stages of assembly, including complete
contig
A contig (from ''contiguous'') is a set of overlapping DNA segments that together represent a consensus region of DNA.Gregory, S. ''Contig Assembly''. Encyclopedia of Life Sciences, 2005.
In bottom-up sequencing projects, a contig refers to ov ...
s and annotated, fully assembled sequence.
*Data relating to
transcriptomics
Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. He ...
, such as
complementary DNA
In genetics, complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA (e.g., messenger RNA (mRNA) or microRNA (miRNA)) template in a reaction catalyzed by the enzyme reverse transcriptase. cDNA is often used to express a spe ...
, with optional annotation.
*Novel or extended annotations of existing
coding sequences, for example new sequence versions with corrected
start
Start can refer to multiple topics:
*Takeoff, the phase of flight where an aircraft transitions from moving along the ground to flying through the air
* Starting lineup in sports
*Standing start, and rolling start, in an auto race
Acronyms
*St ...
or
stop codon
In molecular biology (specifically protein biosynthesis), a stop codon (or termination codon) is a codon (nucleotide triplet within messenger RNA) that signals the termination of the translation process of the current protein. Most codons in me ...
s.
EMBL-Bank format
The EMBL Nucleotide Sequence Database uses a
flat file
A flat-file database is a database stored in a file called a flat file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain ...
plaintext
format
Format may refer to:
Printing and visual media
* Text formatting, the typesetting of text elements
* Paper formats, or paper size standards
* Newspaper format, the size of the paper page
Computing
* File format, particular way that informatio ...
to represent and store data which is typically referred to as EMBL-Bank format.
EMBL-Bank format uses a different
syntax
In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure ( constituency) ...
to the records in DDBJ and GenBank, though each format uses certain standardised nomenclature, such as
taxonomies as defined by the
NCBI
The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The ...
Taxon database. Each line of an EMBL-format file begins with a two-letter code, such as
AC
to label the
accession number and
KW
for a list of keywords relevant to the record; each record ends with
//
.
Sequence Read Archive
The ENA operates an instance of the Sequence Read Archive (SRA), an archival repository of sequence reads and analyses which are intended for public release.
Originally called the Short Read Archive, the name was changed in anticipation of future sequencing technologies being able to produce longer sequence reads.
Currently, the archive accepts sequence reads generated by next-generation
sequencing platforms such as the Illumina Genome Analyzer and
ABI SOLiD
SOLiD (Sequencing by Oligonucleotide Ligation and Detection) is a next-generation DNA sequencing technology developed by Life Technologies and has been commercially available since 2006. This next generation technology generates 108 - 109 s ...
as well as some corresponding analyses and
alignments.
The SRA operates under the guidance of the
International Nucleotide Sequence Database Collaboration The International Nucleotide Sequence Database Collaboration (INSDC) consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves the following computerized databases: DNA Data Bank of Japan (Japan ...
(INSDC)
and is the fastest-growing repository in the ENA.
In 2010 the Sequence Read Archive made up approximately 95% of the
base pair
A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...
data available through the ENA,
encompassing over 500,000,000,000 sequence reads made up of over 60 trillion (6×10
13) base pairs.
Almost half of this data was deposited in relation to the
1000 Genomes Project
The 1000 Genomes Project (abbreviated as 1KGP), launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one th ...
wherein the researchers published their sequence data to the SRA in
real-time
Real-time or real time describes various operations in computing or other processes that must guarantee response times within a specified time (deadline), usually a relatively short time. A real-time process is generally one that happens in defined ...
.
In total, as of September 2010, 65% of the Sequence Read Archive was
human genomic sequence, with another 16% relating to human
metagenome
Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microb ...
sequence reads.
The preferred
data format for files submitted to the SRA is the BAM format, which is capable of storing both aligned and unaligned reads.
Internally the SRA relies on the NCBI SRA Toolkit, used at all three INSDC member databases, to provide flexible
data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...
,
API
An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
access and conversion to other formats such as
FASTQ.
Data access
The data contained in the ENA can be accessed manually or programmatically via
REST URL through the ENA browser. Initially limited to the Sequence Read Archive,
the ENA browser now also provides access to the Trace Archive and EMBL-Bank, allowing file retrieval in a range of formats including
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
,
HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
,
FASTA
FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.
History
The original FASTA program ...
and FASTQ.
Individual records can be accessed using their accession numbers and other text queries are enabled through the
EB-eye
The EB-eye, also known as EBI Search, is a search engine that provides uniform access to the biological data resources hosted at the European Bioinformatics Institute (EBI).
The EB-eye – the EBI search engine for biological data
The European ...
search engine.
Additionally,
sequence similarity
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a spec ...
-based searches implemented using
De Bruijn graph
In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...
s offer another method of retrieving records from the ENA.
The ENA is accessible via the EBI
SOAP
Soap is a salt of a fatty acid used in a variety of cleansing and lubricating products. In a domestic setting, soaps are surfactants usually used for washing, bathing, and other types of housekeeping. In industrial settings, soaps are use ...
and REST APIs, which also offer access to other databases hosted at the EBI, such as
Ensembl
Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...
and
InterPro
InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.
The contents of InterPro ...
.
Storage
The European Nucleotide Archive handles large volumes of data which pose a significant storage challenge.
As of 2012, the ENA's storage requirements continue to
grow exponentially, with a doubling time of approximately 10 months.
To manage this increase, the ENA selectively discards less-valuable sequencing platform data and implements advanced
compression
Compression may refer to:
Physical science
*Compression (physics), size reduction due to forces
*Compression member, a structural element such as a column
*Compressibility, susceptibility to compression
* Gas compression
*Compression ratio, of a ...
strategies.
The CRAM reference-based compression toolkit was developed to help reduce ENA storage requirements.
Funding
Currently the ENA is funded jointly by the
European Molecular Biology Laboratory
The European Molecular Biology Laboratory (EMBL) is an intergovernmental organization dedicated to molecular biology research and is supported by 27 member states, two prospect states, and one associate member state. EMBL was created in 1974 and ...
, the
European Commission
The European Commission (EC) is the executive of the European Union (EU). It operates as a cabinet government, with 27 members of the Commission (informally known as "Commissioners") headed by a President. It includes an administrative body o ...
and the
Wellcome Trust
The Wellcome Trust is a charitable foundation focused on health research based in London, in the United Kingdom. It was established in 1936 with legacies from the pharmaceutical magnate Henry Wellcome (founder of one of the predecessors of Glaxo ...
.
The emerging ELIXIR framework, coordinated by EBI director
Janet Thornton
Dame Janet Maureen Thornton, (born 23 May 1949) is a senior scientist and director emeritus at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL). She is one of the world's leading researche ...
, aims to secure a sustainable European funding infrastructure to support the continued availability of
life science
Life is a quality that distinguishes matter that has biological processes, such as signaling and self-sustaining processes, from that which does not, and is defined by the capacity for growth, reaction to stimuli, metabolism, energy t ...
databases such as the ENA.
See also
*
DNA Data Bank of Japan
The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Dat ...
*
ENCODE
The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome.
ENCODE also supports further biomedical research by "generating community resources of genomics data, software ...
*
Ensembl Genomes
Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species.
The project is run by the European Bioinformatics Institute, and was launched in 2009 using the Ensembl technology. The main objective of the Ensem ...
*
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part ...
*
RefSeq
The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences ( DNA, RNA) and their protein products. RefSeq was first introduced in 2000. This database is built by National ...
*
UniGene
UniGene was a NCBI database of the transcriptome and thus, despite the name, not primarily a database for genes. Each entry is a set of transcripts that appear to stem from the same transcription locus (i.e. gene or expressed pseudogene). Inform ...
References
External links
European Nucleotide ArchiveEMBL Nucleotide Sequence DatabaseThe European Nucleotide Archive: Quick tour
{{Bioinformatics
Genetics databases
Genetics in the United Kingdom
Genome databases
Genomics organizations
Information technology organizations based in Europe
Research institutes in Cambridgeshire
South Cambridgeshire District