The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of

sequence assembly In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one ...

and other

metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...

related to sequencing projects. The archive is composed of three main databases: the

Sequence Read Archive The Sequence Read Archive (SRA, previously known as the Short Read Archive) is a bioinformatics database that provides a public repository for DNA sequencing data, especially the "short reads" generated by high-throughput sequencing, which are typ ...

, the Trace Archive and the EMBL Nucleotide Sequence Database (also known as EMBL-bank). The ENA is produced and maintained by the

European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...

and is a member of the

International Nucleotide Sequence Database Collaboration The International Nucleotide Sequence Database Collaboration (INSDC) consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves the following computerized databases: DNA Data Bank of Japan (Japan ...

(INSDC) along with the

DNA Data Bank of Japan The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Dat ...

and

GenBank The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part ...

. The ENA has grown out of the EMBL Data Library which was released in 1982 as the first internationally supported resource for nucleotide sequence data. As of early 2012, the ENA and other INSDC member databases each contained complete

genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...

s of 5,682 organisms and sequence data for almost 700,000. Moreover, the volume of data is increasing exponentially with a doubling time of approximately 10 months.

History

The European Nucleotide Archive originated from separate databases, the earliest of which was the EMBL Data Library, established in October 1980 at the

European Molecular Biology Laboratory The European Molecular Biology Laboratory (EMBL) is an intergovernmental organization dedicated to molecular biology research and is supported by 27 member states, two prospect states, and one associate member state. EMBL was created in 1974 and ...

(EMBL),

Heidelberg Heidelberg (; Palatine German language, Palatine German: ''Heidlberg'') is a city in the States of Germany, German state of Baden-Württemberg, situated on the river Neckar in south-west Germany. As of the 2016 census, its population was 159,914 ...

. The first release of this

database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...

was made in April 1982 and contained a total of 568 separate entries consisting of around 500,000

base pair A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...

s. In 1984, referring to the EMBL Data Library, Kneale and Kennard remarked that "it was clear some years ago that a large computerized database of sequences would be essential for research in Molecular Biology". NucleotideSequences 86 87

Despite the primary distribution method at the time being via

magnetic tape Magnetic tape is a medium for magnetic storage made of a thin, magnetizable coating on a long, narrow strip of plastic film. It was developed in Germany in 1928, based on the earlier magnetic wire recording from Denmark. Devices that use magne ...

, by 1987, the EMBL Data Library was being used by an estimated 10,000 scientists internationally. The same year, the EMBL File Server was introduced to serve database records over BITNET, EARN and the early

Internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, pub ...

. In May 1988 the journal ''

Nucleic Acids Research ''Nucleic Acids Research'' is an open-access peer-reviewed scientific journal published since 1974 by the Oxford University Press. The journal covers research on nucleic acids, such as DNA and RNA, and related work. According to the ''Journal Cit ...

'' introduced a policy stating that "manuscripts submitted to ucleic Acids Researchand containing or discussing sequence data must be accompanied by evidence that the data have been deposited with the EMBL Data Library."

European Bioinformatics Institute, Hinxton 2

During the 1990s the EMBL Data Library was renamed the EMBL Nucleotide Sequence Database and was formally relocated to the

(EBI) from Heidelberg. In 2003, the Nucleotide Sequence Database was extended with the addition of the Sequence Version Archive (SVA), which maintains records of all current and previous entries in the database. A year later in June 2004, limits on the maximum sequence length for each record (then 350

kilobase A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...

s) were removed, allowing entire genome sequences to be stored as a single

entry. Following the uptake of

Sanger sequencing Sanger sequencing is a method of DNA sequencing that involves electrophoresis and is based on the random incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. After first being developed by Frederi ...

, the

Wellcome Trust Sanger Institute The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit British genomics and genetics research institute, primarily funded by the Wellcome Trust. It is located on the Wellcome G ...

(then known as The Sanger Centre) had begun cataloguing sequence reads along with quality information in a database called The Trace Archive. The Trace Archive grew substantially with the commercialisation of high-throughput parallel sequencing technologies by companies such as

Roche F. Hoffmann-La Roche AG, commonly known as Roche, is a Swiss multinational healthcare company that operates worldwide under two divisions: Pharmaceuticals and Diagnostics. Its holding company, Roche Holding AG, has shares listed on the SIX ...

and Illumina. In 2008, the EBI combined the Trace Archive, EMBL Nucleotide Sequence Database (now also known as EMBL-Bank) and a newly developed Sequence (or Short) Read Archive (SRA) to make up the ENA, aimed at providing a comprehensive

nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules wi ...

sequence archive. As a member of the

, the ENA exchanges data submissions each day with both the

and

EMBL Nucleotide Sequence Database

The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) is the section of the ENA which contains high-level

genome assembly In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one ...

details, as well as assembled sequences and their functional

annotation An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...

. EMBL-Bank is contributed to by direct submission from genome

consortia A consortium (plural: consortia) is an association of two or more individuals, companies, organizations or governments (or any combination of these entities) with the objective of participating in a common activity or pooling their resources for ...

and smaller research groups as well as by the retrieval of sequence data associated with

patent application A patent application is a request pending at a patent office for the grant of a patent for an invention described in the patent specification and a set of one or more claims stated in a formal document, including necessary official forms and re ...

s. As of release 114 (December 2012), the EMBL Nucleotide Sequence Database contains approximately 5×10¹¹ nucleotides with an uncompressed filesize of 1.6

terabyte The byte is a units of information, unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character (computing), character of text in a computer and for this ...

Data classes

The EMBL Nucleotide Sequence Database supports a variety of data derived from different sources including, but not limited to: *

Expressed sequence tag In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proc ...

s with their associated sample data. *Nucleotide sequence being generated from

whole genome sequencing Whole genome sequencing (WGS), also known as full genome sequencing, complete genome sequencing, or entire genome sequencing, is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a s ...

projects at varying stages of assembly, including complete

contig A contig (from ''contiguous'') is a set of overlapping DNA segments that together represent a consensus region of DNA.Gregory, S. ''Contig Assembly''. Encyclopedia of Life Sciences, 2005. In bottom-up sequencing projects, a contig refers to ov ...

s and annotated, fully assembled sequence. *Data relating to

transcriptomics Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. He ...

, such as

complementary DNA In genetics, complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA (e.g., messenger RNA (mRNA) or microRNA (miRNA)) template in a reaction catalyzed by the enzyme reverse transcriptase. cDNA is often used to express a spe ...

, with optional annotation. *Novel or extended annotations of existing coding sequences, for example new sequence versions with corrected

start Start can refer to multiple topics: *Takeoff, the phase of flight where an aircraft transitions from moving along the ground to flying through the air * Starting lineup in sports *Standing start, and rolling start, in an auto race Acronyms *St ...

stop codon In molecular biology (specifically protein biosynthesis), a stop codon (or termination codon) is a codon (nucleotide triplet within messenger RNA) that signals the termination of the translation process of the current protein. Most codons in me ...

EMBL-Bank format

The EMBL Nucleotide Sequence Database uses a

flat file A flat-file database is a database stored in a file called a flat file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain ...

plaintext

format Format may refer to: Printing and visual media * Text formatting, the typesetting of text elements * Paper formats, or paper size standards * Newspaper format, the size of the paper page Computing * File format, particular way that informatio ...

to represent and store data which is typically referred to as EMBL-Bank format. EMBL-Bank format uses a different

syntax In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure ( constituency) ...

to the records in DDBJ and GenBank, though each format uses certain standardised nomenclature, such as taxonomies as defined by the

NCBI The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The ...

Taxon database. Each line of an EMBL-format file begins with a two-letter code, such as AC to label the accession number and KW for a list of keywords relevant to the record; each record ends with //.

Sequence Read Archive

The ENA operates an instance of the Sequence Read Archive (SRA), an archival repository of sequence reads and analyses which are intended for public release. Originally called the Short Read Archive, the name was changed in anticipation of future sequencing technologies being able to produce longer sequence reads. Currently, the archive accepts sequence reads generated by next-generation sequencing platforms such as the Illumina Genome Analyzer and

ABI SOLiD SOLiD (Sequencing by Oligonucleotide Ligation and Detection) is a next-generation DNA sequencing technology developed by Life Technologies and has been commercially available since 2006. This next generation technology generates 108 - 109 s ...

as well as some corresponding analyses and alignments. The SRA operates under the guidance of the

(INSDC) and is the fastest-growing repository in the ENA. In 2010 the Sequence Read Archive made up approximately 95% of the

data available through the ENA, encompassing over 500,000,000,000 sequence reads made up of over 60 trillion (6×10¹³) base pairs. Almost half of this data was deposited in relation to the

1000 Genomes Project The 1000 Genomes Project (abbreviated as 1KGP), launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one th ...

wherein the researchers published their sequence data to the SRA in

real-time Real-time or real time describes various operations in computing or other processes that must guarantee response times within a specified time (deadline), usually a relatively short time. A real-time process is generally one that happens in defined ...

. In total, as of September 2010, 65% of the Sequence Read Archive was human genomic sequence, with another 16% relating to human

metagenome Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microb ...

sequence reads. The preferred data format for files submitted to the SRA is the BAM format, which is capable of storing both aligned and unaligned reads. Internally the SRA relies on the NCBI SRA Toolkit, used at all three INSDC member databases, to provide flexible

data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression ...

API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...

access and conversion to other formats such as FASTQ.

Data access

The data contained in the ENA can be accessed manually or programmatically via REST URL through the ENA browser. Initially limited to the Sequence Read Archive, the ENA browser now also provides access to the Trace Archive and EMBL-Bank, allowing file retrieval in a range of formats including

XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...

HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...

FASTA FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics. History The original FASTA program ...

and FASTQ. Individual records can be accessed using their accession numbers and other text queries are enabled through the

EB-eye The EB-eye, also known as EBI Search, is a search engine that provides uniform access to the biological data resources hosted at the European Bioinformatics Institute (EBI). The EB-eye – the EBI search engine for biological data The European ...

search engine. Additionally,

sequence similarity Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a spec ...

-based searches implemented using

De Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...

s offer another method of retrieving records from the ENA. The ENA is accessible via the EBI

SOAP Soap is a salt of a fatty acid used in a variety of cleansing and lubricating products. In a domestic setting, soaps are surfactants usually used for washing, bathing, and other types of housekeeping. In industrial settings, soaps are use ...

and REST APIs, which also offer access to other databases hosted at the EBI, such as

Ensembl Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...

and

InterPro InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them. The contents of InterPro ...

Storage

The European Nucleotide Archive handles large volumes of data which pose a significant storage challenge. As of 2012, the ENA's storage requirements continue to grow exponentially, with a doubling time of approximately 10 months. To manage this increase, the ENA selectively discards less-valuable sequencing platform data and implements advanced

compression Compression may refer to: Physical science *Compression (physics), size reduction due to forces *Compression member, a structural element such as a column *Compressibility, susceptibility to compression * Gas compression *Compression ratio, of a ...

strategies. The CRAM reference-based compression toolkit was developed to help reduce ENA storage requirements.

Funding

Currently the ENA is funded jointly by the

, the

European Commission The European Commission (EC) is the executive of the European Union (EU). It operates as a cabinet government, with 27 members of the Commission (informally known as "Commissioners") headed by a President. It includes an administrative body o ...

and the

Wellcome Trust The Wellcome Trust is a charitable foundation focused on health research based in London, in the United Kingdom. It was established in 1936 with legacies from the pharmaceutical magnate Henry Wellcome (founder of one of the predecessors of Glaxo ...

. The emerging ELIXIR framework, coordinated by EBI director

Janet Thornton Dame Janet Maureen Thornton, (born 23 May 1949) is a senior scientist and director emeritus at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL). She is one of the world's leading researche ...

, aims to secure a sustainable European funding infrastructure to support the continued availability of

life science Life is a quality that distinguishes matter that has biological processes, such as signaling and self-sustaining processes, from that which does not, and is defined by the capacity for growth, reaction to stimuli, metabolism, energy t ...

databases such as the ENA.

References

External links

European Nucleotide ArchiveEMBL Nucleotide Sequence DatabaseThe European Nucleotide Archive: Quick tour
{{Bioinformatics Genetics databases Genetics in the United Kingdom Genome databases Genomics organizations Information technology organizations based in Europe Research institutes in Cambridgeshire South Cambridgeshire District