HOME

TheInfoList



OR:

Ensembl genome database project is a scientific project at the
European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...
, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the
genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...
s of our own species and other
vertebrate Vertebrates () comprise all animal taxa within the subphylum Vertebrata () ( chordates with backbones), including all mammals, birds, reptiles, amphibians, and fish. Vertebrates represent the overwhelming majority of the phylum Chordata, ...
s and
model organism A model organism (often shortened to model) is a non-human species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the model organism will provide insight into the workin ...
s. Ensembl is one of several well known
genome browser In bioinformatics, a genome browser is a graphical interface for display of information from a biological database Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughp ...
s for the retrieval of
genomic Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dim ...
information. Similar
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases sp ...
s and browsers are found at
NCBI The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The ...
and the University of California, Santa Cruz (UCSC).


History

The human genome consists of three billion
base pair A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...
s, which code for approximately 20,000–25,000
gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...
s. However the genome alone is of little use, unless the locations and relationships of individual genes can be identified. One option is manual
annotation An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...
, whereby a team of scientists tries to locate genes using experimental data from scientific journals and public databases. However this is a slow, painstaking task. The alternative, known as automated annotation, is to use the power of computers to do the complex
pattern-matching In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be ...
of
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
to DNA. The Ensembl project was launched in 1999 in response to the imminent completion of the
Human Genome Project The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a ...
, with the initial goals of automatically annotate the human genome, integrate this annotation with available biological data and make all this knowledge publicly available. In the Ensembl project, sequence data are fed into the gene annotation system (a collection of software "pipelines" written in
Perl Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was offici ...
) which creates a set of predicted gene locations and saves them in a
MySQL MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database o ...
database for subsequent analysis and display. Ensembl makes these data freely accessible to the world research community. All the data and code produced by the Ensembl project is available to download, and there is also a publicly accessible database server allowing remote access. In addition, the Ensembl website provides computer-generated visual displays of much of the data. Over time the project has expanded to include additional species (including key
model organisms A model organism (often shortened to model) is a non-human species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the model organism will provide insight into the working ...
such as
mouse A mouse ( : mice) is a small rodent. Characteristically, mice are known to have a pointed snout, small rounded ears, a body-length scaly tail, and a high breeding rate. The best known mouse species is the common house mouse (''Mus musculus' ...
, fruitfly and
zebrafish The zebrafish (''Danio rerio'') is a freshwater fish belonging to the minnow family ( Cyprinidae) of the order Cypriniformes. Native to South Asia, it is a popular aquarium fish, frequently sold under the trade name zebra danio (and thus often ...
) as well as a wider range of genomic data, including
genetic variation Genetic variation is the difference in DNA among individuals or the differences between populations. The multiple sources of genetic variation include mutation and genetic recombination. Mutations are the ultimate sources of genetic variation, ...
s and regulatory features. Since April 2009, a sister project,
Ensembl Genomes Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species. The project is run by the European Bioinformatics Institute, and was launched in 2009 using the Ensembl technology. The main objective of the Ensem ...
, has extended the scope of Ensembl into invertebrate
metazoa Animals are multicellular, eukaryotic organisms in the biological kingdom Animalia. With few exceptions, animals consume organic material, breathe oxygen, are able to move, can reproduce sexually, and go through an ontogenetic stage in ...
,
plants Plants are predominantly Photosynthesis, photosynthetic eukaryotes of the Kingdom (biology), kingdom Plantae. Historically, the plant kingdom encompassed all living things that were not animals, and included algae and fungi; however, all curr ...
,
fungi A fungus ( : fungi or funguses) is any member of the group of eukaryotic organisms that includes microorganisms such as yeasts and molds, as well as the more familiar mushrooms. These organisms are classified as a kingdom, separately from ...
,
bacteria Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were among ...
, and
protists A protist () is any eukaryotic organism (that is, an organism whose cells contain a cell nucleus) that is not an animal, plant, or fungus. While it is likely that protists share a common ancestor (the last eukaryotic common ancestor), the excl ...
, focusing on providing taxonomic and evolutionary context to genes, whilst the original project continues to focus on vertebrates. As of 2020, Ensembl supported over 50 000 genomes across both Ensembl and Ensembl Genomes databases, adding some new innovative features such a
Rapid Release
a new website designed to make genome annotation data available more quickly to users, an

a new website to access to
SARS-CoV-2 Severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) is a strain of coronavirus that causes COVID-19 (coronavirus disease 2019), the respiratory illness responsible for the ongoing COVID-19 pandemic. The virus previously had a ...
reference genome.


Displaying genomic data

Central to the Ensembl concept is the ability to automatically generate graphical views of the alignment of genes and other genomic data against a
reference genome A reference genome (also known as a reference assembly) is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assemble ...
. These are shown as data tracks, and individual tracks can be turned on and off, allowing the user to customise the display to suit their research interests. The interface also enables the user to zoom in to a region or move along the genome in either direction. Other displays show data at varying levels of resolution, from whole
karyotype A karyotype is the general appearance of the complete set of metaphase chromosomes in the cells of a species or in an individual organism, mainly including their sizes, numbers, and shapes. Karyotyping is the process by which a karyotype is disce ...
s down to text-based representations of DNA and
amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
sequences, or present other types of display such as
trees In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are u ...
of similar genes ( homologues) across a range of species. The graphics are complemented by tabular displays, and in many cases data can be exported directly from the page in a variety of standard file formats such as
FASTA FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics. History The original FASTA program ...
. Externally produced data can also be added to the display by uploading a suitable file in one of the supported formats, such as BAM,
BED A bed is an item of furniture that is used as a place to sleep, rest, and relax. Most modern beds consist of a soft, cushioned mattress on a bed frame. The mattress rests either on a solid base, often wood slats, or a sprung base. Many beds ...
, or PSL. Graphics are generated using a suite of custom Perl modules based on GD, the standard Perl graphics display library.


Alternative access methods

In addition to its website, Ensembl provides a REST
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
and a Perl
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
(Application Programming Interface) that models biological objects such as genes and proteins, allowing simple
script Script may refer to: Writing systems * Script, a distinctive writing system, based on a repertoire of specific elements or symbols, or that repertoire * Script (styles of handwriting) ** Script typeface, a typeface with characteristics of handw ...
s to be written to retrieve data of interest. The same API is used internally by the web interface to display the data. It is divided in sections like the core API, the compara API (for
comparative genomics Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural lan ...
data), the variation API (for accessing SNPs, SNVs, CNVs..), and the functional genomics API (to access regulatory data). The Ensembl website provides extensive information o
how to install and use the API
This software can be used to access the public
MySQL MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database o ...
database, avoiding the need to download enormous datasets. The users could even choose to retrieve data from the MySQL with direct SQL queries, but this requires an extensive knowledge of the current database schema. Large datasets can be retrieved using the
BioMart BioMart is a community-driven project to provide a single point of access to distributed research data. The BioMart project contributes open source software and data services to the international scientific community. Although the BioMart software ...
data-mining tool. It provides a web interface for downloading datasets using complex queries. Last, there is a
FTP
server which can be used to download entire MySQL databases as well some selected data sets in other formats.


Current species

The annotated genomes include most fully sequenced vertebrates and selected model organisms. All of them are eukaryotes, there are no prokaryotes. As of 2022, there are 271 species registered, this includes:


Open source/mirrors

All data part of the Ensembl project is open access and all software is open source, being freely available to the scientific community, under a
CC BY 4.0 A Creative Commons (CC) license is one of several public copyright licenses that enable the free distribution of an otherwise copyrighted "work".A "work" is any creative material made by a person. A painting, a graphic, a book, a song/lyric ...
license. Currently, Ensembl database website is mirrored at four different locations worldwide to improve the service.


See also

*
List of sequenced eukaryotic genomes This list of "sequenced" eukaryotic genomes contains all the eukaryotes known to have publicly available complete nuclear and organelle genome sequences that have been sequenced, assembled, annotated and published; draft genomes are not included, ...
*
List of biological databases Biological databases are stores of biological information. The journal ''Nucleic Acids Research'' regularly publishes special issues on biological databases and has a list of such databases. The 2018 issue has a list of about 180 such databases an ...
*
Sequence analysis In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignm ...
*
Sequence profiling tool A sequence profiling tool in bioinformatics is a type of software that presents information related to a genetic sequence, gene name, or keyword input. Such tools generally take a query such as a DNA, RNA, or protein sequence or ‘keyword’ an ...
*
Sequence motif In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as ''As ...
*
UCSC Genome Browser The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate spec ...
*
ENCODE The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims to identify functional elements in the human genome. ENCODE also supports further biomedical research by "generating community resources of genomics data, software ...


References


External links

*
VegaPre-EnsemblEnsembl genomesUCSC Genome BrowserNCBIEnsembl: Browsing chordate genomes on EBI Train OnLine
{{Authority control Genetic engineering in the United Kingdom Genome databases Medical databases in the United Kingdom Medical genetics Science and technology in Cambridgeshire South Cambridgeshire District Wellcome Trust Biological databases Bioinformatics Computational biology