ENSEMBL GENOME DATABASE PROJECT is a joint scientific project between
European Bioinformatics Institute
European Bioinformatics Institute and the
Wellcome Trust Sanger
Institute , which was launched in 1999 in response to the imminent
completion of the
Human Genome Project . After 10 years in existence,
Ensembl's aim remains to provide a centralized resource for
geneticists, molecular biologists and other researchers studying the
genomes of our own species and other vertebrates and model organisms .
Ensembl is one of several well known genome browsers for the retrieval
of genomic information.
Similar databases and browsers are found at NCBI and the University
of California, Santa Cruz (UCSC) .
* 1 Background
* 2 Displaying genomic data
* 3 Alternative access methods
* 4 Current species
* 5 See also
* 6 References
* 7 External links
The human genome consists of three billion base pairs , which code
for approximately 20,000–25,000 genes . However the genome alone is
of little use, unless the locations and relationships of individual
genes can be identified. One option is manual annotation , whereby a
team of scientists tries to locate genes using experimental data from
scientific journals and public databases. However this is a slow,
painstaking task. The alternative, known as automated annotation, is
to use the power of computers to do the complex pattern-matching of
Ensembl project, sequence data are fed into the gene
annotation system (a collection of software "pipelines" written in
Perl ) which creates a set of predicted gene locations and saves them
MySQL database for subsequent analysis and display.
these data freely accessible to the world research community. All the
data and code produced by the
Ensembl project is available to
download, and there is also a publicly accessible database server
allowing remote access. In addition, the
Ensembl website provides
computer-generated visual displays of much of the data.
Over time the project has expanded to include additional species
(including key model organisms such as mouse , fruitfly and zebrafish
) as well as a wider range of genomic data, including genetic
variations and regulatory features. Since April 2009, a sister
Ensembl Genomes , has extended the scope of
invertebrate metazoa , plants , fungi , bacteria , and protists ,
whilst the original project continues to focus on vertebrates.
DISPLAYING GENOMIC DATA
Gene SGCB aligned to the human genome
Central to the
Ensembl concept is the ability to automatically
generate graphical views of the alignment of genes and other genomic
data against a reference genome . These are shown as data tracks, and
individual tracks can be turned on and off, allowing the user to
customise the display to suit their research interests. The interface
also enables the user to zoom in to a region or move along the genome
in either direction.
Other displays show data at varying levels of resolution, from whole
karyotypes down to text-based representations of
DNA and amino acid
sequences, or present other types of display such as trees of similar
genes (homologues ) across a range of species. The graphics are
complemented by tabular displays, and in many cases data can be
exported directly from the page in a variety of standard file formats
such as FASTA .
Externally produced data can also be added to the display, either via
a DAS (Distributed
Annotation System ) server on the internet, or by
uploading a suitable file in one of the supported formats, such as BAM
, BED , or PSL .
Graphics are generated using a suite of custom
Perl modules based on
GD , the standard
Perl graphics display library.
ALTERNATIVE ACCESS METHODS
In addition to its website,
Ensembl provides a
(Application Programming Interface) that models biological objects
such as genes and proteins, allowing simple scripts to be written to
retrieve data of interest. The same
API is used internally by the web
interface to display the data. It is divided in sections like the core
API, the compara
API (for comparative genomics data), the variation
API (for accessing SNPs, SNVs, CNVs..), and the functional genomics
API (to access regulatory data). The
Ensembl website provides
extensive information on how to install and use the API.
This software can be used to access the public
avoiding the need to download enormous datasets. The users could even
choose to retrieve data from the
MySQL with direct SQL queries, but
this requires an extensive knowledge of the current database schema.
Large datasets can be retrieved using the
BioMart data-mining tool.
It provides a web interface for downloading datasets using complex
Last, there is an FTP server which can be used to download entire
MySQL databases as well some selected data sets in other formats.
The annotated genomes include most fully sequenced vertebrates and
selected model organisms. All of them are eukaryotes, there are no
prokaryotes. As of 2008 , this includes:
Primates : bushbaby , chimp , human, macaque , mouse lemur ,
orangutan , tarsier ;
Scandentia : tree shrew ;
Glires (= Rodents + Lagomorphs): guineapig , kangaroo rat , mouse
, rat , ground squirrel , pika , rabbit ;
Laurasiatheria : cow , dolphin , alpaca , pig , cat , dog , horse
, megabat , microbat , hedgehog , shrew ;
* Afrotheria: elephant , hyrax , tenrec
Xenarthra : armadillo , sloth ;
Marsupialia : opossum , wallaby ;
Monotremes : platypus ;
Birds : chicken , zebra finch ;
Lepidosauria : anole lizard (pre);
Lissamphibia : _Xenopus tropicalis_;
Teleost fishes: _Takifugu rubripes_ (fugu ), _Tetraodon
nigroviridis _ (green spotted pufferfish), _Danio rerio_ (zebrafish ),
_Oryzias latipes_ (medaka ), _Gasterosteus aculeatus_ (stickleback );
Cyclostomata : _Petromyzon marinus_ (sea lamprey ) (pre);
Tunicates : _Ciona intestinalis_, _Ciona savignyi_;
Insects : _
Drosophila melanogaster _ (fruitfly), _Anopheles
gambiae_ (mosquito), _Aedes aegypti_ (mosquito)
Worm : _
Caenorhabditis elegans _
Yeast : _
Saccharomyces cerevisiae _ (baker's yeast)
List of sequenced eukaryotic genomes
Sequence profiling tool
* ^ Hubbard T.; et al. (January 2002). "The
Ensembl genome database
project". _Nucleic Acid Res_. 30 (1): 38–41. PMC 99161 _. PMID
11752248 . doi :10.1093/nar/30.1.38 . Retrieved 11 November 2014.
* ^ Flicek P, Amode MR, Barrell D, et al. (November 2010). "Ensembl
2011" . Nucleic Acids Res_. 39 (
Database issue): D800–D806. PMC
3013672 _. PMID 21045057 . doi :10.1093/nar/gkq1064 .
* ^ Flicek P, Aken BL, Ballester B, et al. (January 2010).
"Ensembl\'s 10th year" . Nucleic Acids Res_. 38 (
D557–62. PMC 2808936 _. PMID 19906699 . doi :10.1093/nar/gkp972 .
* ^ Ruffier, Magali; Kähäri, Andreas; Komorowska, Monika; Keenan,
Stephen; Laird, Matthew; Longden, Ian; Proctor, Glenn; Searle, Steve;
Staines, Daniel; Taylor, Kieron; Vullo, Alessandro; Yates, Andrew;
Zerbino, Daniel; Flicek, Paul (January 2017). "
Ensembl core software
resources: storage and programmatic access for
DNA sequence and genome
annotation". Database_. 2017 (1). doi :10.1093/database/bax020 .
* ^ Stabenau A, McVicker G, Melsopp C, Proctor G, Clamp M, Birney E
(February 2004). "The
Ensembl Core Software Libraries". _Genome
Research_. 14 (5): 929–933. PMC 479122 . PMID 15123588 . doi