HOME

TheInfoList



OR:

DNA annotation or genome annotation is the process of identifying the locations of
gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...
s and all of the
coding region The coding region of a gene, also known as the coding sequence (CDS), is the portion of a gene's DNA or RNA that codes for protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to n ...
s in a
genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
and determining what those genes do. An annotation (irrespective of the context) is a note added by way of explanation or commentary. Once a genome is sequenced, it needs to be annotated to make sense of it. Genes in a eukaryotic genome can be annotated using various annotation tools such as FINDER. A modern annotation pipeline can support a user-friendly web interface and software containerization such as MOSGA. For DNA annotation, a previously unknown sequence representation of genetic material is enriched with information relating genomic position to
intron An intron is any Nucleic acid sequence, nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of ...
-
exon An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequen ...
boundaries,
regulatory sequence A regulatory sequence is a segment of a nucleic acid molecule which is capable of increasing or decreasing the expression of specific genes within an organism. Regulation of gene expression is an essential feature of all living organisms and ...
s,
repeats A rerun or repeat is a rebroadcast of an episode of a radio or television program. There are two types of reruns – those that occur during a hiatus, and those that occur when a program is syndicated. Variations In the United Kingdom, the wor ...
,
gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...
names and
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, res ...
products. This annotation is stored in genomic databases such as
Mouse Genome Informatics Mouse Genome Informatics (MGI) is a free, online database and bioinformatics resource hosted by The Jackson Laboratory, with funding by the National Human Genome Research Institute (NHGRI), the National Cancer Institute (NCI), and the Eunice Kenne ...
,
FlyBase FlyBase is an online bioinformatics database and the primary repository of genetic and molecular data for the insect family Drosophilidae. For the most extensively studied species and model organism, ''Drosophila melanogaster'', a wide range of d ...
, and WormBase. Educational materials on some aspects of biological annotation from the 2006
Gene Ontology The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and ge ...
annotation camp and similar events are available at the Gene Ontology website. The National Center for Biomedical Ontology (www.bioontology.org) develops tools for automated annotationhttp://bioontology.stanford.edu/annotator-service of database records based on the textual descriptions of those records. As a general method,
dcGO dcGO is a comprehensive ontology database for protein domains. As an ontology resource, dcGO integrates Open Biomedical Ontologies from a variety of contexts, ranging from functional information like Gene Ontology to others on enzymes and pathw ...
has an automated procedure for statistically inferring associations between ontology terms and protein domains or combinations of domains from the existing gene/protein-level annotations.


Process

Genome annotation consists of three main steps:. # identifying portions of the genome that do not code for proteins # identifying elements on the
genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
, a process called
gene prediction In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functio ...
# attaching biological information to these elements Automatic annotation tools attempt to perform these steps via computer analysis, as opposed to manual annotation (a.k.a. curation) which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation pipeline. A simple method of gene annotation relies on homology based search tools, like
BLAST Blast or The Blast may refer to: *Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film), ...
, to search for homologous genes in specific databases, the resulting information is then used to annotate genes and genomes. However, as information is added to the annotation platform, manual annotators become capable of deconvoluting discrepancies between genes that are given the same annotation. Some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their Subsystems approach. Other databases (e.g.
Ensembl Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...
) rely on curated data sources as well as a range of different software tools in their automated genome annotation pipeline. ''Structural annotation'' consists of the identification of genomic elements. * ORFs and their localization * gene structure * coding regions * location of regulatory motifs ''Functional annotation'' consists of attaching biological information to genomic elements. * biochemical function * biological function * involved regulation and interactions * expression These steps may involve both biological experiments and ''
in silico In biology and other experimental sciences, an ''in silico'' experiment is one performed on computer or via computer simulation. The phrase is pseudo-Latin for 'in silicon' (correct la, in silicio), referring to silicon in computer chips. It ...
'' analysis. Proteogenomics based approaches utilize information from expressed proteins, often derived from
mass spectrometry Mass spectrometry (MS) is an analytical technique that is used to measure the mass-to-charge ratio of ions. The results are presented as a '' mass spectrum'', a plot of intensity as a function of the mass-to-charge ratio. Mass spectrometry is u ...
, to improve genomics annotations. A variety of software tools have been developed to permit scientists to view and share genome annotations; for example
MAKER
Genome annotation remains a major challenge for scientists investigating the
human genome The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the ...
, now that the genome sequences of more than a thousand human individuals (The 100,000 Genomes Project, UK) and several model organisms are largely complete. Identifying the locations of genes and other genetic control elements is often described as defining the biological "parts list" for the assembly and normal operation of an organism. Scientists are still at an early stage in the process of delineating this parts list and in understanding how all the parts "fit together". Genome annotation is an active area of investigation and involves a number of different organizations in the life science community which publish the results of their efforts in publicly available
biological databases Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genom ...
accessible via the web and other electronic means. Here is an alphabetical listing of on-going projects relevant to genome annotation: * Encyclopedia of DNA elements (ENCODE) * Entrez Gene *
Ensembl Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...
* GENCODE * Gene Ontology Consortium *
GeneRIF A GeneRIF or Gene Reference Into Function is a short (255 characters or fewer) statement about the function of a gene. GeneRIFs provide a simple mechanism for allowing scientists to add to the functional annotation of genes described in the Entr ...
*
RefSeq The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences ( DNA, RNA) and their protein products. RefSeq was first introduced in 2000. This database is built by National ...
*
Uniprot UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...
* Vertebrate and Genome Annotation Project (Vega) At Wikipedia, genome annotation has started to become automated under the auspices of the Gene Wiki portal which operates a bot that harvests gene data from research databases and creates gene stubs on that basis.


References

{{Use dmy dates, date=April 2017 DNA