bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...

, GENSCAN is a

program Program, programme, programmer, or programming may refer to: Business and management * Program management, the process of managing several related projects * Time management * Program, a part of planning Arts and entertainment Audio * Progra ...

to identify complete

gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a ba ...

structures in genomic DNA. It is a G HMM-based program that can be used to predict the location of genes and their

exon An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequen ...

intron An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of the cistron .e., gene. ...

boundaries in genomic sequences from a variety of organisms. The GENSCAN Web server can be found at

MIT The Massachusetts Institute of Technology (MIT) is a private land-grant research university in Cambridge, Massachusetts. Established in 1861, MIT has played a key role in the development of modern technology and science, and is one of the mo ...

. GENSCAN was developed by Christopher Burge in the research group of

Samuel Karlin Samuel Karlin (June 8, 1924 – December 18, 2007) was an American mathematician at Stanford University in the late 20th century. Biography Karlin was born in Janów, Poland and immigrated to Chicago as a child. Raised in an Orthodox Jewish hous ...

Stanford University Stanford University, officially Leland Stanford Junior University, is a private research university in Stanford, California. The campus occupies , among the largest in the United States, and enrolls over 17,000 students. Stanford is consider ...

History

In 2001, the world of human gene prediction entered into

Comparative genomics Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural lan ...

. This resulted in the development of a program called TWINSCAN as an adaptation of GENSCAN with higher accuracy. Other programs like N-SCAN were later developed by further adapting the GHMM model. As of 2002, GENSCAN remained a popular tool in bioinformatics, becoming a standard feature for genomes released on University of California Santa Cruz and

Ensembl Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...

Genome browser In bioinformatics, a genome browser is a graphical interface for display of information from a biological database for genomic data. Genome browsers enable researchers to visualize and browse entire genomes with annotated data including gene predic ...

Implementation

Genomic Model

The primary goal when developing a genomic sequence model for GENSCAN was to identify both the general and specific properties that compose the individual functional units of

eukaryotic Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bacte ...

genes (e.g.

exons An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...

introns An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of the cistron .e., gene. ...

, splice sites, promoters). Particular focus was placed upon features that are recognizable by general transcriptional, splicing and translational machinery that processes the majority of all protein coding genes, as opposed to the signals associated with

transcription Transcription refers to the process of converting sounds (voice, music etc.) into letters or musical notes, or producing a copy of something in another medium, including: Genetics * Transcription (biology), the copying of DNA into RNA, the fir ...

or splicing of genes and

gene families A gene family is a set of several similar genes, formed by duplication of a single original gene, and generally with similar biochemical functions. One such family are the genes for human hemoglobin subunits; the ten genes are in two clusters on ...

(e.g.

TATA box In molecular biology, the TATA box (also called the Goldberg–Hogness box) is a sequence of DNA found in the core promoter region of genes in archaea and eukaryotes. The bacterial homolog of the TATA box is called the Pribnow box which has ...

). In addition, a general three-periodic fifth-order

Markov model In probability theory, a Markov model is a stochastic model used to Mathematical model, model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, i ...

coding regions The coding region of a gene, also known as the coding sequence (CDS), is the portion of a gene's DNA or RNA that codes for protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to non ...

is used as opposed to models of specific protein motifs or database

homology Homology may refer to: Sciences Biology *Homology (biology), any characteristic of biological organisms that is derived from a common ancestor * Sequence homology, biological homology between DNA, RNA, or protein sequences *Homologous chrom ...

information. In addition, the model factors in the structural and density differences between compositional regions of the human genome. Due to the usage of these elements, GENSCAN works without needing to reference similar genes in protein sequence databases. Instead, predictions produced by GENSCAN are complementary to those gathered by homology-based gene identification methods (e.g. querying protein databases with BLASTX). Overall, the structure of the model used in GENSCAN is similar to the General Hidden Markov Model.

Features

GENSCAN's implementation differs from other programs in multiple ways. A notable difference is the fact that GENSCAN utilizes a genomic sequence model that exclusively focuses double-stranded DNA where genes that are present on both strands are simultaneously analyzed. Also, GENSCAN is capable of analyzing genomes in situations where there are partial genes or no genes, rather than only being able to analyze single and complete gene sequences like other programs at its time. These two factors contribute to GENSCAN being particularly useful in analyzing longer human genomes. In addition, GENSCAN employs the concept of Maximal Dependence Decomposition such that functional signals in DNA and protein sequences can be modeled, creating the possibility for dependencies between signal positions to be considered by the program. This is implemented in GENSCAN such that a model is generated of the donor splice signal, capturing dependences that are associated with the recognition mechanisms for donor splice sites in

pre-mRNA A primary transcript is the single-stranded ribonucleic acid (RNA) product synthesized by transcription of DNA, and processed to yield various mature RNA products such as mRNAs, tRNAs, and rRNAs. The primary transcripts designated to be mRNAs a ...

sequences. GENSCAN has the capability of calculating the accuracy of each of its predictions by using the forward-backward algorithm. Predicting the structure and overall composition of human genes in regard to exon and gene locations in longer sequences is an additionally useful component of GENSCAN. There are several different features that come as a part of this. One of which being the capability of capturing differences in gene structure and composition between C + G regions in the human genome, using sets of empirically generated model parameters. Another derived feature is, as mentioned before, predicting multiple genes in a sequence in addition to having the ability of working with partial genes and double-stranded DNA. Lastly, this also allows GENSCAN to capture dependencies between signal positions with new models of donor and acceptor splice sites.

Efficiency

The run time for GENSCAN scales almost linearly when provided realistically sized sequences (several kilobits minimum), but has a worst case of being quadratic.

Supplemental Usage

GENSCAN, like other genome prediction programs, doesn't produce results that totally match those of other programs. This is due to a multitude of factors including, but not limited to: differences in algorithms, parameters, and training sets. Therefore, GENSCAN has been utilized in the practice of combining two gene prediction programs' results such that if one program in the combination is confident in a sequence prediction, that sequence is used. On the other hand, if neither program is confident in their predictions, the sequence predicted is only used if both programs agree on it.

Accuracy

Tests were conducted to evaluate the accuracy of GENSCAN with short data sets. One test was done on the Burset/Guigó dataset containing 570 vertebrate multi-exon gene sequences. The data produced from this test is shown in the table below, along with the data produced by testing other programs with the same dataset. GENSCAN is shown in the table to be generally more accurate than its competitors at predicting sequences with both

nucleotides Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecules w ...

and exons. Furthermore, the table shown below specifically describes the accuracy of GENSCAN in regard to genomic sequences organized by ranges of C + G and types of organisms. We can see in the data provided that GENSCAN's accuracy variation was rather insensitive to C + G content and organism type. This further demonstrates GENSCAN's independence of factors that would have impacted the results of comparable genome prediction programs. A separate test was conducted on GENSCAN's accuracy using two GeneParser data sets that are stripped of all genes that are more than 25% of a match regarding amino acids with those in previous GeneParser test sets. The resulting data of this test and of the same test performed on other programs is shown in the table below. We can see that there is little variation between the accuracy of GENSCAN under the aforementioned Burset/Guigó data set and the GeneParser data sets. However, certain data points with higher fluctuation (e.g. 98% CC on high C + G nucleotides in GeneParser set II vs. 90% CC on C + G >60 nucleotides in Burset/Guigó) may be attributed to the GeneParser data sets being much smaller in sample size. The tests on the aforementioned three data sets provided enough information to form respective conclusions. However, these datasets are not of realistic size, therefore, their reliability and scope are justifiably brought into question. In 1997, GENSCAN was found to have a higher accuracy than previous gene prediction programs. However, work still needed to be done due to how GENSCAN was shown to only predict 10-15% of genes accurately on realistic data sets. Because of inaccuracies like this, any predictions given by GENSCAN and other programs must be verified by comparing them to a

Complementary DNA In genetics, complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA (e.g., messenger RNA (mRNA) or microRNA (miRNA)) template in a reaction catalyzed by the enzyme reverse transcriptase. cDNA is often used to express a spe ...

sequence, a

Expressed sequence tag In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proce ...

(EST) sequence, or a known protein sequence.

References

{{reflist Bioinformatics software