GeneMark is a generic name for a family of

ab initio ''Ab initio'' ( ) is a Latin term meaning "from the beginning" and is derived from the Latin ''ab'' ("from") + ''initio'', ablative singular of ''initium'' ("beginning"). Etymology Circa 1600, from Latin, literally "from the beginning", from ab ...

gene prediction programs developed at the Georgia Institute of Technology in

Atlanta Atlanta ( ) is the capital and most populous city of the U.S. state of Georgia. It is the seat of Fulton County, the most populous county in Georgia, but its territory falls in both Fulton and DeKalb counties. With a population of 498,715 ...

. Developed in 1993, original GeneMark was used in 1995 as a primary gene prediction tool for annotation of the first completely sequenced bacterial genome of ''

Haemophilus influenzae ''Haemophilus influenzae'' (formerly called Pfeiffer's bacillus or ''Bacillus influenzae'') is a Gram-negative, non-motile, coccobacillary, facultatively anaerobic, capnophilic pathogenic bacterium of the family Pasteurellaceae. The bacter ...

'', and in 1996 for the first archaeal genome of '' Methanococcus jannaschii''. The algorithm introduced

inhomogeneous Homogeneity and heterogeneity are concepts often used in the sciences and statistics relating to the uniformity of a substance or organism. A material or image that is homogeneous is uniform in composition or character (i.e. color, shape, siz ...

three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in two DNA strands simultaneously. Species specific parameters of the models were estimated from training sets of sequences of known type (protein-coding and non-coding). The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being "protein-coding" (carrying

genetic code The genetic code is the set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets, or codons) into proteins. Translation is accomplished by the ribosome, which links ...

) in each of six possible reading frames (including three frames in

complementary DNA In genetics, complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA (e.g., messenger RNA (mRNA) or microRNA (miRNA)) template in a reaction catalyzed by the enzyme reverse transcriptase. cDNA is often used to express a spec ...

strand) or being "non-coding". Original GeneMark (developed before the HMM era in Bioinformatics) is an HMM-like algorithm; it can be viewed as approximation to known in the HMM theory posterior decoding algorithm for appropriately defined HMM.

Prokaryotic gene prediction

The GeneMark.hmm algorithm (1998) was designed to improve gene prediction accuracy in finding short genes and gene starts. The idea was to integrate the Markov chain models used in GeneMark into a

hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an o ...

framework, with transition between coding and non-coding regions formally interpreted as transitions between hidden states. Additionally, the ribosome

binding site In biochemistry and molecular biology, a binding site is a region on a macromolecule such as a protein that binds to another molecule with specificity. The binding partner of the macromolecule is often referred to as a ligand. Ligands may includ ...

model was used to improve accuracy of gene start prediction. Next step was done with development of the self-training gene prediction tool GeneMarkS (2001). GeneMarkS has been in active use by genomics community for gene identification in new prokaryotic genomic sequences. GeneMarkS+, extension of GeneMarkS integrating information on homologous proteins into gene prediction is used in the NCBI pipeline for prokaryotic genomes annotation; the pipeline can annotate up to 2000 genomes daily ().

Heuristic Models and Gene Prediction in Metagenomes and Metatransciptomes

Accurate identification of species specific parameters of the GeneMark and GeneMark.hmm algorithms was the key condition for making accurate gene predictions. However, the question was raised, motivated by studies of viral genomes, how to define parameters for gene prediction in a rather short sequence that has no large genomic context. In 1999 this question was addressed by development of a "heuristic method" computations of the parameters as functions of the sequence G+C content. Since 2004 models built by the heuristic approach have been used in finding genes in metagenomic sequences. Subsequently, analysis of several hundred prokaryotic genomes led to developing more advanced heuristic method (implemented in MetaGeneMark) in 2010.

Eukaryotic gene prediction

In eukaryotic genomes modeling of exon borders with introns and intergenic regions presents a major challenge addressed by use of HMMs. The HMM architecture of eukaryotic GeneMark.hmm includes hidden states for initial, internal, and terminal exons, introns,

intergenic region An intergenic region is a stretch of DNA sequences located between genes. Intergenic regions may contain functional elements and junk DNA. ''Inter''genic regions should not be confused with ''intra''genic regions (or introns), which are non-cod ...

s and single exon genes located in both DNA strands. Initial eukaryotic GeneMark.hmm needed training sets for estimation of the algorithm parameters. In 2005 first version of self-training algorithm GeneMark-ES was developed. In 2008 the GeneMark-ES algorithm was extended to fungal genomes by developing a special intron model and more complex strategy of self-training. Then, in 2014, GeneMark-ET the algorithm that augmented self-training by information from mapped to genome unassembled RNA-Seq reads was added to the family. Gene prediction in eukaryotic transcripts can be done by the new algorithm GeneMarkS-T (2015)

GeneMark Family of Gene Prediction Programs

Bacteria, Archaea

* GeneMark * GeneMarkS * GeneMarkS+

Metagenomes and Metatranscriptomes

* MetaGeneMark

Eukaryotes

* GeneMark * GeneMark.hmm * GeneMark-ES: gene finding algorithm for eukaryotic genomes that performs automatic training in unsupervised ab initio mode. * GeneMark-ET: augments GeneMark-ES with a novel method that integrates RNA-Seq read alignments into the self-training procedure. * GeneMark-EX: a fully automatic integrated tool for genome annotation that shows robust performance across the input data of various size, structure and quality. The algorithm selects the approach to parameter estimation depending on the volume, quality and features of the input data, size of RNA-seq dataset, phylogenetic position of the species, degree of assembly fragmentation. It is able to automatically modify the HMM architecture to fit the features of the genome in question and to integrate transcript and protein information into the process of gene prediction. https://pag.confex.com/pag/xxvi/meetingapp.cgi/Paper/31299 GeneMark-EX

Viruses, phages and plasmids

* Heuristic models

Transcripts assembled from RNA-Seq read

* GeneMarkS-T

References

* Borodovsky M. and McIninch J.
GeneMark: parallel gene recognition for both DNA strands.
''Computers & Chemistry'' (1993) 17 (2): 123–133. * Lukashin A. and Borodovsky M.
GeneMark.hmm: new solutions for gene finding.
''Nucleic Acids Research'' (1998) 26 (4): 1107–1115. * Besemer J. and Borodovsky M.
Heuristic approach to deriving models for gene finding.
''Nucleic Acids Research'' (1999) 27 (19): 3911–3920. * Besemer J., Lomsadze A. and Borodovsky M.
GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.
''Nucleic Acids Research'' (2001) 29 (12): 2607–2618. * Mills R., Rozanov M., Lomsadze A., Tatusova T. and Borodovsky M.
Improving gene annotation in complete viral genomes.
''Nucleic Acids Research'' (2003) 31 (23): 7041–7055. * Besemer J. and Borodovsky M.
GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses.
''Nucleic Acids Research'' (2005) 33 (Web Server Issue): W451-454. * Lomsadze A., Ter-Hovhannisyan V., Chernoff Y. and Borodovsky M.
Gene identification in novel eukaryotic genomes by self-training algorithm.
''Nucleic Acids Research'' (2005) 33 (20): 6494–6506. * Zhu W., Lomsadze A. and Borodovsky M.
Ab initio gene identification in metagenomic sequences.
''Nucleic Acids Research'' (2010) 38 (12): e132.

External links

* {{genomics-footer Metagenomics software Mathematical and theoretical biology Genomics Bioinformatics software zh:基因识别