In
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
, the general feature format (gene-finding format, generic feature format, GFF) is a
file format
A file format is a Computer standard, standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary format, pr ...
used for describing
gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
s and other features of
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
,
RNA
Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...
and
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
sequences.
GFF Versions
The following versions of GFF exist:
General Feature Format Version 2 generally deprecated
*
a derivative used by Ensembl
Generic Feature Format Version 3*
Genome Variation Format with additional pragmas and attributes for sequence_alteration features
GFF2/GTF had a number of deficiencies, notably that it can only represent two-level feature hierarchies and thus cannot handle the three-level hierarchy of gene → transcript → exon. GFF3 addresses this and other deficiencies. For example, it supports arbitrarily many hierarchical levels, and gives specific meanings to certain tags in the attributes field.
The
GTF is identical to GFF, version 2.
GFF general structure
All GFF formats (GFF2, GFF3 and GTF) are
tab delimited with 9 fields per line. They all share the same structure for the first 7 fields, while differing in the content and format of the ''ninth field''. Some field names have been changed in GFF3 to avoid confusion. For example, the "seqid" field was formerly referred to as "sequence", which may be confused with a nucleotide or amino acid chain. The general structure is as follows:
The 8th field: phase of CDS features
Simply put, CDS means "CoDing Sequence". The exact meaning of the term is defined by Sequence Ontology (SO). According to the GFF3 specification:
Meta Directives
In GFF files, additional meta information can be included and follows after the ## directive. This meta information can detail GFF version, sequence region, or species (full list of meta data types can be found a
Sequence Ontology specifications.
GFF software
Servers
Servers that generate this format:
Clients
Clients that use this format:
Validation
The
modENCODE project hosts a
online GFF3 validation toolwith generous limits of 286.10 MB and 15 million lines.
The Genome Tools software collection contains a ''gff3validator'' tool that can be used offline to validate and possibly tidy GFF3 files. A
online validation serviceis also available.
See also
*
Distributed Annotation System
*
Variant Call Format
The Variant Call Format or VCF is a standard text file format used in bioinformatics for storing gene sequence or DNA sequence variations. The format was developed in 2010 for the 1000 Genomes Project and has since been used by other large-sca ...
*
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
References
{{DEFAULTSORT:General Feature Format
Bioinformatics
Biological sequence format