General feature format
   HOME

TheInfoList



OR:

In
bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
, the general feature format (gene-finding format, generic feature format, GFF) is a
file format A file format is a Computer standard, standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary format, pr ...
used for describing
gene In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
s and other features of
DNA Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
,
RNA Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...
and
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
sequences.


GFF Versions

The following versions of GFF exist:
General Feature Format Version 2
generally deprecated *

a derivative used by Ensembl
Generic Feature Format Version 3
*
Genome Variation Format
with additional pragmas and attributes for sequence_alteration features GFF2/GTF had a number of deficiencies, notably that it can only represent two-level feature hierarchies and thus cannot handle the three-level hierarchy of gene → transcript → exon. GFF3 addresses this and other deficiencies. For example, it supports arbitrarily many hierarchical levels, and gives specific meanings to certain tags in the attributes field. The GTF is identical to GFF, version 2.


GFF general structure

All GFF formats (GFF2, GFF3 and GTF) are tab delimited with 9 fields per line. They all share the same structure for the first 7 fields, while differing in the content and format of the ''ninth field''. Some field names have been changed in GFF3 to avoid confusion. For example, the "seqid" field was formerly referred to as "sequence", which may be confused with a nucleotide or amino acid chain. The general structure is as follows:


The 8th field: phase of CDS features

Simply put, CDS means "CoDing Sequence". The exact meaning of the term is defined by Sequence Ontology (SO). According to the GFF3 specification:


Meta Directives

In GFF files, additional meta information can be included and follows after the ## directive. This meta information can detail GFF version, sequence region, or species (full list of meta data types can be found a
Sequence Ontology specifications
.


GFF software


Servers

Servers that generate this format:


Clients

Clients that use this format:


Validation

The modENCODE project hosts a
online GFF3 validation tool
with generous limits of 286.10 MB and 15 million lines. The Genome Tools software collection contains a ''gff3validator'' tool that can be used offline to validate and possibly tidy GFF3 files. A
online validation service
is also available.


See also

* Distributed Annotation System *
Variant Call Format The Variant Call Format or VCF is a standard text file format used in bioinformatics for storing gene sequence or DNA sequence variations. The format was developed in 2010 for the 1000 Genomes Project and has since been used by other large-sca ...
*
Sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...


References

{{DEFAULTSORT:General Feature Format Bioinformatics Biological sequence format