HOME

TheInfoList



OR:

InterPro is a database of
protein families A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be ...
,
protein domain In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of s ...
s and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them. The contents of InterPro consist of diagnostic signatures and the proteins that they significantly match. The signatures consist of models (simple types, such as
regular expressions A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" o ...
or more complex ones, such as
Hidden Markov models A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
) which describe protein families, domains or sites. Models are built from the amino acid sequences of known families or domains and they are subsequently used to search unknown sequences (such as those arising from novel genome sequencing) in order to classify them. Each of the member databases of InterPro contributes towards a different niche, from very high-level, structure-based classifications ( SUPERFAMILY and CATH-Gene3D) through to quite specific sub-family classifications ( PRINTS and
PANTHER Panther may refer to: Large cats *Pantherinae, the cat subfamily that contains the genera ''Panthera'' and ''Neofelis'' **'' Panthera'', the cat genus that contains tigers, lions, jaguars and leopards. *** Jaguar (''Panthera onca''), found in So ...
). InterPro's intention is to provide a one-stop-shop for protein classification, where all the signatures produced by the different member databases are placed into entries within the InterPro database. Signatures which represent equivalent domains, sites or families are put into the same entry and entries can also be related to one another. Additional information such as a description, consistent names and
Gene Ontology The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and g ...
(GO) terms are associated with each entry, where possible.


Data contained in InterPro

InterPro contains three main entities: proteins, signatures (also referred to as "methods" or "models") and entries. The proteins in
UniProtKB UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...
are also the central protein entities in InterPro. Information regarding which signatures significantly match these proteins are calculated as the sequences are released by UniProtKB and these results are made available to the public (see below). The matches of signatures to proteins are what determine how signatures are integrated together into InterPro entries: comparative overlap of matched protein sets and the location of the signatures' matches on the sequences are used as indicators of relatedness. Only signatures deemed to be of sufficient quality are integrated into InterPro. As of version 81.0 (released 21 August 2020) InterPro entries annotated 73.9% of residues found in UniProtKB with another 9.2% annotated by signatures that are pending integration. InterPro also includes data for splice variants and the proteins contained in the UniParc and UniMES databases.


InterPro consortium member databases

The signatures from InterPro come from 13 "member databases", which are listed below. ;CATH-Gene3D: Describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains. Functional annotation is provided to proteins from multiple resources. Functional prediction and analysis of domain architectures is available from the Gene3D website. ;CDD:
Conserved Domain Database The Conserved Domain Database (CDD) is a database of well-annotated multiple sequence alignment models and derived database search models, for ancient domains and full-length proteins. Philosophy Domains can be thought of as distinct functional ...
is a protein annotation resource that consists of a collection of annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. ;HAMAP : Stands for High-quality Automated and Manual Annotation of microbial Proteomes. HAMAP profiles are manually created by expert curators they identify proteins that are part of well-conserved bacterial, archaeal and plastid-encoded (i.e. chloroplasts, cyanelles, apicoplasts, non-photosynthetic plastids) proteins families or subfamilies. ;MobiDB: MobiDB is database annotating intrinsic disorder in proteins. ;PANTHER:
PANTHER Panther may refer to: Large cats *Pantherinae, the cat subfamily that contains the genera ''Panthera'' and ''Neofelis'' **'' Panthera'', the cat genus that contains tigers, lions, jaguars and leopards. *** Jaguar (''Panthera onca''), found in So ...
is a large collection of protein families that have been subdivided into functionally related subfamilies, using human expertise. These subfamilies model the divergence of specific functions within protein families, allowing more accurate association with function (human-curated molecular function and biological process classifications and pathway diagrams), as well as inference of amino acids important for functional specificity. Hidden Markov models (HMMs) are built for each family and subfamily for classifying additional protein sequences. ;Pfam: Is large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. ;PIRSF: Protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains. The primary PIRSF classification unit is the homeomorphic family, whose members are both homologous (evolved from a common ancestor) and homeomorphic (sharing full-length sequence similarity and a common domain architecture). ;PRINTS: PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of UniProt. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, their full diagnostic potency deriving from the mutual context afforded by motif neighbours. ; PROSITE:
PROSITE PROSITE is a protein database. It consists of entries describing the protein families, domains and functional sites as well as amino acid patterns and profiles in them. These are manually curated by a team of the Swiss Institute of Bioinformatic ...
is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. ;SMART:
Simple Modular Architecture Research Tool Simple Modular Architecture Research Tool (SMART) is a biological database that is used in the identification and analysis of protein domains within protein sequences. SMART uses profile-hidden Markov models built from multiple sequence alignmen ...
Allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. More than 800 domain families found in signaling, extracellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. ;SUPERFAMILY: SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the
SCOP A ( or ) was a poet as represented in Old English poetry. The scop is the Old English counterpart of the Old Norse ', with the important difference that "skald" was applied to historical persons, and scop is used, for the most part, to designa ...
classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to. SUPERFAMILY has been used to carry out structural assignments to all completely sequenced genomes. ;SFLD : A hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities. ;TIGRFAMs:
TIGRFAMs TIGRFAMs is a database of protein families designed to support manual and automated genome annotation. Each entry includes a multiple sequence alignment and hidden Markov model (HMM) built from the alignment. Sequences that score above the defined ...
is a collection of protein families, featuring curated multiple sequence alignments, hidden Markov models (HMMs) and annotation, which provides a tool for identifying functionally related proteins based on sequence homology. Those entries which are "equivalogs" group homologous proteins which are conserved with respect to function.


Data types

InterPro consists of seven types of data provided by different members of the consortium:


InterPro entry types

InterPro entries can be further broken down into five types: * Homologous Superfamily: A group of proteins that share a common evolutionary origin as seen in their structural similarities, even if their sequences are not highly similar. These entries are specifically only provided by two member databases: CATH-Gene3D and SUPERFAMILY. * Family: A group of proteins that have a common evolutionary origin determined through structural similarities, related functions, or
sequence homology Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a spe ...
. * Domain: A distinct unit in a protein with a particular function, structure, or sequence. *Repeat: A sequence of amino acids, usually no longer than 50 amino acids, that tend to repeat many times in a protein. * Site: A short sequence of amino acids where at least one amino acid is conserved. These include post-translation modification sites, conserved sites,
binding site In biochemistry and molecular biology, a binding site is a region on a macromolecule such as a protein that binds to another molecule with specificity. The binding partner of the macromolecule is often referred to as a ligand. Ligands may inclu ...
s, and
active site In biology and biochemistry, the active site is the region of an enzyme where substrate molecules bind and undergo a chemical reaction. The active site consists of amino acid residues that form temporary bonds with the substrate (binding site) a ...
s.


Access

The database is available for text- and sequence-based searches via a webserver, and for download via anonymous FTP. Like other
EBI Ebrahim Hamedi ( fa, اِبراهیم حامدی, also Romanized as "Ebrāhim Hāmedi"; born 1949), better known by his stage name Ebi (Persian: ), is an Iranian pop singer who first started his career in Tehran, gaining fame as part of a ban ...
databases, it is in the
public domain The public domain (PD) consists of all the creative work A creative work is a manifestation of creative effort including fine artwork (sculpture, paintings, drawing, sketching, performance art), dance, writing (literature), filmmaking, ...
, since its content can be used "by any individual and for any purpose". InterPro aims to release data to the public every 8 weeks, typically within a day of the UniProtKB release of the same proteins.


InterPro application programming interface (API)

InterPro provides an
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
for programmatic access to all InterPro entries and their related entries in
Json JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other ser ...
format. There are six main endpoints for the API corresponding to the different InterPro data types: entry, protein, structure, taxonomy, proteome and set.


InterProScan

InterProScan is a software package that allows users to scan sequences against member database signatures. Users can use this signature scanning software to functionally characterize novel nucleotide or protein sequences. InterProScan is frequently used in
genome projects Genome projects are scientific endeavours that ultimately aim to determine the complete genome sequence of an organism (be it an animal, a plant, a fungus, a bacterium, an archaean, a protist or a virus) and to annotate protein-coding genes and ot ...
in order to obtain a "first-pass" characterisation of the genome of interest. , the public version of InterProScan (v5.x) uses a Java-based architecture. The software package is currently only supported on a 64-bit
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
operating system. InterProScan, along with many other EMBL-EBI bioinformatics tools, can also be accessed programmatically using RESTful and
SOAP Soap is a salt of a fatty acid used in a variety of cleansing and lubricating products. In a domestic setting, soaps are surfactants usually used for washing, bathing, and other types of housekeeping. In industrial settings, soaps are use ...
Web Services APIs.


See also

*
Protein family A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be c ...
*
Domain of unknown function A domain of unknown function (DUF) is a protein domain that has no characterised function. These families have been collected together in the Pfam database using the prefix DUF followed by a number, with examples being DUF2992 and DUF1220. As of 201 ...
*
Sequence motif In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as ''As ...


References


External links

* — webserver {{DEFAULTSORT:Interpro Biological databases Protein classification Science and technology in Cambridgeshire South Cambridgeshire District