A protein superfamily is the largest grouping (
clade
In biology, a clade (), also known as a Monophyly, monophyletic group or natural group, is a group of organisms that is composed of a common ancestor and all of its descendants. Clades are the fundamental unit of cladistics, a modern approach t ...
) of
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
s for which
common ancestry can be inferred (see
homology). Usually this common ancestry is inferred from
structural alignment and mechanistic similarity, even if no sequence similarity is evident.
Sequence homology
Sequence homology is the homology (biology), biological homology between DNA sequence, DNA, RNA sequence, RNA, or Protein primary structure, protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments ...
can then be deduced even if not apparent (due to low sequence similarity). Superfamilies typically contain several
protein families which show sequence similarity within each family. The term ''protein clan'' is commonly used for
protease
A protease (also called a peptidase, proteinase, or proteolytic enzyme) is an enzyme that catalysis, catalyzes proteolysis, breaking down proteins into smaller polypeptides or single amino acids, and spurring the formation of new protein products ...
and
glycosyl hydrolases superfamilies based on the
MEROPS and
CAZy classification systems.
Identification
Superfamilies of proteins are identified using a number of methods. Closely related members can be identified by different methods to those needed to group the most evolutionarily divergent members.
Sequence similarity
Historically, the similarity of different amino acid sequences has been the most common method of inferring homology. Sequence similarity is considered a good predictor of relatedness, since similar sequences are more likely the result of gene duplication
Gene duplication (or chromosomal duplication or gene amplification) is a major mechanism through which new genetic material is generated during molecular evolution. It can be defined as any duplication of a region of DNA that contains a gene ...
and divergent evolution, rather than the result of convergent evolution
Convergent evolution is the independent evolution of similar features in species of different periods or epochs in time. Convergent evolution creates analogous structures that have similar form or function but were not present in the last comm ...
. Amino acid sequence is typically more conserved than DNA sequence (due to the degenerate genetic code), so it is a more sensitive detection method. Since some of the amino acids have similar properties (e.g., charge, hydrophobicity, size), conservative mutations that interchange them are often neutral to function. The most conserved sequence regions of a protein often correspond to functionally important regions like catalytic sites and binding sites, since these regions are less tolerant to sequence changes.
Using sequence similarity to infer homology has several limitations. There is no minimum level of sequence similarity guaranteed to produce identical structures. Over long periods of evolution, related proteins may show no detectable sequence similarity to one another. Sequences with many insertions and deletions can also sometimes be difficult to align and so identify the homologous sequence regions. In the PA clan of protease
A protease (also called a peptidase, proteinase, or proteolytic enzyme) is an enzyme that catalysis, catalyzes proteolysis, breaking down proteins into smaller polypeptides or single amino acids, and spurring the formation of new protein products ...
s, for example, not a single residue is conserved through the superfamily, not even those in the catalytic triad. Conversely, the individual families that make up a superfamily are defined on the basis of their sequence alignment, for example the C04 protease family within the PA clan.
Nevertheless, sequence similarity is the most commonly used form of evidence to infer relatedness, since the number of known sequences vastly outnumbers the number of known tertiary structures. In the absence of structural information, sequence similarity constrains the limits of which proteins can be assigned to a superfamily.
Structural similarity
Structure
A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such as ...
is much more evolutionarily conserved than sequence, such that proteins with highly similar structures can have entirely different sequences. Over very long evolutionary timescales, very few residues show detectable amino acid sequence conservation, however secondary structural elements and tertiary structural motifs are highly conserved. Some protein dynamics and conformational change
In biochemistry, a conformational change is a change in the shape of a macromolecule, often induced by environmental factors.
A macromolecule is usually flexible and dynamic. Its shape can change in response to changes in its environment or othe ...
s of the protein structure may also be conserved, as is seen in the serpin superfamily. Consequently, protein tertiary structure can be used to detect homology between proteins even when no evidence of relatedness remains in their sequences. Structural alignment programs, such as DALI, use the 3D structure of a protein of interest to find proteins with similar folds. However, on rare occasions, related proteins may evolve to be structurally dissimilar and relatedness can only be inferred by other methods.
Mechanistic similarity
The catalytic mechanism of enzymes within a superfamily is commonly conserved, although substrate specificity may be significantly different. Catalytic residues also tend to occur in the same order in the protein sequence. For the families within the PA clan of proteases, although there has been divergent evolution of the catalytic triad residues used to perform catalysis, all members use a similar mechanism to perform covalent, nucleophilic catalysis on proteins, peptides or amino acids. However, mechanism alone is not sufficient to infer relatedness. Some catalytic mechanisms have been convergently evolved multiple times independently, and so form separate superfamilies, and in some superfamilies display a range of different (though often chemically similar) mechanisms.
Evolutionary significance
Protein superfamilies represent the current limits of our ability to identify common ancestry. They are the largest evolutionary
Evolution is the change in the heritable characteristics of biological populations over successive generations. It occurs when evolutionary processes such as natural selection and genetic drift act on genetic variation, resulting in certa ...
grouping based on direct evidence
Evidence for a proposition is what supports the proposition. It is usually understood as an indication that the proposition is truth, true. The exact definition and role of evidence vary across different fields. In epistemology, evidence is what J ...
that is currently possible. They are therefore amongst the most ancient evolutionary events currently studied. Some superfamilies have members present in all kingdoms
Kingdom commonly refers to:
* A monarchic state or realm ruled by a king or queen.
** A monarchic chiefdom, represented or governed by a king or queen.
* Kingdom (biology), a category in biological taxonomy
Kingdom may also refer to:
Arts and me ...
of life
Life, also known as biota, refers to matter that has biological processes, such as Cell signaling, signaling and self-sustaining processes. It is defined descriptively by the capacity for homeostasis, Structure#Biological, organisation, met ...
, indicating that the last common ancestor of that superfamily was in the last universal common ancestor
The last universal common ancestor (LUCA) is the hypothesized common ancestral cell from which the three domains of life, the Bacteria, the Archaea, and the Eukarya originated. The cell had a lipid bilayer; it possessed the genetic code a ...
of all life (LUCA).
Superfamily members may be in different species, with the ancestral protein being the form of the protein that existed in the ancestral species ( orthology). Conversely, the proteins may be in the same species, but evolved from a single protein whose gene was duplicated in the genome ( paralogy).
Diversification
A majority of proteins contain multiple domains. Between 66-80% of eukaryotic proteins have multiple domains while about 40-60% of prokaryotic proteins have multiple domains. Over time, many of the superfamilies of domains have mixed together. In fact, it is very rare to find “consistently isolated superfamilies”. When domains do combine, the N- to C-terminal domain order (the "domain architecture") is typically well conserved. Additionally, the number of domain combinations seen in nature is small compared to the number of possibilities, suggesting that selection acts on all combinations.
Examples
; α/β hydrolase superfamily: Members share an α/β sheet, containing 8 strands connected by helices
A helix (; ) is a shape like a cylindrical coil spring or the thread of a machine screw. It is a type of smoothness (mathematics), smooth space curve with tangent lines at a constant angle to a fixed axis. Helices are important in biology, as ...
, with catalytic triad residues in the same order, activities include proteases, lipases, peroxidases, esterases, epoxide hydrolases and dehalogenases.
; Alkaline phosphatase superfamily: Members share an αβα sandwich structure as well as performing common promiscuous reactions by a common mechanism.
; Globin superfamily: Members share an 8-alpha helix
An alpha helix (or α-helix) is a sequence of amino acids in a protein that are twisted into a coil (a helix).
The alpha helix is the most common structural arrangement in the Protein secondary structure, secondary structure of proteins. It is al ...
globular globin fold.
; Immunoglobulin superfamily: Members share a sandwich-like structure of two sheets of antiparallel β strands ( Ig-fold), and are involved in recognition, binding, and adhesion
Adhesion is the tendency of dissimilar particles or interface (matter), surfaces to cling to one another. (Cohesion (chemistry), Cohesion refers to the tendency of similar or identical particles and surfaces to cling to one another.)
The ...
.
; PA clan: Members share a chymotrypsin-like double β-barrel fold and similar proteolysis
Proteolysis is the breakdown of proteins into smaller polypeptides or amino acids. Protein degradation is a major regulatory mechanism of gene expression and contributes substantially to shaping mammalian proteomes. Uncatalysed, the hydrolysis o ...
mechanisms but sequence identity of <10%. The clan contains both cysteine
Cysteine (; symbol Cys or C) is a semiessential proteinogenic amino acid with the chemical formula, formula . The thiol side chain in cysteine enables the formation of Disulfide, disulfide bonds, and often participates in enzymatic reactions as ...
and serine proteases (different nucleophiles).[
; Ras superfamily: Members share a common catalytic G domain of a 6-strand β sheet surrounded by 5 α-helices.]
; RSH superfamily: Members share capability to hydrolyze and/or synthesize ppGpp alarmones in the stringent response.
; Serpin superfamily: Members share a high-energy, stressed fold which can undergo a large conformational change
In biochemistry, a conformational change is a change in the shape of a macromolecule, often induced by environmental factors.
A macromolecule is usually flexible and dynamic. Its shape can change in response to changes in its environment or othe ...
, which is typically used to inhibit serine and cysteine proteases by disrupting their structure.
; TIM barrel superfamily: Members share a large α8β8 barrel structure. It is one of the most common protein folds and the monophylicity of this superfamily is still contested.
Protein superfamily resources
Several biological databases document protein superfamilies and protein folds, for example:
* Pfam - Protein families database of alignments and HMMs
* PROSITE - Database of protein domains, families and functional sites
* PIRSF - SuperFamily Classification System
* PASS2 - Protein Alignment as Structural Superfamilies v2
* SUPERFAMILY - Library of HMMs representing superfamilies and database of (superfamily and family) annotations for all completely sequenced organisms
* SCOP
A ( or ) was a poet as represented in Old English poetry. The scop is the Old English counterpart of the Old Norse ', with the important difference that "skald" was applied to historical persons, and scop is used, for the most part, to designat ...
and CATH - Classifications of protein structures into superfamilies, families and domains
Similarly there are algorithms that search the PDB for proteins with structural homology to a target structure, for example:
* DALI - Structural alignment based on a distance alignment matrix method
See also
* Structural alignment
* Protein domains
*Protein family
A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be ...
* Protein subfamily
* Protein mimetic
*Protein structure
Protein structure is the three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers specifically polypeptides formed from sequences of amino acids, which are the monomers of the polymer. A single amino acid ...
*Homology (biology)
In biology, homology is similarity in anatomical structures or genes between organisms of different taxa due to shared ancestry, ''regardless'' of current functional differences. Evolutionary biology explains homologous structures as retained her ...
*Interolog An interolog is a conserved interaction between a pair of proteins which have interacting Homology (biology), homologs in another organism. The term was introduced in a 2000 paper by Walhout et al.
Example
Suppose that A and B are two differen ...
* List of gene families
* SUPERFAMILY
* CATH
References
External links
*
{{Enzymes
Molecular evolution
*
*
Protein classification