A protein superfamily is the largest grouping (
clade) of
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...
s for which
common ancestry can be inferred (see
homology). Usually this common ancestry is inferred from
structural alignment and mechanistic similarity, even if no sequence similarity is evident.
Sequence homology
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a sp ...
can then be deduced even if not apparent (due to low sequence similarity). Superfamilies typically contain several
protein families
A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be ...
which show sequence similarity within each family. The term ''protein clan'' is commonly used for
protease
A protease (also called a peptidase, proteinase, or proteolytic enzyme) is an enzyme that catalyzes (increases reaction rate or "speeds up") proteolysis, breaking down proteins into smaller polypeptides or single amino acids, and spurring the form ...
and
glycosyl hydrolase
Glycoside hydrolases (also called glycosidases or glycosyl hydrolases) catalyze the hydrolysis of glycosidic bonds in complex sugars. They are extremely common enzymes with roles in nature including degradation of biomass such as cellulose ...
s superfamilies based on the
MEROPS
MEROPS is an online database for peptidases (also known as proteases, proteinases and proteolytic enzymes) and their inhibitors. The classification scheme for peptidases was published by Rawlings & Barrett in 1993, and that for protein inhibi ...
and
CAZy classification systems.
Identification
Superfamilies of proteins are identified using a number of methods. Closely related members can be identified by different methods to those needed to group the most evolutionarily divergent members.
Sequence similarity
Historically, the similarity of different amino acid sequences has been the most common method of inferring homology. Sequence similarity is considered a good predictor of relatedness, since similar sequences are more likely the result of gene duplication
Gene duplication (or chromosomal duplication or gene amplification) is a major mechanism through which new genetic material is generated during molecular evolution. It can be defined as any duplication of a region of DNA that contains a gene ...
and divergent evolution, rather than the result of convergent evolution
Convergent evolution is the independent evolution of similar features in species of different periods or epochs in time. Convergent evolution creates analogous structures that have similar form or function but were not present in the last com ...
. Amino acid sequence is typically more conserved than DNA sequence (due to the degenerate genetic code), so is a more sensitive detection method. Since some of the amino acids have similar properties (e.g., charge, hydrophobicity, size), conservative mutations that interchange them are often neutral to function. The most conserved sequence regions of a protein often correspond to functionally important regions like catalytic sites and binding sites, since these regions are less tolerant to sequence changes.
Using sequence similarity to infer homology has several limitations. There is no minimum level of sequence similarity guaranteed to produce identical structures. Over long periods of evolution, related proteins may show no detectable sequence similarity to one another. Sequences with many insertions and deletions can also sometimes be difficult to align and so identify the homologous sequence regions. In the PA clan of protease
A protease (also called a peptidase, proteinase, or proteolytic enzyme) is an enzyme that catalyzes (increases reaction rate or "speeds up") proteolysis, breaking down proteins into smaller polypeptides or single amino acids, and spurring the form ...
s, for example, not a single residue is conserved through the superfamily, not even those in the catalytic triad
A catalytic triad is a set of three coordinated amino acids that can be found in the active site of some enzymes. Catalytic triads are most commonly found in hydrolase and transferase enzymes (e.g. proteases, amidases, esterases, acylases, l ...
. Conversely, the individual families that make up a superfamily are defined on the basis of their sequence alignment, for example the C04 protease family within the PA clan.
Nevertheless, sequence similarity is the most commonly used form of evidence to infer relatedness, since the number of known sequences vastly outnumbers the number of known tertiary structures. In the absence of structural information, sequence similarity constrains the limits of which proteins can be assigned to a superfamily.
Structural similarity
Structure is much more evolutionarily conserved than sequence, such that proteins with highly similar structures can have entirely different sequences. Over very long evolutionary timescales, very few residues show detectable amino acid sequence conservation, however secondary structural elements and tertiary structural motifs are highly conserved. Some protein dynamics Proteins are generally thought to adopt unique structures determined by their amino acid sequences. However, proteins are not strictly static objects, but rather populate ensembles of (sometimes similar) conformations. Transitions between these stat ...
and conformational change
In biochemistry, a conformational change is a change in the shape of a macromolecule, often induced by environmental factors.
A macromolecule is usually flexible and dynamic. Its shape can change in response to changes in its environment or oth ...
s of the protein structure may also be conserved, as is seen in the serpin superfamily. Consequently, protein tertiary structure can be used to detect homology between proteins even when no evidence of relatedness remains in their sequences. Structural alignment programs, such as DALI
Dali or Dalí may refer to:
Chinese history
* Kingdom of Dali (937–1253 AD), centered in modern Yunnan
* Kingdom of Nanzhao or Dali, Kingdom of Dali's predecessor state
* Dali, Emperor Daizong of Tang's third and last regnal period (766–779)
...
, use the 3D structure of a protein of interest to find proteins with similar folds. However, on rare occasions, related proteins may evolve to be structurally dissimilar and relatedness can only be inferred by other methods.
Mechanistic similarity
The catalytic mechanism of enzymes within a superfamily is commonly conserved, although substrate
Substrate may refer to:
Physical layers
*Substrate (biology), the natural environment in which an organism lives, or the surface or medium on which an organism grows or is attached
** Substrate (locomotion), the surface over which an organism lo ...
specificity may be significantly different. Catalytic residues also tend to occur in the same order in the protein sequence. For the families within the PA clan of proteases, although there has been divergent evolution of the catalytic triad
A catalytic triad is a set of three coordinated amino acids that can be found in the active site of some enzymes. Catalytic triads are most commonly found in hydrolase and transferase enzymes (e.g. proteases, amidases, esterases, acylases, l ...
residues used to perform catalysis, all members use a similar mechanism to perform covalent, nucleophilic catalysis on proteins, peptides or amino acids. However, mechanism alone is not sufficient to infer relatedness. Some catalytic mechanisms have been convergently evolved multiple times independently, and so form separate superfamilies, and in some superfamilies display a range of different (though often chemically similar) mechanisms.
Evolutionary significance
Protein superfamilies represent the current limits of our ability to identify common ancestry. They are the largest evolutionary
Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation t ...
grouping based on direct evidence
Evidence for a proposition is what supports this proposition. It is usually understood as an indication that the supported proposition is true. What role evidence plays and how it is conceived varies from field to field.
In epistemology, eviden ...
that is currently possible. They are therefore amongst the most ancient evolutionary events currently studied. Some superfamilies have members present in all kingdoms of life
Life is a quality that distinguishes matter that has biological processes, such as Cell signaling, signaling and self-sustaining processes, from that which does not, and is defined by the capacity for Cell growth, growth, reaction to Stimu ...
, indicating that the last common ancestor of that superfamily was in the last universal common ancestor
The last universal common ancestor (LUCA) is the most recent population from which all organisms now living on Earth share common descent—the most recent common ancestor of all current life on Earth. This includes all cellular organisms; ...
of all life (LUCA).
Superfamily members may be in different species, with the ancestral protein being the form of the protein that existed in the ancestral species ( orthology). Conversely, the proteins may be in the same species, but evolved from a single protein whose gene was duplicated in the genome ( paralogy).
Diversification
A majority of proteins contain multiple domains. Between 66-80% of eukaryotic proteins have multiple domains while about 40-60% of prokaryotic proteins have multiple domains. Over time, many of the superfamilies of domains have mixed together. In fact, it is very rare to find “consistently isolated superfamilies”. When domains do combine, the N- to C-terminal domain order (the "domain architecture") is typically well conserved. Additionally, the number of domain combinations seen in nature is small compared to the number of possibilities, suggesting that selection acts on all combinations.
Examples
; α/β hydrolase superfamily: Members share an α/β sheet, containing 8 strands connected by helices, with catalytic triad
A catalytic triad is a set of three coordinated amino acids that can be found in the active site of some enzymes. Catalytic triads are most commonly found in hydrolase and transferase enzymes (e.g. proteases, amidases, esterases, acylases, l ...
residues in the same order, activities include proteases
A protease (also called a peptidase, proteinase, or proteolytic enzyme) is an enzyme that catalyzes (increases reaction rate or "speeds up") proteolysis, breaking down proteins into smaller polypeptides or single amino acids, and spurring the ...
, lipases, peroxidases, esterases, epoxide hydrolases and dehalogenases.
; Alkaline phosphatase superfamily: Members share an αβα sandwich structure as well as performing common promiscuous reactions by a common mechanism.
; Globin superfamily
The globins are a superfamily of heme-containing globular proteins, involved in binding and/or transporting oxygen. These proteins all incorporate the globin fold, a series of eight alpha helical segments. Two prominent members include my ...
: Members share an 8-alpha helix
The alpha helix (α-helix) is a common motif in the secondary structure of proteins and is a right hand-helix conformation in which every backbone N−H group hydrogen bonds to the backbone C=O group of the amino acid located four residues earl ...
globular globin fold.
; Immunoglobulin superfamily
The immunoglobulin superfamily (IgSF) is a large protein superfamily of cell surface and soluble proteins that are involved in the recognition, binding, or adhesion processes of cells. Molecules are categorized as members of this superfamily ...
: Members share a sandwich-like structure of two sheets
A bed sheet is a rectangular piece of cloth used either singly or in a pair as bedding, which is larger in length and width than a mattress, and which is placed immediately above a mattress or bed, but below blankets and other bedding (such a ...
of antiparallel β strand
The beta sheet, (β-sheet) (also β-pleated sheet) is a common motif of the regular protein secondary structure. Beta sheets consist of beta strands (β-strands) connected laterally by at least two or three backbone hydrogen bonds, forming a gen ...
s ( Ig-fold), and are involved in recognition, binding, and adhesion
Adhesion is the tendency of dissimilar particles or surfaces to cling to one another ( cohesion refers to the tendency of similar or identical particles/surfaces to cling to one another).
The forces that cause adhesion and cohesion can b ...
.
; PA clan: Members share a chymotrypsin
Chymotrypsin (, chymotrypsins A and B, alpha-chymar ophth, avazyme, chymar, chymotest, enzeon, quimar, quimotrase, alpha-chymar, alpha-chymotrypsin A, alpha-chymotrypsin) is a digestive enzyme component of pancreatic juice acting in the duod ...
-like double β-barrel
In protein structures, a beta barrel is a beta sheet composed of tandem repeats that twists and coils to form a closed toroidal structure in which the first strand is bonded to the last strand (hydrogen bond). Beta-strands in many beta-barrels are ...
fold and similar proteolysis
Proteolysis is the breakdown of proteins into smaller polypeptides or amino acids. Uncatalysed, the hydrolysis of peptide bonds is extremely slow, taking hundreds of years. Proteolysis is typically catalysed by cellular enzymes called protease ...
mechanisms but sequence identity of <10%. The clan contains both cysteine
Cysteine (symbol Cys or C; ) is a semiessential proteinogenic amino acid with the formula . The thiol side chain in cysteine often participates in enzymatic reactions as a nucleophile.
When present as a deprotonated catalytic residue, s ...
and serine proteases (different nucleophiles).[
; ]Ras superfamily
The Ras superfamily, derived from "Rat sarcoma virus", is a protein superfamily of small GTPases. Members of the superfamily are divided into families and subfamilies based on their structure, sequence and function. The five main families are R ...
: Members share a common catalytic G domain of a 6-strand β sheet surrounded by 5 α-helices.
; RSH superfamily: Members share capability to hydrolyze and/or synthesize ppGpp alarmones in the stringent response.
; Serpin superfamily: Members share a high-energy, stressed fold which can undergo a large conformational change
In biochemistry, a conformational change is a change in the shape of a macromolecule, often induced by environmental factors.
A macromolecule is usually flexible and dynamic. Its shape can change in response to changes in its environment or oth ...
, which is typically used to inhibit serine
Serine (symbol Ser or S) is an α-amino acid that is used in the biosynthesis of proteins. It contains an α- amino group (which is in the protonated − form under biological conditions), a carboxyl group (which is in the deprotonated − for ...
and cysteine proteases
Cysteine proteases, also known as thiol proteases, are hydrolase enzymes that degrade proteins. These proteases share a common catalytic mechanism that involves a nucleophilic cysteine thiol in a catalytic triad or dyad.
Discovered by Gopal Chund ...
by disrupting their structure.
; TIM barrel superfamily: Members share a large α8β8 barrel structure. It is one of the most common protein folds and the monophylicity of this superfamily is still contested.
Protein superfamily resources
Several biological databases
Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genom ...
document protein superfamilies and protein folds, for example:
*Pfam
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 35.0, was released in November 2021 and contains 19,632 families.
Use ...
- Protein families database of alignments and HMMs
* PROSITE - Database of protein domains, families and functional sites
* PIRSF - SuperFamily Classification System
* PASS2 - Protein Alignment as Structural Superfamilies v2
* SUPERFAMILY - Library of HMMs representing superfamilies and database of (superfamily and family) annotations for all completely sequenced organisms
* SCOP
A (
or ) was a poet as represented in Old English literature#Poetry, Old English poetry. The scop is the Old English counterpart of the Old Norse ', with the important difference that "skald" was applied to historical persons, and scop is used ...
and CATH
The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and coll ...
- Classifications of protein structures into superfamilies, families and domains
Similarly there are algorithms that search the PDB for proteins with structural homology to a target structure, for example:
*DALI
Dali or Dalí may refer to:
Chinese history
* Kingdom of Dali (937–1253 AD), centered in modern Yunnan
* Kingdom of Nanzhao or Dali, Kingdom of Dali's predecessor state
* Dali, Emperor Daizong of Tang's third and last regnal period (766–779)
...
- Structural alignment based on a distance alignment matrix method
See also
* Structural alignment
*Protein domains
In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of se ...
*Protein family
A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be c ...
*Protein mimetic {{Unreferenced, date=June 2019, bot=noref (GreenC bot)
A protein mimetic is a molecule such as a peptide, a modified peptide or any other molecule that biologically mimics the action or activity of some other protein.
Protein mimetics are commonly ...
*Protein structure
Protein structure is the molecular geometry, three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers specifically polypeptides formed from sequences of amino acids, the monomers of the polymer. A single ami ...
*Homology (biology)
In biology, homology is similarity due to shared ancestry between a pair of structures or genes in different taxa. A common example of homologous structures is the forelimbs of vertebrates, where the wings of bats and birds, the arms of p ...
* Interolog
* List of gene families
* SUPERFAMILY
*CATH
The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and coll ...
References
External links
*
{{Enzymes
Molecular evolution
*
*
Protein classification