A protein superfamily is the largest grouping (
clade
A clade (), also known as a monophyletic group or natural group, is a group of organisms that are monophyletic – that is, composed of a common ancestor and all its lineal descendants – on a phylogenetic tree. Rather than the English ter ...
) of
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, res ...
s for which
common ancestry
Common descent is a concept in evolutionary biology applicable when one species is the ancestor of two or more species later in time. All living beings are in fact descendants of a unique ancestor commonly referred to as the last universal com ...
can be inferred (see
homology). Usually this common ancestry is inferred from
structural alignment
Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large R ...
and mechanistic similarity, even if no sequence similarity is evident.
Sequence homology
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a ...
can then be deduced even if not apparent (due to low sequence similarity). Superfamilies typically contain several
protein families
A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be ...
which show sequence similarity within each family. The term ''protein clan'' is commonly used for
protease
A protease (also called a peptidase, proteinase, or proteolytic enzyme) is an enzyme that catalyzes (increases reaction rate or "speeds up") proteolysis, breaking down proteins into smaller polypeptides or single amino acids, and spurring the ...
and
glycosyl hydrolase
Glycoside hydrolases (also called glycosidases or glycosyl hydrolases) catalyze the hydrolysis of glycosidic bonds in complex sugars. They are extremely common enzymes with roles in nature including degradation of biomass such as cellulose (ce ...
s superfamilies based on the
MEROPS
MEROPS is an online database for peptidases (also known as proteases, proteinases and proteolytic enzymes) and their inhibitors. The classification scheme for peptidases was published by Rawlings & Barrett in 1993, and that for protein inhibitors ...
and
CAZy
CAZy is a database of Carbohydrate-Active enZYmes (CAZymes). The database contains a classification and associated information about enzymes involved in the synthesis, metabolism, and recognition of complex carbohydrates, i.e. disaccharides, olig ...
classification systems.
Identification
Superfamilies of proteins are identified using a number of methods. Closely related members can be identified by different methods to those needed to group the most evolutionarily divergent members.
Sequence similarity
Historically, the similarity of different amino acid sequences has been the most common method of inferring homology. Sequence similarity is considered a good predictor of relatedness, since similar sequences are more likely the result of gene duplication
Gene duplication (or chromosomal duplication or gene amplification) is a major mechanism through which new genetic material is generated during molecular evolution. It can be defined as any duplication of a region of DNA that contains a gene. ...
and divergent evolution
Divergent evolution or divergent selection is the accumulation of differences between closely related populations within a species, leading to speciation. Divergent evolution is typically exhibited when two populations become separated by a geog ...
, rather than the result of convergent evolution
Convergent evolution is the independent evolution of similar features in species of different periods or epochs in time. Convergent evolution creates analogous structures that have similar form or function but were not present in the last com ...
. Amino acid sequence is typically more conserved than DNA sequence (due to the degenerate genetic code), so is a more sensitive detection method. Since some of the amino acids have similar properties (e.g., charge, hydrophobicity, size), conservative mutations that interchange them are often neutral to function. The most conserved sequence regions of a protein often correspond to functionally important regions like catalytic sites and binding sites, since these regions are less tolerant to sequence changes.
Using sequence similarity to infer homology has several limitations. There is no minimum level of sequence similarity guaranteed to produce identical structures. Over long periods of evolution, related proteins may show no detectable sequence similarity to one another. Sequences with many insertions and deletions can also sometimes be difficult to align and so identify the homologous sequence regions. In the PA clan
The PA clan ( Proteases of mixed nucleophile, superfamily A) is the largest group of proteases with common ancestry as identified by structural homology. Members have a chymotrypsin-like fold and similar proteolysis mechanisms but can have identi ...
of protease
A protease (also called a peptidase, proteinase, or proteolytic enzyme) is an enzyme that catalyzes (increases reaction rate or "speeds up") proteolysis, breaking down proteins into smaller polypeptides or single amino acids, and spurring the ...
s, for example, not a single residue is conserved through the superfamily, not even those in the catalytic triad
A catalytic triad is a set of three coordinated amino acids that can be found in the active site of some enzymes. Catalytic triads are most commonly found in hydrolase and transferase enzymes (e.g. proteases, amidases, esterases, acylases, li ...
. Conversely, the individual families that make up a superfamily are defined on the basis of their sequence alignment, for example the C04 protease family within the PA clan.
Nevertheless, sequence similarity is the most commonly used form of evidence to infer relatedness, since the number of known sequences vastly outnumbers the number of known tertiary structures. In the absence of structural information, sequence similarity constrains the limits of which proteins can be assigned to a superfamily.
Structural similarity
Structure
A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such a ...
is much more evolutionarily conserved than sequence, such that proteins with highly similar structures can have entirely different sequences. Over very long evolutionary timescales, very few residues show detectable amino acid sequence conservation, however secondary structural elements and tertiary structural motifs are highly conserved. Some protein dynamics Proteins are generally thought to adopt unique structures determined by their amino acid sequences. However, proteins are not strictly static objects, but rather populate ensembles of (sometimes similar) conformations. Transitions between these stat ...
and conformational change
In biochemistry, a conformational change is a change in the shape of a macromolecule, often induced by environmental factors.
A macromolecule is usually flexible and dynamic. Its shape can change in response to changes in its environment or oth ...
s of the protein structure may also be conserved, as is seen in the serpin superfamily
Serpins are a superfamily of proteins with similar structures that were first identified for their protease inhibition activity and are found in all kingdoms of life. The acronym serpin was originally coined because the first serpins to be ...
. Consequently, protein tertiary structure can be used to detect homology between proteins even when no evidence of relatedness remains in their sequences. Structural alignment
Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large R ...
programs, such as DALI, use the 3D structure of a protein of interest to find proteins with similar folds. However, on rare occasions, related proteins may evolve to be structurally dissimilar and relatedness can only be inferred by other methods.
Mechanistic similarity
The catalytic mechanism
Enzyme catalysis is the increase in the rate of a process by a biological molecule, an " enzyme". Most enzymes are proteins, and most such processes are chemical reactions. Within the enzyme, generally catalysis occurs at a localized site, call ...
of enzymes within a superfamily is commonly conserved, although substrate specificity may be significantly different. Catalytic residues also tend to occur in the same order in the protein sequence. For the families within the PA clan of proteases, although there has been divergent evolution of the catalytic triad
A catalytic triad is a set of three coordinated amino acids that can be found in the active site of some enzymes. Catalytic triads are most commonly found in hydrolase and transferase enzymes (e.g. proteases, amidases, esterases, acylases, li ...
residues used to perform catalysis, all members use a similar mechanism to perform covalent, nucleophilic catalysis on proteins, peptides or amino acids. However, mechanism alone is not sufficient to infer relatedness. Some catalytic mechanisms have been convergently evolved
Convergent evolution is the independent evolution of similar features in species of different periods or epochs in time. Convergent evolution creates analogous structures that have similar form or function but were not present in the last com ...
multiple times independently, and so form separate superfamilies, and in some superfamilies display a range of different (though often chemically similar) mechanisms.
Evolutionary significance
Protein superfamilies represent the current limits of our ability to identify common ancestry. They are the largest evolutionary
Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variati ...
grouping based on direct evidence
Evidence for a proposition is what supports this proposition. It is usually understood as an indication that the supported proposition is true. What role evidence plays and how it is conceived varies from field to field.
In epistemology, evidenc ...
that is currently possible. They are therefore amongst the most ancient evolutionary events currently studied. Some superfamilies have members present in all kingdoms
Kingdom commonly refers to:
* A monarchy ruled by a king or queen
* Kingdom (biology), a category in biological taxonomy
Kingdom may also refer to:
Arts and media Television
* ''Kingdom'' (British TV series), a 2007 British television drama s ...
of life
Life is a quality that distinguishes matter that has biological processes, such as signaling and self-sustaining processes, from that which does not, and is defined by the capacity for growth, reaction to stimuli, metabolism, energ ...
, indicating that the last common ancestor of that superfamily was in the last universal common ancestor
The last universal common ancestor (LUCA) is the most recent population from which all organisms now living on Earth share common descent—the most recent common ancestor of all current life on Earth. This includes all cellular organisms; th ...
of all life (LUCA).
Superfamily members may be in different species, with the ancestral protein being the form of the protein that existed in the ancestral species ( orthology). Conversely, the proteins may be in the same species, but evolved from a single protein whose gene was duplicated in the genome (paralogy
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a s ...
).
Diversification
A majority of proteins contain multiple domains. Between 66-80% of eukaryotic proteins have multiple domains while about 40-60% of prokaryotic proteins have multiple domains. Over time, many of the superfamilies of domains have mixed together. In fact, it is very rare to find “consistently isolated superfamilies”. When domains do combine, the N- to C-terminal domain order (the "domain architecture") is typically well conserved. Additionally, the number of domain combinations seen in nature is small compared to the number of possibilities, suggesting that selection acts on all combinations.
Examples
; α/β hydrolase superfamily: Members share an α/β sheet, containing 8 strands connected by helices
A helix () is a shape like a corkscrew or spiral staircase. It is a type of smooth space curve with tangent lines at a constant angle to a fixed axis. Helices are important in biology, as the DNA molecule is formed as two intertwined helices, ...
, with catalytic triad
A catalytic triad is a set of three coordinated amino acids that can be found in the active site of some enzymes. Catalytic triads are most commonly found in hydrolase and transferase enzymes (e.g. proteases, amidases, esterases, acylases, li ...
residues in the same order, activities include proteases
A protease (also called a peptidase, proteinase, or proteolytic enzyme) is an enzyme that catalyzes (increases reaction rate or "speeds up") proteolysis, breaking down proteins into smaller polypeptides or single amino acids, and spurring the for ...
, lipases
Lipase ( ) is a family of enzymes that catalyzes the hydrolysis of fats. Some lipases display broad substrate scope including esters of cholesterol, phospholipids, and of lipid-soluble vitamins and sphingomyelinases; however, these are usually tr ...
, peroxidases
Peroxidases or peroxide reductases ( EC numberbr>1.11.1.x are a large group of enzymes which play a role in various biological processes. They are named after the fact that they commonly break up peroxides.
Functionality
Peroxidases typically ca ...
, esterases
An esterase is a hydrolase enzyme that splits esters into an acid and an alcohol in a chemical reaction with water called hydrolysis.
A wide range of different esterases exist that differ in their substrate specificity, their protein structur ...
, epoxide hydrolase
Epoxide hydrolases (EH's), also known as epoxide hydratases, are enzymes that metabolize compounds that contain an epoxide residue; they convert this residue to two hydroxyl residues through an epoxide hydrolysis reaction to form diol products. ...
s and dehalogenase A dehalogenase is a type of enzyme that catalyzes the removal of a halogen atom from a substrate.
Examples include:
* Reductive dehalogenases
* 4-chlorobenzoate dehalogenase
* 4-chlorobenzoyl-CoA dehalogenase
* Dichloromethane dehalogenase
* Fl ...
s.
; Alkaline phosphatase superfamily: Members share an αβα sandwich structure as well as performing common promiscuous reactions by a common mechanism.
; Globin superfamily: Members share an 8-alpha helix
The alpha helix (α-helix) is a common motif in the secondary structure of proteins and is a right hand-helix conformation in which every backbone N−H group hydrogen bonds to the backbone C=O group of the amino acid located four residues e ...
globular globin fold
The globins are a superfamily of heme-containing globular proteins, involved in binding and/or transporting oxygen. These proteins all incorporate the globin fold, a series of eight alpha helical segments. Two prominent members include myogl ...
.
; Immunoglobulin superfamily
The immunoglobulin superfamily (IgSF) is a large protein superfamily of cell surface and soluble proteins that are involved in the recognition, binding, or adhesion processes of cells. Molecules are categorized as members of this superfamily ...
: Members share a sandwich-like structure of two sheets
A bed sheet is a rectangular piece of cloth used either singly or in a pair as bedding, which is larger in length and width than a mattress, and which is placed immediately above a mattress or bed, but below blankets and other bedding (such as ...
of antiparallel β strands ( Ig-fold), and are involved in recognition, binding, and adhesion
Adhesion is the tendency of dissimilar particles or surfaces to cling to one another ( cohesion refers to the tendency of similar or identical particles/surfaces to cling to one another).
The forces that cause adhesion and cohesion can b ...
.
; PA clan
The PA clan ( Proteases of mixed nucleophile, superfamily A) is the largest group of proteases with common ancestry as identified by structural homology. Members have a chymotrypsin-like fold and similar proteolysis mechanisms but can have identi ...
: Members share a chymotrypsin-like double β-barrel fold and similar proteolysis
Proteolysis is the breakdown of proteins into smaller polypeptides or amino acids. Uncatalysed, the hydrolysis of peptide bonds is extremely slow, taking hundreds of years. Proteolysis is typically catalysed by cellular enzymes called protease ...
mechanisms but sequence identity of <10%. The clan contains both cysteine and serine proteases
Serine proteases (or serine endopeptidases) are enzymes that cleave peptide bonds in proteins. Serine serves as the nucleophilic amino acid at the (enzyme's) active site.
They are found ubiquitously in both eukaryotes and prokaryotes. Seri ...
(different nucleophiles
In chemistry, a nucleophile is a chemical species that forms bonds by donating an electron pair. All molecules and ions with a free pair of electrons or at least one pi bond can act as nucleophiles. Because nucleophiles donate electrons, they are ...
).[
; ]Ras superfamily
The Ras superfamily, derived from "Rat sarcoma virus", is a protein superfamily of small GTPases. Members of the superfamily are divided into families and subfamilies based on their structure, sequence and function. The five main families are Ra ...
: Members share a common catalytic G domain of a 6-strand β sheet surrounded by 5 α-helices.
; RSH superfamily: Members share capability to hydrolyze and/or synthesize ppGpp
(p)ppGpp, guanosine pentaphosphate and tetraphosphate, also known as the "magic spot" nucleotides, are alarmones involved in the stringent response in bacteria that cause the inhibition of RNA synthesis when there is a shortage of amino acids. ...
alarmones in the stringent response
The stringent response, also called stringent control, is a stress response of bacteria and plant chloroplasts in reaction to amino-acid starvation, fatty acid limitation, iron limitation, heat shock and other stress conditions. The stringent resp ...
.
; Serpin superfamily
Serpins are a superfamily of proteins with similar structures that were first identified for their protease inhibition activity and are found in all kingdoms of life. The acronym serpin was originally coined because the first serpins to be i ...
: Members share a high-energy, stressed fold which can undergo a large conformational change
In biochemistry, a conformational change is a change in the shape of a macromolecule, often induced by environmental factors.
A macromolecule is usually flexible and dynamic. Its shape can change in response to changes in its environment or oth ...
, which is typically used to inhibit serine and cysteine proteases by disrupting their structure.
; TIM barrel superfamily: Members share a large α8β8 barrel structure. It is one of the most common protein fold
A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred (see homology). Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similari ...
s and the monophylicity of this superfamily is still contested.
Protein superfamily resources
Several biological databases document protein superfamilies and protein folds, for example:
* Pfam - Protein families database of alignments and HMMs
*PROSITE
PROSITE is a protein database. It consists of entries describing the protein families, domains and functional sites as well as amino acid patterns and profiles in them. These are manually curated by a team of the Swiss Institute of Bioinformatic ...
- Database of protein domains, families and functional sites
* PIRSF - SuperFamily Classification System
* PASS2 - Protein Alignment as Structural Superfamilies v2
* SUPERFAMILY - Library of HMMs representing superfamilies and database of (superfamily and family) annotations for all completely sequenced organisms
* SCOP
A (
or ) was a poet as represented in Old English poetry. The scop is the Old English counterpart of the Old Norse ', with the important difference that "skald" was applied to historical persons, and scop is used, for the most part, to designa ...
and CATH - Classifications of protein structures into superfamilies, families and domains
Similarly there are algorithms that search the PDB for proteins with structural homology to a target structure, for example:
* DALI - Structural alignment based on a distance alignment matrix method
See also
*Structural alignment
Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large R ...
* Protein domains
*Protein family
A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be c ...
* Protein mimetic
*Protein structure
Protein structure is the three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers specifically polypeptides formed from sequences of amino acids, the monomers of the polymer. A single amino acid monomer ma ...
*Homology (biology)
In biology, homology is similarity due to shared ancestry between a pair of structures or genes in different taxa. A common example of homologous structures is the forelimbs of vertebrates, where the wings of bats and birds, the arms of pri ...
*Interolog An interolog is a conserved interaction between a pair of proteins which have interacting Homology (biology), homologs in another organism. The term was introduced in a 2000 paper by Walhout et al.
Example
Suppose that A and B are two differen ...
*List of gene families
This is a list of gene family, gene families or gene complexes, i.e. sets of genes which are related ancestrally and often serve similar biological functions. These gene families typically encode functionally related proteins, and sometimes the ter ...
* SUPERFAMILY
* CATH
References
External links
*
{{Enzymes
Molecular evolution
*
*
Protein classification