Protein Domains
   HOME

TheInfoList



OR:

In
molecular biology Molecular biology is the branch of biology that seeks to understand the molecular basis of biological activity in and between cells, including biomolecular synthesis, modification, mechanisms, and interactions. The study of chemical and physi ...
, a protein domain is a region of a
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
's
polypeptide chain Peptides (, ) are short chains of amino acids linked by peptide bonds. Long chains of amino acids are called proteins. Chains of fewer than twenty amino acids are called oligopeptides, and include dipeptides, tripeptides, and tetrapeptides. ...
that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of several domains, and a domain may appear in a variety of different proteins.
Molecular evolution Molecular evolution is the process of change in the sequence composition of cellular molecules such as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of evolutionary biology and population genetics ...
uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions. In general, domains vary in length from between about 50
amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
s up to 250 amino acids in length. The shortest domains, such as
zinc fingers A zinc finger is a small protein structural motif that is characterized by the coordination of one or more zinc ions (Zn2+) in order to stabilize the fold. It was originally coined to describe the finger-like appearance of a hypothesized struct ...
, are stabilized by metal ions or
disulfide bridges In biochemistry, a disulfide (or disulphide in British English) refers to a functional group with the structure . The linkage is also called an SS-bond or sometimes a disulfide bridge and is usually derived by the coupling of two thiol groups. In ...
. Domains often form functional units, such as the calcium-binding EF hand domain of
calmodulin Calmodulin (CaM) (an abbreviation for calcium-modulated protein) is a multifunctional intermediate calcium-binding messenger protein expressed in all eukaryotic cells. It is an intracellular target of the secondary messenger Ca2+, and the bind ...
. Because they are independently stable, domains can be "swapped" by
genetic engineering Genetic engineering, also called genetic modification or genetic manipulation, is the modification and manipulation of an organism's genes using technology. It is a set of technologies used to change the genetic makeup of cells, including t ...
between one protein and another to make chimeric proteins.


Background

The concept of the domain was first proposed in 1973 by Wetlaufer after X-ray crystallographic studies of hen
lysozyme Lysozyme (EC 3.2.1.17, muramidase, ''N''-acetylmuramide glycanhydrolase; systematic name peptidoglycan ''N''-acetylmuramoylhydrolase) is an antimicrobial enzyme produced by animals that forms part of the innate immune system. It is a glycoside ...
and
papain Papain, also known as papaya proteinase I, is a cysteine protease () enzyme present in papaya (''Carica papaya'') and mountain papaya (''Vasconcellea cundinamarcensis''). It is the namesake member of the papain-like protease family. It has wide ...
and by limited proteolysis studies of
immunoglobulins An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein used by the immune system to identify and neutralize foreign objects such as pathogenic bacteria and viruses. The antibody recognizes a unique molecule of the ...
. Wetlaufer defined domains as stable units of
protein structure Protein structure is the three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers specifically polypeptides formed from sequences of amino acids, the monomers of the polymer. A single amino acid monomer ma ...
that could fold autonomously. In the past domains have been described as units of: * compact structure * function and evolution * folding. Each definition is valid and will often overlap, i.e. a compact structural domain that is found amongst diverse proteins is likely to fold independently within its structural environment. Nature often brings several domains together to form multidomain and multifunctional proteins with a vast number of possibilities. In a multidomain protein, each domain may fulfill its own function independently, or in a concerted manner with its neighbours. Domains can either serve as modules for building up large assemblies such as virus particles or muscle fibres, or can provide specific catalytic or binding sites as found in enzymes or regulatory proteins.


Example: Pyruvate kinase

An appropriate example is
pyruvate kinase Pyruvate kinase is the enzyme involved in the last step of glycolysis. It catalyzes the transfer of a phosphate group from phosphoenolpyruvate (PEP) to adenosine diphosphate (ADP), yielding one molecule of pyruvate and one molecule of ATP. P ...
(see first figure), a glycolytic enzyme that plays an important role in regulating the flux from fructose-1,6-biphosphate to pyruvate. It contains an all-β nucleotide binding domain (in blue), an α/β-substrate binding domain (in grey) and an α/β-regulatory domain (in olive green), connected by several polypeptide linkers. Each domain in this protein occurs in diverse sets of
protein families A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be ...
. The central α/β-barrel substrate binding domain is one of the most common
enzyme Enzymes () are proteins that act as biological catalysts by accelerating chemical reactions. The molecules upon which enzymes may act are called substrates, and the enzyme converts the substrates into different molecules known as products. A ...
folds. It is seen in many different enzyme families catalysing completely unrelated reactions. The α/β-barrel is commonly called the
TIM barrel The TIM barrel (triose-phosphate isomerase), also known as an alpha/beta barrel, is a conserved protein fold consisting of eight alpha helices (α-helices) and eight parallel beta strands (β-strands) that alternate along the peptide backbone. ...
named after triose phosphate isomerase, which was the first such structure to be solved. It is currently classified into 26 homologous families in the CATH domain database. The TIM barrel is formed from a sequence of β-α-β motifs closed by the first and last strand hydrogen bonding together, forming an eight stranded barrel. There is debate about the evolutionary origin of this domain. One study has suggested that a single ancestral enzyme could have diverged into several families, while another suggests that a stable TIM-barrel structure has evolved through convergent evolution. The TIM-barrel in pyruvate kinase is 'discontinuous', meaning that more than one segment of the polypeptide is required to form the domain. This is likely to be the result of the insertion of one domain into another during the protein's evolution. It has been shown from known structures that about a quarter of structural domains are discontinuous. The inserted β-barrel regulatory domain is 'continuous', made up of a single stretch of polypeptide.


Units of protein structure

The
primary structure Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthes ...
(string of amino acids) of a
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
ultimately encodes its uniquely folded three-dimensional (3D) conformation. The most important factor governing the folding of a protein into 3D structure is the distribution of polar and non-polar side chains. Folding is driven by the burial of hydrophobic side chains into the interior of the molecule so to avoid contact with the aqueous environment. Generally proteins have a core of hydrophobic residues surrounded by a shell of hydrophilic residues. Since the peptide bonds themselves are polar they are neutralised by hydrogen bonding with each other when in the hydrophobic environment. This gives rise to regions of the polypeptide that form regular 3D structural patterns called
secondary structure Protein secondary structure is the three dimensional conformational isomerism, form of ''local segments'' of proteins. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...
. There are two main types of secondary structure:
α-helices The alpha helix (α-helix) is a common motif in the secondary structure of proteins and is a right hand-helix conformation in which every backbone N−H group hydrogen bonds to the backbone C=O group of the amino acid located four residues ear ...
and
β-sheet The beta sheet, (β-sheet) (also β-pleated sheet) is a common motif of the regular protein secondary structure. Beta sheets consist of beta strands (β-strands) connected laterally by at least two or three backbone hydrogen bonds, forming a g ...
s. Some simple combinations of secondary structure elements have been found to frequently occur in
protein structure Protein structure is the three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers specifically polypeptides formed from sequences of amino acids, the monomers of the polymer. A single amino acid monomer ma ...
and are referred to as
supersecondary structure A supersecondary structure is a compact three-dimensional protein structure of several adjacent elements of a secondary structure that is smaller than a protein domain or a subunit. Supersecondary structures can act as nucleations in the process ...
or motifs. For example, the β-hairpin motif consists of two adjacent antiparallel β-strands joined by a small loop. It is present in most antiparallel β structures both as an isolated ribbon and as part of more complex β-sheets. Another common super-secondary structure is the β-α-β motif, which is frequently used to connect two parallel β-strands. The central α-helix connects the C-termini of the first strand to the N-termini of the second strand, packing its side chains against the β-sheet and therefore shielding the hydrophobic residues of the β-strands from the surface. Covalent association of two domains represents a functional and structural advantage since there is an increase in stability when compared with the same structures non-covalently associated. Other, advantages are the protection of intermediates within inter-domain enzymatic clefts that may otherwise be unstable in aqueous environments, and a fixed stoichiometric ratio of the enzymatic activity necessary for a sequential set of reactions. Structural alignment is an important tool for determining domains.


Tertiary structure

Several motifs pack together to form compact, local, semi-independent units called domains. The overall 3D structure of the polypeptide chain is referred to as the protein's
tertiary structure Protein tertiary structure is the three dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains may int ...
. Domains are the fundamental units of tertiary structure, each domain containing an individual hydrophobic core built from secondary structural units connected by loop regions. The packing of the polypeptide is usually much tighter in the interior than the exterior of the domain producing a solid-like core and a fluid-like surface. Core residues are often conserved in a
protein family A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be c ...
, whereas the residues in loops are less conserved, unless they are involved in the protein's function. Protein tertiary structure can be divided into four main classes based on the secondary structural content of the domain. * All-α domains have a domain core built exclusively from α-helices. This class is dominated by small folds, many of which form a simple bundle with helices running up and down. * All-β domains have a core composed of antiparallel β-sheets, usually two sheets packed against each other. Various patterns can be identified in the arrangement of the strands, often giving rise to the identification of recurring motifs, for example the Greek key motif. * α+β domains are a mixture of all-α and all-β motifs. Classification of proteins into this class is difficult because of overlaps to the other three classes and therefore is not used in the CATH domain database. * α/β domains are made from a combination of β-α-β motifs that predominantly form a parallel β-sheet surrounded by amphipathic α-helices. The secondary structures are arranged in layers or barrels.


Limits on size

Domains have limits on size. The size of individual structural domains varies from 36 residues in E-selectin to 692 residues in lipoxygenase-1, but the majority, 90%, have fewer than 200 residues with an average of approximately 100 residues. Very short domains, less than 40 residues, are often stabilised by metal ions or disulfide bonds. Larger domains, greater than 300 residues, are likely to consist of multiple hydrophobic cores.


Quaternary structure

Many proteins have a
quaternary structure Protein quaternary structure is the fourth (and highest) classification level of protein structure. Protein quaternary structure refers to the structure of proteins which are themselves composed of two or more smaller protein chains (also refe ...
, which consists of several polypeptide chains that associate into an oligomeric molecule. Each polypeptide chain in such a protein is called a subunit. Hemoglobin, for example, consists of two α and two β subunits. Each of the four chains has an all-α globin fold with a heme pocket.


Domain swapping

Domain swapping is a mechanism for forming oligomeric assemblies. In domain swapping, a secondary or tertiary element of a monomeric protein is replaced by the same element of another protein. Domain swapping can range from secondary structure elements to whole structural domains. It also represents a model of evolution for functional adaptation by oligomerisation, e.g. oligomeric enzymes that have their active site at subunit interfaces.


Domains as evolutionary modules

''Nature is a tinkerer and not an inventor'', new sequences are adapted from pre-existing sequences rather than invented. Domains are the common material used by nature to generate new sequences; they can be thought of as genetically mobile units, referred to as 'modules'. Often, the C and N termini of domains are close together in space, allowing them to easily be "slotted into" parent structures during the process of evolution. Many domain families are found in all three forms of life,
Archaea Archaea ( ; singular archaeon ) is a domain of single-celled organisms. These microorganisms lack cell nuclei and are therefore prokaryotes. Archaea were initially classified as bacteria, receiving the name archaebacteria (in the Archaebac ...
,
Bacteria Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were among ...
and
Eukarya Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bacte ...
. Protein modules are a subset of protein domains which are found across a range of different proteins with a particularly versatile structure. Examples can be found among extracellular proteins associated with clotting, fibrinolysis, complement, the extracellular matrix, cell surface adhesion molecules and cytokine receptors. Four concrete examples of widespread protein modules are the following domains: SH2,
immunoglobulin An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein used by the immune system to identify and neutralize foreign objects such as pathogenic bacteria and viruses. The antibody recognizes a unique molecule of the ...
, fibronectin type 3 and the
kringle Kringle (, ) is a Northern European pastry, a variety of pretzel. Pretzels were introduced by Roman Catholic monks in the 13th century in Denmark, and from there they spread throughout Scandinavia and evolved into several kinds of sweet, salty ...
.
Molecular evolution Molecular evolution is the process of change in the sequence composition of cellular molecules such as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of evolutionary biology and population genetics ...
gives rise to families of related proteins with similar sequence and structure. However, sequence similarities can be extremely low between proteins that share the same structure. Protein structures may be similar because proteins have diverged from a common ancestor. Alternatively, some folds may be more favored than others as they represent stable arrangements of secondary structures and some proteins may converge towards these folds over the course of evolution. There are currently about 110,000 experimentally determined protein 3D structures deposited within the
Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cry ...
(PDB). However, this set contains many identical or very similar structures. All proteins should be classified to structural families to understand their evolutionary relationships. Structural comparisons are best achieved at the domain level. For this reason many algorithms have been developed to automatically assign domains in proteins with known 3D structure; see ' Domain definition from structural co-ordinates'. The CATH domain database classifies domains into approximately 800 fold families; ten of these folds are highly populated and are referred to as 'super-folds'. Super-folds are defined as folds for which there are at least three structures without significant sequence similarity. The most populated is the α/β-barrel super-fold, as described previously.


Multidomain proteins

The majority of proteins, two-thirds in unicellular organisms and more than 80% in metazoa, are multidomain proteins. However, other studies concluded that 40% of prokaryotic proteins consist of multiple domains while eukaryotes have approximately 65% multi-domain proteins. Many domains in eukaryotic multidomain proteins can be found as independent proteins in prokaryotes, suggesting that domains in multidomain proteins have once existed as independent proteins. For example, vertebrates have a multi-enzyme polypeptide containing the GAR synthetase,
AIR synthetase The atmosphere of Earth is the layer of gases, known collectively as air, retained by Gravity of Earth, Earth's gravity that surrounds the planet and forms its planetary atmosphere. The atmosphere of Earth protects life on Earth by creating Atmo ...
and
GAR transformylase Phosphoribosylglycinamide formyltransferase (, ''2-amino-N-ribosylacetamide 5'-phosphate transformylase'', ''GAR formyltransferase'', ''GAR transformylase'', ''glycinamide ribonucleotide transformylase'', ''GAR TFase'', ''5,10-methenyltetrahydrofol ...
domains (GARs-AIRs-GARt; GAR: glycinamide ribonucleotide synthetase/transferase; AIR: aminoimidazole ribonucleotide synthetase). In insects, the polypeptide appears as GARs-(AIRs)2-GARt, in yeast GARs-AIRs is encoded separately from GARt, and in bacteria each domain is encoded separately.


Origin

Multidomain proteins are likely to have emerged from selective pressure during
evolution Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...
to create new functions. Various proteins have diverged from common ancestors by different combinations and associations of domains. Modular units frequently move about, within and between biological systems through mechanisms of genetic shuffling: * transposition of mobile elements including horizontal transfers (between species); * gross rearrangements such as inversions, translocations, deletions and duplications; *
homologous recombination Homologous recombination is a type of genetic recombination in which genetic information is exchanged between two similar or identical molecules of double-stranded or single-stranded nucleic acids (usually DNA as in cellular organisms but may ...
; * slippage of
DNA polymerase A DNA polymerase is a member of a family of enzymes that catalyze the synthesis of DNA molecules from nucleoside triphosphates, the molecular precursors of DNA. These enzymes are essential for DNA replication and usually work in groups to create ...
during replication.


Types of organization

The simplest multidomain organization seen in proteins is that of a single domain repeated in tandem. The domains may interact with each other ( domain-domain interaction) or remain isolated, like beads on string. The giant 30,000 residue muscle protein
titin Titin (contraction for Titan protein) (also called connectin) is a protein that in humans is encoded by the ''TTN'' gene. Titin is a giant protein, greater than 1 µm in length, that functions as a molecular spring that is responsible for th ...
comprises about 120 fibronectin-III-type and Ig-type domains. In the serine proteases, a gene duplication event has led to the formation of a two β-barrel domain enzyme. The repeats have diverged so widely that there is no obvious sequence similarity between them. The active site is located at a cleft between the two β-barrel domains, in which functionally important residues are contributed from each domain. Genetically engineered mutants of the chymotrypsin
serine protease Serine proteases (or serine endopeptidases) are enzymes that cleave peptide bonds in proteins. Serine serves as the nucleophilic amino acid at the (enzyme's) active site. They are found ubiquitously in both eukaryotes and prokaryotes. ...
were shown to have some proteinase activity even though their active site residues were abolished and it has therefore been postulated that the duplication event enhanced the enzyme's activity. Modules frequently display different connectivity relationships, as illustrated by the
kinesin A kinesin is a protein belonging to a class of motor proteins found in eukaryotic cells. Kinesins move along microtubule (MT) filaments and are powered by the hydrolysis of adenosine triphosphate (ATP) (thus kinesins are ATPases, a type of enzy ...
s and
ABC transporters The ATP synthase, ATP-binding cassette transporters (ABC transporters) are a transport system superfamily that is one of the largest and possibly one of the oldest gene family, gene families. It is represented in all extant taxon, extant Phylum ...
. The kinesin motor domain can be at either end of a polypeptide chain that includes a coiled-coil region and a cargo domain. ABC transporters are built with up to four domains consisting of two unrelated modules, ATP-binding cassette and an integral membrane module, arranged in various combinations. Not only do domains recombine, but there are many examples of a domain having been inserted into another. Sequence or structural similarities to other domains demonstrate that homologues of inserted and parent domains can exist independently. An example is that of the 'fingers' inserted into the 'palm' domain within the polymerases of the Pol I family. Since a domain can be inserted into another, there should always be at least one continuous domain in a multidomain protein. This is the main difference between definitions of structural domains and evolutionary/functional domains. An evolutionary domain will be limited to one or two connections between domains, whereas structural domains can have unlimited connections, within a given criterion of the existence of a common core. Several structural domains could be assigned to an evolutionary domain. A superdomain consists of two or more conserved domains of nominally independent origin, but subsequently inherited as a single structural/functional unit. This combined superdomain can occur in diverse proteins that are not related by gene duplication alone. An example of a superdomain is the
protein tyrosine phosphatase Protein tyrosine phosphatases (EC 3.1.3.48, systematic name protein-tyrosine-phosphate phosphohydrolase) are a group of enzymes that remove phosphate groups from phosphorylated tyrosine residues on proteins: : proteintyrosine phosphate + H2O = ...
C2 domain A C2 domain is a protein structural domain involved in targeting proteins to cell membranes. The typical version (PKC-C2) has a beta-sandwich composed of 8 β-strands that co-ordinates two or three calcium ions, which bind in a cavity formed by ...
pair in PTEN,
tensin Tensin was first identified as a 220 kDa multi-domain protein localized to the specialized regions of plasma membrane called integrin-mediated focal adhesions (which are formed around a transmembrane core of an αβ integrin heterodimer). Genome s ...
,
auxilin Putative tyrosine-protein phosphatase auxilin is an enzyme that in humans is encoded by the ''DNAJC6'' gene. Function DNAJC6 belongs to the evolutionarily conserved DNAJ/HSP40 family of proteins, which regulate molecular chaperone activity by ...
and the membrane protein TPTE2. This superdomain is found in proteins in animals, plants and fungi. A key feature of the PTP-C2 superdomain is amino acid residue conservation in the domain interface.


Domains are autonomous folding units


Folding

Protein folding - the unsolved problem : Since the seminal work of Anfinsen in the early 1960s, the goal to completely understand the mechanism by which a polypeptide rapidly folds into its stable native conformation remains elusive. Many experimental folding studies have contributed much to our understanding, but the principles that govern protein folding are still based on those discovered in the very first studies of folding. Anfinsen showed that the native state of a protein is thermodynamically stable, the conformation being at a global minimum of its free energy. Folding is a directed search of conformational space allowing the protein to fold on a biologically feasible time scale. The
Levinthal paradox Levinthal's paradox is a thought experiment, also constituting a self-reference in the theory of protein folding. In 1969, Cyrus Levinthal noted that, because of the very large number of degrees of freedom in an unfolded polypeptide chain, the m ...
states that if an averaged sized protein would sample all possible conformations before finding the one with the lowest energy, the whole process would take billions of years. Proteins typically fold within 0.1 and 1000 seconds. Therefore, the protein folding process must be directed some way through a specific folding pathway. The forces that direct this search are likely to be a combination of local and global influences whose effects are felt at various stages of the reaction. Advances in experimental and theoretical studies have shown that folding can be viewed in terms of energy landscapes, where folding kinetics is considered as a progressive organisation of an ensemble of partially folded structures through which a protein passes on its way to the folded structure. This has been described in terms of a
folding funnel The folding funnel hypothesis is a specific version of the energy landscape theory of protein folding, which assumes that a protein's native state corresponds to its free energy minimum under the solution conditions usually encountered in cells. A ...
, in which an unfolded protein has a large number of conformational states available and there are fewer states available to the folded protein. A funnel implies that for protein folding there is a decrease in energy and loss of entropy with increasing tertiary structure formation. The local roughness of the funnel reflects kinetic traps, corresponding to the accumulation of misfolded intermediates. A folding chain progresses toward lower intra-chain free-energies by increasing its compactness. The chain's conformational options become increasingly narrowed ultimately toward one native structure.


Advantage of domains in protein folding

The organisation of large proteins by structural domains represents an advantage for protein folding, with each domain being able to individually fold, accelerating the folding process and reducing a potentially large combination of residue interactions. Furthermore, given the observed random distribution of hydrophobic residues in proteins, domain formation appears to be the optimal solution for a large protein to bury its hydrophobic residues while keeping the hydrophilic residues at the surface. However, the role of inter-domain interactions in protein folding and in energetics of stabilisation of the native structure, probably differs for each protein. In T4 lysozyme, the influence of one domain on the other is so strong that the entire molecule is resistant to proteolytic cleavage. In this case, folding is a sequential process where the C-terminal domain is required to fold independently in an early step, and the other domain requires the presence of the folded C-terminal domain for folding and stabilisation. It has been found that the folding of an isolated domain can take place at the same rate or sometimes faster than that of the integrated domain, suggesting that unfavourable interactions with the rest of the protein can occur during folding. Several arguments suggest that the slowest step in the folding of large proteins is the pairing of the folded domains. This is either because the domains are not folded entirely correctly or because the small adjustments required for their interaction are energetically unfavourable, such as the removal of water from the domain interface.


Domains and protein flexibility

Protein domain dynamics play a key role in a multitude of molecular recognition and signaling processes. Protein domains, connected by intrinsically disordered flexible linker domains, induce long-range
allostery In biochemistry, allosteric regulation (or allosteric control) is the regulation of an enzyme by binding an effector molecule at a site other than the enzyme's active site. The site to which the effector binds is termed the ''allosteric site ...
via protein domain dynamics. The resultant dynamic modes cannot be generally predicted from static structures of either the entire protein or individual domains. They can however be inferred by comparing different structures of a protein (as in
Database of Molecular Motions The Database of Macromolecular Motions is a bioinformatics database and software-as-a-service tool that attempts to categorize macromolecular motions, sometimes also known as conformational change. It was originally developed by Mark B. Gerstein, ...
). They can also be suggested by sampling in extensive molecular dynamics trajectories and principal component analysis, or they can be directly observed using spectra measured by
neutron spin echo Neutron spin echo spectroscopy is an inelastic neutron scattering technique invented by Ferenc Mezei in the 1970s, and developed in collaboration with John Hayter. In recognition of his work and in other areas, Mezei was awarded the first Walte ...
spectroscopy.


Domain definition from structural co-ordinates

The importance of domains as structural building blocks and elements of evolution has brought about many automated methods for their identification and classification in proteins of known structure. Automatic procedures for reliable domain assignment is essential for the generation of the domain databases, especially as the number of known protein structures is increasing. Although the boundaries of a domain can be determined by visual inspection, construction of an automated method is not straightforward. Problems occur when faced with domains that are discontinuous or highly associated. The fact that there is no standard definition of what a domain really is has meant that domain assignments have varied enormously, with each researcher using a unique set of criteria. A structural domain is a compact, globular sub-structure with more interactions within it than with the rest of the protein. Therefore, a structural domain can be determined by two visual characteristics: its compactness and its extent of isolation. Measures of local compactness in proteins have been used in many of the early methods of domain assignment and in several of the more recent methods.


Methods

One of the first algorithms used a Cα-Cα distance map together with a hierarchical clustering routine that considered proteins as several small segments, 10 residues in length. The initial segments were clustered one after another based on inter-segment distances; segments with the shortest distances were clustered and considered as single segments thereafter. The stepwise clustering finally included the full protein. Go also exploited the fact that inter-domain distances are normally larger than intra-domain distances; all possible Cα-Cα distances were represented as diagonal plots in which there were distinct patterns for helices, extended strands and combinations of secondary structures. The method by Sowdhamini and Blundell clusters secondary structures in a protein based on their Cα-Cα distances and identifies domains from the pattern in their
dendrogram A dendrogram is a diagram representing a tree. This diagrammatic representation is frequently used in different contexts: * in hierarchical clustering, it illustrates the arrangement of the clusters produced by the corresponding analyses. ...
s. As the procedure does not consider the protein as a continuous chain of amino acids there are no problems in treating discontinuous domains. Specific nodes in these dendrograms are identified as tertiary structural clusters of the protein, these include both super-secondary structures and domains. The DOMAK algorithm is used to create the 3Dee domain database. It calculates a 'split value' from the number of each type of contact when the protein is divided arbitrarily into two parts. This split value is large when the two parts of the structure are distinct. The method of Wodak and Janin was based on the calculated interface areas between two chain segments repeatedly cleaved at various residue positions. Interface areas were calculated by comparing surface areas of the cleaved segments with that of the native structure. Potential domain boundaries can be identified at a site where the interface area was at a minimum. Other methods have used measures of solvent accessibility to calculate compactness. The PUU algorithm incorporates a harmonic model used to approximate inter-domain dynamics. The underlying physical concept is that many rigid interactions will occur within each domain and loose interactions will occur between domains. This algorithm is used to define domains in the FSSP domain database. Swindells (1995) developed a method, DETECTIVE, for identification of domains in protein structures based on the idea that domains have a hydrophobic interior. Deficiencies were found to occur when hydrophobic cores from different domains continue through the interface region.
RigidFinder
is a novel method for identification of protein rigid blocks (domains and loops) from two different conformations. Rigid blocks are defined as blocks where all inter residue distances are conserved across conformations. The metho
RIBFIND
developed by Pandurangan and Topf identifies rigid bodies in protein structures by performing spacial clustering of secondary structural elements in proteins. The RIBFIND rigid bodies have been used to flexibly fit protein structures into cryo electron microscopy density maps. A general method to identify ''dynamical domains'', that is protein regions that behave approximately as rigid units in the course of structural fluctuations, has been introduced by Potestio et al. and, among other applications was also used to compare the consistency of the dynamics-based domain subdivisions with standard structure-based ones. The method, terme
PiSQRD
is publicly available in the form of a webserver. The latter allows users to optimally subdivide single-chain or multimeric proteins into quasi-rigid domains based on the collective modes of fluctuation of the system. By default the latter are calculated through an elastic network model; alternatively pre-calculated essential dynamical spaces can be uploaded by the user.


Example domains

*
Armadillo repeats An armadillo repeat is the name of a characteristic, repetitive amino acid sequence of about 40 residues in length that is found in many proteins. Proteins that contain armadillo repeats typically contain several tandemly repeated copies. Each a ...
: named after the β-catenin-like Armadillo protein of the fruit fly ''
Drosophila melanogaster ''Drosophila melanogaster'' is a species of fly (the taxonomic order Diptera) in the family Drosophilidae. The species is often referred to as the fruit fly or lesser fruit fly, or less commonly the "vinegar fly" or "pomace fly". Starting with Ch ...
''. *Basic leucine zipper domain (
bZIP domain The Basic Leucine Zipper Domain (bZIP domain) is found in many DNA binding eukaryotic proteins. One part of the domain contains a region that mediates sequence specific DNA binding properties and the leucine zipper that is required to hold tog ...
): found in many DNA-binding
eukaryotic Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bacte ...
proteins. One part of the domain contains a region that mediates sequence-specific DNA-binding properties and the Leucine zipper that is required for the
dimer Dimer may refer to: * Dimer (chemistry), a chemical structure formed from two similar sub-units ** Protein dimer, a protein quaternary structure ** d-dimer * Dimer model, an item in statistical mechanics, based on ''domino tiling'' * Julius Dimer ...
ization of two DNA-binding regions. The DNA-binding region comprises a number of basic aminoacids such as
arginine Arginine is the amino acid with the formula (H2N)(HN)CN(H)(CH2)3CH(NH2)CO2H. The molecule features a guanidino group appended to a standard amino acid framework. At physiological pH, the carboxylic acid is deprotonated (−CO2−) and both the am ...
and
lysine Lysine (symbol Lys or K) is an α-amino acid that is a precursor to many proteins. It contains an α-amino group (which is in the protonated form under biological conditions), an α-carboxylic acid group (which is in the deprotonated −C ...
. *
Cadherin Cadherins (named for "calcium-dependent adhesion") are a type of cell adhesion molecule (CAM) that is important in the formation of adherens junctions to allow cells to adhere to each other . Cadherins are a class of type-1 transmembrane proteins, ...
repeats: Cadherins function as Ca2+-dependent cell–cell
adhesion Adhesion is the tendency of dissimilar particles or surfaces to cling to one another ( cohesion refers to the tendency of similar or identical particles/surfaces to cling to one another). The forces that cause adhesion and cohesion can be ...
proteins. Cadherin domains are extracellular regions which mediate cell-to-cell homophilic binding between cadherins on the surface of adjacent cells. *
Death effector domain The death-effector domain (DED) is a protein interaction domain found only in eukaryotes that regulates a variety of cellular signalling pathways. The DED domain is found in inactive procaspases (cysteine proteases) and proteins that regulate cas ...
(DED): allows protein–protein binding by homotypic interactions (DED-DED).
Caspase Caspases (cysteine-aspartic proteases, cysteine aspartases or cysteine-dependent aspartate-directed proteases) are a family of protease enzymes playing essential roles in programmed cell death. They are named caspases due to their specific cystei ...
protease A protease (also called a peptidase, proteinase, or proteolytic enzyme) is an enzyme that catalyzes (increases reaction rate or "speeds up") proteolysis, breaking down proteins into smaller polypeptides or single amino acids, and spurring the ...
s trigger
apoptosis Apoptosis (from grc, ἀπόπτωσις, apóptōsis, 'falling off') is a form of programmed cell death that occurs in multicellular organisms. Biochemical events lead to characteristic cell changes (morphology) and death. These changes incl ...
via proteolytic cascades. Pro-caspase-8 and pro-caspase-9 bind to specific adaptor molecules via DED domains, which leads to autoactivation of caspases. *
EF hand The EF hand is a helix–loop–helix structural domain or ''motif'' found in a large family of calcium-binding proteins. The EF-hand motif contains a helix–loop–helix topology, much like the spread thumb and forefinger of the human hand, i ...
: a
helix-turn-helix Helix-turn-helix is a DNA-binding protein (DBP). The helix-turn-helix (HTH) is a major structural motif capable of binding DNA. Each monomer incorporates two α helices, joined by a short strand of amino acids, that bind to the major groove of D ...
structural motif In a polymer, chain-like biological molecule, such as a protein or nucleic acid, a structural motif is a common Biomolecular structure#Tertiary structure, three-dimensional structure that appears in a variety of different, evolutionarily unrel ...
found in each
structural domain In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of s ...
of the
signaling protein In biology, cell signaling (cell signalling in British English) or cell communication is the ability of a cell to receive, process, and transmit signals with its environment and with itself. Cell signaling is a fundamental property of all cellula ...
calmodulin Calmodulin (CaM) (an abbreviation for calcium-modulated protein) is a multifunctional intermediate calcium-binding messenger protein expressed in all eukaryotic cells. It is an intracellular target of the secondary messenger Ca2+, and the bind ...
and in the muscle protein
troponin-C Troponin C is a protein which is part of the troponin complex. It contains four calcium-binding EF hands, although different isoforms may have fewer than four functional calcium-binding subdomains. It is a component of thin filaments, along wi ...
. *Immunoglobulin-like domains: found in proteins of the
immunoglobulin superfamily The immunoglobulin superfamily (IgSF) is a large protein superfamily of cell surface and soluble proteins that are involved in the recognition, binding, or adhesion processes of cells. Molecules are categorized as members of this superfamily ...
(IgSF). They contain about 70-110
amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
s and are classified into different categories (IgV, IgC1, IgC2 and IgI) according to their size and function. They possess a characteristic fold in which two
beta sheet The beta sheet, (β-sheet) (also β-pleated sheet) is a common motif of the regular protein secondary structure. Beta sheets consist of beta strands (β-strands) connected laterally by at least two or three backbone hydrogen bonds, forming a g ...
s form a "sandwich" that is stabilized by interactions between conserved
cysteine Cysteine (symbol Cys or C; ) is a semiessential proteinogenic amino acid with the formula . The thiol side chain in cysteine often participates in enzymatic reactions as a nucleophile. When present as a deprotonated catalytic residue, sometime ...
s and other charged
amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
s. They are important for protein–protein interactions in processes of
cell adhesion Cell adhesion is the process by which cells interact and attach to neighbouring cells through specialised molecules of the cell surface. This process can occur either through direct contact between cell surfaces such as cell junctions or indir ...
, cell activation, and molecular recognition. These domains are commonly found in molecules with roles in the
immune system The immune system is a network of biological processes that protects an organism from diseases. It detects and responds to a wide variety of pathogens, from viruses to parasitic worms, as well as cancer cells and objects such as wood splinte ...
. *
Phosphotyrosine-binding domain In molecular biology, Phosphotyrosine-binding domains are protein domains which bind to phosphotyrosine. The phosphotyrosine-binding domain (PTB, also phosphotyrosine-interaction or PI domain) in the protein tensin tends to be found at the C-t ...
(PTB): PTB domains usually bind to phosphorylated tyrosine residues. They are often found in signal transduction proteins. PTB-domain binding specificity is determined by residues to the amino-terminal side of the phosphotyrosine. Examples: the PTB domains of both SHC and IRS-1 bind to a NPXpY sequence. PTB-containing proteins such as SHC and IRS-1 are important for
insulin Insulin (, from Latin ''insula'', 'island') is a peptide hormone produced by beta cells of the pancreatic islets encoded in humans by the ''INS'' gene. It is considered to be the main anabolic hormone of the body. It regulates the metabolism o ...
responses of human cells. *
Pleckstrin homology domain Pleckstrin homology domain (PH domain) or (PHIP) is a protein domain of approximately 120 amino acids that occurs in a wide range of proteins involved in intracellular signaling or as constituents of the cytoskeleton. This domain can bind phospha ...
(PH): PH domains bind
phosphoinositide Phosphatidylinositol (or Inositol Phospholipid) consists of a family of lipids as illustrated on the right, where red is x, blue is y, and black is z, in the context of independent variation, a class of the phosphatidylglycerides. In such molecul ...
s with high affinity. Specificity for
PtdIns(3)P Phosphatidylinositol 3-phosphate (PtdIns3''P'') is a phospholipid found in cell membranes that helps to recruit a range of proteins, many of which are involved in protein trafficking, to the membranes. It is the product of both the class II and II ...
, PtdIns(4)P, PtdIns(3,4)P2,
PtdIns(4,5)P2 Phosphatidylinositol 4,5-bisphosphate or PtdIns(4,5)''P''2, also known simply as PIP2 or PI(4,5)P2, is a minor phospholipid component of cell membranes. PtdIns(4,5)''P''2 is enriched at the plasma membrane where it is a substrate for a number of ...
, and
PtdIns(3,4,5)P3 Phosphatidylinositol (3,4,5)-trisphosphate (PtdIns(3,4,5)''P''3), abbreviated PIP3, is the product of the class I phosphoinositide 3-kinases (PI 3-kinases) phosphorylation of phosphatidylinositol (4,5)-bisphosphate (PIP2). It is a phospholipid tha ...
have all been observed. Given the fact that phosphoinositides are sequestered to various cell membranes (due to their long lipophilic tail) the PH domains usually causes recruitment of the protein in question to a membrane where the protein can exert a certain function in cell signalling, cytoskeletal reorganization or membrane trafficking. *
Src homology 2 domain The SH2 (Src Homology 2) domain is a structurally conserved protein domain contained within the Src oncoprotein and in many other intracellular signal-transducing proteins. SH2 domains allow proteins containing those domains to dock to phosphory ...
(SH2): SH2 domains are often found in signal transduction proteins. SH2 domains confer binding to phosphorylated tyrosine (pTyr). Named after the phosphotyrosine binding domain of the src viral
oncogene An oncogene is a gene that has the potential to cause cancer. In tumor cells, these genes are often mutated, or expressed at high levels.
, which is itself a
tyrosine kinase A tyrosine kinase is an enzyme that can transfer a phosphate group from ATP to the tyrosine residues of specific proteins inside a cell. It functions as an "on" or "off" switch in many cellular functions. Tyrosine kinases belong to a larger cla ...
. ''See also'':
SH3 domain The SRC Homology 3 Domain (or SH3 domain) is a small protein domain of about 60 amino acid residues. Initially, SH3 was described as a conserved sequence in the viral adaptor protein v-Crk. This domain is also present in the molecules of phos ...
. *
Zinc finger A zinc finger is a small protein structural motif that is characterized by the coordination of one or more zinc ions (Zn2+) in order to stabilize the fold. It was originally coined to describe the finger-like appearance of a hypothesized struct ...
DNA-binding domain A DNA-binding domain (DBD) is an independently folded protein domain that contains at least one structural motif that recognizes double- or single-stranded DNA. A DBD can recognize a specific DNA sequence (a recognition sequence) or have a gener ...
(ZnF_GATA): ZnF_GATA domain-containing proteins are typically
transcription factor In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. The fu ...
s that usually bind to the DNA sequence TATA Gof promoters.


Domains of unknown function

A large fraction of domains are of unknown function. A domain of unknown function (DUF) is a protein domain that has no characterized function. These families have been collected together in the  Pfam database using the prefix DUF followed by a number, with examples being DUF2992 and DUF1220. There are now over 3,000 DUF families within the Pfam database representing over 20% of known families. Surprisingly, the number of DUFs in Pfam has increased from 20% (in 2010) to 22% (in 2019), mostly due to an increasing number of new genome sequences. Pfam release 32.0 (2019) contained 3,961 DUFs.


See also

*
Binding domain In molecular biology, binding domain is a protein domain which binds to a specific atom or molecule, such as calcium or DNA. A protein domain is a part of a protein sequence and a tertiary structure that can change or evolve, function, and liv ...
*
PANDIT A Pandit ( sa, पण्डित, paṇḍit; hi, पंडित; also spelled Pundit, pronounced ; abbreviated Pt.) is a man with specialised knowledge or a teacher of any field of knowledge whether it is shashtra (Holy Books) or shastra (Wea ...
, a biological database covering protein domains *
Pfam Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 35.0, was released in November 2021 and contains 19,632 families. Uses ...
: database of protein domains *
Protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
**
Protein structure Protein structure is the three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers specifically polypeptides formed from sequences of amino acids, the monomers of the polymer. A single amino acid monomer ma ...
**
Protein structure prediction Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different ...
**
Protein structure prediction software This list of protein structure prediction software summarizes notable used software tools in protein structure prediction, including homology modeling, protein threading, ''ab initio'' methods, secondary structure prediction, and transmembrane ...
**
Protein superfamily A protein superfamily is the largest grouping (clade) of proteins for which common ancestry can be inferred (see homology (biology), homology). Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if n ...
**
Protein tandem repeats An array of protein tandem repeats is defined as several (at least two) adjacent copies having the same or similar sequence motifs. These periodic sequences are generated by internal duplications in both coding and non-coding genomic sequences. Re ...
**
Protein family A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be c ...
**
Protein subfamily Protein subfamily is a level of protein classification, based on their close evolutionary relationship. It is below the larger levels of protein superfamily and protein family. Proteins typically share greater sequence and function similarities w ...
*
Short linear motif In molecular biology short linear motifs (SLiMs), linear motifs or minimotifs are short stretches of protein sequence that mediate protein–protein interaction. The first definition was given by Tim Hunt: "The sequences of many proteins contain s ...
*
Structural biology Structural biology is a field that is many centuries old which, and as defined by the Journal of Structural Biology, deals with structural analysis of living material (formed, composed of, and/or maintained and refined by living cells) at every le ...
* Structural Classification of Proteins (SCOP) * CATH Protein Structure Classification database


References

''This article incorporates text and figures from George, R. A. (2002) "Predicting Structural Domains in Proteins" Thesis, University College London, which were contributed by its author.''


Key papers

* * * * * * * * * * * * * * * * * * * * * * * *


External links


Structural domain databases


Conserved Domains at the National Center for Biotechnology website3DeeCATHDALI
*
PFAM clan browser


Sequence domain databases


InterPro
*
PROSITEProDomSMARTNCBI Conserved Domain DatabaseSUPERFAMILY
Library of HMMs representing superfamilies and database of (superfamily and family) annotations for all completely sequenced organisms


Functional domain databases


dcGO
A comprehensive database of domain-centric ontologies on functions, phenotypes and diseases. {{DEFAULTSORT:Protein Domain Protein structure Protein families Protein superfamilies