Macromolecular docking is the computational modelling of the
quaternary structure
Protein quaternary structure is the fourth (and highest) classification level of protein structure. Protein quaternary structure refers to the structure of proteins which are themselves composed of two or more smaller protein chains (also refe ...
of
complexes formed by two or more interacting
biological macromolecules.
Protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
–protein complexes are the most commonly attempted targets of such modelling, followed by protein–
nucleic acid
Nucleic acids are biopolymers, macromolecules, essential to all known forms of life. They are composed of nucleotides, which are the monomers made of three components: a 5-carbon sugar, a phosphate group and a nitrogenous base. The two main cl ...
complexes.
The ultimate goal of docking is the prediction of the three-dimensional structure of the macromolecular complex of interest as it would occur in a living organism. Docking itself only produces plausible candidate structures. These candidates must be ranked using methods such as
scoring functions
Score or scorer may refer to:
*Test score, the result of an exam or test
Business
* Score Digital, now part of Bauer Radio
* Score Entertainment, a former American trading card design and manufacturing company
* Score Media, a former Canadian m ...
to identify structures that are most likely to occur in nature.
The term "docking" originated in the late 1970s, with a more restricted meaning; then, "docking" meant refining a model of a complex structure by optimizing the separation between the
interactor An interactor is a person who interacts with the members of the audience.
or
An interactor is an entity that natural selection acts upon.
Definition
Interactor is a concept commonly used in the field of evolutionary biology. A widely accepted ...
s but keeping their relative orientations fixed. Later, the relative orientations of the interacting partners in the modelling was allowed to vary, but the internal geometry of each of the partners was held fixed. This type of modelling is sometimes referred to as "rigid docking". With further increases in computational power, it became possible to model changes in internal geometry of the interacting partners that may occur when a complex is formed. This type of modelling is referred to as "flexible docking".
Background
The
biological
Biology is the scientific study of life. It is a natural science with a broad scope but has several unifying themes that tie it together as a single, coherent field. For instance, all organisms are made up of cells that process hereditary in ...
roles of most proteins, as characterized by which other
macromolecules they interact with, are known at best incompletely. Even those proteins that participate in a well-studied
biological process
Biological processes are those processes that are vital for an organism to live, and that shape its capacities for interacting with its environment. Biological processes are made of many chemical reactions or other events that are involved in the ...
(e.g., the
Krebs cycle
The citric acid cycle (CAC)—also known as the Krebs cycle or the TCA cycle (tricarboxylic acid cycle)—is a series of chemical reactions to release stored energy through the oxidation of acetyl-CoA derived from carbohydrates, fats, and protein ...
) may have unexpected interaction partners or
function
Function or functionality may refer to:
Computing
* Function key, a type of key on computer keyboards
* Function model, a structured representation of processes in a system
* Function object or functor or functionoid, a concept of object-oriente ...
s which are unrelated to that process.
In cases of known protein–protein interactions, other questions arise.
Genetic disease
A genetic disorder is a health problem caused by one or more abnormalities in the genome. It can be caused by a mutation in a single gene (monogenic) or multiple genes (polygenic) or by a chromosomal abnormality. Although polygenic disorders ...
s (e.g.,
cystic fibrosis
Cystic fibrosis (CF) is a rare genetic disorder that affects mostly the lungs, but also the pancreas, liver, kidneys, and intestine. Long-term issues include difficulty breathing and coughing up mucus as a result of frequent lung infections. O ...
) are known to be caused by misfolded or
mutated
In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, mitos ...
proteins, and there is a desire to understand what, if any, anomalous protein–protein interactions a given mutation can cause. In the distant future, proteins may be designed to perform biological functions, and a determination of the potential interactions of such proteins will be essential.
For any given set of proteins, the following questions may be of interest, from the point of view of technology or natural history:
* Do these proteins bind ''
in vivo
Studies that are ''in vivo'' (Latin for "within the living"; often not italicized in English) are those in which the effects of various biological entities are tested on whole, living organisms or cells, usually animals, including humans, and ...
''?
If they do bind,
* What is the spatial configuration which they adopt in their
bound state
Bound or bounds may refer to:
Mathematics
* Bound variable
* Upper and lower bounds, observed limits of mathematical functions
Physics
* Bound state, a particle that has a tendency to remain localized in one or more regions of space
Geography
*B ...
?
* How strong or weak is their interaction?
If they do not bind,
* Can they be made to bind by inducing a mutation?
Protein–protein docking is ultimately envisaged to address all these issues. Furthermore, since docking methods can be based on purely
physical
Physical may refer to:
*Physical examination
In a physical examination, medical examination, or clinical examination, a medical practitioner examines a patient for any possible medical signs or symptoms of a medical condition. It generally co ...
principles, even proteins of unknown function (or which have been studied relatively little) may be docked. The only prerequisite is that their
molecular structure
Molecular geometry is the three-dimensional arrangement of the atoms that constitute a molecule. It includes the general shape of the molecule as well as bond lengths, bond angles, torsional angles and any other geometrical parameters that determ ...
has been either determined experimentally, or can be estimated by a
protein structure prediction
Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different ...
technique.
Protein–nucleic acid interactions feature prominently in the living cell.
Transcription factors
In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. The func ...
, which regulate
gene expression
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect. The ...
, and
polymerase
A polymerase is an enzyme ( EC 2.7.7.6/7/19/48/49) that synthesizes long chains of polymers or nucleic acids. DNA polymerase and RNA polymerase are used to assemble DNA and RNA molecules, respectively, by copying a DNA template strand using base- ...
s, which
catalyse
Catalysis () is the process of increasing the rate of a chemical reaction by adding a substance known as a catalyst (). Catalysts are not consumed in the reaction and remain unchanged after it. If the reaction is rapid and the catalyst recyc ...
replication, are composed of proteins, and the
genetic material
Nucleic acids are biopolymers, macromolecules, essential to all known forms of life. They are composed of nucleotides, which are the monomers made of three components: a 5-carbon sugar, a phosphate group and a nitrogenous base. The two main cla ...
they interact with is composed of nucleic acids. Modeling protein–nucleic acid complexes presents some unique challenges, as described below.
History
In the 1970s, complex modelling revolved around manually identifying features on the surfaces of the interactors, and interpreting the consequences for binding, function and activity; any computer programmes were typically used at the end of the modelling process, to discriminate between the relatively few configurations which remained after all the heuristic constraints had been imposed. The first use of computers was in a study on
hemoglobin
Hemoglobin (haemoglobin BrE) (from the Greek word αἷμα, ''haîma'' 'blood' + Latin ''globus'' 'ball, sphere' + ''-in'') (), abbreviated Hb or Hgb, is the iron-containing oxygen-transport metalloprotein present in red blood cells (erythrocyte ...
interaction in
sickle-cell
Sickle cell disease (SCD) is a group of blood disorders typically inherited from a person's parents. The most common type is known as sickle cell anaemia. It results in an abnormality in the oxygen-carrying protein haemoglobin found in red blo ...
fibres.
This was followed in 1978 by work on the
trypsin
Trypsin is an enzyme in the first section of the small intestine that starts the digestion of protein molecules by cutting these long chains of amino acids into smaller pieces. It is a serine protease from the PA clan superfamily, found in the dig ...
-
BPTI
The drug aprotinin (Trasylol, previously Bayer and now Nordic Group pharmaceuticals), is a small protein bovine pancreatic trypsin inhibitor (BPTI), or basic trypsin inhibitor of bovine pancreas, which is an antifibrinolytic molecule that inhibit ...
complex.
Computers discriminated between good and bad models using a scoring function which rewarded large interface area, and pairs of molecules in contact but not occupying the same space. The computer used a simplified representation of the interacting proteins, with one interaction centre for each residue. Favorable
electrostatic
Electrostatics is a branch of physics that studies electric charges at rest (static electricity).
Since classical times, it has been known that some materials, such as amber, attract lightweight particles after rubbing. The Greek word for amber ...
interactions, including
hydrogen bonds
In chemistry, a hydrogen bond (or H-bond) is a primarily electrostatic force of attraction between a hydrogen (H) atom which is covalently bound to a more electronegative "donor" atom or group (Dn), and another electronegative atom bearing a ...
, were identified by hand.
In the early 1990s, more structures of complexes were determined, and available computational power had increased substantially. With the emergence of
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
, the focus moved towards developing generalized techniques which could be applied to an arbitrary set of complexes at acceptable computational cost. The new methods were envisaged to apply even in the absence of phylogenetic or experimental clues; any specific prior knowledge could still be introduced at the stage of choosing between the highest ranking output models, or be framed as input if the algorithm catered for it.
1992 saw the publication of the correlation method,
an algorithm which used the
fast Fourier transform
A fast Fourier transform (FFT) is an algorithm that computes the discrete Fourier transform (DFT) of a sequence, or its inverse (IDFT). Fourier analysis converts a signal from its original domain (often time or space) to a representation in th ...
to give a vastly improved scalability for evaluating coarse shape complementarity on rigid-body models. This was extended in 1997 to cover coarse electrostatics.
In 1996 the results of the first blind trial were published,
in which six research groups attempted to predict the complexed structure of
TEM-1 Beta-lactamase with Beta-lactamase
inhibitor protein
The inhibitor protein (IP) is situated in the mitochondrial matrix and protects the cell against rapid ATP hydrolysis during momentary ischaemia
Ischemia American and British English spelling differences#ae and oe, or ischaemia is a restric ...
(BLIP). The exercise brought into focus the necessity of accommodating conformational change and the difficulty of discriminating between conformers. It also served as the prototype for the CAPRI assessment series, which debuted in 2001.
Rigid-body docking ''vs''. flexible docking
If the
bond angles, bond lengths and torsion angles of the components are not modified at any stage of complex generation, it is known as ''rigid body docking''. A subject of speculation is whether or not rigid-body docking is sufficiently good for most docking. When substantial conformational change occurs within the components at the time of complex formation, rigid-body docking is inadequate. However, scoring all possible conformational changes is prohibitively expensive in computer time. Docking procedures which permit conformational change, or ''flexible docking'' procedures, must intelligently select small subset of possible conformational changes for consideration.
Methods
Successful docking requires two criteria:
*Generating a set of configurations which reliably includes at least one nearly correct one.
*Reliably distinguishing nearly correct configurations from the others.
For many interactions, the binding site is known on one or more of the proteins to be docked. This is the case for
antibodies
An antibody (Ab), also known as an immunoglobulin (Ig), is a large, Y-shaped protein used by the immune system to identify and neutralize foreign objects such as pathogenic bacteria and viruses. The antibody recognizes a unique molecule of the ...
and for
competitive inhibitor
Competitive inhibition is interruption of a chemical pathway owing to one chemical substance inhibiting the effect of another by competing with it for binding or bonding. Any metabolic or chemical messenger system can potentially be affected b ...
s. In other cases, a binding site may be strongly suggested by
mutagenic
In genetics, a mutagen is a physical or chemical agent that permanently changes genetic material, usually DNA, in an organism and thus increases the frequency of mutations above the natural background level. As many mutations can cause cancer in ...
or
phylogenetic
In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups o ...
evidence. Configurations where the proteins interpenetrate severely may also be ruled out ''a priori''.
After making exclusions based on prior knowledge or
stereochemical
Stereochemistry, a subdiscipline of chemistry, involves the study of the relative spatial arrangement of atoms that form the structure of molecules and their manipulation. The study of stereochemistry focuses on the relationships between stereois ...
clash, the remaining space of possible complexed structures must be sampled exhaustively, evenly and with a sufficient coverage to guarantee a near hit. Each configuration must be scored with a measure that is capable of ranking a nearly correct structure above at least 100,000 alternatives. This is a computationally intensive task, and a variety of strategies have been developed.
Reciprocal space methods
Each of the proteins may be represented as a simple cubic lattice. Then, for the class of scores which are discrete
convolution
In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions ( and ) that produces a third function (f*g) that expresses how the shape of one is ...
s, configurations related to each other by translation of one protein by an exact lattice vector can all be scored almost simultaneously by applying the
convolution theorem
In mathematics, the convolution theorem states that under suitable conditions the Fourier transform of a convolution of two functions (or signals) is the pointwise product of their Fourier transforms. More generally, convolution in one domain (e.g. ...
.
It is possible to construct reasonable, if approximate, convolution-like scoring functions representing both stereochemical and electrostatic fitness.
Reciprocal space methods have been used extensively for their ability to evaluate enormous numbers of configurations. They lose their speed advantage if torsional changes are introduced. Another drawback is that it is impossible to make efficient use of prior knowledge. The question also remains whether convolutions are too limited a class of scoring function to identify the best complex reliably.
Monte Carlo methods
In
Monte Carlo
Monte Carlo (; ; french: Monte-Carlo , or colloquially ''Monte-Carl'' ; lij, Munte Carlu ; ) is officially an administrative area of the Principality of Monaco, specifically the ward of Monte Carlo/Spélugues, where the Monte Carlo Casino is ...
, an initial configuration is refined by taking random steps which are accepted or rejected based on their induced improvement in score (see the
Metropolis criterion), until a certain number of steps have been tried. The assumption is that convergence to the best structure should occur from a large class of initial configurations, only one of which needs to be considered. Initial configurations may be sampled coarsely, and much computation time can be saved. Because of the difficulty of finding a scoring function which is both highly discriminating for the correct configuration and also converges to the correct configuration from a distance, the use of two levels of refinement, with different scoring functions, has been proposed.
Torsion can be introduced naturally to Monte Carlo as an additional property of each random move.
Monte Carlo methods are not guaranteed to search exhaustively, so that the best configuration may be missed even using a scoring function which would in theory identify it. How severe a problem this is for docking has not been firmly established.
Evaluation
Scoring functions
To find a score which forms a consistent basis for selecting the best configuration, studies are carried out on a standard benchmark (see below) of protein–protein interaction cases. Scoring functions are assessed on the rank they assign to the best structure (ideally the best structure should be ranked 1), and on their coverage (the proportion of the benchmark cases for which they achieve an acceptable result).
Types of scores studied include:
*
Heuristic
A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...
scores based on
residue
Residue may refer to:
Chemistry and biology
* An amino acid, within a peptide chain
* Crop residue, materials left after agricultural processes
* Pesticide residue, refers to the pesticides that may remain on or in food after they are applied ...
contacts.
*Shape complementarity of
molecular surfaces ("stereochemistry").
*Free energies, estimated using parameters from
molecular mechanics
Molecular mechanics uses classical mechanics to model molecular systems. The Born–Oppenheimer approximation is assumed valid and the potential energy of all systems is calculated as a function of the nuclear coordinates using force fields. Mo ...
force fields such as
CHARMM
Chemistry at Harvard Macromolecular Mechanics (CHARMM) is the name of a widely used set of force fields for molecular dynamics, and the name for the molecular dynamics simulation and analysis computer software package associated with them. The CHA ...
or
AMBER
Amber is fossilized tree resin that has been appreciated for its color and natural beauty since Neolithic times. Much valued from antiquity to the present as a gemstone, amber is made into a variety of decorative objects."Amber" (2004). In Ma ...
.
*Phylogenetic desirability of the interacting regions.
*Clustering coefficients.
*Information based cues.
It is usual to create hybrid scores by combining one or more categories above in a weighted sum whose weights are optimized on cases from the benchmark. To avoid bias, the benchmark cases used to optimize the weights must not overlap with the cases used to make the final test of the score.
The ultimate goal in protein–protein docking is to select the ideal ranking solution according to a scoring scheme that would also give an insight into the affinity of the complex. Such a development would drive ''in silico''
protein engineering
Protein engineering is the process of developing useful or valuable proteins. It is a young discipline, with much research taking place into the understanding of protein folding and recognition for protein design principles. It has been used to imp ...
,
computer-aided drug design
Drug design, often referred to as rational drug design or simply rational design, is the inventive process of finding new medications based on the knowledge of a biological target. The drug is most commonly an organic small molecule that acti ...
and/or high-throughput annotation of which proteins bind or not (annotation of
interactome In molecular biology, an interactome is the whole set of molecular interactions in a particular cell. The term specifically refers to physical interactions among molecules (such as those among proteins, also known as protein–protein interactions, ...
). Several scoring functions have been proposed for binding affinity / free energy prediction.
However the correlation between experimentally determined binding affinities and the predictions of nine commonly used scoring functions have been found to be nearly
orthogonal
In mathematics, orthogonality is the generalization of the geometric notion of ''perpendicularity''.
By extension, orthogonality is also used to refer to the separation of specific features of a system. The term also has specialized meanings in ...
(R
2 ~ 0).
It was also observed that some components of the scoring algorithms may display better correlation to the experimental binding energies than the full score, suggesting that a significantly better performance might be obtained by combining the appropriate contributions from different scoring algorithms. Experimental methods for the determination of binding affinities are:
surface plasmon resonance
Surface plasmon resonance (SPR) is the resonant oscillation of conduction electrons at the interface between negative and positive permittivity material in a particle stimulated by incident light. SPR is the basis of many standard tools for measu ...
(SPR),
Förster resonance energy transfer
Förster resonance energy transfer (FRET), fluorescence resonance energy transfer, resonance energy transfer (RET) or electronic energy transfer (EET) is a mechanism describing energy transfer between two light-sensitive molecules ( chromophores). ...
,
radioligand
A radioligand is a radioactive biochemical substance (in particular, a ligand that is radiolabeled) that is used for diagnosis or for research-oriented study of the receptor systems of the body.
In a neuroimaging application the radioligand is inj ...
-based techniques,
isothermal titration calorimetry
Isothermal titration calorimetry (ITC) is a physical technique used to determine the thermodynamic parameters of interactions in solution. It is most often used to study the binding of small molecules (such as medicinal compounds) to larger macrom ...
(ITC),
microscale thermophoresis
Microscale thermophoresis (MST) is a technology for the biophysical analysis of interactions between biomolecules. Microscale thermophoresis is based on the detection of a temperature-induced change in fluorescence of a target as a function of th ...
(MST) or spectroscopic measurements and other fluorescence techniques. Textual information from scientific articles can provide useful cues for scoring.
Benchmarks
A benchmark of 84 protein–protein interactions with known complexed structures has been developed for testing docking methods.
The set is chosen to cover a wide range of interaction types, and to avoid repeated features, such as the profile of interactors' structural families according to the
SCOP
A (
or ) was a poet as represented in Old English poetry. The scop is the Old English counterpart of the Old Norse ', with the important difference that "skald" was applied to historical persons, and scop is used, for the most part, to designa ...
database. Benchmark elements are classified into three levels of difficulty (the most difficult containing the largest change in backbone conformation). The protein–protein docking benchmark contains examples of enzyme-inhibitor, antigen-antibody and homomultimeric complexes.
The latest version of protein-protein docking benchmark consists of 230 complexes. A protein-DNA docking benchmark consists of 47 test cases. A protein-RNA docking benchmark was curated as a dataset of 45 non-redundant test cases with complexes solved by
X-ray crystallography
X-ray crystallography is the experimental science determining the atomic and molecular structure of a crystal, in which the crystalline structure causes a beam of incident X-rays to diffract into many specific directions. By measuring the angles ...
only as well as an extended dataset of 71 test cases with structures derived from
homology modelling
Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "''target''" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous pr ...
as well. The protein-RNA benchmark has been updated to include more structures solved by
X-ray crystallography
X-ray crystallography is the experimental science determining the atomic and molecular structure of a crystal, in which the crystalline structure causes a beam of incident X-rays to diffract into many specific directions. By measuring the angles ...
and now it consists of 126 test cases. The benchmarks have a combined dataset of 209 complexes.
A binding affinity benchmark has been based on the protein–protein docking benchmark.
81 protein–protein complexes with known experimental affinities are included; these complexes span over 11 orders of magnitude in terms of affinity. Each entry of the benchmark includes several biochemical parameters associated with the experimental data, along with the method used to determine the affinity. This benchmark was used to assess the extent to which scoring functions could also predict affinities of macromolecular complexes.
This Benchmark was post-peer reviewed and significantly expanded.
The new set is diverse in terms of the biological functions it represents, with complexes that involve G-proteins and receptor extracellular domains, as well as antigen/antibody, enzyme/inhibitor, and enzyme/substrate complexes. It is also diverse in terms of the partners' affinity for each other, with K
d ranging between 10
−5 and 10
−14 M. Nine pairs of entries represent closely related complexes that have a similar structure, but a very different affinity, each pair comprising a cognate and a noncognate assembly. The unbound structures of the component proteins being available, conformation changes can be assessed. They are significant in most of the complexes, and large movements or disorder-to-order transitions are frequently observed. The set may be used to benchmark biophysical models aiming to relate affinity to structure in protein–protein interactions, taking into account the reactants and the conformation changes that accompany the association reaction, instead of just the final product.
The CAPRI assessment
The Critical Assessment of PRediction of Interactions
is an ongoing series of events in which researchers throughout the community try to dock the same proteins, as provided by the assessors. Rounds take place approximately every 6 months. Each round contains between one and six target protein–protein complexes whose structures have been recently determined experimentally. The coordinates and are held privately by the assessors, with the cooperation of the
structural biologist
Structural biology is a field that is many centuries old which, and as defined by the Journal of Structural Biology, deals with structural analysis of living material (formed, composed of, and/or maintained and refined by living cells) at every le ...
s who determined them. The assessment of submissions is
double blind
In a blind or blinded experiment, information which may influence the participants of the experiment is withheld until after the experiment is complete. Good blinding can reduce or eliminate experimental biases that arise from a participants' expec ...
.
CAPRI attracts a high level of participation (37 groups participated worldwide in round seven) and a high level of interest from the biological community in general. Although CAPRI results are of little statistical significance owing to the small number of targets in each round, the role of CAPRI in stimulating discourse is significant. (The
CASP
Critical Assessment of Structure Prediction (CASP), sometimes called Critical Assessment of Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP prov ...
assessment is a similar exercise in the field of protein structure prediction).
See also
*
Biomolecular complex
A biomolecule or biological molecule is a loosely used term for molecules present in organisms that are essential to one or more typically biological processes, such as cell division, morphogenesis, or development. Biomolecules include large ...
– any biological complex of protein, RNA, DNA (sometimes has lipids and carbohydrates)
*
Docking (molecular)
In the field of molecular modeling, docking is a method which predicts the preferred orientation of one molecule to a second when a ligand and a target are bound to each other to form a stable complex. Knowledge of the preferred orientation in tu ...
– small molecule docking to proteins
References
{{Protein methods
Protein structure
Bioinformatics
Molecular physics
Molecular modelling