Graphical models have become powerful frameworks for
protein structure prediction,
protein–protein interaction, and
free energy calculations for protein structures. Using a graphical model to represent the protein structure allows the solution of many problems including secondary structure prediction, protein-protein interactions, protein-drug interaction, and free energy calculations.
There are two main approaches to using graphical models in protein structure modeling. The first approach uses
discrete
Discrete may refer to:
*Discrete particle or quantum in physics, for example in quantum theory
* Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit
*Discrete group, a ...
variables for representing the coordinates or the
dihedral angles of the protein structure. The variables are originally all continuous values and, to transform them into discrete values, a discretization process is typically applied. The second approach uses continuous variables for the coordinates or dihedral angles.
Discrete graphical models for protein structure
Markov random fields, also known as undirected graphical models are common representations for this problem. Given an
undirected graph ''G'' = (''V'', ''E''), a set of
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s ''X'' = (''X''
''v'')
''v'' ∈ ''V'' indexed by ''V'', form a Markov random field with respect to ''G'' if they satisfy the pairwise Markov property:
*any two non-adjacent variables are
conditionally independent
In probability theory, conditional independence describes situations wherein an observation is irrelevant or redundant when evaluating the certainty of a hypothesis. Conditional independence is usually formulated in terms of conditional probabil ...
given all other variables:
:
In the discrete model, the continuous variables are discretized into a set of favorable discrete values. If the variables of choice are
dihedral angles, the discretization is typically done by mapping each value to the corresponding
rotamer
In chemistry, conformational isomerism is a form of stereoisomerism in which the isomers can be interconverted just by rotations about formally single bonds (refer to figure on single bond rotation). While any two arrangements of atoms in a molec ...
conformation.
Model
Let ''X'' = be the random variables representing the entire protein structure. ''X''
''b'' can be represented by a set of 3-d coordinates of the
backbone atoms, or equivalently, by a sequence of
bond lengths and
dihedral angles. The probability of a particular
conformation ''x'' can then be written as:
:
where
represents any parameters used to describe this model, including sequence information, temperature etc. Frequently the backbone is assumed to be rigid with a known conformation, and the problem is then transformed to a side-chain placement problem. The structure of the graph is also encoded in
. This structure shows which two variables are conditionally independent. As an example, side chain angles of two residues far apart can be independent given all other angles in the protein. To extract this structure, researchers use a distance threshold, and only a pair of residues which are within that threshold are considered connected (i.e. have an edge between them).
Given this representation, the probability of a particular side chain conformation ''x''
''s'' given the backbone conformation ''x''
''b'' can be expressed as
:
where ''C''(''G'') is the set of all cliques in ''G'',
is a
potential function defined over the variables, and ''Z'' is the
partition function.
To completely characterize the MRF, it is necessary to define the potential function
. To simplify, the cliques of a graph are usually restricted to only the cliques of size 2, which means the potential function is only defined over pairs of variables. In
Goblin System, these pairwise functions are defined as
:
where
is the energy of interaction between rotamer state p of residue
and rotamer state q of residue
and
is the
Boltzmann constant
The Boltzmann constant ( or ) is the proportionality factor that relates the average relative kinetic energy of particles in a gas with the thermodynamic temperature of the gas. It occurs in the definitions of the kelvin and the gas constant, ...
.
Using a PDB file, this model can be built over the protein structure. From this model, free energy can be calculated.
Free energy calculation: belief propagation
It has been shown that the free energy of a system is calculated as
:
where E is the enthalpy of the system, T the temperature and S, the entropy. Now if we associate a probability with each state of the system, (p(x) for each conformation value, x), G can be rewritten as
:
Calculating p(x) on discrete graphs is done by the
generalized belief propagation
A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common character ...
algorithm. This algorithm calculates an
approximation
An approximation is anything that is intentionally similar but not exactly equality (mathematics), equal to something else.
Etymology and usage
The word ''approximation'' is derived from Latin ''approximatus'', from ''proximus'' meaning ''very ...
to the probabilities, and it is not guaranteed to converge to a final value set. However, in practice, it has been shown to converge successfully in many cases.
Continuous graphical models for protein structures
Graphical models can still be used when the variables of choice are continuous. In these cases, the probability distribution is represented as a
multivariate probability distribution over continuous variables. Each family of distribution will then impose certain properties on the graphical model.
Multivariate Gaussian distribution
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One d ...
is one of the most convenient distributions in this problem. The simple form of the probability and the direct relation with the corresponding graphical model makes it a popular choice among researchers.
Gaussian graphical models of protein structures
Gaussian graphical models are multivariate probability distributions encoding a network of dependencies among variables. Let