Graphical models have become powerful frameworks for protein structure prediction, protein–protein interaction, and free energy calculations for protein structures. Using a graphical model to represent the protein structure allows the solution of many problems including secondary structure prediction, protein-protein interactions, protein-drug interaction, and free energy calculations. There are two main approaches to using graphical models in protein structure modeling. The first approach uses

discrete Discrete may refer to: *Discrete particle or quantum in physics, for example in quantum theory * Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit *Discrete group, a ...

variables for representing the coordinates or the dihedral angles of the protein structure. The variables are originally all continuous values and, to transform them into discrete values, a discretization process is typically applied. The second approach uses continuous variables for the coordinates or dihedral angles.

Discrete graphical models for protein structure

Markov random fields, also known as undirected graphical models are common representations for this problem. Given an undirected graph ''G'' = (''V'', ''E''), a set of

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

s ''X'' = (''X''_''v'')_{''v'' ∈ ''V''} indexed by ''V'', form a Markov random field with respect to ''G'' if they satisfy the pairwise Markov property: *any two non-adjacent variables are

conditionally independent In probability theory, conditional independence describes situations wherein an observation is irrelevant or redundant when evaluating the certainty of a hypothesis. Conditional independence is usually formulated in terms of conditional probabil ...

given all other variables: :

X_u \perp\!\!\!\perp X_v ,  X_ \quad \text \ \notin E.

In the discrete model, the continuous variables are discretized into a set of favorable discrete values. If the variables of choice are dihedral angles, the discretization is typically done by mapping each value to the corresponding

rotamer In chemistry, conformational isomerism is a form of stereoisomerism in which the isomers can be interconverted just by rotations about formally single bonds (refer to figure on single bond rotation). While any two arrangements of atoms in a molec ...

conformation.

Model

Let ''X'' = be the random variables representing the entire protein structure. ''X''_''b'' can be represented by a set of 3-d coordinates of the backbone atoms, or equivalently, by a sequence of bond lengths and dihedral angles. The probability of a particular conformation ''x'' can then be written as: :

p(X = x, \Theta) = p(X_b = x_b)p(X_s = x_s, X_b,\Theta), \,

where

\Theta

represents any parameters used to describe this model, including sequence information, temperature etc. Frequently the backbone is assumed to be rigid with a known conformation, and the problem is then transformed to a side-chain placement problem. The structure of the graph is also encoded in

\Theta

. This structure shows which two variables are conditionally independent. As an example, side chain angles of two residues far apart can be independent given all other angles in the protein. To extract this structure, researchers use a distance threshold, and only a pair of residues which are within that threshold are considered connected (i.e. have an edge between them). Given this representation, the probability of a particular side chain conformation ''x''_''s'' given the backbone conformation ''x''_''b'' can be expressed as :

p(X_s = x_s, X_b = x_b) = \frac \prod_\Phi_c (x_s^c,x_b^c)

where ''C''(''G'') is the set of all cliques in ''G'',

\Phi

is a potential function defined over the variables, and ''Z'' is the partition function. To completely characterize the MRF, it is necessary to define the potential function

\Phi

. To simplify, the cliques of a graph are usually restricted to only the cliques of size 2, which means the potential function is only defined over pairs of variables. In Goblin System, these pairwise functions are defined as :

\Phi(x_s^,x_b^) = \exp ( -E(x_s^,x_b^)/K_BT)

where

E(x_s^,x_b^)

is the energy of interaction between rotamer state p of residue

X_i^s

and rotamer state q of residue

X_j^s

and

k_B

is the

Boltzmann constant The Boltzmann constant ( or ) is the proportionality factor that relates the average relative kinetic energy of particles in a gas with the thermodynamic temperature of the gas. It occurs in the definitions of the kelvin and the gas constant, ...

. Using a PDB file, this model can be built over the protein structure. From this model, free energy can be calculated.

Free energy calculation: belief propagation

It has been shown that the free energy of a system is calculated as :

G=E-TS

where E is the enthalpy of the system, T the temperature and S, the entropy. Now if we associate a probability with each state of the system, (p(x) for each conformation value, x), G can be rewritten as :

G=\sum_p(x)E(x)-T\sum_xp(x)\ln(p(x)) \,

Calculating p(x) on discrete graphs is done by the

generalized belief propagation A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common character ...

algorithm. This algorithm calculates an

approximation An approximation is anything that is intentionally similar but not exactly equality (mathematics), equal to something else. Etymology and usage The word ''approximation'' is derived from Latin ''approximatus'', from ''proximus'' meaning ''very ...

to the probabilities, and it is not guaranteed to converge to a final value set. However, in practice, it has been shown to converge successfully in many cases.

Continuous graphical models for protein structures

Graphical models can still be used when the variables of choice are continuous. In these cases, the probability distribution is represented as a multivariate probability distribution over continuous variables. Each family of distribution will then impose certain properties on the graphical model.

Multivariate Gaussian distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One d ...

is one of the most convenient distributions in this problem. The simple form of the probability and the direct relation with the corresponding graphical model makes it a popular choice among researchers.

Gaussian graphical models of protein structures

Gaussian graphical models are multivariate probability distributions encoding a network of dependencies among variables. Let

\Theta= theta_1, \theta_2, \dots, \theta_n /math> be a set of n variables, such as n

dihedral angles A dihedral angle is the angle between two intersecting planes or half-planes. In chemistry, it is the clockwise angle between half-planes through two sets of three atoms, having two atoms in common. In solid geometry, it is defined as the uni ...

, and let

f(\Theta=D)

be the value of the

probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can ...

at a particular value ''D''. A multivariate Gaussian graphical model defines this probability as follows: :

f(\Theta=D) = \frac \exp\left\

Where

Z = (2\pi)^, \Sigma, ^

is the closed form for the partition function. The parameters of this distribution are

\mu

and

\Sigma

\mu

is the vector of mean values of each variable, and

\Sigma^

, the inverse of the

covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...

, also known as the precision matrix. Precision matrix contains the pairwise dependencies between the variables. A zero value in

\Sigma^

means that conditioned on the values of the other variables, the two corresponding variable are independent of each other. To learn the graph structure as a multivariate Gaussian graphical model, we can use either L-1 regularization, or neighborhood selection algorithms. These algorithms simultaneously learn a graph structure and the edge strength of the connected nodes. An edge strength corresponds to the potential function defined on the corresponding two-node clique. We use a training set of a number of PDB structures to learn the

\mu

and

\Sigma^

. Once the model is learned, we can repeat the same step as in the discrete case, to get the density functions at each node, and use analytical form to calculate the free energy. Here, the partition function already has a closed form, so the

inference Inferences are steps in reasoning, moving from premises to logical consequences; etymologically, the word '' infer'' means to "carry forward". Inference is theoretically traditionally divided into deduction and induction, a distinction that in ...

, at least for the Gaussian graphical models is trivial. If the analytical form of the partition function is not available, particle filtering or expectation propagation can be used to approximate ''Z'', and then perform the inference and calculate free energy.

References

* Time Varying Undirected Graphs, Shuheng Zhou and John D. Lafferty and Larry A. Wasserman, COLT 2008 * Free Energy Estimates of All-atom Protein Structures Using Generalized Belief Propagation, Hetunandan Kamisetty Eric P. Xing Christopher J. Langmead, RECOMB 2008

External links

* http://www.liebertonline.com/doi/pdf/10.1089/cmb.2007.0131 * https://web.archive.org/web/20110724225908/http://www.learningtheory.org/colt2008/81-Zhou.pdf *
Predicting Protein Folds with Structural Repeats Using a Chain Graph Model
{{DEFAULTSORT:Graphical Models For Protein Structure Graphical models Protein methods Computational chemistry