Statistical coupling analysis
   HOME

TheInfoList



OR:

Statistical Coupling Analysis (SCA) is a method used in
bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
to study how pairs of
amino acids Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the Proteinogenic amino acid, 22 α-amino acids incorporated into p ...
in a protein sequence evolve together. It analyzes a
multiple sequence alignment Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis an ...
(MSA), which is a display of the sequences of many related proteins arranged to highlight similarities and differences. SCA measures how much the amino acid makeup at one position in the protein changes when the amino acid makeup at another position is altered. This relationship is quantified as statistical coupling energy. A higher coupling energy indicates that the amino acids at both positions are more likely to have co-evolved and are therefore functionally or structurally linked. In simpler terms, it helps scientists understand which parts of a protein are working together and how they have changed over evolutionary time.


Definition of statistical coupling energy

Statistical coupling energy measures how a perturbation of amino acid distribution at one site in an MSA affects the amino acid distribution at another site. For example, consider a multiple sequence alignment with sites (or columns) ''a'' through ''z'', where each site has some distribution of amino acids. At position ''i'', 60% of the sequences have a
valine Valine (symbol Val or V) is an α-amino acid that is used in the biosynthesis of proteins. It contains an α- amino group (which is in the protonated −NH3+ form under biological conditions), an α- carboxylic acid group (which is in the deproton ...
and the remaining 40% of sequences have a
leucine Leucine (symbol Leu or L) is an essential amino acid that is used in the biosynthesis of proteins. Leucine is an α-amino acid, meaning it contains an α-amino group (which is in the protonated −NH3+ form under biological conditions), an α-Car ...
, at position ''j'' the distribution is 40%
isoleucine Isoleucine (symbol Ile or I) is an α-amino acid that is used in the biosynthesis of proteins. It contains an α-amino group (which is in the protonated −NH form under biological conditions), an α-carboxylic acid group (which is in the depro ...
, 40%
histidine Histidine (symbol His or H) is an essential amino acid that is used in the biosynthesis of proteins. It contains an Amine, α-amino group (which is in the protonated –NH3+ form under Physiological condition, biological conditions), a carboxylic ...
and 20%
methionine Methionine (symbol Met or M) () is an essential amino acid in humans. As the precursor of other non-essential amino acids such as cysteine and taurine, versatile compounds such as SAM-e, and the important antioxidant glutathione, methionine play ...
, ''k'' has an average distribution (the 20 amino acids are present at roughly the same frequencies seen in all proteins), and ''l'' has 80% histidine, 20% valine. Since positions ''i'', ''j'' and ''l'' have an amino acid distribution different from the mean distribution observed in all proteins, they are said to have some degree of conservation. In statistical coupling analysis, the conservation (ΔGstat) at each site (''i'') is defined as: \Delta G_i^ = \sqrt. Here, Pix describes the probability of finding amino acid ''x'' at position ''i'', and is defined by a function in binomial form as follows:
P_i^x = \fracp_x^(1 - p_x)^,
where N is 100, nx is the percentage of sequences with residue ''x'' (e.g. methionine) at position ''i'', and px corresponds to the approximate distribution of amino acid ''x'' in all positions among all sequenced proteins. The summation runs over all 20 amino acids. After ΔGistat is computed, the conservation for position ''i'' in a subalignment produced after a perturbation of amino acid distribution at ''j'' (ΔGi , δjstat) is taken. Statistical coupling energy, denoted ΔΔGi, jstat, is simply the difference between these two values. That is:
\Delta\Delta G_^ = \Delta G_^ - \Delta G_i^, or, more commonly, \Delta\Delta G_^ = \sqrt
Statistical coupling energy is often systematically calculated between a fixed, perturbated position, and all other positions in an MSA. Continuing with the example MSA from the beginning of the section, consider a perturbation at position ''j'' where the amino distribution changes from 40% I, 40% H, 20% M to 100% I. If, in a subsequent subalignment, this changes the distribution at ''i'' from 60% V, 40% L to 90% V, 10% L, but does not change the distribution at position ''l'', then there would be some amount of statistical coupling energy between ''i'' and ''j'' but none between ''l'' and ''j''.


Applications

Ranganathan and Lockless originally developed SCA to examine thermodynamic (energetic) coupling of residue pairs in proteins. Using the
PDZ domain The PDZ domain is a common structural domain of 80-90 Amino acid, amino-acids found in the Signal transduction, signaling proteins of bacteria, yeast, plants, viruses and animals. Proteins containing PDZ domains play a key role in anchoring recept ...
family, they were able to identify a small network of residues that were energetically coupled to a binding site residue. The network consisted of both residues spatially close to the binding site in the tertiary fold, called contact pairs, and more distant residues that participate in longer-range energetic interactions. Later applications of SCA by th
Ranganathan group
on the GPCR,
serine protease Serine proteases (or serine endopeptidases) are enzymes that cleave peptide bonds in proteins. Serine serves as the nucleophilic amino acid at the (enzyme's) active site. They are found ubiquitously in both eukaryotes and prokaryotes. Serin ...
and
hemoglobin Hemoglobin (haemoglobin, Hb or Hgb) is a protein containing iron that facilitates the transportation of oxygen in red blood cells. Almost all vertebrates contain hemoglobin, with the sole exception of the fish family Channichthyidae. Hemoglobin ...
families also showed energetic coupling in sparse networks of residues that cooperate in allosteric communication. Statistical coupling analysis has also been used as a basis for computational protein design. In 2005, Socolich et al. used an SCA for the
WW domain The WW domain (also known as the rsp5-domain or WWP repeating structural motif, motif) is a modular protein domain that mediates specific interactions with protein ligands. This domain is found in a number of unrelated signaling and structural pro ...
to create artificial proteins with similar thermodynamic stability and
structure A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such as ...
to natural WW domains. The fact that 12 out of the 43 designed proteins with the same SCA profile as natural WW domains properly folded provided strong evidence that little information—only coupling information—was required for specifying the protein fold. This support for the SCA hypothesis was made more compelling considering that a) the successfully folded proteins had only 36% average sequence identity to natural WW folds, and b) none of the artificial proteins designed without coupling information folded properly. An accompanying study showed that the artificial WW domains were functionally similar to natural WW domains in ligand binding affinity and specificity. In ''de novo'' protein structure prediction, it has been shown that, when combined with a simple residue-residue distance metric, SCA-based scoring can fairly accurately distinguish native from non-native protein folds.


See also

Mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual Statistical dependence, dependence between the two variables. More specifically, it quantifies the "Information conten ...


External links


What is a WW domain?


* ttp://www.pandasthumb.org/archives/2005/10/protein-folding.html Protein folding — a step closer? - A summary of the Ranganathan lab's SCA-based design of artificial yet functional WW domains.


References

{{reflist Bioinformatics