The GOR method (short for Garnier–Osguthorpe–Robson) is an

information theory Information theory is the scientific study of the quantification (science), quantification, computer data storage, storage, and telecommunication, communication of information. The field was originally established by the works of Harry Nyquist a ...

-based method for the

prediction A prediction (Latin ''præ-'', "before," and ''dicere'', "to say"), or forecast, is a statement about a future event or data. They are often, but not always, based upon experience or knowledge. There is no universal agreement about the exact ...

secondary structure Protein secondary structure is the three dimensional conformational isomerism, form of ''local segments'' of proteins. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...

s in

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...

s. It was developed in the late 1970s shortly after the simpler Chou–Fasman method. Like Chou–Fasman, the GOR method is based on

probability Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...

parameters derived from empirical studies of known protein

tertiary structure Protein tertiary structure is the three dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains may int ...

s solved by

X-ray crystallography X-ray crystallography is the experimental science determining the atomic and molecular structure of a crystal, in which the crystalline structure causes a beam of incident X-rays to diffract into many specific directions. By measuring the angles ...

. However, unlike Chou–Fasman, the GOR method takes into account not only the propensities of individual

amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...

s to form particular secondary structures, but also the

conditional probability In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occur ...

of the amino acid to form a secondary structure given that its immediate neighbors have already formed that structure. The method is therefore essentially

Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...

in its analysis.

Method

The GOR method analyzes sequences to predict

alpha helix The alpha helix (α-helix) is a common motif in the secondary structure of proteins and is a right hand-helix conformation in which every backbone N−H group hydrogen bonds to the backbone C=O group of the amino acid located four residues e ...

beta sheet The beta sheet, (β-sheet) (also β-pleated sheet) is a common motif of the regular protein secondary structure. Beta sheets consist of beta strands (β-strands) connected laterally by at least two or three backbone hydrogen bonds, forming a g ...

, turn, or

random coil In polymer chemistry, a random coil is a conformation of polymers where the monomer subunits are oriented randomly while still being bonded to adjacent units. It is not one specific shape, but a statistical distribution of shapes for all the ch ...

secondary structure at each position based on 17-amino-acid sequence windows. The original description of the method included four scoring matrices of size 17×20, where the columns correspond to the

log-odds In statistics, the logit ( ) function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations. Mathematically, the logit is the ...

score, which reflects the probability of finding a given amino acid at each position in the 17-residue sequence. The four matrices reflect the probabilities of the central, ninth amino acid being in a helical, sheet, turn, or coil conformation. In subsequent revisions to the method, the turn matrix was eliminated due to the high variability of sequences in turn regions (particularly over such a large window). The method was considered as best requiring at least four contiguous residues to score as alpha helices to classify the region as helical, and at least two contiguous residues for a beta sheet.

Algorithm

The mathematics and algorithm of the GOR method were based on an earlier series of studies by Robson and colleagues reported mainly in the ''Journal of Molecular Biology'' and ''The Biochemical Journal''. The latter describes the information theoretic expansions in terms of conditional information measures. The use of the word "simple" in the title of the GOR paper reflected the fact that the above earlier methods provided proofs and techniques somewhat daunting by being rather unfamiliar in protein science in the early 1970s; even Bayes methods were then unfamiliar and controversial. An important feature of these early studies, which survived in the GOR method, was the treatment of the sparse protein sequence data of the early 1970s by expected information measures. That is, expectations on a Bayesian basis considering the distribution of plausible information measure values given the actual frequencies (numbers of observations). The expectation measures resulting from integration over this and similar distributions may now be seen as composed of "incomplete" or extended zeta functions, e.g. z(s,observed frequency) − z(s, expected frequency) with incomplete zeta function z(s, n) = 1 + (1/2)^s + (1/3)^s+ (1/4)^s + …. +(1/''n'')^s. The GOR method used s=1. Also, in the GOR method and the earlier methods, the measure for the contrary state to e.g. helix H, i.e. ~H, was subtracted from that for H, and similarly for beta sheet, turns, and coil or loop. Thus the method can be seen as employing a zeta function estimate of log predictive odds. An adjustable decision constant could also be applied, which thus implies a decision theory approach; the GOR method allowed the option to use decision constants to optimize predictions for different classes of protein. The expected information measure used as a basis for the information expansion was less important by the time of publication of the GOR method because protein sequence data became more plentiful, at least for the terms considered at that time. Then, for s=1, the expression z(s,observed frequency) − z(s,expected frequency) approaches the natural logarithm of (observed frequency / expected frequency) as frequencies increase. However, this measure (including use of other values of s) remains important in later more general applications with high-dimensional data, where data for more complex terms in the information expansion are inevitably sparse.e.g.

References

{{Reflist, 2 Bioinformatics Protein methods Applications of Bayesian inference

Method

Algorithm

See also

References