
Protein structure prediction is the inference of the three-dimensional structure of a
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
from its
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...
sequence—that is, the prediction of its
secondary and
tertiary structure
Protein tertiary structure is the three-dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains and the ...
from
primary structure
Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthe ...
. Structure prediction is different from the inverse problem of
protein design.
Protein structure prediction is one of the most important goals pursued by
computational biology
Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer sci ...
and addresses
Levinthal's paradox. Accurate structure prediction has important applications in
medicine
Medicine is the science and Praxis (process), practice of caring for patients, managing the Medical diagnosis, diagnosis, prognosis, Preventive medicine, prevention, therapy, treatment, Palliative care, palliation of their injury or disease, ...
(for example, in
drug design
Drug design, often referred to as rational drug design or simply rational design, is the invention, inventive process of finding new medications based on the knowledge of a biological target. The drug is most commonly an organic compound, organi ...
) and
biotechnology
Biotechnology is a multidisciplinary field that involves the integration of natural sciences and Engineering Science, engineering sciences in order to achieve the application of organisms and parts thereof for products and services. Specialists ...
(for example, in novel
enzyme
An enzyme () is a protein that acts as a biological catalyst by accelerating chemical reactions. The molecules upon which enzymes may act are called substrate (chemistry), substrates, and the enzyme converts the substrates into different mol ...
design).
Starting in 1994, the performance of current methods is assessed biannually in the ''Critical Assessment of Structure Prediction'' (
CASP) experiment. A continuous evaluation of protein structure prediction web servers is performed by the community project ''Continuous Automated Model EvaluatiOn'' (
CAMEO3D).
Protein structure and terminology
Proteins are chains of
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...
s joined together by
peptide bond
In organic chemistry, a peptide bond is an amide type of covalent chemical bond linking two consecutive alpha-amino acids from C1 (carbon number one) of one alpha-amino acid and N2 (nitrogen number two) of another, along a peptide or protein cha ...
s. Many conformations of this chain are possible due to the rotation of the main chain about the two torsion angles φ and ψ at the Cα atom (see figure). This conformational flexibility is responsible for differences in the three-dimensional structure of proteins. The peptide bonds in the chain are polar, i.e. they have separated positive and negative charges (partial charges) in the
carbonyl group
In organic chemistry, a carbonyl group is a functional group with the formula , composed of a carbon atom double-bonded to an oxygen atom, and it is divalent at the C atom. It is common to several classes of organic compounds (such as aldehydes ...
, which can act as hydrogen bond acceptor and in the NH group, which can act as hydrogen bond donor. These groups can therefore interact in the protein structure. Proteins consist mostly of 20 different types of L-α-amino acids (the
proteinogenic amino acid
Proteinogenic amino acids are amino acids that are incorporated biosynthetically into proteins during translation from RNA. The word "proteinogenic" means "protein creating". Throughout known life, there are 22 genetically encoded (proteinogenic) ...
s). These can be classified according to the chemistry of the side chain, which also plays an important structural role.
Glycine
Glycine (symbol Gly or G; ) is an amino acid that has a single hydrogen atom as its side chain. It is the simplest stable amino acid. Glycine is one of the proteinogenic amino acids. It is encoded by all the codons starting with GG (G ...
takes on a special position, as it has the smallest side chain, only one hydrogen atom, and therefore can increase the local flexibility in the protein structure.
Cysteine
Cysteine (; symbol Cys or C) is a semiessential proteinogenic amino acid with the chemical formula, formula . The thiol side chain in cysteine enables the formation of Disulfide, disulfide bonds, and often participates in enzymatic reactions as ...
in contrast can react with another cysteine residue to form one
cystine and thereby form a cross link stabilizing the whole structure.
The protein structure can be considered as a sequence of secondary structure elements, such as
α helices and
β sheets. In these secondary structures, regular patterns of H-bonds are formed between the main chain NH and CO groups of spatially neighboring amino acids, and the amino acids have similar
Φ and ψ
angles.
The formation of these secondary structures efficiently satisfies the hydrogen bonding capacities of the peptide bonds. The secondary structures can be tightly packed in the protein core in a hydrophobic environment, but they can also present at the polar protein surface. Each amino acid side chain has a limited volume to occupy and a limited number of possible interactions with other nearby side chains, a situation that must be taken into account in molecular modeling and alignments.
[Yousif, Ragheed Hussam, et al. "Exploring the Molecular Interactions between Neoculin and the Human Sweet Taste Receptors through Computational Approaches." ''Sains Malaysiana'' 49.3 (2020): 517-525.]
α-helix

The α-helix is the most abundant type of secondary structure in proteins. The α-helix has 3.6 amino acids per turn with an H-bond formed between every fourth residue; the average length is 10 amino acids (3 turns) or 10
Å but varies from 5 to 40 (1.5 to 11 turns). The alignment of the H-bonds creates a dipole moment for the helix with a resulting partial positive charge at the amino end of the helix. Because this region has free NH
2 groups, it will interact with negatively charged groups such as phosphates. The most common location of α-helices is at the surface of protein cores, where they provide an interface with the aqueous environment. The inner-facing side of the helix tends to have hydrophobic amino acids and the outer-facing side hydrophilic amino acids. Thus, every third of four amino acids along the chain will tend to be hydrophobic, a pattern that can be quite readily detected. In the leucine zipper motif, a repeating pattern of leucines on the facing sides of two adjacent helices is highly predictive of the motif. A helical-wheel plot can be used to show this repeated pattern. Other α-helices buried in the protein core or in cellular membranes have a higher and more regular distribution of hydrophobic amino acids, and are highly predictive of such structures. Helices exposed on the surface have a lower proportion of hydrophobic amino acids. Amino acid content can be predictive of an α-helical region. Regions richer in
alanine
Alanine (symbol Ala or A), or α-alanine, is an α-amino acid that is used in the biosynthesis of proteins. It contains an amine group and a carboxylic acid group, both attached to the central carbon atom which also carries a methyl group sid ...
(A),
glutamic acid
Glutamic acid (symbol Glu or E; known as glutamate in its anionic form) is an α- amino acid that is used by almost all living beings in the biosynthesis of proteins. It is a non-essential nutrient for humans, meaning that the human body can ...
(E),
leucine
Leucine (symbol Leu or L) is an essential amino acid that is used in the biosynthesis of proteins. Leucine is an α-amino acid, meaning it contains an α-amino group (which is in the protonated −NH3+ form under biological conditions), an α-Car ...
(L), and
methionine
Methionine (symbol Met or M) () is an essential amino acid in humans.
As the precursor of other non-essential amino acids such as cysteine and taurine, versatile compounds such as SAM-e, and the important antioxidant glutathione, methionine play ...
(M) and poorer in
proline (P),
glycine
Glycine (symbol Gly or G; ) is an amino acid that has a single hydrogen atom as its side chain. It is the simplest stable amino acid. Glycine is one of the proteinogenic amino acids. It is encoded by all the codons starting with GG (G ...
(G),
tyrosine
-Tyrosine or tyrosine (symbol Tyr or Y) or 4-hydroxyphenylalanine is one of the 20 standard amino acids that are used by cells to synthesize proteins. It is a conditionally essential amino acid with a polar side group. The word "tyrosine" is ...
(Y), and
serine
Serine
(symbol Ser or S) is an α-amino acid that is used in the biosynthesis of proteins. It contains an α- amino group (which is in the protonated − form under biological conditions), a carboxyl group (which is in the deprotonated − ...
(S) tend to form an α-helix. Proline destabilizes or breaks an α-helix but can be present in longer helices, forming a bend.
β-sheet
β-sheets are formed by H-bonds between an average of 5–10 consecutive amino acids in one portion of the chain with another 5–10 farther down the chain. The interacting regions may be adjacent, with a short loop in between, or far apart, with other structures in between. Every chain may run in the same direction to form a parallel sheet, every other chain may run in the reverse chemical direction to form an anti parallel sheet, or the chains may be parallel and anti parallel to form a mixed sheet. The pattern of H bonding is different in the parallel and anti parallel configurations. Each amino acid in the interior strands of the sheet forms two H-bonds with neighboring amino acids, whereas each amino acid on the outside strands forms only one bond with an interior strand. Looking across the sheet at right angles to the strands, more distant strands are rotated slightly counterclockwise to form a left-handed twist. The Cα-atoms alternate above and below the sheet in a pleated structure, and the R side groups of the amino acids alternate above and below the pleats. The Φ and Ψ angles of the amino acids in sheets vary considerably in one region of the
Ramachandran plot
In biochemistry, a Ramachandran plot (also known as a Rama plot, a Ramachandran diagram or a �,ψplot), originally developed in 1963 by G. N. Ramachandran, C. Ramakrishnan, and V. Sasisekharan, is a way to visualize energetically allowed regio ...
. It is more difficult to predict the location of β-sheets than of α-helices. The situation improves somewhat when the amino acid variation in multiple sequence alignments is taken into account.
Deltas
Some parts of the protein have fixed three-dimensional structure, but do not form any regular structures. They should not be confused with
disordered or unfolded segments of proteins or
random coil, an unfolded polypeptide chain lacking any fixed three-dimensional structure. These parts are frequently called "
deltas
A river delta is a landform, wikt:archetype#Noun, archetypically triangular, created by the deposition (geology), deposition of the sediments that are carried by the waters of a river, where the river merges with a body of slow-moving water or ...
" (''Δ'') because they connect β-sheets and α-helices. Deltas are usually located at protein surface, and therefore mutations of their residues are more easily tolerated. Having more substitutions, insertions, and deletions in a certain region of a sequence alignment maybe an indication of some delta. The positions of
introns
An intron is any Nucleic acid sequence, nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e., a region inside a gene."The notion of ...
in genomic DNA may correlate with the locations of loops in the encoded protein . Deltas also tend to have charged and polar amino acids and are frequently a component of active sites.
Protein classification
Proteins may be classified according to both structural and sequential similarity. For structural classification, the sizes and spatial arrangements of secondary structures described in the above paragraph are compared in known three-dimensional structures. Classification based on sequence similarity was historically the first to be used. Initially, similarity based on alignments of whole sequences was performed. Later, proteins were classified on the basis of the occurrence of conserved amino acid patterns.
Databases
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and ana ...
that classify proteins by one or more of these schemes are available.
In considering protein classification schemes, it is important to keep several observations in mind. First, two entirely different protein sequences from different evolutionary origins may fold into a similar structure. Conversely, the sequence of an ancient gene for a given structure may have diverged considerably in different species while at the same time maintaining the same basic structural features. Recognizing any remaining sequence similarity in such cases may be a very difficult task. Second, two proteins that share a significant degree of sequence similarity either with each other or with a third sequence also share an evolutionary origin and should share some structural features also. However, gene duplication and genetic rearrangements during evolution may give rise to new gene copies, which can then evolve into proteins with new function and structure.
Terms used for classifying protein structures and sequences
The more commonly used terms for evolutionary and structural relationships among proteins are listed below. Many additional terms are used for various kinds of structural features found in proteins. Descriptions of such terms may be found at the CATH Web site, the
Structural Classification of Proteins (SCOP) Web site, and a
Glaxo Wellcome tutorial on the Swiss bioinformatics Expasy Web site.
;
Active site: a localized combination of amino acid side groups within the tertiary (three-dimensional) or quaternary (protein subunit) structure that can interact with a chemically specific substrate and that provides the protein with biological activity. Proteins of very different amino acid sequences may fold into a structure that produces the same active site.
;Architecture: is the relative orientations of secondary structures in a three-dimensional structure without regard to whether or not they share a similar loop structure.
;Fold (topology): a type of architecture that also has a conserved loop structure.
;Blocks: is a conserved amino acid sequence pattern in a family of proteins. The pattern includes a series of possible matches at each position in the represented sequences, but there are not any inserted or deleted positions in the pattern or in the sequences. By way of contrast, sequence profiles are a type of scoring matrix that represents a similar set of patterns that includes insertions and deletions.
;
Class
Class, Classes, or The Class may refer to:
Common uses not otherwise categorized
* Class (biology), a taxonomic rank
* Class (knowledge representation), a collection of individuals or objects
* Class (philosophy), an analytical concept used d ...
: a term used to classify protein domains according to their secondary structural content and organization. Four
classes were originally recognized by Levitt and Chothia (1976), and several others have been added in the SCOP database. Three classes are given in the CATH database: mainly-α, mainly-β, and α–β, with the α–β class including both alternating α/β and α+β structures.
;Core: the portion of a folded protein molecule that comprises the hydrophobic interior of α-helices and β-sheets. The compact structure brings together side groups of amino acids into close enough proximity so that they can interact. When comparing protein structures, as in the SCOP database, core is the region common to most of the structures that share a common fold or that are in the same superfamily. In structure prediction, core is sometimes defined as the arrangement of secondary structures that is likely to be conserved during evolutionary change.
;
Domain (sequence context): a segment of a polypeptide chain that can fold into a three-dimensional structure irrespective of the presence of other segments of the chain. The separate domains of a given protein may interact extensively or may be joined only by a length of polypeptide chain. A protein with several domains may use these domains for functional interactions with different molecules.
;
Family
Family (from ) is a Social group, group of people related either by consanguinity (by recognized birth) or Affinity (law), affinity (by marriage or other relationship). It forms the basis for social order. Ideally, families offer predictabili ...
(sequence context): a group of proteins of similar biochemical function that are more than 50% identical when aligned. This same cutoff is still used by the
Protein Information Resource (PIR). A protein family comprises proteins with the same function in different organisms (orthologous sequences) but may also include proteins in the same organism (paralogous sequences) derived from gene duplication and rearrangements. If a multiple sequence alignment of a protein family reveals a common level of similarity throughout the lengths of the proteins, PIR refers to the family as a homeomorphic family. The aligned region is referred to as a homeomorphic domain, and this region may comprise several smaller homology domains that are shared with other families. Families may be further subdivided into subfamilies or grouped into superfamilies based on respective higher or lower levels of sequence similarity. The SCOP database reports 1296 families and the CATH database (version 1.7 beta), reports 1846 families.
:When the sequences of proteins with the same function are examined in greater detail, some are found to share high sequence similarity. They are obviously members of the same family by the above criteria. However, others are found that have very little, or even insignificant, sequence similarity with other family members. In such cases, the family relationship between two distant family members A and C can often be demonstrated by finding an additional family member B that shares significant similarity with both A and C. Thus, B provides a connecting link between A and C. Another approach is to examine distant alignments for highly conserved matches.
:At a level of identity of 50%, proteins are likely to have the same three-dimensional structure, and the identical atoms in the sequence alignment will also superimpose within approximately 1 Å in the structural model. Thus, if the structure of one member of a family is known, a reliable prediction may be made for a second member of the family, and the higher the identity level, the more reliable the prediction. Protein structural modeling can be performed by examining how well the amino acid substitutions fit into the core of the three-dimensional structure.
;Family (structural context): as used in the FSSP database (
Families of structurally similar proteins) and the DALI/FSSP Web site, two structures that have a significant level of structural similarity but not necessarily significant sequence similarity.
;Fold: similar to structural motif, includes a larger combination of secondary structural units in the same configuration. Thus, proteins sharing the same fold have the same combination of secondary structures that are connected by similar loops. An example is the Rossman fold comprising several alternating α helices and parallel β strands. In the SCOP, CATH, and FSSP databases, the known protein structures have been classified into hierarchical levels of structural complexity with the fold as a basic level of classification.
;Homologous domain (sequence context): an extended sequence pattern, generally found by sequence alignment methods, that indicates a common evolutionary origin among the aligned sequences. A homology domain is generally longer than motifs. The domain may include all of a given protein sequence or only a portion of the sequence. Some domains are complex and made up of several smaller homology domains that became joined to form a larger one during evolution. A domain that covers an entire sequence is called the homeomorphic domain by PIR (
Protein Information Resource).
;Module: a region of conserved amino acid patterns comprising one or more motifs and considered to be a fundamental unit of structure or function. The presence of a module has also been used to classify proteins into families.
;
Motif (sequence context): a conserved pattern of amino acids that is found in two or more proteins. In the
Prosite catalog, a motif is an amino acid pattern that is found in a group of proteins that have a similar biochemical activity, and that often is near the active site of the protein. Examples of sequence motif databases are the Prosite catalog and the Stanford Motifs Database.
;Motif (structural context): a combination of several secondary structural elements produced by the folding of adjacent sections of the polypeptide chain into a specific three-dimensional configuration. An example is the helix-loop-helix motif. Structural motifs are also referred to as supersecondary structures and folds.
;
Position-specific scoring matrix (sequence context, also known as weight or scoring matrix): represents a conserved region in a multiple sequence alignment with no gaps. Each matrix column represents the variation found in one column of the multiple sequence alignment.
;
Position-specific scoring matrix—3D (structural context): represents the amino acid variation found in an alignment of proteins that fall into the same structural class. Matrix columns represent the amino acid variation found at one amino acid position in the aligned structures.
;
Primary structure
Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthe ...
: the linear amino acid sequence of a protein, which chemically is a polypeptide chain composed of amino acids joined by peptide bonds.
;
Profile (sequence context): a scoring matrix that represents a multiple sequence alignment of a protein family. The profile is usually obtained from a well-conserved region in a multiple sequence alignment. The profile is in the form of a matrix with each column representing a position in the alignment and each row one of the amino acids. Matrix values give the likelihood of each amino acid at the corresponding position in the alignment. The profile is moved along the target sequence to locate the best scoring regions by a dynamic programming algorithm. Gaps are allowed during matching and a gap penalty is included in this case as a negative score when no amino acid is matched. A sequence profile may also be represented by a
hidden Markov model, referred to as a profile HMM.
;Profile (structural context): a scoring matrix that represents which amino acids should fit well and which should fit poorly at sequential positions in a known protein structure. Profile columns represent sequential positions in the structure, and profile rows represent the 20 amino acids. As with a sequence profile, the structural profile is moved along a target sequence to find the highest possible alignment score by a dynamic programming algorithm. Gaps may be included and receive a penalty. The resulting score provides an indication as to whether or not the target protein might adopt such a structure.
;
Quaternary structure: the three-dimensional configuration of a protein molecule comprising several independent polypeptide chains.
;
Secondary structure
Protein secondary structure is the local spatial conformation of the polypeptide backbone excluding the side chains. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...
: the interactions that occur between the C, O, and NH groups on amino acids in a polypeptide chain to form α-helices, β-sheets, turns, loops, and other forms, and that facilitate the folding into a three-dimensional structure.
;
Superfamily: a group of protein families of the same or different lengths that are related by distant yet detectable sequence similarity. Members of a given
superfamily thus have a common evolutionary origin. Originally, Dayhoff defined the cutoff for superfamily status as being the chance that the sequences are not related of 10 6, on the basis of an alignment score (Dayhoff et al. 1978). Proteins with few identities in an alignment of the sequences but with a convincingly common number of structural and functional features are placed in the same superfamily. At the level of three-dimensional structure, superfamily proteins will share common structural features such as a common fold, but there may also be differences in the number and arrangement of secondary structures. The PIR resource uses the term ''homeomorphic superfamilies'' to refer to superfamilies that are composed of sequences that can be aligned from end to end, representing a sharing of single sequence homology domain, a region of similarity that extends throughout the alignment. This domain may also comprise smaller homology domains that are shared with other protein families and superfamilies. Although a given protein sequence may contain domains found in several superfamilies, thus indicating a complex evolutionary history, sequences will be assigned to only one homeomorphic superfamily based on the presence of similarity throughout a multiple sequence alignment. The superfamily alignment may also include regions that do not align either within or at the ends of the alignment. In contrast, sequences in the same family align well throughout the alignment.
;
Supersecondary structure: a term with similar meaning to a structural motif. Tertiary structure is the three-dimensional or globular structure formed by the packing together or folding of secondary structures of a polypeptide chain.
Secondary structure
Secondary structure prediction is a set of techniques in
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
that aim to predict the local
secondary structure
Protein secondary structure is the local spatial conformation of the polypeptide backbone excluding the side chains. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...
s of
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
s based only on knowledge of their
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...
sequence. For proteins, a prediction consists of assigning regions of the amino acid sequence as likely
alpha helices,
beta strands (often termed ''extended'' conformations), or
turns. The success of a prediction is determined by comparing it to the results of the
DSSP algorithm (or similar e.g.
STRIDE) applied to the
crystal structure
In crystallography, crystal structure is a description of ordered arrangement of atoms, ions, or molecules in a crystalline material. Ordered structures occur from intrinsic nature of constituent particles to form symmetric patterns that repeat ...
of the protein. Specialized algorithms have been developed for the detection of specific well-defined patterns such as
transmembrane helices and
coiled coils in proteins.
The best modern methods of secondary structure prediction in proteins were claimed to reach 80% accuracy after using machine learning and
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
s; this high accuracy allows the use of the predictions as feature improving
fold recognition and
ab initio protein structure prediction, classification of
structural motifs, and refinement of
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
s. The accuracy of current protein secondary structure prediction methods is assessed in weekly
benchmarks such as
LiveBench and
EVA.
Background
Early methods of secondary structure prediction, introduced in the 1960s and early 1970s, focused on identifying likely alpha helices and were based mainly on
helix-coil transition models.
Significantly more accurate predictions that included beta sheets were introduced in the 1970s and relied on statistical assessments based on probability parameters derived from known solved structures. These methods, applied to a single sequence, are typically at most about 60–65% accurate, and often underpredict beta sheets.
Since the 1980s,
artificial neural networks
In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks.
A neural network consists of connected ...
have been applied to the prediction of protein structures.
The
evolution
Evolution is the change in the heritable Phenotypic trait, characteristics of biological populations over successive generations. It occurs when evolutionary processes such as natural selection and genetic drift act on genetic variation, re ...
ary
conservation of secondary structures can be exploited by simultaneously assessing many
homologous sequences in a
multiple sequence alignment, by calculating the net secondary structure propensity of an aligned column of amino acids. In concert with larger databases of known protein structures and modern
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
methods such as
neural nets and
support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
s, these methods can achieve up to 80% overall accuracy in
globular proteins.
The theoretical upper limit of accuracy is around 90%,
partly due to idiosyncrasies in DSSP assignment near the ends of secondary structures, where local conformations vary under native conditions but may be forced to assume a single conformation in crystals due to packing constraints. Moreover, the typical secondary structure prediction methods do not account for the influence of
tertiary structure
Protein tertiary structure is the three-dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains and the ...
on formation of secondary structure; for example, a sequence predicted as a likely helix may still be able to adopt a beta-strand conformation if it is located within a beta-sheet region of the protein and its side chains pack well with their neighbors. Dramatic conformational changes related to the protein's function or environment can also alter local secondary structure.
Historical perspective
To date, over 20 different secondary structure prediction methods have been developed. One of the first algorithms was
Chou–Fasman method, which relies predominantly on probability parameters determined from relative frequencies of each amino acid's appearance in each type of secondary structure.
The original Chou-Fasman parameters, determined from the small sample of structures solved in the mid-1970s, produce poor results compared to modern methods, though the parameterization has been updated since it was first published. The Chou-Fasman method is roughly 50–60% accurate in predicting secondary structures.
The next notable program was the
GOR method is an
information theory
Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...
-based method. It uses the more powerful probabilistic technique of
Bayesian inference
Bayesian inference ( or ) is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian infer ...
.
The GOR method takes into account not only the probability of each amino acid having a particular secondary structure, but also the
conditional probability
In probability theory, conditional probability is a measure of the probability of an Event (probability theory), event occurring, given that another event (by assumption, presumption, assertion or evidence) is already known to have occurred. This ...
of the amino acid assuming each structure given the contributions of its neighbors (it does not assume that the neighbors have that same structure). The approach is both more sensitive and more accurate than that of Chou and Fasman because amino acid structural propensities are only strong for a small number of amino acids such as
proline and
glycine
Glycine (symbol Gly or G; ) is an amino acid that has a single hydrogen atom as its side chain. It is the simplest stable amino acid. Glycine is one of the proteinogenic amino acids. It is encoded by all the codons starting with GG (G ...
. Weak contributions from each of many neighbors can add up to strong effects overall. The original GOR method was roughly 65% accurate and is dramatically more successful in predicting alpha helices than beta sheets, which it frequently mispredicted as loops or disorganized regions.
Another big step forward, was using
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
methods. First
artificial neural network
In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks.
A neural network consists of connected ...
s methods were used. As a training sets they use solved structures to identify common sequence motifs associated with particular arrangements of secondary structures. These methods are over 70% accurate in their predictions, although beta strands are still often underpredicted due to the lack of three-dimensional structural information that would allow assessment of
hydrogen bonding
In chemistry, a hydrogen bond (H-bond) is a specific type of molecular interaction that exhibits partial covalent character and cannot be described as a purely electrostatic force. It occurs when a hydrogen (H) atom, Covalent bond, covalently b ...
patterns that can promote formation of the extended conformation required for the presence of a complete beta sheet.
PSIPRED and
JPRED are some of the most known programs based on neural networks for protein secondary structure prediction. Next,
support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
s have proven particularly useful for predicting the locations of
turns, which are difficult to identify with statistical methods.
Extensions of machine learning techniques attempt to predict more fine-grained local properties of proteins, such as
backbone dihedral angles in unassigned regions. Both SVMs
and neural networks
have been applied to this problem.
More recently, real-value torsion angles can be accurately predicted by SPINE-X and successfully employed for ab initio structure prediction.
Other improvements
It is reported that in addition to the protein sequence, secondary structure formation depends on other factors. For example, it is reported that secondary structure tendencies depend also on local environment,
solvent accessibility of residues,
protein structural class,
and even the organism from which the proteins are obtained.
Based on such observations, some studies have shown that secondary structure prediction can be improved by addition of information about protein structural class,
residue accessible surface area
and also
contact number information.
Tertiary structure
The practical role of protein structure prediction is now more important than ever. Massive amounts of protein sequence data are produced by modern large-scale
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
sequencing efforts such as the
Human Genome Project
The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a ...
. Despite community-wide efforts in
structural genomics, the output of experimentally determined protein structures—typically by time-consuming and relatively expensive
X-ray crystallography
X-ray crystallography is the experimental science of determining the atomic and molecular structure of a crystal, in which the crystalline structure causes a beam of incident X-rays to Diffraction, diffract in specific directions. By measuring th ...
or
NMR spectroscopy—is lagging far behind the output of protein sequences.
The protein structure prediction remains an extremely difficult and unresolved undertaking. The two main problems are the calculation of
protein free energy and
finding the global minimum of this energy. A protein structure prediction method must explore the space of possible protein structures which is
astronomically large. These problems can be partially bypassed in "comparative" or
homology modeling
Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "''target''" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous pr ...
and
fold recognition methods, in which the search space is pruned by the assumption that the protein in question adopts a structure that is close to the experimentally determined structure of another homologous protein. In contrast, the
de novo protein structure prediction methods must explicitly resolve these problems. The progress and challenges in protein structure prediction have been reviewed by Zhang.
Before modelling
Most tertiary structure modelling methods, such as Rosetta, are optimized for modelling the tertiary structure of single protein domains. A step called domain parsing, or domain boundary prediction, is usually done first to split a protein into potential structural domains. As with the rest of tertiary structure prediction, this can be done comparatively from known structures or ''ab initio'' with the sequence only (usually by
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, assisted by covariation). The structures for individual domains are docked together in a process called domain assembly to form the final tertiary structure.
''Ab initio'' protein modelling
Energy- and fragment-based methods
''Ab initio''- or ''de novo''- protein modelling methods seek to build three-dimensional protein models "from scratch", i.e., based on physical principles rather than (directly) on previously solved structures. There are many possible procedures that either attempt to mimic
protein folding
Protein folding is the physical process by which a protein, after Protein biosynthesis, synthesis by a ribosome as a linear chain of Amino acid, amino acids, changes from an unstable random coil into a more ordered protein tertiary structure, t ...
or apply some
stochastic Stochastic (; ) is the property of being well-described by a random probability distribution. ''Stochasticity'' and ''randomness'' are technically distinct concepts: the former refers to a modeling approach, while the latter describes phenomena; i ...
method to search possible solutions (i.e.,
global optimization
Global optimization is a branch of operations research, applied mathematics, and numerical analysis that attempts to find the global minimum or maximum of a function or a set of functions on a given set. It is usually described as a minimization ...
of a suitable energy function). These procedures tend to require vast computational resources, and have thus only been carried out for tiny proteins. To predict protein structure ''de novo'' for larger proteins will require better algorithms and larger computational resources like those afforded by either powerful supercomputers (such as
Blue Gene or
MDGRAPE-3) or distributed computing (such as
Folding@home, the
Human Proteome Folding Project and
Rosetta@Home). Although these computational barriers are vast, the potential benefits of structural genomics (by predicted or experimental methods) make ''ab initio'' structure prediction an active research field.
As of 2009, a 50-residue protein could be simulated atom-by-atom on a supercomputer for 1 millisecond.
As of 2012, comparable stable-state sampling could be done on a standard desktop with a new graphics card and more sophisticated algorithms.
A much larger simulation timescales can be achieved using
coarse-grained modeling.
Evolutionary covariation to predict 3D contacts
As sequencing became more commonplace in the 1990s several groups used protein sequence alignments to predict correlated
mutation
In biology, a mutation is an alteration in the nucleic acid sequence of the genome of an organism, virus, or extrachromosomal DNA. Viral genomes contain either DNA or RNA. Mutations result from errors during DNA or viral replication, ...
s and it was hoped that these coevolved residues could be used to predict tertiary structure (using the analogy to distance constraints from experimental procedures such as
NMR
Nuclear magnetic resonance (NMR) is a physical phenomenon in which atomic nucleus, nuclei in a strong constant magnetic field are disturbed by a weak oscillating magnetic field (in the near and far field, near field) and respond by producing ...
). The assumption is when single residue mutations are slightly deleterious, compensatory mutations may occur to restabilize residue-residue interactions.
This early work used what are known as ''local'' methods to calculate correlated mutations from protein sequences, but suffered from indirect false correlations which result from treating each pair of residues as independent of all other pairs.
In 2011, a different, and this time ''global'' statistical approach, demonstrated that predicted coevolved residues were sufficient to predict the 3D fold of a protein, providing there are enough sequences available (>1,000 homologous sequences are needed).
The method
EVfold uses no homology modeling, threading or 3D structure fragments and can be run on a standard personal computer even for proteins with hundreds of residues. The accuracy of the contacts predicted using this and related approaches has now been demonstrated on many known structures and contact maps, including the prediction of experimentally unsolved transmembrane proteins.
Comparative protein modeling
Comparative protein modeling uses previously solved structures as starting points, or templates. This is effective because it appears that although the number of actual proteins is vast, there is a limited set of
tertiary
Tertiary (from Latin, meaning 'third' or 'of the third degree/order..') may refer to:
* Tertiary period, an obsolete geologic period spanning from 66 to 2.6 million years ago
* Tertiary (chemistry), a term describing bonding patterns in organic ch ...
structural motifs to which most proteins belong. It has been suggested that there are only around 2,000 distinct protein folds in nature, though there are many millions of different proteins. The comparative protein modeling can combine with the evolutionary covariation in the structure prediction.
These methods may also be split into two groups:
*
Homology modeling
Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "''target''" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous pr ...
is based on the reasonable assumption that two
homologous proteins will share very similar structures. Because a protein's fold is more evolutionarily conserved than its amino acid sequence, a target sequence can be modeled with reasonable accuracy on a very distantly related template, provided that the relationship between target and template can be discerned through
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
. It has been suggested that the primary bottleneck in comparative modelling arises from difficulties in alignment rather than from errors in structure prediction given a known-good alignment.
Unsurprisingly, homology modelling is most accurate when the target and template have similar sequences.
*
Protein threading scans the amino acid sequence of an unknown structure against a database of solved structures. In each case, a
scoring function is used to assess the compatibility of the sequence to the structure, thus yielding possible three-dimensional models. This type of method is also known as 3D-1D fold recognition due to its compatibility analysis between three-dimensional structures and linear protein sequences. This method has also given rise to methods performing an inverse folding search by evaluating the compatibility of a given structure with a large database of sequences, thus predicting which sequences have the potential to produce a given fold.
Modeling of side-chain conformations
Accurate packing of the amino acid
side chain
In organic chemistry and biochemistry, a side chain is a substituent, chemical group that is attached to a core part of the molecule called the "main chain" or backbone chain, backbone. The side chain is a hydrocarbon branching element of a mo ...
s represents a separate problem in protein structure prediction. Methods that specifically address the problem of predicting side-chain geometry include
dead-end elimination and the
self-consistent mean field methods. The side chain conformations with low energy are usually determined on the rigid polypeptide backbone and using a set of discrete side chain conformations known as "
rotamers". The methods attempt to identify the set of rotamers that minimize the model's overall energy.
These methods use rotamer libraries, which are collections of favorable conformations for each residue type in proteins. Rotamer libraries may contain information about the conformation, its frequency, and the standard deviations about mean dihedral angles, which can be used in sampling.
Rotamer libraries are derived from
structural bioinformatics or other statistical analysis of side-chain conformations in known experimental structures of proteins, such as by clustering the observed conformations for tetrahedral carbons near the staggered (60°, 180°, −60°) values.
Rotamer libraries can be backbone-independent, secondary-structure-dependent, or backbone-dependent. Backbone-independent rotamer libraries make no reference to backbone conformation, and are calculated from all available side chains of a certain type (for instance, the first example of a rotamer library, done by Ponder and
Richards at Yale in 1987). Secondary-structure-dependent libraries present different dihedral angles and/or rotamer frequencies for
-helix,
-sheet, or coil secondary structures.
Backbone-dependent rotamer libraries present conformations and/or frequencies dependent on the local backbone conformation as defined by the backbone dihedral angles
and
, regardless of secondary structure.
The modern versions of these libraries as used in most software are presented as multidimensional distributions of probability or frequency, where the peaks correspond to the dihedral-angle conformations considered as individual rotamers in the lists. Some versions are based on very carefully curated data and are used primarily for structure validation, while others emphasize relative frequencies in much larger data sets and are the form used primarily for structure prediction, such as the
Dunbrack rotamer libraries.
Side-chain packing methods are most useful for analyzing the protein's
hydrophobic
In chemistry, hydrophobicity is the chemical property of a molecule (called a hydrophobe) that is seemingly repelled from a mass of water. In contrast, hydrophiles are attracted to water.
Hydrophobic molecules tend to be nonpolar and, thu ...
core, where side chains are more closely packed; they have more difficulty addressing the looser constraints and higher flexibility of surface residues, which often occupy multiple rotamer conformations rather than just one.
Quaternary structure
In the case of
complexes of two or more proteins, where the structures of the proteins are known or can be predicted with high accuracy,
protein–protein docking methods can be used to predict the structure of the complex. Information of the effect of mutations at specific sites on the affinity of the complex helps to understand the complex structure and to guide docking methods.
Software
A great number of software tools for protein structure prediction exist. Approaches include
homology modeling
Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "''target''" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous pr ...
,
protein threading, ''ab initio'' methods,
secondary structure prediction, and transmembrane helix and signal peptide prediction. In particular,
deep learning
Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...
based on
long short-term memory has been used for this purpose since 2007, when it was successfully applied to protein homology detection
and to
predict subcellular localization of proteins.
Some recent successful methods based on the
CASP experiments include
I-TASSER,
HHpred and
AlphaFold. In 2021, AlphaFold was reported to perform best.
Knowing the structure of a protein often allows functional prediction as well. For instance, collagen is folded into a long-extended fiber-like chain and it makes it a fibrous protein. Recently, several techniques have been developed to predict protein folding and thus protein structure, for example, Itasser, and AlphaFold.
AI methods
AlphaFold was one of the first AIs to predict protein structures. It was introduced by Google's DeepMind in the 13th CASP competition, which was held in 2018.
AlphaFold relies on a
neural network approach, which directly predicts the 3D coordinates of all non-hydrogen atoms for a given protein using the amino acid sequence and aligned
homologous sequences. The
AlphaFold network consists of a trunk which processes the inputs through repeated layers, and a structure module which introduces an explicit 3D structure.
Earlier neural networks for protein structure prediction used
LSTM
Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hi ...
.

Since
AlphaFold outputs protein coordinates directly,
AlphaFold produces predictions in graphics processing unit (GPU) minutes to GPU hours, depending on the length of protein sequence.
The
European Bioinformatics Institute together with
DeepMind have constructed the AlphaFold – EBI database for predicted protein structures.
Current AI methods and databases of predicted protein structures
AlphaFold2, was introduced in CASP14, and is capable of predicting protein structures to near experimental accuracy.
AlphaFold was swiftly followed by RoseTTAFold
and later by OmegaFold and the ESM Metagenomic Atlas.
In a study, Sommer et al. 2022 demonstrated the application of protein structure prediction in genome annotation, specifically in identifying functional protein isoforms using computationally predicted structures, available at https://www.isoform.io. This study highlights the promise of protein structure prediction as a genome annotation tool and presents a practical, structure-guided approach that can be used to enhance the annotation of any genome.
In 2024,
David Baker and
Demis Hassabis were awarded the
Nobel Prize in Chemistry
The Nobel Prize in Chemistry () is awarded annually by the Royal Swedish Academy of Sciences to scientists in the various fields of chemistry. It is one of the five Nobel Prizes established by the will of Alfred Nobel in 1895, awarded for outst ...
for their contributions to computational protein modeling, including the development of AlphaFold2, an AI-based model for protein structure prediction. AlphaFold2's accuracy has been evaluated against experimentally determined protein structures using metrics such as
root-mean-square deviation (RMSD). The median RMSD between different experimental structures of the same protein is approximately 0.6 Å, while the median RMSD between AlphaFold2 predictions and experimental structures is around 1 Å. For regions where AlphaFold2 assigns high confidence, the median RMSD is about 0.6 Å, comparable to the variability observed between different experimental structures. However, in low-confidence regions, the RMSD can exceed 2 Å, indicating greater deviations. In proteins with multiple domains connected by flexible linkers, AlphaFold2 predicts individual domain structures accurately but may assign random relative positions to these domains. Additionally, AlphaFold2 does not account for structural constraints such as the membrane plane, sometimes placing protein domains in positions that would physically clash with the membrane.
Evaluation of automatic structure prediction servers
CASP, which stands for Critical Assessment of Techniques for Protein Structure Prediction, is a community-wide experiment for protein structure prediction taking place every two years since 1994. CASP provides with an opportunity to assess the quality of available human, non-automated methodology (human category) and automatic servers for protein structure prediction (server category, introduced in the CASP7).
The
CAMEO3D Continuous Automated Model EvaluatiOn Server evaluates automated protein structure prediction servers on a weekly basis using blind predictions for newly release protein structures. CAMEO publishes the results on its website.
See also
*
Protein design
*
Protein function prediction
*
Protein–protein interaction prediction
*
Gene prediction
*
Protein structure prediction software
*
''De novo'' protein structure prediction
*
Molecular design software
*
Molecular modeling software
*
Modelling biological systems
*
Fragment libraries
*
Lattice proteins
*
Statistical potential
*
Structure atlas of human genome
*
Protein circular dichroism data bank
References
Further reading
*
*
*
*
*
*
*
*
*
External links
* , Protein Structure Prediction Center, CASP experiments
ExPASy Proteomics tools– list of prediction tools and servers
{{DEFAULTSORT:Protein Structure Prediction
Bioinformatics
Protein structure
Protein methods