Protein Fragment Library
   HOME

TheInfoList



OR:

Protein backbone fragment libraries have been used successfully in a variety of
structural biology Structural biology is a field that is many centuries old which, and as defined by the Journal of Structural Biology, deals with structural analysis of living material (formed, composed of, and/or maintained and refined by living cells) at every le ...
applications, including
homology modeling Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "''target''" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous pr ...
,Kolodny, R., Guibas, L., Levitt, M., and Koehl, P. (2005, March). Inverse Kinematics in Biology: The Protein Loop Closure Problem. The International Journal of Robotics Research 24(2-3), 151-163. de novo structure prediction, Simons, K., Kooperberg, C., Huang, E., and Baker, D. (1997). Assembly of Protein Tertiary Structures from Fragments with Similar Local Sequences using Simulated Annealing and Bayesian Scoring Functions. J Mol Biol 268, 209-225.Bujnicki, J. (2006) Protein Structure Prediction by Recombination of Fragments. ChemBioChem. 7, 19-27.Li, S. et al. (2008) Fragment-HMM: A New Approach to Protein Structure Prediction. Protein Science. 17, 1925-1934. and
structure determination A chemical structure determination includes a chemist's specifying the molecular geometry and, when feasible and necessary, the electronic structure of the target molecule or other solid. Molecular geometry refers to the spatial arrangement of at ...
.DiMaio, F., Shavlik, J., Phillips, G. A probabilistic approach to protein backbone tracing in electron density maps (2006). Bioinformatics 22(14), 81-89. By reducing the complexity of the search space, these fragment libraries enable more rapid search of conformational space, leading to more efficient and accurate models.


Motivation

Protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
s can adopt an exponential number of states when modeled discretely. Typically, a protein's conformations are represented as sets of
dihedral angle A dihedral angle is the angle between two intersecting planes or half-planes. In chemistry, it is the clockwise angle between half-planes through two sets of three atoms, having two atoms in common. In solid geometry, it is defined as the uni ...
s,
bond length In molecular geometry, bond length or bond distance is defined as the average distance between nuclei of two bonded atoms in a molecule. It is a transferable property of a bond between atoms of fixed types, relatively independent of the rest of ...
s, and
bond angle Bond or bonds may refer to: Common meanings * Bond (finance), a type of debt security * Bail bond, a commercial third-party guarantor of surety bonds in the United States * Chemical bond, the attraction of atoms, ions or molecules to form chemical ...
s between all connected atoms. The most common simplification is to assume ideal bond lengths and bond angles. However, this still leaves the phi-psi angles of the backbone, and up to four dihedral angles for each
side chain In organic chemistry and biochemistry, a side chain is a chemical group that is attached to a core part of the molecule called the "main chain" or backbone. The side chain is a hydrocarbon branching element of a molecule that is attached to a l ...
, leading to a worst case complexity of ''k''6*''n'' possible states of the protein, where ''n'' is the number of residues and ''k'' is the number of discrete states modeled for each dihedral angle. In order to reduce the conformational space, one can use protein fragment libraries rather than explicitly model every phi-psi angle. Fragments are short segments of the peptide backbone, typically from 5 to 15 residues long, and do not include the side chains. They may specify the location of just the C-alpha atoms if it is a reduced atom representation, or all the backbone heavy atoms (N, C-alpha, C carbonyl, O). Note that side chains are typically not modeled using the fragment library approach. To model discrete states of a side chain, one could use a rotamer library approach.Canutescu, A., Shelenkov, A., and Dunbrack, R. (2003). A graph theory algorithm for protein side-chain prediction. Protein Sci. 12, 2001–2014. This approach operates under the assumption that local interactions play a large role in stabilizing the overall protein conformation. In any short sequence, the molecular forces constrain the structure, leading to only a small number of possible conformations, which can be modeled by fragments. Indeed, according to
Levinthal's paradox Levinthal's paradox is a thought experiment, also constituting a self-reference in the theory of protein folding. In 1969, Cyrus Levinthal noted that, because of the very large number of degrees of freedom in an unfolded polypeptide chain, the m ...
, a protein could not possibly sample all possible conformations within a biologically reasonable amount of time. Locally stabilized structures would reduce the search space and allow proteins to fold on the order of milliseconds.


Construction

Libraries of these fragments are constructed from an analysis of the
Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cry ...
(PDB). First, a representative subset of the PDB is chosen which should cover a diverse array of structures, preferably at a good resolution. Then, for each structure, every set of ''n'' consecutive residues is taken as a sample fragment. The samples are then clustered into ''k'' groups, based upon how similar they are to each other in spatial configuration, using algorithms such as ''k''-means clustering. The parameters ''n'' and ''k'' are chosen according to the application (see discussion on complexity below). The
centroid In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the surface of the figure. The same definition extends to any ob ...
s of the clusters are then taken to represent the fragment. Further optimization can be performed to ensure that the centroid possesses ideal bond geometry, as it was derived by averaging other geometries. Kolodny, R., Koehl, P., Guibas, L., and Levitt, M. (2005). Small Libraries of Protein Fragments Model Native Protein Structures Accurately. J Mol Biol 323, 297-307. Because the fragments are derived from structures that exist in nature, the segment of backbone they represent will have realistic bonding geometries. This helps avoid having to explore the full space of conformation angles, much of which would lead to unrealistic geometries. The clustering above can be performed without regard to the identities of the residues, or it can be residue-specific. That is, for any given input sequence of amino acids, a clustering can be derived using only samples found in the PDB with the same sequence in the ''k''-mer fragment. This requires more computational work than deriving a sequence-independent fragment library but can potentially produce more accurate models. Conversely, a larger sample set is required, and one may not achieve full coverage.


Example use: loop modeling

In
homology modeling Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "''target''" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous pr ...
, a common application of fragment libraries is to model the loops of the structure. Typically, the
alpha helices The alpha helix (α-helix) is a common motif in the secondary structure of proteins and is a right hand-helix conformation in which every backbone N−H group hydrogen bonds to the backbone C=O group of the amino acid located four residues ear ...
and
beta sheet The beta sheet, (β-sheet) (also β-pleated sheet) is a common motif of the regular protein secondary structure. Beta sheets consist of beta strands (β-strands) connected laterally by at least two or three backbone hydrogen bonds, forming a g ...
s are threaded against a template structure, but the loops in between are not specified and need to be predicted. Finding the loop with the optimal configuration is
NP-hard In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...
. To reduce the conformational space that needs to be explored, one can model the loop as a series of overlapping fragments. The space can then be sampled, or if the space is now small enough, exhaustively enumerated. One approach for exhaustive enumeration goes as follows. Loop construction begins by aligning all possible fragments to overlap with the three residues at the N terminus of the loop (the anchor point). Then all possible choices for a second fragment are aligned to (all possible choices of) the first fragment, ensuring that the last three residues of the first fragment overlap with the first three residues of the second fragment. This ensures that the fragment chain forms realistic angles both within the fragment and between fragments. This is then repeated until a loop with the correct length of residues is constructed. The loop must both begin at the anchor on the N side and end at the anchor on the C side. Each loop must therefore be tested to see if its last few residues overlap with the C terminal anchor. Very few of these exponential numbers of candidate loops will close the loop. After filtering out loops that don't close, one must then determine which loop has the optimal configuration, as determined by having the lowest energy using some molecular mechanics force field.


Complexity

The complexity of the state space is still exponential in the number of residues, even after using fragment libraries. However, the degree of the exponent is reduced. For a library of ''F''-mer fragments, with ''L'' fragments in the library, and to model a chain of ''N ''residues overlapping each fragment by 3, there will be ''L'' 'N''/(''F''-3)1 possible chains. This is much less than the ''K''''N'' possibilities if explicitly modeling the phi-psi angles as ''K'' possible combinations, as the complexity grows at a degree smaller than ''N''. The complexity increases in ''L'', the size of the fragment library. However, libraries with more fragments will capture a greater diversity of fragment structures, so there is a trade off in the accuracy of the model vs the speed of exploring the search space. This choice governs what ''K'' is used when performing the clustering. Additionally, for any fixed ''L'', the diversity of structures capable of being modeled decreases as the length of the fragments increases. Shorter fragments are more capable of covering the diverse array of structures found in the PDB than longer ones. Recently, it was shown that libraries of up to length 15 are capable of modeling 91% of the fragments in the PDB to within 2.0 angstroms. Du, P., Andrec, M., and Levy, R. Have We Seen All Structures Corresponding to Short Protein Fragments in the Protein Data Bank? An Update. Protein Engineering. 2003, 16(6) 407-414.


See also

*
De novo protein structure prediction In computational biology, ''de novo'' protein structure prediction refers to an algorithmic process by which protein tertiary structure is predicted from its amino acid primary sequence. The problem itself has occupied leading scientists for decade ...
*
Homology modeling Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "''target''" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous pr ...
*
Protein design Protein design is the rational design of new protein molecules to design novel activity, behavior, or purpose, and to advance basic understanding of protein function. Proteins can be designed from scratch (''de novo'' design) or by making calcula ...
*
Protein structure prediction Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different ...
*
Protein structure prediction software This list of protein structure prediction software summarizes notable used software tools in protein structure prediction, including homology modeling, protein threading, ''ab initio'' methods, secondary structure prediction, and transmembrane ...
*
Structural alignment Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RN ...


References

{{reflist, 2 Bioinformatics Protein structure