HOME

TheInfoList



OR:

Protein design is the
rational design In chemical biology and biomolecular engineering, rational design (RD) is an umbrella term which invites the strategy of creating new molecules with a certain functionality, based upon the ability to predict how the molecule's structure (specific ...
of new
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
molecules to design novel activity, behavior, or purpose, and to advance basic understanding of protein function. Proteins can be designed from scratch (''de novo'' design) or by making calculated variants of a known protein structure and its sequence (termed ''protein redesign''). Rational protein design approaches make protein-sequence predictions that will fold to specific structures. These predicted sequences can then be validated experimentally through methods such as
peptide synthesis In organic chemistry, peptide synthesis is the production of peptides, compounds where multiple amino acids are linked via amide bonds, also known as peptide bonds. Peptides are chemically synthesized by the condensation reaction of the carboxyl ...
,
site-directed mutagenesis Site-directed mutagenesis is a molecular biology method that is used to make specific and intentional mutating changes to the DNA sequence of a gene and any gene products. Also called site-specific mutagenesis or oligonucleotide-directed mutagenesi ...
, or
artificial gene synthesis Artificial gene synthesis, or simply gene synthesis, refers to a group of methods that are used in synthetic biology to construct and assemble genes from nucleotides '' de novo''. Unlike DNA synthesis in living cells, artificial gene synthesis d ...
. Rational protein design dates back to the mid-1970s. Recently, however, there were numerous examples of successful rational design of water-soluble and even transmembrane peptides and proteins, in part due to a better understanding of different factors contributing to protein structure stability and development of better computational methods.


Overview and history

The goal in rational protein design is to predict
amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
sequences In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is called t ...
that will fold to a specific protein structure. Although the number of possible protein sequences is vast, growing exponentially with the size of the protein chain, only a subset of them will fold reliably and quickly to one native state. Protein design involves identifying novel sequences within this subset. The native state of a protein is the conformational free energy minimum for the chain. Thus, protein design is the search for sequences that have the chosen structure as a free energy minimum. In a sense, it is the reverse of
protein structure prediction Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different ...
. In design, a
tertiary structure Protein tertiary structure is the three dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains may int ...
is specified, and a sequence that will fold to it is identified. Hence, it is also termed ''inverse folding''. Protein design is then an optimization problem: using some scoring criteria, an optimized sequence that will fold to the desired structure is chosen. When the first proteins were rationally designed during the 1970s and 1980s, the sequence for these was optimized manually based on analyses of other known proteins, the sequence composition, amino acid charges, and the geometry of the desired structure. The first designed proteins are attributed to Bernd Gutte, who designed a reduced version of a known catalyst, bovine ribonuclease, and tertiary structures consisting of beta-sheets and alpha-helices, including a binder of
DDT Dichlorodiphenyltrichloroethane, commonly known as DDT, is a colorless, tasteless, and almost odorless crystalline chemical compound, an organochloride. Originally developed as an insecticide, it became infamous for its environmental impacts. ...
. Urry and colleagues later designed
elastin Elastin is a protein that in humans is encoded by the ''ELN'' gene. Elastin is a key component of the extracellular matrix in gnathostomes (jawed vertebrates). It is highly elastic and present in connective tissue allowing many tissues in the bod ...
-like
fibrous Fiber or fibre (from la, fibra, links=no) is a #Natural fibers, natural or Fiber#Artificial fibers, artificial substance that is significantly longer than it is wide. Fibers are often used in the manufacture of other materials. The stronge ...
peptides based on rules on sequence composition. Richardson and coworkers designed a 79-residue protein with no sequence homology to a known protein. In the 1990s, the advent of powerful computers, libraries of amino acid conformations, and force fields developed mainly for
molecular dynamics Molecular dynamics (MD) is a computer simulation method for analyzing the physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a fixed period of time, giving a view of the dynamic "evolution" of the ...
simulations enabled the development of structure-based computational protein design tools. Following the development of these computational tools, great success has been achieved over the last 30 years in protein design. The first protein successfully designed completely ''de novo'' was done by Stephen Mayo and coworkers in 1997, and, shortly after, in 1999 Peter S. Kim and coworkers designed dimers, trimers, and tetramers of unnatural right-handed
coiled coil A coiled coil is a structural motif in proteins in which 2–7 alpha-helices are coiled together like the strands of a rope. (Dimers and trimers are the most common types.) Many coiled coil-type proteins are involved in important biological fun ...
s. In 2003, David Baker's laboratory designed a full protein to a fold never seen before in nature. Later, in 2008, Baker's group computationally designed enzymes for two different reactions. In 2010, one of the most powerful broadly neutralizing antibodies was isolated from patient serum using a computationally designed protein probe. Due to these and other successes (e.g., see
examples Example may refer to: * '' exempli gratia'' (e.g.), usually read out in English as "for example" * .example, reserved as a domain name that may not be installed as a top-level domain of the Internet ** example.com, example.net, example.org, e ...
below), protein design has become one of the most important tools available for
protein engineering Protein engineering is the process of developing useful or valuable proteins. It is a young discipline, with much research taking place into the understanding of protein folding and recognition for protein design principles. It has been used to imp ...
. There is great hope that the design of new proteins, small and large, will have uses in
biomedicine Biomedicine (also referred to as Western medicine, mainstream medicine or conventional medicine)
and
bioengineering Biological engineering or bioengineering is the application of principles of biology and the tools of engineering to create usable, tangible, economically-viable products. Biological engineering employs knowledge and expertise from a number o ...
.


Underlying models of protein structure and function

Protein design programs use
computer models Computer simulation is the process of mathematical modelling, performed on a computer, which is designed to predict the behaviour of, or the outcome of, a real-world or physical system. The reliability of some mathematical models can be deter ...
of the molecular forces that drive proteins in ''
in vivo Studies that are ''in vivo'' (Latin for "within the living"; often not italicized in English) are those in which the effects of various biological entities are tested on whole, living organisms or cells, usually animals, including humans, and ...
'' environments. In order to make the problem tractable, these forces are simplified by protein design models. Although protein design programs vary greatly, they have to address four main modeling questions: What is the target structure of the design, what flexibility is allowed on the target structure, which sequences are included in the search, and which force field will be used to score sequences and structures.


Target structure

Protein function is heavily dependent on protein structure, and rational protein design uses this relationship to design function by designing proteins that have a target structure or fold. Thus, by definition, in rational protein design the target structure or ensemble of structures must be known beforehand. This contrasts with other forms of protein engineering, such as
directed evolution Directed evolution (DE) is a method used in protein engineering that mimics the process of natural selection to steer proteins or nucleic acids toward a user-defined goal. It consists of subjecting a gene to iterative rounds of mutagenesis (cre ...
, where a variety of methods are used to find proteins that achieve a specific function, and with
protein structure prediction Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different ...
where the sequence is known, but the structure is unknown. Most often, the target structure is based on a known structure of another protein. However, novel folds not seen in nature have been made increasingly possible. Peter S. Kim and coworkers designed trimers and tetramers of unnatural coiled coils, which had not been seen before in nature. The protein Top7, developed in David Baker's lab, was designed completely using protein design algorithms, to a completely novel fold. More recently, Baker and coworkers developed a series of principles to design ideal globular-protein structures based on protein folding funnels that bridge between secondary structure prediction and tertiary structures. These principles, which build on both protein structure prediction and protein design, were used to design five different novel protein topologies.


Sequence space

In rational protein design, proteins can be redesigned from the sequence and structure of a known protein, or completely from scratch in ''de novo'' protein design. In protein redesign, most of the residues in the sequence are maintained as their wild-type amino-acid while a few are allowed to mutate. In ''de novo'' design, the entire sequence is designed anew, based on no prior sequence. Both ''de novo'' designs and protein redesigns can establish rules on the
sequence space In functional analysis and related areas of mathematics, a sequence space is a vector space whose elements are infinite sequences of real or complex numbers. Equivalently, it is a function space whose elements are functions from the natural num ...
: the specific amino acids that are allowed at each mutable residue position. For example, the composition of the surface of the RSC3 probe to select HIV-broadly neutralizing antibodies was restricted based on evolutionary data and charge balancing. Many of the earliest attempts on protein design were heavily based on empiric ''rules'' on the sequence space. Moreover, the design of fibrous proteins usually follows strict rules on the sequence space.
Collagen Collagen () is the main structural protein in the extracellular matrix found in the body's various connective tissues. As the main component of connective tissue, it is the most abundant protein in mammals, making up from 25% to 35% of the whole ...
-based designed proteins, for example, are often composed of Gly-Pro-X repeating patterns. The advent of computational techniques allows designing proteins with no human intervention in sequence selection.


Structural flexibility

In protein design, the target structure (or structures) of the protein are known. However, a rational protein design approach must model some ''flexibility'' on the target structure in order to increase the number of sequences that can be designed for that structure and to minimize the chance of a sequence folding to a different structure. For example, in a protein redesign of one small amino acid (such as alanine) in the tightly packed core of a protein, very few mutants would be predicted by a rational design approach to fold to the target structure, if the surrounding side-chains are not allowed to be repacked. Thus, an essential parameter of any design process is the amount of flexibility allowed for both the side-chains and the backbone. In the simplest models, the protein backbone is kept rigid while some of the protein side-chains are allowed to change conformations. However, side-chains can have many degrees of freedom in their bond lengths, bond angles, and χ dihedral angles. To simplify this space, protein design methods use rotamer libraries that assume ideal values for bond lengths and bond angles, while restricting χ dihedral angles to a few frequently observed low-energy conformations termed rotamers. Rotamer libraries are derived from the statistical analysis of many protein structures. Backbone-independent rotamer libraries describe all rotamers. Backbone-dependent rotamer libraries, in contrast, describe the rotamers as how likely they are to appear depending on the protein backbone arrangement around the side chain. Most protein design programs use one conformation (e.g., the modal value for rotamer dihedrals in space) or several points in the region described by the rotamer; the OSPREY protein design program, in contrast, models the entire continuous region. Although rational protein design must preserve the general backbone fold a protein, allowing some backbone flexibility can significantly increase the number of sequences that fold to the structure while maintaining the general fold of the protein. Backbone flexibility is especially important in protein redesign because sequence mutations often result in small changes to the backbone structure. Moreover, backbone flexibility can be essential for more advanced applications of protein design, such as binding prediction and enzyme design. Some models of protein design backbone flexibility include small and continuous global backbone movements, discrete backbone samples around the target fold, backrub motions, and protein loop flexibility.


Energy function

Rational protein design techniques must be able to discriminate sequences that will be stable under the target fold from those that would prefer other low-energy competing states. Thus, protein design requires accurate energy functions that can rank and score sequences by how well they fold to the target structure. At the same time, however, these energy functions must consider the computational
challenges Challenge may refer to: * Voter challenging or caging, a method of challenging the registration status of voters * Euphemism for disability * Peremptory challenge, a dismissal of potential jurors from jury duty Places Geography *Challenge, C ...
behind protein design. One of the most challenging requirements for successful design is an energy function that is both accurate and simple for computational calculations. The most accurate energy functions are those based on quantum mechanical simulations. However, such simulations are too slow and typically impractical for protein design. Instead, many protein design algorithms use either physics-based energy functions adapted from
molecular mechanics Molecular mechanics uses classical mechanics to model molecular systems. The Born–Oppenheimer approximation is assumed valid and the potential energy of all systems is calculated as a function of the nuclear coordinates using Force field (chemi ...
simulation programs, knowledge based energy-functions, or a hybrid mix of both. The trend has been toward using more physics-based potential energy functions. Physics-based energy functions, such as
AMBER Amber is fossilized tree resin that has been appreciated for its color and natural beauty since Neolithic times. Much valued from antiquity to the present as a gemstone, amber is made into a variety of decorative objects."Amber" (2004). In Ma ...
and
CHARMM Chemistry at Harvard Macromolecular Mechanics (CHARMM) is the name of a widely used set of force fields for molecular dynamics, and the name for the molecular dynamics simulation and analysis computer software package associated with them. The CHA ...
, are typically derived from quantum mechanical simulations, and experimental data from thermodynamics, crystallography, and spectroscopy. These energy functions typically simplify physical energy function and make them pairwise decomposable, meaning that the total energy of a protein conformation can be calculated by adding the pairwise energy between each atom pair, which makes them attractive for optimization algorithms. Physics-based energy functions typically model an attractive-repulsive
Lennard-Jones Sir John Edward Lennard-Jones (27 October 1894 – 1 November 1954) was a British mathematician and professor of theoretical physics at the University of Bristol, and then of theoretical science at the University of Cambridge. He was an im ...
term between atoms and a pairwise
electrostatics Electrostatics is a branch of physics that studies electric charges at rest (static electricity). Since classical times, it has been known that some materials, such as amber, attract lightweight particles after rubbing. The Greek word for amber ...
coulombic term between non-bonded atoms. Statistical potentials, in contrast to physics-based potentials, have the advantage of being fast to compute, of accounting implicitly of complex effects and being less sensitive to small changes in the protein structure. These energy functions are based on deriving energy values from frequency of appearance on a structural database. Protein design, however, has requirements that can sometimes be limited in molecular mechanics force-fields. Molecular mechanics force-fields, which have been used mostly in molecular dynamics simulations, are optimized for the simulation of single sequences, but protein design searches through many conformations of many sequences. Thus, molecular mechanics force-fields must be tailored for protein design. In practice, protein design energy functions often incorporate both statistical terms and physics-based terms. For example, the Rosetta energy function, one of the most-used energy functions, incorporates physics-based energy terms originating in the CHARMM energy function, and statistical energy terms, such as rotamer probability and knowledge-based electrostatics. Typically, energy functions are highly customized between laboratories, and specifically tailored for every design.


Challenges for effective design energy functions

Water makes up most of the molecules surrounding proteins and is the main driver of protein structure. Thus, modeling the interaction between water and protein is vital in protein design. The number of water molecules that interact with a protein at any given time is huge and each one has a large number of degrees of freedom and interaction partners. Instead, protein design programs model most of such water molecules as a continuum, modeling both the hydrophobic effect and solvation polarization. Individual water molecules can sometimes have a crucial structural role in the core of proteins, and in protein–protein or protein–ligand interactions. Failing to model such waters can result in mispredictions of the optimal sequence of a protein–protein interface. As an alternative, water molecules can be added to rotamers.


As an optimization problem

The goal of protein design is to find a protein sequence that will fold to a target structure. A protein design algorithm must, thus, search all the conformations of each sequence, with respect to the target fold, and rank sequences according to the lowest-energy conformation of each one, as determined by the protein design energy function. Thus, a typical input to the protein design algorithm is the target fold, the sequence space, the structural flexibility, and the energy function, while the output is one or more sequences that are predicted to fold stably to the target structure. The number of candidate protein sequences, however, grows exponentially with the number of protein residues; for example, there are 20100 protein sequences of length 100. Furthermore, even if amino acid side-chain conformations are limited to a few rotamers (see Structural flexibility), this results in an exponential number of conformations for each sequence. Thus, in our 100 residue protein, and assuming that each amino acid has exactly 10 rotamers, a search algorithm that searches this space will have to search over 200100 protein conformations. The most common energy functions can be decomposed into pairwise terms between rotamers and amino acid types, which casts the problem as a combinatorial one, and powerful optimization algorithms can be used to solve it. In those cases, the total energy of each conformation belonging to each sequence can be formulated as a sum of individual and pairwise terms between residue positions. If a designer is interested only in the best sequence, the protein design algorithm only requires the lowest-energy conformation of the lowest-energy sequence. In these cases, the amino acid identity of each rotamer can be ignored and all rotamers belonging to different amino acids can be treated the same. Let ri be a rotamer at residue position i in the protein chain, and E(ri) the potential energy between the internal atoms of the rotamer. Let E(ri, rj) be the potential energy between ri and rotamer rj at residue position j. Then, we define the optimization problem as one of finding the conformation of minimum energy (ET): The problem of minimizing ET is an
NP-hard In computational complexity theory, NP-hardness ( non-deterministic polynomial-time hardness) is the defining property of a class of problems that are informally "at least as hard as the hardest problems in NP". A simple example of an NP-hard pr ...
problem. Even though the class of problems is NP-hard, in practice many instances of protein design can be solved exactly or optimized satisfactorily through heuristic methods.


Algorithms

Several algorithms have been developed specifically for the protein design problem. These algorithms can be divided into two broad classes: exact algorithms, such as
dead-end elimination The dead-end elimination algorithm (DEE) is a method for optimization (mathematics), minimizing a Function (mathematics), function over a discrete set of independent variables. The basic idea is to identify "dead ends", i.e., combinations of vari ...
, that lack runtime guarantees but guarantee the quality of the solution; and
heuristic A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...
algorithms, such as Monte Carlo, that are faster than exact algorithms but have no guarantees on the optimality of the results. Exact algorithms guarantee that the optimization process produced the optimal according to the protein design model. Thus, if the predictions of exact algorithms fail when these are experimentally validated, then the source of error can be attributed to the energy function, the allowed flexibility, the sequence space or the target structure (e.g., if it cannot be designed for). Some protein design algorithms are listed below. Although these algorithms address only the most basic formulation of the protein design problem, Equation (), when the optimization goal changes because designers introduce improvements and extensions to the protein design model, such as improvements to the structural flexibility allowed (e.g., protein backbone flexibility) or including sophisticated energy terms, many of the extensions on protein design that improve modeling are built atop these algorithms. For example, Rosetta Design incorporates sophisticated energy terms, and backbone flexibility using Monte Carlo as the underlying optimizing algorithm. OSPREY's algorithms build on the dead-end elimination algorithm and A* to incorporate continuous backbone and side-chain movements. Thus, these algorithms provide a good perspective on the different kinds of algorithms available for protein design. In 2020 scientists reported the development of an AI-based process using
genome databases In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ge ...
for evolution-based designing of novel proteins. They used
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...
to identify design-rules. In 2022, a study reported deep learning software that can design proteins that contain prespecified functional sites.


With mathematical guarantees


Dead-end elimination

The dead-end elimination (DEE) algorithm reduces the search space of the problem iteratively by removing rotamers that can be provably shown to be not part of the global lowest energy conformation (GMEC). On each iteration, the dead-end elimination algorithm compares all possible pairs of rotamers at each residue position, and removes each rotamer r′i that can be shown to always be of higher energy than another rotamer ri and is thus not part of the GMEC: E(r^\prime_i) + \sum_ \min_ E(r^\prime_i,r_j) > E(r_i) + \sum_ \max_ E(r_i,r_j) Other powerful extensions to the dead-end elimination algorithm include the pairs elimination criterion, and the generalized dead-end elimination criterion. This algorithm has also been extended to handle continuous rotamers with provable guarantees. Although the Dead-end elimination algorithm runs in polynomial time on each iteration, it cannot guarantee convergence. If, after a certain number of iterations, the dead-end elimination algorithm does not prune any more rotamers, then either rotamers have to be merged or another search algorithm must be used to search the remaining search space. In such cases, the dead-end elimination acts as a pre-filtering algorithm to reduce the search space, while other algorithms, such as A*, Monte Carlo, Linear Programming, or FASTER are used to search the remaining search space.


Branch and bound

The protein design conformational space can be represented as a
tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...
, where the protein residues are ordered in an arbitrary way, and the tree branches at each of the rotamers in a residue.
Branch and bound Branch and bound (BB, B&B, or BnB) is an algorithm design paradigm for discrete and combinatorial optimization problems, as well as mathematical optimization. A branch-and-bound algorithm consists of a systematic enumeration of candidate soluti ...
algorithms use this representation to efficiently explore the conformation tree: At each ''branching'', branch and bound algorithms ''bound'' the conformation space and explore only the promising branches. A popular search algorithm for protein design is the
A* search algorithm A* (pronounced "A-star") is a graph traversal and path search algorithm, which is used in many fields of computer science due to its completeness, optimality, and optimal efficiency. One major practical drawback is its O(b^d) space complexity, ...
. A* computes a lower-bound score on each partial tree path that lower bounds (with guarantees) the energy of each of the expanded rotamers. Each partial conformation is added to a priority queue and at each iteration the partial path with the lowest lower bound is popped from the queue and expanded. The algorithm stops once a full conformation has been enumerated and guarantees that the conformation is the optimal. The A* score f in protein design consists of two parts, f=g+h. g is the exact energy of the rotamers that have already been assigned in the partial conformation. h is a lower bound on the energy of the rotamers that have not yet been assigned. Each is designed as follows, where d is the index of the last assigned residue in the partial conformation. g=\sum_^d (E(r_i ) + \sum_^d E(r_i,r_j) ) h = \sum_^n min_(E(r_j) + \sum_^d E(r_i,r_j) + \sum_^n \min_ E(r_j,r_k))/math>


Integer linear programming

The problem of optimizing ET (Equation ()) can be easily formulated as an
integer linear program An integer programming problem is a mathematical optimization or feasibility program in which some or all of the variables are restricted to be integers. In many settings the term refers to integer linear programming (ILP), in which the objectiv ...
(ILP). One of the most powerful formulations uses binary variables to represent the presence of a rotamer and edges in the final solution, and constraints the solution to have exactly one rotamer for each residue and one pairwise interaction for each pair of residues: \ \min \sum_\sum_ E_i(r_i)q_(r_i) + \sum_\sum_ E_(r_i, r_j)q_(r_i, r_j) \, s.t. \sum_ q_(r_i) = 1, \ \forall i \sum_ q_(r_i,r_j) = q_(r_i), \forall i, r_i, j q_i, q_ \in \ ILP solvers, such as
CPLEX IBM ILOG CPLEX Optimization Studio (often informally referred to simply as CPLEX) is an optimization software package. In 2004, the work on CPLEX earned the first INFORMS Impact Prize. History The CPLEX Optimizer was named for the simplex me ...
, can compute the exact optimal solution for large instances of protein design problems. These solvers use a
linear programming relaxation In mathematics, the relaxation of a (mixed) integer linear program is the problem that arises by removing the integrality constraint of each variable. For example, in a 0–1 integer program, all constraints are of the form :x_i\in\. The relax ...
of the problem, where qi and qij are allowed to take continuous values, in combination with a
branch and cut Branch and cut is a method of combinatorial optimization for solving integer linear programs (ILPs), that is, linear programming (LP) problems where some or all the unknowns are restricted to integer values. Branch and cut involves running a branc ...
algorithm to search only a small portion of the conformation space for the optimal solution. ILP solvers have been shown to solve many instances of the side-chain placement problem.


Message-passing based approximations to the linear programming dual

ILP solvers depend on linear programming (LP) algorithms, such as the
Simplex In geometry, a simplex (plural: simplexes or simplices) is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions. The simplex is so-named because it represents the simplest possible polytope in any given dimension. ...
or
barrier A barrier or barricade is a physical structure which blocks or impedes something. Barrier may also refer to: Places * Barrier, Kentucky, a community in the United States * Barrier, Voerendaal, a place in the municipality of Voerendaal, Netherl ...
-based methods to perform the LP relaxation at each branch. These LP algorithms were developed as general-purpose optimization methods and are not optimized for the protein design problem (Equation ()). In consequence, the LP relaxation becomes the bottleneck of ILP solvers when the problem size is large. Recently, several alternatives based on message-passing algorithms have been designed specifically for the optimization of the LP relaxation of the protein design problem. These algorithms can approximate both the dual or the
primal Primal may refer to: Psychotherapy * ''Primal'', the core concept in primal therapy, denotes the full reliving and cathartic release of an early traumatic experience * Primal scene (in psychoanalysis), refers to the witnessing by a young child of ...
instances of the integer programming, but in order to maintain guarantees on optimality, they are most useful when used to approximate the dual of the protein design problem, because approximating the dual guarantees that no solutions are missed. Message-passing based approximations include the ''tree reweighted max-product message passing'' algorithm, and the ''message passing linear programming'' algorithm.


Optimization algorithms without guarantees


Monte Carlo and simulated annealing

Monte Carlo is one of the most widely used algorithms for protein design. In its simplest form, a Monte Carlo algorithm selects a residue at random, and in that residue a randomly chosen rotamer (of any amino acid) is evaluated. The new energy of the protein, Enew is compared against the old energy Eold and the new rotamer is ''accepted'' with a probability of: p=e^, where β is the
Boltzmann constant The Boltzmann constant ( or ) is the proportionality factor that relates the average relative kinetic energy of particles in a gas with the thermodynamic temperature of the gas. It occurs in the definitions of the kelvin and the gas constant, ...
and the temperature T can be chosen such that in the initial rounds it is high and it is slowly annealed to overcome local minima.


FASTER

The FASTER algorithm uses a combination of deterministic and stochastic criteria to optimize amino acid sequences. FASTER first uses DEE to eliminate rotamers that are not part of the optimal solution. Then, a series of iterative steps optimize the rotamer assignment.


Belief propagation

In
belief propagation A belief is an attitude that something is the case, or that some proposition is true. In epistemology, philosophers use the term "belief" to refer to attitudes about the world which can be either true or false. To believe something is to take i ...
for protein design, the algorithm exchanges messages that describe the ''belief'' that each residue has about the probability of each rotamer in neighboring residues. The algorithm updates messages on every iteration and iterates until convergence or until a fixed number of iterations. Convergence is not guaranteed in protein design. The message mi→ j(rj that a residue i sends to every rotamer (rj at neighboring residue j is defined as: m_(r_j) = \max_ \Big(e^\Big) \prod_ m_ Both max-product and sum-product belief propagation have been used to optimize protein design.


Applications and examples of designed proteins


Enzyme design

The design of new
enzyme Enzymes () are proteins that act as biological catalysts by accelerating chemical reactions. The molecules upon which enzymes may act are called substrates, and the enzyme converts the substrates into different molecules known as products. A ...
s is a use of protein design with huge bioengineering and biomedical applications. In general, designing a protein structure can be different from designing an enzyme, because the design of enzymes must consider many states involved in the
catalytic mechanism Enzyme catalysis is the increase in the rate of a process by a biological molecule, an "enzyme". Most enzymes are proteins, and most such processes are chemical reactions. Within the enzyme, generally catalysis occurs at a localized site, calle ...
. However protein design is a prerequisite of ''de novo'' enzyme design because, at the very least, the design of catalysts requires a scaffold in which the catalytic mechanism can be inserted. Great progress in ''de novo'' enzyme design, and redesign, was made in the first decade of the 21st century. In three major studies, David Baker and coworkers ''de novo'' designed enzymes for the retro-
aldol reaction The aldol reaction is a means of forming carbon–carbon bonds in organic chemistry. Discovered independently by the Russian chemist Alexander Borodin in 1869 and by the French chemist Charles-Adolphe Wurtz in 1872, the reaction combines two carb ...
, a Kemp-elimination reaction, and for the Diels-Alder reaction. Furthermore, Stephen Mayo and coworkers developed an iterative method to design the most efficient known enzyme for the Kemp-elimination reaction. Also, in the laboratory of
Bruce Donald Bruce Randall Donald (born 1958) is an American computer scientist and computational biologist. He is the James B. Duke Professor of Computer Science and Biochemistry at Duke University. He has made numerous contributions to several fields in ...
, computational protein design was used to switch the specificity of one of the
protein domain In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of s ...
s of the
nonribosomal peptide synthetase Nonribosomal peptides (NRP) are a class of peptide secondary metabolites, usually produced by microorganisms like bacteria and fungi. Nonribosomal peptides are also found in higher organisms, such as nudibranchs, but are thought to be made by bacter ...
that produces
Gramicidin S Gramicidin S or Gramicidin Soviet is an antibiotic that is effective against some gram-positive and gram-negative bacteria as well as some fungi. It is a derivative of gramicidin, produced by the gram-positive bacterium '' Brevibacillus brevis ...
, from its natural substrate phenylalanine to other noncognate substrates including charged amino acids; the redesigned enzymes had activities close to those of the wild-type.


Design for affinity

Protein–protein interaction Protein–protein interactions (PPIs) are physical contacts of high specificity established between two or more protein molecules as a result of biochemical events steered by interactions that include electrostatic forces, hydrogen bonding and th ...
s are involved in most biotic processes. Many of the hardest-to-treat diseases, such as Alzheimer's, many forms of
cancer Cancer is a group of diseases involving abnormal cell growth with the potential to invade or spread to other parts of the body. These contrast with benign tumors, which do not spread. Possible signs and symptoms include a lump, abnormal b ...
(e.g.,
TP53 p53, also known as Tumor protein P53, cellular tumor antigen p53 (UniProt name), or transformation-related protein 53 (TRP53) is a regulatory protein that is often mutated in human cancers. The p53 proteins (originally thought to be, and often ...
), and human immunodeficiency virus (
HIV The human immunodeficiency viruses (HIV) are two species of ''Lentivirus'' (a subgroup of retrovirus) that infect humans. Over time, they cause acquired immunodeficiency syndrome (AIDS), a condition in which progressive failure of the immune ...
) infection involve protein–protein interactions. Thus, to treat such diseases, it is desirable to design protein or protein-like therapeutics that bind one of the partners of the interaction and, thus, disrupt the disease-causing interaction. This requires designing protein-therapeutics for ''affinity'' toward its partner. Protein–protein interactions can be designed using protein design algorithms because the principles that rule protein stability also rule protein–protein binding. Protein–protein interaction design, however, presents challenges not commonly present in protein design. One of the most important challenges is that, in general, the interfaces between proteins are more polar than protein cores, and binding involves a tradeoff between desolvation and hydrogen bond formation. To overcome this challenge, Bruce Tidor and coworkers developed a method to improve the affinity of antibodies by focusing on electrostatic contributions. They found that, for the antibodies designed in the study, reducing the desolvation costs of the residues in the interface increased the affinity of the binding pair.


Scoring binding predictions

Protein design energy functions must be adapted to score binding predictions because binding involves a trade-off between the lowest-
energy In physics, energy (from Ancient Greek: ἐνέργεια, ''enérgeia'', “activity”) is the quantitative property that is transferred to a body or to a physical system, recognizable in the performance of work and in the form of heat a ...
conformations of the free proteins (EP and EL) and the lowest-energy conformation of the bound complex (EPL): \Delta_G = E_ - E_P - E_L . The K* algorithm approximates the binding constant of the algorithm by including conformational entropy into the free energy calculation. The K* algorithm considers only the lowest-energy conformations of the free and bound complexes (denoted by the sets P, L, and PL) to approximate the partition functions of each complex: K^* = \frac


Design for specificity

The design of protein–protein interactions must be highly specific because proteins can interact with a large number of proteins; successful design requires selective binders. Thus, protein design algorithms must be able to distinguish between on-target (or ''positive design'') and off-target binding (or ''negative design''). One of the most prominent examples of design for specificity is the design of specific
bZIP The Basic Leucine Zipper Domain (bZIP domain) is found in many DNA binding eukaryotic proteins. One part of the domain contains a region that mediates sequence specific DNA binding properties and the leucine zipper that is required to hold tog ...
-binding peptides by Amy Keating and coworkers for 19 out of the 20 bZIP families; 8 of these peptides were specific for their intended partner over competing peptides. Further, positive and negative design was also used by Anderson and coworkers to predict mutations in the active site of a drug target that conferred resistance to a new drug; positive design was used to maintain wild-type activity, while negative design was used to disrupt binding of the drug. Recent computational redesign by Costas Maranas and coworkers was also capable of experimentally switching the cofactor specificity of ''Candida boidinii'' xylose reductase from
NADPH Nicotinamide adenine dinucleotide phosphate, abbreviated NADP or, in older notation, TPN (triphosphopyridine nucleotide), is a cofactor used in anabolic reactions, such as the Calvin cycle and lipid and nucleic acid syntheses, which require NAD ...
to
NADH Nicotinamide adenine dinucleotide (NAD) is a coenzyme central to metabolism. Found in all living cells, NAD is called a dinucleotide because it consists of two nucleotides joined through their phosphate groups. One nucleotide contains an aden ...
.


Protein resurfacing

Protein resurfacing consists of designing a protein's surface while preserving the overall fold, core, and boundary regions of the protein intact. Protein resurfacing is especially useful to alter the binding of a protein to other proteins. One of the most important applications of protein resurfacing was the design of the RSC3 probe to select broadly neutralizing HIV antibodies at the NIH Vaccine Research Center. First, residues outside of the binding interface between the gp120 HIV envelope protein and the formerly discovered b12-antibody were selected to be designed. Then, the sequence spaced was selected based on evolutionary information, solubility, similarity with the wild-type, and other considerations. Then the RosettaDesign software was used to find optimal sequences in the selected sequence space. RSC3 was later used to discover the broadly neutralizing antibody VRC01 in the serum of a long-term HIV-infected non-progressor individual.


Design of globular proteins

Globular protein In biochemistry, globular proteins or spheroproteins are spherical ("globe-like") proteins and are one of the common protein types (the others being fibrous, disordered and membrane proteins). Globular proteins are somewhat water-soluble (formi ...
s are proteins that contain a hydrophobic core and a hydrophilic surface. Globular proteins often assume a stable structure, unlike
fibrous protein In molecular biology, fibrous proteins or scleroproteins are one of the three main classifications of protein structure (alongside globular and membrane proteins). Fibrous proteins are made up of elongated or fibrous polypeptide chains which fo ...
s, which have multiple conformations. The three-dimensional structure of globular proteins is typically easier to determine through
X-ray crystallography X-ray crystallography is the experimental science determining the atomic and molecular structure of a crystal, in which the crystalline structure causes a beam of incident X-rays to diffract into many specific directions. By measuring the angles ...
and
nuclear magnetic resonance Nuclear magnetic resonance (NMR) is a physical phenomenon in which nuclei in a strong constant magnetic field are perturbed by a weak oscillating magnetic field (in the near field) and respond by producing an electromagnetic signal with a ...
than both fibrous proteins and
membrane protein Membrane proteins are common proteins that are part of, or interact with, biological membranes. Membrane proteins fall into several broad categories depending on their location. Integral membrane proteins are a permanent part of a cell membrane ...
s, which makes globular proteins more attractive for protein design than the other types of proteins. Most successful protein designs have involved globular proteins. Both RSD-1, and Top7 were ''de novo'' designs of globular proteins. Five more protein structures were designed, synthesized, and verified in 2012 by the Baker group. These new proteins serve no biotic function, but the structures are intended to act as building-blocks that can be expanded to incorporate functional active sites. The structures were found computationally by using new heuristics based on analyzing the connecting loops between parts of the sequence that specify secondary structures.


Design of membrane proteins

Several transmembrane proteins have been successfully designed, along with many other membrane-associated peptides and proteins. Recently, Costas Maranas and his coworkers developed an automated tool to redesign the pore size of Outer Membrane Porin Type-F (OmpF) from ''E.coli'' to any desired sub-nm size and assembled them in membranes to perform precise angstrom scale separation.


Other applications

One of the most desirable uses for protein design is for
biosensor A biosensor is an analytical device, used for the detection of a chemical substance, that combines a biological component with a physicochemical detector. The ''sensitive biological element'', e.g. tissue, microorganisms, organelles, cell rece ...
s, proteins that will sense the presence of specific compounds. Some attempts in the design of biosensors include sensors for unnatural molecules including
TNT Trinitrotoluene (), more commonly known as TNT, more specifically 2,4,6-trinitrotoluene, and by its preferred IUPAC name 2-methyl-1,3,5-trinitrobenzene, is a chemical compound with the formula C6H2(NO2)3CH3. TNT is occasionally used as a reagen ...
. More recently, Kuhlman and coworkers designed a biosensor of the
PAK1 Serine/threonine-protein kinase PAK 1 is an enzyme that in humans is encoded by the ''PAK1'' gene. PAK1 is one of six members of the PAK family of serine/threonine kinases which are broadly divided into group I (PAK1, PAK2 and PAK3) and group II ...
. In a sense, protein design is a subset of battery design.


See also

*
Molecular design software Molecular design software is notable software for molecular modeling, that provides special support for developing molecular models ''de novo''. In contrast to the usual molecular modeling programs, such as for molecular dynamics and quantum chemi ...
*
Protein engineering Protein engineering is the process of developing useful or valuable proteins. It is a young discipline, with much research taking place into the understanding of protein folding and recognition for protein design principles. It has been used to imp ...
*
Protein structure prediction software This list of protein structure prediction software summarizes notable used software tools in protein structure prediction, including homology modeling, protein threading, ''ab initio'' methods, secondary structure prediction, and transmembrane ...
*
Comparison of software for molecular mechanics modeling This is a list of computer programs that are predominantly used for molecular mechanics calculations. See also * Car–Parrinello molecular dynamics * Comparison of force-field implementations *Comparison of nucleic acid simulation software ...


References


Further reading

* * * * {{Design Protein structure Protein engineering