Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "''target''"

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, res ...

from its

amino acid sequence Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthe ...

and an experimental three-dimensional structure of a related homologous protein (the "''template''"). Homology modeling relies on the identification of one or more known protein structures likely to resemble the structure of the query sequence, and on the production of an

alignment Alignment may refer to: Archaeology * Alignment (archaeology), a co-linear arrangement of features or structures with external landmarks * Stone alignment, a linear arrangement of upright, parallel megalithic standing stones Biology * Struc ...

that maps residues in the query sequence to residues in the template sequence. It has been seen that protein structures are more conserved than protein sequences amongst homologues, but sequences falling below a 20% sequence identity can have very different structure. Evolutionarily related proteins have similar sequences and naturally occurring homologous proteins have similar protein structure. It has been shown that three-dimensional protein structure is evolutionarily more conserved than would be expected on the basis of sequence conservation alone. The sequence alignment and template structure are then used to produce a structural model of the target. Because protein structures are more conserved than DNA sequences, and detectable levels of sequence similarity usually imply significant structural similarity. The quality of the homology model is dependent on the quality of the sequence alignment and template structure. The approach can be complicated by the presence of alignment gaps (commonly called indels) that indicate a structural region present in the target but not in the template, and by structure gaps in the template that arise from poor resolution in the experimental procedure (usually

X-ray crystallography X-ray crystallography is the experimental science determining the atomic and molecular structure of a crystal, in which the crystalline structure causes a beam of incident X-rays to diffract into many specific directions. By measuring the angles ...

) used to solve the structure. Model quality declines with decreasing sequence identity; a typical model has ~1–2 Å

root mean square deviation The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. The RMSD represents ...

between the matched C^α atoms at 70% sequence identity but only 2–4 Å agreement at 25% sequence identity. However, the errors are significantly higher in the loop regions, where the amino acid sequences of the target and template proteins may be completely different. Regions of the model that were constructed without a template, usually by loop modeling, are generally much less accurate than the rest of the model. Errors in

side chain In organic chemistry and biochemistry, a side chain is a chemical group that is attached to a core part of the molecule called the "main chain" or backbone. The side chain is a hydrocarbon branching element of a molecule that is attached to a ...

packing and position also increase with decreasing identity, and variations in these packing configurations have been suggested as a major reason for poor model quality at low identity.Chung SY, Subbiah S. (1996.) A structural explanation for the twilight zone of protein sequence homology. ''Structure'' 4: 1123–27. Taken together, these various atomic-position errors are significant and impede the use of homology models for purposes that require atomic-resolution data, such as drug design and

protein–protein interaction Protein–protein interactions (PPIs) are physical contacts of high specificity established between two or more protein molecules as a result of biochemical events steered by interactions that include electrostatic forces, hydrogen bonding and th ...

predictions; even the quaternary structure of a protein may be difficult to predict from homology models of its subunit(s). Nevertheless, homology models can be useful in reaching ''qualitative'' conclusions about the biochemistry of the query sequence, especially in formulating hypotheses about why certain residues are conserved, which may in turn lead to experiments to test those hypotheses. For example, the spatial arrangement of conserved residues may suggest whether a particular residue is conserved to stabilize the folding, to participate in binding some small molecule, or to foster association with another protein or nucleic acid. Homology modeling can produce high-quality structural models when the target and template are closely related, which has inspired the formation of a

structural genomics Structural genomics seeks to describe the 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of experimental and modeling ...

consortium dedicated to the production of representative experimental structures for all classes of protein folds. The chief inaccuracies in homology modeling, which worsen with lower sequence identity, derive from errors in the initial sequence alignment and from improper template selection. Like other methods of structure prediction, current practice in homology modeling is assessed in a biennial large-scale experiment known as the Critical Assessment of Techniques for Protein Structure Prediction, or CASP.

Motive

The method of homology modeling is based on the observation that protein

tertiary structure Protein tertiary structure is the three dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains may i ...

is better conserved than

. Thus, even proteins that have diverged appreciably in sequence but still share detectable similarity will also share common structural properties, particularly the overall fold. Because it is difficult and time-consuming to obtain experimental structures from methods such as

and protein NMR for every protein of interest, homology modeling can provide useful structural models for generating hypotheses about a protein's function and directing further experimental work. There are exceptions to the general rule that proteins sharing significant sequence identity will share a fold. For example, a judiciously chosen set of mutations of less than 50% of a protein can cause the protein to adopt a completely different fold. However, such a massive structural rearrangement is unlikely to occur in

evolution Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...

, especially since the protein is usually under the constraint that it must fold properly and carry out its function in the cell. Consequently, the roughly folded structure of a protein (its "topology") is conserved longer than its amino-acid sequence and much longer than the corresponding DNA sequence; in other words, two proteins may share a similar fold even if their evolutionary relationship is so distant that it cannot be discerned reliably. For comparison, the function of a protein is conserved much ''less'' than the protein sequence, since relatively few changes in amino-acid sequence are required to take on a related function.

Steps in model production

The homology modeling procedure can be broken down into four sequential steps: template selection, target-template alignment, model construction, and model assessment. The first two steps are often essentially performed together, as the most common methods of identifying templates rely on the production of sequence alignments; however, these alignments may not be of sufficient quality because database search techniques prioritize speed over alignment quality. These processes can be performed iteratively to improve the quality of the final model, although quality assessments that are not dependent on the true target structure are still under development. Optimizing the speed and accuracy of these steps for use in large-scale automated structure prediction is a key component of structural genomics initiatives, partly because the resulting volume of data will be too large to process manually and partly because the goal of structural genomics requires providing models of reasonable quality to researchers who are not themselves structure prediction experts.

Template selection and sequence alignment

The critical first step in homology modeling is the identification of the best template structure, if indeed any are available. The simplest method of template identification relies on serial pairwise sequence alignments aided by database search techniques such as FASTA and

BLAST Blast or The Blast may refer to: *Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film), ...

. More sensitive methods based on

multiple sequence alignment Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutio ...

– of which PSI-BLAST is the most common example – iteratively update their

position-specific scoring matrix A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences. PWMs are often derived from a set ...

to successively identify more distantly related homologs. This family of methods has been shown to produce a larger number of potential templates and to identify better templates for sequences that have only distant relationships to any solved structure. Protein threading, also known as fold recognition or 3D-1D alignment, can also be used as a search technique for identifying templates to be used in traditional homology modeling methods. Recent CASP experiments indicate that some protein threading methods such as RaptorX indeed are more sensitive than purely sequence(profile)-based methods when only distantly-related templates are available for the proteins under prediction. When performing a BLAST search, a reliable first approach is to identify hits with a sufficiently low ''E''-value, which are considered sufficiently close in evolution to make a reliable homology model. Other factors may tip the balance in marginal cases; for example, the template may have a function similar to that of the query sequence, or it may belong to a homologous operon. However, a template with a poor ''E''-value should generally not be chosen, even if it is the only one available, since it may well have a wrong structure, leading to the production of a misguided model. A better approach is to submit the primary sequence to fold-recognition servers or, better still, consensus meta-servers which improve upon individual fold-recognition servers by identifying similarities (consensus) among independent predictions. Often several candidate template structures are identified by these approaches. Although some methods can generate hybrid models with better accuracy from multiple templates, most methods rely on a single template. Therefore, choosing the best template from among the candidates is a key step, and can affect the final accuracy of the structure significantly. This choice is guided by several factors, such as the similarity of the query and template sequences, of their functions, and of the predicted query and observed template

secondary structure Protein secondary structure is the three dimensional form of ''local segments'' of proteins. The two most common secondary structural elements are alpha helices and beta sheets, though beta turns and omega loops occur as well. Secondary struct ...

s. Perhaps most importantly, the ''coverage'' of the aligned regions: the fraction of the query sequence structure that can be predicted from the template, and the plausibility of the resulting model. Thus, sometimes several homology models are produced for a single query sequence, with the most likely candidate chosen only in the final step. It is possible to use the sequence alignment generated by the database search technique as the basis for the subsequent model production; however, more sophisticated approaches have also been explored. One proposal generates an ensemble of

stochastic Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselv ...

ally defined pairwise alignments between the target sequence and a single identified template as a means of exploring "alignment space" in regions of sequence with low local similarity. "Profile-profile" alignments that first generate a sequence profile of the target and systematically compare it to the sequence profiles of solved structures; the coarse-graining inherent in the profile construction is thought to reduce noise introduced by sequence drift in nonessential regions of the sequence.

Model generation

Given a template and an alignment, the information contained therein must be used to generate a three-dimensional structural model of the target, represented as a set of

Cartesian coordinates A Cartesian coordinate system (, ) in a plane is a coordinate system that specifies each point uniquely by a pair of numerical coordinates, which are the signed distances to the point from two fixed perpendicular oriented lines, measured in ...

for each atom in the protein. Three major classes of model generation methods have been proposed.

Fragment assembly

The original method of homology modeling relied on the assembly of a complete model from conserved structural fragments identified in closely related solved structures. For example, a modeling study of

serine protease Serine proteases (or serine endopeptidases) are enzymes that cleave peptide bonds in proteins. Serine serves as the nucleophilic amino acid at the (enzyme's) active site. They are found ubiquitously in both eukaryotes and prokaryotes. Seri ...

s in

mammal Mammals () are a group of vertebrate animals constituting the class Mammalia (), characterized by the presence of mammary glands which in females produce milk for feeding (nursing) their young, a neocortex (a region of the brain), fur ...

s identified a sharp distinction between "core" structural regions conserved in all experimental structures in the class, and variable regions typically located in the loops where the majority of the sequence differences were localized. Thus unsolved proteins could be modeled by first constructing the conserved core and then substituting variable regions from other proteins in the set of solved structures. Current implementations of this method differ mainly in the way they deal with regions that are not conserved or that lack a template. The variable regions are often constructed with the help of fragment libraries.

Segment matching

The segment-matching method divides the target into a series of short segments, each of which is matched to its own template fitted from the

Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cr ...

. Thus, sequence alignment is done over segments rather than over the entire protein. Selection of the template for each segment is based on sequence similarity, comparisons of

alpha carbon In the nomenclature of organic chemistry, a locant is a term to indicate the position of a functional group or substituent within a molecule. Numeric locants The International Union of Pure and Applied Chemistry (IUPAC) recommends the us ...

coordinates, and predicted

steric Steric effects arise from the spatial arrangement of atoms. When atoms come close together there is a rise in the energy of the molecule. Steric effects are nonbonding interactions that influence the shape ( conformation) and reactivity of ions ...

conflicts arising from the van der Waals radii of the divergent atoms between target and template.

Satisfaction of spatial restraints

The most common current homology modeling method takes its inspiration from calculations required to construct a three-dimensional structure from data generated by

NMR spectroscopy Nuclear magnetic resonance spectroscopy, most commonly known as NMR spectroscopy or magnetic resonance spectroscopy (MRS), is a spectroscopic technique to observe local magnetic fields around atomic nuclei. The sample is placed in a magnetic fi ...

. One or more target-template alignments are used to construct a set of geometrical criteria that are then converted to

probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) ca ...

s for each restraint. Restraints applied to the main protein internal coordinates –

protein backbone In organic chemistry, a peptide bond is an amide type of covalent chemical bond linking two consecutive alpha-amino acids from C1 (carbon number one) of one alpha-amino acid and N2 (nitrogen number two) of another, along a peptide or protein ch ...

distances and

dihedral angle A dihedral angle is the angle between two intersecting planes or half-planes. In chemistry, it is the clockwise angle between half-planes through two sets of three atoms, having two atoms in common. In solid geometry, it is defined as the un ...

s – serve as the basis for a global optimization procedure that originally used

conjugate gradient In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is positive-definite. The conjugate gradient method is often implemented as an itera ...

energy minimization to iteratively refine the positions of all heavy atoms in the protein. This method had been dramatically expanded to apply specifically to loop modeling, which can be extremely difficult due to the high flexibility of loops in proteins in

aqueous An aqueous solution is a solution in which the solvent is water. It is mostly shown in chemical equations by appending (aq) to the relevant chemical formula. For example, a solution of table salt, or sodium chloride (NaCl), in water would be re ...

solution. A more recent expansion applies the spatial-restraint model to

electron density In quantum chemistry, electron density or electronic density is the measure of the probability of an electron being present at an infinitesimal element of space surrounding any given point. It is a scalar quantity depending upon three spatial va ...

maps derived from

cryoelectron microscopy Cryogenic electron microscopy (cryo-EM) is a cryomicroscopy technique applied on samples cooled to cryogenic temperatures. For biological specimens, the structure is preserved by embedding in an environment of vitreous ice. An aqueous sample so ...

studies, which provide low-resolution information that is not usually itself sufficient to generate atomic-resolution structural models. To address the problem of inaccuracies in initial target-template sequence alignment, an iterative procedure has also been introduced to refine the alignment on the basis of the initial structural fit. The most commonly used software in spatial restraint-based modeling is MODELLER and a database called ModBase has been established for reliable models generated with it.Ursula Pieper, Narayanan Eswar, Hannes Braberg, M.S. Madhusudhan, Fred Davis, Ashley C. Stuart, Nebojsa Mirkovic, Andrea Rossi, Marc A. Marti-Renom, Andras Fiser, Ben Webb, Daniel Greenblatt, Conrad Huang, Tom Ferrin, Andrej Sali. MODBASE, a database of annotated comparative protein structure models, and associated resources. ''Nucleic Acids Res'' 32, D217-D222, 2004.

Loop modeling

Regions of the target sequence that are not aligned to a template are modeled by loop modeling; they are the most susceptible to major modeling errors and occur with higher frequency when the target and template have low sequence identity. The coordinates of unmatched sections determined by loop modeling programs are generally much less accurate than those obtained from simply copying the coordinates of a known structure, particularly if the loop is longer than 10 residues. The first two sidechain

s (χ₁ and χ₂) can usually be estimated within 30° for an accurate backbone structure; however, the later dihedral angles found in longer side chains such as

lysine Lysine (symbol Lys or K) is an α-amino acid that is a precursor to many proteins. It contains an α-amino group (which is in the protonated form under biological conditions), an α-carboxylic acid group (which is in the deprotonated − ...

and

arginine Arginine is the amino acid with the formula (H2N)(HN)CN(H)(CH2)3CH(NH2)CO2H. The molecule features a guanidino group appended to a standard amino acid framework. At physiological pH, the carboxylic acid is deprotonated (−CO2−) and both the am ...

are notoriously difficult to predict. Moreover, small errors in χ₁ (and, to a lesser extent, in χ₂) can cause relatively large errors in the positions of the atoms at the terminus of side chain; such atoms often have a functional importance, particularly when located near the

active site In biology and biochemistry, the active site is the region of an enzyme where substrate molecules bind and undergo a chemical reaction. The active site consists of amino acid residues that form temporary bonds with the substrate ( binding site) ...

Model assessment

A large number of methods have been developed for selecting a native-like structure from a set of models. Scoring functions have been based on both molecular mechanics energy functions (Lazaridis and Karplus 1999; Petrey and Honig 2000; Feig and Brooks 2002; Felts et al. 2002; Lee and Duan 2004), statistical potentials (Sippl 1995; Melo and Feytmans 1998; Samudrala and Moult 1998; Rojnuckarin and Subramaniam 1999; Lu and Skolnick 2001; Wallqvist et al. 2002; Zhou and Zhou 2002), residue environments (Luthy et al. 1992; Eisenberg et al. 1997; Park et al. 1997; Summa et al. 2005), local side-chain and backbone interactions (Fang and Shortle 2005), orientation-dependent properties (Buchete et al. 2004a,b; Hamelryck 2005), packing estimates (Berglund et al. 2004), solvation energy (Petrey and Honig 2000; McConkey et al. 2003; Wallner and Elofsson 2003; Berglund et al. 2004), hydrogen bonding (Kortemme et al. 2003), and geometric properties (Colovos and Yeates 1993; Kleywegt 2000; Lovell et al. 2003; Mihalek et al. 2003). A number of methods combine different potentials into a global score, usually using a linear combination of terms (Kortemme et al. 2003; Tosatto 2005), or with the help of machine learning techniques, such as neural networks (Wallner and Elofsson 2003) and support vector machines (SVM) (Eramian et al. 2006). Comparisons of different global model quality assessment programs can be found in recent papers by Pettitt et al. (2005), Tosatto (2005), and Eramian et al. (2006). Less work has been reported on the local quality assessment of models. Local scores are important in the context of modeling because they can give an estimate of the reliability of different regions of a predicted structure. This information can be used in turn to determine which regions should be refined, which should be considered for modeling by multiple templates, and which should be predicted ab initio. Information on local model quality could also be used to reduce the combinatorial problem when considering alternative alignments; for example, by scoring different local models separately, fewer models would have to be built (assuming that the interactions between the separate regions are negligible or can be estimated separately). One of the most widely used local scoring methods is Verify3D (Luthy et al. 1992; Eisenberg et al. 1997), which combines secondary structure, solvent accessibility, and polarity of residue environments. ProsaII (Sippl 1993), which is based on a combination of a pairwise statistical potential and a solvation term, is also applied extensively in model evaluation. Other methods include the Errat program (Colovos and Yeates 1993), which considers distributions of nonbonded atoms according to atom type and distance, and the energy strain method (Maiorov and Abagyan 1998), which uses differences from average residue energies in different environments to indicate which parts of a protein structure might be problematic. Melo and Feytmans (1998) use an atomic pairwise potential and a surface-based solvation potential (both knowledge-based) to evaluate protein structures. Apart from the energy strain method, which is a semiempirical approach based on the ECEPP3 force field (Nemethy et al. 1992), all of the local methods listed above are based on statistical potentials. A conceptually distinct approach is the ProQres method, which was very recently introduced by Wallner and Elofsson (2006). ProQres is based on a neural network that combines structural features to distinguish correct from incorrect regions. ProQres was shown to outperform earlier methodologies based on statistical approaches (Verify3D, ProsaII, and Errat). The data presented in Wallner and Elofsson's study suggests that their machine-learning approach based on structural features is indeed superior to statistics-based methods. However, the knowledge-based methods examined in their work, Verify3D (Luthy et al. 1992; Eisenberg et al. 1997), Prosa (Sippl 1993), and Errat (Colovos and Yeates 1993), are not based on newer statistical potentials.

Benchmarking

Several large-scale

benchmarking Benchmarking is the practice of comparing business processes and performance metrics to industry bests and best practices from other companies. Dimensions typically measured are quality, time and cost. Benchmarking is used to measure performanc ...

efforts have been made to assess the relative quality of various current homology modeling methods. CASP is a community-wide prediction experiment that runs every two years during the summer months and challenges prediction teams to submit structural models for a number of sequences whose structures have recently been solved experimentally but have not yet been published. Its partner

CAFASP CAFASP, or the Critical Assessment of Fully Automated Structure Prediction, is a large-scale blind experiment in protein structure prediction that studies the performance of automated structure prediction webservers in homology modeling, fold reco ...

has run in parallel with CASP but evaluates only models produced via fully automated servers. Continuously running experiments that do not have prediction 'seasons' focus mainly on benchmarking publicly available webservers.

LiveBench LiveBench is a continuously running benchmark project for assessing the quality of protein structure prediction and secondary structure prediction methods. LiveBench focuses mainly on homology modeling and protein threading but also includes secon ...

and EVA run continuously to assess participating servers' performance in prediction of imminently released structures from the PDB. CASP and CAFASP serve mainly as evaluations of the state of the art in modeling, while the continuous assessments seek to evaluate the model quality that would be obtained by a non-expert user employing publicly available tools.

Accuracy

The accuracy of the structures generated by homology modeling is highly dependent on the sequence identity between target and template. Above 50% sequence identity, models tend to be reliable, with only minor errors in

packing and rotameric state, and an overall RMSD between the modeled and the experimental structure falling around 1 Å. This error is comparable to the typical resolution of a structure solved by NMR. In the 30–50% identity range, errors can be more severe and are often located in loops. Below 30% identity, serious errors occur, sometimes resulting in the basic fold being mis-predicted. This low-identity region is often referred to as the "twilight zone" within which homology modeling is extremely difficult, and to which it is possibly less suited than

fold recognition Protein threading, also known as fold recognition, is a method of protein modeling which is used to model those proteins which have the same fold as proteins of known structures, but do not have homologous proteins with known structure. It diffe ...

methods. At high sequence identities, the primary source of error in homology modeling derives from the choice of the template or templates on which the model is based, while lower identities exhibit serious errors in sequence alignment that inhibit the production of high-quality models. It has been suggested that the major impediment to quality model production is inadequacies in sequence alignment, since "optimal"

structural alignment Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large R ...

s between two proteins of known structure can be used as input to current modeling methods to produce quite accurate reproductions of the original experimental structure. Attempts have been made to improve the accuracy of homology models built with existing methods by subjecting them to

molecular dynamics Molecular dynamics (MD) is a computer simulation method for analyzing the physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a fixed period of time, giving a view of the dynamic "evolution" of th ...

simulation in an effort to improve their RMSD to the experimental structure. However, current force field parameterizations may not be sufficiently accurate for this task, since homology models used as starting structures for molecular dynamics tend to produce slightly worse structures. Slight improvements have been observed in cases where significant restraints were used during the simulation.

Sources of error

The two most common and large-scale sources of error in homology modeling are poor template selection and inaccuracies in target-template sequence alignment. Controlling for these two factors by using a

, or a sequence alignment produced on the basis of comparing two solved structures, dramatically reduces the errors in final models; these "gold standard" alignments can be used as input to current modeling methods to produce quite accurate reproductions of the original experimental structure. Results from the most recent CASP experiment suggest that "consensus" methods collecting the results of multiple fold recognition and multiple alignment searches increase the likelihood of identifying the correct template; similarly, the use of multiple templates in the model-building step may be worse than the use of the single correct template but better than the use of a single suboptimal one. Alignment errors may be minimized by the use of a multiple alignment even if only one template is used, and by the iterative refinement of local regions of low similarity. A lesser source of model errors are errors in the template structure. Th
PDBREPORT
database lists several million, mostly very small but occasionally dramatic, errors in experimental (template) structures that have been deposited in the PDB. Serious local errors can arise in homology models where an

insertion Insertion may refer to: * Insertion (anatomy), the point of a tendon or ligament onto the skeleton or other part of the body * Insertion (genetics), the addition of DNA into a genetic sequence *Insertion, several meanings in medicine, see ICD-10-PC ...

or deletion mutation or a gap in a solved structure result in a region of target sequence for which there is no corresponding template. This problem can be minimized by the use of multiple templates, but the method is complicated by the templates' differing local structures around the gap and by the likelihood that a missing region in one experimental structure is also missing in other structures of the same protein family. Missing regions are most common in loops where high local flexibility increases the difficulty of resolving the region by structure-determination methods. Although some guidance is provided even with a single template by the positioning of the ends of the missing region, the longer the gap, the more difficult it is to model. Loops of up to about 9 residues can be modeled with moderate accuracy in some cases if the local alignment is correct. Larger regions are often modeled individually using ab initio structure prediction techniques, although this approach has met with only isolated success.Kryshtafovych A, Venclovas C, Fidelis K, Moult J. (2005). Progress over the first decade of CASP experiments. ''Proteins'' 61(S7):225–36. The rotameric states of side chains and their internal packing arrangement also present difficulties in homology modeling, even in targets for which the backbone structure is relatively easy to predict. This is partly due to the fact that many side chains in crystal structures are not in their "optimal" rotameric state as a result of energetic factors in the hydrophobic core and in the packing of the individual molecules in a protein crystal. One method of addressing this problem requires searching a rotameric library to identify locally low-energy combinations of packing states. It has been suggested that a major reason that homology modeling so difficult when target-template sequence identity lies below 30% is that such proteins have broadly similar folds but widely divergent side chain packing arrangements.

Utility

Uses of the structural models include protein–protein interaction prediction,

protein–protein docking Macromolecular docking is the computational modelling of the quaternary structure of complexes formed by two or more interacting biological macromolecules. Protein–protein complexes are the most commonly attempted targets of such modelling, fol ...

molecular docking In the field of molecular modeling, docking is a method which predicts the preferred orientation of one molecule to a second when a ligand and a target are bound to each other to form a stable complex. Knowledge of the preferred orientation in ...

, and functional annotation of

gene In biology, the word gene (from , ; "...Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...

s identified in an organism's

genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...

. Even low-accuracy homology models can be useful for these purposes, because their inaccuracies tend to be located in the loops on the protein surface, which are normally more variable even between closely related proteins. The functional regions of the protein, especially its

, tend to be more highly conserved and thus more accurately modeled. Homology models can also be used to identify subtle differences between related proteins that have not all been solved structurally. For example, the method was used to identify

cation An ion () is an atom or molecule with a net electrical charge. The charge of an electron is considered to be negative by convention and this charge is equal and opposite to the charge of a proton, which is considered to be positive by conven ...

binding site In biochemistry and molecular biology, a binding site is a region on a macromolecule such as a protein that binds to another molecule with specificity. The binding partner of the macromolecule is often referred to as a ligand. Ligands may includ ...

s on the Na⁺/K⁺

ATPase ATPases (, Adenosine 5'-TriPhosphatase, adenylpyrophosphatase, ATP monophosphatase, triphosphatase, SV40 T-antigen, ATP hydrolase, complex V (mitochondrial electron transport), (Ca2+ + Mg2+)-ATPase, HCO3−-ATPase, adenosine triphosphatase) are ...

and to propose hypotheses about different ATPases' binding affinity. Used in conjunction with

simulations, homology models can also generate hypotheses about the kinetics and dynamics of a protein, as in studies of the ion selectivity of a

potassium Potassium is the chemical element with the symbol K (from Neo-Latin '' kalium'') and atomic number19. Potassium is a silvery-white metal that is soft enough to be cut with a knife with little force. Potassium metal reacts rapidly with atmos ...

channel. Large-scale automated modeling of all identified protein-coding regions in a

has been attempted for the

yeast Yeasts are eukaryotic, single-celled microorganisms classified as members of the fungus kingdom. The first yeast originated hundreds of millions of years ago, and at least 1,500 species are currently recognized. They are estimated to constit ...

Saccharomyces cerevisiae ''Saccharomyces cerevisiae'' () (brewer's yeast or baker's yeast) is a species of yeast (single-celled fungus microorganisms). The species has been instrumental in winemaking, baking, and brewing since ancient times. It is believed to have b ...

'', resulting in nearly 1000 quality models for proteins whose structures had not yet been determined at the time of the study, and identifying novel relationships between 236 yeast proteins and other previously solved structures.

References

{{Reflist, 30em Bioinformatics Protein methods Protein structure