Quantitative structure–activity relationship
Activity = f(physiochemical properties and/or structural properties) + error
The error includes model error (bias) and observational variability, that is, the variability in observations even on a correct model.
1 Essential steps in QSAR studies 2 SAR and the SAR paradox 3 Types
3.1 Fragment based (group contribution) 3.2 3D-QSAR 3.3 Chemical descriptor based
5 Evaluation of the quality of QSAR models 6 Application
6.1 Chemical 6.2 Biological 6.3 Applications
7 See also 8 References 9 Further reading 10 External links
Essential steps in QSAR studies
Principal steps of QSAR/QSPR including (i) Selection of Data set and
extraction of structural/empirical descriptors (ii) variable
selection, (iii) model construction and (iv) validation
SAR and the SAR paradox
The basic assumption for all molecule based hypotheses is that similar
molecules have similar activities. This principle is also called
Structure–Activity Relationship (SAR). The underlying problem is
therefore how to define a small difference on a molecular level, since
each kind of activity, e.g. reaction ability, biotransformation
ability, solubility, target activity, and so on, might depend on
another difference. Good examples were given in the bioisosterism
reviews by Patanie/LaVoie and Brown.
In general, one is more interested in finding strong trends. Created
hypotheses usually rely on a finite number of chemical data. Thus, the
induction principle should be respected to avoid overfitted hypotheses
and deriving overfitted and useless interpretations on
The SAR paradox refers to the fact that it is not the case that all
similar molecules have similar activities.
Fragment based (group contribution)
Analogously, the "partition coefficient"—a measurement of
differential solubility and itself a component of QSAR
predictions—can be predicted either by atomic methods (known as
"XLogP" or "ALogP") or by chemical fragment methods (known as "CLogP"
and other variations). It has been shown that the logP of compound can
be determined by the sum of its fragments; fragment-based methods are
generally accepted as better predictors than atomic-based methods.
Fragmentary values have been determined statistically, based on
empirical data for known logP values. This method gives mixed results
and is generally not trusted to have accuracy of more than ±0.1
Group or Fragment based QSAR is also known as GQSAR. GQSAR allows
flexibility to study various molecular fragments of interest in
relation to the variation in biological response. The molecular
fragments could be substituents at various substitution sites in
congeneric set of molecules or could be on the basis of pre-defined
chemical rules in case of non-congeneric sets. GQSAR also considers
cross-terms fragment descriptors, which could be helpful in
identification of key fragment interactions in determining variation
of activity. Lead discovery using Fragnomics is an emerging
paradigm. In this context FB-QSAR proves to be a promising strategy
for fragment library design and in fragment-to-lead identification
An advanced approach on fragment or group-based QSAR based on the
concept of pharmacophore-similarity is developed. This method,
pharmacophore-similarity-based QSAR (PS-QSAR) uses topological
pharmacophoric descriptors to develop QSAR models. This activity
prediction may assist the contribution of certain pharmacophore
features encoded by respective fragments toward activity improvement
and/or detrimental effects.
The acronym 3D-QSAR or 3-D QSAR refers to the application of force
field calculations requiring three-dimensional structures of a given
set of small molecules with known activities (training set). The
training set needs to be superimposed (aligned) by either experimental
data (e.g. based on ligand-protein crystallography) or molecule
superimposition software. It uses computed potentials, e.g. the
Lennard-Jones potential, rather than experimental constants and is
concerned with the overall molecule rather than a single substituent.
The first 3-D QSAR was named Comparative Molecular Field Analysis
(CoMFA) by Cramer et al. It examined the steric fields (shape of the
molecule) and the electrostatic fields which were correlated by
means of partial least squares regression (PLS).
The created data space is then usually reduced by a following feature
extraction (see also dimensionality reduction). The following learning
method can be any of the already mentioned machine learning methods,
e.g. support vector machines. An alternative approach uses
multiple-instance learning by encoding molecules as sets of data
instances, each of which represents a possible molecular conformation.
A label or response is assigned to each set corresponding to the
activity of the molecule, which is assumed to be determined by at
least one instance in the set (i.e. some conformation of the
On June 18, 2011 the Comparative Molecular Field Analysis (CoMFA)
patent has dropped any restriction on the use of GRID and partial
least-squares (PLS) technologies.
Chemical descriptor based
In this approach, descriptors quantifying various electronic,
geometric, or steric properties of a molecule are computed and used to
develop a QSAR. This approach is different from the fragment (or
group contribution) approach in that the descriptors are computed for
the system as whole rather than from the properties of individual
fragments. This approach is different from the 3D-QSAR approach in
that the descriptors are computed from scalar quantities (e.g.,
energies, geometric parameters) rather than from 3D fields.
An example of this approach is the QSARs developed for olefin
polymerization by half sandwich compounds.
In the literature it can be often found that chemists have a
preference for partial least squares (PLS) methods,
since it applies the feature extraction and induction in one step.
Matched molecular pair analysis Main article: Matched molecular pair analysis Typically QSAR models derived from non linear machine learning is seen as a "black box", which fails to guide medicinal chemists. Recently there is a relatively new concept of matched molecular pair analysis or prediction driven MMPA which is coupled with QSAR model in order to identify activity cliffs. Evaluation of the quality of QSAR models QSAR modeling produces predictive models derived from application of statistical tools correlating biological activity (including desirable therapeutic effect and undesirable side effects) or physico-chemical properties in QSPR models of chemicals (drugs/toxicants/environmental pollutants) with descriptors representative of molecular structure or properties. QSARs are being applied in many disciplines, for example: risk assessment, toxicity prediction, and regulatory decisions in addition to drug discovery and lead optimization. Obtaining a good quality QSAR model depends on many factors, such as the quality of input data, the choice of descriptors and statistical methods for modeling and for validation. Any QSAR modeling should ultimately lead to statistically robust and predictive models capable of making accurate and reliable predictions of the modeled response of new compounds. For validation of QSAR models, usually various strategies are adopted:
internal validation or cross-validation (actually, while extracting data, cross validation is a measure of model robustness, the more a model is robust (higher q2) the less data extraction perturb the original model); external validation by splitting the available data set into training set for model development and prediction set for model predictivity check; blind external validation by application of model on new external data and data randomization or Y-scrambling for verifying the absence of chance correlation between the response and the modeling descriptors.
The success of any QSAR model depends on accuracy of the input data,
selection of appropriate descriptors and statistical tools, and most
importantly validation of the developed model. Validation is the
process by which the reliability and relevance of a procedure are
established for a specific purpose; for QSAR models validation must be
mainly for robustness, prediction performances and applicability
domain (AD) of the models.
Some validation methodologies can be problematic. For example, leave
one-out cross-validation generally leads to an overestimation of
predictive capacity. Even with external validation, it is difficult to
determine whether the selection of training and test sets was
manipulated to maximize the predictive capacity of the model being
Different aspects of validation of QSAR models that need attention
includes methods of selection of training set compounds, setting
training set size and impact of variable selection for
training set models for determining the quality of prediction.
Development of novel validation parameters for judging quality of QSAR
models is also important.
One of the first historical QSAR applications was to predict boiling
It is well known for instance that within a particular family of
chemical compounds, especially of organic chemistry, that there are
strong correlations between structure and observed properties. A
simple example is the relationship between the number of carbons in
alkanes and their boiling points. There is a clear trend in the
increase of boiling point with an increase in the number carbons, and
this serves as a means for predicting the boiling points of higher
A still very interesting application is the Hammett equation, Taft
equation and pKa prediction methods.
The biological activity of molecules is usually measured in assays to
establish the level of inhibition of particular signal transduction or
S.No. Name Algorithms External link
1. R RF,SVM, Naïve Bayesian, and ANN "R: The R Project for Statistical Computing".
2. libSVM SVM "LIBSVM -- A Library for Support Vector Machines".
3. Orange RF, SVM, and Naïve Bayesian "Orange Data Mining".
4. RapidMiner SVM, RF, Naïve Bayes, DT, ANN, and k-NN "RapidMiner #1 Open Source Predictive Analytics Platform".
RF, SVM, and Naïve Bayes
"Weka 3 - Data Mining with Open Source Machine
6. Knime DT, Naïve Bayes, and SVM "KNIME Open for Innovation".
7. AZOrange RT, SVM, ANN, and RF "AZCompTox/AZOrange: AstraZeneca add-ons to Orange". GitHub.
8. Tanagra SVM, RF, Naïve Bayes, and DT "TANAGRA - A free DATA MINING software for teaching and research".
9. Elki k-NN "ELKI Data Mining Framework".
"MOA Massive Online Analysis Real Time Analytics for Data Streams".
12. Deep Chem Logistic Regression, Naive Bayes, RF, ANN, and others "DeepChem". deepchem.io. Retrieved 20 October 2017.
Matched molecular pair analysis
Computer-assisted drug design
^ Nantasenamat C, Isarankura-Na-Ayudhya C, Naenna T, Prachayasittikul
V (2009). "A practical overview of quantitative structure-activity
relationship". Excli J. 8: 74–88.
^ Nantasenamat C, Isarankura-Na-Ayudhya C, Prachayasittikul V (Jul
2010). "Advances in computational methods to predict the biological
activity of compounds". Expert Opinion on Drug Discovery. 5 (7):
633–54. doi:10.1517/17460441.2010.492827. PMID 22823204.
^ a b Yousefinejad S, Hemmateenejad B (2015). "Chemometrics tools in
QSAR/QSPR studies: A historical perspective". Chemometrics and
Intelligent Laboratory Systems. 149, Part B: 177–204.
^ a b Tropsha A, Gramatica P, Gombar VJ (2003). "The Importance of
Being Earnest: Validation is the Absolute Essential for Successful
Application and Interpretation of QSPR Models". QSAR &Comb. Sci.
22: 69–77. doi:10.1002/qsar.200390007.
^ a b Gramatica P (2007). "Principles of QSAR models validation:
internal and external". QSAR &Comb. Sci. 26: 694–701.
^ a b c Chirico N, Gramatica P (Aug 2012). "Real external predictivity
of QSAR models. Part 2. New intercomparable thresholds for different
validation criteria and the need for scatter plot inspection". Journal
of Chemical Information and Modeling. 52 (8): 2044–58.
doi:10.1021/ci300084j. PMID 22721530.
^ Tropsha, Alexander (2010). "Best Practices for QSAR Model
Development, Validation, and Exploitation". Molecular Informatics. 29
(6-7): 476–488. doi:10.1002/minf.201000061.
^ Patani GA, LaVoie EJ (Dec 1996). "Bioisosterism: A Rational Approach
in Drug Design". Chemical Reviews. 96 (8): 3147–3176.
doi:10.1021/cr950066q. PMID 11848856.
^ Brown N (2012). Bioisosteres in Medicinal Chemistry. Weinheim:
Wiley-VCH. ISBN 978-3-527-33015-7.
^ Thompson SJ, Hattotuwagama CK, Holliday JD, Flower DR (2006). "On
the hydrophobicity of peptides: Comparing empirical predictions of
peptide log P values". Bioinformation. 1 (7): 237–41.
doi:10.6026/97320630001237. PMC 1891704 .
^ Wildman SA, Crippen GM (1999). "Prediction of physicochemical
parameters by atomic contributions". J. Chem. Inf. Comput. Sci. 39
(5): 868–873. doi:10.1021/ci990307l.
^ a b Ajmani S, Jadhav K, Kulkarni SA. "Group-Based QSAR
^ Manoharan P, Vijayan RS, Ghoshal N (Oct 2010). "Rationalizing
fragment based drug discovery for BACE1: insights from FB-QSAR,
FB-QSSR, multi objective (MO-QSPR) and MIF studies". Journal of
Computer-Aided Molecular Design. 24 (10): 843–64.
^ a b Prasanth Kumar S, Jasrai YT, Pandya HA, Rawal RM (November
2013). "Pharmacophore-similarity-based QSAR (PS-QSAR) for
group-specific biological activity predictions". Journal of
Biomolecular Structure & Dynamics. 33 (1): 56–69.
doi:10.1080/07391102.2013.849618. PMID 24266725.
^ Leach AR (2001). Molecular modelling: principles and applications.
Englewood Cliffs, N.J: Prentice Hall. ISBN 0-582-38210-6.
^ Vert JP, Schölkopf B, Tsuda K (2004). Kernel methods in
computational biology. Cambridge, Mass: MIT Press.
^ Dietterich TG, Lathrop RH, Lozano-Pérez T (1997). "Solving the
multiple instance problem with axis-parallel rectangles". Artificial
Intelligence. 89 (1–2): 31–71.
^ Caruthers JM, Lauterbach JA, Thomson KT, Venkatasubramanian V,
Snively CM, Bhan A, Katare S, Oskarsdottir G (2003). "Catalyst design:
knowledge extraction from high-throughput experimentation". J. Catal.
216: 3776–3777. doi:10.1016/S0021-9517(02)00036-2.
^ Manz TA, Phomphrai K, Medvedev G, Krishnamurthy BB, Sharma S, Haq J,
Novstrup KA, Thomson KT, Delgass WN, Caruthers JM, Abu-Omar MM (Apr
2007). "Structure-activity correlation in titanium single-site olefin
polymerization catalysts containing mixed cyclopentadienyl/aryloxide
ligation". Journal of the American Chemical Society. 129 (13):
3776–7. doi:10.1021/ja0640849. PMID 17348648.
^ Manz TA, Caruthers JM, Sharma S, Phomphrai K, Thomson KT, Delgass
WN, Abu-Omar MM (2012). "Structure–Activity
Selassie CD (2003). "History of Quantitative Structure-Activity Relationships" (PDF). In Abraham DJ. Burger's medicinal Chemistry and Drug Discovery. 1 (6th ed.). New York: Wiley. pp. 1–48. ISBN 0-471-27401-1.
Shityakov S, Puskás I, Roewer N, Förster C, Broscheit J (2014). "Three-dimensional quantitative structure-activity relationship and docking studies in a series of anthocyanin derivatives as cytochrome P450 3A4 inhibitors". Advances and Applications in Bioinformatics and Chemistry. 7: 11–21. doi:10.2147/AABC.S56478. PMC 3970920 . PMID 24741320.
"The Cheminformatics and QSAR Society". Retrieved 2009-05-11. "The 3D QSAR Server". Retrieved 2011-06-18. "Nature Protocols: Development of QSAR models using C-QSAR program". Nature Protocols. doi:10.1038/nprot.2007.125. Retrieved 2009-05-11. A regression program that has dual databases of over 21,000 QSAR models "QSAR World". Archived from the original on 2009-04-25. Retrieved 2009-05-11. A comprehensive web resource for QSAR modelers Chemoinformatics Tools, Drug Theoretics and Cheminformatics Laboratory Multiscale Conceptual Model Figures for QSARs in Biological and Environmental Science
v t e
Topics in medicinal chemistry