Applicability Domain
   HOME

TheInfoList



OR:

The applicability domain (AD) (for both
chemistry Chemistry is the science, scientific study of the properties and behavior of matter. It is a natural science that covers the Chemical element, elements that make up matter to the chemical compound, compounds made of atoms, molecules and ions ...
and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
) of a QSAR model is the physico-chemical, structural or biological space, knowledge or information on which the
training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...
of the model has been developed, and for which it is applicable to make predictions for new compounds. The purpose of AD is to state whether the model's assumptions are met, and for which chemicals the model can be reliably applicable. In general, this is the case for interpolation rather than for
extrapolation In mathematics, extrapolation is a type of estimation, beyond the original observation range, of the value of a variable on the basis of its relationship with another variable. It is similar to interpolation, which produces estimates between know ...
. Up to now there is no single generally accepted algorithm for determining the AD: a comprehensive survey can be found in a Report and Recommendations of ECVAM Workshop 52. There exists a rather systematic approach for defining interpolation regions. The process involves the removal of outliers and a probability density distribution method using kernel-weighted sampling. Another widely used approach for the structural AD of the regression QSAR models is based on the leverage calculated from the diagonal values of the hat matrix of the modeling molecular descriptors. A recent rigorous benchmarking study of several AD algorithms identified standard-deviation of model predictions as the most reliable approach.Tetko IV, Sushko I, Pandey AK, Zhu H, Tropsha A, Papa E, Oberg T, Todeschini R, Fourches D, Varnek A. Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection. J Chem Inf Model. 2008 Sep;48(9):1733-46. To investigate the AD of a training set of chemicals one can directly analyse properties of the
multivariate Multivariate may refer to: In mathematics * Multivariable calculus * Multivariate function * Multivariate polynomial In computing * Multivariate cryptography * Multivariate division algorithm * Multivariate interpolation * Multivariate optical c ...
descriptor space of the training compounds or more indirectly via
distance Distance is a numerical or occasionally qualitative measurement of how far apart objects or points are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria (e.g. "two counties over"). ...
(or similarity) metrics. When using distance metrics care should be taken to use an orthogonal and significant vector space. This can be achieved by different means of feature selection and successive
principal components analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
.


Notes

Cheminformatics Medicinal chemistry Drug discovery {{medicinal-chem-stub