machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

and data mining, quantification (variously called ''learning to quantify'', or ''supervised prevalence estimation'', or ''class prior estimation'') is the task of using supervised learning in order to train models (''quantifiers'') that estimate the

relative frequencies In statistics, the frequency (or absolute frequency) of an Event (probability theory), event i is the number n_i of times the observation has occurred/recorded in an experiment or study. These frequencies are often depicted graphically or in tabu ...

(also known as prevalence ''values'') of the classes of interest in a sample of unlabelled data items. For instance, in a sample of 100,000 unlabelled tweets known to express opinions about a certain political candidate, a quantifier may be used to estimate the percentage of these 100,000 tweets which belong to class `Positive' (i.e., which manifest a positive stance towards this candidate), and to do the same for classes `Neutral' and `Negative'. Quantification may also be viewed as the task of training predictors that estimate a (discrete)

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...

, i.e., that generate a predicted distribution that approximates the unknown true distribution of the items across the classes of interest. Quantification is different from

classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

, since the goal of classification is to predict the class labels of individual data items, while the goal of quantification it to predict the class prevalence values of sets of data items. Quantification is also different from

regression Regression or regressions may refer to: Science * Marine regression, coastal advance due to falling sea level, the opposite of marine transgression * Regression (medicine), a characteristic of diseases to express lighter symptoms or less extent ( ...

, since in regression the training data items have real-valued labels, while in quantification the training data items have class labels. It has been shown in multiple research works that performing quantification by classifying all unlabelled instances and then counting the instances that have been attributed to each class (the 'classify and count' method) usually leads to suboptimal quantification accuracy. This suboptimality may be seen as a direct consequence of '

Vapnik Vladimir Naumovich Vapnik (russian: Владимир Наумович Вапник; born 6 December 1936) is one of the main developers of the Vapnik–Chervonenkis theory of statistical learning, and the co-inventor of the support-vector machin ...

's principle', which states: In our case, the problem to be solved directly is quantification, while the more general intermediate problem is classification. As a result of the suboptimality of the 'classify and count' method, quantification has evolved as a task in its own right, different (in goals, methods, techniques, and evaluation measures) from classification.

Quantification tasks

The main variants of quantification, according to the characteristics of the set of classes used, are: * Binary quantification, corresponding to the case in which there are only

n=2

classes and each data item belongs to exactly one of them; * Single-label multiclass quantification, corresponding to the case with

n>2

classes and each data item belongs to exactly one of them; * Ordinal quantification, corresponding to the single-label multiclass case in which a total order is defined on the set of classes. Most known quantification methods address the binary case or the single-label multiclass case, and only few of them address the ordinal case. Binary-only methods include the ''Mixture Model'' (MM) method, the HDy method, SVM(KLD), and SVM(Q). Methods that can deal with both the binary case and the single-label multiclass case include ''probabilistic classify and count'' (PCC), ''adjusted classify and count'' (ACC), ''probabilistic adjusted classify and count'' (PACC), and the Saerens-Latinne-Decaestecker EM-based method (SLD). Methods for the ordinal case include ''Ordinal Quantification Tree'' (OQT), and ordinal version of the above-mentioned ACC, PACC, and SLD methods.

Evaluation measures for quantification

Several evaluation measures can be used for evaluating the error of a quantification method. Since quantification consists of generating a predicted probability distribution that estimates a true probability distribution, these evaluation measures are ones that compare two probability distributions. Most evaluation measures for quantification belong to the class of divergences. Evaluation measures for binary quantification and single-label multiclass quantification are * Absolute Error * Squared Error * Relative Absolute Error * Kullback-Leibler divergence * Pearson Divergence Evaluation measures for ordinal quantification are * Normalized Match Distance (a particular case of the Earth Mover's Distance) * Root Normalized Order-Aware Distance

Applications

Quantification is of special interest in fields such as the

social sciences Social science is one of the branches of science, devoted to the study of societies and the relationships among individuals within those societies. The term was formerly used to refer to the field of sociology, the original "science of soci ...

epidemiology Epidemiology is the study and analysis of the distribution (who, when, and where), patterns and determinants of health and disease conditions in a defined population. It is a cornerstone of public health, and shapes policy decisions and evidenc ...

market research Market research is an organized effort to gather information about target markets and customers: know about them, starting with who they are. It is an important component of business strategy and a major factor in maintaining competitiveness. Mark ...

, and

ecological modelling ''Ecological Modelling'' is a monthly peer-reviewed scientific journal covering the use of ecosystem models in the field of ecology. It was founded in 1975 by Sven Erik Jørgensen and is published by Elsevier. The current editor-in-chief is Bria ...

, since these fields are inherently concerned with aggregate data; however, quantification is also useful in applications outside these fields, such as in measuring classifier bias and enforcing classifier fairness, performing

word sense disambiguation Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to consci ...

, allocating resources, and improving the accuracy of classifiers.

Resources

* LQ 2021: the 1st International Workshop on Learning to Quantify * LQ 2022: the 2nd International Workshop on Learning to Quantify * LeQua 2022: A machine learning competition on Learning to Quantify * QuaPy: An open-source Python-based software library for quantification

References

{{reflist Machine learning