HOME

TheInfoList



OR:

The evaluation of binary classifiers compares two methods of assigning a binary attribute, one of which is usually a standard method and the other is being investigated. There are many metrics that can be used to measure the performance of a classifier or predictor; different fields have different preferences for specific metrics due to different goals. For example, in medicine
sensitivity and specificity ''Sensitivity'' and ''specificity'' mathematically describe the accuracy of a test which reports the presence or absence of a condition. Individuals for which the condition is satisfied are considered "positive" and those for which it is not are ...
are often used, while in computer science precision and recall are preferred. An important distinction is between metrics that are independent on the
prevalence In epidemiology, prevalence is the proportion of a particular population found to be affected by a medical condition (typically a disease or a risk factor such as smoking or seatbelt use) at a specific time. It is derived by comparing the number o ...
(how often each category occurs in the population), and metrics that depend on the prevalence – both types are useful, but they have very different properties.


Contingency table

Given a data set, a classification (the output of a classifier on that set) gives two numbers: the number of positives and the number of negatives, which add up to the total size of the set. To evaluate a classifier, one compares its output to another reference classification – ideally a perfect classification, but in practice the output of another
gold standard A gold standard is a monetary system in which the standard economic unit of account is based on a fixed quantity of gold. The gold standard was the basis for the international monetary system from the 1870s to the early 1920s, and from the l ...
test – and cross tabulates the data into a 2×2 contingency table, comparing the two classifications. One then evaluates the classifier ''relative'' to the gold standard by computing
summary statistic In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. Statisticians commonly try to describe the observations in * a measure of ...
s of these 4 numbers. Generally these statistics will be
scale invariant In physics, mathematics and statistics, scale invariance is a feature of objects or laws that do not change if scales of length, energy, or other variables, are multiplied by a common factor, and thus represent a universality. The technical ter ...
(scaling all the numbers by the same factor does not change the output), to make them independent of population size, which is achieved by using ratios of homogeneous functions, most simply homogeneous linear or homogeneous quadratic functions. Say we test some people for the presence of a disease. Some of these people have the disease, and our test correctly says they are positive. They are called ''
true positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result ...
s'' (TP). Some have the disease, but the test incorrectly claims they don't. They are called ''
false negative A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resul ...
s'' (FN). Some don't have the disease, and the test says they don't – ''
true negative A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resul ...
s'' (TN). Finally, there might be healthy people who have a positive test result – ''
false positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resul ...
s'' (FP). These can be arranged into a 2×2 contingency table (
confusion matrix In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a su ...
), conventionally with the test result on the vertical axis and the actual condition on the horizontal axis. These numbers can then be totaled, yielding both a grand total and
marginal total In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the varia ...
s. Totaling the entire table, the number of true positives, false negatives, true negatives, and false positives add up to 100% of the set. Totaling the columns (adding vertically) the number of true positives and false positives add up to 100% of the test positives, and likewise for negatives. Totaling the rows (adding horizontally), the number of true positives and false negatives add up to 100% of the condition positives (conversely for negatives). The basic marginal ratio statistics are obtained by dividing the 2×2=4 values in the table by the marginal totals (either rows or columns), yielding 2 auxiliary 2×2 tables, for a total of 8 ratios. These ratios come in 4 complementary pairs, each pair summing to 1, and so each of these derived 2×2 tables can be summarized as a pair of 2 numbers, together with their complements. Further statistics can be obtained by taking ratios of these ratios, ratios of ratios, or more complicated functions. The contingency table and the most common derived ratios are summarized below; see sequel for details. Note that the rows correspond to the ''condition actually'' being positive or negative (or classified as such by the gold standard), as indicated by the color-coding, and the associated statistics are prevalence-independent, while the columns correspond to the ''test'' being positive or negative, and the associated statistics are prevalence-dependent. There are analogous likelihood ratios for prediction values, but these are less commonly used, and not depicted above.


Sensitivity and specificity

The fundamental prevalence-independent statistics are
sensitivity and specificity ''Sensitivity'' and ''specificity'' mathematically describe the accuracy of a test which reports the presence or absence of a condition. Individuals for which the condition is satisfied are considered "positive" and those for which it is not are ...
. Sensitivity or
True Positive Rate ''Sensitivity'' and ''specificity'' mathematically describe the accuracy of a test which reports the presence or absence of a condition. Individuals for which the condition is satisfied are considered "positive" and those for which it is not are ...
(TPR), also known as recall, is the proportion of people that tested positive and are positive (True Positive, TP) of all the people that actually are positive (Condition Positive, CP = TP + FN). It can be seen as ''the probability that the test is positive given that the patient is sick''. With higher sensitivity, fewer actual cases of disease go undetected (or, in the case of the factory quality control, fewer faulty products go to the market). Specificity (SPC) or True Negative Rate (TNR) is the proportion of people that tested negative and are negative (True Negative, TN) of all the people that actually are negative (Condition Negative, CN = TN + FP). As with sensitivity, it can be looked at as ''the probability that the test result is negative given that the patient is not sick''. With higher specificity, fewer healthy people are labeled as sick (or, in the factory case, fewer good products are discarded). The relationship between sensitivity and specificity, as well as the performance of the classifier, can be visualized and studied using the
Receiver Operating Characteristic A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was originally developed for operators of m ...
(ROC) curve. In theory, sensitivity and specificity are independent in the sense that it is possible to achieve 100% in both (such as in the red/blue ball example given above). In more practical, less contrived instances, however, there is usually a trade-off, such that they are inversely proportional to one another to some extent. This is because we rarely measure the actual thing we would like to classify; rather, we generally measure an indicator of the thing we would like to classify, referred to as a surrogate marker. The reason why 100% is achievable in the ball example is because redness and blueness is determined by directly detecting redness and blueness. However, indicators are sometimes compromised, such as when non-indicators mimic indicators or when indicators are time-dependent, only becoming evident after a certain lag time. The following example of a pregnancy test will make use of such an indicator. Modern pregnancy tests ''do not'' use the pregnancy itself to determine pregnancy status; rather,
human chorionic gonadotropin Human chorionic gonadotropin (hCG) is a hormone for the maternal recognition of pregnancy produced by trophoblast cells that are surrounding a growing embryo (syncytiotrophoblast initially), which eventually forms the placenta after implantatio ...
is used, or hCG, present in the urine of
gravid In biology and human medicine, gravidity and parity are the number of times a woman is or has been pregnant (gravidity) and carried the pregnancies to a viable gestational age (parity). These terms are usually coupled, sometimes with additional t ...
females, as a ''surrogate marker to indicate'' that a woman is pregnant. Because hCG can also be produced by a
tumor A neoplasm () is a type of abnormal and excessive growth of tissue. The process that occurs to form or produce a neoplasm is called neoplasia. The growth of a neoplasm is uncoordinated with that of the normal surrounding tissue, and persists ...
, the specificity of modern pregnancy tests cannot be 100% (because false positives are possible). Also, because hCG is present in the urine in such small concentrations after fertilization and early embryogenesis, the sensitivity of modern pregnancy tests cannot be 100% (because false negatives are possible).


Likelihood ratios


Positive and negative predictive values

In addition to sensitivity and specificity, the performance of a binary classification test can be measured with positive predictive value (PPV), also known as
precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...
, and negative predictive value (NPV). The positive prediction value answers the question "If the test result is ''positive'', how well does that ''predict'' an actual presence of disease?". It is calculated as TP/(TP + FP); that is, it is the proportion of true positives out of all positive results. The negative prediction value is the same, but for negatives, naturally.


Impact of prevalence on prediction values

Prevalence has a significant impact on prediction values. As an example, suppose there is a test for a disease with 99% sensitivity and 99% specificity. If 2000 people are tested and the prevalence (in the sample) is 50%, 1000 of them are sick and 1000 of them are healthy. Thus about 990 true positives and 990 true negatives are likely, with 10 false positives and 10 false negatives. The positive and negative prediction values would be 99%, so there can be high confidence in the result. However, if the prevalence is only 5%, so of the 2000 people only 100 are really sick, then the prediction values change significantly. The likely result is 99 true positives, 1 false negative, 1881 true negatives and 19 false positives. Of the 19+99 people tested positive, only 99 really have the disease – that means, intuitively, that given that a patient's test result is positive, there is only 84% chance that they really have the disease. On the other hand, given that the patient's test result is negative, there is only 1 chance in 1882, or 0.05% probability, that the patient has the disease despite the test result.


Likelihood ratios


Precision and recall

Precision and recall can be interpreted as (estimated) conditional probabilities: Precision is given by P(C=P, \hat=P) while recall is given by P(\hat=P, C=P),Information Retrieval Models, Thomas Roelleke, ISBN 9783031023286, page 76, https://www.google.de/books/edition/Information_Retrieval_Models/YX9yEAAAQBAJ?hl=de&gbpv=1&pg=PA76&printsec=frontcover where \hat is the predicted class and C is the actual class. Both quantities are therefore connected by Bayes' theorem.


Relationships

There are various relationships between these ratios. If the prevalence, sensitivity, and specificity are known, the positive predictive value can be obtained from the following identity: :: \text = \frac If the prevalence, sensitivity, and specificity are known, the negative predictive value can be obtained from the following identity: :: \text = \frac.


Single metrics

In addition to the paired metrics, there are also single metrics that give a single number to evaluate the test. Perhaps the simplest statistic is
accuracy Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements ( observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each oth ...
or ''fraction correct'' (FC), which measures the fraction of all instances that are correctly categorized; it is the ratio of the number of correct classifications to the total number of correct or incorrect classifications: (TP + TN)/total population = (TP + TN)/(TP + TN + FP + FN). As such, it compares estimates of
pre- and post-test probability Pre-test probability and post-test probability (alternatively spelled pretest and posttest probability) are the probabilities of the presence of a condition (such as a disease) before and after a diagnostic test, respectively. ''Post-test probabi ...
. This measure is
prevalence In epidemiology, prevalence is the proportion of a particular population found to be affected by a medical condition (typically a disease or a risk factor such as smoking or seatbelt use) at a specific time. It is derived by comparing the number o ...
-dependent. If 90% of people with COVID symptoms don't have COVID, the prior probability P(-) is 0.9, and the simple rule "Classify all such patients as COVID-free." would be 90% accurate. Diagnosis should be better than that. One can construct a "One-proportion z-test" with p0 as max(priors) = max(P(-),P(+)) for a diagnostic method hoping to beat a simple rule using the most likely outcome. Here, the hypotheses are "Ho: p ≤ 0.9 vs. Ha: p > 0.9", rejecting Ho for large values of z. One diagnostic rule could be compared to another if the other's accuracy is known and substituted for p0 in calculating the z statistic. If not known and calculated from data, an accuracy comparison test could be made using "Two-proportion z-test, pooled for Ho: p1 = p2". Not used very much is the complementary statistic, the ''fraction incorrect'' (FiC): FC + FiC = 1, or (FP + FN)/(TP + TN + FP + FN) – this is the sum of the
antidiagonal In linear algebra, the main diagonal (sometimes principal diagonal, primary diagonal, leading diagonal, major diagonal, or good diagonal) of a matrix A is the list of entries a_ where i = j. All off-diagonal elements are zero in a diagonal matrix. ...
, divided by the total population. Cost-weighted fractions incorrect could compare expected costs of misclassification for different methods. The
diagnostic odds ratio In medical testing with binary classification, the diagnostic odds ratio (DOR) is a measure of the effectiveness of a diagnostic test. It is defined as the ratio of the odds of the test being positive if the subject has a disease relative to the ...
(DOR) can be a more useful overall metric, which can be defined directly as (TP×TN)/(FP×FN) = (TP/FN)/(FP/TN), or indirectly as a ratio of ratio of ratios (ratio of likelihood ratios, which are themselves ratios of true rates or prediction values). This has a useful interpretation – as an
odds ratio An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due ...
– and is prevalence-independent.
Likelihood ratio The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...
is generally considered to be prevalence-independent and is easily interpreted as the multiplier to turn
prior probabilities In Bayesian probability, Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some e ...
into
posterior probabilities The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...
. Another useful single measure is "area under the ROC curve", AUC.


Alternative metrics

An
F-score In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the n ...
is a combination of the
precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...
and the recall, providing a single score. There is a one-parameter family of statistics, with parameter ''β,'' which determines the relative weights of precision and recall. The traditional or balanced F-score (
F1 score In statistical analysis of binary classification, the F-score or F-measure is a measure of a test's accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the ...
) is the harmonic mean of precision and recall: :F_1 = 2 \cdot \frac . F-scores do not take the true negative rate into account and, therefore, are more suited to information retrieval and
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
evaluation where the true negatives are innumerable. Instead, measures such as the
phi coefficient In statistics, the phi coefficient (or mean square contingency coefficient and denoted by φ or rφ) is a measure of association for two binary variables. In machine learning, it is known as the Matthews correlation coefficient (MCC) and used as ...
,
Matthews correlation coefficient In statistics, the phi coefficient (or mean square contingency coefficient and denoted by φ or rφ) is a measure of association for two binary variables. In machine learning, it is known as the Matthews correlation coefficient (MCC) and used as ...
,
informedness Youden's J statistic (also called Youden's index) is a single statistic that captures the performance of a dichotomous diagnostic test. Informedness is its generalization to the multiclass case and estimates the probability of an informed decision ...
or
Cohen's kappa Cohen's kappa coefficient (''κ'', lowercase Greek kappa) is a statistic that is used to measure inter-rater reliability (and also intra-rater reliability) for qualitative (categorical) items. It is generally thought to be a more robust measure th ...
may be preferable to assess the performance of a binary classifier. As a
correlation coefficient A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationship between two variables. The variables may be two columns of a given data set of observations, often called a sample, or two components ...
, the Matthews correlation coefficient is the geometric mean of the
regression coefficient In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
s of the problem and its dual. The component regression coefficients of the Matthews correlation coefficient are
markedness In linguistics and social sciences, markedness is the state of standing out as nontypical or divergent as opposed to regular or common. In a marked–unmarked relation, one term of an opposition is the broader, dominant one. The dominant defau ...
(deltap) and informedness (
Youden's J statistic Youden's J statistic (also called Youden's index) is a single statistic that captures the performance of a dichotomous diagnostic test. Informedness is its generalization to the multiclass case and estimates the probability of an informed decision ...
or deltap').


See also

*
Population impact measures Population impact measures (PIMs) are biostatistical measures of risk and benefit used in epidemiological and public health research. They are used to describe the impact of health risks and benefits in a population, to inform health policy. Fre ...
*
Attributable risk In epidemiology, attributable risk or excess risk is a term synonymous to risk difference, that has also been used to denote attributable fraction among the exposed and attributable fraction for the population. See also * Population Impact Measu ...
* Attributable risk percent *
Scoring rule In decision theory, a scoring rule provides a summary measure for the evaluation of probabilistic forecasting, probabilistic predictions or forecasts. It is applicable to tasks in which predictions assign probabilities to events, i.e. one issues a ...
(for probability predictions)


References

{{Reflist Statistical classification Machine learning