Evaluation of a binary classifier typically assigns a numerical value, or values, to a classifier that represent its accuracy. An example is error rate, which measures how frequently the classifier makes a mistake. There are many metrics that can be used; different fields have different preferences. For example, in medicine

sensitivity and specificity In medicine and statistics, sensitivity and specificity mathematically describe the accuracy of a test that reports the presence or absence of a medical condition. If individuals who have the condition are considered "positive" and those who do ...

are often used, while in computer science

precision and recall In pattern recognition, information retrieval, object detection and classification (machine learning), precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space. Precision (also calle ...

are preferred. An important distinction is between metrics that are independent of the

prevalence In epidemiology, prevalence is the proportion of a particular population found to be affected by a medical condition (typically a disease or a risk factor such as smoking or seatbelt use) at a specific time. It is derived by comparing the number o ...

or skew (how often each class occurs in the population), and metrics that depend on the prevalence – both types are useful, but they have very different properties. Often, evaluation is used to compare two methods of classification, so that one can be adopted and the other discarded. Such comparisons are more directly achieved by a form of evaluation that results in a single unitary metric rather than a pair of metrics.

Contingency table

Given a data set, a classification (the output of a classifier on that set) gives two numbers: the number of positives and the number of negatives, which add up to the total size of the set. To evaluate a classifier, one compares its output to another reference classification – ideally a perfect classification, but in practice the output of another

gold standard A gold standard is a backed currency, monetary system in which the standard economics, economic unit of account is based on a fixed quantity of gold. The gold standard was the basis for the international monetary system from the 1870s to the ...

test – and cross tabulates the data into a 2×2

contingency table In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business int ...

, comparing the two classifications. One then evaluates the classifier ''relative'' to the gold standard by computing summary statistics of these 4 numbers. Generally these statistics will be scale invariant (scaling all the numbers by the same factor does not change the output), to make them independent of population size, which is achieved by using ratios of

homogeneous function In mathematics, a homogeneous function is a function of several variables such that the following holds: If each of the function's arguments is multiplied by the same scalar (mathematics), scalar, then the function's value is multiplied by some p ...

s, most simply homogeneous linear or homogeneous quadratic functions. Say we test some people for the presence of a disease. Some of these people have the disease, and our test correctly says they are positive. They are called '' true positives'' (TP). Some have the disease, but the test incorrectly claims they don't. They are called ''

false negative A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resu ...

s'' (FN). Some don't have the disease, and the test says they don't – ''

true negative A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resu ...

s'' (TN). Finally, there might be healthy people who have a positive test result – ''

false positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test resu ...

s'' (FP). These can be arranged into a 2×2 contingency table (

confusion matrix In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a super ...

), conventionally with the test result on the vertical axis and the actual condition on the horizontal axis. These numbers can then be totaled, yielding both a grand total and marginal totals. Totaling the entire table, the number of true positives, false negatives, true negatives, and false positives add up to 100% of the set. Totaling the columns (adding vertically) the number of true positives and false positives add up to 100% of the test positives, and likewise for negatives. Totaling the rows (adding horizontally), the number of true positives and false negatives add up to 100% of the condition positives (conversely for negatives). The basic marginal ratio statistics are obtained by dividing the 2×2=4 values in the table by the marginal totals (either rows or columns), yielding 2 auxiliary 2×2 tables, for a total of 8 ratios. These ratios come in 4 complementary pairs, each pair summing to 1, and so each of these derived 2×2 tables can be summarized as a pair of 2 numbers, together with their complements. Further statistics can be obtained by taking ratios of these ratios, ratios of ratios, or more complicated functions. The contingency table and the most common derived ratios are summarized below; see sequel for details. Note that the rows correspond to the ''condition actually'' being positive or negative (or classified as such by the gold standard), as indicated by the color-coding, and the associated statistics are prevalence-independent, while the columns correspond to the ''test'' being positive or negative, and the associated statistics are prevalence-dependent. There are analogous likelihood ratios for prediction values, but these are less commonly used, and not depicted above.

Pairs of metrics

Often accuracy is evaluated with a pair of metrics composed in a standard pattern.

Sensitivity and specificity

The fundamental prevalence-independent statistics are

. Sensitivity or True Positive Rate (TPR), also known as recall, is the proportion of people that tested positive and are positive (True Positive, TP) of all the people that actually are positive (Condition Positive, CP = TP + FN). It can be seen as ''the probability that the test is positive given that the patient is sick''. With higher sensitivity, fewer actual cases of disease go undetected (or, in the case of the factory quality control, fewer faulty products go to the market). Specificity (SPC) or True Negative Rate (TNR) is the proportion of people that tested negative and are negative (True Negative, TN) of all the people that actually are negative (Condition Negative, CN = TN + FP). As with sensitivity, it can be looked at as ''the probability that the test result is negative given that the patient is not sick''. With higher specificity, fewer healthy people are labeled as sick (or, in the factory case, fewer good products are discarded). The relationship between sensitivity and specificity, as well as the performance of the classifier, can be visualized and studied using the

Receiver Operating Characteristic A receiver operating characteristic curve, or ROC curve, is a graph of a function, graphical plot that illustrates the performance of a binary classifier model (can be used for multi class classification as well) at varying threshold values. ROC ...

(ROC) curve. In theory, sensitivity and specificity are independent in the sense that it is possible to achieve 100% in both (such as in the red/blue ball example given above). In more practical, less contrived instances, however, there is usually a trade-off, such that they are inversely proportional to one another to some extent. This is because we rarely measure the actual thing we would like to classify; rather, we generally measure an indicator of the thing we would like to classify, referred to as a surrogate marker. The reason why 100% is achievable in the ball example is because redness and blueness is determined by directly detecting redness and blueness. However, indicators are sometimes compromised, such as when non-indicators mimic indicators or when indicators are time-dependent, only becoming evident after a certain lag time. The following example of a pregnancy test will make use of such an indicator. Modern pregnancy tests ''do not'' use the pregnancy itself to determine pregnancy status; rather,

human chorionic gonadotropin Human chorionic gonadotropin (hCG) is a hormone for the maternal recognition of pregnancy produced by trophoblast cells that are surrounding a growing embryo (syncytiotrophoblast initially), which eventually forms the placenta after implantat ...

is used, or hCG, present in the urine of

gravid In biology and medicine, gravidity and parity are the number of times a female has been pregnant (gravidity) and carried the pregnancies to a viable gestational age (parity). These two terms are usually coupled, sometimes with additional terms, t ...

females, as a ''surrogate marker to indicate'' that a woman is pregnant. Because hCG can also be produced by a

tumor A neoplasm () is a type of abnormal and excessive growth of tissue. The process that occurs to form or produce a neoplasm is called neoplasia. The growth of a neoplasm is uncoordinated with that of the normal surrounding tissue, and persists ...

, the specificity of modern pregnancy tests cannot be 100% (because false positives are possible). Also, because hCG is present in the urine in such small concentrations after fertilization and early

embryogenesis An embryo ( ) is the initial stage of development for a multicellular organism. In organisms that reproduce sexually, embryonic development is the part of the life cycle that begins just after fertilization of the female egg cell by the male ...

, the sensitivity of modern pregnancy tests cannot be 100% (because false negatives are possible).

Positive and negative predictive values

In addition to sensitivity and specificity, the performance of a binary classification test can be measured with

positive predictive value The positive and negative predictive values (PPV and NPV respectively) are the proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results, respectively. The PPV and NPV desc ...

(PPV), also known as precision, and

negative predictive value The positive and negative predictive values (PPV and NPV respectively) are the proportions of positive and negative results in statistics and diagnostic tests that are true positive and true negative results, respectively. The PPV and NPV desc ...

(NPV). The positive prediction value answers the question "If the test result is ''positive'', how well does that ''predict'' an actual presence of disease?". It is calculated as TP/(TP + FP); that is, it is the proportion of true positives out of all positive results. The negative prediction value is the same, but for negatives, naturally.

Impact of prevalence on predictive values

Prevalence has a significant impact on prediction values. As an example, suppose there is a test for a disease with 99% sensitivity and 99% specificity. If 2000 people are tested and the prevalence (in the sample) is 50%, 1000 of them are sick and 1000 of them are healthy. Thus about 990 true positives and 990 true negatives are likely, with 10 false positives and 10 false negatives. The positive and negative prediction values would be 99%, so there can be high confidence in the result. However, if the prevalence is only 5%, so of the 2000 people only 100 are really sick, then the prediction values change significantly. The likely result is 99 true positives, 1 false negative, 1881 true negatives and 19 false positives. Of the 19+99 people tested positive, only 99 really have the disease – that means, intuitively, that given that a patient's test result is positive, there is only 84% chance that they really have the disease. On the other hand, given that the patient's test result is negative, there is only 1 chance in 1882, or 0.05% probability, that the patient has the disease despite the test result.

Precision and recall

Precision and recall can be interpreted as (estimated) conditional probabilities: Precision is given by

P(C=P, \hat=P)

while recall is given by

P(\hat=P, C=P)

, where

\hat

is the predicted class and

C

is the actual class. Both quantities are therefore connected by

Bayes' theorem Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting Conditional probability, conditional probabilities, allowing one to find the probability of a cause given its effect. For exampl ...

Relationships

There are various relationships between these ratios. If the prevalence, sensitivity, and specificity are known, the positive predictive value can be obtained from the following identity: ::

\text = \frac

If the prevalence, sensitivity, and specificity are known, the negative predictive value can be obtained from the following identity: ::

\text = \frac.

Unitary metrics

In addition to the paired metrics, there are also unitary metrics that give a single number to evaluate the test. Perhaps the simplest statistic is

accuracy Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value''. ''Precision'' is how close the measurements are to each other. The ...

or ''fraction correct'' (FC), which measures the fraction of all instances that are correctly categorized; it is the ratio of the number of correct classifications to the total number of correct or incorrect classifications: (TP + TN)/total population = (TP + TN)/(TP + TN + FP + FN). As such, it compares estimates of

pre- and post-test probability Pre-test probability and post-test probability (alternatively spelled pretest and posttest probability) are the probabilities of the presence of a condition (such as a disease) before and after a diagnostic test, respectively. ''Post-test probabi ...

. In total ignorance, one can compare a rule to flipping a coin (p0=0.5). This measure is

-dependent. If 90% of people with COVID symptoms don't have COVID, the prior probability P(-) is 0.9, and the simple rule "Classify all such patients as COVID-free." would be 90% accurate. Diagnosis should be better than that. One can construct a "One-proportion z-test" with p0 as max(priors) = max(P(-),P(+)) for a diagnostic method hoping to beat a simple rule using the most likely outcome. Here, the hypotheses are "Ho: p ≤ 0.9 vs. Ha: p > 0.9", rejecting Ho for large values of z. One diagnostic rule could be compared to another if the other's accuracy is known and substituted for p0 in calculating the z statistic. If not known and calculated from data, an accuracy comparison test could be made using "Two-proportion z-test, pooled for Ho: p1 = p2". Not used very much is the complementary statistic, the ''fraction incorrect'' (FiC): FC + FiC = 1, or (FP + FN)/(TP + TN + FP + FN) – this is the sum of the

antidiagonal In linear algebra, the main diagonal (sometimes principal diagonal, primary diagonal, leading diagonal, major diagonal, or good diagonal) of a matrix A is the list of entries a_ where i = j. All off-diagonal elements are zero in a diagonal matrix ...

, divided by the total population. Cost-weighted fractions incorrect could compare expected costs of misclassification for different methods. The diagnostic odds ratio (DOR) can be a more useful overall metric, which can be defined directly as (TP×TN)/(FP×FN) = (TP/FN)/(FP/TN), or indirectly as a ratio of ratio of ratios (ratio of likelihood ratios, which are themselves ratios of true rates or prediction values). This has a useful interpretation – as an

odds ratio An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of event A taking place in the presence of B, and the odds of A in the absence of B ...

– and is prevalence-independent. Likelihood ratio is generally considered to be prevalence-independent and is easily interpreted as the multiplier to turn prior probabilities into posterior probabilities. An F-score is a combination of the precision and the recall, providing a single score. There is a one-parameter family of statistics, with parameter ''β,'' which determines the relative weights of precision and recall. The traditional or balanced F-score (

F1 score In statistical analysis of binary classification and information retrieval systems, the F-score or F-measure is a measure of predictive performance. It is calculated from the precision and recall of the test, where the precision is the number o ...

) is the

harmonic mean In mathematics, the harmonic mean is a kind of average, one of the Pythagorean means. It is the most appropriate average for ratios and rate (mathematics), rates such as speeds, and is normally only used for positive arguments. The harmonic mean ...

of precision and recall: :

F_1 = 2 \cdot \frac

. F-scores do not take the true negative rate into account and, therefore, are more suited to

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

and information extraction evaluation where the true negatives are innumerable. Instead, measures such as the

phi coefficient In statistics, the phi coefficient, or mean square contingency coefficient, denoted by ''φ'' or ''r'φ'', is a measure of association for two binary variables. In machine learning, it is known as the Matthews correlation coefficient (MCC) an ...

Matthews correlation coefficient In statistics, the phi coefficient, or mean square contingency coefficient, denoted by ''φ'' or ''r'φ'', is a measure of association for two binary variables. In machine learning, it is known as the Matthews correlation coefficient (MCC) an ...

informedness Youden's J statistic (also called Youden's index) is a single statistic that captures the performance of a dichotomous diagnostic test. In meteorology, this statistic is referred to as Peirce Skill Score (PSS), Hanssen–Kuipers Discriminant (HKD) ...

or Cohen's kappa may be preferable to assess the performance of a binary classifier. As a

correlation coefficient A correlation coefficient is a numerical measure of some type of linear correlation, meaning a statistical relationship between two variables. The variables may be two columns of a given data set of observations, often called a sample, or two c ...

, the Matthews correlation coefficient is the

geometric mean In mathematics, the geometric mean is a mean or average which indicates a central tendency of a finite collection of positive real numbers by using the product of their values (as opposed to the arithmetic mean which uses their sum). The geometri ...

of the regression coefficients of the problem and its dual. The component regression coefficients of the Matthews correlation coefficient are

markedness In linguistics and social sciences, markedness is the state of standing out as nontypical or divergent as opposed to regular or common. In a marked–unmarked relation, one term of an opposition is the broader, dominant one. The dominant defau ...

(deltap) and informedness (

Youden's J statistic Youden's J statistic (also called Youden's index) is a single statistic that captures the performance of a dichotomy, dichotomous diagnostic test. In meteorology, this statistic is referred to as Peirce Skill Score (PSS), Hanssen–Kuipers Discrim ...

or deltap').

Choosing the appropriate form of evaluation

Hand has highlighted the importance of choosing an appropriate method of evaluation. However, of the many different methods for evaluating the accuracy of a classifier, there is no general method for determining which method should be used in which circumstances. Different fields have taken different approaches. Cullerne Bown has distinguished three basic approaches to evaluation: ° Mathematical - such as the Matthews Correlation Coefficient, in which both kinds of error are axiomatically treated as equally problematic; ° Cost-benefit - in which a currency is adopted (e.g. money or Quality Adjusted Life Years) and values assigned to errors and successes on the basis of empirical measurement; ° Judgemental - in which a human judgement is made about the relative importance of the two kinds of error; typically this starts by adopting a pair of indicators such as sensitivity and specificity, precision and recall or positive predictive value and negative predictive value. In the judgemental case, he has provided a flow chart for determining which pair of indicators should be used when, and consequently how to choose between the

and the Precision-Recall Curve.

Evaluation of underlying technologies

Often, we want to evaluate not a specific classifier working in a specific way but an underlying technology. Typically, the technology can be adjusted through altering the threshold of a score function, the threshold determining whether the result is a positive or negative. For such evaluations a useful single measure is "area under the ROC curve", AUC.

Accuracy aside

Apart from accuracy, binary classifiers can be assessed in many other ways, for example in terms of their speed or cost.

Evaluation of probabilistic classifiers

Probabilistic classification In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation sh ...

models go beyond providing binary outputs and instead produce probability scores for each class. These models are designed to assess the likelihood or probability of an instance belonging to different classes. In the context of evaluating probabilistic classifiers, alternative evaluation metrics have been developed to properly assess the performance of these models. These metrics take into account the probabilistic nature of the classifier's output and provide a more comprehensive assessment of its effectiveness in assigning accurate probabilities to different classes. These evaluation metrics aim to capture the degree of calibration, discrimination, and overall accuracy of the probabilistic classifier's predictions.

In information systems

Information retrieval systems, such as

database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...

s and

web search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...

s, are evaluated by many different metrics, some of which are derived from the

, which divides results into true positives (documents correctly retrieved), true negatives (documents correctly not retrieved), false positives (documents incorrectly retrieved), and false negatives (documents incorrectly not retrieved). Commonly used metrics include the notions of

. In this context, precision is defined as the fraction of documents correctly retrieved compared to the documents retrieved (true positives divided by true positives plus false positives), using a set of

ground truth Ground truth is information that is known to be real or true, provided by direct observation and measurement (i.e. empirical evidence) as opposed to information provided by inference. Etymology The ''Oxford English Dictionary'' (s.v. ''ground ...

relevant results selected by humans. Recall is defined as the fraction of documents correctly retrieved compared to the relevant documents (true positives divided by true positives plus false negatives). Less commonly, the metric of accuracy is used, is defined as the fraction of documents correctly classified compared to the documents (true positives plus true negatives divided by true positives plus true negatives plus false positives plus false negatives). None of these metrics take into account the ranking of results. Ranking is very important for web search engines because readers seldom go past the first page of results, and there are too many documents on the web to manually classify all of them as to whether they should be included or excluded from a given search. Adding a cutoff at a particular number of results takes ranking into account to some degree. The measure

precision at k Evaluation measures for an information retrieval (IR) system assess how well an index, search engine, or database returns results from a collection of resources that satisfy a user's query. They are therefore fundamental to the success of informati ...

, for example, is a measure of precision looking only at the top ten (k=10) search results. More sophisticated metrics, such as discounted cumulative gain, take into account each individual ranking, and are more commonly used where this is important.

References

{{Reflist

External links

Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules
Statistical classification Machine learning