A receiver operating characteristic curve, or ROC curve, is a
graphical plot that illustrates the diagnostic ability of a
binary classifier system as its discrimination threshold is varied.
The ROC curve is the plot of the
true positive rate
''Sensitivity'' and ''specificity'' mathematically describe the accuracy of a test which reports the presence or absence of a condition. Individuals for which the condition is satisfied are considered "positive" and those for which it is not are ...
(TPR) against the
false positive rate
In statistics, when performing multiple comparisons, a false positive ratio (also known as fall-out or false alarm ratio) is the probability of falsely rejecting the null hypothesis for a particular test. The false positive rate is calculated as th ...
(FPR), at various threshold settings.
The ROC can also be thought of as a plot of the
power
Power most often refers to:
* Power (physics), meaning "rate of doing work"
** Engine power, the power put out by an engine
** Electric power
* Power (social and political), the ability to influence people or events
** Abusive power
Power may a ...
as a function of the
Type I Error
In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the fa ...
of the decision rule (when the performance is calculated from just a sample of the population, it can be thought of as estimators of these quantities). The ROC curve is thus the sensitivity or recall as a function of
fall-out.
Given the
probability distributions for both true positive and false positive are known, the ROC curve is obtained as the
cumulative distribution function (CDF, area under the probability distribution from
to the discrimination threshold) of the detection probability in the y-axis versus the CDF of the false positive probability on the x-axis.
ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. ROC analysis is related in a direct and natural way to cost/benefit analysis of diagnostic
decision making.
Terminology
There are a large number of synonyms for components of a ROC curve. They are tabulated on the right.
The true-positive rate is also known as
sensitivity,
recall
Recall may refer to:
* Recall (bugle call), a signal to stop
* Recall (information retrieval), a statistical measure
* ''ReCALL'' (journal), an academic journal about computer-assisted language learning
* Recall (memory)
* ''Recall'' (Overwatch ...
or ''probability of detection''.
The false-positive rate is also known as ''probability of false alarm''
and equals (1 −
specificity).
The ROC is also known as a relative operating characteristic curve, because it is a comparison of two operating characteristics (TPR and FPR) as the criterion changes.
[Swets, John A.]
''Signal detection theory and ROC analysis in psychology and diagnostics : collected papers''
Lawrence Erlbaum Associates, Mahwah, NJ, 1996
History
The ROC curve was first developed by electrical engineers and radar engineers during World War II for detecting enemy objects in battlefields, starting in 1941, which led to its name ("receiver operating characteristic").
It was soon introduced to
psychology
Psychology is the scientific study of mind and behavior. Psychology includes the study of conscious and unconscious phenomena, including feelings and thoughts. It is an academic discipline of immense scope, crossing the boundaries between ...
to account for perceptual detection of stimuli. ROC analysis since then has been used in
medicine
Medicine is the science and practice of caring for a patient, managing the diagnosis, prognosis, prevention, treatment, palliation of their injury or disease, and promoting their health. Medicine encompasses a variety of health care pr ...
,
radiology
Radiology ( ) is the medical discipline that uses medical imaging to diagnose diseases and guide their treatment, within the bodies of humans and other animals. It began with radiography (which is why its name has a root referring to radiat ...
,
biometrics
Biometrics are body measurements and calculations related to human characteristics. Biometric authentication (or realistic authentication) is used in computer science as a form of identification and access control. It is also used to identify i ...
,
forecasting
Forecasting is the process of making predictions based on past and present data. Later these can be compared (resolved) against what happens. For example, a company might estimate their revenue in the next year, then compare it against the actual ...
of
natural hazards,
meteorology
Meteorology is a branch of the atmospheric sciences (which include atmospheric chemistry and physics) with a major focus on weather forecasting. The study of meteorology dates back millennia, though significant progress in meteorology did no ...
, model performance assessment, and other areas for many decades and is increasingly used in
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
and
data mining research.
Basic concept
A classification model (
classifier or
diagnosis
Diagnosis is the identification of the nature and cause of a certain phenomenon. Diagnosis is used in many different disciplines, with variations in the use of logic, analytics, and experience, to determine " cause and effect". In systems engin ...
) is a
mapping of instances between certain classes/groups. Because the classifier or diagnosis result can be an arbitrary
real value (continuous output), the classifier boundary between classes must be determined by a threshold value (for instance, to determine whether a person has
hypertension based on a
blood pressure measure). Or it can be a
discrete
Discrete may refer to:
*Discrete particle or quantum in physics, for example in quantum theory
*Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit
*Discrete group, a g ...
class label, indicating one of the classes.
Consider a two-class prediction problem (
binary classification
Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include:
* Medical testing to determine if a patient has c ...
), in which the outcomes are labeled either as positive (''p'') or negative (''n''). There are four possible outcomes from a binary classifier. If the outcome from a prediction is ''p'' and the actual value is also ''p'', then it is called a ''true positive'' (TP); however if the actual value is ''n'' then it is said to be a ''false positive'' (FP). Conversely, a ''true negative'' (TN) has occurred when both the prediction outcome and the actual value are ''n'', and ''false negative'' (FN) is when the prediction outcome is ''n'' while the actual value is ''p''.
To get an appropriate example in a real-world problem, consider a diagnostic test that seeks to determine whether a person has a certain disease. A false positive in this case occurs when the person tests positive, but does not actually have the disease. A false negative, on the other hand, occurs when the person tests negative, suggesting they are healthy, when they actually do have the disease.
Consider an experiment from P positive instances and N negative instances for some condition. The four outcomes can be formulated in a 2×2 ''
contingency table'' or ''
confusion matrix
In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a su ...
'', as follows:
ROC space
The contingency table can derive several evaluation "metrics" (see infobox). To draw a ROC curve, only the true positive rate (TPR) and false positive rate (FPR) are needed (as functions of some classifier parameter). The TPR defines how many correct positive results occur among all positive samples available during the test. FPR, on the other hand, defines how many incorrect positive results occur among all negative samples available during the test.
A ROC space is defined by FPR and TPR as ''x'' and ''y'' axes, respectively, which depicts relative trade-offs between true positive (benefits) and false positive (costs). Since TPR is equivalent to sensitivity and FPR is equal to 1 − specificity, the ROC graph is sometimes called the sensitivity vs (1 − specificity) plot. Each prediction result or instance of a
confusion matrix
In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a su ...
represents one point in the ROC space.
The best possible prediction method would yield a point in the upper left corner or coordinate (0,1) of the ROC space, representing 100% sensitivity (no false negatives) and 100% specificity (no false positives). The (0,1) point is also called a ''perfect classification''. A random guess would give a point along a diagonal line (the so-called ''line of no-discrimination'') from the bottom left to the top right corners (regardless of the positive and negative
base rate
In probability and statistics, the base rate (also known as prior probabilities) is the class of probabilities unconditional on "featural evidence" (likelihoods).
For example, if 1% of the population were medical professionals, and remaining ...
s). An intuitive example of random guessing is a decision by flipping coins. As the size of the sample increases, a random classifier's ROC point tends towards the diagonal line. In the case of a balanced coin, it will tend to the point (0.5, 0.5).
The diagonal divides the ROC space. Points above the diagonal represent good classification results (better than random); points below the line represent bad results (worse than random). Note that the output of a consistently bad predictor could simply be inverted to obtain a good predictor.
Consider four prediction results from 100 positive and 100 negative instances:
Plots of the four results above in the ROC space are given in the figure. The result of method A clearly shows the best predictive power among A, B, and C. The result of B lies on the random guess line (the diagonal line), and it can be seen in the table that the
accuracy
Accuracy and precision are two measures of ''observational error''.
''Accuracy'' is how close a given set of measurements ( observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each oth ...
of B is 50%. However, when C is mirrored across the center point (0.5,0.5), the resulting method C′ is even better than A. This mirrored method simply reverses the predictions of whatever method or test produced the C contingency table. Although the original C method has negative predictive power, simply reversing its decisions leads to a new predictive method C′ which has positive predictive power. When the C method predicts p or n, the C′ method would predict n or p, respectively. In this manner, the C′ test would perform the best. The closer a result from a contingency table is to the upper left corner, the better it predicts, but the distance from the random guess line in either direction is the best indicator of how much predictive power a method has. If the result is below the line (i.e. the method is worse than a random guess), all of the method's predictions must be reversed in order to utilize its power, thereby moving the result above the random guess line.
Curves in ROC space
In binary classification, the class prediction for each instance is often made based on a
continuous random variable
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...
, which is a "score" computed for the instance (e.g. the estimated probability in logistic regression). Given a threshold parameter
, the instance is classified as "positive" if
, and "negative" otherwise.
follows a probability density
if the instance actually belongs to class "positive", and
if otherwise. Therefore, the true positive rate is given by
and the false positive rate is given by
.
The ROC curve plots parametrically
versus
with
as the varying parameter.
For example, imagine that the blood protein levels in diseased people and healthy people are
normally distributed with means of 2
g/
dL and 1 g/dL respectively. A medical test might measure the level of a certain protein in a blood sample and classify any number above a certain threshold as indicating disease. The experimenter can adjust the threshold (green vertical line in the figure), which will in turn change the false positive rate. Increasing the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.
Further interpretations
Sometimes, the ROC is used to generate a summary statistic. Common versions are:
* the intercept of the ROC curve with the line at 45 degrees orthogonal to the no-discrimination line - the balance point where
Sensitivity = 1 –
Specificity
* the intercept of the ROC curve with the tangent at 45 degrees parallel to the no-discrimination line that is closest to the error-free point (0,1) – also called
Youden's J statistic
Youden's J statistic (also called Youden's index) is a single statistic that captures the performance of a dichotomous diagnostic test. Informedness is its generalization to the multiclass case and estimates the probability of an informed decision ...
and generalized as Informedness
* the area between the ROC curve and the no-discrimination line multiplied by two is called the ''Gini coefficient''. It should not be confused with the
measure of statistical dispersion also called Gini coefficient.
* the area between the full ROC curve and the triangular ROC curve including only (0,0), (1,1) and one selected operating point
– Consistency
* the area under the ROC curve, or "AUC" ("area under curve"), or A' (pronounced "a-prime"), or "c-statistic" ("concordance statistic").
* the
sensitivity index ''d′'' (pronounced "d-prime"), the distance between the mean of the distribution of activity in the system under noise-alone conditions and its distribution under signal-alone conditions, divided by their
standard deviation, under the assumption that both these distributions are
normal Normal(s) or The Normal(s) may refer to:
Film and television
* ''Normal'' (2003 film), starring Jessica Lange and Tom Wilkinson
* ''Normal'' (2007 film), starring Carrie-Anne Moss, Kevin Zegers, Callum Keith Rennie, and Andrew Airlie
* ''Norma ...
with the same standard deviation. Under these assumptions, the shape of the ROC is entirely determined by ''d′''.
However, any attempt to summarize the ROC curve into a single number loses information about the pattern of tradeoffs of the particular discriminator algorithm.
Probabilistic interpretation
When using normalized units, the area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative').
[Fawcett, Tom (2006); ]
An introduction to ROC analysis
', Pattern Recognition Letters, 27, 861–874. In other words, when given one randomly selected positive instance and one randomly selected negative instance, AUC is the probability that the classifier will be able to tell which one is which.
This can be seen as follows: the area under the curve is given by (the integral boundaries are reversed as large threshold
has
a lower value on the x-axis)
:
:
:
where
is the score for a positive instance and
is the score for a negative instance, and
and
are probability densities as defined in previous section.
Area under the curve
It can be shown that the AUC is closely related to the
Mann–Whitney U,
which tests whether positives are ranked higher than negatives. It is also equivalent to the
Wilcoxon test of ranks.
For a predictor
, an unbiased estimator of its AUC can be expressed by the following ''Wilcoxon-Mann-Whitney'' statistic:
:
where