A receiver operating characteristic curve, or ROC curve, is a
graphical plot that illustrates the diagnostic ability of a
binary classifier system as its discrimination threshold is varied. The method was originally developed for operators of military radar receivers starting in 1941, which led to its name.
The ROC curve is created by plotting the
true positive rate
''Sensitivity'' and ''specificity'' mathematically describe the accuracy of a test which reports the presence or absence of a condition. Individuals for which the condition is satisfied are considered "positive" and those for which it is not are ...
(TPR) against the
false positive rate
In statistics, when performing multiple comparisons, a false positive ratio (also known as fall-out or false alarm ratio) is the probability of falsely rejecting the null hypothesis for a particular test. The false positive rate is calculated as th ...
(FPR) at various threshold settings. The true-positive rate is also known as
sensitivity,
recall
Recall may refer to:
* Recall (bugle call), a signal to stop
* Recall (information retrieval), a statistical measure
* ''ReCALL'' (journal), an academic journal about computer-assisted language learning
* Recall (memory)
* ''Recall'' (Overwatch ...
or ''probability of detection''.
The false-positive rate is also known as ''probability of false alarm''
[ and can be calculated as (1 − specificity). The ROC can also be thought of as a plot of the ]power
Power most often refers to:
* Power (physics), meaning "rate of doing work"
** Engine power, the power put out by an engine
** Electric power
* Power (social and political), the ability to influence people or events
** Abusive power
Power may a ...
as a function of the Type I Error
In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the fa ...
of the decision rule (when the performance is calculated from just a sample of the population, it can be thought of as estimators of these quantities). The ROC curve is thus the sensitivity or recall as a function of fall-out. In general, if the probability distributions for both detection and false alarm are known, the ROC curve can be generated by plotting the cumulative distribution function
In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.
Ev ...
(area under the probability distribution from to the discrimination threshold) of the detection probability in the y-axis versus the cumulative distribution function of the false-alarm probability on the x-axis.
ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution. ROC analysis is related in a direct and natural way to cost/benefit analysis of diagnostic decision making
In psychology, decision-making (also spelled decision making and decisionmaking) is regarded as the cognitive process resulting in the selection of a belief or a course of action among several possible alternative options. It could be either rati ...
.
The ROC curve was first developed by electrical engineers and radar engineers during World War II for detecting enemy objects in battlefields and was soon introduced to psychology
Psychology is the scientific study of mind and behavior. Psychology includes the study of conscious and unconscious phenomena, including feelings and thoughts. It is an academic discipline of immense scope, crossing the boundaries betwe ...
to account for perceptual detection of stimuli. ROC analysis since then has been used in medicine
Medicine is the science and practice of caring for a patient, managing the diagnosis, prognosis, prevention, treatment, palliation of their injury or disease, and promoting their health. Medicine encompasses a variety of health care pract ...
, radiology
Radiology ( ) is the medical discipline that uses medical imaging to diagnose diseases and guide their treatment, within the bodies of humans and other animals. It began with radiography (which is why its name has a root referring to radiat ...
, biometrics
Biometrics are body measurements and calculations related to human characteristics. Biometric authentication (or realistic authentication) is used in computer science as a form of identification and access control. It is also used to identify i ...
, forecasting
Forecasting is the process of making predictions based on past and present data. Later these can be compared (resolved) against what happens. For example, a company might estimate their revenue in the next year, then compare it against the actual ...
of natural hazard
A natural hazard is a natural phenomenon that might have a negative effect on humans and other animals, or the environment. Natural hazard events can be classified into two broad categories: geophysical and biological.
An example of the distinct ...
s, meteorology
Meteorology is a branch of the atmospheric sciences (which include atmospheric chemistry and physics) with a major focus on weather forecasting. The study of meteorology dates back millennia, though significant progress in meteorology did not ...
, model performance assessment, and other areas for many decades and is increasingly used in machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
and data mining research.
The ROC is also known as a relative operating characteristic curve, because it is a comparison of two operating characteristics (TPR and FPR) as the criterion changes.[Swets, John A.]
''Signal detection theory and ROC analysis in psychology and diagnostics : collected papers''
Lawrence Erlbaum Associates, Mahwah, NJ, 1996
Basic concept
A classification model ( classifier or diagnosis
Diagnosis is the identification of the nature and cause of a certain phenomenon. Diagnosis is used in many different disciplines, with variations in the use of logic, analytics, and experience, to determine " cause and effect". In systems engin ...
) is a mapping of instances between certain classes/groups. Because the classifier or diagnosis result can be an arbitrary real value (continuous output), the classifier boundary between classes must be determined by a threshold value (for instance, to determine whether a person has hypertension
Hypertension (HTN or HT), also known as high blood pressure (HBP), is a long-term medical condition in which the blood pressure in the arteries is persistently elevated. High blood pressure usually does not cause symptoms. Long-term high bl ...
based on a blood pressure
Blood pressure (BP) is the pressure of circulating blood against the walls of blood vessels. Most of this pressure results from the heart pumping blood through the circulatory system. When used without qualification, the term "blood pressure" r ...
measure). Or it can be a discrete
Discrete may refer to:
*Discrete particle or quantum in physics, for example in quantum theory
*Discrete device, an electronic component with just one circuit element, either passive or active, other than an integrated circuit
*Discrete group, a g ...
class label, indicating one of the classes.
Consider a two-class prediction problem (binary classification
Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include:
* Medical testing to determine if a patient has c ...
), in which the outcomes are labeled either as positive (''p'') or negative (''n''). There are four possible outcomes from a binary classifier. If the outcome from a prediction is ''p'' and the actual value is also ''p'', then it is called a ''true positive'' (TP); however if the actual value is ''n'' then it is said to be a ''false positive'' (FP). Conversely, a ''true negative'' (TN) has occurred when both the prediction outcome and the actual value are ''n'', and ''false negative'' (FN) is when the prediction outcome is ''n'' while the actual value is ''p''.
To get an appropriate example in a real-world problem, consider a diagnostic test that seeks to determine whether a person has a certain disease. A false positive in this case occurs when the person tests positive, but does not actually have the disease. A false negative, on the other hand, occurs when the person tests negative, suggesting they are healthy, when they actually do have the disease.
Let us define an experiment from P positive instances and N negative instances for some condition. The four outcomes can be formulated in a 2×2 '' contingency table'' or ''confusion matrix
In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a su ...
'', as follows:
ROC space
The contingency table can derive several evaluation "metrics" (see infobox). To draw a ROC curve, only the true positive rate (TPR) and false positive rate (FPR) are needed (as functions of some classifier parameter). The TPR defines how many correct positive results occur among all positive samples available during the test. FPR, on the other hand, defines how many incorrect positive results occur among all negative samples available during the test.
A ROC space is defined by FPR and TPR as ''x'' and ''y'' axes, respectively, which depicts relative trade-offs between true positive (benefits) and false positive (costs). Since TPR is equivalent to sensitivity and FPR is equal to 1 − specificity, the ROC graph is sometimes called the sensitivity vs (1 − specificity) plot. Each prediction result or instance of a confusion matrix
In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a su ...
represents one point in the ROC space.
The best possible prediction method would yield a point in the upper left corner or coordinate (0,1) of the ROC space, representing 100% sensitivity (no false negatives) and 100% specificity (no false positives). The (0,1) point is also called a ''perfect classification''. A random guess would give a point along a diagonal line (the so-called ''line of no-discrimination'') from the bottom left to the top right corners (regardless of the positive and negative base rate
In probability and statistics, the base rate (also known as prior probabilities) is the class of probabilities unconditional on "featural evidence" (likelihoods).
For example, if 1% of the population were medical professionals, and remaining ...
s). An intuitive example of random guessing is a decision by flipping coins. As the size of the sample increases, a random classifier's ROC point tends towards the diagonal line. In the case of a balanced coin, it will tend to the point (0.5, 0.5).
The diagonal divides the ROC space. Points above the diagonal represent good classification results (better than random); points below the line represent bad results (worse than random). Note that the output of a consistently bad predictor could simply be inverted to obtain a good predictor.
Let us look into four prediction results from 100 positive and 100 negative instances:
Plots of the four results above in the ROC space are given in the figure. The result of method A clearly shows the best predictive power among A, B, and C. The result of B lies on the random guess line (the diagonal line), and it can be seen in the table that the accuracy
Accuracy and precision are two measures of ''observational error''.
''Accuracy'' is how close a given set of measurements ( observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each oth ...
of B is 50%. However, when C is mirrored across the center point (0.5,0.5), the resulting method C′ is even better than A. This mirrored method simply reverses the predictions of whatever method or test produced the C contingency table. Although the original C method has negative predictive power, simply reversing its decisions leads to a new predictive method C′ which has positive predictive power. When the C method predicts p or n, the C′ method would predict n or p, respectively. In this manner, the C′ test would perform the best. The closer a result from a contingency table is to the upper left corner, the better it predicts, but the distance from the random guess line in either direction is the best indicator of how much predictive power a method has. If the result is below the line (i.e. the method is worse than a random guess), all of the method's predictions must be reversed in order to utilize its power, thereby moving the result above the random guess line.
Curves in ROC space
In binary classification, the class prediction for each instance is often made based on a continuous random variable
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...
, which is a "score" computed for the instance (e.g. the estimated probability in logistic regression). Given a threshold parameter , the instance is classified as "positive" if , and "negative" otherwise. follows a probability density if the instance actually belongs to class "positive", and if otherwise. Therefore, the true positive rate is given by and the false positive rate is given by .
The ROC curve plots parametrically versus with as the varying parameter.
For example, imagine that the blood protein levels in diseased people and healthy people are normally distributed with means of 2 g/ dL and 1 g/dL respectively. A medical test might measure the level of a certain protein in a blood sample and classify any number above a certain threshold as indicating disease. The experimenter can adjust the threshold (green vertical line in the figure), which will in turn change the false positive rate. Increasing the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.
Further interpretations
Sometimes, the ROC is used to generate a summary statistic. Common versions are:
* the intercept of the ROC curve with the line at 45 degrees orthogonal to the no-discrimination line - the balance point where Sensitivity = 1 - Specificity
* the intercept of the ROC curve with the tangent at 45 degrees parallel to the no-discrimination line that is closest to the error-free point (0,1) - also called Youden's J statistic
Youden's J statistic (also called Youden's index) is a single statistic that captures the performance of a dichotomous diagnostic test. Informedness is its generalization to the multiclass case and estimates the probability of an informed decision ...
and generalized as Informedness
* the area between the ROC curve and the no-discrimination line multiplied by two is called the ''Gini coefficient''. It should not be confused with the measure of statistical dispersion also called Gini coefficient.
* the area between the full ROC curve and the triangular ROC curve including only (0,0), (1,1) and one selected operating point - Consistency
* the area under the ROC curve, or "AUC" ("area under curve"), or A' (pronounced "a-prime"), or "c-statistic" ("concordance statistic").
* the sensitivity index ''d′'' (pronounced "d-prime"), the distance between the mean of the distribution of activity in the system under noise-alone conditions and its distribution under signal-alone conditions, divided by their standard deviation
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
, under the assumption that both these distributions are normal Normal(s) or The Normal(s) may refer to:
Film and television
* ''Normal'' (2003 film), starring Jessica Lange and Tom Wilkinson
* ''Normal'' (2007 film), starring Carrie-Anne Moss, Kevin Zegers, Callum Keith Rennie, and Andrew Airlie
* ''Norma ...
with the same standard deviation. Under these assumptions, the shape of the ROC is entirely determined by ''d′''.
However, any attempt to summarize the ROC curve into a single number loses information about the pattern of tradeoffs of the particular discriminator algorithm.
Probabilistic interpretation
When using normalized units, the area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative').[Fawcett, Tom (2006); ]
An introduction to ROC analysis
', Pattern Recognition Letters, 27, 861–874. In other words, when given one randomly selected positive instance and one randomly selected negative instance, AUC is the probability that the classifier will be able to tell which one is which.
This can be seen as follows: the area under the curve is given by (the integral boundaries are reversed as large threshold has
a lower value on the x-axis)
:
:
:
where is the score for a positive instance and is the score for a negative instance, and and are probability densities as defined in previous section.
Area under the curve
It can be shown that the AUC is closely related to the Mann–Whitney U, which tests whether positives are ranked higher than negatives. It is also equivalent to the Wilcoxon test of ranks. For a predictor , an unbiased estimator of its AUC can be expressed by the following ''Wilcoxon-Mann-Whitney'' statistic:
:
where,